Given a graph, the PageRank algorithm outputs a probability distribution representing the likelihood that a person randomly traversing the graph will arrive at any particular vertex. This algorithm was originally used by Google to rank websites where the World Wide Web was modeled as a directed graph with the vertices representing the websites. The PageRank algorithm initially proposed by Larry Page and Sergey Brin is implemented here [1].
We also implement personalized PageRank, in which a notion of importance provides personalization to a query. For example, importance scores can be biased according to a specified set of vertices in the graph that are of interest or special in some way [2].
pagerank( vertex_table, vertex_id, edge_table, edge_args, out_table, damping_factor, max_iter, threshold, grouping_cols, personalization_vertices )
Arguments
TEXT. Name of the table containing the vertex data for the graph. Must contain the column specified in the 'vertex_id' parameter below.
TEXT, default = 'id'. Name of the column(s) in 'vertex_table' containing vertex ids. The vertex ids can be of type INTEGER or BIGINT with no duplicates. They do not need to be contiguous. If multiple columns are used as vertex ids, they are passed in the following format: [<vertex_id1>,<vertex_id2>,...]
TEXT. Name of the table containing the edge data. The edge table must contain columns for source vertex and destination vertex.
TEXT. A comma-delimited string containing multiple named arguments of the form "name=value". The following parameters are supported for this string argument:
TEXT. Name of the table to store the result of PageRank. It will contain a row for every vertex from 'vertex_table' with the following columns:
A summary table is also created that contains information regarding the number of iterations required for convergence. It is named by adding the suffix '_summary' to the 'out_table' parameter.
FLOAT8, default 0.85. The probability, at any step, that a user will continue following the links in a random surfer model.
INTEGER, default: 100. The maximum number of iterations allowed.
FLOAT8, default: (1/number of vertices * 1000). If the difference between the PageRank of every vertex of two consecutive iterations is smaller than 'threshold', or the iteration number is larger than 'max_iter', the computation stops. If you set the threshold to zero, then you will force the algorithm to run for the full number of iterations specified in 'max_iter'. It is advisable to set threshold to a value lower than 1/(number of vertices in the graph) since the PageRank value of nodes is initialized to that value.
Download the example sql file here.
DROP TABLE IF EXISTS vertex, edge; CREATE TABLE vertex( node_id INTEGER ); CREATE TABLE edge( conn_src INTEGER, conn_dest INTEGER, user_id INTEGER ); INSERT INTO vertex VALUES (0), (1), (2), (3), (4), (5), (6); INSERT INTO edge VALUES (0, 1, 1), (0, 2, 1), (0, 4, 1), (1, 2, 1), (1, 3, 1), (2, 3, 1), (2, 5, 1), (2, 6, 1), (3, 0, 1), (4, 0, 1), (5, 6, 1), (6, 3, 1), (0, 1, 2), (0, 2, 2), (0, 4, 2), (1, 2, 2), (1, 3, 2), (2, 3, 2), (3, 0, 2), (4, 0, 2), (5, 6, 2), (6, 3, 2);
DROP TABLE IF EXISTS pagerank_out, pagerank_out_summary; SELECT madlib.pagerank( 'vertex', -- Vertex table 'node_id', -- Vertex id column 'edge', -- Edge table 'src=conn_src, dest=conn_dest', -- Comma delimted string of edge arguments 'pagerank_out'); -- Output table of PageRank SELECT * FROM pagerank_out ORDER BY pagerank DESC;
node_id | pagerank ---------+------------------- 0 | 0.28753749341184 3 | 0.21016988901855 2 | 0.14662683454062 4 | 0.10289614384217 1 | 0.10289614384217 6 | 0.09728637768887 5 | 0.05258711765692 (7 rows)
SELECT * FROM pagerank_out_summary;
__iterations__ ----------------+ 16 (1 row)
DROP TABLE IF EXISTS pagerank_out, pagerank_out_summary; SELECT madlib.pagerank( 'vertex', -- Vertex table 'node_id', -- Vertex id column 'edge', -- Edge table 'src=conn_src, dest=conn_dest', -- Comma delimted string of edge arguments 'pagerank_out', -- Output table of PageRank 0.5); -- Damping factor SELECT * FROM pagerank_out ORDER BY pagerank DESC;
node_id | pagerank ---------+-------------------- 0 | 0.225477161441199 3 | 0.199090328586664 2 | 0.136261327206477 6 | 0.132691559968224 4 | 0.109009291409508 1 | 0.109009291409508 5 | 0.0884610399788161 (7 rows)
DROP TABLE IF EXISTS pagerank_out, pagerank_out_summary; SELECT madlib.pagerank( 'vertex', -- Vertex table 'node_id', -- Vertex id column 'edge', -- Edge table 'src=conn_src, dest=conn_dest', -- Comma delimted string of edge arguments 'pagerank_out', -- Output table of PageRank NULL, -- Default damping factor (0.85) NULL, -- Default max iters (100) 0.00000001, -- Threshold 'user_id'); -- Grouping column name SELECT * FROM pagerank_out ORDER BY user_id, pagerank DESC;
user_id | node_id | pagerank ---------+---------+-------------------- 1 | 0 | 0.27825488388552 1 | 3 | 0.20188114667075 1 | 2 | 0.14288112346059 1 | 6 | 0.11453637832147 1 | 1 | 0.10026745615438 1 | 4 | 0.10026745615438 1 | 5 | 0.06191155535288 2 | 0 | 0.31854625004173 2 | 3 | 0.23786686773343 2 | 2 | 0.15914876489397 2 | 1 | 0.11168334437971 2 | 4 | 0.11168334437971 2 | 6 | 0.03964285714285 2 | 5 | 0.02142857142857 (14 rows)
SELECT * FROM pagerank_out_summary ORDER BY user_id;
user_id | __iterations__ ---------+---------------- 1 | 27 2 | 31 (2 rows)
DROP TABLE IF EXISTS pagerank_out, pagerank_out_summary; SELECT madlib.pagerank( 'vertex', -- Vertex table 'node_id', -- Vertex id column 'edge', -- Edge table 'src=conn_src, dest=conn_dest', -- Comma delimted string of edge arguments 'pagerank_out', -- Output table of PageRank NULL, -- Default damping factor (0.85) NULL, -- Default max iters (100) NULL, -- Default Threshold NULL, -- No Grouping '{2,4}'); -- Personalization vertices SELECT * FROM pagerank_out ORDER BY pagerank DESC;
node_id | pagerank ---------+-------------------- 0 | 0.565232961966315 2 | 0.378139420991773 3 | 0.355003292266017 4 | 0.310111215897626 1 | 0.160111215897626 6 | 0.148615315574136 5 | 0.0803403307142321 (7 rows)
SELECT * FROM pagerank_out_summary;
__iterations__ ----------------+ 37 (1 row)
DROP TABLE IF EXISTS vertex_multicol_pagerank, edge_multicol_pagerank; CREATE TABLE vertex_multicol_pagerank( node_id_major BIGINT, node_id_minor BIGINT ); CREATE TABLE edge_multicol_pagerank( conn_src_major BIGINT, conn_dest_major BIGINT, user_id_major BIGINT, conn_src_minor BIGINT, conn_dest_minor BIGINT, user_id_minor BIGINT ); INSERT INTO vertex_multicol_pagerank VALUES (0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6); INSERT INTO edge_multicol_pagerank VALUES (0, 1, 1, 0, 1, 1), (0, 2, 1, 0, 2, 1), (0, 4, 1, 0, 4, 1), (1, 2, 1, 1, 2, 1), (1, 3, 1, 1, 3, 1), (2, 3, 1, 2, 3, 1), (2, 5, 1, 2, 5, 1), (2, 6, 1, 2, 6, 1), (3, 0, 1, 3, 0, 1), (4, 0, 1, 4, 0, 1), (5, 6, 1, 5, 6, 1), (6, 3, 1, 6, 3, 1), (0, 1, 2, 0, 1, 2), (0, 2, 2, 0, 2, 2), (0, 4, 2, 0, 4, 2), (1, 2, 2, 1, 2, 2), (1, 3, 2, 1, 3, 2), (2, 3, 2, 2, 3, 2), (3, 0, 2, 3, 0, 2), (4, 0, 2, 4, 0, 2), (5, 6, 2, 5, 6, 2), (6, 3, 2, 6, 3, 2);
DROP TABLE IF EXISTS pagerank_multicol_out, pagerank_multicol_out_summary; SELECT madlib.pagerank( 'vertex_multicol_pagerank', -- Vertex table '[node_id_major,node_id_minor]', -- Vertex id column 'edge_multicol_pagerank', -- Edge table 'src=[conn_src_major,conn_src_minor], dest=[conn_dest_major,conn_dest_minor]', -- Comma delimted string of edge arguments 'pagerank_multicol_out', -- Output table of PageRank NULL, -- Default damping factor (0.85) NULL, -- Default max iters (100) NULL, -- Default Threshold 'user_id_major,user_id_minor', -- Grouping Columns '{{2,2},{4,4}}'); -- Personalization vertices SELECT * FROM pagerank_multicol_out ORDER BY pagerank DESC;
user_id_major | user_id_minor | id | pagerank ---------------+---------------+-------+-------------------- 2 | 2 | {0,0} | 0.448826703440932 2 | 2 | {3,3} | 0.325943770128465 1 | 1 | {0,0} | 0.270635964385879 2 | 2 | {2,2} | 0.256179815391031 2 | 2 | {4,4} | 0.202149921235622 1 | 1 | {2,2} | 0.18423239851445 1 | 1 | {3,3} | 0.166801820206414 1 | 1 | {4,4} | 0.151661035568349 2 | 2 | {1,1} | 0.127149921235622 1 | 1 | {6,6} | 0.0965411872854988 1 | 1 | {1,1} | 0.0766610355683489 2 | 2 | {5,5} | 0.075 2 | 2 | {6,6} | 0.06375 1 | 1 | {5,5} | 0.0521896024086525 (7 rows)
SELECT * FROM pagerank_multicol_out_summary;
user_id_major | user_id_minor | __iterations__ ---------------+---------------+---------------- 2 | 2 | 45 1 | 1 | 41 (1 row)
[1] Brin, S. and Page, L. (1998), "The anatomy of a large-scale hypertextual Web search engine", Computer Networks and ISDN Systems. 30: 107–117, http://infolab.stanford.edu/pub/papers/google.pdf
[2] Jeh, Glen and Widom, Jennifer. "Scaling Personalized Web Search", Proceedings of the 12th international conference on World Wide Web, Pages 271-279 Budapest, Hungary, May 20-24, 2003, http://ilpubs.stanford.edu:8090/530/1/2002-12.pdf