2.1.0
User Documentation for Apache MADlib
Weakly Connected Components

Given a directed graph, a weakly connected component (WCC) is a subgraph of the original graph where all vertices are connected to each other by some path, ignoring the direction of edges. In case of an undirected graph, a weakly connected component is also a strongly connected component. This module also includes a number of helper functions that operate on the WCC output.

Weakly Connected Components
weakly_connected_components( vertex_table,
            vertex_id,
            edge_table,
            edge_args,
            out_table,
            grouping_cols,
            iteration_limit,
            warm_start
          )

Arguments

vertex_table

TEXT. Name of the table containing the vertex data for the graph. Must contain the column specified in the 'vertex_id' parameter below.

vertex_id

TEXT, default = 'id'. Name of the column(s) in 'vertex_table' containing vertex ids. The vertex ids can be of type INTEGER or BIGINT with no duplicates. They do not need to be contiguous. If multiple columns are used as vertex ids, they are passed in the following format: [<vertex_id1>,<vertex_id2>,...]

edge_table

TEXT. Name of the table containing the edge data. The edge table must contain columns for source vertex and destination vertex.

edge_args

TEXT. A comma-delimited string containing multiple named arguments of the form "name=value". The following parameters are supported for this string argument:

  • src (INTEGER or BIGINT): Name of the column(s) containing the source vertex ids in the edge table. Default column name is 'src'.
  • dest (INTEGER or BIGINT): Name of the column(s) containing the destination vertex ids in the edge table. Default column name is 'dest'.

out_table

TEXT. Name of the table to store the component ID associated with each vertex. It will contain a row for every vertex from 'vertex_table' with the following columns:

  • vertex_id : The id of a vertex. Will use the input parameter 'vertex_id' for column naming. If multiple columns are used for identifying vertices, this column will be an array named "id".
  • component_id : Component that the vertex belongs to. We use the convention where 'component_id' is the id of the first vertex in a particular group. It means that component ids are generally not contiguous.
  • grouping_cols : Grouping column (if any) values associated with the vertex_id.

A summary table named <out_table>_summary is also created. This is an internal table that keeps a record of some of the input parameters and is used by the weakly connected component helper functions.

grouping_cols (optional)
TEXT, default: NULL. A single column or a list of comma-separated columns that divides the input data into discrete groups, which are treated independently as separate graphs. When this value is NULL, no grouping is used and weakly connected components are generated for all data (single graph).
Note
Expressions are not currently supported for 'grouping_cols'.
iteration_limit (optional)

INTEGER, default: NULL. Maximum number of iterations to run wcc. This parameter is used to stop wcc early to limit the number of subtransactions created by wcc. For such subtx issues, it is advised to set this parameter to

  1. A wcc run that stopped early by this parameter can resume its progress by using the warm_start parameter. An additional table named <out_table>_message is also created. This table is necessary in case the iteration_limit is reached and there are still vertices to update. It gets used when the wcc continues the process via warm_start and gets dropped when the wcc determines there are no more updates necessary. The user might determine if the wcc is completed or not by checking the nodes_to_update column of <out_table>_summary table where 0 means wcc is complete.

warm_start (optional)

BOOLEAN, default: NULL. If True, wcc will look for the <out_table>_message table and continue using it and the partial output from <out_table> to continue the wcc process.

Note
On a Greenplum cluster, the edge table should be distributed by the source vertex id column for better performance. In addition, the user should note that this function creates a duplicate of the edge table (on Greenplum cluster) for better performance.

Retrieve Largest Connected Component

The largest connected component retrieval function finds the largest weakly connected component(s) in a graph. If weakly connected components was run with grouping, the largest connected components are computed for each group.

graph_wcc_largest_cpt( wcc_table,
                      largest_cpt_table
                     )

Arguments

wcc_table

TEXT. Name of the table that contains the output of weakly connected components.

largest_cpt_table
TEXT. Name of the output table that contains the largest component's information. It contains one or more rows for every group and has the following columns:
  • grouping_cols: The grouping columns given in the creation of wcc_table. If there are no grouping columns, this column is not created.
  • component_id: The ID of the largest component. Recall that we use the convention where 'component_id' is the id of the first vertex in a particular group. It means that component ids are generally not contiguous. If there are multiple components of the same size, a row is created for each component. If grouping_cols is specified, the largest component is computed for each group.
  • num_vertices: Number of vertices in the largest component.

Retrieve Histogram of Vertices Per Connected Component

This function creates a histogram of the number of vertices per connected component.

graph_wcc_histogram( wcc_table,
                    histogram_table
                   )

Arguments

wcc_table

TEXT. Name of the table that contains the output of weakly connected components.

histogram_table

TEXT. Name of the output table that contains the number of vertices per component. A row is created for every comoponent in every group if grouping_cols was specified when running weakly connected components. The output table has the following columns:

  • grouping_cols: The grouping columns given during the creation of the wcc_table. If there are no grouping columns, this column is not created.
  • component_id: The ID of the component.
  • num_vertices: Number of vertices in the component specified by the component_id column.

Check if Two Vertices Belong to the Same Component

This function determines if two vertices belong to the same component.

graph_wcc_vertex_check( wcc_table,
                       vertex_pair,
                       pair_table
                      )

Arguments

wcc_table

TEXT. Name of the table that contains the output of weakly connected components.

vertex_pair

BIGINT[]. A pair of vertex IDs separated by a comma. If multiple columns are used for identifying vertices, a 2D array will be required for this parameter.

pair_table

TEXT. Name of the output table that specifies if the two vertices in vertex_pair belong to the same component. If wcc_table was generated using grouping_cols, all the components in all groups are considered. The output table has the following columns:

  • component_id: Component ID that contains both the vertices in vertex_pair.
  • grouping_cols: The grouping columns given in the creation of wcc_table. If there are no grouping columns, this column is not created.

Retrieve All Vertices Reachable from a Vertex

This function finds all the vertices that can be reached from a given vertex via weakly connected paths.

graph_wcc_reachable_vertices( wcc_table,
                             src,
                             reachable_vertices_table
                            )

Arguments

wcc_table

TEXT. Name of the table that contains the output of weakly connected components.

src

BIGINT or BIGINT[]. The vertex ID from which all reachable vertices have to be found.

reachable_vertices_table

TEXT. Name of the output table that contains the list of vertices that are reachable from the src vertex. The output table has the following columns:

  • grouping_cols : The grouping columns given in the creation of wcc_table. If there are no grouping columns, this column is not created.
  • component_id : The ID of the component that both the src and dest vertices belong to.
  • dest : Vertex ID that is reachable from the src vertex. Reachability is computed with regard to a component.

Count of Connected Components

This function finds the total number of components in the input graph.

graph_wcc_num_cpts( wcc_table,
                   count_table
                  )

Arguments

wcc_table

TEXT. Name of the table that contains the output of weakly connected components.

count_table

TEXT. Name of the output table that contains the total number of components per group in the graph, if there are any grouping_cols in wcc_table. The output table has the following columns:

  • grouping_cols : The grouping columns given in the creation of wcc_table. If there are no grouping columns, this column is not created, and count is with regard to the entire graph.
  • num_components : Count of weakly connected components in a graph, or the number of components within a group if grouping_cols is defined.

Examples

Download the example sql file here.

  1. Create vertex and edge tables to represent the graph:
    DROP TABLE IF EXISTS vertex, edge;
    CREATE TABLE vertex(
        node_id INTEGER
    );
    CREATE TABLE edge(
        conn_src INTEGER,
        conn_dest INTEGER,
        user_id INTEGER
    );
    INSERT INTO vertex VALUES
    (0),
    (1),
    (2),
    (3),
    (4),
    (5),
    (6),
    (10),
    (11),
    (12),
    (13),
    (14),
    (15),
    (16);
    INSERT INTO edge VALUES
    (0, 1, 1),
    (0, 2, 1),
    (1, 2, 1),
    (1, 3, 1),
    (2, 3, 1),
    (2, 5, 1),
    (2, 6, 1),
    (3, 0, 1),
    (5, 6, 1),
    (6, 3, 1),
    (10, 11, 2),
    (10, 12, 2),
    (11, 12, 2),
    (11, 13, 2),
    (12, 13, 2),
    (13, 10, 2),
    (15, 16, 2),
    (15, 14, 2);
    
  2. Find all the weakly connected components in the graph:
    DROP TABLE IF EXISTS wcc_out, wcc_out_summary;
    SELECT madlib.weakly_connected_components(
        'vertex',                        -- Vertex table
        'node_id',                       -- Vertex id column
        'edge',                          -- Edge table
        'src=conn_src, dest=conn_dest',  -- Comma delimted string of edge arguments
        'wcc_out');                      -- Output table of weakly connected components
    SELECT * FROM wcc_out ORDER BY component_id, node_id;
    
     node_id | component_id
    ---------+--------------
           0 |            0
           1 |            0
           2 |            0
           3 |            0
           5 |            0
           6 |            0
           4 |            4
          10 |           10
          11 |           10
          12 |           10
          13 |           10
          14 |           14
          15 |           14
          16 |           14
    (14 rows)
    
  3. Now get the weakly connected components associated with each 'user_id' using the grouping feature:
    DROP TABLE IF EXISTS wcc_out, wcc_out_summary;
    SELECT madlib.weakly_connected_components(
        'vertex',                       -- Vertex table
        'node_id',                      -- Vertex id column
        'edge',                         -- Edge table
        'src=conn_src, dest=conn_dest', -- Comma delimted string of edge arguments
        'wcc_out',                      -- Output table of weakly connected components
        'user_id');                     -- Grouping column name
    SELECT * FROM wcc_out ORDER BY user_id, component_id, node_id;
    
     node_id | component_id | user_id
    ---------+--------------+---------
           0 |            0 |       1
           1 |            0 |       1
           2 |            0 |       1
           3 |            0 |       1
           5 |            0 |       1
           6 |            0 |       1
          10 |           10 |       2
          11 |           10 |       2
          12 |           10 |       2
          13 |           10 |       2
          14 |           14 |       2
          15 |           14 |       2
          16 |           14 |       2
    (13 rows)
    
    Note that vertex 4 is not identified as a separate component above. This is because there is no entry in the edge table for vertex 4 indicating which group it belongs to (though you could do that if you wanted to).
  4. Retrieve the largest connected component:
    DROP TABLE IF EXISTS largest_cpt_table;
    SELECT madlib.graph_wcc_largest_cpt(
                             'wcc_out',             -- WCC output table
                             'largest_cpt_table');  -- output table containing largest component ID
    SELECT * FROM largest_cpt_table ORDER BY component_id;
    
     user_id | component_id | num_vertices
    ---------+--------------+--------------
           1 |            0 |            6
           2 |           10 |            4
    (2 rows)
    
  5. Retrieve histogram of the number of vertices per connected component:
    DROP TABLE IF EXISTS histogram_table;
    SELECT madlib.graph_wcc_histogram(
                             'wcc_out',           -- WCC output table
                             'histogram_table');  -- output table containing the histogram of vertices
    SELECT * FROM histogram_table ORDER BY component_id;
    
     user_id | component_id | num_vertices
    ---------+--------------+--------------
           1 |            0 |            6
           2 |           10 |            4
           2 |           14 |            3
    (3 rows)
    
  6. Check if two vertices belong to the same component:
    DROP TABLE IF EXISTS vc_table;
    SELECT madlib.graph_wcc_vertex_check(
                             'wcc_out',    -- WCC output table
                             '14,15',      -- Pair of vertex IDs
                             'vc_table');  -- output table containing components that contain the two vertices
    SELECT * FROM vc_table ORDER BY component_id;
    
     user_id | component_id
    ---------+--------------
           2 |           14
    (1 row)
    
  7. Retrieve all vertices reachable from a vertex
    DROP TABLE IF EXISTS reach_table;
    SELECT madlib.graph_wcc_reachable_vertices(
                             'wcc_out',         -- WCC output table
                             '0',               -- source vertex
                             'reach_table');    -- output table containing all vertices reachable from source vertex
    SELECT * FROM reach_table ORDER BY component_id, dest;
    
     user_id | component_id | dest
    ---------+--------------+------
           1 |            0 |    1
           1 |            0 |    2
           1 |            0 |    3
           1 |            0 |    5
           1 |            0 |    6
    (5 rows)
    
  8. Count of connected components:
    DROP TABLE IF EXISTS count_table;
    SELECT madlib.graph_wcc_num_cpts(
                             'wcc_out',       -- WCC output table
                             'count_table');  -- output table containing number of components per group
    SELECT * FROM count_table;
    
     user_id | num_components
    ---------+----------------
           1 |              1
           2 |              2
    (2 rows)
    
  9. Create vertex and edge tables with multiple column ids to represent the graph:
    DROP TABLE IF EXISTS vertex_multicol_wcc, edge_multicol_wcc;
    CREATE TABLE vertex_multicol_wcc(
        node_id_major BIGINT,
        node_id_minor BIGINT
    );
    CREATE TABLE edge_multicol_wcc(
        conn_src_major BIGINT,
        conn_dest_major BIGINT,
        user_id_major BIGINT,
        conn_src_minor BIGINT,
        conn_dest_minor BIGINT,
        user_id_minor BIGINT
    );
    INSERT INTO vertex_multicol_wcc VALUES
    (0, 0),
    (1, 1),
    (2, 2),
    (3, 3),
    (4, 4),
    (5, 5),
    (6, 6);
    INSERT INTO edge_multicol_wcc VALUES
    (0, 1, 1, 0, 1, 1),
    (0, 2, 1, 0, 2, 1),
    (0, 4, 1, 0, 4, 1),
    (1, 2, 1, 1, 2, 1),
    (1, 3, 1, 1, 3, 1),
    (2, 3, 1, 2, 3, 1),
    (2, 5, 1, 2, 5, 1),
    (2, 6, 1, 2, 6, 1),
    (3, 0, 1, 3, 0, 1),
    (4, 0, 1, 4, 0, 1),
    (5, 6, 1, 5, 6, 1),
    (6, 3, 1, 6, 3, 1),
    (0, 1, 2, 0, 1, 2),
    (0, 2, 2, 0, 2, 2),
    (0, 4, 2, 0, 4, 2),
    (1, 2, 2, 1, 2, 2),
    (1, 3, 2, 1, 3, 2),
    (2, 3, 2, 2, 3, 2),
    (3, 0, 2, 3, 0, 2),
    (4, 0, 2, 4, 0, 2),
    (5, 6, 2, 5, 6, 2),
    (6, 3, 2, 6, 3, 2);
    
  10. Find all the weakly connected components in the graph:
    DROP TABLE IF EXISTS wcc_multicol_out, wcc_multicol_out_summary;
    SELECT madlib.weakly_connected_components(
        'vertex_multicol_wcc',                                                          -- Vertex table
        '[node_id_major,node_id_minor]',                                                -- Vertex id column
        'edge_multicol_wcc',                                                            -- Edge table
        'src=[conn_src_major,conn_src_minor], dest=[conn_dest_major,conn_dest_minor]',  -- Comma delimted string of edge arguments
        'wcc_multicol_out',                                                             -- Output table of weakly connected components
        'user_id_major,user_id_minor');                                                 -- Grouping column name
    SELECT * FROM wcc_multicol_out ORDER BY user_id_major, user_id_minor, component_id, id;
    
      id   | component_id | user_id_major | user_id_minor
    -------+--------------+---------------+---------------
     {0,0} |            3 |             1 |             1
     {1,1} |            3 |             1 |             1
     {2,2} |            3 |             1 |             1
     {3,3} |            3 |             1 |             1
     {4,4} |            3 |             1 |             1
     {5,5} |            3 |             1 |             1
     {6,6} |            3 |             1 |             1
     {0,0} |            3 |             2 |             2
     {1,1} |            3 |             2 |             2
     {2,2} |            3 |             2 |             2
     {3,3} |            3 |             2 |             2
     {4,4} |            3 |             2 |             2
     {5,5} |            3 |             2 |             2
     {6,6} |            3 |             2 |             2
    (14 rows)
    

Notes
  1. On a Greenplum cluster, the edge table should be distributed by the source vertex id column for better performance.