1.9.1
User Documentation for MADlib

This module implements the association rules data mining technique on a transactional data set. Given the names of a table and the columns, minimum support and confidence values, this function generates all single and multidimensional association rules that meet the minimum thresholds.

Association rule mining is a widely used technique for discovering relationships between variables in a large data set (e.g items in a store that are commonly purchased together). The classic market basket analysis example using association rules is the "beer and diapers" rule. According to data mining urban legend, a study of customers' purchase behavior in a supermarket found that men often purchased beer and diapers together. After making this discovery, the managers strategically placed beer and diapers closer together on the shelves and saw a dramatic increase in sales. In addition to market basket analysis, association rules are also used in bioinformatics, web analytics, and several other fields.

This type of data mining algorithm uses transactional data. Every transaction event has a unique identification, and each transaction consists of a set of items (or itemset). Purchases are considered binary (either it was purchased or not), and this implementation does not take into consideration the quantity of each item. For the MADlib association rules function, it is assumed that the data is stored in two columns with one item and transaction id per row. Transactions with multiple items will span multiple rows with one row per item.

     tran_id | product
    ---------+---------
           1 | 1
           1 | 2
           1 | 3
           1 | 4
           2 | 3
           2 | 4
           2 | 5
           3 | 1
           3 | 4
           3 | 6
    ...

Rules

Association rules take the form "If X, then Y", where X and Y are non-empty itemsets. X and Y are called the antecedent and consequent, or the left-hand- side and right-hand-side, of the rule respectively. Using our previous example, the association rule may state "If {diapers}, then {beer}" with .2 support and .85 confidence.

Given any association rule "If X, then Y", the association rules function will also calculate the following metrics:

Apriori Algorithm

Although there are many algorithms that generate association rules, the classic algorithm used is called Apriori (which we implemented in this module). It is a breadth-first search, as opposed to depth-first searches like eclat. Frequent itemsets of order $ n $ are generated from sets of order $ n - 1 $. Using the downward closure property, all sets must have frequent subsets. There are two steps in this algorithm; generating frequent itemsets, and using these itemsets to construct the association rules. A simplified version of the algorithm is as follows, and assumes a minimum level of support and confidence is provided:

Initial step

  1. Generate all itemsets of order 1
  2. Eliminate itemsets that have support is less than minimum support

Main algorithm

  1. For $ n \ge 2 $, generate itemsets of order $ n $ by combining the itemsets of order $ n - 1 $. This is done by doing the union of two itemsets that have identical items except one.
  2. Eliminate itemsets that have (n-1) order subsets with insufficient support
  3. Eliminate itemsets with insufficient support
  4. Repeat until itemsets cannot be generated

Association rule generation

Given a frequent itemset $ A $ generated from the Apriori algorithm, and all subsets $ B $ , we generate rules such that $ B \Rightarrow (A - B) $ meets minimum confidence requirements.

Function Syntax
Association rules can be called with the following syntax.
assoc_rules( support,
             confidence,
             tid_col,
             item_col,
             input_table,
             output_schema,
             verbose
           );
This generates all association rules that satisfy the specified minimum support and confidence.

Arguments

support

The minimum level of support needed for each itemset to be included in result.

confidence

The minimum level of confidence needed for each rule to be included in result.

tid_col

The name of the column storing the transaction ids.

item_col

The name of the column storing the products.

input_table

The name of the table containing the input data.

The input data is expected to be of the following form:

{TABLE|VIEW} input_table (
    trans_id INTEGER,
    product TEXT
)

The algorithm maps the product names to consecutive integer ids starting at 1. If they are already structured this way, then the ids will not change.

output_schema

The name of the schema where the final results will be stored. It is expected to be created before calling the function, or using NULL suggests the current schema will be used.

The results containing the rules, support, confidence, lift, and conviction are stored in the table assoc_rules in the schema specified by output_schema.

The table has the following columns.

ruleid integer
pre text
post text
support double
confidence double
lift double
conviction double

On Greenplum Database the table is distributed by the ruleid column.

The pre and post columns are the itemsets of left and right hand sides of the association rule respectively. The support, confidence, lift, and conviction columns are calculated as mentioned in the about section.

verbose
BOOLEAN, default FALSE. Determines if the output contains comments.

Examples

Let us take a look at some sample transactional data and generate association rules.

  1. Create an input dataset.
    DROP TABLE IF EXISTS test_data;
    CREATE TABLE test_data (
        trans_id INT,
        product TEXT
    );
    INSERT INTO test_data VALUES (1, 'beer');
    INSERT INTO test_data VALUES (1, 'diapers');
    INSERT INTO test_data VALUES (1, 'chips');
    INSERT INTO test_data VALUES (2, 'beer');
    INSERT INTO test_data VALUES (2, 'diapers');
    INSERT INTO test_data VALUES (3, 'beer');
    INSERT INTO test_data VALUES (3, 'diapers');
    INSERT INTO test_data VALUES (4, 'beer');
    INSERT INTO test_data VALUES (4, 'chips');
    INSERT INTO test_data VALUES (5, 'beer');
    INSERT INTO test_data VALUES (6, 'beer');
    INSERT INTO test_data VALUES (6, 'diapers');
    INSERT INTO test_data VALUES (6, 'chips');
    INSERT INTO test_data VALUES (7, 'beer');
    INSERT INTO test_data VALUES (7, 'diapers');
    
  2. Let $ min(support) = .25 $ and $ min(confidence) = .5 $, and the output schema be 'myschema'. For this example, we set verbose to TRUE so that we have some insight into the progress of the function. We can now generate association rules as follows:
    SELECT * FROM madlib.assoc_rules( .25,
                                      .5,
                                      'trans_id',
                                      'product',
                                      'test_data',
                                      'myschema',
                                      TRUE
                                    );
    
    Result:
     output_schema | output_table | total_rules | total_time
    ---------------+--------------+-------------+-----------------
     myschema      | assoc_rules  |           7 | 00:00:03.162094
    (1 row)
    
    The association rules are stored in the myschema.assoc_rules table:
    SELECT * FROM myschema.assoc_rules
    ORDER BY support DESC;
    
    Result:
     ruleid |       pre       |      post      |      support      |    confidence     |       lift        |    conviction
    --------+-----------------+----------------+-------------------+-------------------+-------------------+-------------------
          4 | {diapers}       | {beer}         | 0.714285714285714 |                 1 |                 1 |                 0
          2 | {beer}          | {diapers}      | 0.714285714285714 | 0.714285714285714 |                 1 |                 1
          1 | {chips}         | {beer}         | 0.428571428571429 |                 1 |                 1 |                 0
          5 | {chips}         | {beer,diapers} | 0.285714285714286 | 0.666666666666667 | 0.933333333333333 | 0.857142857142857
          6 | {chips,beer}    | {diapers}      | 0.285714285714286 | 0.666666666666667 | 0.933333333333333 | 0.857142857142857
          7 | {chips,diapers} | {beer}         | 0.285714285714286 |                 1 |                 1 |                 0
          3 | {chips}         | {diapers}      | 0.285714285714286 | 0.666666666666667 | 0.933333333333333 | 0.857142857142857
    (7 rows)
    

Notes

The association rules function always creates a table named assoc_rules. Make a copy of this table before running the function again if you would like to keep multiple association rule tables.

Related Topics

File assoc_rules.sql_in documenting the SQL function.