User Documentation for Apache MADlib

MFVSketch: Most Frequent Values variant of CountMin sketch, implemented as a UDA.

Produces an n-bucket histogram for a column where each bucket counts one of the most frequent values in the column. The output is an array of doubles {value, count} in descending order of frequency; counts are approximated via CountMin sketches. Ties are handled arbitrarily.

The MFV frequent-value UDA comes in two different versions:

In PostgreSQL the two UDAs are identical. In Greenplum, the quick version should produce good results unless the number of values requested is small, or the distribution is flat.

This is a User Defined Aggregate which returns the results when used in a query. Use "CREATE TABLE AS ", with the UDA as subquery if the results are to be stored. This is unlike the usual MADlib stored procedure interface which places the results in a table instead of returning it.

  1. Generate some data.
    CREATE TABLE data(class INT, a1 INT);
    INSERT INTO data SELECT 1,1 FROM generate_series(1,10000);
    INSERT INTO data SELECT 1,2 FROM generate_series(1,15000);
    INSERT INTO data SELECT 1,3 FROM generate_series(1,10000);
    INSERT INTO data SELECT 2,5 FROM generate_series(1,1000);
    INSERT INTO data SELECT 2,6 FROM generate_series(1,1000);
  2. Produce a histogram of 5 bins and return the most frequent value and associated count in each bin.
    SELECT mfvsketch_top_histogram( a1, 5 )
    FROM data;
    (1 row)

This method is not usually called an MFV sketch in the literature; it is a natural extension of the CountMin sketch.

Related Topics

File sketch.sql_in documenting the SQL functions.

Module CountMin (Cormode-Muthukrishnan).