1.10.0
User Documentation for MADlib

The MADlib summary() function produces summary statistics for any data table. The function invokes various methods from the MADlib library to provide the data overview.

Summary Function Syntax
The summary() function has the following syntax:
summary ( source_table,
          output_table,
          target_cols,
          grouping_cols,
          get_distinct,
          get_quartiles,
          ntile_array,
          how_many_mfv,
          get_estimates
        )

The summary() function returns a composite type containing three fields:

output_table TEXT. The name of the output table.
row_count INTEGER. The number of rows in the output table.
duration FLOAT8. The time taken (in seconds) to compute the summary.

Arguments

source_table
TEXT. The name of the table containing the input data.
output_table

TEXT. The name of the table to contain the output summary data.

Summary statistics are saved in a table with the name specifed in the output_table argument. The table contains the following columns:

group_by Group-by column name. NULL if none provided.
group_by_value Value of the Group-by column. NULL if there is no grouping.
target_column Targeted column values for which summary is requested.
column_number Physical column number for the target column, as described in pg_attribute catalog.
data_type Data type of the target column. Standard GPDB type descriptors are displayed.
row_count Number of rows for the target column.
distinct_values Number of distinct values in the target column. When the summary() function is called with the get_estimates argument set to TRUE, this is an estimated statistic based on the Flajolet-Martin distinct count estimator.
missing_values Number of missing values in the target column.
blank_values Number of blank values. Blanks are defined by this regular expression:
'^\w*$'
fraction_missing Percentage of total rows that are missing, as a decimal value, e.g. 0.3.
fraction_blank Percentage of total rows that are blank, as a decimal value, e.g. 0.3.
mean Mean value of target column if target is numeric, otherwise NULL.
variance Variance of target column if target is numeric, otherwise NULL.
min Minimum value of target column. For strings this is the length of the shortest string.
max Maximum value of target column. For strings this is the length of the longest string.
first_quartile First quartile (25th percentile), only for numeric columns. Currently unavailable for PostgreSQL 9.3 or lower.
median Median value of target column, if target is numeric, otherwise NULL. Currently unavailable for PostgreSQL 9.3 or lower.
third_quartile Third quartile (25th percentile), only for numeric columns. Currently unavailable for PostgreSQL 9.3 or lower.
quantile_array Percentile values corresponding to ntile_array. Currently unavailable for PostgreSQL 9.3 or lower.
most_frequent_values An array containing the most frequently occurring values. The how_many_mfv argument determines the length of the array, 10 by default. If the summary() function is called with the get_estimates argument set to TRUE (default), the frequent values computation is performed using a parallel aggregation method that is faster, but in some cases can fail to detect the exact most frequent values.
mfv_frequencies Array containing the frequency count for each of the most frequent values.
target_columns (optional)
TEXT, default NULL. A comma-separated list of columns to summarize. If NULL, summaries are produced for all columns.
grouping_cols (optional)
TEXT, default: null. A comma-separated list of columns on which to group results. If NULL, summaries are produced on the complete table.
Note
Please note that summary statistics are calculated for each grouping column independently. That is, grouping columns are not combined together as in the regular PostgreSQL style GROUP BY directive. (This was done to reduce long run time and huge output table size which would otherwise result in the case of large input tables with a lot of grouping_cols and target_cols specified.)
get_distinct (optional)
BOOLEAN, default TRUE. If true, distinct values are counted.
get_quartiles (optional)
BOOLEAN, default TRUE. If TRUE, quartiles are computed.
ntile_array (optional)
FLOAT8[], default NULL. An array of quantile values to compute. If NULL, quantile values are not computed.
Note
Quartile and quantile functions are not available for PostgreSQL 9.3 or lower. If you are using PostgreSQL 9.3 or lower, the output table will not contain these values, even if you set 'get_quartiles' = TRUE or provide an array of quantile values for the parameter 'ntile_array'.
how_many_mfv (optional)
INTEGER, default: 10. The number of most-frequent-values to compute.
get_estimates (optional)
BOOLEAN, default TRUE. If TRUE, estimated values are produced for distinct values and most frequent values. If FALSE, exact values are calculated (may take longer to run depending on data size).

Examples
  1. View online help for the summary() function.
    SELECT * FROM madlib.summary();
    
  2. Create an input data set.
    CREATE TABLE houses (id INT, tax INT, bedroom INT, bath FLOAT, price INT,
                 size INT, lot INT);
    COPY houses FROM STDIN WITH DELIMITER '|';
      1 |  590 |       2 |    1 |  50000 |  770 | 22100
      2 | 1050 |       3 |    2 |  85000 | 1410 | 12000
      3 |   20 |       3 |    1 |  22500 | 1060 |  3500
      4 |  870 |       2 |    2 |  90000 | 1300 | 17500
      5 | 1320 |       3 |    2 | 133000 | 1500 | 30000
      6 | 1350 |       2 |    1 |  90500 |  820 | 25700
      7 | 2790 |       3 |  2.5 | 260000 | 2130 | 25000
      8 |  680 |       2 |    1 | 142500 | 1170 | 22000
      9 | 1840 |       3 |    2 | 160000 | 1500 | 19000
     10 | 3680 |       4 |    2 | 240000 | 2790 | 20000
     11 | 1660 |       3 |    1 |  87000 | 1030 | 17500
     12 | 1620 |       3 |    2 | 118600 | 1250 | 20000
     13 | 3100 |       3 |    2 | 140000 | 1760 | 38000
     14 | 2070 |       2 |    3 | 148000 | 1550 | 14000
     15 |  650 |       3 |  1.5 |  65000 | 1450 | 12000
    \.
    
  3. Run the summary() function.
    SELECT * FROM madlib.summary( 'houses',
                                  'houses_summary',
                                  'tax,bedroom,lot,bath,price,size,lot',
                                  'bedroom',
                                  TRUE,
                                  TRUE,
                                  NULL,
                                  5,
                                  FALSE
                                );
    
    Result:
      output_table  | row_count |    duration
    ----------------+-----------+----------------
     houses_summary |        21 | 0.207587003708
    (1 row)
    
  4. View the summary data.
    -- Turn on expanded display for readability.
    \x on
    SELECT * FROM houses_summary;
    
    Result:
     -[ RECORD 1 ]--------+-----------------------------------
     group_by             | bedroom
     group_by_value       | 3
     target_column        | tax
     column_number        | 2
     data_type            | int4
     row_count            | 9
     distinct_values      | 9
     missing_values       | 0
     blank_values         |
     fraction_missing     | 0
     fraction_blank       |
     mean                 | 1561.11111111111
     variance             | 936736.111111111
     min                  | 20
     max                  | 3100
     most_frequent_values | {20,1320,2790,1840,1660}
     mfv_frequencies      | {1,1,1,1,1}
     -[ RECORD 2 ]--------+-----------------------------------
     group_by             | bedroom
     group_by_value       | 3
     target_column        | bath
     column_number        | 4
     ...
    

Notes
  • Table names can be optionally schema qualified (current_schemas() would be searched if a schema name is not provided) and table and column names should follow case-sensitivity and quoting rules per the database. (For instance, 'mytable' and 'MyTable' both resolve to the same entity, i.e. 'mytable'. If mixed-case or multi-byte characters are desired for entity names then the string should be double-quoted; in this case the input would be '"MyTable"').
  • Estimated values are only implemented for the distinct values computation.
  • The get_estimates parameter controls computation for two statistics:
    • If get_estimates is TRUE then the distinct value computation is estimated. Further, the most frequent values computation is computed using a "quick and dirty" method that does parallel aggregation in Greenplum Database at the expense of missing some of the most frequent values.
    • If get_estimates is FALSE then the distinct values are computed in a slow but exact method. The most frequent values are computed using a faithful implementation that preserves the approximation guarantees of the Cormode/Muthukrishnan method (more information in MFV (Most Frequent Values)).
  • Summary statistics are calculated for each grouping column independently. That is, grouping columns are not combined together as in the regular PostgreSQL style GROUP BY directive. (This was done to reduce long run time and huge output table size which would otherwise result in the case of large input tables with a lot of grouping_cols and target_cols specified.)
  • Quartile and quantile functions are not available for PostgreSQL 9.3 or lower. If you are using PostgreSQL 9.3 or lower, the output table will not contain these values, even if you set 'get_quartiles' = TRUE or provide an array of quantile values for the parameter 'ntile_array'.

Related Topics
File summary.sql_in documenting the summary() function

MFV (Most Frequent Values)