2.1.0
User Documentation for Apache MADlib

The functions in this module calculate robust variance (Huber-White estimates) for linear regression, logistic regression, multinomial logistic regression, and Cox proportional hazards. They are useful in calculating variances in a dataset with potentially noisy outliers. The Huber-White implemented here is identical to the "HC0" sandwich operator in the R module "sandwich".

The interfaces for robust linear, logistic, and multinomial logistic regression are similar. Each regression type has its own training function. The regression results are saved in an output table with small differences, depending on the regression type.

Warning
Please note that the interface for Cox proportional hazards, unlike the interface of other regression methods, accepts an output model table produced by coxph_train() function.

Robust Linear Regression Training Function

The robust_variance_linregr() function has the following syntax:

robust_variance_linregr( source_table,
                         out_table,
                         dependent_varname,
                         independent_varname,
                         grouping_cols
                       )
source_table
VARCHAR. The name of the table containing the training data.
out_table

VARCHAR. Name of the generated table containing the output model. The output table contains the following columns.

coef DOUBLE PRECISION[]. Vector of the coefficients of the regression.
std_err DOUBLE PRECISION[]. Vector of the standard error of the coefficients.
t_stats DOUBLE PRECISION[]. Vector of the t-stats of the coefficients.
p_values DOUBLE PRECISION[]. Vector of the p-values of the coefficients.

A summary table named <out_table>_summary is also created, which is the same as the summary table created by linregr_train function. Please refer to the documentation for linear regression for details.

dependent_varname
VARCHAR. The name of the column containing the dependent variable.
independent_varname
VARCHAR. Expression list to evaluate for the independent variables. An intercept variable is not assumed. It is common to provide an explicit intercept term by including a single constant 1 term in the independent variable list.
grouping_cols (optional)
VARCHAR, default: NULL. An expression list used to group the input dataset into discrete groups, running one regression per group. Similar to the SQL "GROUP BY" clause. When this value is NULL, no grouping is used and a single result model is generated. Default value: NULL.

Robust Logistic Regression Training Function

The robust_variance_logregr() function has the following syntax:

robust_variance_logregr( source_table,
                         out_table,
                         dependent_varname,
                         independent_varname,
                         grouping_cols,
                         max_iter,
                         optimizer,
                         tolerance,
                         verbose_mode
                       )
source_table
VARCHAR. The name of the table containing the training data.
out_table

VARCHAR. Name of the generated table containing the output model. The output table has the following columns:

coef Vector of the coefficients of the regression.
std_err Vector of the standard error of the coefficients.
z_stats Vector of the z-stats of the coefficients.
p_values Vector of the p-values of the coefficients.

A summary table named <out_table>_summary is also created, which is the same as the summary table created by logregr_train function. Please refer to the documentation for logistic regression for details.

dependent_varname
VARCHAR. The name of the column containing the independent variable.
independent_varname
VARCHAR. Expression list to evaluate for the independent variables. An intercept variable is not assumed. It is common to provide an explicit intercept term by including a single constant 1 term in the independent variable list.
grouping_cols (optional)
VARCHAR, default: NULL. An expression list used to group the input dataset into discrete groups, running one regression per group. Similar to the SQL "GROUP BY" clause. When this value is NULL, no grouping is used and a single result model is generated.
max_iter (optional)
INTEGER, default: 20. The maximum number of iterations that are allowed.
optimizer
VARCHAR, default: 'fista'. Name of optimizer, either 'fista' or 'igd'.
tolerance (optional)
DOUBLE PRECISION, default: 1e-6. The criteria to end iterations. Both the 'fista' and 'igd' optimizers compute the average difference between the coefficients of two consecutive iterations, and when the difference is smaller than tolerance or the iteration number is larger than max_iter, the computation stops.
verbose_mode (optional)
BOOLEAN, default: FALSE. Whether the regression fit should print any warning messages.

Robust Multinomial Logistic Regression Function

The robust_variance_mlogregr() function has the following syntax:

robust_variance_mlogregr( source_table,
                          out_table,
                          dependent_varname,
                          independent_varname,
                          ref_category,
                          grouping_cols,
                          optimizer_params,
                          verbose_mode
                        )
source_table
VARCHAR. The name of the table containing training data, properly qualified.
out_table

VARCHAR. The name of the table where the regression model will be stored. The output table has the following columns:

category The category.
ref_category The refererence category used for modeling.
coef Vector of the coefficients of the regression.
std_err Vector of the standard error of the coefficients.
z_stats Vector of the z-stats of the coefficients.
p_values Vector of the p-values of the coefficients.

A summary table named <out_table>_summary is also created, which is the same as the summary table created by mlogregr_train function. Please refer to the documentation for multinomial logistic regression for details.

dependent_varname
VARCHAR. The name of the column containing the dependent variable.
independent_varname
VARCHAR. Expression list to evaluate for the independent variables. An intercept variable is not assumed. It is common to provide an explicit intercept term by including a single constant 1 term in the independent variable list. The independent_varname can be the name of a column that contains an array of numeric values. It can also be a string with the format 'ARRAY[1, x1, x2, x3]', where x1, x2 and x3 are each column names.
ref_category (optional)
INTEGER, default: 0. The reference category.
grouping_cols (optional)
VARCHAR, default: NULL. Not currently implemented. Any non-NULL value is ignored. An expression list used to group the input dataset into discrete groups, running one regression per group. Similar to the SQL "GROUP BY" clause. When this value is NULL, no grouping is used and a single result model is generated.
optimizer_params (optional)
TEXT, default: NULL, which uses the default values of optimizer parameters: max_iter=20, optimizer='newton', tolerance=1e-4. It should be a string that contains pairs of 'key=value' separated by commas.
verbose_mode (optional)
BOOLEAN, default FALSE. Not currently implemented. TRUE if the regression fit should print warning messages.

Robust Variance Function For Cox Proportional Hazards

The robust_variance_coxph() function has the following syntax:

robust_variance_coxph(model_table, output_table)

Arguments

model_table
TEXT. The name of the model table, which is exactaly the same as the 'output_table' parameter of coxph_train() function.
output_table
TEXT. The name of the table where the output is saved. It has the following columns:
coef FLOAT8[]. Vector of the coefficients.
loglikelihood FLOAT8. Log-likelihood value of the MLE estimate.
std_err FLOAT8[]. Vector of the standard error of the coefficients.
robust_se FLOAT8[]. Vector of the robust standard errors of the coefficients.
robust_z FLOAT8[]. Vector of the robust z-stats of the coefficients.
robust_p FLOAT8[]. Vector of the robust p-values of the coefficients.
hessian FLOAT8[]. The Hessian matrix.

Examples

Logistic Regression Example

  1. View online help for the logistic regression training function.
    SELECT madlib.robust_variance_logregr();
    
  2. Create the training data table.
    DROP TABLE IF EXISTS patients;
    CREATE TABLE patients (id INTEGER NOT NULL, second_attack INTEGER,
        treatment INTEGER, trait_anxiety INTEGER);
    COPY patients FROM STDIN WITH DELIMITER '|';
      1 |             1 |         1 |            70
      3 |             1 |         1 |            50
      5 |             1 |         0 |            40
      7 |             1 |         0 |            75
      9 |             1 |         0 |            70
     11 |             0 |         1 |            65
     13 |             0 |         1 |            45
     15 |             0 |         1 |            40
     17 |             0 |         0 |            55
     19 |             0 |         0 |            50
      2 |             1 |         1 |            80
      4 |             1 |         0 |            60
      6 |             1 |         0 |            65
      8 |             1 |         0 |            80
     10 |             1 |         0 |            60
     12 |             0 |         1 |            50
     14 |             0 |         1 |            35
     16 |             0 |         1 |            50
     18 |             0 |         0 |            45
     20 |             0 |         0 |            60
    \.
    
  3. Run the logistic regression training function and compute the robust logistic variance of the regression:
    DROP TABLE IF EXISTS patients_logregr, patients_logregr_summary;
    SELECT madlib.robust_variance_logregr( 'patients',
                                           'patients_logregr',
                                           'second_attack',
                                           'ARRAY[1, treatment, trait_anxiety]'
                                         );
    
  4. View the regression results.
    \x on
    SELECT * FROM patients_logregr;
    
    Result:
     -[ RECORD 1 ]-------------------------------------------------------
     coef     | {-6.36346994178179,-1.02410605239327,0.119044916668605}
     std_err  | {3.45872062333648,1.1716192578234,0.0534328864185018}
     z_stats  | {-1.83983346294192,-0.874094587943036,2.22793348156809}
     p_values | {0.0657926909738889,0.382066744585541,0.0258849510757339}
    
    Alternatively, unnest the arrays in the results for easier reading of output.
    \x off
    SELECT unnest(array['intercept', 'treatment', 'trait_anxiety' ]) as attribute,
           unnest(coef) as coefficient,
           unnest(std_err) as standard_error,
           unnest(z_stats) as z_stat,
           unnest(p_values) as pvalue
    FROM patients_logregr;
    

Cox Proportional Hazards Example

  1. View online help for the robust Cox Proportional hazards training method.
    SELECT madlib.robust_variance_coxph();
    
  2. Create an input data set.
    DROP TABLE IF EXISTS sample_data;
    CREATE TABLE sample_data (
        id INTEGER NOT NULL,
        grp DOUBLE PRECISION,
        wbc DOUBLE PRECISION,
        timedeath INTEGER,
        status BOOLEAN
    );
    COPY sample_data FROM STDIN DELIMITER '|';
      0 |   0 | 1.45 |        35 | t
      1 |   0 | 1.47 |        34 | t
      3 |   0 |  2.2 |        32 | t
      4 |   0 | 1.78 |        25 | t
      5 |   0 | 2.57 |        23 | t
      6 |   0 | 2.32 |        22 | t
      7 |   0 | 2.01 |        20 | t
      8 |   0 | 2.05 |        19 | t
      9 |   0 | 2.16 |        17 | t
     10 |   0 |  3.6 |        16 | t
     11 |   1 |  2.3 |        15 | t
     12 |   0 | 2.88 |        13 | t
     13 |   1 |  1.5 |        12 | t
     14 |   0 |  2.6 |        11 | t
     15 |   0 |  2.7 |        10 | t
     16 |   0 |  2.8 |         9 | t
     17 |   1 | 2.32 |         8 | t
     18 |   0 | 4.43 |         7 | t
     19 |   0 | 2.31 |         6 | t
     20 |   1 | 3.49 |         5 | t
     21 |   1 | 2.42 |         4 | t
     22 |   1 | 4.01 |         3 | t
     23 |   1 | 4.91 |         2 | t
     24 |   1 |    5 |         1 | t
    \.
    
  3. Run the Cox regression function.
    DROP TABLE IF EXISTS sample_cox, sample_cox_summary;
    SELECT madlib.coxph_train( 'sample_data',
                               'sample_cox',
                               'timedeath',
                               'ARRAY[grp,wbc]',
                               'status'
                             );
    
  4. Run the Robust Cox regression function.
    SELECT madlib.robust_variance_coxph( 'sample_cox',
                               'sample_robust_cox'
                             );
    
  5. View the results of the robust Cox regression.
    \x on
    SELECT * FROM sample_robust_cox;
    
    Results:
    -[ RECORD 1 ]-+----------------------------------------------------------------------------
    coef          | {2.54407073265105,1.67172094780081}
    loglikelihood | -37.8532498733452
    std_err       | {0.677180599295459,0.387195514577754}
    robust_se     | {0.621095581073685,0.274773521439328}
    robust_z      | {4.09610180811965,6.08399579058399}
    robust_p      | {4.2016521208424e-05,1.17223683104729e-09}
    hessian       | {{2.78043065745405,-2.25848560642669},{-2.25848560642669,8.50472838284265}}
    

Technical Background

When doing regression analysis, we are sometimes interested in the variance of the computed coefficients \( \boldsymbol c \). While the built-in regression functions provide variance estimates, we may prefer a robust variance estimate.

The robust variance calculation can be expressed in a sandwich formation, which is the form

\[ S( \boldsymbol c) = B( \boldsymbol c) M( \boldsymbol c) B( \boldsymbol c) \]

where \( B( \boldsymbol c)\) and \( M( \boldsymbol c)\) are matrices. The \( B( \boldsymbol c) \) matrix, also known as the bread, is relatively straight forward, and can be computed as

\[ B( \boldsymbol c) = n\left(\sum_i^n -H(y_i, x_i, \boldsymbol c) \right)^{-1} \]

where \( H \) is the hessian matrix.

The \( M( \boldsymbol c)\) matrix has several variations, each with different robustness properties. The form implemented here is the Huber-White sandwich operator, which takes the form

\[ M_{H} =\frac{1}{n} \sum_i^n \psi(y_i,x_i, \boldsymbol c)^T \psi(y_i,x_i, \boldsymbol c). \]

The above method for calculating robust variance (Huber-White estimates) is implemented for linear regression, logistic regression, and multinomial logistic regression. It is useful in calculating variances in a dataset with potentially noisy outliers. The Huber-White implemented here is identical to the "HC0" sandwich operator in the R module "sandwich".

When multinomial logistic regression is computed before the multinomial robust regression, it uses a default reference category of zero and the regression coefficients are included in the output table. The regression coefficients in the output are in the same order as the multinomial logistic regression function, which is described below. For a problem with \( K \) dependent variables \( (1, ..., K) \) and \( J \) categories \( (0, ..., J-1) \), let \( {m_{k,j}} \) denote the coefficient for dependent variable \( k \) and category \( j \) . The output is \( {m_{k_1, j_0}, m_{k_1, j_1} \ldots m_{k_1, j_{J-1}}, m_{k_2, j_0}, m_{k_2, j_1} \ldots m_{k_K, j_{J-1}}} \). The order is NOT CONSISTENT with the multinomial regression marginal effect calculation with function marginal_mlogregr. This is deliberate because the interfaces of all multinomial regressions (robust, clustered, ...) will be moved to match that used in marginal.

The robust variance of Cox proportional hazards is more complex because coeeficients are trained by maximizing a partial log-likelihood. Therefore, one cannot directly use the formula for \( M( \boldsymbol c) \) as in Huber-White robust estimator. Extra terms are needed. See [4] for details.

Literature

[1] vce(cluster) function in STATA: http://www.stata.com/help.cgi?vce_option

[2] clustered estimators in R: http://people.su.se/~ma/clustering.pdf

[3] Achim Zeileis: Object-oriented Computation of Sandwich Estimators. Research Report Series / Department of Statistics and Mathematics, 37. Department of Statistics and Mathematics, WU Vienna University of Economics and Business, Vienna. http://cran.r-project.org/web/packages/sandwich/vignettes/sandwich-OOP.pdf

[4] D. Y. Lin and L . J. Wei, The Robust Inference for the Cox Proportional Hazards Model, Journal of the American Statistical Association, Vol. 84, No. 408, p.1074 (1989).

Related Topics
File robust.sql_in documenting the SQL functions File robust_variance_coxph.sql_in documenting more the SQL functions