Clustered Variance

The Clustered Variance module adjusts standard errors for clustering. For example, replicating a dataset 100 times should not increase the precision of parameter estimates, but performing this procedure with the IID assumption will actually do this. Another example is in economics of education research, it is reasonable to expect that the error terms for children in the same class are not independent. Clustering standard errors can correct for this.

The MADlib Clustered Variance module includes functions to calculate linear, logistic, and multinomial logistic regression problems.

- Clustered Variance Linear Regression Training Function

The clustered variance linear regression training function has the following syntax.

clustered_variance_linregr ( source_table, out_table, dependent_varname, independent_varname, clustervar, grouping_cols )

**Arguments**

- source_table
TEXT. The name of the table containing the input data.

- out_table
VARCHAR. Name of the generated table containing the output model. The output table contains the following columns.

coef DOUBLE PRECISION[]. Vector of the coefficients of the regression. std_err DOUBLE PRECISION[]. Vector of the standard error of the coefficients. t_stats DOUBLE PRECISION[]. Vector of the t-stats of the coefficients. p_values DOUBLE PRECISION[]. Vector of the p-values of the coefficients. A summary table named <out_table>_summary is also created, which is the same as the summary table created by linregr_train function. Please refer to the documentation for linear regression for details.

- dependent_varname
- TEXT. An expression to evaluate for the dependent variable.
- independent_varname
- TEXT. An Expression to evalue for the independent variables.
- clustervar
- TEXT. A comma-separated list of the columns to use as cluster variables.
- grouping_cols (optional)
- TEXT, default: NULL.
*Not currently implemented. Any non-NULL value is ignored.*An expression list used to group the input dataset into discrete groups, running one regression per group. Similar to the SQL GROUP BY clause. When this value is null, no grouping is used and a single result model is generated.

- Clustered Variance Logistic Regression Training Function

The clustered variance logistic regression training function has the following syntax.

clustered_variance_logregr( source_table, out_table, dependent_varname, independent_varname, clustervar, grouping_cols, max_iter, optimizer, tolerance, verbose_mode )

**Arguments**

- source_table
- TEXT. The name of the table containing the input data.
- out_table
VARCHAR. Name of the generated table containing the output model. The output table has the following columns:

coef Vector of the coefficients of the regression. std_err Vector of the standard error of the coefficients. z_stats Vector of the z-stats of the coefficients. p_values Vector of the p-values of the coefficients. A summary table named <out_table>_summary is also created, which is the same as the summary table created by logregr_train function. Please refer to the documentation for logistic regression for details.

- dependent_varname
- TEXT. An expression to evaluate for the dependent variable.
- independent_varname
- TEXT. An expression to evaluate for the independent variable.
- clustervar
- TEXT. A comma-separated list of columns to use as cluster variables.
- grouping_cols (optional)
- TEXT, default: NULL.
*Not yet implemented. Any non-NULL values are ignored.*An expression list used to group the input dataset into discrete groups, running one regression per group. Similar to the SQL GROUP BY clause. When this value is NULL, no grouping is used and a single result model is generated. - max_iter (optional)
- INTEGER, default: 20. The maximum number of iterations that are allowed.
- optimizer (optional)
- TEXT, default: 'irls'. The name of the optimizer to use:
- 'newton' or 'irls': Iteratively reweighted least squares
- 'cg': conjugate gradient
- 'igd': incremental gradient descent.

- tolerance (optional)
- FLOAT8, default: 0.0001 The difference between log-likelihood values in successive iterations that should indicate convergence. A zero disables the convergence criterion, so that execution stops after
*n*Iterations have completed. - verbose_mode (optional)
- BOOLEAN, default FALSE. Provides verbose_mode output of the results of training.

- Clustered Variance Multinomial Logistic Regression Training Function

clustered_variance_mlogregr( source_table, out_table, dependent_varname, independent_varname, cluster_varname, ref_category, grouping_cols, optimizer_params, verbose_mode )

**Arguments**

- source_table
- TEXT. The name of the table containing the input data.
- out_table
TEXT. The name of the table where the regression model will be stored. The output table has the following columns:

category The category. ref_category The refererence category used for modeling. coef Vector of the coefficients of the regression. std_err Vector of the standard error of the coefficients. z_stats Vector of the z-stats of the coefficients. p_values Vector of the p-values of the coefficients. A summary table named <out_table>_summary is also created, which is the same as the summary table created by mlogregr_train function. Please refer to the documentation for multinomial logistic regression for details.

- dependent_varname
- TEXT. An expression to evaluate for the dependent variable.
- independent_varname
- TEXT. An expression to evaluate for the independent variable.
- cluster_varname
- TEXT. A comma-separated list of columns to use as cluster variables.
- ref_category (optional)
- INTEGER. Reference category in the range [0, num_category).
- groupingvarng_cols (optional)
- TEXT, default: NULL.
*Not yet implemented. Any non-NULL values are ignored.*A comma-separated list of columns to use as grouping variables. - optimizer_params (optional)
- TEXT, default: NULL, which uses the default values of optimizer parameters: max_iter=20, optimizer='newton', tolerance=1e-4. It should be a string that contains pairs of 'key=value' separated by commas.
- verbose_mode (optional)
- BOOLEAN, default FALSE. If TRUE, detailed information is printed when computing logistic regression.

- Clustered Variance for Cox Proportional Hazards model

The clustered robust variance estimator function for the Cox Proportional Hazards model has the following syntax.

clustered_variance_coxph(model_table, output_table, clustervar)

**Arguments**

- model_table
- TEXT. The name of the model table, which is exactaly the same as the 'output_table' parameter of coxph_train() function.
- output_table
- TEXT. The name of the table where the output is saved. It has the following columns:
coef FLOAT8[]. Vector of the coefficients. loglikelihood FLOAT8. Log-likelihood value of the MLE estimate. std_err FLOAT8[]. Vector of the standard error of the coefficients. clustervar TEXT. A comma-separated list of columns to use as cluster variables. clustered_se FLOAT8[]. Vector of the robust standard errors of the coefficients. clustered_z FLOAT8[]. Vector of the robust z-stats of the coefficients. clustered_p FLOAT8[]. Vector of the robust p-values of the coefficients. hessian FLOAT8[]. The Hessian matrix. - clustervar
- TEXT. A comma-separated list of columns to use as cluster variables.

- Examples

- View online help for the clustered variance linear regression function.
SELECT madlib.clustered_variance_linregr();

- Run the linear regression function and view the results.
DROP TABLE IF EXISTS out_table, out_table_summary; SELECT madlib.clustered_variance_linregr( 'abalone', 'out_table', 'rings', 'ARRAY[1, diameter, length, width]', 'sex', NULL ); SELECT * FROM out_table;

- View online help for the clustered variance logistic regression function.
SELECT madlib.clustered_variance_logregr();

- Run the logistic regression function and view the results.
DROP TABLE IF EXISTS out_table, out_table_summary; SELECT madlib.clustered_variance_logregr( 'abalone', 'out_table', 'rings < 10', 'ARRAY[1, diameter, length, width]', 'sex' ); SELECT * FROM out_table;

- View online help for the clustered variance multinomial logistic regression function.
SELECT madlib.clustered_variance_mlogregr();

- Run the multinomial logistic regression and view the results.
DROP TABLE IF EXISTS out_table, out_table_summary; SELECT madlib.clustered_variance_mlogregr( 'abalone', 'out_table', 'CASE WHEN rings < 10 THEN 1 ELSE 0 END', 'ARRAY[1, diameter, length, width]', 'sex', 0 ); SELECT * FROM out_table;

- Run the Cox Proportional Hazards regression and compute the clustered robust estimator.
DROP TABLE IF EXISTS lung_cl_out; DROP TABLE IF EXISTS lung_out; DROP TABLE IF EXISTS lung_out_summary; SELECT madlib.coxph_train('lung', 'lung_out', 'time', 'array[age, "ph.ecog"]', 'TRUE', NULL, NULL); SELECT madlib.clustered_variance_coxph('lung_out', 'lung_cl_out', '"ph.karno"'); SELECT * FROM lung_cl_out;

- Notes

- Note that we need to manually include an intercept term in the independent variable expression. The NULL value of
*groupingvar*means that there is no grouping in the calculation.

- Technical Background

Assume that the data can be separated into \(m\) clusters. Usually this can be done by grouping the data table according to one or multiple columns.

The estimator has a similar form to the usual sandwich estimator

\[ S(\vec{c}) = B(\vec{c}) M(\vec{c}) B(\vec{c}) \]

The bread part is the same as Huber-White sandwich estimator

\begin{eqnarray} B(\vec{c}) & = & \left(-\sum_{i=1}^{n} H(y_i, \vec{x}_i, \vec{c})\right)^{-1}\\ & = & \left(-\sum_{i=1}^{n}\frac{\partial^2 l(y_i, \vec{x}_i, \vec{c})}{\partial c_\alpha \partial c_\beta}\right)^{-1} \end{eqnarray}

where \(H\) is the hessian matrix, which is the second derivative of the target function

\[ L(\vec{c}) = \sum_{i=1}^n l(y_i, \vec{x}_i, \vec{c})\ . \]

The meat part is different

\[ M(\vec{c}) = \bf{A}^T\bf{A} \]

where the \(m\)-th row of \(\bf{A}\) is

\[ A_m = \sum_{i\in G_m}\frac{\partial l(y_i,\vec{x}_i,\vec{c})}{\partial \vec{c}} \]

where \(G_m\) is the set of rows that belong to the same cluster.

We can compute the quantities of \(B\) and \(A\) for each cluster during one scan through the data table in an aggregate function. Then sum over all clusters to the full \(B\) and \(A\) in the outside of the aggregate function. At last, the matrix mulplitications are done in a separate function on the master node.

When multinomial logistic regression is computed before the multinomial clustered variance calculation, it uses a default reference category of zero and the regression coefficients are included in the output table. The regression coefficients in the output are in the same order as multinomial logistic regression function, which is described below. For a problem with \( K \) dependent variables \( (1, ..., K) \) and \( J \) categories \( (0, ..., J-1) \), let \( {m_{k,j}} \) denote the coefficient for dependent variable \( k \) and category \( j \). The output is \( {m_{k_1, j_0}, m_{k_1, j_1} \ldots m_{k_1, j_{J-1}}, m_{k_2, j_0}, m_{k_2, j_1} \ldots m_{k_K, j_{J-1}}} \). The order is NOT CONSISTENT with the multinomial regression marginal effect calculation with function *marginal_mlogregr*. This is deliberate because the interfaces of all multinomial regressions (robust, clustered, ...) will be moved to match that used in marginal.

- Literature

[1] Standard, Robust, and Clustered Standard Errors Computed in R, http://diffuseprior.wordpress.com/2012/06/15/standard-robust-and-clustered-standard-errors-computed-in-r/

- Related Topics
- File clustered_variance.sql_in documenting the clustered variance SQL functions.

File clustered_variance_coxph.sql_in documenting the clustered variance for Cox proportional hazards SQL functions.