Proportional-Hazard models enable the comparison of various survival models. These survival models are functions describing the probability of a one-item event (prototypically, this event is death) with respect to time. The interval of time before the occurrence of death can be called the survival time. Let T be a random variable representing the survival time, with a cumulative probability function P(t). Informally, P(t) is the probability that death has happened before time t.
Following is the syntax for the coxph_train() training function:
coxph_train( source_table, output_table, dependent_variable, independent_variable, right_censoring_status, strata, optimizer_params )
Arguments
TEXT. The name of the table where the output model is saved. The output is saved in the table named by the output_table argument. It has the following columns:
coef | FLOAT8[]. Vector of the coefficients. |
---|---|
loglikelihood | FLOAT8. Log-likelihood value of the MLE estimate. |
std_err | FLOAT8[]. Vector of the standard error of the coefficients. |
stats | FLOAT8[]. Vector of the statistics of the coefficients. |
p_values | FLOAT8[]. Vector of the p-values of the coefficients. |
hessian | FLOAT8[]. The Hessian matrix computed using the final solution. |
num_iterations | INTEGER. The number of iterations performed by the optimizer. |
Additionally, a summary output table is generated that contains a summary of the parameters used for building the Cox model. It is stored in a table named <output_table>_summary. It has the following columns:
source_table | The source table name. |
---|---|
dependent_variable | The dependent variable name. |
independent_variable | The independent variable name. |
right_censoring_status | The right censoring status |
strata | The stratification columns |
num_processed | The number of rows that were actually used in the computation. |
num_missing_rows_skipped | The number of rows that were skipped in the computation due to NULL values in them. |
VARCHAR, default: NULL, which uses the default values of optimizer parameters: max_iter=100, optimizer=newton, tolerance=1e-8, array_agg_size=10000000, sample_size=1000000. It should be a string that contains 'key=value' pairs separated by commas. The meanings of these parameters are:
The cox_zph() function tests the proportional hazards assumption (PHA) of a Cox regression.
Proportional-hazard models enable the comparison of various survival models. These PH models, however, assume that the hazard for a given individual is a fixed proportion of the hazard for any other individual, and the ratio of the hazards is constant across time. MADlib does not currently have support for performing any transformation of the time to compute the correlation.
The cox_zph() function is used to test this assumption by computing the correlation of the residual of the coxph_train() model with time.
Following is the syntax for the cox_zph() function:
cox_zph(cox_model_table, output_table)
Arguments
TEXT. The name of the table containing the Cox Proportional-Hazards model.
covariate | TEXT. The independent variables. |
---|---|
rho | FLOAT8[]. Vector of the correlation coefficients between survival time and the scaled Schoenfeld residuals. |
chi_square | FLOAT8[]. Chi-square test statistic for the correlation analysis. |
p_value | FLOAT8[]. Two-side p-value for the chi-square statistic. |
Additionally, the residual values are outputted to the table named output_table_residual. The table contains the following columns:
<dep_column_name> | FLOAT8. Time values (dependent variable) present in the original source table. |
---|---|
residual | FLOAT8[]. Difference between the original covariate values and the expectation of the covariates obtained from the coxph_train model. |
scaled_residual | Residual values scaled by the variance of the coefficients. |
coxph_predict(model_table, source_table, id_col_name, output_table, pred_type, reference)
Arguments
TEXT. Name of the table containing the cox model.
TEXT. Name of the table containing the prediction data.
TEXT. Name of the id column in the source table.
TEXT. Name of the table to store the prediction results in. The output table is named by the output_table argument and has the following columns:
id | TEXT. The id column name from the source table. |
---|---|
predicted_result | DOUBLE PRECISION. Result of prediction based of the value of the prediction type parameter. |
TEXT, OPTIONAL. Type of prediction. This can be one of 'linear_predictors', 'risk', or 'terms'. DEFAULT='linear_predictors'.
DROP TABLE IF EXISTS sample_data; CREATE TABLE sample_data ( id INTEGER NOT NULL, grp DOUBLE PRECISION, wbc DOUBLE PRECISION, timedeath INTEGER, status BOOLEAN ); COPY sample_data FROM STDIN WITH DELIMITER '|'; 0 | 0 | 1.45 | 35 | t 1 | 0 | 1.47 | 34 | t 3 | 0 | 2.2 | 32 | t 4 | 0 | 1.78 | 25 | t 5 | 0 | 2.57 | 23 | t 6 | 0 | 2.32 | 22 | t 7 | 0 | 2.01 | 20 | t 8 | 0 | 2.05 | 19 | t 9 | 0 | 2.16 | 17 | t 10 | 0 | 3.6 | 16 | t 11 | 1 | 2.3 | 15 | t 12 | 0 | 2.88 | 13 | t 13 | 1 | 1.5 | 12 | t 14 | 0 | 2.6 | 11 | t 15 | 0 | 2.7 | 10 | t 16 | 0 | 2.8 | 9 | t 17 | 1 | 2.32 | 8 | t 18 | 0 | 4.43 | 7 | t 19 | 0 | 2.31 | 6 | t 20 | 1 | 3.49 | 5 | t 21 | 1 | 2.42 | 4 | t 22 | 1 | 4.01 | 3 | t 23 | 1 | 4.91 | 2 | t 24 | 1 | 5 | 1 | t \.
DROP TABLE IF EXISTS sample_cox, sample_cox_summary; SELECT madlib.coxph_train( 'sample_data', 'sample_cox', 'timedeath', 'ARRAY[grp,wbc]', 'status' );
\x on SELECT * FROM sample_cox;Results:
-[ RECORD 1 ]--+---------------------------------------------------------------------------- coef | {2.54407073265254,1.67172094779487} loglikelihood | -37.8532498733 std_err | {0.677180599294897,0.387195514577534} z_stats | {3.7568570855419,4.31751114064138} p_values | {0.000172060691513886,1.5779844638453e-05} hessian | {{2.78043065745617,-2.25848560642414},{-2.25848560642414,8.50472838284472}} num_iterations | 5
\x off -- Display the linear predictors for the original dataset DROP TABLE IF EXISTS sample_pred; SELECT madlib.coxph_predict('sample_cox', 'sample_data', 'id', 'sample_pred'); SELECT * FROM sample_pred;
id | predicted_value ----+-------------------- 0 | -2.97110918125034 4 | -2.41944126847803 6 | -1.5167119566688 8 | -1.96807661257341 10 | 0.623090856508638 12 | -0.58054822590367 14 | -1.04863009128623 16 | -0.714285901727259 18 | 2.01061924317838 20 | 2.98327228490375 22 | 3.85256717775708 24 | 5.507570916074 1 | -2.93767476229444 3 | -1.71731847040418 5 | -1.09878171972008 7 | -2.03494545048521 9 | -1.78418730831598 15 | -0.881457996506747 19 | -1.53342916614675 11 | 0.993924357027849 13 | -0.343452401208048 17 | 1.02735877598375 21 | 1.19453087076323 23 | 5.35711603077246 (24 rows)
-- Display the relative risk for the original dataset DROP TABLE IF EXISTS sample_pred; SELECT madlib.coxph_predict('sample_cox', 'sample_data', 'id', 'sample_pred', 'risk'); SELECT * FROM sample_pred;
id | predicted_value ----+-------------------- 1 | 0.0529887971503509 3 | 0.179546963459175 5 | 0.33327686110022 7 | 0.130687611255372 9 | 0.167933483703554 15 | 0.414178600294289 19 | 0.215794402223054 11 | 2.70181658768287 13 | 0.709317242984782 17 | 2.79367735395696 21 | 3.30200833843654 23 | 212.112338046551 0 | 0.0512464372091503 4 | 0.0889713146524469 6 | 0.219432204682557 8 | 0.139725343898993 10 | 1.86468261037506 12 | 0.559591499901241 14 | 0.350417460388247 16 | 0.489541567796517 18 | 7.46794038691975 20 | 19.7523463121038 22 | 47.1138577624204 24 | 246.551504798816 (24 rows)
SELECT madlib.cox_zph( 'sample_cox', 'sample_zph_output' );
\x on SELECT * FROM sample_zph_output;Results:
-[ RECORD 1 ]----------------------------------------- covariate | ARRAY[grp,wbc] rho | {0.00237308357328641,0.0375600568840431} chi_square | {0.000100675718191977,0.0232317400546175} p_value | {0.991994376850758,0.878855984657948}
Generally, proportional-hazard models start with a list of \( \boldsymbol n \) observations, each with \( \boldsymbol m \) covariates and a time of death. From this \( \boldsymbol n \times m \) matrix, we would like to derive the correlation between the covariates and the hazard function. This amounts to finding the parameters \( \boldsymbol \beta \) that best fit the model described below.
Let us define:
Note that this model does not include a constant term, and the data cannot contain a column of 1s.
By definition,
\[ P[T_k = t_i | \boldsymbol R(t_i)] = \frac{e^{\beta^T x_k} }{ \sum_{j \in R(t_i)} e^{\beta^T x_j}}. \,. \]
The partial likelihood function can now be generated as the product of conditional probabilities:
\[ \mathcal L = \prod_{i = 1}^n \left( \frac{e^{\beta^T x_i}}{ \sum_{j \in R(t_i)} e^{\beta^T x_j}} \right). \]
The log-likelihood form of this equation is
\[ L = \sum_{i = 1}^n \left[ \beta^T x_i - \log\left(\sum_{j \in R(t_i)} e^{\beta^T x_j }\right) \right]. \]
Using this score function and Hessian matrix, the partial likelihood can be maximized using the Newton-Raphson algorithm. Breslow's method is used to resolved tied times of deaths. The time of death for two records are considered "equal" if they differ by less than 1.0e-6
The inverse of the Hessian matrix, evaluated at the estimate of \( \boldsymbol \beta \), can be used as an approximate variance-covariance matrix for the estimate, and used to produce approximate standard errors for the regression coefficients.
\[ \mathit{se}(c_i) = \left( (H)^{-1} \right)_{ii} \,. \]
The Wald z-statistic is
\[ z_i = \frac{c_i}{\mathit{se}(c_i)} \,. \]
The Wald \( p \)-value for coefficient \( i \) gives the probability (under the assumptions inherent in the Wald test) of seeing a value at least as extreme as the one observed, provided that the null hypothesis ( \( c_i = 0 \)) is true. Letting \( F \) denote the cumulative density function of a standard normal distribution, the Wald \( p \)-value for coefficient \( i \) is therefore
\[ p_i = \Pr(|Z| \geq |z_i|) = 2 \cdot (1 - F( |z_i| )) \]
where \( Z \) is a standard normally distributed random variable.
The condition number is computed as \( \kappa(H) \) during the iteration immediately preceding convergence (i.e., \( A \) is computed using the coefficients of the previous iteration). A large condition number (say, more than 1000) indicates the presence of significant multicollinearity.
A somewhat random selection of nice write-ups, with valuable pointers into further literature:
[1] John Fox: Cox Proportional-Hazards Regression for Survival Data, Appendix to An R and S-PLUS companion to Applied Regression Feb 2012, http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-cox-regression.pdf
[2] Stephen J Walters: What is a Cox model? http://www.medicine.ox.ac.uk/bandolier/painres/download/whatis/cox_model.pdf
If number of ties in the source table is very large, a memory allocation error may be raised. The limitation is about \((10^8 / m)\), where \(m\) is number of featrues. For instance, if there are 100 featrues, the number of ties should be fewer than 1 million.
File cox_prop_hazards.sql_in documenting the functions