Cox models

The Cox Proportional Hazards Model: Theory

The Cox Proportional Hazards Model [4] is a semi-parametric model used to analyze time-to-event data. It models the relationship between the survival time of an individual and a set of explanatory variables (covariates).

1. The Hazard Function

The model is defined by the hazard function, which describes the risk of an event occuring at time $t$, given that the event has not occured before the time $t$. The hazard function is given by:

\[h(t | \mathbf{X}) = h_0(t) \exp(\mathbf{X}^T\mathbf{\beta})\]

where:

$h_0(t)$ is the baseline hazard function. This is an unspecified, non-negative function of time that represents the hazard for an individual with all covariates equal to zero. It captures the underlying risk profile common to all individuals.
$\mathbf{X}$ is the covariate vector for an individual. These are the independent variables (e.g., age, treatment, gender) that influence the event time.
$\mathbf{\beta}$ is the vector of regression coefficients.

The term $exp(\mathbf{X}^T\mathbf{\beta})$ is often called the hazard ratio.

The Baseline Hazard Function

How we saw before, $h_0(t)$ is the baseline hazard function. This is the hazard for when all the covariates are zero. The Cox model does not estimate it parametrically.

For predictions, we use the cumulative baseline hazard function: $H_0(t) = \int_0^t h_0(u) du$. It is estimated using the Breslow estimator :

\[\tilde{H}_0(t) = \sum_{t_i \le t} \frac{d_i}{\sum e^{(X\beta)}}\]

where:

$d_i$ is the number of events that occur at time $t$

2. The Survival Function

The survival function, $S(t)$, which represents the probability that an individual survives beyond time $t$.

\[S(t | \mathbf{X}) = \exp\left(-\int_0^t h(u | \mathbf{X}) du\right)\]

After substituting the Cox hazard function:

\[S(t | \mathbf{X}) = \exp\left(-\exp(\mathbf{X}^\top \boldsymbol{\beta}) \int_0^t h_0(u) du\right)\]

with $H_0(t) = \int_0^t h_0(u) du$

Finally,

\[S(t | \mathbf{X}) = \exp\left(-\exp(\mathbf{X}^\top \boldsymbol{\beta}) H_0(t)\right)\]

3. The Full Likelihood Function

The full likelihood function includes the baseline hazard:

\[L(\boldsymbol{\beta}, h_0(\cdot)) = \prod_{i=1}^n \left( h(t_i | \mathbf{X}_i) \right)^{\Delta_i} S(t_i | \mathbf{X}_i)\]

with:

$h(t_i | \mathbf{X}_i)$ for the individuals at risk of the event at time $t_i(\Delta_i=1)$
$S(t_i | \mathbf{X}_i)$ for the individuals censored at time $t_i(\Delta_i=0)$

And if we substitute the hazard and survival function:

\[L(\boldsymbol{\beta}, h_0(\cdot)) = \prod_{i=1}^n \left( h_0(t_i) \exp(\mathbf{X}_i^\top \boldsymbol{\beta}) \right)^{\Delta_i} \exp\left(-\exp(\mathbf{X}_i^\top \boldsymbol{\beta}) H_0(t_i)\right)\]

4. The Partial-Likelihood Function

Since the baseline hazard function $h_0(t)$ is unspecified, a standard likelihood function cannot be formed directly. Instead, Cox introduced the concept of a partial likelihood. This approach focuses on the order of events rather than their exact timings, factoring out the unknown $h_0(t)$.

For each distinct observed event time $t_(j)$, we consider the set of individuals who are "at risk" of experiencing the event just before $t_(j)$. This is called the risk set, $R(t_(j))$. The partial likelihood is constructed by considering the probability that the specific individual(s) who experienced the event at $t_(j)$ were the ones to fail, given that some event occurred among the individuals in $R(t_(j))$. We can write this as follows: $P(\text{Individual } i \text{ fails at } t \mid \text{An event occurs in } R(t) \text{ at } t)$. Using the defintion of the conditional probability:

\[P(\text{Individual } i \text{ fails at } t \mid \text{An event occurs in } R(t) \text{ at } t) = \frac{P(\text{Individual } i \text{ fails at } t)}{\sum_{l \in R(t)} P(\text{Individual } l \text{ fails at } t)}\]

By substituting the hazard function and canceling out $h_0(t)$and $dt$:

\[\frac{h_i(t)dt}{\sum_{l \in R(t)} h_l(t)dt} = \frac{h_0(t)\exp(\mathbf{X}_i^T\mathbf{\beta})dt}{\sum_{l \in R(t)} h_0(t)\exp(\mathbf{X}_l^T\mathbf{\beta})dt} = \frac{\exp(\mathbf{X}_i^T\mathbf{\beta})}{\sum_{l \in R(t)} \exp(\mathbf{X}_l^T\mathbf{\beta})}\]

To have the partial likelihood, we will multiply the conditional probabilities for each time. For tied events, we used the Breslow approximation which says as follows: when $\Delta j$ individuals experience the event at the exact same time $t(j)$, the individuals are treated as if they failed simultaneously.

The partial-likelihood function for the Cox model, accounting for tied event times using Breslow's approximation, is given by:

\[L(\mathbf{\beta}) = \prod_{j=1}^{k} \left( \frac{\exp(\mathbf{X}_i^T\mathbf{\beta})}{ \sum_{l \in R_j} \exp(\mathbf{X}_l^T\mathbf{\beta})}\right)^{\Delta_i}\]

where:

$k$ is the number of distinct event times.
$t_{(j)}$ denotes the j-th distinct ordered event time.
$\Delta_j$ is the set of individuals who experience the event at time $t_{(j)}$.
$R_j$ is the risk set at time $t_{(j)}$, comprising all individuals who are still at risk (have not yet experienced the event or been censored) just- before $t_{(j)}$.
$\mathbf{X}_i$ is the covariate vector for individual $i$.

5. The Loss Function (Negative Log-Partial-Likelihood)

Our goal is to estimate the regression coefficients $\mathbf{\beta}$ by maximizing the partial-likelihood function $L(\mathbf{\beta})$. Equivalently, it is often more convenient to minimize its negative logarithm, which we define as our loss function:

\[\text{Loss}(\mathbf{\beta}) = - \log L(\mathbf{\beta}) \]

Taking the negative logarithm of the Breslow partial likelihood, we get:

\[ \text{Loss}(\mathbf{\beta}) = - \sum_{i=1}^{n} \Delta_i \left( \mathbf{X}_i^T\mathbf{\beta} - \log \left( \sum_{j \in R_j} \exp(\mathbf{X}_j^T\mathbf{\beta}) \right) \right) \]

This function is convex, which facilitates optimization.

The loss function is coded as follows:

function loss(beta, M::Cox)
    # M.X: Design matrix (n x m), where n is number of observations, m is number of covariates.
    # M.T: Vector of observed times (n) for each individual.
    # M.Δ: Vector of event indicators (n), 1 if event, 0 if censored.
    η = M.X*beta
    return dot(M.Δ, log.((M.T .<= M.T') * exp.(η)) .- η)
end

6. Gradient of the Loss Function

To find the optimal $\mathbf{\beta}$, we need to minimize the loss function.

The gradient of the loss function with respect to a specific coefficient $\beta_k$ is:

\[\frac{\partial}{\partial \beta_k} \text{Loss}(\mathbf{\beta}) = - \sum_{i=1}^{n} \left( X_{ik} - \frac{\sum_{j \in R_i} \exp(\mathbf{\beta}^T\mathbf{X}_j) X_{jk}}{\sum_{j \in R_i} \exp(\mathbf{\beta}^T\mathbf{X}_j)} \right)\]

7. Hessian Matrix of the Loss Function

For optimization algorithms like Newton-Raphson and for calculating standard errors, the Hessian matrix (matrix of second partial derivatives) of the loss function is required.

The entry for the $k$-th row and $l$-th column of the Hessian matrix is:

\[\frac{\partial^2}{\partial \beta_k \partial \beta_l} \text{Loss}(\mathbf{\beta}) = \sum_{i=1}^{n} \Delta_i \left[ \frac{\sum_{j \in R_i} \exp(\mathbf{\beta}^T\mathbf{X}_j) X_{jk}X_{jl}}{\sum_{j \in R_i} \exp(\mathbf{\beta}^T\mathbf{X}_j)} - \frac{\left( \sum_{j \in R_i} \exp(\mathbf{\beta}^T\mathbf{X}_j) X_{jk} \right) \left( \sum_{j \in R_i} \exp(\mathbf{\beta}^T\mathbf{X}_j) X_{jl} \right)}{\left( \sum_{j \in R_i} \exp(\mathbf{\beta}^T\mathbf{X}_j) \right)^2} \right]\]

8. Information Matrix and Variance-Covariance Matrix

The observed Information Matrix, $I(\hat{\boldsymbol{\beta}})$, is defined as the negative of the Hessian matrix of the log-likelihood function, evaluated at the maximum likelihood estimates $\hat{\boldsymbol{\beta}}$.

\[I(\hat{\boldsymbol{\beta}}) = -H(\hat{\boldsymbol{\beta}})\]

But, in the earlier formula, $\mathbf{H}_{\text{Loss}}$ was for Loss(β), which is -log L(β) $\mathbf{H}_{\text{Loss}} = - \mathbf{H}_{\text{log-likelihood}}$. Therefore, the observed Information Matrix is equal to $\mathbf{H}_{\text{Loss}}$ itself.

The variance (and covariance) of our estimators $\hat{\boldsymbol{\beta}}$ are obtained by inverting the observed information matrix.

\[\text{Var}(\hat{\boldsymbol{\beta}}) = I(\hat{\boldsymbol{\beta}})^{-1}\]

This final matrix contains:

On its diagonal: the variances of each coefficient ($\text{Var}(\hat{\beta}_1)$, $\text{Var}(\hat{\beta}_2)$, ...).
Off-diagonal: the covariances between pairs of coefficients.

9. Standard Error

The standard error for a specific coefficient ($\hat{\beta}_k$) is the square root of its variance.

\[SE(\hat{\beta}_k) = \sqrt{\text{Var}(\hat{\beta}_k)}\]

10. Wald Test for Significance

To determine if a variable has a statistically significant effect, a Wald test is performed. A z-score is calculated:

\[z = \frac{\text{Coefficient}}{\text{Erreur Type}} = \frac{\hat{\beta}}{SE(\hat{\beta})}\]

This $z$-score is then compared to a normal distribution to obtain a $p$-value. A low $p$-value (typically < 0.05) suggests that the coefficient is significantly different from zero.

The p-value for each coefficient is calculated by comparing its z-score to a standard normal distribution. This p-value indicates the probability of observing a z-score as extreme as, or more extreme than, the one calculated, assuming the null hypothesis (that the coefficient is zero) is true.

11. Confidence Interval

The standard error allows for the construction of a confidence interval (CI) around the coefficient, which provides a range of plausible values for the true coefficient.

The general formula for a $(1 - \alpha) \times 100\%$ confidence interval is:

\[\text{IC pour } \hat{\beta} = \hat{\beta} \pm z_{\alpha/2} \times SE(\hat{\beta})\]

Let us see for example the output on the colon dataset:

using SurvivalModels
using RDatasets

# ovarian = dataset("survival", "ovarian")
# ovarian.FUTime = Float64.(ovarian.FUTime)
# ovarian.FUStat = Bool.(ovarian.FUStat)
# model = fit(Cox, @formula(Surv(FUTime, FUStat) ~ Age + ECOG_PS), ovarian)

colon = dataset("survival", "colon")
colon.Time = Float64.(colon.Time)
colon.Status = Bool.(colon.Status)
model_colon = fit(Cox, @formula(Surv(Time, Status) ~ Age + Rx), colon)
coeftable(model_colon)

| | Coef. | Std. Error | z | Pr(>|z|) | exp(coef) | Lower 95% | Upper 95% | |:––––––|––––––:|–––––-:|–––:|:––––-|:–––––|–––––:|–––––:| | Age | -0.00205614 | 0.00280669 | -0.73 | 0.4638 | 0.997946 | 0.992471 | 1.00345 | | Rx: Lev | -0.0200488 | 0.0768372 | -0.26 | 0.7942 | 0.980151 | 0.843119 | 1.13945 | | Rx: Lev+5FU | -0.439289 | 0.0839364 | -5.23 | <1e-06 | 0.644494 | 0.546729 | 0.759742 |

The table has one row per coefficient with the estimate (log hazard ratio), its standard error, the Wald z statistic and p-value, the hazard ratio exp(coef), and the confidence interval for the hazard ratio.

SurvivalModels.CoxMethod — Type

StatsBase.fit(Cox, @formula(Surv(T,Δ)~predictors), dataset)

Arguments:

Cox: The model to fit, e.g. Cox. Could be specified to any of CoxNM, CoxOptim, CoxApprox or CoxDefault if you want different sovers to be used, see their own documentations. Default is CoxDefault.
formula: A StatsModels.FormulaTerm specifying the survival model
df: A DataFrame containing the variables specified in the formula

Returns:

predictor: A Vector{String} containing the names of the predictor variables included in the model
beta: A Vector{Float64} containing the estimated regression coefficients (β) for each predictor
se: A Vector{Float64} containing the standard errors of the estimated regression coefficients
loglikelihood: A Vector{Float64} containing the log-likelihood of the fitted model. This value is repeated for each predictor row
coef: A vector of the estimated coefficients
formula: The applied formula

Example:

ovarian = dataset("survival", "ovarian")
ovarian.FUTime = Float64.(ovarian.FUTime) (Time column needs to be Float64 type)
ovarian.FUStat = Bool.(ovarian.FUStat) (Status column needs to be Bool type)
model = fit(Cox, @formula(Surv(FUTime, FUStat) ~ Age + ECOG_PS), ovarian)

We need to add details about the different prediction types here.

Types:

Cox : the base abstract type
CoxGrad<:Cox : abstract type for Cox models that are solved using gradient-based optimization
CoxLLH<:CoxGrad : abstract type for Cox models that are solved by optimizing the log-likelihood