Cox models

The Cox Proportional Hazards Model: Theory

The Cox Proportional Hazards Model [4] is a semi-parametric model used to analyze time-to-event data. It models the relationship between the survival time of an individual and a set of explanatory variables (covariates).

1. The Hazard Function

The model is defined by the hazard function, which describes the risk of an event occuring at time $t$, given that the event has not occured before the time $t$. The hazard function is given by:

\[h(t | \mathbf{X}) = h_0(t) \exp(\mathbf{X}^T\mathbf{\beta})\]

where:

  • $h_0(t)$ is the baseline hazard function. This is an unspecified, non-negative function of time that represents the hazard for an individual with all covariates equal to zero. It captures the underlying risk profile common to all individuals.
  • $\mathbf{X}$ is the covariate vector for an individual. These are the independent variables (e.g., age, treatment, gender) that influence the event time.
  • $\mathbf{\beta}$ is the vector of regression coefficients.

The term $exp(\mathbf{X}^T\mathbf{\beta})$ is often called the hazard ratio.

The Baseline Hazard Function

How we saw before, $h_0(t)$ is the baseline hazard function. This is the hazard for when all the covariates are zero. The Cox model does not estimate it parametrically.

For predictions, we use the cumulative baseline hazard function: $H_0(t) = \int_0^t h_0(u) du$. It is estimated using the Breslow estimator :

\[\tilde{H}_0(t) = \sum_{t_i \le t} \frac{d_i}{\sum e^{(X\beta)}}\]

where:

  • $d_i$ is the number of events that occur at time $t$

2. The Survival Function

The survival function, $S(t)$, which represents the probability that an individual survives beyond time $t$.

\[S(t | \mathbf{X}) = \exp\left(-\int_0^t h(u | \mathbf{X}) du\right)\]

After substituting the Cox hazard function:

\[S(t | \mathbf{X}) = \exp\left(-\exp(\mathbf{X}^\top \boldsymbol{\beta}) \int_0^t h_0(u) du\right)\]

with $H_0(t) = \int_0^t h_0(u) du$

Finally,

\[S(t | \mathbf{X}) = \exp\left(-\exp(\mathbf{X}^\top \boldsymbol{\beta}) H_0(t)\right)\]

3. The Full Likelihood Function

The full likelihood function includes the baseline hazard:

\[L(\boldsymbol{\beta}, h_0(\cdot)) = \prod_{i=1}^n \left( h(t_i | \mathbf{X}_i) \right)^{\Delta_i} S(t_i | \mathbf{X}_i)\]

with:

  • $h(t_i | \mathbf{X}_i)$ for the individuals at risk of the event at time $t_i(\Delta_i​=1)$
  • $S(t_i | \mathbf{X}_i)$ for the individuals censored at time $t_i​(\Delta_i​=0)$

And if we substitute the hazard and survival function:

\[L(\boldsymbol{\beta}, h_0(\cdot)) = \prod_{i=1}^n \left( h_0(t_i) \exp(\mathbf{X}_i^\top \boldsymbol{\beta}) \right)^{\Delta_i} \exp\left(-\exp(\mathbf{X}_i^\top \boldsymbol{\beta}) H_0(t_i)\right)\]

4. The Partial-Likelihood Function

Since the baseline hazard function $h_0(t)$ is unspecified, a standard likelihood function cannot be formed directly. Instead, Cox introduced the concept of a partial likelihood. This approach focuses on the order of events rather than their exact timings, factoring out the unknown $h_0(t)$.

For each distinct observed event time $t_(j)$, we consider the set of individuals who are "at risk" of experiencing the event just before $t_(j)$. This is called the risk set, $R(t_(j))$. The partial likelihood is constructed by considering the probability that the specific individual(s) who experienced the event at $t_(j)$ were the ones to fail, given that some event occurred among the individuals in $R(t_(j))$. We can write this as follows: $P(\text{Individual } i \text{ fails at } t \mid \text{An event occurs in } R(t) \text{ at } t)$. Using the defintion of the conditional probability:

\[P(\text{Individual } i \text{ fails at } t \mid \text{An event occurs in } R(t) \text{ at } t) = \frac{P(\text{Individual } i \text{ fails at } t)}{\sum_{l \in R(t)} P(\text{Individual } l \text{ fails at } t)}\]

By substituting the hazard function and canceling out $h_0(t)$and $dt$:

\[\frac{h_i(t)dt}{\sum_{l \in R(t)} h_l(t)dt} = \frac{h_0(t)\exp(\mathbf{X}_i^T\mathbf{\beta})dt}{\sum_{l \in R(t)} h_0(t)\exp(\mathbf{X}_l^T\mathbf{\beta})dt} = \frac{\exp(\mathbf{X}_i^T\mathbf{\beta})}{\sum_{l \in R(t)} \exp(\mathbf{X}_l^T\mathbf{\beta})}\]

To have the partial likelihood, we will multiply the conditional probabilities for each time. For tied events, we used the Breslow approximation which says as follows: when $\Delta j$​ individuals experience the event at the exact same time $t(j)$​, the​ individuals are treated as if they failed simultaneously.

The partial-likelihood function for the Cox model, accounting for tied event times using Breslow's approximation, is given by:

\[L(\mathbf{\beta}) = \prod_{j=1}^{k} \left( \frac{\exp(\mathbf{X}_i^T\mathbf{\beta})}{ \sum_{l \in R_j} \exp(\mathbf{X}_l^T\mathbf{\beta})}\right)^{\Delta_i}\]

where:

  • $k$ is the number of distinct event times.
  • $t_{(j)}$ denotes the j-th distinct ordered event time.
  • $\Delta_j$ is the set of individuals who experience the event at time $t_{(j)}$.
  • $R_j$ is the risk set at time $t_{(j)}$, comprising all individuals who are still at risk (have not yet experienced the event or been censored) just- before $t_{(j)}$.
  • $\mathbf{X}_i$ is the covariate vector for individual $i$.

5. The Loss Function (Negative Log-Partial-Likelihood)

Our goal is to estimate the regression coefficients $\mathbf{\beta}$ by maximizing the partial-likelihood function $L(\mathbf{\beta})$. Equivalently, it is often more convenient to minimize its negative logarithm, which we define as our loss function:

\[\text{Loss}(\mathbf{\beta}) = - \log L(\mathbf{\beta}) \]

Taking the negative logarithm of the Breslow partial likelihood, we get:

\[ \text{Loss}(\mathbf{\beta}) = - \sum_{i=1}^{n} \Delta_i \left( \mathbf{X}_i^T\mathbf{\beta} - \log \left( \sum_{j \in R_j} \exp(\mathbf{X}_j^T\mathbf{\beta}) \right) \right) \]

This function is convex, which facilitates optimization.

The loss function is coded as follows:

function loss(beta, M::Cox)
    # M.X: Design matrix (n x m), where n is number of observations, m is number of covariates.
    # M.T: Vector of observed times (n) for each individual.
    # M.Δ: Vector of event indicators (n), 1 if event, 0 if censored.
    η = M.X*beta
    return dot(M.Δ, log.((M.T .<= M.T') * exp.(η)) .- η)
end

6. Gradient of the Loss Function

To find the optimal $\mathbf{\beta}$, we need to minimize the loss function.

The gradient of the loss function with respect to a specific coefficient $\beta_k$ is:

\[\frac{\partial}{\partial \beta_k} \text{Loss}(\mathbf{\beta}) = - \sum_{i=1}^{n} \left( X_{ik} - \frac{\sum_{j \in R_i} \exp(\mathbf{\beta}^T\mathbf{X}_j) X_{jk}}{\sum_{j \in R_i} \exp(\mathbf{\beta}^T\mathbf{X}_j)} \right)\]

7. Hessian Matrix of the Loss Function

For optimization algorithms like Newton-Raphson and for calculating standard errors, the Hessian matrix (matrix of second partial derivatives) of the loss function is required.

The entry for the $k$-th row and $l$-th column of the Hessian matrix is:

\[\frac{\partial^2}{\partial \beta_k \partial \beta_l} \text{Loss}(\mathbf{\beta}) = \sum_{i=1}^{n} \Delta_i \left[ \frac{\sum_{j \in R_i} \exp(\mathbf{\beta}^T\mathbf{X}_j) X_{jk}X_{jl}}{\sum_{j \in R_i} \exp(\mathbf{\beta}^T\mathbf{X}_j)} - \frac{\left( \sum_{j \in R_i} \exp(\mathbf{\beta}^T\mathbf{X}_j) X_{jk} \right) \left( \sum_{j \in R_i} \exp(\mathbf{\beta}^T\mathbf{X}_j) X_{jl} \right)}{\left( \sum_{j \in R_i} \exp(\mathbf{\beta}^T\mathbf{X}_j) \right)^2} \right]\]

8. Information Matrix and Variance-Covariance Matrix

The observed Information Matrix, $I(\hat{\boldsymbol{\beta}})$, is defined as the negative of the Hessian matrix of the log-likelihood function, evaluated at the maximum likelihood estimates $\hat{\boldsymbol{\beta}}$.

\[I(\hat{\boldsymbol{\beta}}) = -H(\hat{\boldsymbol{\beta}})\]

But, in the earlier formula, $\mathbf{H}_{\text{Loss}}$​ was for Loss(β), which is -log L(β) $\mathbf{H}_{\text{Loss}} = - \mathbf{H}_{\text{log-likelihood}}$. Therefore, the observed Information Matrix is equal to $\mathbf{H}_{\text{Loss}}$ itself.

The variance (and covariance) of our estimators $\hat{\boldsymbol{\beta}}$ are obtained by inverting the observed information matrix.

\[\text{Var}(\hat{\boldsymbol{\beta}}) = I(\hat{\boldsymbol{\beta}})^{-1}\]

This final matrix contains:

  • On its diagonal: the variances of each coefficient ($\text{Var}(\hat{\beta}_1)$, $\text{Var}(\hat{\beta}_2)$, ...).
  • Off-diagonal: the covariances between pairs of coefficients.

9. Standard Error

The standard error for a specific coefficient ($\hat{\beta}_k$) is the square root of its variance.

\[SE(\hat{\beta}_k) = \sqrt{\text{Var}(\hat{\beta}_k)}\]

10. Wald Test for Significance

To determine if a variable has a statistically significant effect, a Wald test is performed. A z-score is calculated:

\[z = \frac{\text{Coefficient}}{\text{Erreur Type}} = \frac{\hat{\beta}}{SE(\hat{\beta})}\]

This $z$-score is then compared to a normal distribution to obtain a $p$-value. A low $p$-value (typically < 0.05) suggests that the coefficient is significantly different from zero.

The p-value for each coefficient is calculated by comparing its z-score to a standard normal distribution. This p-value indicates the probability of observing a z-score as extreme as, or more extreme than, the one calculated, assuming the null hypothesis (that the coefficient is zero) is true.

11. Confidence Interval

The standard error allows for the construction of a confidence interval (CI) around the coefficient, which provides a range of plausible values for the true coefficient.

The general formula for a $(1 - \alpha) \times 100\%$ confidence interval is:

\[\text{IC pour } \hat{\beta} = \hat{\beta} \pm z_{\alpha/2} \times SE(\hat{\beta})\]

Let us see for example the output on the colon dataset:

using SurvivalModels
using RDatasets

# ovarian = dataset("survival", "ovarian")
# ovarian.FUTime = Float64.(ovarian.FUTime)
# ovarian.FUStat = Bool.(ovarian.FUStat)
# model = fit(Cox, @formula(Surv(FUTime, FUStat) ~ Age + ECOG_PS), ovarian)

colon = dataset("survival", "colon")
colon.Time = Float64.(colon.Time)
colon.Status = Bool.(colon.Status)
model_colon = fit(Cox, @formula(Surv(Time, Status) ~ Age + Rx), colon)
Cox Model (n: 1858, m: 3, method: SurvivalModels.CoxDefault, C-index: 0.5435170918176299)

The outputed dataframe contains columns with respectively the name of the predictor, the obtained coefficients, its standard error, the associated p-value and the test statistic z as just described.

SurvivalModels.CoxMethodType
StatsBase.fit(Cox, @formula(Surv(T,Δ)~predictors), dataset)

Arguments:

  • Cox: The model to fit, e.g. Cox. Could be specified to any of CoxNM, CoxOptim, CoxApprox or CoxDefault if you want different sovers to be used, see their own documentations. Default is CoxDefault.
  • formula: A StatsModels.FormulaTerm specifying the survival model
  • df: A DataFrame containing the variables specified in the formula

Returns:

  • predictor: A Vector{String} containing the names of the predictor variables included in the model

  • beta: A Vector{Float64} containing the estimated regression coefficients (β​) for each predictor

  • se: A Vector{Float64} containing the standard errors of the estimated regression coefficients

  • loglikelihood: A Vector{Float64} containing the log-likelihood of the fitted model. This value is repeated for each predictor row

  • coef: A vector of the estimated coefficients

  • formula: The applied formula

Example:

ovarian = dataset("survival", "ovarian")
ovarian.FUTime = Float64.(ovarian.FUTime) (Time column needs to be Float64 type)
ovarian.FUStat = Bool.(ovarian.FUStat) (Status column needs to be Bool type)
model = fit(Cox, @formula(Surv(FUTime, FUStat) ~ Age + ECOG_PS), ovarian)

We need to add details about the different prediction types here.

Types:

  • Cox : the base abstract type
  • CoxGrad<:Cox : abstract type for Cox models that are solved using gradient-based optimization
  • CoxLLH<:CoxGrad : abstract type for Cox models that are solved by optimizing the log-likelihood
source

Model Summary & Extraction

To inspect the model statistics programmatically (such as extracting standard errors, p-values, or confidence intervals), you can use the summary function. This returns a DataFrame allowing for easy data manipulation.

SurvivalModels.summaryMethod
summary(model::Cox)

Compute a statistical summary table for a fitted Cox proportional hazards model.

This function calculates standard errors, z-scores, p-values, and confidence intervals.

Returns

A DataFrame with the following columns:

  • predictor: Name of the covariate.
  • β: Estimated regression coefficient.
  • e_β: Hazard ratio (exp(β)).
  • se: Standard error of the coefficient.
  • z_scores: Wald statistic (β / se).
  • p_values: Two-sided p-value.
  • ci_lower_β: Lower bound of the 95% confidence interval for β.
  • ci_upper_β: Upper bound of the 95% confidence interval for β.

Example

model = fit(Cox, @formula(Surv(time, status) ~ age + sex), df)

# Get the full summary dataframe
summ = summary(model)

# Extract standard errors specifically
standard_errors = summ.se
source

12. Prediction Interface

After fitting a Cox model, you can obtain various types of predictions using the predict function:

predict(model; type=:lp)        # Linear predictor (default)
predict(model; type=:risk)      # Relative risk (exp(linear predictor))
predict(model; type=:expected)  # Expected cumulative hazard
predict(model; type=:survival)  # Survival probability
predict(model; type=:terms)     # Contribution of each term (covariate) to the linear predictor

Available types:

  • :lp — Linear predictor (Xβ)
  • :risk — Relative risk, exp(Xβ)
  • :expected — Expected cumulative hazard for each subject
  • :survival — Survival probability for each subject
  • :terms — Covariate-wise contributions to the linear predictor

If you want to center predictions (e.g., for survival curves), you can use the centered keyword argument.

Example: Prediction on New Data

The following example is drawn from this article.

    using DataFrames, SurvivalModels

    df = DataFrame(
        time = [1.0, 3.0, 5.0, 6.0, 2.0, 7.0, 9.0, 11.0],
        status = [true, false, true, true, true, false, true, true],
        sex = [:male, :male, :male, :male, :female, :female, :female, :female],
        age = [57, 52, 48, 42, 39, 31, 26, 22],
    )
    model = fit(Cox, @formula(Surv(time, status) ~ age + sex), df)
Cox Model (n: 8, m: 2, method: SurvivalModels.CoxDefault, C-index: 0.9047619047619048)

You can extract the baseline haard, centered or not, with the following:

    SurvivalModels.baseline_hazard(model, centered = false)
8-element Vector{Float64}:
 3.4424556285043997e-13
 5.942770374703066e-12
 5.942770374703066e-12
 1.0965739923044279e-10
 1.8972983721645534e-9
 1.8972983721645534e-9
 6.646861703876311e-8
 9.459173599386564e-7
    SurvivalModels.baseline_hazard(model, centered = true)
8-element Vector{Float64}:
     0.027808077917814818
     0.4800556331330888
     0.4800556331330888
     8.858099656581187
   153.26332903066987
   153.26332903066987
  5369.31969841062
 76410.98822358932

And then each of the prediction type can be accessed as follows:

    predict(model, :lp)
8-element Vector{Float64}:
   3.5189684309689433
   0.34988421408591464
  -2.185383159420514
  -5.98828421968015
  -0.3961355271103777
  -5.466670274123228
  -8.635754491006256
 -11.171021864512682
    predict(model, :risk)
8-element Vector{Float64}:
 33.749595463125615
  1.4189032500719052
  0.11243464500647969
  0.0025079634750864986
  0.6729154914651001
  0.004225277777741304
  0.00017763947357070042
  1.4076245966063964e-5
    predict(model, :terms)
8×2 Matrix{Float64}:
  11.0126    -7.4936
   7.84348   -7.4936
   5.30822   -7.4936
   1.50532   -7.4936
  -0.396136   0.0
  -5.46667    0.0
  -8.63575    0.0
 -11.171      0.0
    predict(model, :expected)
8-element Vector{Float64}:
   0.9385113803333266
   0.6811524980678659
   0.05397488469467969
   0.022215790397381873
 103.13326837825056
   0.6475801382959431
   0.9538031246584544
   1.0755798647452601
    predict(model, :survival)
8-element Vector{Float64}:
 0.3912097646648285
 0.506033453588987
 0.9474559018472358
 0.9780291629766359
 1.621028471053042e-45
 0.5233105850573463
 0.38527299245365143
 0.34109990613348057

Different versions of the optimisation routine

To implement the Cox proportional hazards model, different versions were coded, using different methods. The final goal is to compare these versions and choose the most efficient one: the fastest and the closest to the true values of the coefficients.

V0: Derivative-Free Optimization with Nelder-Mead

SurvivalModels.CoxNMType
CoxNM(T, Δ, X)
fit(CoxNM, @formula(Surv(T,Δ)~X), data = ...)

An implementation of the Cox proportional hazards model that minimizes the negative partial log-likelihood function (cox_nllh). This version uses the Nelder-Mead method, a derivative-free optimization algorithm.

Fields:

  • X::Matrix{Float64}: The design matrix of covariates, where rows correspond to individuals and columns to features
  • T::Vector{Float64}: The observed times, sorted in ascending order
  • Δ::Vector{Bool}: The event indicator vector (true for event, false for censoring)
source

V1: Implementation using the 'Optimization.jl' Julia package

SurvivalModels.CoxOptimType
CoxOptim(T, Δ, X)
fit(CoxOptim, @formula(Surv(T,Δ)~X), data = ...)

The first implementation of the Cox proportional hazards model uses optimization libraries (Optimization.jl, Optim.jl) for coefficient estimation. It uses the BFGS algorithm to minimize the negative partial log-likelihood.

Fields:

  • X::Matrix{Float64}: The design matrix of covariates, where rows correspond to individuals and columns to features.
  • T::Vector{Float64}: The observed times, sorted in ascending order
  • Δ::Vector{Int64}: The event indicator vector (true for event, false for censoring)
source

V2: Implementation using the gradient and the Hessian matrix

For this version the Hessian Matrix was directly implemented. Its complexity is O(n2m2) which makes it the slowest version.

SurvivalModels.CoxHessianType
CoxHessian(T, Δ, X)
fit(CoxHessian, @formula(Surv(T,Δ)~X), data = ...)

The second implementation of the Cox proportional hazards model uses a Newton-Raphson-like iterative update that directly calculates and utilizes the gradient and Hessian matrix. This version is updating coefficients via the update! function.

Fields:

  • X::Matrix{Float64}: The design matrix of covariates, where rows correspond to individuals and columns to features
  • T::Vector{Float64}: The observed times, sorted in ascending order
  • Δ::Vector{Int64}: The event indicator vector (true for event, false for censoring)
  • R::BitMatrix: A boolean risk matrix, where 'R[i,j]' is 'true' if individual 'j' is at risk at time 'T[i]'
source

V3: Improved version of V2 (much faster) because non-allocative.

SurvivalModels.CoxDefaultType
CoxDefault(T, Δ, X)
fit(CoxDefault, @formula(Surv(T,Δ)~X), data = ...)
fit(Cox, @formula(Surv(T,Δ)~X), data = ...)

The third implementation of the Cox proportional hazards model represents a highly optimized and significantly faster iteration compared to previous implementation, CoxV2.

This is the default implementation called when you ask for a Cox model.

Fields:

  • Xᵗ::Matrix{Float64}: The design matrix of covariates, transposed (m rows, n columns)
  • sX::Vector{Float64}: Sum of X' multiplied by Δ
  • T::Vector{Float64}: The observed times sorted in descending order
  • Δ::Vector{Bool}: The event indicator vector (true for event, false for censoring)
  • loss::Vector{Float64}: Stores the current negative partial log-likelihood value
  • G::Vector{Float64}: Stores the gradient vector
  • H::Matrix{Float64}: Stores the Hessian matrix
  • S₁::Vector{Float64}: Sum of rⱼxₖⱼ
  • S₂::Matrix{Float64}: Sum of rⱼxₖⱼ * xⱼ
  • μ::Vector{Float64}: Updates the gradient and Hessian
  • η::Vector{Float64}: ηi = Xiβ
  • r::Vector{Float64}: ri = exp(ηi)
  • R::Vector{UnitRange{Int64}}: A vector of the risk ranges for each output time.
source

V4: Majoration of the Hessian matrix by a universal bound.

Coding the Hessian Matrix for the CoxHessian was very impractical, so for this version we tried a different approach:

The Hessian Matrix simplifies for the diagonal terms as folllows:

\[\frac{\partial^2 \text{Loss}(\mathbf{\beta})}{\partial \beta_k^2} = \sum_{i=1}^{n} \Delta_i \left[ \frac{\sum_{j \in R_i} \exp(\mathbf{\beta}^T\mathbf{X}_j) X_{jk}^2}{\sum_{j \in R_i} \exp(\mathbf{\beta}^T\mathbf{X}_j)} - \left( \frac{\sum_{j \in R_i} \exp(\mathbf{\beta}^T\mathbf{X}_j) X_{jk}}{\sum_{j \in R_i} \exp(\mathbf{\beta}^T\mathbf{X}_j)} \right)^2 \right]\]

This can be expressed using the concept of variance.

\[\frac{\partial^2 \text{Loss}(\mathbf{\beta})}{\partial \beta_k^2} = \sum_{i=1}^{n} \Delta_i \left[ \sum_{j \in R_i} A_j X_{jk}^2 - \left( \sum_{j \in R_i} A_j X_{jk} \right)^2 \right]\]

with:

\[A_{j} = \frac{\exp(\mathbf{\beta}^T\mathbf{X}_j)}{\sum_{j \in R_i} \exp(\mathbf{\beta}^T\mathbf{X}_j)}\]

The upper bound for the variance of a variable $X$ on an interval $[a,b]$ is given by:

\[\text{Var}(X) \le \frac{1}{4}(b-a)^2\]

Applying this bound to the diagonal Hessian elements, we get an approximation for the k-th diagonal element, B_k:

\[B_k = \sum_{i=1}^n \frac{1}{4} \Delta_i \left( \max_{j \in R_i} X_{jk} - \min_{j \in R_i} X_{jk} \right)^2\]

SurvivalModels.CoxApproxType
CoxApprox(T, Δ, X)
fit(CoxApprox, @formula(Surv(T,Δ)~X), data = ...)

The fourth implementation of the Cox proportional hazards model uses Hessian approximation based on a pre-calculated estimation. This version was created for when it might be difficult to work with full Hessian , offering faster iterations by using a Hessian approximation.

Fields:

  • X::Matrix{Float64}: The design matrix of covariates, where rows correspond to individuals and columns to features
  • T::Vector{Float64}: The observed times sorted in ascending order
  • Δ::Vector{Bool}: The event indicator vector (true for event, false for censoring)
  • sX::Vector{Float64}: Sum of X' multiplied by Δ
  • G::Vector{Float64}: Stores the gradient vector
  • η::Vector{Float64}: ηi = Xiβ
  • A::Vector{Float64}: Ai = exp(ηi)
  • B::Vector{Float64}: Stores the majoration elements of the Hessian matrix
  • C::Vector{Float64}: Used in the mkA! function
  • K::Vector{Int64}: Number of events at each unique observed event time
  • loss::Vector{Float64}: Stores the current negative partial log-likelihood value, used in CoxLLH getβ
source

Comparison of the different methods speed

We propose to compare the different methods on simulated data, with varying number of lines and columns, to verify empirically the theoretical complexity of the different methods. We will then compare the results with Julia's and R's existing Cox implementation.

We will start by coding our Julia and R structures:

using SurvivalModels, Plots, Random, Distributions, StatsBase, LinearAlgebra, DataFrames, RCall, Survival
using SurvivalModels: getβ, CoxNM, CoxOptim, CoxHessian, CoxDefault, CoxApprox

We add specific code to compare with Survival.jl and also R::survival::coxph():

struct CoxVJ<:SurvivalModels.CoxMethod
    T::Vector{Float64}
    Δ::Vector{Bool}
    X::Matrix{Float64}
    function CoxVJ(T,Δ,X)
        new(T,Bool.(Δ),X)
    end
end

function SurvivalModels.getβ(M::CoxVJ)
    return fit(Survival.CoxModel, M.X, Survival.EventTime.(M.T,M.Δ)).β
end

R"""
library(survival)
"""

struct CoxVR<:SurvivalModels.CoxMethod
    df::DataFrame
    function CoxVR(T,Δ,X)
        df = DataFrame(X,:auto)
        df.status = Δ
        df.time = T
        new(df)
    end
end

function SurvivalModels.getβ(M::CoxVR)
    df = M.df
    @rput df
    R"""
    beta  <- coxph(Surv(time,status)~., data = df, ties="breslow")$coefficients
    """
    @rget beta
    return beta
end

Then, we will simulate our data and then run our models:

# Creating a dictionary for all the models:
# Label => (constructor, plotting color)

const design = Dict(
"CoxNM"=> (CoxNM, :blue),
"CoxOptim"=> (CoxOptim, :orange),
"CoxHessian"=> (CoxHessian, :brown),
"CoxDefault"=> (CoxDefault, :purple),
"CoxApprox"=> (CoxApprox, :green),
"VR"=> (CoxVR, :red),
"VJ"=> (CoxVJ, :black)
);


function simulate_survival_data(n, m; β=ones(m))
    X = randn(n,m)
    O = rand.(Exponential.(exp.(.- X * β)))
    C = rand(Exponential(1/3),n)
    T = min.(O, C)
    Δ = Bool.(O .<= C)
    return (T, Δ, X)
end

# Run the models and get the running time, the β coefficients and the difference between the true β and the obtained ones:
function run_models()
    Ns = (1000, 2000, 4000)
    Ms = (5,10,20)
    true_betas = randn(maximum(Ms))
    df = []
    for n in Ns, m in Ms
        if (n == 2000) | (m == 10) # Only if they end up in the graphs.
            data = simulate_survival_data(n,m, β = true_betas[1:m])
            for (name, (constructor, _)) in design
                display((n,m,name))
                model = constructor(data...)
                beta = getβ(model)
                time = @elapsed getβ(model)
                push!(df, (
                    n = n,
                    m = m,
                    name = name,
                    time = time,
                    beta = beta,
                    diff_to_truth = sqrt(sum((beta .- true_betas[1:m]).^2)/sum(true_betas[1:m].^2)),
                ))
            end
        end
    end
    df = DataFrame(df)
    sort!(df, :name)
    return df
end
run_models (generic function with 1 method)

The first graph will compare how long it takes for all the implementations to run on datasets that have maximum 2000 observations and 20 covariates.

# Plot the results, starting with the time:
function timing_graph(df)
    group1 = groupby(filter(r -> r.m==10, df), :name)
    p1 = plot(; xlabel = "Number of observations (n)",
        ylabel = "Time (in seconds)",
        yscale= :log10,
        xscale= :log10,
        title = "For m=20 covs., varying n",
        legend = :bottomright,
        lw = 1);
    for g in group1
        plot!(p1, g.n, g.time, label = g.name[1] , color = design[g.name[1]][2], marker = :circle, markersize = 3)
    end
    group2 = groupby(filter(r -> r.n==2000, df), :name)
        p2 = plot(; xlabel = "Number of covariates (m)",
            ylabel = "Temps (ms)",
            yscale= :log10,
            xscale= :log10,
            title = "For n=2000 obs., varying m",
            legend = :bottomright,
            lw = 1);
        for g in group2
            plot!(p2, g.m, g.time, label = g.name[1] , color = design[g.name[1]][2], marker = :circle, markersize = 3)
        end
        p = plot(p1,p2, size=(1200,600), plot_title = "Runtime (logscale) of the various implementations")
        return p
end
timing_graph (generic function with 1 method)
df = run_models()
timing_graph(df)
Example block output

We can the that CoxDefault is the fastest, while CoxHessian is the slowest. We will zoom on our implementation vs Survival.jl vs R::survival:

timing_graph(filter(r -> r.name ∈ ("V4", "V3", "VJ", "VR"), df))
Example block output

So we are about x10 faster than the reference implementation of R (and than the previous Julia versions) on this example.

function beta_correctness_graphs(df; ref="VJ")

    reflines = filter(r -> r.name == ref, df)
    rename!(reflines, :beta => :refbeta)
    select!(reflines, Not([:name, :time, :diff_to_truth]))
    otherlines = filter(r -> r.name != ref, df)
    rez = leftjoin(otherlines, reflines, on=[:n,:m])
    percent(x,y) = sqrt(sum((x .- y).^2)/sum(y .^2))
    rez.error = percent.(rez.beta, rez.refbeta)
    select!(rez, [:n,:m,:name,:error])
    rez = filter!(r -> !isnan(r.error), rez)

    group1 = groupby(filter(r -> r.m==20, rez), :name)
    p1 = plot(; xlabel = "Number of observations (n)",
                ylabel = "L2dist to $ref's β",
                yscale=:log10,
                xscale= :log10,
                title = "m=20, varying n",
                legend = :bottomright,
                lw = 1);
    for g in group1
        plot!(p1, g.n, g.error, label = g.name[1] , color = design[g.name[1]][2], marker = :circle, markersize = 3)
    end

    group2 = groupby(filter(r -> r.n==2000, rez), :name)
    p2 = plot(; xlabel = "Nomber of covariates (m)",
                ylabel = "L2Dist to $ref's β",
                yscale=:log10,
                xscale= :log10,
                title = "n=2000, varying m",
                legend = :bottomright,
                lw = 1);
    for g in group2
        plot!(p2, g.m, g.error, label = g.name[1] , color = design[g.name[1]][2], marker = :circle, markersize = 3)
    end
    p = plot(p1,p2, size=(1200,600), plot_title="β-correctness w.r.t. $ref's version.")
    return p
end

beta_correctness_graphs(df)
Example block output

We will compare the diffferent models' coefficents values with the "right" beta value that we obtained while simulating our data.

function beta_wrt_truth(df)
    group1 = groupby(filter(r -> r.m==10, df), :name)
    p1 = plot(; xlabel = "Number of observations (n)",
                ylabel = "L2dist to the truth",
                yscale=:log10,
                xscale= :log10,
                title = "m=20, varying n",
                legend = :bottomright,
                lw = 1,
                plot_title="β-correctness w.r.t. the truth.");
    for g in group1
        plot!(p1, g.n, g.diff_to_truth, label = g.name[1] , color = design[g.name[1]][2], marker = :circle, markersize = 3)
    end

    return p1
end

beta_wrt_truth(df)
Example block output

Model Evaluation: Harrell's Concordance Index (C-index)

A key metric for assessing the predictive discrimination of a Cox model is Harrell's concordance index (C-index). The C-index measures the proportion of all usable patient pairs in which the predictions and outcomes are concordant: it is the probability that, for a randomly chosen pair of comparable subjects, the subject with the higher predicted risk actually experiences the event before the other. Tied risk scores count as half-concordant.

SurvivalModels.harrells_cFunction
harrells_c(times, statuses, risk_scores)
harrels_c(C::Cox)

Compute Harrell's concordance index (C-index) for survival models.

Arguments

  • times: Vector of observed times (Float64).
  • statuses: Vector of event indicators (Bool; true if event, false if censored).
  • risk_scores: Vector of predicted risk scores (higher means higher risk).

Returns

  • The C-index, a value between 0 and 1 indicating the proportion of all usable patient pairs in which predictions and outcomes are concordant.

Details

The C-index measures the discriminative ability of a survival model: it is the probability that, for a randomly chosen pair of comparable subjects, the subject with the higher predicted risk actually experiences the event before the other. Tied risk scores count as half-concordant.

You can also call harrels_c(C::Cox) to compute the C-index for a fitted Cox model.

source

The C-index ranges from 0.5 (no better than random) to 1.0 (perfect discrimination). It is outputed by the show function, and can be queried seprately too:

You can compute the C-index for a fitted Cox model using:

model = fit(Cox, @formula(Surv(Time, Status) ~ Age + Rx), colon)
cindex = SurvivalModels.harrells_c(model)
model
Cox Model (n: 1858, m: 3, method: SurvivalModels.CoxDefault, C-index: 0.5435170918176299)

A higher C-index indicates better predictive discrimination.

[4]
D. R. Cox. Regression models and life-tables. Journal of the Royal Statistical Society: Series B (Methodological) 34, 187–202 (1972).