Kaplan-Meier Estimator

The Kaplan-Meier estimator[1] is a non-parametric statistic used to estimate the survival function from lifetime data, especially when data are censored. The Greenwood formula [2] is used for variance estimation.

Suppose we observe $n$ individuals, with observed times $T_1, T_2, \ldots, T_n$ and event indicators $\Delta_1, \Delta_2, \ldots, \Delta_n$ ($\Delta_i = 1$ if the event occurred, $0$ if censored).

Let $t_1 < t_2 < \cdots < t_k$ be the ordered unique event times, and set:

  • $d_j$: number of events at time $t_j$
  • $n_j$: number of individuals at risk just before $t_j$

The Kaplan-Meier estimator of the survival function $S(t)$ is:

\[\hat{S}(t) = \prod_{t_j \leq t} \left(1 - \frac{d_j}{n_j}\right)\]

This product runs over all event times $t_j$ less than or equal to $t$.

Greenwood's Formula

The Greenwood estimator [2] for the variance of $\hat{S}(t)$ is:

\[\widehat{\mathrm{Var}}[\hat{S}(t)] = \hat{S}(t)^2 \sum_{t_j \leq t} \frac{d_j}{n_j (n_j - d_j)}\]

This allows for the construction of confidence intervals for the survival curve.

How to use it

You can compute these estimators using the following code:

using SurvivalModels

T = [2, 3, 4, 5, 8]
Δ = [1, 1, 0, 1, 0]
km = KaplanMeier(T, Δ)
KaplanMeier{Float64}([2.0, 3.0, 4.0, 5.0, 8.0], [1, 1, 0, 1, 0], [5, 4, 3, 2, 1], [0.2, 0.25, 0.0, 0.5, 0.0], [0.05, 0.08333333333333333, 0.0, 0.5, 0.0])

and/or with the formula interface:

using DataFrames
df = DataFrame(time=Float64.(T), status=Bool.(Δ))
km = fit(KaplanMeier, @formula(Surv(time, status) ~ 1), df)
KaplanMeier{Float64}([2.0, 3.0, 4.0, 5.0, 8.0], [1, 1, 0, 1, 0], [5, 4, 3, 2, 1], [0.2, 0.25, 0.0, 0.5, 0.0], [0.05, 0.08333333333333333, 0.0, 0.5, 0.0])

The obtained objects has the following fields:

  • t: Sorted unique event times.
  • ∂N: Number of uncensored deaths at each time point.
  • Y: Number of individuals at risk at each time point.
  • ∂Λ: Increments of cumulative hazard.
  • ∂σ: Greenwood variance increments.

The obtained object can be used to compute survival and variance estimates as follows:

using SurvivalModels: greenwood
Ŝ = km(5.0)  # Survival probability at time 5
v̂ = greenwood(km, 5.0)  # Greenwood variance at time 5
Ŝ, v̂
(0.6000000000000001, 0.13333333333333333)

Finally, a $(1-\alpha) \times 100\%$ confidence interval for $S(t)$ can be constructed using the log-minus-log transformation:

\[\log(-\log \hat{S}(t)) \pm z_{1-\alpha/2} \frac{1}{\log \hat{S}(t)} \sqrt{\widehat{\mathrm{Var}}[\hat{S}(t)]}\]

The confint function can do it for you:

using SurvivalModels

T = [2, 3, 4, 5, 8]
Δ = [1, 1, 0, 1, 0]
km = KaplanMeier(T, Δ)

# Compute confidence intervals at each event time (default 95%)
ci = confint(km)
first(ci, 5)  # show the first 5 rows
5×4 DataFrame
Rowtimesurvlowerupper
Float64Float64Float64Float64
12.00.80.2038090.96918
23.00.60.125730.881756
34.00.60.125730.881756
45.00.30.01230150.719218
58.00.30.01230150.719218

References

SurvivalModels.KaplanMeierType
KaplanMeier(T, Δ)
fit(KaplanMeier, @formula(Surv(T, Δ) ~ 1), df)

Efficient Kaplan-Meier estimator.

Mathematical Description

Suppose we observe $n$ individuals, with observed times $T_1, T_2, \ldots, T_n$ and event indicators $\Delta_1, \Delta_2, \ldots, \Delta_n$ ($\Delta_i = 1$ if the event occurred, $0$ if censored).

Let $t_1 < t_2 < \cdots < t_k$ be the ordered unique event times.

  • $d_j$: number of events at time $t_j$
  • $Y_j$: number of individuals at risk just before $t_j$

The Kaplan-Meier estimator of the survival function $S(t)$ is:

\[\hat{S}(t) = \prod_{t_j \leq t} \left(1 - \frac{d_j}{Y_j}\right)\]

This product runs over all event times $t_j$ less than or equal to $t$.

The Greenwood estimator for the variance of $\hat{S}(t)$ is:

\[\widehat{\mathrm{Var}}[\hat{S}(t)] = \hat{S}(t)^2 \sum_{t_j \leq t} \frac{d_j}{Y_j (Y_j - d_j)}\]

Arguments

  • T: Vector of event or censoring times.
  • Δ: Event indicator vector (1 if event, 0 if censored).

Stores

  • t: Sorted unique event times.
  • ∂N: Number of uncensored deaths at each time point.
  • Y: Number of at risk individuals at each time point.
  • ∂Λ: Increments of cumulative hazard.
  • ∂σ: Greenwood variance increments.

Example: Direct usage

using SurvivalModels
T = [2, 3, 4, 5, 8]
Δ = [1, 1, 0, 1, 0]
km = KaplanMeier(T, Δ)

Example: Using the fit() interface

using SurvivalModels, DataFrames, StatsModels
df = DataFrame(time=T, status=Δ)
km2 = fit(KaplanMeier, @formula(Surv(time, status) ~ 1), df)
source
SurvivalModels.greenwoodFunction
greenwood(S::KaplanMeier, t)

Compute the Greenwood variance estimate for the Kaplan-Meier survival estimator at time t.

The Greenwood formula provides an estimate of the variance of the Kaplan-Meier survival function at a given time point. For a fitted Kaplan-Meier object S, the variance at time t is:

```math \widehat{\mathrm{Var}}[\hat{S}(t)] = \hat{S}(t)^2 \sum{tj < t} \frac{dj}{Yj (Yj - dj)}

source
[1]
E. L. Kaplan and P. Meier. Nonparametric estimation from incomplete observations. Journal of the American statistical association 53, 457–481 (1958).
[2]
M. Greenwood. The natural duration of cancer. Reports on Public Health and Medical Subjects 33, 1–26 (1926).