Kaplan-Meier Estimator
The Kaplan-Meier estimator[1] is a non-parametric statistic used to estimate the survival function from lifetime data, especially when data are censored. The Greenwood formula [2] is used for variance estimation.
Suppose we observe $n$ individuals, with observed times $T_1, T_2, \ldots, T_n$ and event indicators $\Delta_1, \Delta_2, \ldots, \Delta_n$ ($\Delta_i = 1$ if the event occurred, $0$ if censored).
Let $t_1 < t_2 < \cdots < t_k$ be the ordered unique event times, and set:
- $d_j$: number of events at time $t_j$
- $n_j$: number of individuals at risk just before $t_j$
The Kaplan-Meier estimator of the survival function $S(t)$ is:
\[\hat{S}(t) = \prod_{t_j \leq t} \left(1 - \frac{d_j}{n_j}\right)\]
This product runs over all event times $t_j$ less than or equal to $t$.
Greenwood's Formula
The Greenwood estimator [2] for the variance of $\hat{S}(t)$ is:
\[\widehat{\mathrm{Var}}[\hat{S}(t)] = \hat{S}(t)^2 \sum_{t_j \leq t} \frac{d_j}{n_j (n_j - d_j)}\]
This allows for the construction of confidence intervals for the survival curve.
How to use it
You can compute these estimators using the following code:
using SurvivalModels
T = [2, 3, 4, 5, 8]
Δ = [1, 1, 0, 1, 0]
km = KaplanMeier(T, Δ)KaplanMeier{Float64}([2.0, 3.0, 4.0, 5.0, 8.0], [1, 1, 0, 1, 0], [5, 4, 3, 2, 1], [0.2, 0.25, 0.0, 0.5, 0.0], [0.05, 0.08333333333333333, 0.0, 0.5, 0.0])and/or with the formula interface:
using DataFrames
df = DataFrame(time=Float64.(T), status=Bool.(Δ))
km = fit(KaplanMeier, @formula(Surv(time, status) ~ 1), df)KaplanMeier{Float64}([2.0, 3.0, 4.0, 5.0, 8.0], [1, 1, 0, 1, 0], [5, 4, 3, 2, 1], [0.2, 0.25, 0.0, 0.5, 0.0], [0.05, 0.08333333333333333, 0.0, 0.5, 0.0])The obtained objects has the following fields:
t: Sorted unique event times.∂N: Number of uncensored deaths at each time point.Y: Number of individuals at risk at each time point.∂Λ: Increments of cumulative hazard.∂σ: Greenwood variance increments.
The obtained object can be used to compute survival and variance estimates as follows:
using SurvivalModels: greenwood
Ŝ = km(5.0) # Survival probability at time 5
v̂ = greenwood(km, 5.0) # Greenwood variance at time 5
Ŝ, v̂(0.6000000000000001, 0.13333333333333333)Finally, a $(1-\alpha) \times 100\%$ confidence interval for $S(t)$ can be constructed using the log-minus-log transformation:
\[\log(-\log \hat{S}(t)) \pm z_{1-\alpha/2} \frac{1}{\log \hat{S}(t)} \sqrt{\widehat{\mathrm{Var}}[\hat{S}(t)]}\]
The confint function can do it for you:
using SurvivalModels
T = [2, 3, 4, 5, 8]
Δ = [1, 1, 0, 1, 0]
km = KaplanMeier(T, Δ)
# Compute confidence intervals at each event time (default 95%)
ci = confint(km)
first(ci, 5) # show the first 5 rows| Row | time | surv | lower | upper |
|---|---|---|---|---|
| Float64 | Float64 | Float64 | Float64 | |
| 1 | 2.0 | 0.8 | 0.203809 | 0.96918 |
| 2 | 3.0 | 0.6 | 0.12573 | 0.881756 |
| 3 | 4.0 | 0.6 | 0.12573 | 0.881756 |
| 4 | 5.0 | 0.3 | 0.0123015 | 0.719218 |
| 5 | 8.0 | 0.3 | 0.0123015 | 0.719218 |
References
SurvivalModels.KaplanMeier — Type
KaplanMeier(T, Δ)
fit(KaplanMeier, @formula(Surv(T, Δ) ~ 1), df)Efficient Kaplan-Meier estimator.
Mathematical Description
Suppose we observe $n$ individuals, with observed times $T_1, T_2, \ldots, T_n$ and event indicators $\Delta_1, \Delta_2, \ldots, \Delta_n$ ($\Delta_i = 1$ if the event occurred, $0$ if censored).
Let $t_1 < t_2 < \cdots < t_k$ be the ordered unique event times.
- $d_j$: number of events at time $t_j$
- $Y_j$: number of individuals at risk just before $t_j$
The Kaplan-Meier estimator of the survival function $S(t)$ is:
\[\hat{S}(t) = \prod_{t_j \leq t} \left(1 - \frac{d_j}{Y_j}\right)\]
This product runs over all event times $t_j$ less than or equal to $t$.
The Greenwood estimator for the variance of $\hat{S}(t)$ is:
\[\widehat{\mathrm{Var}}[\hat{S}(t)] = \hat{S}(t)^2 \sum_{t_j \leq t} \frac{d_j}{Y_j (Y_j - d_j)}\]
Arguments
T: Vector of event or censoring times.Δ: Event indicator vector (1if event,0if censored).
Stores
t: Sorted unique event times.∂N: Number of uncensored deaths at each time point.Y: Number of at risk individuals at each time point.∂Λ: Increments of cumulative hazard.∂σ: Greenwood variance increments.
Example: Direct usage
using SurvivalModels
T = [2, 3, 4, 5, 8]
Δ = [1, 1, 0, 1, 0]
km = KaplanMeier(T, Δ)Example: Using the fit() interface
using SurvivalModels, DataFrames, StatsModels
df = DataFrame(time=T, status=Δ)
km2 = fit(KaplanMeier, @formula(Surv(time, status) ~ 1), df)SurvivalModels.greenwood — Function
greenwood(S::KaplanMeier, t)Compute the Greenwood variance estimate for the Kaplan-Meier survival estimator at time t.
The Greenwood formula provides an estimate of the variance of the Kaplan-Meier survival function at a given time point. For a fitted Kaplan-Meier object S, the variance at time t is:
```math \widehat{\mathrm{Var}}[\hat{S}(t)] = \hat{S}(t)^2 \sum{tj < t} \frac{dj}{Yj (Yj - dj)}
- [1]
- E. L. Kaplan and P. Meier. Nonparametric estimation from incomplete observations. Journal of the American statistical association 53, 457–481 (1958).
- [2]
- M. Greenwood. The natural duration of cancer. Reports on Public Health and Medical Subjects 33, 1–26 (1926).