Case Study: Survival Analysis on the Colon Dataset
Dataset Description
The colon dataset from the R survival package contains data from a clinical trial of colon cancer patients. Key variables include:
Time: Survival or censoring time (days)Status: Event indicator (1 = death, 0 = censored)Rx: Treatment group (Obs, Lev, Lev+5FU)Sex: Sex (1 = male, 0 = female)Age: Age at entryNode: Number of positive lymph nodesExtent: Extent of local spread (1 = confined, 2 = adjacent, 3 = adherent, 4 = invaded)Differ: Tumor differentiation (1 = well, 2 = moderate, 3 = poor)Perfor: Perforation (0 = no, 1 = yes)
Let’s load and inspect the data:
using SurvivalModels, RDatasets, DataFrames, Plots
colon = dataset("survival", "colon")
colon.Time = Float64.(colon.Time)
colon.Status = Bool.(colon.Status)
first(colon, 5)
describe(colon)| Row | variable | mean | min | median | max | nmissing | eltype |
|---|---|---|---|---|---|---|---|
| Symbol | Union… | Any | Union… | Any | Int64 | Type | |
| 1 | Id | 465.0 | 1 | 465.0 | 929 | 0 | Int32 |
| 2 | Study | 1.0 | 1 | 1.0 | 1 | 0 | Int32 |
| 3 | Rx | Obs | Lev+5FU | 0 | CategoricalValue{String, UInt8} | ||
| 4 | Sex | 0.52099 | 0 | 1.0 | 1 | 0 | Int32 |
| 5 | Age | 59.7546 | 18 | 61.0 | 85 | 0 | Int32 |
| 6 | Obstruct | 0.193757 | 0 | 0.0 | 1 | 0 | Int32 |
| 7 | Perfor | 0.0290635 | 0 | 0.0 | 1 | 0 | Int32 |
| 8 | Adhere | 0.145318 | 0 | 0.0 | 1 | 0 | Int32 |
| 9 | Nodes | 3.65971 | 0 | 2.0 | 33 | 36 | Union{Missing, Int32} |
| 10 | Status | 0.495156 | false | 0.0 | true | 0 | Bool |
| 11 | Differ | 2.06291 | 1 | 2.0 | 3 | 46 | Union{Missing, Int32} |
| 12 | Extent | 2.88698 | 1 | 3.0 | 4 | 0 | Int32 |
| 13 | Surg | 0.265877 | 0 | 0.0 | 1 | 0 | Int32 |
| 14 | Node4 | 0.274489 | 0 | 0.0 | 1 | 0 | Int32 |
| 15 | Time | 1537.55 | 8.0 | 1855.0 | 3329.0 | 0 | Float64 |
| 16 | EType | 1.5 | 1 | 1.5 | 2 | 0 | Int32 |
Exploratory Analysis
Let’s look at the distribution of survival times and events:
histogram(colon.Time, bins=50, xlabel="Time (days)", ylabel="Frequency", title="Distribution of Survival Times")
sum(colon.Status), length(colon.Status) - sum(colon.Status) # events, censored(920, 938)Let’s check the treatment groups and other covariates:
unique(colon.Rx)
combine(groupby(colon, :Rx), nrow)| Row | Rx | nrow |
|---|---|---|
| Cat… | Int64 | |
| 1 | Obs | 630 |
| 2 | Lev | 620 |
| 3 | Lev+5FU | 608 |
Kaplan-Meier Estimator
Estimate and plot the overall survival curve:
km = fit(KaplanMeier, @formula(Surv(Time, Status) ~ 1), colon)
plot(km.t, cumprod(1 .- km.∂Λ), title = "Kaplan-Meier estimator", xlabel="Time (days)", ylabel="Survival probability")Compare survival curves by treatment group:
km_obs = fit(KaplanMeier, @formula(Surv(Time, Status) ~ 1), colon[colon.Rx .== "Obs", :])
km_lev = fit(KaplanMeier, @formula(Surv(Time, Status) ~ 1), colon[colon.Rx .== "Lev", :])
km_lev5fu = fit(KaplanMeier, @formula(Surv(Time, Status) ~ 1), colon[colon.Rx .== "Lev+5FU", :])
plot(km_obs.t, cumprod(1 .- km_obs.∂Λ), label="Obs", xlabel="Time (days)", ylabel="Survival probability")
plot!(km_lev.t, cumprod(1 .- km_lev.∂Λ), label="Lev")
plot!(km_lev5fu.t, cumprod(1 .- km_lev5fu.∂Λ), label="Lev+5FU", title="KM by Treatment Group")Regrouping the two similar curves, we get (including confidence intervals):
Let’s regroup "Lev" and "Lev+5FU" as a single treatment group, and compare it to "Obs".
# Create a new treatment variable
colon.TreatGroup = ifelse.(colon.Rx .== "Obs", "Obs", "Lev+5FU or Lev")
# Fit KM for each group
km_obs = fit(KaplanMeier, @formula(Surv(Time, Status) ~ 1), colon[colon.TreatGroup .== "Obs", :])
km_levgroup = fit(KaplanMeier, @formula(Surv(Time, Status) ~ 1), colon[colon.TreatGroup .== "Lev+5FU or Lev", :])
# Get survival and confidence intervals from the package
ci_obs = confint(km_obs)
ci_lev = confint(km_levgroup)
# Plot
plot(ci_obs.time, ci_obs.surv, ribbon=(ci_obs.surv .- ci_obs.lower, ci_obs.upper .- ci_obs.surv),
label="Obs", color=:blue, xlabel="Time (days)", ylabel="Survival probability",
title="KM by Regrouped Treatment", legend=:bottomleft)
plot!(ci_lev.time, ci_lev.surv, ribbon=(ci_lev.surv .- ci_lev.lower, ci_lev.upper .- ci_lev.surv),
label="Lev+5FU or Lev", color=:red)This plot shows the Kaplan-Meier survival curves for the two treatment groups, with 95% confidence intervals.
Log-Rank Test
Test for differences in survival between treatment groups:
lrt = fit(LogRankTest, @formula(Surv(Time, Status) ~ TreatGroup), colon)
lrt.stat, lrt.pval(22351.70913178704, 0.0)Interpretation: A small p-value suggests significant differences in survival between groups.
Cox Proportional Hazards Model
Fit a Cox model with treatment and other covariates:
fit(Cox, @formula(Surv(Time, Status) ~ TreatGroup + Sex + Age), colon)Cox Model (n: 1858, m: 3, method: SurvivalModels.CoxDefault, C-index: 0.5311999566589811)
Interpretation: The coefficients show the log hazard ratios for each covariate.
General Hazard Models (GHMs)
TODO
Model Comparison
Let’s compare the quality of the fits using log-likelihood and AIC:
TODO
Performance (Optional)
You can also compare computation times:
TODO
Summary
- The colon dataset provides a rich set of covariates for survival analysis.
- Kaplan-Meier curves show overall and group-wise survival.
- Log-rank tests confirm differences between treatment groups.
- Cox and General Hazard Models allow for multivariate modeling and flexible hazard shapes.
This case study can be extended with further diagnostics, stratified analyses, and more advanced parametric models as needed.