Case Study: Survival Analysis on the Colon Dataset

Dataset Description

The colon dataset from the R survival package contains data from a clinical trial of colon cancer patients. Key variables include:

Time: Survival or censoring time (days)
Status: Event indicator (1 = death, 0 = censored)
Rx: Treatment group (Obs, Lev, Lev+5FU)
Sex: Sex (1 = male, 0 = female)
Age: Age at entry
Node: Number of positive lymph nodes
Extent: Extent of local spread (1 = confined, 2 = adjacent, 3 = adherent, 4 = invaded)
Differ: Tumor differentiation (1 = well, 2 = moderate, 3 = poor)
Perfor: Perforation (0 = no, 1 = yes)

Let’s load and inspect the data:

using SurvivalModels, RDatasets, DataFrames, Plots
colon = dataset("survival", "colon")
colon.Time = Float64.(colon.Time)
colon.Status = Bool.(colon.Status)
first(colon, 5)
describe(colon)

16×7 DataFrame

Row	variable	mean	min	median	max	nmissing	eltype
	Symbol	Union…	Any	Union…	Any	Int64	Type
1	Id	465.0	1	465.0	929	0	Int32
2	Study	1.0	1	1.0	1	0	Int32
3	Rx		Obs		Lev+5FU	0	CategoricalValue{String, UInt8}
4	Sex	0.52099	0	1.0	1	0	Int32
5	Age	59.7546	18	61.0	85	0	Int32
6	Obstruct	0.193757	0	0.0	1	0	Int32
7	Perfor	0.0290635	0	0.0	1	0	Int32
8	Adhere	0.145318	0	0.0	1	0	Int32
9	Nodes	3.65971	0	2.0	33	36	Union{Missing, Int32}
10	Status	0.495156	false	0.0	true	0	Bool
11	Differ	2.06291	1	2.0	3	46	Union{Missing, Int32}
12	Extent	2.88698	1	3.0	4	0	Int32
13	Surg	0.265877	0	0.0	1	0	Int32
14	Node4	0.274489	0	0.0	1	0	Int32
15	Time	1537.55	8.0	1855.0	3329.0	0	Float64
16	EType	1.5	1	1.5	2	0	Int32

Exploratory Analysis

Let’s look at the distribution of survival times and events:

histogram(colon.Time, bins=50, xlabel="Time (days)", ylabel="Frequency", title="Distribution of Survival Times")
sum(colon.Status), length(colon.Status) - sum(colon.Status) # events, censored

(920, 938)

Let’s check the treatment groups and other covariates:

unique(colon.Rx)
combine(groupby(colon, :Rx), nrow)

3×2 DataFrame

Row	Rx	nrow
	Cat…	Int64
1	Obs	630
2	Lev	620
3	Lev+5FU	608

Kaplan-Meier Estimator

Estimate and plot the overall survival curve:

km = fit(KaplanMeier, @formula(Surv(Time, Status) ~ 1), colon)
plot(km.t, cumprod(1 .- km.∂Λ), title = "Kaplan-Meier estimator", xlabel="Time (days)", ylabel="Survival probability")

Compare survival curves by treatment group:

km_obs = fit(KaplanMeier, @formula(Surv(Time, Status) ~ 1), colon[colon.Rx .== "Obs", :])
km_lev = fit(KaplanMeier, @formula(Surv(Time, Status) ~ 1), colon[colon.Rx .== "Lev", :])
km_lev5fu = fit(KaplanMeier, @formula(Surv(Time, Status) ~ 1), colon[colon.Rx .== "Lev+5FU", :])
plot(km_obs.t, cumprod(1 .- km_obs.∂Λ), label="Obs", xlabel="Time (days)", ylabel="Survival probability")
plot!(km_lev.t, cumprod(1 .- km_lev.∂Λ), label="Lev")
plot!(km_lev5fu.t, cumprod(1 .- km_lev5fu.∂Λ), label="Lev+5FU", title="KM by Treatment Group")

Regrouping the two similar curves, we get (including confidence intervals):

Let’s regroup "Lev" and "Lev+5FU" as a single treatment group, and compare it to "Obs".

# Create a new treatment variable
colon.TreatGroup = ifelse.(colon.Rx .== "Obs", "Obs", "Lev+5FU or Lev")

# Fit KM for each group
km_obs = fit(KaplanMeier, @formula(Surv(Time, Status) ~ 1), colon[colon.TreatGroup .== "Obs", :])
km_levgroup = fit(KaplanMeier, @formula(Surv(Time, Status) ~ 1), colon[colon.TreatGroup .== "Lev+5FU or Lev", :])

# Get survival and confidence intervals from the package
ci_obs = confint(km_obs)
ci_lev = confint(km_levgroup)

# Plot
plot(ci_obs.time, ci_obs.surv, ribbon=(ci_obs.surv .- ci_obs.lower, ci_obs.upper .- ci_obs.surv),
     label="Obs", color=:blue, xlabel="Time (days)", ylabel="Survival probability",
     title="KM by Regrouped Treatment", legend=:bottomleft)
plot!(ci_lev.time, ci_lev.surv, ribbon=(ci_lev.surv .- ci_lev.lower, ci_lev.upper .- ci_lev.surv),
      label="Lev+5FU or Lev", color=:red)

This plot shows the Kaplan-Meier survival curves for the two treatment groups, with 95% confidence intervals.

Log-Rank Test

Test for differences in survival between treatment groups:

lrt = fit(LogRankTest, @formula(Surv(Time, Status) ~ TreatGroup), colon)
lrt.stat, lrt.pval

(22351.70913178704, 0.0)

Interpretation: A small p-value suggests significant differences in survival between groups.

Cox Proportional Hazards Model

Fit a Cox model with treatment and other covariates:

fit(Cox, @formula(Surv(Time, Status) ~ TreatGroup + Sex + Age), colon)

Cox Model (n: 1858, m: 3, method: SurvivalModels.CoxDefault, C-index: 0.5311999566589811)

Interpretation: The coefficients show the log hazard ratios for each covariate.

General Hazard Models (GHMs)

TODO

Model Comparison

Let’s compare the quality of the fits using log-likelihood and AIC:

TODO

Performance (Optional)

You can also compare computation times:

TODO

Summary

The colon dataset provides a rich set of covariates for survival analysis.
Kaplan-Meier curves show overall and group-wise survival.
Log-rank tests confirm differences between treatment groups.
Cox and General Hazard Models allow for multivariate modeling and flexible hazard shapes.

This case study can be extended with further diagnostics, stratified analyses, and more advanced parametric models as needed.