Case study
Introduction and datasets
We will illustrate with an example using the dataset colrec
, which comprises $5971$ patients diagnosed with colon or rectal cancer between $1994$ and $2000$. This dataset is sourced from the Slovenia cancer registry. Given the high probability that the patients are Slovenian, we will be using the Slovenian mortality table slopop
as reference for the populational rates. Subsequently, we can apply various non-parametric estimators for net survival analysis.
Mortality tables may vary in structure, with options such as the addition or removal of specific covariates. To confirm that the mortality table is in the correct format, please refer to RateTables.jl
's documentation, or directly extract it from there.
Cohort details
The patients in the study are diagnosed between January 1st 1994 and December 31st 2000. Before we move on to the survival probabilities, it is important to be aware of how your data is distributed and of what it comprises.
using NetSurvival, RateTables, DataFrames
first(colrec,10)
Row | age | year | sex | time | status | stage | site |
---|---|---|---|---|---|---|---|
Float64 | Float64 | Symbol | Float64 | Bool | Int64 | String7 | |
1 | 23004.0 | 728528.0 | male | 16.0 | false | 1 | rectum |
2 | 12082.0 | 729260.0 | female | 504.0 | false | 3 | rectum |
3 | 24277.0 | 728583.0 | male | 22.0 | false | 3 | colon |
4 | 29256.0 | 729843.0 | female | 3998.0 | false | 1 | colon |
5 | 30416.0 | 728869.0 | female | 9.0 | false | 99 | colon |
6 | 15156.0 | 729686.0 | male | 88.0 | false | 2 | colon |
7 | 23953.0 | 728713.0 | female | 7769.0 | false | 1 | rectum |
8 | 20775.0 | 728867.0 | male | 7366.0 | false | 1 | rectum |
9 | 17555.0 | 730141.0 | female | 6480.0 | false | 1 | rectum |
10 | 20329.0 | 730137.0 | female | 6449.0 | false | 1 | rectum |
Let's explore how our data is distributed first, starting with the age
variable.
println(minimum(colrec.age./365.241), " ; ",maximum(colrec.age./365.241))
12.482169307388821 ; 96.7169622249419
The study can be considered diverse in terms of age seeing as the patients are between 12 and 96 years old, approximately.
using Plots
plot(
histogram(colrec.age./365.241, label="Age"),
histogram(colrec.time./365.241, label="Follow-up time")
)
The graph above show us that although it has a wide range of patients within all age groups, it is mostly centered around older adults and elderly, with the majority of the patients being between 60 and 80 years old.
Looking at the second graph that details the distribution of the follow-up times, we notice that the values quickly drop. Unfortunately, this is a common theme in cancer studies.
Let's take a look at the sex
variable now:
combine(groupby(colrec, :sex), nrow)
Row | sex | nrow |
---|---|---|
Symbol | Int64 | |
1 | male | 3289 |
2 | female | 2682 |
This dataframe shows the number of male and female patients. There isn't too big of a difference between the two. We can say this study includes both gender relatively equally, thus, reducing bias.
With these two observations, it is also worth noting that colorectal cancer is most common with men and people older than 50.
In total, we note that we have $5971$ patients. By taking a look at the status
variable, we can determine the deaths and censorship:
sum(colrec.status)
4979
Out of the $5971$ patients, $4979$ have died. This is a very high mortality rate, and again, unfortunately common in cancer studies.
(nrow(colrec) - sum(colrec.status)) / nrow(colrec)
0.16613632557360575
In other terms, the censorship rate is of 16.6%, meaning the event, in this case death, was not observed for only 16.6% of the patients. This is a low censorship rate and thus the quality of the signal will be pretty good.
Mortality table
We will be using the mortality table slopop
taken from the RateTables.jl
package as the study is done on Slovenian patients.
slopop
RateTable(:sex,)
By examining slopop
, we notice it contains information regarding age
and year
, as expected for mortality tables. Additionally, it incorporates the covariate sex, which has two possible entries (:male
or :female
). The ratetable is then three dimensional, with the covariate sex
added. For example, the daily hazard rate for a woman turning $45$ on the January 1st $2006$ can be accessed through the following command:
daily_hazard(slopop, 45*365.241, 2006*365.241; sex=:female)
2.02680951796113e-6
Making the survival probability easily calculated with:
exp(-(daily_hazard(slopop, 45*365.241, 2006*365.241; sex=:female))*365)
0.9992604880997519
Overall and expected survival
For this part, we will be using the Survival.jl
package to apply the Kaplan Meier estimator for the overall survival.
using Survival
km = fit(KaplanMeier, colrec.time./365.241, colrec.status)
plot(km.events.time, km.survival, label=false, title = "Kaplan-Meier Estimator for the Overall Survival")
The graph above indicates a significant dip in survival probability within the first $5$ years, and even more at $10$ years. This period in the study is then crucial for the analysis.
Estimated net survival
In this part, we are interested in the first $5$ years of the study. We will thus limit the follow-up time to $5$ years, meaning we will censor all individuals with a follow-up time that is higher than this. Then, we will apply the different net survival methods.
colrec.time5 .= 0.0
colrec.status5 .= Bool(true)
for i in 1:nrow(colrec)
colrec.time5[i] = min(colrec.time[i], round(5*365.241))
if colrec.time[i] > 5*365.241
colrec.status5[i] = false
end
end
Now that we have defined our own time and status variables according to the observations made, we can apply the different non parametric methods for relative survival.
e1 = fit(EdererI, @formula(Surv(time5,status5)~1), colrec, slopop)
EdererI(t ∈ (1.0, 1826.0)) with summary stats:
1826×5 DataFrame
Row │ Sₑ ∂Λₑ σₑ lower_95_CI upper_95_CI
│ Float64 Float64 Float64 Float64 Float64
──────┼───────────────────────────────────────────────────────────────
1 │ 0.997105 0.00289458 0.000710541 0.995718 0.998495
2 │ 0.995215 0.00189576 0.000918415 0.993425 0.997008
3 │ 0.992152 0.00307809 0.0011755 0.989869 0.99444
4 │ 0.988753 0.00342609 0.00140734 0.986029 0.991484
5 │ 0.986023 0.00276087 0.00157124 0.982991 0.989064
6 │ 0.983292 0.00276916 0.00172041 0.979982 0.986614
7 │ 0.981734 0.00158443 0.00180287 0.978271 0.98521
8 │ 0.976991 0.00483135 0.00202379 0.973124 0.980874
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮
1820 │ 0.456494 -0.000133259 0.017137 0.441416 0.472087
1821 │ 0.456555 -0.000133262 0.017137 0.441475 0.47215
1822 │ 0.456616 -0.000133307 0.017137 0.441534 0.472213
1823 │ 0.456677 -0.000133276 0.017137 0.441593 0.472276
1824 │ 0.456527 0.000328197 0.0171432 0.441442 0.472127
1825 │ 0.456377 0.000328384 0.0171494 0.441292 0.471977
1826 │ 0.456016 0.000790464 0.0171618 0.440932 0.471616
1811 rows omitted
With the EdererI method, after $1826$ days have passed, we can say that the survival rate at this mark is around $0.456$, in the hypothetical world where patients can only die of cancer.
crude_e1 = CrudeMortality(e1)
println(crude_e1.Mₒ[1826], " , ", crude_e1.Mₑ[1826], " , ", crude_e1.Mₚ[1826])
0.6368404579125496 , 0.5158124650476976 , 0.121027992864852
Out of the 0.63 patients that have died, according to the EdererI method, 0.51 died because of colorectal cancer and 0.12 died of other causes.
e2 = fit(EdererII, @formula(Surv(time5,status5)~1), colrec, slopop)
EdererII(t ∈ (1.0, 1826.0)) with summary stats:
1826×5 DataFrame
Row │ Sₑ ∂Λₑ σₑ lower_95_CI upper_95_CI
│ Float64 Float64 Float64 Float64 Float64
──────┼───────────────────────────────────────────────────────────────
1 │ 0.997105 0.00289456 0.000710541 0.995718 0.998495
2 │ 0.995215 0.00189611 0.000918415 0.993425 0.997008
3 │ 0.992151 0.00307874 0.0011755 0.989868 0.994439
4 │ 0.988751 0.00342711 0.00140734 0.986027 0.991482
5 │ 0.98602 0.00276208 0.00157124 0.982988 0.989061
6 │ 0.983288 0.00277059 0.00172041 0.979978 0.986609
7 │ 0.981728 0.00158595 0.00180287 0.978265 0.985203
8 │ 0.976983 0.00483311 0.00202379 0.973116 0.980866
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮
1820 │ 0.441142 -0.00011516 0.017137 0.426571 0.45621
1821 │ 0.441193 -0.000115156 0.017137 0.42662 0.456263
1822 │ 0.441243 -0.000115176 0.017137 0.426669 0.456316
1823 │ 0.441294 -0.000115213 0.017137 0.426718 0.456368
1824 │ 0.441142 0.00034624 0.0171432 0.426565 0.456216
1825 │ 0.440989 0.000346509 0.0171494 0.426412 0.456063
1826 │ 0.440632 0.000808582 0.0171618 0.426057 0.455706
1811 rows omitted
Similarily, the EdererII method, also known as the conditional method, shows that at the $5$ year mark, the survival probability is of $0.44$ in this hypothetical world.
crude_e2 = CrudeMortality(e2)
println(crude_e2.Mₒ[1826], " , ", crude_e2.Mₑ[1826], " , ", crude_e2.Mₚ[1826])
0.6368404579125496 , 0.5334518856851325 , 0.10338857222741712
Here, out of the 0.63 patients that have dued, 0.53 are due to colorectal cancer and 0.1 due to other causes.
pp = fit(PoharPerme, @formula(Surv(time5,status5)~1), colrec, slopop)
PoharPerme(t ∈ (1.0, 1826.0)) with summary stats:
1826×5 DataFrame
Row │ Sₑ ∂Λₑ σₑ lower_95_CI upper_95_CI
│ Float64 Float64 Float64 Float64 Float64
──────┼───────────────────────────────────────────────────────────────
1 │ 0.997105 0.00289493 0.000710632 0.995717 0.998495
2 │ 0.995214 0.0018967 0.0009186 0.993424 0.997007
3 │ 0.992149 0.00307987 0.00117581 0.989865 0.994438
4 │ 0.988748 0.00342788 0.0014077 0.986024 0.99148
5 │ 0.986016 0.00276316 0.0015717 0.982983 0.989058
6 │ 0.983283 0.00277109 0.00172089 0.979972 0.986605
7 │ 0.981722 0.00158763 0.0018035 0.978258 0.985199
8 │ 0.976973 0.00483737 0.00202472 0.973104 0.980858
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮
1820 │ 0.44143 -0.000148858 0.0178917 0.426219 0.457184
1821 │ 0.441496 -0.000148885 0.0178917 0.426282 0.457252
1822 │ 0.441561 -0.000148937 0.0178917 0.426345 0.45732
1823 │ 0.441627 -0.000149005 0.0178917 0.426409 0.457389
1824 │ 0.441403 0.00050879 0.0179038 0.426182 0.457167
1825 │ 0.441281 0.000276025 0.0179089 0.42606 0.457045
1826 │ 0.440922 0.000813485 0.0179221 0.425702 0.456685
1811 rows omitted
We conclude for the Poher-Perme method, that in a world where cancer patients could only die due to cancer, only 41% of these patients would still be alive $5$ year after their diagnosis.
crude_pp = CrudeMortality(pp)
println(crude_pp.Mₒ[1826], " , ", crude_pp.Mₑ[1826], " , ", crude_pp.Mₚ[1826])
0.6458893614663985 , 0.5319347797905097 , 0.11395458167588875
Finally, for this estimator, we have that of the 0.64 patients that have died, 0.53 is due to colorectal cancer while 0.11 is due to other causes.
We will plot the Pohar Perme method only.
conf_int = confint(pp; level = 0.05)
lower_bounds = [lower[1] for lower in conf_int]
upper_bounds = [upper[2] for upper in conf_int]
p1 = plot(pp.grid, pp.Sₑ, ribbon=(pp.Sₑ - lower_bounds, upper_bounds - pp.Sₑ), xlab = "Time (days)", ylab = "Net survival", label = false)
p2 = plot(pp.grid, crude_pp.Mₑ, label = "Excess Mortality Rate")
p2 = plot!(pp.grid, crude_pp.Mₚ, label = "Population Mortality Rate")
plot(p1,p2)
Looking at the graph, and the rapid dip it takes, it is evident that the first $5$ years are crucial and that the survival probability is highly affected in these years. Additionnally, the crude mortality graph allows us to see how much of this curve is due to the colorectacl cancer studied versus other undefined causes. It is clear that the large majority is due to the cancer.
Net survival with respect to covariates
We are now interested in comparing the different groups of patients defined by various covariates.
pp_sex = fit(PoharPerme, @formula(Surv(time5,status5)~sex), colrec, slopop)
pp_males = pp_sex[pp_sex.sex .== :male,:estimator][1]
pp_females = pp_sex[pp_sex.sex .== :female,:estimator][1]
PoharPerme(t ∈ (1.0, 1826.0)) with summary stats:
1826×5 DataFrame
Row │ Sₑ ∂Λₑ σₑ lower_95_CI upper_95_CI
│ Float64 Float64 Float64 Float64 Float64
──────┼──────────────────────────────────────────────────────────────
1 │ 0.99675 0.00325029 0.00111873 0.994567 0.998938
2 │ 0.994244 0.00251419 0.00149396 0.991337 0.997159
3 │ 0.989871 0.0043981 0.00198035 0.986036 0.99372
4 │ 0.987363 0.00253401 0.00221726 0.983081 0.991663
5 │ 0.983733 0.00367643 0.00251903 0.978888 0.988602
6 │ 0.980104 0.00368904 0.00278998 0.974759 0.985478
7 │ 0.97871 0.00142212 0.00289239 0.973177 0.984274
8 │ 0.97321 0.00561896 0.00324789 0.967035 0.979425
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮
1820 │ 0.450196 -0.000129968 0.025504 0.428245 0.473272
1821 │ 0.450254 -0.000129989 0.025504 0.428301 0.473333
1822 │ 0.450313 -0.000130047 0.025504 0.428356 0.473395
1823 │ 0.450371 -0.000130101 0.025504 0.428412 0.473456
1824 │ 0.45043 -0.000130144 0.025504 0.428468 0.473518
1825 │ 0.450071 0.000797147 0.0255209 0.428112 0.473156
1826 │ 0.449584 0.00108151 0.0255496 0.427625 0.472671
1811 rows omitted
When comparing at time $1826$, we notice that the survival probability is slightly inferior for men than for women ($0.433 < 0.449$). It is also more probable for the women to die from other causes than the men seeing as $0.0255 > 0.025$. Still, the differences are minimal. Let's confirm this with the Grafféo log-rank test:
test_sex = fit(GraffeoTest, @formula(Surv(time5,status5)~sex), colrec, slopop)
Grafféo's log-rank-type-test
1×3 DataFrame
Row │ test_statistic degrees_of_freedom p_value
│ Float64 Int64 Float64
─────┼──────────────────────────────────────────────
1 │ 0.54326 1 0.461085
The p-value is indeed above $0.05$. We cannot reject the null hypothesis $H_0$ and thus we dismiss the differences between the two sexes.
As for the age, we will define two different groups: individuals aged 65 and above and those who are not.
colrec.age65 .= ifelse.(colrec.age .>= 65*365.241, :old, :young)
pp_age65 = fit(PoharPerme, @formula(Surv(time5,status5)~age65), colrec, slopop)
pp_young = pp_age65[pp_age65.age65 .== :young, :estimator][1]
pp_old = pp_age65[pp_age65.age65 .== :old, :estimator][1]
PoharPerme(t ∈ (1.0, 1826.0)) with summary stats:
1826×5 DataFrame
Row │ Sₑ ∂Λₑ σₑ lower_95_CI upper_95_CI
│ Float64 Float64 Float64 Float64 Float64
──────┼──────────────────────────────────────────────────────────────
1 │ 0.996032 0.00396827 0.0010703 0.993944 0.998123
2 │ 0.993167 0.00287628 0.00141168 0.990423 0.995919
3 │ 0.988642 0.00455641 0.00181945 0.985122 0.992173
4 │ 0.983839 0.0048581 0.00217215 0.979659 0.988036
5 │ 0.980416 0.00347923 0.00239706 0.97582 0.985033
6 │ 0.97644 0.00405544 0.00263427 0.971411 0.981494
7 │ 0.974672 0.00180987 0.00273909 0.969454 0.979919
8 │ 0.967646 0.00720934 0.0030984 0.961787 0.97354
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮
1820 │ 0.40224 -0.000235167 0.0269335 0.381557 0.424044
1821 │ 0.402335 -0.0002352 0.0269335 0.381647 0.424144
1822 │ 0.402429 -0.000235279 0.0269335 0.381737 0.424244
1823 │ 0.402524 -0.000235379 0.0269335 0.381826 0.424344
1824 │ 0.402139 0.000955393 0.0269598 0.381442 0.42396
1825 │ 0.401925 0.000534307 0.0269708 0.38123 0.423743
1826 │ 0.401615 0.000770012 0.0269896 0.380922 0.423432
1811 rows omitted
Here, the difference between the two is much more important. In the first group, the individuals are aged under 65 and at $5$ years time, they have a $50.1$% chance of survival. On the other hand, the individuals aged 65 and up have a $40.1$% chance of survival.
It is also worth noting that their chances of dying from other causes is higher than the younger group, given their age.
When applying the Grafféo test, we get the results below:
test_age65 = fit(GraffeoTest, @formula(Surv(time5,status5)~age65), colrec, slopop)
Grafféo's log-rank-type-test
1×3 DataFrame
Row │ test_statistic degrees_of_freedom p_value
│ Float64 Int64 Float64
─────┼─────────────────────────────────────────────────
1 │ 76.8388 1 1.85495e-18
The p-value is well under $0.05$, meaning we reject the $H_0$ hypothesis and must admit there are differences between the individuals aged 65 and above and the others.
When plotting both we get:
conf_int_men = confint(pp_males; level = 0.05)
lower_bounds_men = [lower[1] for lower in conf_int_men]
upper_bounds_men = [upper[2] for upper in conf_int_men]
conf_int_women = confint(pp_females; level = 0.05)
lower_bounds_women = [lower[1] for lower in conf_int_women]
upper_bounds_women = [upper[2] for upper in conf_int_women]
conf_int_under65 = confint(pp_young; level = 0.05)
lower_bounds_under65 = [lower[1] for lower in conf_int_under65]
upper_bounds_under65 = [upper[2] for upper in conf_int_under65]
conf_int_65 = confint(pp_old; level = 0.05)
lower_bounds_65 = [lower[1] for lower in conf_int_65]
upper_bounds_65 = [upper[2] for upper in conf_int_65]
plot1 = plot(pp_males.grid, pp_males.Sₑ, ribbon=(pp_males.Sₑ - lower_bounds_men, upper_bounds_men - pp_males.Sₑ), xlab = "Time (days)", ylab = "Net survival", label = "men")
plot1 = plot!(pp_females.grid, pp_females.Sₑ, ribbon=(pp_females.Sₑ - lower_bounds_women, upper_bounds_women - pp_females.Sₑ), xlab = "Time (days)", ylab = "Net survival", label = "women")
plot2 = plot(pp_young.grid, pp_young.Sₑ, ribbon=(pp_young.Sₑ - lower_bounds_under65, upper_bounds_under65 - pp_young.Sₑ), xlab = "Time (days)", ylab = "Net survival", label = "Under 65")
plot2 = plot!(pp_old.grid, pp_old.Sₑ, ribbon=(pp_old.Sₑ - lower_bounds_65, upper_bounds_65 - pp_old.Sₑ), xlab = "Time (days)", ylab = "Net survival", label = "65 and up")
plot(plot1, plot2, layout = (1, 2))
Visually, it is almost immediately understood that there are no worthy differences between the two sexes whereas the age65
variable seems to play a big role.
Estimated sample size and life expectancy
Given that the age group plays a significant role in the study, we will now estimate the sample size by yearly intervals in order to better compare the age groups.
elt, ess = nessie(@formula(Surv(time,status)~age65), colrec, slopop)
elt
Row | age65 | expected_life_time |
---|---|---|
Symbol | Float64 | |
1 | young | 24.7882 |
2 | old | 10.2949 |
The expected life time for the younger patients is significatively higher than for older patients (24.78 years > 10.29 years).
hcat(ess[:,3]...)
23×2 Matrix{Float64}:
2352.0 3619.0
2323.53 3389.94
2293.83 3168.85
2262.95 2954.21
2230.76 2746.24
2197.39 2546.26
2162.79 2352.33
2127.0 2162.2
2090.07 1974.6
2052.34 1792.37
⋮
1846.31 1047.55
1801.39 933.738
1754.86 827.664
1706.36 727.389
1655.44 633.358
1601.95 546.265
1545.51 464.941
1484.84 388.018
1419.49 316.588
Finally, the table above represents yearly expected sample sizes for both age groups under 65 and above, with the second column representing the latter. We can see that the sample size decreases for the older patients in a much more dramatic way than for the younger ages.
Unsurprisingly, we can thus conclude that age plays an important role in the study.