Case study
Introduction and datasets
We will illustrate with an example using the dataset colrec
, which comprises $5971$ patients diagnosed with colon or rectal cancer between $1994$ and $2000$. This dataset is sourced from the Slovenia cancer registry. Given the high probability that the patients are Slovenian, we will be using the Slovenian mortality table slopop
as reference for the population mortality rates. Subsequently, we can apply various non-parametric estimators for net survival analysis.
Mortality tables may vary in structure, with options such as the addition or removal of specific covariates. To confirm that the mortality table is in the correct format, please refer to RateTables.jl
's documentation, or directly extract it from there.
Cohort details
The patients in the study are diagnosed between January 1st $1994$ and December 31st $2000$. Before we move on to the survival probabilities, it is important to be aware of how your data is distributed and of what it comprises.
using NetSurvival, RateTables, DataFrames
first(colrec,10)
Row | time | status | age | year | sex | stage | site |
---|---|---|---|---|---|---|---|
Float64 | Bool | Float64 | Float64 | Symbol | Int64 | Symbol | |
1 | 16.0 | false | 23004.0 | 728528.0 | male | 1 | rectum |
2 | 504.0 | false | 12082.0 | 729260.0 | female | 3 | rectum |
3 | 22.0 | false | 24277.0 | 728583.0 | male | 3 | colon |
4 | 3998.0 | false | 29256.0 | 729843.0 | female | 1 | colon |
5 | 9.0 | false | 30416.0 | 728869.0 | female | 99 | colon |
6 | 88.0 | false | 15156.0 | 729686.0 | male | 2 | colon |
7 | 7769.0 | false | 23953.0 | 728713.0 | female | 1 | rectum |
8 | 7366.0 | false | 20775.0 | 728867.0 | male | 1 | rectum |
9 | 6480.0 | false | 17555.0 | 730141.0 | female | 1 | rectum |
10 | 6449.0 | false | 20329.0 | 730137.0 | female | 1 | rectum |
Let's explore how our data is distributed first, starting with the age
variable.
println(minimum(colrec.age./365.241), " ; ",maximum(colrec.age./365.241))
12.482169307388821 ; 96.7169622249419
The study can be considered diverse in terms of age seeing as the patients are between 12 and 96 years old, approximately.
using Plots
plot(histogram(colrec.age./365.241, label="Age"),
histogram(colrec.time./365.241, label="Follow-up time"))
The graph above show us that although the dataset has a wide range of patients within all age groups, it is mostly centered around older adults and elderly, with the majority of the patients being between $60$ and $80$ years old. Looking at the second graph that details the distribution of the follow-up times. We notice there that the values quickly drop. Unfortunately, this is a common theme in cancer studies.
Let's take a look at the sex
variable now, by looking at the number of male and female patients:
combine(groupby(colrec, :sex), nrow)
Row | sex | nrow |
---|---|---|
Symbol | Int64 | |
1 | male | 3289 |
2 | female | 2682 |
There isn't too big of a difference between the two. We can say this study includes both gender relatively equally, thus, reducing bias. With these two observations, it is also worth noting that colorectal cancer is most common with men and people older than $50$.
In total, we note that we have $5971$ patients. By taking a look at the status
variable, we can determine the deaths and censorship:
sum(colrec.status)
4979
Out of the $5971$ patients, $4979$ have died. This is a very high mortality rate, and again, unfortunately common in cancer studies.
(nrow(colrec) - sum(colrec.status)) / nrow(colrec)
0.16613632557360575
In other terms, the censorship rate is of 16.6%, meaning the event, in this case death, was not observed for only 16.6% of the patients. This is a low censorship rate and thus the quality of the signal will be pretty good.
Mortality table
We will be using the mortality table slopop
taken from the RateTables.jl
package as the study is done on Slovenian patients.
slopop
RateTable(:sex,)
The show method of the RateTable
class shows the additional covariate sex
that the rate table has on top of the (mandatory) age
and year
variables. The sex
variable has two madalities, :male
and :female
. The ratetable is then three dimensional. For example, the daily hazard rate for a woman turning $45$ on the January 1st $2006$ can be accessed through the following command:
λ = daily_hazard(slopop, 45*365.241, 2006*365.241; sex=:female)
2.02680951796113e-6
Making the daily survival probability easily calculated with:
exp(-λ)
0.999997973192536
Overall and expected survival
For this part, we will be using the Survival.jl
package to apply the Kaplan Meier estimator for the overall survival.
using Survival
km = fit(KaplanMeier, colrec.time./365.241, colrec.status)
plot(km.events.time, km.survival, label=false, title = "Kaplan-Meier Estimator for the Overall Survival")
The graph above indicates a significant dip in survival probability within the first $5$ years, and even more at $10$ years. This period in the study is then crucial for the analysis.
Estimated net survival
We will restrict ourselves to the first $5$ years of the study. For that, let us re-censor the dataset as follows:
for i in 1:nrow(colrec)
if colrec.time[i] > 1826 # five years
colrec.status[i] = false
colrec.time[i] = 1826
end
end
We can now apply the different non-parametric methods to compute the relative survival.
e1 = fit(EdererI, @formula(Surv(time,status)~1), colrec, slopop)
EdererI(t ∈ (1.0, 1826.0)) with summary stats:
1826×5 DataFrame
Row │ Sₑ ∂Λₑ σₑ lower_95_CI upper_95_CI
│ Float64 Float64 Float64 Float64 Float64
──────┼───────────────────────────────────────────────────────────────
1 │ 0.997105 0.00289458 0.000710541 0.995718 0.998495
2 │ 0.995215 0.00189576 0.000918415 0.993425 0.997008
3 │ 0.992152 0.00307809 0.0011755 0.989869 0.99444
4 │ 0.988753 0.00342609 0.00140734 0.986029 0.991484
5 │ 0.986023 0.00276087 0.00157124 0.982991 0.989064
6 │ 0.983292 0.00276916 0.00172041 0.979982 0.986614
7 │ 0.981734 0.00158443 0.00180287 0.978271 0.98521
8 │ 0.976991 0.00483135 0.00202379 0.973124 0.980874
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮
1820 │ 0.456957 -0.000133259 0.0171307 0.441869 0.47256
1821 │ 0.457018 -0.000133262 0.0171307 0.441928 0.472623
1822 │ 0.457079 -0.000133307 0.0171307 0.441987 0.472686
1823 │ 0.457139 -0.000133276 0.0171307 0.442045 0.472749
1824 │ 0.456989 0.000328197 0.017137 0.441895 0.472599
1825 │ 0.456839 0.000328384 0.0171432 0.441745 0.47245
1826 │ 0.456478 0.000790464 0.0171556 0.441385 0.472088
1811 rows omitted
With the EdererI method, after $1826$ days have passed, we can say that the survival rate at this mark is around $0.456$, in the hypothetical world where patients can only die of cancer.
crude_e1 = CrudeMortality(e1)
println(crude_e1.Mₒ[1826], " , ", crude_e1.Mₑ[1826], " , ", crude_e1.Mₚ[1826])
0.636476800600479 , 0.515341453104102 , 0.121135347496377
Out of the 0.63 patients that have died, according to the EdererI method, 0.51 died because of colorectal cancer and 0.12 died of other causes.
e2 = fit(EdererII, @formula(Surv(time,status)~1), colrec, slopop)
EdererII(t ∈ (1.0, 1826.0)) with summary stats:
1826×5 DataFrame
Row │ Sₑ ∂Λₑ σₑ lower_95_CI upper_95_CI
│ Float64 Float64 Float64 Float64 Float64
──────┼───────────────────────────────────────────────────────────────
1 │ 0.997105 0.00289456 0.000710541 0.995718 0.998495
2 │ 0.995215 0.00189611 0.000918415 0.993425 0.997008
3 │ 0.992151 0.00307874 0.0011755 0.989868 0.994439
4 │ 0.988751 0.00342711 0.00140734 0.986027 0.991482
5 │ 0.98602 0.00276208 0.00157124 0.982988 0.989061
6 │ 0.983288 0.00277059 0.00172041 0.979978 0.986609
7 │ 0.981728 0.00158595 0.00180287 0.978265 0.985203
8 │ 0.976983 0.00483311 0.00202379 0.973116 0.980866
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮
1820 │ 0.441589 -0.00011516 0.0171307 0.427008 0.456667
1821 │ 0.44164 -0.000115156 0.0171307 0.427058 0.45672
1822 │ 0.441691 -0.000115176 0.0171307 0.427107 0.456772
1823 │ 0.441741 -0.000115213 0.0171307 0.427156 0.456825
1824 │ 0.441588 0.00034624 0.017137 0.427003 0.456672
1825 │ 0.441435 0.000346509 0.0171432 0.42685 0.45652
1826 │ 0.441079 0.000808582 0.0171556 0.426494 0.456162
1811 rows omitted
Similarly, the EdererII method, also known as the conditional method, shows that at the $5$ year mark, the survival probability is of $0.44$ in this hypothetical world.
crude_e2 = CrudeMortality(e2)
println(crude_e2.Mₒ[1826], " , ", crude_e2.Mₑ[1826], " , ", crude_e2.Mₚ[1826])
0.636476800600479 , 0.5329969626451269 , 0.10347983795535214
Here, out of the 0.63 patients that have died, 0.53 are due to colorectal cancer and 0.1 due to other causes.
pp = fit(PoharPerme, @formula(Surv(time,status)~1), colrec, slopop)
PoharPerme(t ∈ (1.0, 1826.0)) with summary stats:
1826×5 DataFrame
Row │ Sₑ ∂Λₑ σₑ lower_95_CI upper_95_CI
│ Float64 Float64 Float64 Float64 Float64
──────┼───────────────────────────────────────────────────────────────
1 │ 0.997105 0.00289493 0.000710632 0.995717 0.998495
2 │ 0.995214 0.0018967 0.0009186 0.993424 0.997007
3 │ 0.992149 0.00307987 0.00117581 0.989865 0.994438
4 │ 0.988748 0.00342788 0.0014077 0.986024 0.99148
5 │ 0.986016 0.00276316 0.0015717 0.982983 0.989058
6 │ 0.983283 0.00277109 0.00172089 0.979972 0.986605
7 │ 0.981722 0.00158763 0.0018035 0.978258 0.985199
8 │ 0.976973 0.00483737 0.00202472 0.973104 0.980858
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮
1820 │ 0.44187 -0.000148858 0.017886 0.426649 0.457635
1821 │ 0.441936 -0.000148885 0.017886 0.426712 0.457703
1822 │ 0.442002 -0.000148937 0.017886 0.426776 0.457772
1823 │ 0.442068 -0.000149005 0.017886 0.426839 0.45784
1824 │ 0.441843 0.00050879 0.0178981 0.426612 0.457618
1825 │ 0.441721 0.000276025 0.0179032 0.42649 0.457496
1826 │ 0.441362 0.000813485 0.0179164 0.426132 0.457136
1811 rows omitted
We conclude for the Pohar Perme method, that in a world where cancer patients could only die due to cancer, only 41% of these patients would still be alive $5$ year after their diagnosis. The Pohar Perme estimator is the best estimator of the excess hazard under the standard hypotheses.
crude_pp = CrudeMortality(pp)
println(crude_pp.Mₒ[1826], " , ", crude_pp.Mₑ[1826], " , ", crude_pp.Mₚ[1826])
0.6455403791714492 , 0.5314853737936457 , 0.1140550053778036
Finally, for this estimator, we have that of the 0.64 patients that have died, 0.53 is due to colorectal cancer while 0.11 is due to other causes.
We will plot the Pohar Perme method only.
function mkribbon(pp)
S = pp.Sₑ
ci = confint(pp; level = 0.05)
l,u = getindex.(ci, 1), getindex.(ci,2)
rb = (S - l, u - S)
return rb
end
p1 = plot(pp.grid, pp.Sₑ, ribbon=mkribbon(pp), xlab = "Time (days)", ylab = "Net survival", label = false)
p2 = plot(pp.grid, crude_pp.Mₑ, label = "Excess Mortality Rate")
p2 = plot!(pp.grid, crude_pp.Mₚ, label = "Population Mortality Rate")
plot(p1,p2)
Looking at the graph, and the rapid dip it takes, it is evident that the first $5$ years are crucial and that the survival probability is highly affected in these years. Additionally, the crude mortality graph allows us to see how much of this curve is due to the colorectal cancer studied versus other undefined causes. It is clear that the large majority is due to the cancer.
Net survival with respect to covariates
We are now interested in comparing the different groups of patients defined by various covariates.
pp_sex = fit(PoharPerme, @formula(Surv(time,status)~sex), colrec, slopop)
pp_males = pp_sex[pp_sex.sex .== :male,:estimator][1]
pp_females = pp_sex[pp_sex.sex .== :female,:estimator][1]
PoharPerme(t ∈ (1.0, 1826.0)) with summary stats:
1826×5 DataFrame
Row │ Sₑ ∂Λₑ σₑ lower_95_CI upper_95_CI
│ Float64 Float64 Float64 Float64 Float64
──────┼──────────────────────────────────────────────────────────────
1 │ 0.99675 0.00325029 0.00111873 0.994567 0.998938
2 │ 0.994244 0.00251419 0.00149396 0.991337 0.997159
3 │ 0.989871 0.0043981 0.00198035 0.986036 0.99372
4 │ 0.987363 0.00253401 0.00221726 0.983081 0.991663
5 │ 0.983733 0.00367643 0.00251903 0.978888 0.988602
6 │ 0.980104 0.00368904 0.00278998 0.974759 0.985478
7 │ 0.97871 0.00142212 0.00289239 0.973177 0.984274
8 │ 0.97321 0.00561896 0.00324789 0.967035 0.979425
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮
1820 │ 0.450638 -0.000129968 0.0254941 0.428674 0.473727
1821 │ 0.450697 -0.000129989 0.0254941 0.42873 0.473789
1822 │ 0.450755 -0.000130047 0.0254941 0.428786 0.473851
1823 │ 0.450814 -0.000130101 0.0254941 0.428841 0.473912
1824 │ 0.450873 -0.000130144 0.0254941 0.428897 0.473974
1825 │ 0.450513 0.000797147 0.025511 0.428541 0.473612
1826 │ 0.450026 0.00108151 0.0255398 0.428054 0.473126
1811 rows omitted
When comparing at time $1826$, we notice that the survival probability is slightly inferior for men than for women ($0.433 < 0.449$). It is also more probable for the women to die from other causes than the men seeing as $0.0255 > 0.025$. Still, the differences are minimal. Let's confirm this with the Grafféo log-rank test:
test_sex = fit(GraffeoTest, @formula(Surv(time,status)~sex), colrec, slopop)
Grafféo's log-rank-type-test (Method: NetSurvival.PoharPermeMethod())
1×3 DataFrame
Row │ test_statistic degrees_of_freedom p_value
│ Float64 Int64 Float64
─────┼──────────────────────────────────────────────
1 │ 0.533424 1 0.465171
The p-value is indeed above $0.05$. We cannot reject the null hypothesis $H_0$ and thus we dismiss the differences between the two sexes.
As for the age, we will define two different groups: individuals aged 65 and above and those who are not.
colrec.age65 .= ifelse.(colrec.age .>= 65*365.241, :old, :young)
pp_age65 = fit(PoharPerme, @formula(Surv(time,status)~age65), colrec, slopop)
pp_young = pp_age65[pp_age65.age65 .== :young, :estimator][1]
pp_old = pp_age65[pp_age65.age65 .== :old, :estimator][1]
PoharPerme(t ∈ (1.0, 1826.0)) with summary stats:
1826×5 DataFrame
Row │ Sₑ ∂Λₑ σₑ lower_95_CI upper_95_CI
│ Float64 Float64 Float64 Float64 Float64
──────┼──────────────────────────────────────────────────────────────
1 │ 0.996032 0.00396827 0.0010703 0.993944 0.998123
2 │ 0.993167 0.00287628 0.00141168 0.990423 0.995919
3 │ 0.988642 0.00455641 0.00181945 0.985122 0.992173
4 │ 0.983839 0.0048581 0.00217215 0.979659 0.988036
5 │ 0.980416 0.00347923 0.00239706 0.97582 0.985033
6 │ 0.97644 0.00405544 0.00263427 0.971411 0.981494
7 │ 0.974672 0.00180987 0.00273909 0.969454 0.979919
8 │ 0.967646 0.00720934 0.0030984 0.961787 0.97354
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮
1820 │ 0.402479 -0.000235167 0.0269303 0.381786 0.424294
1821 │ 0.402574 -0.0002352 0.0269303 0.381876 0.424393
1822 │ 0.402669 -0.000235279 0.0269303 0.381966 0.424493
1823 │ 0.402763 -0.000235379 0.0269303 0.382056 0.424593
1824 │ 0.402379 0.000955393 0.0269566 0.381671 0.424209
1825 │ 0.402164 0.000534307 0.0269676 0.381459 0.423992
1826 │ 0.401854 0.000770012 0.0269863 0.381151 0.423681
1811 rows omitted
Here, the difference between the two is much more important. In the first group, the individuals are aged under 65 and at $5$ years time, they have a $50.1$% chance of survival. On the other hand, the individuals aged 65 and up have a $40.1$% chance of survival.
It is also worth noting that their chances of dying from other causes is higher than the younger group, given their age.
When applying the Grafféo test, we get the results below:
test_age65 = fit(GraffeoTest, @formula(Surv(time,status)~age65), colrec, slopop)
Grafféo's log-rank-type-test (Method: NetSurvival.PoharPermeMethod())
1×3 DataFrame
Row │ test_statistic degrees_of_freedom p_value
│ Float64 Int64 Float64
─────┼─────────────────────────────────────────────────
1 │ 77.4381 1 1.36951e-18
The p-value is well under $0.05$, meaning we reject the $H_0$ hypothesis and must admit there are differences between the individuals aged 65 and above and the others.
When plotting both we get:
plot1 = plot(pp_males.grid, pp_males.Sₑ, ribbon=mkribbon(pp_males), xlab = "Time (days)", ylab = "Net survival", label = "men")
plot1 = plot!(plot1, pp_females.grid, pp_females.Sₑ, ribbon=mkribbon(pp_females), label = "women")
plot2 = plot(pp_young.grid, pp_young.Sₑ, ribbon=mkribbon(pp_young), xlab = "Time (days)", ylab = "Net survival", label = "Under 65")
plot2 = plot!(pp_old.grid, pp_old.Sₑ, ribbon=mkribbon(pp_old), label = "65 and up")
plot(plot1, plot2, layout = (1, 2))
Visually, it is almost immediately understood that there are no worthy differences between the two sexes whereas the age65
variable seems to play a big role.
The same kind of graph can be made on the stage:
pp3 = fit(PoharPerme, @formula(Surv(time,status)~stage), colrec, slopop)
plot3 = plot(xlab = "Time (days)", ylab = "Net survival",title="Net survival per cancer stages")
for i in 1:nrow(pp3)
e = pp3[i,:estimator]
plot!(plot3, e.grid, e.Sₑ, ribbon=mkribbon(e), label=pp3[i,:stage])
end
plot(plot3)
Estimated sample size and life expectancy
Given that the age group plays a significant role in the study, we will now estimate the sample size by yearly intervals in order to better compare the age groups.
elt, ess = nessie(@formula(Surv(time,status)~age65), colrec, slopop)
elt
Row | age65 | expected_life_time |
---|---|---|
Symbol | Float64 | |
1 | young | 25.1699 |
2 | old | 10.3289 |
The expected life time for the younger patients is significatively higher than for older patients (24.78 years > 10.29 years).
hcat(ess[:,3]...)
5×2 Matrix{Float64}:
2352.0 3619.0
2323.53 3389.94
2293.83 3168.85
2262.95 2954.21
2230.76 2746.24
Finally, the table above represents yearly expected sample sizes for both age groups under 65 and above, with the second column representing the latter. We can see that the sample size decreases for the older patients in a much more dramatic way than for the younger ages.
Unsurprisingly, we can thus conclude that age plays an important role in the study.