Case study

Introduction and datasets

We will illustrate with an example using the dataset colrec, which comprises $5971$ patients diagnosed with colon or rectal cancer between $1994$ and $2000$. This dataset is sourced from the Slovenia cancer registry. Given the high probability that the patients are Slovenian, we will be using the Slovenian mortality table slopop as reference for the populational rates. Subsequently, we can apply various non-parametric estimators for net survival analysis.

N.B.

Mortality tables may vary in structure, with options such as the addition or removal of specific covariates. To confirm that the mortality table is in the correct format, please refer to RateTables.jl's documentation, or directly extract it from there.

Cohort details

The patients in the study are diagnosed between January 1st 1994 and December 31st 2000. Before we move on to the survival probabilities, it is important to be aware of how your data is distributed and of what it comprises.

using NetSurvival, RateTables, DataFrames

first(colrec,10)
10×7 DataFrame
Rowageyearsextimestatusstagesite
Float64Float64SymbolFloat64BoolInt64String7
123004.0728528.0male16.0false1rectum
212082.0729260.0female504.0false3rectum
324277.0728583.0male22.0false3colon
429256.0729843.0female3998.0false1colon
530416.0728869.0female9.0false99colon
615156.0729686.0male88.0false2colon
723953.0728713.0female7769.0false1rectum
820775.0728867.0male7366.0false1rectum
917555.0730141.0female6480.0false1rectum
1020329.0730137.0female6449.0false1rectum

Let's explore how our data is distributed first, starting with the age variable.

println(minimum(colrec.age./365.241), " ; ",maximum(colrec.age./365.241))
12.482169307388821 ; 96.7169622249419

The study can be considered diverse in terms of age seeing as the patients are between 12 and 96 years old, approximately.

using Plots

plot(
    histogram(colrec.age./365.241, label="Age"),
    histogram(colrec.time./365.241, label="Follow-up time")
)
Example block output

The graph above show us that although it has a wide range of patients within all age groups, it is mostly centered around older adults and elderly, with the majority of the patients being between 60 and 80 years old.

Looking at the second graph that details the distribution of the follow-up times, we notice that the values quickly drop. Unfortunately, this is a common theme in cancer studies.

Let's take a look at the sex variable now:

combine(groupby(colrec, :sex), nrow)
2×2 DataFrame
Rowsexnrow
SymbolInt64
1male3289
2female2682

This dataframe shows the number of male and female patients. There isn't too big of a difference between the two. We can say this study includes both gender relatively equally, thus, reducing bias.

With these two observations, it is also worth noting that colorectal cancer is most common with men and people older than 50.

In total, we note that we have $5971$ patients. By taking a look at the status variable, we can determine the deaths and censorship:

sum(colrec.status)
4979

Out of the $5971$ patients, $4979$ have died. This is a very high mortality rate, and again, unfortunately common in cancer studies.

(nrow(colrec) - sum(colrec.status)) / nrow(colrec)
0.16613632557360575

In other terms, the censorship rate is of 16.6%, meaning the event, in this case death, was not observed for only 16.6% of the patients. This is a low censorship rate and thus the quality of the signal will be pretty good.

Mortality table

We will be using the mortality table slopop taken from the RateTables.jl package as the study is done on Slovenian patients.

slopop
RateTable(:sex,)

By examining slopop, we notice it contains information regarding age and year, as expected for mortality tables. Additionally, it incorporates the covariate sex, which has two possible entries (:male or :female). The ratetable is then three dimensional, with the covariate sex added. For example, the daily hazard rate for a woman turning $45$ on the January 1st $2006$ can be accessed through the following command:

daily_hazard(slopop, 45*365.241, 2006*365.241; sex=:female)
2.02680951796113e-6

Making the survival probability easily calculated with:

exp(-(daily_hazard(slopop, 45*365.241, 2006*365.241; sex=:female))*365)
0.9992604880997519

Overall and expected survival

For this part, we will be using the Survival.jl package to apply the Kaplan Meier estimator for the overall survival.

using Survival

km = fit(KaplanMeier, colrec.time./365.241, colrec.status)
plot(km.events.time, km.survival, label=false, title = "Kaplan-Meier Estimator for the Overall Survival")
Example block output

The graph above indicates a significant dip in survival probability within the first $5$ years, and even more at $10$ years. This period in the study is then crucial for the analysis.

Estimated net survival

In this part, we are interested in the first $5$ years of the study. We will thus limit the follow-up time to $5$ years, meaning we will censor all individuals with a follow-up time that is higher than this. Then, we will apply the different net survival methods.

colrec.time5 .= 0.0
colrec.status5 .= Bool(true)
for i in 1:nrow(colrec)
    colrec.time5[i] = min(colrec.time[i], round(5*365.241))
    if colrec.time[i] > 5*365.241
        colrec.status5[i] = false
    end
end

Now that we have defined our own time and status variables according to the observations made, we can apply the different non parametric methods for relative survival.

e1 = fit(EdererI, @formula(Surv(time5,status5)~1), colrec, slopop)
EdererI(t ∈ (1.0, 1826.0)) with summary stats:
 1826×5 DataFrame
  Row  Sₑ        ∂Λₑ           σₑ           lower_95_CI  upper_95_CI  Float64   Float64       Float64      Float64      Float64     
──────┼───────────────────────────────────────────────────────────────
    1 │ 0.997105   0.00289458   0.000710541     0.995718     0.998495
    2 │ 0.995215   0.00189576   0.000918415     0.993425     0.997008
    3 │ 0.992152   0.00307809   0.0011755       0.989869     0.99444
    4 │ 0.988753   0.00342609   0.00140734      0.986029     0.991484
    5 │ 0.986023   0.00276087   0.00157124      0.982991     0.989064
    6 │ 0.983292   0.00276916   0.00172041      0.979982     0.986614
    7 │ 0.981734   0.00158443   0.00180287      0.978271     0.98521
    8 │ 0.976991   0.00483135   0.00202379      0.973124     0.980874
  ⋮   │    ⋮           ⋮             ⋮            ⋮            ⋮
 1820 │ 0.456494  -0.000133259  0.017137        0.441416     0.472087
 1821 │ 0.456555  -0.000133262  0.017137        0.441475     0.47215
 1822 │ 0.456616  -0.000133307  0.017137        0.441534     0.472213
 1823 │ 0.456677  -0.000133276  0.017137        0.441593     0.472276
 1824 │ 0.456527   0.000328197  0.0171432       0.441442     0.472127
 1825 │ 0.456377   0.000328384  0.0171494       0.441292     0.471977
 1826 │ 0.456016   0.000790464  0.0171618       0.440932     0.471616
                                                     1811 rows omitted

With the EdererI method, after $1826$ days have passed, we can say that the survival rate at this mark is around $0.456$, in the hypothetical world where patients can only die of cancer.

crude_e1 = CrudeMortality(e1)
println(crude_e1.Mₒ[1826], " , ", crude_e1.Mₑ[1826], " , ", crude_e1.Mₚ[1826])
0.6368404579125496 , 0.5158124650476976 , 0.121027992864852

Out of the 0.63 patients that have died, according to the EdererI method, 0.51 died because of colorectal cancer and 0.12 died of other causes.

e2 = fit(EdererII, @formula(Surv(time5,status5)~1), colrec, slopop)
EdererII(t ∈ (1.0, 1826.0)) with summary stats:
 1826×5 DataFrame
  Row  Sₑ        ∂Λₑ           σₑ           lower_95_CI  upper_95_CI  Float64   Float64       Float64      Float64      Float64     
──────┼───────────────────────────────────────────────────────────────
    1 │ 0.997105   0.00289456   0.000710541     0.995718     0.998495
    2 │ 0.995215   0.00189611   0.000918415     0.993425     0.997008
    3 │ 0.992151   0.00307874   0.0011755       0.989868     0.994439
    4 │ 0.988751   0.00342711   0.00140734      0.986027     0.991482
    5 │ 0.98602    0.00276208   0.00157124      0.982988     0.989061
    6 │ 0.983288   0.00277059   0.00172041      0.979978     0.986609
    7 │ 0.981728   0.00158595   0.00180287      0.978265     0.985203
    8 │ 0.976983   0.00483311   0.00202379      0.973116     0.980866
  ⋮   │    ⋮           ⋮             ⋮            ⋮            ⋮
 1820 │ 0.441142  -0.00011516   0.017137        0.426571     0.45621
 1821 │ 0.441193  -0.000115156  0.017137        0.42662      0.456263
 1822 │ 0.441243  -0.000115176  0.017137        0.426669     0.456316
 1823 │ 0.441294  -0.000115213  0.017137        0.426718     0.456368
 1824 │ 0.441142   0.00034624   0.0171432       0.426565     0.456216
 1825 │ 0.440989   0.000346509  0.0171494       0.426412     0.456063
 1826 │ 0.440632   0.000808582  0.0171618       0.426057     0.455706
                                                     1811 rows omitted

Similarily, the EdererII method, also known as the conditional method, shows that at the $5$ year mark, the survival probability is of $0.44$ in this hypothetical world.

crude_e2 = CrudeMortality(e2)
println(crude_e2.Mₒ[1826], " , ", crude_e2.Mₑ[1826], " , ", crude_e2.Mₚ[1826])
0.6368404579125496 , 0.5334518856851325 , 0.10338857222741712

Here, out of the 0.63 patients that have dued, 0.53 are due to colorectal cancer and 0.1 due to other causes.

pp = fit(PoharPerme, @formula(Surv(time5,status5)~1), colrec, slopop)
PoharPerme(t ∈ (1.0, 1826.0)) with summary stats:
 1826×5 DataFrame
  Row  Sₑ        ∂Λₑ           σₑ           lower_95_CI  upper_95_CI  Float64   Float64       Float64      Float64      Float64     
──────┼───────────────────────────────────────────────────────────────
    1 │ 0.997105   0.00289493   0.000710632     0.995717     0.998495
    2 │ 0.995214   0.0018967    0.0009186       0.993424     0.997007
    3 │ 0.992149   0.00307987   0.00117581      0.989865     0.994438
    4 │ 0.988748   0.00342788   0.0014077       0.986024     0.99148
    5 │ 0.986016   0.00276316   0.0015717       0.982983     0.989058
    6 │ 0.983283   0.00277109   0.00172089      0.979972     0.986605
    7 │ 0.981722   0.00158763   0.0018035       0.978258     0.985199
    8 │ 0.976973   0.00483737   0.00202472      0.973104     0.980858
  ⋮   │    ⋮           ⋮             ⋮            ⋮            ⋮
 1820 │ 0.44143   -0.000148858  0.0178917       0.426219     0.457184
 1821 │ 0.441496  -0.000148885  0.0178917       0.426282     0.457252
 1822 │ 0.441561  -0.000148937  0.0178917       0.426345     0.45732
 1823 │ 0.441627  -0.000149005  0.0178917       0.426409     0.457389
 1824 │ 0.441403   0.00050879   0.0179038       0.426182     0.457167
 1825 │ 0.441281   0.000276025  0.0179089       0.42606      0.457045
 1826 │ 0.440922   0.000813485  0.0179221       0.425702     0.456685
                                                     1811 rows omitted

We conclude for the Poher-Perme method, that in a world where cancer patients could only die due to cancer, only 41% of these patients would still be alive $5$ year after their diagnosis.

crude_pp = CrudeMortality(pp)
println(crude_pp.Mₒ[1826], " , ", crude_pp.Mₑ[1826], " , ", crude_pp.Mₚ[1826])
0.6458893614663985 , 0.5319347797905097 , 0.11395458167588875

Finally, for this estimator, we have that of the 0.64 patients that have died, 0.53 is due to colorectal cancer while 0.11 is due to other causes.

We will plot the Pohar Perme method only.

conf_int = confint(pp; level = 0.05)
lower_bounds = [lower[1] for lower in conf_int]
upper_bounds = [upper[2] for upper in conf_int]

p1 = plot(pp.grid, pp.Sₑ, ribbon=(pp.Sₑ - lower_bounds, upper_bounds - pp.Sₑ), xlab = "Time (days)", ylab = "Net survival", label = false)

p2 = plot(pp.grid, crude_pp.Mₑ, label = "Excess Mortality Rate")
p2 = plot!(pp.grid, crude_pp.Mₚ, label = "Population Mortality Rate")

plot(p1,p2)
Example block output

Looking at the graph, and the rapid dip it takes, it is evident that the first $5$ years are crucial and that the survival probability is highly affected in these years. Additionnally, the crude mortality graph allows us to see how much of this curve is due to the colorectacl cancer studied versus other undefined causes. It is clear that the large majority is due to the cancer.

Net survival with respect to covariates

We are now interested in comparing the different groups of patients defined by various covariates.

pp_sex = fit(PoharPerme, @formula(Surv(time5,status5)~sex), colrec, slopop)
pp_males = pp_sex[pp_sex.sex .== :male,:estimator][1]
pp_females = pp_sex[pp_sex.sex .== :female,:estimator][1]
PoharPerme(t ∈ (1.0, 1826.0)) with summary stats:
 1826×5 DataFrame
  Row  Sₑ        ∂Λₑ           σₑ          lower_95_CI  upper_95_CI  Float64   Float64       Float64     Float64      Float64     
──────┼──────────────────────────────────────────────────────────────
    1 │ 0.99675    0.00325029   0.00111873     0.994567     0.998938
    2 │ 0.994244   0.00251419   0.00149396     0.991337     0.997159
    3 │ 0.989871   0.0043981    0.00198035     0.986036     0.99372
    4 │ 0.987363   0.00253401   0.00221726     0.983081     0.991663
    5 │ 0.983733   0.00367643   0.00251903     0.978888     0.988602
    6 │ 0.980104   0.00368904   0.00278998     0.974759     0.985478
    7 │ 0.97871    0.00142212   0.00289239     0.973177     0.984274
    8 │ 0.97321    0.00561896   0.00324789     0.967035     0.979425
  ⋮   │    ⋮           ⋮            ⋮            ⋮            ⋮
 1820 │ 0.450196  -0.000129968  0.025504       0.428245     0.473272
 1821 │ 0.450254  -0.000129989  0.025504       0.428301     0.473333
 1822 │ 0.450313  -0.000130047  0.025504       0.428356     0.473395
 1823 │ 0.450371  -0.000130101  0.025504       0.428412     0.473456
 1824 │ 0.45043   -0.000130144  0.025504       0.428468     0.473518
 1825 │ 0.450071   0.000797147  0.0255209      0.428112     0.473156
 1826 │ 0.449584   0.00108151   0.0255496      0.427625     0.472671
                                                    1811 rows omitted

When comparing at time $1826$, we notice that the survival probability is slightly inferior for men than for women ($0.433 < 0.449$). It is also more probable for the women to die from other causes than the men seeing as $0.0255 > 0.025$. Still, the differences are minimal. Let's confirm this with the Grafféo log-rank test:

test_sex = fit(GraffeoTest, @formula(Surv(time5,status5)~sex), colrec, slopop)
Grafféo's log-rank-type-test
1×3 DataFrame
 Row  test_statistic  degrees_of_freedom  p_value   Float64         Int64               Float64  
─────┼──────────────────────────────────────────────
   1 │        0.54326                   1  0.461085

The p-value is indeed above $0.05$. We cannot reject the null hypothesis $H_0$ and thus we dismiss the differences between the two sexes.

As for the age, we will define two different groups: individuals aged 65 and above and those who are not.

colrec.age65 .= ifelse.(colrec.age .>= 65*365.241, :old, :young)
pp_age65 = fit(PoharPerme, @formula(Surv(time5,status5)~age65), colrec, slopop)
pp_young = pp_age65[pp_age65.age65 .== :young, :estimator][1]
pp_old = pp_age65[pp_age65.age65 .== :old, :estimator][1]
PoharPerme(t ∈ (1.0, 1826.0)) with summary stats:
 1826×5 DataFrame
  Row  Sₑ        ∂Λₑ           σₑ          lower_95_CI  upper_95_CI  Float64   Float64       Float64     Float64      Float64     
──────┼──────────────────────────────────────────────────────────────
    1 │ 0.996032   0.00396827   0.0010703      0.993944     0.998123
    2 │ 0.993167   0.00287628   0.00141168     0.990423     0.995919
    3 │ 0.988642   0.00455641   0.00181945     0.985122     0.992173
    4 │ 0.983839   0.0048581    0.00217215     0.979659     0.988036
    5 │ 0.980416   0.00347923   0.00239706     0.97582      0.985033
    6 │ 0.97644    0.00405544   0.00263427     0.971411     0.981494
    7 │ 0.974672   0.00180987   0.00273909     0.969454     0.979919
    8 │ 0.967646   0.00720934   0.0030984      0.961787     0.97354
  ⋮   │    ⋮           ⋮            ⋮            ⋮            ⋮
 1820 │ 0.40224   -0.000235167  0.0269335      0.381557     0.424044
 1821 │ 0.402335  -0.0002352    0.0269335      0.381647     0.424144
 1822 │ 0.402429  -0.000235279  0.0269335      0.381737     0.424244
 1823 │ 0.402524  -0.000235379  0.0269335      0.381826     0.424344
 1824 │ 0.402139   0.000955393  0.0269598      0.381442     0.42396
 1825 │ 0.401925   0.000534307  0.0269708      0.38123      0.423743
 1826 │ 0.401615   0.000770012  0.0269896      0.380922     0.423432
                                                    1811 rows omitted

Here, the difference between the two is much more important. In the first group, the individuals are aged under 65 and at $5$ years time, they have a $50.1$% chance of survival. On the other hand, the individuals aged 65 and up have a $40.1$% chance of survival.

It is also worth noting that their chances of dying from other causes is higher than the younger group, given their age.

When applying the Grafféo test, we get the results below:

test_age65 = fit(GraffeoTest, @formula(Surv(time5,status5)~age65), colrec, slopop)
Grafféo's log-rank-type-test
1×3 DataFrame
 Row  test_statistic  degrees_of_freedom  p_value      Float64         Int64               Float64     
─────┼─────────────────────────────────────────────────
   1 │        76.8388                   1  1.85495e-18

The p-value is well under $0.05$, meaning we reject the $H_0$ hypothesis and must admit there are differences between the individuals aged 65 and above and the others.

When plotting both we get:

conf_int_men = confint(pp_males; level = 0.05)
lower_bounds_men = [lower[1] for lower in conf_int_men]
upper_bounds_men = [upper[2] for upper in conf_int_men]

conf_int_women = confint(pp_females; level = 0.05)
lower_bounds_women = [lower[1] for lower in conf_int_women]
upper_bounds_women = [upper[2] for upper in conf_int_women]

conf_int_under65 = confint(pp_young; level = 0.05)
lower_bounds_under65 = [lower[1] for lower in conf_int_under65]
upper_bounds_under65 = [upper[2] for upper in conf_int_under65]

conf_int_65 = confint(pp_old; level = 0.05)
lower_bounds_65 = [lower[1] for lower in conf_int_65]
upper_bounds_65 = [upper[2] for upper in conf_int_65]

plot1 = plot(pp_males.grid, pp_males.Sₑ, ribbon=(pp_males.Sₑ - lower_bounds_men, upper_bounds_men - pp_males.Sₑ), xlab = "Time (days)", ylab = "Net survival", label = "men")
plot1 = plot!(pp_females.grid, pp_females.Sₑ, ribbon=(pp_females.Sₑ - lower_bounds_women, upper_bounds_women - pp_females.Sₑ), xlab = "Time (days)", ylab = "Net survival", label = "women")

plot2 = plot(pp_young.grid, pp_young.Sₑ, ribbon=(pp_young.Sₑ - lower_bounds_under65, upper_bounds_under65 - pp_young.Sₑ), xlab = "Time (days)", ylab = "Net survival", label = "Under 65")
plot2 = plot!(pp_old.grid, pp_old.Sₑ, ribbon=(pp_old.Sₑ - lower_bounds_65, upper_bounds_65 - pp_old.Sₑ), xlab = "Time (days)", ylab = "Net survival", label = "65 and up")

plot(plot1, plot2, layout = (1, 2))
Example block output

Visually, it is almost immediately understood that there are no worthy differences between the two sexes whereas the age65 variable seems to play a big role.

Estimated sample size and life expectancy

Given that the age group plays a significant role in the study, we will now estimate the sample size by yearly intervals in order to better compare the age groups.

elt, ess = nessie(@formula(Surv(time,status)~age65), colrec, slopop)
elt
2×2 DataFrame
Rowage65expected_life_time
SymbolFloat64
1young24.7882
2old10.2949

The expected life time for the younger patients is significatively higher than for older patients (24.78 years > 10.29 years).

hcat(ess[:,3]...)
23×2 Matrix{Float64}:
 2352.0   3619.0
 2323.53  3389.94
 2293.83  3168.85
 2262.95  2954.21
 2230.76  2746.24
 2197.39  2546.26
 2162.79  2352.33
 2127.0   2162.2
 2090.07  1974.6
 2052.34  1792.37
    ⋮     
 1846.31  1047.55
 1801.39   933.738
 1754.86   827.664
 1706.36   727.389
 1655.44   633.358
 1601.95   546.265
 1545.51   464.941
 1484.84   388.018
 1419.49   316.588

Finally, the table above represents yearly expected sample sizes for both age groups under 65 and above, with the second column representing the latter. We can see that the sample size decreases for the older patients in a much more dramatic way than for the younger ages.

Unsurprisingly, we can thus conclude that age plays an important role in the study.