Case study

Introduction and datasets

We will illustrate with an example using the dataset colrec, which comprises $5971$ patients diagnosed with colon or rectal cancer between $1994$ and $2000$. This dataset is sourced from the Slovenia cancer registry. Given the high probability that the patients are Slovenian, we will be using the Slovenian mortality table slopop as reference for the population mortality rates. Subsequently, we can apply various non-parametric estimators for net survival analysis.

N.B.

Mortality tables may vary in structure, with options such as the addition or removal of specific covariates. To confirm that the mortality table is in the correct format, please refer to RateTables.jl's documentation, or directly extract it from there.

Cohort details

The patients in the study are diagnosed between January 1st $1994$ and December 31st $2000$. Before we move on to the survival probabilities, it is important to be aware of how your data is distributed and of what it comprises.

using NetSurvival, RateTables, DataFrames
first(colrec,10)

10×7 DataFrame

Row	time	status	age	year	sex	stage	site
	Float64	Bool	Float64	Float64	Symbol	Int64	Symbol
1	16.0	false	23004.0	728528.0	male	1	rectum
2	504.0	false	12082.0	729260.0	female	3	rectum
3	22.0	false	24277.0	728583.0	male	3	colon
4	3998.0	false	29256.0	729843.0	female	1	colon
5	9.0	false	30416.0	728869.0	female	99	colon
6	88.0	false	15156.0	729686.0	male	2	colon
7	7769.0	false	23953.0	728713.0	female	1	rectum
8	7366.0	false	20775.0	728867.0	male	1	rectum
9	6480.0	false	17555.0	730141.0	female	1	rectum
10	6449.0	false	20329.0	730137.0	female	1	rectum

Let's explore how our data is distributed first, starting with the age variable.

println(minimum(colrec.age./365.241), " ; ",maximum(colrec.age./365.241))

12.482169307388821 ; 96.7169622249419

The study can be considered diverse in terms of age seeing as the patients are between 12 and 96 years old, approximately.

using Plots
plot(histogram(colrec.age./365.241, label="Age"),
    histogram(colrec.time./365.241, label="Follow-up time"))

The graph above show us that although the dataset has a wide range of patients within all age groups, it is mostly centered around older adults and elderly, with the majority of the patients being between $60$ and $80$ years old. Looking at the second graph that details the distribution of the follow-up times. We notice there that the values quickly drop. Unfortunately, this is a common theme in cancer studies.

Let's take a look at the sex variable now, by looking at the number of male and female patients:

combine(groupby(colrec, :sex), nrow)

2×2 DataFrame

Row	sex	nrow
	Symbol	Int64
1	male	3289
2	female	2682

There isn't too big of a difference between the two. We can say this study includes both gender relatively equally, thus, reducing bias. With these two observations, it is also worth noting that colorectal cancer is most common with men and people older than $50$.

In total, we note that we have $5971$ patients. By taking a look at the status variable, we can determine the deaths and censorship:

sum(colrec.status)

Out of the $5971$ patients, $4979$ have died. This is a very high mortality rate, and again, unfortunately common in cancer studies.

(nrow(colrec) - sum(colrec.status)) / nrow(colrec)

0.16613632557360575

In other terms, the censorship rate is of 16.6%, meaning the event, in this case death, was not observed for only 16.6% of the patients. This is a low censorship rate and thus the quality of the signal will be pretty good.

Mortality table

We will be using the mortality table slopop taken from the RateTables.jl package as the study is done on Slovenian patients.

slopop

RateTable(:sex,)

The show method of the RateTable class shows the additional covariate sex that the rate table has on top of the (mandatory) age and year variables. The sex variable has two madalities, :male and :female. The ratetable is then three dimensional. For example, the daily hazard rate for a woman turning $45$ on the January 1st $2006$ can be accessed through the following command:

λ  = daily_hazard(slopop, 45*365.241, 2006*365.241; sex=:female)

2.02680951796113e-6

Making the daily survival probability easily calculated with:

exp(-λ)

0.999997973192536

Overall and expected survival

For this part, we will be using the Survival.jl package to apply the Kaplan Meier estimator for the overall survival.

using Survival
km = fit(KaplanMeier, colrec.time./365.241, colrec.status)
plot(km.events.time, km.survival, label=false, title = "Kaplan-Meier Estimator for the Overall Survival")

The graph above indicates a significant dip in survival probability within the first $5$ years, and even more at $10$ years. This period in the study is then crucial for the analysis.

Estimated net survival

We will restrict ourselves to the first $5$ years of the study. For that, let us re-censor the dataset as follows:

for i in 1:nrow(colrec)
    if colrec.time[i] > 1826 # five years
        colrec.status[i] = false
        colrec.time[i] = 1826
    end
end

We can now apply the different non-parametric methods to compute the relative survival.

e1 = fit(EdererI, @formula(Surv(time,status)~1), colrec, slopop)

EdererI(t ∈ (1.0, 1826.0)) with summary stats:
 1826×5 DataFrame
  Row │ Sₑ        ∂Λₑ           σₑ           lower_95_CI  upper_95_CI 
      │ Float64   Float64       Float64      Float64      Float64     
──────┼───────────────────────────────────────────────────────────────
    1 │ 0.997105   0.00289458   0.000710541     0.995718     0.998495
    2 │ 0.995215   0.00189576   0.000918415     0.993425     0.997008
    3 │ 0.992152   0.00307809   0.0011755       0.989869     0.99444
    4 │ 0.988753   0.00342609   0.00140734      0.986029     0.991484
    5 │ 0.986023   0.00276087   0.00157124      0.982991     0.989064
    6 │ 0.983292   0.00276916   0.00172041      0.979982     0.986614
    7 │ 0.981734   0.00158443   0.00180287      0.978271     0.98521
    8 │ 0.976991   0.00483135   0.00202379      0.973124     0.980874
  ⋮   │    ⋮           ⋮             ⋮            ⋮            ⋮
 1820 │ 0.456957  -0.000133259  0.0171307       0.441869     0.47256
 1821 │ 0.457018  -0.000133262  0.0171307       0.441928     0.472623
 1822 │ 0.457079  -0.000133307  0.0171307       0.441987     0.472686
 1823 │ 0.457139  -0.000133276  0.0171307       0.442045     0.472749
 1824 │ 0.456989   0.000328197  0.017137        0.441895     0.472599
 1825 │ 0.456839   0.000328384  0.0171432       0.441745     0.47245
 1826 │ 0.456478   0.000790464  0.0171556       0.441385     0.472088
                                                     1811 rows omitted

With the EdererI method, after $1826$ days have passed, we can say that the survival rate at this mark is around $0.456$, in the hypothetical world where patients can only die of cancer.

crude_e1 = CrudeMortality(e1)
println(crude_e1.Mₒ[1826], " , ", crude_e1.Mₑ[1826], " , ", crude_e1.Mₚ[1826])

0.636476800600479 , 0.515341453104102 , 0.121135347496377

Out of the 0.63 patients that have died, according to the EdererI method, 0.51 died because of colorectal cancer and 0.12 died of other causes.

e2 = fit(EdererII, @formula(Surv(time,status)~1), colrec, slopop)

EdererII(t ∈ (1.0, 1826.0)) with summary stats:
 1826×5 DataFrame
  Row │ Sₑ        ∂Λₑ           σₑ           lower_95_CI  upper_95_CI 
      │ Float64   Float64       Float64      Float64      Float64     
──────┼───────────────────────────────────────────────────────────────
    1 │ 0.997105   0.00289456   0.000710541     0.995718     0.998495
    2 │ 0.995215   0.00189611   0.000918415     0.993425     0.997008
    3 │ 0.992151   0.00307874   0.0011755       0.989868     0.994439
    4 │ 0.988751   0.00342711   0.00140734      0.986027     0.991482
    5 │ 0.98602    0.00276208   0.00157124      0.982988     0.989061
    6 │ 0.983288   0.00277059   0.00172041      0.979978     0.986609
    7 │ 0.981728   0.00158595   0.00180287      0.978265     0.985203
    8 │ 0.976983   0.00483311   0.00202379      0.973116     0.980866
  ⋮   │    ⋮           ⋮             ⋮            ⋮            ⋮
 1820 │ 0.441589  -0.00011516   0.0171307       0.427008     0.456667
 1821 │ 0.44164   -0.000115156  0.0171307       0.427058     0.45672
 1822 │ 0.441691  -0.000115176  0.0171307       0.427107     0.456772
 1823 │ 0.441741  -0.000115213  0.0171307       0.427156     0.456825
 1824 │ 0.441588   0.00034624   0.017137        0.427003     0.456672
 1825 │ 0.441435   0.000346509  0.0171432       0.42685      0.45652
 1826 │ 0.441079   0.000808582  0.0171556       0.426494     0.456162
                                                     1811 rows omitted

Similarly, the EdererII method, also known as the conditional method, shows that at the $5$ year mark, the survival probability is of $0.44$ in this hypothetical world.

crude_e2 = CrudeMortality(e2)
println(crude_e2.Mₒ[1826], " , ", crude_e2.Mₑ[1826], " , ", crude_e2.Mₚ[1826])

0.636476800600479 , 0.5329969626451269 , 0.10347983795535214

Here, out of the 0.63 patients that have died, 0.53 are due to colorectal cancer and 0.1 due to other causes.

pp = fit(PoharPerme, @formula(Surv(time,status)~1), colrec, slopop)

PoharPerme(t ∈ (1.0, 1826.0)) with summary stats:
 1826×5 DataFrame
  Row │ Sₑ        ∂Λₑ           σₑ           lower_95_CI  upper_95_CI 
      │ Float64   Float64       Float64      Float64      Float64     
──────┼───────────────────────────────────────────────────────────────
    1 │ 0.997105   0.00289493   0.000710632     0.995717     0.998495
    2 │ 0.995214   0.0018967    0.0009186       0.993424     0.997007
    3 │ 0.992149   0.00307987   0.00117581      0.989865     0.994438
    4 │ 0.988748   0.00342788   0.0014077       0.986024     0.99148
    5 │ 0.986016   0.00276316   0.0015717       0.982983     0.989058
    6 │ 0.983283   0.00277109   0.00172089      0.979972     0.986605
    7 │ 0.981722   0.00158763   0.0018035       0.978258     0.985199
    8 │ 0.976973   0.00483737   0.00202472      0.973104     0.980858
  ⋮   │    ⋮           ⋮             ⋮            ⋮            ⋮
 1820 │ 0.44187   -0.000148858  0.017886        0.426649     0.457635
 1821 │ 0.441936  -0.000148885  0.017886        0.426712     0.457703
 1822 │ 0.442002  -0.000148937  0.017886        0.426776     0.457772
 1823 │ 0.442068  -0.000149005  0.017886        0.426839     0.45784
 1824 │ 0.441843   0.00050879   0.0178981       0.426612     0.457618
 1825 │ 0.441721   0.000276025  0.0179032       0.42649      0.457496
 1826 │ 0.441362   0.000813485  0.0179164       0.426132     0.457136
                                                     1811 rows omitted

We conclude for the Pohar Perme method, that in a world where cancer patients could only die due to cancer, only 41% of these patients would still be alive $5$ year after their diagnosis. The Pohar Perme estimator is the best estimator of the excess hazard under the standard hypotheses.

crude_pp = CrudeMortality(pp)
println(crude_pp.Mₒ[1826], " , ", crude_pp.Mₑ[1826], " , ", crude_pp.Mₚ[1826])

0.6455403791714492 , 0.5314853737936457 , 0.1140550053778036

Finally, for this estimator, we have that of the 0.64 patients that have died, 0.53 is due to colorectal cancer while 0.11 is due to other causes.

We will plot the Pohar Perme method only.

function mkribbon(pp)
    S = pp.Sₑ
    ci = confint(pp; level = 0.05)
    l,u = getindex.(ci, 1), getindex.(ci,2)
    rb = (S - l, u - S)
    return rb
end

p1 = plot(pp.grid, pp.Sₑ, ribbon=mkribbon(pp), xlab = "Time (days)", ylab = "Net survival", label = false)

p2 = plot(pp.grid, crude_pp.Mₑ, label = "Excess Mortality Rate")
p2 = plot!(pp.grid, crude_pp.Mₚ, label = "Population Mortality Rate")

plot(p1,p2)

Looking at the graph, and the rapid dip it takes, it is evident that the first $5$ years are crucial and that the survival probability is highly affected in these years. Additionally, the crude mortality graph allows us to see how much of this curve is due to the colorectal cancer studied versus other undefined causes. It is clear that the large majority is due to the cancer.

Net survival with respect to covariates

We are now interested in comparing the different groups of patients defined by various covariates.

pp_sex = fit(PoharPerme, @formula(Surv(time,status)~sex), colrec, slopop)
pp_males = pp_sex[pp_sex.sex .== :male,:estimator][1]
pp_females = pp_sex[pp_sex.sex .== :female,:estimator][1]

PoharPerme(t ∈ (1.0, 1826.0)) with summary stats:
 1826×5 DataFrame
  Row │ Sₑ        ∂Λₑ           σₑ          lower_95_CI  upper_95_CI 
      │ Float64   Float64       Float64     Float64      Float64     
──────┼──────────────────────────────────────────────────────────────
    1 │ 0.99675    0.00325029   0.00111873     0.994567     0.998938
    2 │ 0.994244   0.00251419   0.00149396     0.991337     0.997159
    3 │ 0.989871   0.0043981    0.00198035     0.986036     0.99372
    4 │ 0.987363   0.00253401   0.00221726     0.983081     0.991663
    5 │ 0.983733   0.00367643   0.00251903     0.978888     0.988602
    6 │ 0.980104   0.00368904   0.00278998     0.974759     0.985478
    7 │ 0.97871    0.00142212   0.00289239     0.973177     0.984274
    8 │ 0.97321    0.00561896   0.00324789     0.967035     0.979425
  ⋮   │    ⋮           ⋮            ⋮            ⋮            ⋮
 1820 │ 0.450638  -0.000129968  0.0254941      0.428674     0.473727
 1821 │ 0.450697  -0.000129989  0.0254941      0.42873      0.473789
 1822 │ 0.450755  -0.000130047  0.0254941      0.428786     0.473851
 1823 │ 0.450814  -0.000130101  0.0254941      0.428841     0.473912
 1824 │ 0.450873  -0.000130144  0.0254941      0.428897     0.473974
 1825 │ 0.450513   0.000797147  0.025511       0.428541     0.473612
 1826 │ 0.450026   0.00108151   0.0255398      0.428054     0.473126
                                                    1811 rows omitted

When comparing at time $1826$, we notice that the survival probability is slightly inferior for men than for women ($0.433 < 0.449$). It is also more probable for the women to die from other causes than the men seeing as $0.0255 > 0.025$. Still, the differences are minimal. Let's confirm this with the Grafféo log-rank test:

test_sex = fit(GraffeoTest, @formula(Surv(time,status)~sex), colrec, slopop)

Grafféo's log-rank-type-test (Method: NetSurvival.PoharPermeMethod())
1×3 DataFrame
 Row │ test_statistic  degrees_of_freedom  p_value  
     │ Float64         Int64               Float64  
─────┼──────────────────────────────────────────────
   1 │       0.533424                   1  0.465171

The p-value is indeed above $0.05$. We cannot reject the null hypothesis $H_0$ and thus we dismiss the differences between the two sexes.

As for the age, we will define two different groups: individuals aged 65 and above and those who are not.

colrec.age65 .= ifelse.(colrec.age .>= 65*365.241, :old, :young)
pp_age65 = fit(PoharPerme, @formula(Surv(time,status)~age65), colrec, slopop)
pp_young = pp_age65[pp_age65.age65 .== :young, :estimator][1]
pp_old = pp_age65[pp_age65.age65 .== :old, :estimator][1]

PoharPerme(t ∈ (1.0, 1826.0)) with summary stats:
 1826×5 DataFrame
  Row │ Sₑ        ∂Λₑ           σₑ          lower_95_CI  upper_95_CI 
      │ Float64   Float64       Float64     Float64      Float64     
──────┼──────────────────────────────────────────────────────────────
    1 │ 0.996032   0.00396827   0.0010703      0.993944     0.998123
    2 │ 0.993167   0.00287628   0.00141168     0.990423     0.995919
    3 │ 0.988642   0.00455641   0.00181945     0.985122     0.992173
    4 │ 0.983839   0.0048581    0.00217215     0.979659     0.988036
    5 │ 0.980416   0.00347923   0.00239706     0.97582      0.985033
    6 │ 0.97644    0.00405544   0.00263427     0.971411     0.981494
    7 │ 0.974672   0.00180987   0.00273909     0.969454     0.979919
    8 │ 0.967646   0.00720934   0.0030984      0.961787     0.97354
  ⋮   │    ⋮           ⋮            ⋮            ⋮            ⋮
 1820 │ 0.402479  -0.000235167  0.0269303      0.381786     0.424294
 1821 │ 0.402574  -0.0002352    0.0269303      0.381876     0.424393
 1822 │ 0.402669  -0.000235279  0.0269303      0.381966     0.424493
 1823 │ 0.402763  -0.000235379  0.0269303      0.382056     0.424593
 1824 │ 0.402379   0.000955393  0.0269566      0.381671     0.424209
 1825 │ 0.402164   0.000534307  0.0269676      0.381459     0.423992
 1826 │ 0.401854   0.000770012  0.0269863      0.381151     0.423681
                                                    1811 rows omitted

Here, the difference between the two is much more important. In the first group, the individuals are aged under 65 and at $5$ years time, they have a $50.1$% chance of survival. On the other hand, the individuals aged 65 and up have a $40.1$% chance of survival.

It is also worth noting that their chances of dying from other causes is higher than the younger group, given their age.

When applying the Grafféo test, we get the results below:

test_age65 = fit(GraffeoTest, @formula(Surv(time,status)~age65), colrec, slopop)

Grafféo's log-rank-type-test (Method: NetSurvival.PoharPermeMethod())
1×3 DataFrame
 Row │ test_statistic  degrees_of_freedom  p_value     
     │ Float64         Int64               Float64     
─────┼─────────────────────────────────────────────────
   1 │        77.4381                   1  1.36951e-18

The p-value is well under $0.05$, meaning we reject the $H_0$ hypothesis and must admit there are differences between the individuals aged 65 and above and the others.

When plotting both we get:

plot1 = plot(pp_males.grid, pp_males.Sₑ, ribbon=mkribbon(pp_males), xlab = "Time (days)", ylab = "Net survival", label = "men")
plot1 = plot!(plot1, pp_females.grid, pp_females.Sₑ, ribbon=mkribbon(pp_females), label = "women")

plot2 = plot(pp_young.grid, pp_young.Sₑ, ribbon=mkribbon(pp_young), xlab = "Time (days)", ylab = "Net survival", label = "Under 65")
plot2 = plot!(pp_old.grid, pp_old.Sₑ, ribbon=mkribbon(pp_old), label = "65 and up")

plot(plot1, plot2, layout = (1, 2))

Visually, it is almost immediately understood that there are no worthy differences between the two sexes whereas the age65 variable seems to play a big role.

The same kind of graph can be made on the stage:

pp3 = fit(PoharPerme, @formula(Surv(time,status)~stage), colrec, slopop)
plot3 = plot(xlab = "Time (days)", ylab = "Net survival",title="Net survival per cancer stages")
for i in 1:nrow(pp3)
    e = pp3[i,:estimator]
    plot!(plot3, e.grid, e.Sₑ, ribbon=mkribbon(e), label=pp3[i,:stage])
end
plot(plot3)

Estimated sample size and life expectancy

Given that the age group plays a significant role in the study, we will now estimate the sample size by yearly intervals in order to better compare the age groups.

elt, ess = nessie(@formula(Surv(time,status)~age65), colrec, slopop)
elt

2×2 DataFrame

Row	age65	expected_life_time
	Symbol	Float64
1	young	25.1699
2	old	10.3289

The expected life time for the younger patients is significatively higher than for older patients (24.78 years > 10.29 years).

hcat(ess[:,3]...)

5×2 Matrix{Float64}:
 2352.0   3619.0
 2323.53  3389.94
 2293.83  3168.85
 2262.95  2954.21
 2230.76  2746.24

Finally, the table above represents yearly expected sample sizes for both age groups under 65 and above, with the second column representing the latter. We can see that the sample size decreases for the older patients in a much more dramatic way than for the younger ages.

Unsurprisingly, we can thus conclude that age plays an important role in the study.