Getting Started

In many population-based studies, the specific cause of death is unidentified, unreliable or even unavailable. Relative survival analysis addresses this scenario, previously unexplored in general survival analysis. Different methods were created with the aim to construct a consistant and reliable estimator for this purpose.

Net survival settings

Note first that, for any positive random variable $X$, we will use extensively the following functions (that each fully characterize the distribution of $X$):

Function symbol and definitionFunction Name
$S_X(t) = \mathbb P(X > t)$Survival function
$\Lambda_X(t) = -\ln S_X(t)$Cumulative Hazard function
$\lambda_X(t) = \partial \Lambda_X(t)$Instantaneous hazard function

Consider a study that consists of censored survival times from a specific cause. Such a study consists of several random objects:

Random variableNameIs it observed ?
$E$"Excess" lifetime
$P$"Population" lifetime
$O = E \wedge P$"Overall" lifetime
$C$"Censoring" time
$\mathbf D$Vector of covariates
$T = O \wedge C$Event time
$\Delta = \mathbf{1}\{T \leq C\}$Event status

It is important to note that we do not observe a a potential indicator $\mathbf{1}\{E \geq P\}$. This is one of the key differences between net survival and standard survival. The standard Net survival analysis solves this problem by assuming that the underlying times $E$ and $P$ are independent from each other.

One central hypothesis in net survival (on top of non-informative censoring) is that $P$ and $E$ are independent. This independence can be written in terms of hazard rates $\lambda_O(t) = \lambda_P(t) + \lambda_E(t)$, or in terms of survival functions $S_O(t) = S_P(t)S_E(t)$.

The population hazard for each individual $\lambda_{P_i}$ is usally drawn from a reference life table, and may depend on covariates $\mathbf D_i$ such as age and date, sex, country, race, etc... See the RateTables.jl package for more details on the potential covariates. On the other hand, the excess mortality is assumed to be i.i.d. between individuals and not to depend on covariates at all. Thus, we mostly omit these covariates from our notations.

Available Estimators

The estimation of net survival is usually discussed in terms of the estimation of the cumulative excess hazard $\Lambda_E(t)$ and/or the instantaneous hazard $\lambda_E = \partial\Lambda_E$. To describe the estimators, we use the following counting processes notations, similar to standard survival analysis(see e.g. [6] or [7]).

  • The uncensored event indicatrix $\partial N_i(t)$ for individual $i$ at time $t$
  • The total number of uncensored events process $\partial N(t) = \sum_i \partial N_i(t)$ at time $t$
  • The at-risk indicatrix $Y_i(t)$, for whether an individual is still at risk
  • The total number at risk process $Y(t) = \sum_i Y_i(t)$ at time $t$

With these definitions and assumptions in mind, we will now present the four different methods implemented in this package, commonly used in literature, to estimate the excess hazard function $\partial\Lambda_E(t)$ and its variance. Recall that we estimate the variance as $\sigma_E^2(t) = \int_{0}^t \partial\sigma_E^2(s)$.

NameReferenceProposed (partial) Excess Hazard $\partial\hat{\Lambda}_E(s)$Proposed (partial) Variance $\partial\hat{\sigma}_E^2(s)$
Ederer I[1]$\frac{\sum_i N_i(s)}{\sum_i Y_i(s)} - \frac{\sum_i S_{P_i}(s)\partial\Lambda_{P_i}(s)}{\sum_i S_{P_i}(s)}$$\frac{\sum_i N_i(s)}{\left(\sum_i Y_i(s)\right)^2}$
Ederer II[2]$\frac{\sum_i N_i(s)}{\sum_i Y_i(s)} - \frac{\sum_i Y_i(s)\partial\Lambda_{P_i}(s)}{\sum_i Y_i(s)}$$\frac{\sum_i N_i(s)}{\left(\sum_i Y_i(s)\right)^2}$
Hakulinen[3]$\frac{\sum_i N_i(s)}{\sum_i Y_i(s)} - \frac{\sum_i \frac{Y_i(s)}{ S_{P_i}(s)}\partial\Lambda_{P_i}(s)}{\sum_i \frac{Y_i(s)}{ S_{P_i}(s)}}$$\frac{\sum_i N_i(s)}{\left(\sum_i Y_i(s)\right)^2}$
Pohar Perme[4]$\frac{\sum_i \frac{\partial N_i(s)}{S_{P_i}(s)} - \sum_i \frac{Y_i(s)}{S_{P_i}(s)}\partial\Lambda_{P_i}(s)}{\sum_i \frac{Y_i(s)}{S_{P_i}(s)}}$$\frac{\sum_{i=1}^n \frac{\partial N_i(s)}{S^2_{P_i}}}{\left(\sum_i \frac{Y_i(s)}{S_{p_i}(s)}\right)^2}$

where, in the variances, it is understood that when no more individuals are at risk $0/0$ gives $0$.

The Pohar Perme estimator [4] is the newest addition to relative survival analysis between the four methods, particularly designed to handle situations where covariates may change over time. It is trusted from the field (see e.g. [8] and [9]) that only this estimator should really be used, the other ones being included mostly for historical reasons and comparisons.

NetSurvival.PoharPermeType
PoharPerme

This method estimates net survival probabilities by applying the following estimation:

\[\partial\hat{\Lambda}_E(t) = \frac{\sum_i \frac{dN_i(u)}{S_{P_i}(u)} - \sum_i \frac{Y_i(u)}{S_{P_i}(u)}d\Lambda_{P_i}(u)}{\sum_i \frac{Y_i(u)}{S_{P_i}(u)}}\]

To fit the Pohar Perme to your data based on a certain rate table, apply the example below to your code :

fit(PoharPerme, @formula(Surv(time,status)~covariable1 + covariable2), data, ratetable)

References:

  • [4] Perme, Maja Pohar and Stare, Janez and Estève, Jacques (2012). On Estimation in Relative Survival.
source
NetSurvival.EdererIType
EdererI

This method estimates net survival probabilities by applying the following estimation:

\[\partial\hat{\Lambda}_E(t) = \frac{\sum_i N_i(u)}{\sum_i Y_i(u)} - \frac{\sum_i S_{P_i}(u)d\Lambda_{P_i}(u)}{\sum_i S_{P_i}(u)}\]

To call this function:

fit(EdererI, @formula(Surv(time,status)~covariable1 + covariable2), data, ratetable)
source
NetSurvival.EdererIIType
EdererII

This method estimates net survival probabilities by applying the following estimation:

\[\partial\hat{\Lambda}_E(t) = \frac{\sum_i N_i(u)}{\sum_i Y_i(u)} - \frac{\sum_i Y_i(u)d\Lambda_{P_i}(u)}{\sum_i Y_i(u)}\]

To call this function:

fit(EdererII, @formula(Surv(time,status)~covariable1 + covariable2), data, ratetable)
source
NetSurvival.HakulinenType
Hakulinen

To call this function:

fit(Hakulinen, @formula(Surv(time,status)~covariable1 + covariable2), data, ratetable)
source

Crude Mortality

The crude mortality rate is the global mortality rate (from all causes of death) for a population. This measure does not use the cause of death information which, as we previously mentioned, can be unreliable and incomplete. It is given as:

\[M_E(t) = \int_0^t S_O(u-) \lambda_E(u)du\]

There exists a few estimators of this quantity, the most known one being the Cronin-Feuer estimator [10], given by:

\[\hat{M}_E(t) = \int_0^t \hat{S}_O(u-) \hat{\lambda}_E(u)du\]

where, traditionally, Kaplan-Meier is used to estimate the overall survival function $\hat{S}_O$ and Ederer II is used for the excess hazard rate $\hat{\lambda}_E$ and the population hazard rate $\hat{\lambda}_P$.

Our implementation is a bit more permissive, as any net survival estimators can be used for $\hat{\lambda}_E$. Of course, the default is still the original Ederer II which provides the original Cronin-Feuer estimator.

NetSurvival.CrudeMortalityType
CrudeMortality

This method calculates the crude mortality and presents both the excess mortality and population mortality rates.

The default Cronin-Feuer estimator can be fitted to data with the following interface:

fit(CrudeMortality, args...)

where the args are passed to fit(EdererII,args...) to compute the excess hazard.

A more direct syntax can be used, specifying directly the estimator for the excess hazard:

CrudeMortality(fit(EdererII,args...))

To use another excess hazard estimator, simply replace EdererII with the method of your choice (PoharPerme, EdererI, Hakulinen).

References:

  • [10] Cronin, Kathleen A and Feuer, Eric J (2000). Cumulative cause-specific mortality for cancer patients in the presence of other causes: a crude analogue of relative survival.
source

Grafféo Log-Rank Test

The Grafféo Log-Rank Test [5] was constructed as a complement to the Pohar Perme estimator, aiming to compare the net survival functions provided by the latter. The test is designed to compare these functions across multiple groups, including stratified covariables, and to ultimately determine, with the given p-value, which covariables are impactful to the study.

The null $(H_0)$ hypothesis tests the following assumption:

\[\forall t \in [0,T], \; \; \Lambda_{E,g_1}(t) = \Lambda_{E,g_2}(t) = ... = \Lambda_{E,g_k}(t),\]

where $G = \{g_1,...,g_k\}$ is a partition of $1,...,n$ consisting of disjoint groups of individuals that we wish to compare to each other. For all group $g \in G$, let's denote the numerator and denominator of the Pohar Perme (partial) excess hazard estimators, restricted to individuals in the group, by:

  • $\partial N_{E,g}(s) = \sum_{i \in g} \frac{\partial N_i(s)}{S_{P_i}(s)} - \frac{Y_i(s)}{S_{P_i}(s)}\partial\Lambda_{P_i}(s)$
  • $Y_{E,g}(s) = \sum_{i \in g} \frac{Y_i(s)}{S_{P_i}(s)}$
  • $R_{g}(s) = \frac{Y_{E,g}(s)}{\sum_{g\in G} Y_{E,g}(s)}$

Then, define the vector $\mathbf Z = \left(Z_{g_r}: \; r \in 1,...,k-1 \right)$ with entries:

\[Z_g(T) = N_{E,g}(s) - \int_{0}^T Y_{E,g}(s) \partial\hat{\Lambda}_E(s)\]

The test statistic is then given by:

\[U(T) = \mathbf Z(T)'\hat{\Sigma}_Z^{-1} \mathbf Z(T)\]

where the entries of the $\hat{\Sigma}_Z$ matrix are given by:

\[\sigma_{g,h}(T) = \int_0^T \sum_{\ell \in G} \left(\delta_{g,\ell} - R_g(t) \right)\left(\delta_{h,\ell} - R_h(t)\right) \left(\sum_{i\in\ell} \frac{\partial N_i(s)}{S^2_{P_i}}\right)\]

Under $H_0$, the statistic $U(T)$ is asymptotically $\chi^2(k-1)$-distributed. We thus reject the $H_0$ hypothesis when the p-value obtained is under $0.05$, admitting the notable difference between the groups.

NetSurvival.GraffeoTestType
GraffeoTest

The Grafféo test is a log-rank type test and is typically used in net survival analysis to determine the impact of certain covariates in the study.

The null $(H_0)$ hypothesis tests the following assumption:

\[\forall t \in [0,T], \Lambda_{E,g_1}(t) = \Lambda_{E,g_2}(t) = ... = \Lambda_{E,g_k}(t)\]

where $G = {g_1,...,g_k}$ is a partition of $1,...,n$ consisting of disjoint groups of individuals that we wish to compare to each other. For all group $g \in G$, let's denote the numerator and denominator of the Pohar Perme (partial) excess hazard estimators, restricted to individuals in the group, by:

  • $\partial N_{E,g}(s) = \sum_{i \in g} \frac{\partial N_i(s)}{S_{P_i}(s)} - \frac{Y_i(s)}{S_{P_i}(s)}\partial\Lambda_{P_i}(s)$
  • $Y_{E,g}(s) = \sum_{i \in g} \frac{Y_i(s)}{S_{P_i}(s)}$
  • $R_{g}(s) = \frac{Y_{E,g}(s)}{\sum_{g\in G} Y_{E,g}(s)}$

Then, define the vector $\mathbf Z = \left(Z_{g_r}: r \in 1,...,k-1 \right)$ with entries:

\[Z_g(T) = N_{E,g}(s) - \int_{0}^T Y_{E,g}(s) \partial\hat{\Lambda}_E(s)\]

The test statistic is then given by:

\[U(T) = \mathbf Z(T)'\hat{\Sigma}_Z^{-1} \mathbf Z(T)\]

where the entries of the $\hat{\Sigma}_Z$ matrix are given by:

\[\sigma_{g,h}(T) = \int_0^T \sum_{\ell \in G} \left(\delta_{g,\ell} - R_g(t) \right)\left(\delta_{h,\ell} - R_h(t)\right) \left(\sum_{i\in\ell} \frac{\partial N_i(s)}{S^2_{P_i}}\right)\]

Under $H_0$, the statistic $U(T)$ is asymptotically $\chi^2(k-1)$-distributed.

To apply the test to your data based on a certain rate table, apply the example below to your code :

fit(GraffeoTest, @formula(Surv(time,status)~covariable1 + covariable2), data, ratetable)

If you wish to stratify a covariate:

fit(GraffeoTest, @formula(Surv(time,status)~covariable1 + Strata(covariable2)), data, ratetable)

The produced test statistics is supposed to follow a chi squared distribution under $(H_0)$. You can fetch the results using the .stat, .df and .pval fields of the returned object.

References:

  • [5] Grafféo, Nathalie and Castell, Fabienne and Belot, Aurélien and Giorgi, Roch (2016). A Log-Rank-Type Test to Compare Net Survival Distributions.
source

Nessie

The Nessie function estimates the sample size by yearly intervals as well as averages an estimated lifespan left for a given group.

This function is highly dependant on the Life function taken from the RateTables.jl package which you can find documented here.

The sample size is thus taken by the following formula:

\[ESS(t) = \sum_i^N S_{P_i}(t)\]

While the estimated lifepsan is directly taken from the expectation function.

NetSurvival.nessieFunction
nessie

To call this function, use the formula below:

nessie(@formula(Surv(time,status)~covariate), data, ratetable)
source

References

[1]
F. Ederer. The relative survival rate: a statistical methodology. Natl. Cancer Inst. Monogr. 6, 101–121 (1961).
[2]
F. Ederer and H. Heise. The effect of eliminating deaths from cancer in general survival rates, methodological notes 11. End Result Evaluation Section, National Cancer Institute (1959).
[3]
T. Hakulinen. On long-term relative survival rates. Journal of Chronic Diseases 30, 431–443 (1977).
[4]
M. P. Perme, J. Stare and J. Estève. On Estimation in Relative Survival. Biometrics 68, 113–120 (2011).
[5]
N. Grafféo, F. Castell, A. Belot and R. Giorgi. A Log-Rank-Type Test to Compare Net Survival Distributions. Biometrics 72, 760–769 (2016).
[6]
T. R. Fleming and D. P. Harrington. Counting Processes and Survival Analysis. Vol. 625 (John Wiley & Sons, 2013).
[7]
P. K. Andersen, O. Borgan, R. D. Gill and N. Keiding. Statistical Models Based on Counting Processes. Springer Series in Statistics (Springer US, New York, NY, 1993).
[8]
M. P. Perme and K. Pavlic. Nonparametric relative survival analysis with the R package relsurv. Journal of Statistical Software 87, 1–27 (2018).
[9]
H. Charvat and A. Belot. Mexhaz: An R package for fitting flexible hazard-based regression models for overall and excess mortality with a random effect. Journal of Statistical Software 98, 1–36 (2021).
[10]
K. A. Cronin and E. J. Feuer. Cumulative cause-specific mortality for cancer patients in the presence of other causes: a crude analogue of relative survival. Statistics in medicine 19, 1729–1740 (2000).