Survival Analysis Fundamentals

This page covers the statistical theory behind the Survival Analysis tabs. See that page for usage instructions.

Time-to-Event Data and Censoring

Survival analysis is a set of methods for analyzing time until an event occurs. Despite the name "survival," the event need not be death — it can be machine failure, customer churn, time to recidivism, or any event of interest.

The defining feature of survival data is censoring. Subjects who did not experience the event during the observation period (e.g., patients still alive at the end of a clinical trial, or patients lost to follow-up) carry only incomplete information: "the event had not occurred by at least this time."

Simply excluding censored observations biases the analysis toward subjects who experienced the event sooner, underestimating survival times. Treating censored observations as "no event" overstates survival times since the true event time is unknown. Survival analysis methods are designed to handle censoring properly, provided that censoring is independent of event occurrence (non-informative censoring). When censoring is related to the likelihood of the event — for example, when patients drop out due to worsening side effects — the KM estimator and Cox model estimates become biased. MIDAS handles right censoring only (censoring due to end of observation or loss to follow-up). Left censoring and interval censoring are not supported.

Why Ordinary Regression Fails

Without censoring, survival times could be analyzed as a response variable in ordinary regression. But censored data provides inequality information — "the true value is at least as large as the observed value" — and the usual residual (yiy^iy_i - \hat{y}_i) cannot be defined. Survival analysis incorporates this inequality into the likelihood function, correctly accounting for censoring.

Survival Function and Hazard Function

The distribution of survival time TT is characterized by two functions.

The survival function S(t)=P(T>t)S(t) = P(T > t) is the probability of not having experienced the event by time tt. It starts at S(0)=1S(0) = 1 and decreases monotonically over time.

The hazard function h(t)h(t) is the instantaneous rate of event occurrence at time tt, given survival up to that point:

h(t)=limΔt0P(tT<t+ΔtTt)Δth(t) = \lim_{\Delta t \to 0} \frac{P(t \le T < t + \Delta t \mid T \ge t)}{\Delta t}

The hazard is a rate (per unit time), not a probability, so it can exceed 1. The survival and hazard functions are related by S(t)=exp(0th(u)du)S(t) = \exp\left(-\int_0^t h(u)\,du\right); knowing one determines the other.

Kaplan-Meier Estimator

The Kaplan-Meier estimator is a nonparametric estimator of the survival function. It makes no distributional assumptions, estimating S(t)S(t) directly from observed event times.

Let the distinct event times be t1<t2<<tkt_1 < t_2 < \cdots < t_k, with nin_i subjects at risk and did_i events at each time tit_i:

S^(t)=tit(1dini)\hat{S}(t) = \prod_{t_i \le t} \left(1 - \frac{d_i}{n_i}\right)

This cumulatively multiplies the "survival fraction" at each event time. Censoring is reflected through changes in the risk set: when a subject is censored, they leave the risk set but are not counted as an event.

Under non-informative censoring, the KM estimator is a consistent estimator of S(t)S(t). The variance is estimated using Greenwood's formula, derived via the delta method:

Var^[S^(t)]=S^(t)2titdini(nidi)\widehat{\operatorname{Var}}[\hat{S}(t)] = \hat{S}(t)^2 \sum_{t_i \le t} \frac{d_i}{n_i(n_i - d_i)}

Confidence intervals are constructed from this variance. MIDAS uses the log transformation method by default, computing exp(logS^(t)±zSE/S^(t))\exp(\log \hat{S}(t) \pm z \cdot \text{SE} / \hat{S}(t)). The log transformation prevents the interval from falling outside [0,1][0, 1].

Log-rank Test

The log-rank test is a nonparametric test for whether two or more groups have equal hazard functions.

At each event time tit_i, the observed event count dijd_{ij} and expected event count eij=nijdi/nie_{ij} = n_{ij} \cdot d_i / n_i are computed for group jj, where nijn_{ij} is the size of group jj's risk set at time tit_i. Under the null hypothesis, events at each time point are expected to be distributed according to each group's share of the risk set. This allocation follows a hypergeometric distribution, from which the variance is also derived. For two groups, since O1+O2=diO_1 + O_2 = \sum d_i and E1+E2=diE_1 + E_2 = \sum d_i, it follows that O1E1=(O2E2)O_1 - E_1 = -(O_2 - E_2), so only one group's information is needed. The test statistic is:

χ2=(O1E1)2V1\chi^2 = \frac{(O_1 - E_1)^2}{V_1}

where O1=id1iO_1 = \sum_i d_{1i} (total observed events in group 1), E1=ie1iE_1 = \sum_i e_{1i} (total expected events), and V1=in1in2idi(nidi)ni2(ni1)V_1 = \sum_i \frac{n_{1i} \cdot n_{2i} \cdot d_i (n_i - d_i)}{n_i^2(n_i - 1)} (variance). The denominator is the variance derived from the hypergeometric distribution, not the expected value E1E_1. For three or more groups, this extends to the quadratic form χ2=UV1U\chi^2 = \mathbf{U}' \mathbf{V}^{-1} \mathbf{U} using the variance-covariance matrix. Under the null hypothesis, this statistic approximately follows a chi-squared distribution with g1g - 1 degrees of freedom (gg = number of groups).

The log-rank test is a member of the weighted log-rank test family. It has maximum power compared to other weighted variants (such as Wilcoxon-type tests) when the hazard ratio is constant over time (under the proportional hazards assumption). Power decreases when survival curves cross.

Cox Proportional Hazards Model

Model Formulation

The Cox (1972) proportional hazards model is a semiparametric model that estimates the effect of covariates on hazard:

h(tX)=h0(t)exp(β1X1+β2X2++βpXp)h(t \mid X) = h_0(t) \exp(\beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p)

h0(t)h_0(t) is the baseline hazard (the hazard when all covariates are zero), and exp(βj)\exp(\beta_j) is the hazard ratio for a one-unit increase in covariate XjX_j.

It is called "semiparametric" because β\beta is estimated parametrically, but no functional form is specified for h0(t)h_0(t). This removes the need to assume a distribution for the baseline hazard. After estimating β\beta, the baseline hazard h0(t)h_0(t) can be estimated nonparametrically using the Breslow estimator, which in turn allows computing the survival function S(tX)S(t|X) for specific covariate values. MIDAS outputs the baseline cumulative hazard H0(t)H_0(t) and the adjusted survival curve S(tX)S(t|X) at user-specified covariate values.

The covariates XX in this model are fixed values for each subject throughout the observation period. Handling covariates that change over time (time-varying covariates) requires extensions that MIDAS does not currently support.

Proportional Hazards Assumption

The core assumption is that covariate effects are constant over time. That is, the hazard ratio h(tX1)/h(tX2)=exp(β(X1X2))h(t \mid X_1) / h(t \mid X_2) = \exp(\beta'(X_1 - X_2)) does not depend on tt.

When this assumption is violated (e.g., a treatment effect that fades over time), the estimated β\beta represents a weighted average of the time-varying effect, with weights that depend on the risk set composition and baseline hazard, making interpretation difficult (Struthers & Kalbfleisch, 1986).

Schoenfeld Residuals

Schoenfeld residuals are used to assess the proportional hazards assumption. They are defined at each event time t(i)t_{(i)} for each covariate jj:

rij=XijXˉj(t(i))r_{ij} = X_{ij} - \bar{X}_j(t_{(i)})

XijX_{ij} is the value of covariate jj for the subject who experienced the event. Xˉj(t(i))\bar{X}_j(t_{(i)}) is the weighted mean of covariate jj over the risk set, defined as:

Xˉj(t(i))=kR(t(i))Xkjexp(Xkβ^)kR(t(i))exp(Xkβ^)\bar{X}_j(t_{(i)}) = \frac{\sum_{k \in \mathcal{R}(t_{(i)})} X_{kj} \exp(X_k'\hat\beta)}{\sum_{k \in \mathcal{R}(t_{(i)})} \exp(X_k'\hat\beta)}

Here kk ranges over subjects in the risk set R(t(i))\mathcal{R}(t_{(i)}), XkX_k is the covariate vector for subject kk, and Xkβ^X_k'\hat\beta is the linear predictor. The weight exp(Xkβ^)\exp(X_k'\hat\beta) is subject-specific — subjects with higher hazard contribute more — and is the same for all covariates jj.

The sum of Schoenfeld residuals equals the score function (the gradient of the log partial likelihood with respect to β\beta) (Schoenfeld, 1982). At the MLE β^\hat\beta the score function is theoretically zero, so the sum of residuals is also zero. In practice, Newton-Raphson terminates after finitely many iterations, so the sum is zero only up to convergence tolerance.

Scaled Schoenfeld residuals adjust the raw residuals by the variance-covariance matrix:

ri=dV^ri+β^r^*_i = d \cdot \hat{V} \, r_i + \hat\beta

where dd is the total number of events, V^=I^(β^)1\hat{V} = \hat{I}(\hat\beta)^{-1} is the estimated variance-covariance matrix, and rir_i and rir^*_i are the raw and scaled residual vectors at event time ii. The jj-th component of rir^*_i can be interpreted as an estimate of βj(t(i))\beta_j(t_{(i)}). Under proportional hazards, rir^*_i shows no systematic trend over time (Grambsch & Therneau, 1994).

MIDAS displays the following diagnostics (usage):

  • Grambsch-Therneau test: Tests the association between scaled Schoenfeld residuals and time. Reports per-covariate tests and a global test
  • Scaled Schoenfeld residual plots: Plots rijr^*_{ij} against time with a LOESS smooth
  • log(-log(S(t))) plot: Plots Kaplan-Meier estimates as log(log(S(t)))\log(-\log(S(t))) versus log(t)\log(t) by group. Under proportional hazards, the curves should be approximately parallel

Partial Likelihood

Cox model parameters are estimated using partial likelihood. For subject ii who experienced an event at time t(i)t_{(i)}, consider the conditional probability that subject ii — among all subjects still at risk at that time R(t(i))\mathcal{R}(t_{(i)}) — is the one who experiences the event:

L(β)=i:eventexp(Xiβ)jR(t(i))exp(Xjβ)L(\beta) = \prod_{i:\text{event}} \frac{\exp(X_i'\beta)}{\sum_{j \in \mathcal{R}(t_{(i)})} \exp(X_j'\beta)}

Each factor corresponds to the conditional probability h(t(i)Xi)/jh(t(i)Xj)h(t_{(i)}|X_i) / \sum_j h(t_{(i)}|X_j) within the risk set at time t(i)t_{(i)}. Substituting h(tX)=h0(t)exp(Xβ)h(t|X) = h_0(t)\exp(X'\beta), the h0(t(i))h_0(t_{(i)}) terms cancel between numerator and denominator, so estimating β\beta does not require knowing h0(t)h_0(t). Although the partial likelihood is not a full likelihood, it has been shown to yield estimators with the same asymptotic properties as maximum likelihood — consistency and asymptotic normality (Cox, 1975).

Interpreting Hazard Ratios

exp(βj)\exp(\beta_j) is interpreted as the hazard ratio (HR):

  • HR > 1: A one-unit increase in XjX_j increases the hazard by (HR1)×100%(\text{HR} - 1) \times 100\%
  • HR < 1: The hazard decreases by (1HR)×100%(1 - \text{HR}) \times 100\%
  • HR = 1: XjX_j has no effect on the hazard

When the confidence interval for the hazard ratio does not include 1, the data are inconsistent with the null hypothesis of no effect. The width of the confidence interval reflects estimation precision: a narrow interval indicates a more precise estimate, while a wide interval indicates limited information from the data. Hazard ratios directly convey the direction and magnitude of the effect, making them more informative than the p-value alone.

Test Statistics

Three test statistics are reported for the null hypothesis that all β=0\beta = 0. Each is asymptotically χ2\chi^2 with pp (number of covariates) degrees of freedom, and they converge to the same value in large samples.

Likelihood Ratio Test

Λ=2[(β^)(0)]\Lambda = 2\bigl[\ell(\hat\beta) - \ell(0)\bigr]

Based on the difference in log partial likelihood between the null model (β=0\beta = 0) and the fitted model. It reflects the global shape of the likelihood surface rather than relying on the local approximations used by the Wald and Score tests.

Wald Test

W=β^I^(β^)β^W = \hat\beta' \, \hat{I}(\hat\beta) \, \hat\beta

Based on the MLE β^\hat\beta and the estimated information matrix I^(β^)\hat{I}(\hat\beta). This is a local approximation using the curvature of the log-likelihood at the MLE. The per-covariate p-values in the coefficients table are the univariate version of this test: zj=β^j/SE(β^j)z_j = \hat\beta_j / SE(\hat\beta_j). Comparing the overall Wald statistic with individual zz values can reveal inconsistencies — for example, individually significant covariates that are not jointly significant — which may indicate multicollinearity or estimation instability.

The weakness of the Wald test is that it fails to capture asymmetry of the likelihood surface when β^\hat\beta is large.

Score Test

S=U(0)I(0)1U(0)S = U(0)' \, I(0)^{-1} \, U(0)

U(0)U(0) is the score function (gradient of the log partial likelihood) evaluated at β=0\beta = 0, and I(0)I(0) is the information matrix at that point. Since it does not require β^\hat\beta, it can be computed even when Newton-Raphson fails to converge.

The typical non-convergence scenario is when a covariate nearly perfectly predicts events. In a small clinical trial where events concentrate in the treatment group, β^\hat\beta \to \infty and the iterations diverge. Because the model parameterizes the hazard ratio as exp(β)\exp(\beta), an extremely large effect causes the coefficient to diverge. The likelihood ratio and Wald tests cannot be computed in this case, but the Score test can still provide evidence that covariates are associated with survival. However, the effect size cannot be estimated. Firth correction and exact methods can address this, but MIDAS does not currently support them.

See also

References