Tutorial: Survival Analysis with the Kaplan-Meier Method

This tutorial walks through a Kaplan-Meier survival analysis from start to finish, using heart failure clinical records included as sample data in MIDAS. No installation or coding is needed — everything runs in your browser. Open the sample data in MIDAS to follow along.

You have clinical records for 299 patients diagnosed with heart failure, tracking their survival status over a follow-up period. In this tutorial, you will estimate survival curves using the Kaplan-Meier method and compare survival curves grouped by patient characteristics (anaemia, high blood pressure) using RMST (Restricted Mean Survival Time).

  1. Load the sample data and examine its structure
  2. Understand the key feature of survival data: censoring
  3. Estimate the overall survival curve
  4. Compare survival curves between patients with and without anaemia
  5. Interpret the RMST results
  6. Explore other grouping variables

Load the data

On the launcher screen, click Heart Failure in the Sample Data section. A project is created and the data is loaded.

Open this state in MIDAS

This dataset contains clinical records of heart failure patients collected at the Faisalabad Institute of Cardiology (Pakistan) in 2015 (Chicco & Jurman, 2020).

Examine the data structure

Open the Data Table tab. You will see 299 rows and 13 columns.

Heart Failure data in Data Table

The key columns for survival analysis fall into three categories.

Time and event variables

ColumnDescription
timeFollow-up period in days. The number of days from diagnosis to the last observation (death or censoring)
DEATH_EVENTWhether the patient died during follow-up. 1 = death, 0 = alive (censored)

Patient characteristics (used for grouping)

ColumnDescription
ageAge in years
anaemiaPresence of anaemia (0: No, 1: Yes)
diabetesPresence of diabetes (0: No, 1: Yes)
high_blood_pressurePresence of hypertension (0: No, 1: Yes)
sexSex (0: Female, 1: Male)
smokingSmoking status (0: No, 1: Yes)

Laboratory values

The remaining 5 columns (creatinine_phosphokinase, ejection_fraction, platelets, serum_creatinine, serum_sodium) are blood test results. They are not used in this tutorial but can serve as covariates in Cox regression.

What is censoring?

Of the 299 patients, some died during follow-up (DEATH_EVENT = 1) and others were still alive when follow-up ended (DEATH_EVENT = 0). The latter are called censored observations.

A censored patient's survival time is at least time days, but when the event will eventually occur is unknown.

If you simply excluded censored patients, you would lose the information from patients who survived for long periods without an event, estimating survival times only from those who died. This underestimates survival. The Kaplan-Meier method accounts for censoring by incorporating the "survived at least this long" information into the risk set calculation.

For this estimation to be valid, censoring must be independent of the likelihood of experiencing the event (non-informative censoring). For the mathematical treatment of censoring, see Survival Analysis Fundamentals.

Estimate the overall survival curve

Select Analysis > Survival Analysis > Kaplan-Meier... from the menu bar. The Kaplan-Meier tab opens.

Set variables

  • Time Variable: select time
  • Event Variable: select DEATH_EVENT

Leave Group Variable (Optional) empty.

Kaplan-Meier form setup

Click Run Analysis.

Open this state in MIDAS

Overall Kaplan-Meier survival curve

Read the survival curve

The horizontal axis shows follow-up time (days) and the vertical axis shows survival probability S(t)S(t). The Kaplan-Meier method does not assume a distributional form — it estimates survival probability directly at each event time, producing a step function that drops when a death occurs. The + marks on the curve indicate censoring times — points where subjects were lost to follow-up. The shaded band around the curve is the 95% confidence interval for the estimated survival probability S(t)S(t) at each time point (a pointwise interval constructed independently at each time, not a simultaneous band covering the entire curve).

Check Summary Statistics

ItemMeaning
nNumber of subjects (299)
EventsNumber of deaths
MedianMedian survival time
95% CIConfidence interval for the median

The median is the time point where the survival curve crosses the S(t)=0.5S(t) = 0.5 line. It represents when half of the subjects have experienced the event, and is widely used as a summary measure of survival. If S(t)S(t) does not fall below 0.5 within the observation period, the median is displayed as NR (Not Reached).

Compare survival curves by anaemia status

Next, examine whether survival differs between patients with and without anaemia.

Set the Group Variable

Select anaemia from the Group Variable (Optional) dropdown and click Run Analysis.

Open this state in MIDAS

Survival curves compared by anaemia status

Two survival curves appear: anaemia = 0 (no anaemia) and anaemia = 1 (anaemia present).

Read the curves

The gap between the two curves at each time point is the estimated difference in survival probability between groups. The confidence bands represent the estimation precision of each group's survival function; overlap between bands does not indicate whether the groups differ, because each band is constructed independently for that group and has a different structure from a confidence interval for the difference. Use RMST to compare groups quantitatively.

Interpret the RMST results

Below the curves, the RMST (Restricted Mean Survival Time) results appear for each group.

RMST results for anaemia grouping

RMST is the area under the Kaplan-Meier curve from 0 to a restriction time τ\tau, estimating the average survival time up to τ\tau. RMST does not require the proportional hazards assumption and remains interpretable even when survival curves cross (details).

ColumnDescription
GroupGroup name
RMSTRestricted mean survival time estimate. Computed as the area under the KM curve up to τ\tau
SEStandard error, based on the Greenwood variance
95% CI95% confidence interval for RMST

When there are two or more groups, an RMST Difference table appears below. It shows the pairwise difference in RMST, its SE, and confidence interval. Read the magnitude of the difference and its uncertainty from the point estimate and the width of the confidence interval. For example, if the estimated difference is 15 days with a 95% CI of [3, 27], the average survival time up to τ\tau is estimated to differ by approximately 15 days, and the range from 3 to 27 days represents the uncertainty of that estimate. For three or more groups, the table heading changes to RMST Difference (Unadjusted) and the per-pair confidence intervals are unadjusted for multiplicity.

Number at Risk table

The Number at Risk table below the curve shows how many patients remain in the risk set (neither dead nor censored) at each time point.

The numbers decrease over time as patients leave the risk set through both death and censoring. At time points where few patients remain, the survival estimate becomes less precise and the confidence band widens.

Compare by other variables

Follow the same steps to compare by high blood pressure (high_blood_pressure) or smoking (smoking).

Survival curves compared by high blood pressure status

You can compute RMST with different group variables, but trying multiple grouping variables is hypothesis generation. Report findings from exploration as exploratory analysis.

Kaplan-Meier can only handle one grouping variable at a time. To consider multiple factors simultaneously, use the Cox proportional hazards model. For example, you can assess the effect of anaemia on survival while accounting for differences in age. See Survival Analysis for instructions.

Add results to a report

To save the survival curve for a paper or presentation, click the Add to Report button. In the dialog that appears, select an existing report or create a new one, and the survival curve is added to that report.

See Reports for details on working with reports.

Summary

  • Survival data structure: You need a time variable (follow-up period) and an event variable (death/censoring)
  • Censoring: Patients alive at the end of follow-up are included in the analysis as "survived at least this long"
  • Survival curve estimation: The Kaplan-Meier method estimates the survival curve directly from observed data without assuming a distribution
  • Group comparison: Setting a Group Variable produces group-specific survival curves and enables comparison via RMST differences

For the mathematical background of survival analysis, see Survival Analysis Fundamentals.

References

  • Chicco, D., & Jurman, G. (2020). Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Medical Informatics and Decision Making, 20, 16. https://doi.org/10.1186/s12911-020-1023-5
  • Kaplan, E. L., & Meier, P. (1958). Nonparametric estimation from incomplete observations. Journal of the American Statistical Association, 53(282), 457-481. https://www.jstor.org/stable/2281868