Tutorial: Kaplan-Meier Survival Curves with Heart Failure Data

You have clinical records for 299 patients diagnosed with heart failure, tracking their survival status over a follow-up period. In this tutorial, you will estimate survival curves using the Kaplan-Meier method and compare survival curves grouped by patient characteristics (anaemia, high blood pressure) using the Log-rank test.

Load the sample data and examine its structure
Understand the key feature of survival data: censoring
Estimate the overall survival curve
Compare survival curves between patients with and without anaemia
Interpret the Log-rank test results
Explore other grouping variables

Load the data

On the launcher screen, click Heart Failure in the Sample Data section. A project is created and the data is loaded.

This dataset contains clinical records of heart failure patients collected at the Faisalabad Institute of Cardiology (Pakistan) in 2015 (Chicco & Jurman, 2020).

Examine the data structure

Open the Data Table tab. You will see 299 rows and 13 columns.

Heart Failure data in Data Table

The key columns for survival analysis fall into three categories.

Time and event variables

Column	Description
`time`	Follow-up period in days. The number of days from diagnosis to the last observation (death or censoring)
`DEATH_EVENT`	Whether the patient died during follow-up. 1 = death, 0 = censored (alive at the end of follow-up)

Patient characteristics (used for grouping)

Column	Description
`age`	Age in years
`anaemia`	Presence of anaemia (0: No, 1: Yes)
`diabetes`	Presence of diabetes (0: No, 1: Yes)
`high_blood_pressure`	Presence of hypertension (0: No, 1: Yes)
`sex`	Sex (0: Female, 1: Male)
`smoking`	Smoking status (0: No, 1: Yes)

Laboratory values

The remaining 5 columns (creatinine_phosphokinase, ejection_fraction, platelets, serum_creatinine, serum_sodium) are blood test results. They are not used in this tutorial but can serve as covariates in Cox regression.

What is censoring?

Of the 299 patients, some died during follow-up (DEATH_EVENT = 1) and others were still alive when follow-up ended (DEATH_EVENT = 0). The latter are called censored observations.

A censored patient's survival time is at least time days, but when the event will eventually occur is unknown.

If you simply excluded censored patients, you would lose the information from patients who survived for long periods without an event, estimating survival times only from those who died. This underestimates survival. The Kaplan-Meier method accounts for censoring by incorporating the "survived at least this long" information into the risk set calculation.

For the mathematical treatment of censoring, see Survival Analysis Fundamentals.

Estimate the overall survival curve

Select Analysis > Survival Analysis > Kaplan-Meier... from the menu bar. The Kaplan-Meier tab opens.

Set variables

Time Variable: select time
Event Variable: select DEATH_EVENT

Leave Group Variable empty.

Kaplan-Meier form setup

Click Run Analysis.

Overall Kaplan-Meier survival curve

Read the survival curve

The horizontal axis shows follow-up time (days) and the vertical axis shows survival probability $S(t)$ . The Kaplan-Meier method does not assume a distributional form — it estimates survival probability directly at each event time, producing a step function that drops when a death occurs. The + marks on the curve indicate censoring times — points where subjects were lost to follow-up. The shaded band around the curve is the 95% confidence interval.

Check Summary Statistics

Item	Meaning
n	Number of subjects (299)
Events	Number of deaths
Median	Median survival time

The median is the time point where the survival curve crosses the $S(t) = 0.5$ line. It represents when half of the subjects have experienced the event, and is widely used as a summary measure of survival.

Compare survival curves by anaemia status

Next, examine whether survival differs between patients with and without anaemia.

Set the Group Variable

Select anaemia from the Group Variable dropdown and click Run Analysis.

Survival curves compared by anaemia status

Two survival curves appear: anaemia = 0 (no anaemia) and anaemia = 1 (anaemia present).

Read the curves

The gap between the two curves at each time point is the estimated difference in survival probability between groups. The confidence bands represent the estimation precision of each group's survival function; overlap between bands does not indicate whether the groups differ. Use the Log-rank test to compare groups.

Interpret the Log-rank test

When a Group Variable is specified, the Log-rank test results appear below the curves.

The null hypothesis of the Log-rank test is that the two survival curves are the same.

Item	Description
Chi-squared	Test statistic. Computed by aggregating the differences between observed and expected deaths in each group at each event time. Approximately follows a chi-squared distribution with df degrees of freedom under the null hypothesis
df	Degrees of freedom of the chi-squared distribution. Equal to the number of groups minus 1 (1 for two groups)
p-value	The probability of observing a test statistic as extreme as or more extreme than the one computed from the data, assuming the null hypothesis is true. Compare against a pre-specified significance level to decide whether to reject the null hypothesis

Log-rank test results

The detailed table for each group shows Observed (actual deaths) and Expected (deaths expected under the null hypothesis).

O/E > 1: more deaths than expected (lower survival)
O/E < 1: fewer deaths than expected (higher survival)

Number at Risk table

The Number at Risk table below the curve shows how many patients remain in the risk set (neither dead nor censored) at each time point.

The numbers decrease over time as patients leave the risk set through both death and censoring. At time points where few patients remain, the survival estimate becomes less precise and the confidence band widens.

Compare by other variables

Follow the same steps to compare by high blood pressure (high_blood_pressure) or smoking (smoking).

Survival curves compared by high blood pressure status

You can run the Log-rank test with different group variables, but trying multiple grouping variables is hypothesis generation, not testing. Repeating the test introduces a multiple testing problem. Report findings from exploration as exploratory analysis; to test those hypotheses, use independently collected data.

Kaplan-Meier can only handle one grouping variable at a time. To consider multiple factors simultaneously (for example, to assess the effect of anaemia adjusted for age), use the Cox proportional hazards model. See Survival Analysis for instructions.

Add results to a report

To save the survival curve for a paper or presentation, click the Add to Report button. The curve is added to the report.

See Reports for details on working with reports.

Summary

Survival data structure: You need a time variable (follow-up period) and an event variable (death/censoring)
Censoring: Patients alive at the end of follow-up are included in the analysis as "survived at least this long"
Survival curve estimation: The Kaplan-Meier method estimates the survival curve directly from observed data without assuming a distribution
Group comparison: Setting a Group Variable produces group-specific survival curves and a Log-rank test

For the mathematical background of survival analysis, see Survival Analysis Fundamentals.

References

Chicco, D., & Jurman, G. (2020). Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Medical Informatics and Decision Making, 20, 16.
Kaplan, E. L., & Meier, P. (1958). Nonparametric estimation from incomplete observations. Journal of the American Statistical Association, 53(282), 457-481.