Data Analysis: Kaplan-Meier Survival Curve
Recommended Article : 【Bioinformatics】 Table of Contents for Bioinformatics Analysis
1. Number of patients who died this time
3. Probability of having survived so far
4. R code
5. Python code
Total Time (t) | Number of Patients Died (d) | Number of Patients Alive Until Now (n) | Number of Censored | Probability of Death in this Period (d/n) | Probability of Survival in this Period (1 - d/n) | Probability of Being Alive Up to Now (L) |
---|---|---|---|---|---|---|
6 | 1 | 23 | 0 | 0.0435 | 0.9565 | 0.9565 |
12 | 1 | 22 | 0 | 0.0455 | 0.9545 | 0.9130 |
21 | 1 | 21 | 0 | 0.0476 | 0.9524 | 0.8695 |
27 | 1 | 20 | 0 | 0.0500 | 0.9500 | 0.8260 |
32 | 1 | 19 | 0 | 0.0526 | 0.9474 | 0.7826 |
39 | 1 | 18 | 0 | 0.0556 | 0.9444 | 0.7391 |
43 | 2 | 17 | 1 | 0.1176 | 0.8824 | 0.6522 |
89 | 1 | 14 | 5 | 0.0714 | 0.9286 | 0.6056 |
261 | 1 | 8 | 0 | 0.1250 | 0.8750 | 0.5299 |
263 | 1 | 7 | 0 | 0.1429 | 0.8571 | 0.4542 |
270 | 1 | 6 | 1 | 0.1667 | 0.8333 | 0.3785 |
311 | 1 | 4 | . | 0.2500 | 0.7500 | 0.2839 |
Table 1. Example Kaplan-Meier Survival Curve
1. Number of patients who died this time (number of patients died)
⑴ The 23 people in the first row mean that 23 were alive when the time was 0.
⑵ The 22 people in the second row mean that 22 were alive when the time was 6.
⑶ The number of patients still alive (number at risk) is sometimes shown and sometimes not.
2. # of censored
⑴ Censored means there is no further record for reasons such as no longer being hospitalized.
⑵ It is assumed that censored patients have the same survival probability as other survivors.
⑶ Typically marked on the graph with a ⊕ symbol.
3. Probability of having survived so far
⑴ Probability of surviving until time 12 = 0.9130 = Probability of surviving until time 6 × Probability of surviving this period (6 ~ 12) = 0.9565 × 0.9545
⑵ Probability of surviving until time 21 = 0.8695 = Probability of surviving until time 12 × Probability of surviving this period (12 ~ 21) = 0.9130 × 0.9524
4. R code
⑴ R code related to the survival curve (considering # of censored)
install.packages("survival")
install.packages("survminer")
library(survival)
library(survminer)
surv_data <- data.frame(
time = c(6, 12, 21, 27, 32, 39, 43, 43, 43, 89, 89, 89, 89, 89, 89, 261, 263, 270, 270, 311),
status = c(1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1) # 사망은 1, censoring은 0
)
surv_obj <- Surv(time = surv_data$time, event = surv_data$status)
fit <- survfit(surv_obj ~ 1, data = surv_data)
ggsurvplot(
fit,
data = surv_data,
xlab = "Time",
ylab = "Survival probability",
title = "Kaplan-Meier Survival Curve",
surv.median.line = "hv", # 중앙값 생존 시간 선 추가
ggtheme = theme_minimal(), # 테마 설정
risk.table = TRUE, # 위험 테이블 추가
palette = "Dark2" # 색상 팔레트 설정
)
⑵ If a p-value is added : Note, the following code is hypothetical and unrelated to the above table.
library(survival)
library(survminer)
# Prepare Condition, Overall Survival Time, and Overall Survival Status
print(condition) # continuous variable
print(overall_surv_time)
print(overall_surv_status)
# Convert Overall Survival Status to a binary variable, where 1 = event occurred (DECEASED) and 0 = censored (LIVING or NA)
overall_surv_status_binary <- ifelse(overall_surv_status == "DECEASED", 1, 0)
# Create survival objects
surv_obj_overall <- Surv(time = overall_surv_time, event = overall_surv_status_binary)
# Find the median value of the condition
median_condition <- median(condition, na.rm = TRUE)
# Split data based on the median condition
data_overall <- data.frame(surv_obj_overall, Status = overall_surv_status_binary, Condition = condition)
# Overall Survival Analysis
ggsurvplot(
survfit(surv_obj_overall ~ Condition >= median_condition, data = data_overall),
data = data_overall,
pval = TRUE,
risk.table = TRUE,
title = "Overall Survival based on Median Condition",
legend.title = "Condition >= Median"
)
5. Python code
from lifelines import KaplanMeierFitter
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# Simulate some survival data for demonstration purposes
np.random.seed(42) # for reproducibility
n_patients = 100
study_duration = 1000 # days
# Generate some survival times with a predefined survival rate (we will use an exponential distribution)
survival_times = np.random.exponential(scale=365, size=n_patients) # mean survival time
# Generate censoring times, assuming that if a patient hasn't had an event by the end of the study, they are censored
censoring_times = np.random.uniform(low=0, high=study_duration, size=n_patients)
# The observed time is the minimum of the survival time and censoring time
observed_times = np.minimum(survival_times, censoring_times)
# The event is observed if the survival time is less than or equal to the censoring time
events_observed = (survival_times <= censoring_times).astype(int)
# Create a DataFrame
df_patients = pd.DataFrame({
'duration': observed_times,
'event_observed': events_observed
})
# Fit the Kaplan-Meier survival estimator on the data
kmf = KaplanMeierFitter()
kmf.fit(df_patients['duration'], event_observed=df_patients['event_observed'])
# Plot the survival function
plt.figure(figsize=(10, 6))
kmf.plot_survival_function()
plt.title('Kaplan-Meier Survival Curve')
plt.xlabel('Days since Start of Study')
plt.ylabel('Survival Probability')
plt.grid(True)
plt.show()
Input : 2021.04.13 17:29
Updated : 2024.03.11 21:42