Survival analysis aims to study the relationship between independent covariates and survival time and event outcomes. With the increasing demand for accuracy in practical applications, survival models are constantly being developed and improved. Meanwhile, how to measure the prediction performance of survival models has also attracted more and more attention ([4]). The censoring problem of survival data is the main reason why survival models are difficult to evaluate ([8]; [11]). In this package, we want to provide an “all-in-one” package which includes most of the evaluation metrics for survival predictions.
In the following, we will show how these model evaluation metrics work based on some simulated datasets and the popular Cox and random survival forest (RSF) models.
The four simulated survival data scenarios used here have different degrees of violation of the proportional hazard assumption and are constructed as follows:
Dataset 1: Data is created using 300 independent observations, where the covariate vector (W1, ..., W25) is multivariate normal with mean zero and a covariance matrix having elements (i, j) equal to 0.9|i − j|. Survival times are simulated from an exponential distribution with mean $\mu = e^{0.1 \sum_{i=11}^{20}} W_i$ (i.e., a proportional hazards model) and the censoring distribution is exponential with mean chosen to get approximately 30% censoring.
Dataset 2: Data is created using 200 independent observations where the covariate vector (W1, ..., W25) consists of 25 iid uniform random variables on the interval [0, 1]. The survival times follow an exponential distribution with mean μ = sin(W1π) + 2|W2 − 0.5| + W33. Censoring is uniform on [0, 6], which results in approximately 24% censoring. Here, the proportional hazards assumption is mildly violated.
Dataset 3: Data is created using 300 independent observations where the covariates (W1, ..., W25) are multivariate normal with mean zero and a covariance matrix having elements (i, j) equal to 0.75|i − j|. Survival times are gamma distributed with shape parameter $\mu=0.5+0.3|\sum_{i=11}^{15} W_i|$ and scale parameter 2. Censoring times are uniform on [0, 15], which results in approximately 20% censoring. Here, the proportional hazards assumption is strongly violated.
Dataset 4: Data is created using 300 independent observations where the covariates (W1, ..., W25) are multivariate normal with mean zero and a covariance matrix having elements (i, j) equal to 0.75|i − j|. Survival times are simulated according to a log-normal distribution with mean $\mu = 0.1|\sum_{i=1}^5 W_i| + 0.1|\sum_{i=21}^{25}W_i|$. Censoring times are log-normal with mean μ + 0.5 and scale parameter one, and the censoring rate is approximately 32%. Here, the underlying censoring distribution depends on covariates.
The most used survival model evaluation metrics is the Concordance index (CI) ([5]; [9]; [13]; [12]; [2]), which is the generalization of the ROC curve in the complete data in the survival data ([4]; [14]; [10]; [13]). CI can be divided into two categories according to whether considering the data tied.
In practical applications, deaths may be observed in both samples (δi = 1 and δj = 1) in the sample pair, and the observed survival times are the same (Ti = Xi = Xj = Tj). In this case, when the survival probabilities or survival times predicted by the model are also equal(Yi = Yj), the model has a perfect predictive ability. However, without considering ties between survival times, this kind of sample pair cannot improve CI for the original CI definition.
To deal with this problem, Ishwaran introduced an improved CI calculation method for the tied survival data ([7]). According to their definition, if the prediction result is survival time or survival probability, the smaller the prediction result, the worse it is; if the prediction result is the hazard function, the larger the value, the worse the prediction result. Detailed information on how to calculate CI are shown below:
npij(Xi, δi, Xj, δj) = max(I(Xi ≥ Xj)δj, I(Xi ≤ Xj)δi)
CC = ∑i, jI(sign(Yi, Yj) = csign(Xi, Xj)|npij = 1) where:
sign(Yi, Yj) = I(Yi ≥ Yj) − I(Yi ≤ Yj) csign (Xi, δi, Xj, δj) = I(Xi ≥ Xj)δj − I(Xi ≤ Xj)δi
$$\begin{aligned} PC & =\sum_{i, j} I\left(Y_{i}=Y_{j} \mid n p_{i j}=1, X_{i} \neq X_{j}\right)\\ &+I\left(Y_{i} \neq Y_{j} \mid n p_{i j}=1, X_{i}=X_{j}, \delta_{i}=\delta_{j}=1\right) \\ &+ I\left(Y_{i} \geq Y_{j} \mid n p_{i j}=1, X_{i}=X_{j}, \delta_{i}=1, \delta_{j}=0\right) \end{aligned}$$
$$CI =\frac{Concordance}{Permissible} = \frac{CC+0.5*PC}{\sum_{i,j}np_{ij}(X_i,\delta_i,X_j,\delta_j)} $$
where, δi is the survival status of sample i (0 means censoring, 1 means event), Yi and Xi represents the predicted survival time and the observed survival time, respectively.
Take simulation data set 4 as an example. First, we randomly divide the data into training set and test set. Then, using Cox and RSF models, we can get the corresponding CI values.
library(SurvMetrics)
library(caret)
library(randomForestSRC)
library(survival)
library(pec)
library(ggplot2)
set.seed(123)
#Initialization
metrics_cox = 0
metrics_rsf = 0
for (i in 1:20) {
mydata = SDGM4(N = 100, p = 20, c_step = 0.2)
index_data = createFolds(1:nrow(mydata), 2)
train_data = mydata[index_data[[1]],]
test_data = mydata[index_data[[2]],]
#fit the models
fitrsf = rfsrc(Surv(time, status)~., data = train_data, nsplit = 3, ntree = 500)
mat_rsf = predict(fitrsf, test_data)$survival
dis_time = fitrsf$time.interest
fitcox = coxph(Surv(time, status)~., data = train_data, x = TRUE)
mat_cox = predictSurvProb(fitcox, test_data, dis_time)
#calculate the C index
med_index = median(1:length(dis_time))
surv_obj = Surv(test_data$time, test_data$status)
#C index for Cox
metrics_cox[i] = Cindex(surv_obj, predicted = mat_cox[, med_index])
#C index for RSF
metrics_rsf[i] = Cindex(surv_obj, predicted = mat_rsf[, med_index])
}
data_CI = data.frame('Cindex' = c(metrics_cox, metrics_rsf),
'model' = c(rep('Cox', 20), rep('RSF', 20)))
ggplot(data_CI, aes(x = model, y = Cindex, fill = model)) +
geom_boxplot() +
scale_fill_manual(values = c("#FFBBCC", "#88CCFF"))
The larger the C index, the better the model prediction. It can be seen that the performance of RSF is better than Cox, which is also consistent with the data generation method.
In survival analysis, we usually use the prediction error curve (PEC) to evaluate the prediction performance of the survival model at different time points. When there is no censored sample in the test set, PEC corresponding to sample i at time t is the expected value of the square of the difference between the true survival state of sample i at time t and the predicted survival probability.
PEC(t) = E(Ŝ(t|zi) − δi(t))2
Where Ŝ(t|zi) is the survival function obtained by the model, which is used to estimate the survival probability of sample i at time t; δi(t) is a binary variable used to represent the true survival state of sample i at time t: 1 means that the sample is dead, 0 means that the sample is still alive at time t.
Based on the ideas of PEC and Brier’s ([1]) prediction index for the weather forecast model, Graf ([3]) proposed another common evaluation metric in survival analysis: Brier Score (BS). The prediction error at the time point was improved by inverse probability censoring weighting (IPCW) to evaluate the prediction performance of the survival model. BS at point t* can be expressed as:
$$BS(t^*)=\frac{1}{N}\sum_{i=1}^N\left[\frac{(\hat{S}(t^*|z_i))^2}{\hat{G}(X_i)}\cdot I(X_i<t^*,\delta_i=1)+\frac{(1-\hat{S}(t^*|z_i))^2}{\hat{G}(t^*)}\cdot I(X_i\ge t^*)\right]$$
Where t* is the time point at which BS is to be calculated; N is the sample size; Zi is the covariate corresponding to sample i; Ŝ(⋅) is the survival function predicted by the model; Ĝ(⋅) is the survival function corresponding to censoring.
From the definition of BS, it can be found that BS is essentially a square prediction error based on IPCW, so the smaller BS, the better the prediction effect of the survival model.
At the same time, we found that BS depends on the selection of time point t*. Generally, the median of the observation time is selected as the time point. However, in practice, different selection methods may also have large differences in the evaluation of the model’s prediction effect.
Again, take simulation data set 4 as an example. First, we randomly divide the data into training set and test set. Then, using Cox and RSF models, we get the corresponding Brier Scores.
#Initialization
metrics_cox = 0
metrics_rsf = 0
set.seed(123)
for (i in 1:10) {
mydata = SDGM4(N = 100, p = 20, c_step = 0.5)
index_data = createFolds(1:nrow(mydata), 2)
train_data = mydata[index_data[[1]],]
test_data = mydata[index_data[[2]],]
#fit the models
fitrsf = rfsrc(Surv(time, status)~., data = train_data, nsplit = 3, ntree = 500)
mat_rsf = predict(fitrsf, test_data)$survival
dis_time = fitrsf$time.interest
fitcox = coxph(Surv(time, status)~., data = train_data, x = TRUE)
mat_cox = predictSurvProb(fitcox, test_data, dis_time)
#calculate the Brier Score
med_index = median(1:length(dis_time))
surv_obj = Surv(test_data$time, test_data$status)
t_star = median(fitrsf$time.interest)
#Brier Score for Cox
metrics_cox[i] = Brier(surv_obj, pre_sp = mat_cox[, med_index], t_star)
#Brier Score for RSF
metrics_rsf[i] = Brier(surv_obj, pre_sp = mat_rsf[, med_index], t_star)
}
data_BS = data.frame('BS' = c(metrics_cox, metrics_rsf),
'model' = c(rep('Cox', 10), rep('RSF', 10)))
ggplot(data_BS, aes(x = model, y = BS, fill = model)) +
geom_boxplot() +
scale_fill_manual(values = c("#FFBBCC", "#88CCFF"))
The smaller the BS, the better the model prediction. From the figure, it can be seen that the performance of RSF is better than Cox, which is also consistent with the data generation method SDGM4.
In practice, IBS, the integral form of Brier Score, is often preferred in that IBS value does not depend on the selection of a single time point.
$$IBS=\frac{1}{\max\limits_i{(X_i)}} \int_0^{\max\limits_i(X_i)}BS(t)dt $$
At the same time, one of the reasons why BS and IBS are not as widely used as CI is that the existing packages for input objects in the BS calculation method in R are not clear, which leads to the prediction results of many survival models that cannot be directly used to calculate the value of BS or IBS.
Also, we take simulation data set 4 as an example. First, we randomly divide the data into training set and test set. Then, using Cox and RSF models, we get the corresponding IBS values:
#Initialization
metrics_cox = 0
metrics_rsf = 0
set.seed(123)
for (i in 1:5) {
mydata = SDGM4(N = 100, p = 20, c_step = -0.5)
index_data = createFolds(1:nrow(mydata), 2)
train_data = mydata[index_data[[1]],]
test_data = mydata[index_data[[2]],]
#fit the models
fitrsf = rfsrc(Surv(time, status)~., data = train_data, nsplit = 3, ntree = 500)
mat_rsf = predict(fitrsf, test_data)$survival
dis_time = fitrsf$time.interest
fitcox = coxph(Surv(time, status)~., data = train_data, x = TRUE)
mat_cox = predictSurvProb(fitcox, test_data, dis_time)
#calculate the IBS
med_index = median(1:length(dis_time))
surv_obj = Surv(test_data$time, test_data$status)
#IBS for Cox
metrics_cox[i] = IBS(surv_obj, sp_matrix = mat_cox, dis_time)
#IBS for RSF
metrics_rsf[i] = IBS(surv_obj, sp_matrix = mat_rsf, dis_time)
}
data_IBS = data.frame('IBS' = c(metrics_cox, metrics_rsf),
'model' = c(rep('Cox', 5), rep('RSF', 5)))
ggplot(data_IBS, aes(x = model, y = IBS, fill = model)) +
geom_boxplot() +
scale_fill_manual(values = c("#FFBBCC", "#88CCFF"))
The smaller the IBS, the better the model prediction. From the figure, it can be seen that the performance of RSF is better than Cox, which is also consistent with the data generation method SDGM4.
Two additional evaluation metrics for survival models used are IAE (Integrated Absolute Error) and ISE (Integrated Square Error). ([6]; [16];).
IAE = ∫t|S(t) − Ŝ(t)|dt ISE = ∫t(S(t) − Ŝ(t))2dt
where, S(t) and Ŝ(t) represent the true survival function and the predicted survival function, respectively. However, in practical applications, the mathematical expression of the survival function is usually unknown, which results in that the existing IAE and ISE can only be used in simulation experiments. In the SurvMetrics package, the non-parametric KM estimation method is used to obtain the approximate expression of S(t), and an evaluation standard for the prediction effect of the model is given, which also extends the scope of application of IAE and ISE to the situation of real data sets.
Also, we take simulation data set 4 as an example. First, we randomly divide the data into training set and test set. Then, using Cox and RSF models, we get the corresponding IAE and ISE values:
#Initialization
metrics_cox_IAE = 0
metrics_rsf_IAE = 0
metrics_cox_ISE = 0
metrics_rsf_ISE = 0
set.seed(123)
for (i in 1:10) {
mydata = SDGM4(N = 100, p = 20, c_step = 0.2)
index_data = createFolds(1:nrow(mydata), 2)
train_data = mydata[index_data[[1]],]
test_data = mydata[index_data[[2]],]
#fit the models
fitrsf = rfsrc(Surv(time, status)~., data = train_data, nsplit = 3, ntree = 500)
mat_rsf = predict(fitrsf, test_data)$survival
dis_time = fitrsf$time.interest
fitcox = coxph(Surv(time, status)~., data = train_data, x = TRUE)
mat_cox = predictSurvProb(fitcox, test_data, dis_time)
#calculate the IAE and ISE
med_index = median(1:length(dis_time))
surv_obj = Surv(test_data$time, test_data$status)
#IAE and ISE for Cox
temp1 = IAEISE(surv_obj, sp_matrix = mat_cox, dis_time)
metrics_cox_IAE[i] = temp1[1]
metrics_cox_ISE[i] = temp1[2]
#IAE and ISE for RSF
temp2 = IAEISE(surv_obj, sp_matrix = mat_rsf, dis_time)
metrics_rsf_IAE[i] = temp2[1]
metrics_rsf_ISE[i] = temp2[2]
}
data_IAE = data.frame('IAE' = c(metrics_cox_IAE, metrics_rsf_IAE),
'model' = c(rep('Cox', 10), rep('RSF', 10)))
data_ISE = data.frame('ISE' = c(metrics_cox_ISE, metrics_rsf_ISE),
'model' = c(rep('Cox', 10), rep('RSF', 10)))
P1 = ggplot(data_IAE, aes(x = model, y = IAE, fill = model)) +
geom_boxplot() +
scale_fill_manual(values = c("#FFBBCC", "#88CCFF")) +
theme(legend.position = 'none')
P2 = ggplot(data_ISE, aes(x = model, y = ISE, fill = model)) +
geom_boxplot() +
scale_fill_manual(values = c("#FFBBCC", "#88CCFF")) +
theme(legend.position = 'none')
library(ggpubr)
ggarrange(P1, P2, ncol = 2)
The smaller the IAE and ISE, the better the model prediction. From the figure, it can be seen that the performance of RSF is better than Cox both in IAE and ISE metrics, which is also consistent with the data generation method SDGM4.
The final evaluation metric of survival model presented here is MAE (Mean Absolute Error) ([15]).
$$MAE=\frac{1}{n} \sum_{i=1}^{N}\left(\delta_{i}\left|Y_{i}-{X}_{i}\right|\right)$$ where, N is the observed sample size, and n is the death sample size. MAE only calculates the average absolute error between the predicted survival time and the true survival time in the uncensored sample, because it does not consider the information of the censored data and requires the model to output survival time, while the general model can only output the survival probability or cumulative risk, so MAE is rarely used in practice.
In this vignette, we present the survival model evaluation metrics and give examples of how these calculation methods work. In the current version of SurvMetrics, six survival model evaluation metrics are present. We are working to provide a more user-friendly interface and facilitate both statistical and non-statistical clinical research workers in evaluating survival models.