--- title: "Predictive Evaluation Metrics in Survival Analysis" author: "Zhou Hanpu" date: "2021/7/26" output: rmarkdown::html_vignette bibliography: myref.bib csl: journal-style.csl vignette: > %\VignetteIndexEntry{SurvMetrics-vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Vignette ## 1. Introduction Survival analysis aims to study the relationship between independent covariates and survival time and event outcomes. With the increasing demand for accuracy in practical applications, survival models are constantly being developed and improved. Meanwhile, how to measure the prediction performance of survival models has also attracted more and more attention (@HEaGerty.2005). The censoring problem of survival data is the main reason why survival models are difficult to evaluate (@Ivanescu.2015; @Klein.2010). In this package, we want to provide an "all-in-one" package which includes most of the evaluation metrics for survival predictions. In the following, we will show how these model evaluation metrics work based on some simulated datasets and the popular Cox and random survival forest (RSF) models. ## 2. Data The four simulated survival data scenarios used here have different degrees of violation of the proportional hazard assumption and are constructed as follows: Dataset 1: Data is created using 300 independent observations, where the covariate vector $( W_1, . . . ,W_{25} )$ is multivariate normal with mean zero and a covariance matrix having elements $(i, j)$ equal to $0.9^{|i−j|}$. Survival times are simulated from an exponential distribution with mean $\mu = e^{0.1 \sum_{i=11}^{20}} W_i$ (i.e., a proportional hazards model) and the censoring distribution is exponential with mean chosen to get approximately 30% censoring. Dataset 2: Data is created using 200 independent observations where the covariate vector ($W_1, . . . ,W_{25}$) consists of 25 iid uniform random variables on the interval [0, 1]. The survival times follow an exponential distribution with mean $μ = sin(W_1\pi) + 2|W_2 − 0.5| + W_3^3$. Censoring is uniform on [0, 6], which results in approximately 24% censoring. Here, the proportional hazards assumption is mildly violated. Dataset 3: Data is created using 300 independent observations where the covariates ($W_1, . . . ,W_{25}$) are multivariate normal with mean zero and a covariance matrix having elements $(i, j)$ equal to $0.75^{|i−j|}$. Survival times are gamma distributed with shape parameter $\mu=0.5+0.3|\sum_{i=11}^{15} W_i|$ and scale parameter 2. Censoring times are uniform on [0, 15], which results in approximately 20% censoring. Here, the proportional hazards assumption is strongly violated. Dataset 4: Data is created using 300 independent observations where the covariates $(W_1, . . . ,W_{25})$ are multivariate normal with mean zero and a covariance matrix having elements $(i, j)$ equal to $0.75^{|i−j|}$. Survival times are simulated according to a log-normal distribution with mean $\mu = 0.1|\sum_{i=1}^5 W_i| + 0.1|\sum_{i=21}^{25}W_i|$. Censoring times are log-normal with mean $\mu + 0.5$ and scale parameter one, and the censoring rate is approximately 32%. Here, the underlying censoring distribution depends on covariates. ## 3. Predictive Evaluation Metrics in Survival Analysis ### 3.1 Concordance index The most used survival model evaluation metrics is the Concordance index (CI) (@Harrell.1982; @Jing.2019; @Li.2016; @Lee.2018; @Devlin.2021), which is the generalization of the ROC curve in the complete data in the survival data (@HEaGerty.2005; @Obuchowski.2006; @Kang.2015; @Li.2016). CI can be divided into two categories according to whether considering the data tied. In practical applications, deaths may be observed in both samples ($\delta_i=1$ and $\delta_j=1$) in the sample pair, and the observed survival times are the same ($T_i=X_i=X_j=T_j$). In this case, when the survival probabilities or survival times predicted by the model are also equal($Y_i=Y_j$), the model has a perfect predictive ability. However, without considering ties between survival times, this kind of sample pair cannot improve CI for the original CI definition. To deal with this problem, Ishwaran introduced an improved CI calculation method for the tied survival data (@Ishwaran.2008). According to their definition, if the prediction result is survival time or survival probability, the smaller the prediction result, the worse it is; if the prediction result is the hazard function, the larger the value, the worse the prediction result. Detailed information on how to calculate CI are shown below: - First define the comparable sample pair: $$np_{ij}\left(X_{i}, \delta_{i}, X_{j}, \delta_{j}\right) = max(I\left(X_{i} \geqslant X_{j}\right) \delta_{j},I\left(X_{i} \leqslant X_{j}\right) \delta_{i}) $$ - In the second step, calculate the Complete Concordance (CC): $$ CC = \sum_{i,j}I(sign(Y_i,Y_j)=csign(X_i,X_j)|np_{ij}=1) $$ where: $$sign(Y_i,Y_j) = I\left(Y_{i} \geqslant Y_{j}\right) -I\left(Y_{i} \leqslant Y_{j}\right)$$ $$\operatorname{csign}\left(X_{i}, \delta_{i}, X_{j}, \delta_{j}\right)=I\left(X_{i} \geqslant X_{j}\right) \delta_{j}-I\left(X_{i} \leqslant X_{j}\right) \delta_{i}$$ - Then, derive the Partial Concordance (PC): $$\begin{aligned} PC & =\sum_{i, j} I\left(Y_{i}=Y_{j} \mid n p_{i j}=1, X_{i} \neq X_{j}\right)\\ &+I\left(Y_{i} \neq Y_{j} \mid n p_{i j}=1, X_{i}=X_{j}, \delta_{i}=\delta_{j}=1\right) \\ &+ I\left(Y_{i} \geq Y_{j} \mid n p_{i j}=1, X_{i}=X_{j}, \delta_{i}=1, \delta_{j}=0\right) \end{aligned}$$ - Finally, CI can be calculated as: $$CI =\frac{Concordance}{Permissible} = \frac{CC+0.5*PC}{\sum_{i,j}np_{ij}(X_i,\delta_i,X_j,\delta_j)} $$ where, $\delta_i$ is the survival status of sample $i$ (0 means censoring, 1 means event), $Y_i$ and $X_i$ represents the predicted survival time and the observed survival time, respectively. ### 3.2 Example on CI Take simulation data set 4 as an example. First, we randomly divide the data into training set and test set. Then, using Cox and RSF models, we can get the corresponding CI values. ```{r fig.align='center', message=FALSE, warning=FALSE, results='hide'} library(SurvMetrics) library(caret) library(randomForestSRC) library(survival) library(pec) library(ggplot2) set.seed(123) #Initialization metrics_cox = 0 metrics_rsf = 0 for (i in 1:20) { mydata = SDGM4(N = 100, p = 20, c_step = 0.2) index_data = createFolds(1:nrow(mydata), 2) train_data = mydata[index_data[[1]],] test_data = mydata[index_data[[2]],] #fit the models fitrsf = rfsrc(Surv(time, status)~., data = train_data, nsplit = 3, ntree = 500) mat_rsf = predict(fitrsf, test_data)$survival dis_time = fitrsf$time.interest fitcox = coxph(Surv(time, status)~., data = train_data, x = TRUE) mat_cox = predictSurvProb(fitcox, test_data, dis_time) #calculate the C index med_index = median(1:length(dis_time)) surv_obj = Surv(test_data$time, test_data$status) #C index for Cox metrics_cox[i] = Cindex(surv_obj, predicted = mat_cox[, med_index]) #C index for RSF metrics_rsf[i] = Cindex(surv_obj, predicted = mat_rsf[, med_index]) } data_CI = data.frame('Cindex' = c(metrics_cox, metrics_rsf), 'model' = c(rep('Cox', 20), rep('RSF', 20))) ggplot(data_CI, aes(x = model, y = Cindex, fill = model)) + geom_boxplot() + scale_fill_manual(values = c("#FFBBCC", "#88CCFF")) ``` The larger the C index, the better the model prediction. It can be seen that the performance of RSF is better than Cox, which is also consistent with the data generation method. ### 3.3 Brier Score In survival analysis, we usually use the prediction error curve (PEC) to evaluate the prediction performance of the survival model at different time points. When there is no censored sample in the test set, PEC corresponding to sample $i$ at time $t$ is the expected value of the square of the difference between the true survival state of sample $i$ at time $t$ and the predicted survival probability. $$PEC(t)=E\left(\hat{S}(t|z_i)-\delta_i(t)\right)^2$$ Where $\hat{S}(t|z_i)$ is the survival function obtained by the model, which is used to estimate the survival probability of sample $i$ at time $t$; $\delta_i(t)$ is a binary variable used to represent the true survival state of sample $i$ at time $t$: 1 means that the sample is dead, 0 means that the sample is still alive at time $t$. Based on the ideas of PEC and Brier's (@Brier.1950) prediction index for the weather forecast model, Graf (@Graf.1999) proposed another common evaluation metric in survival analysis: Brier Score (BS). The prediction error at the time point was improved by inverse probability censoring weighting (IPCW) to evaluate the prediction performance of the survival model. BS at point $t^*$ can be expressed as: $$BS(t^*)=\frac{1}{N}\sum_{i=1}^N\left[\frac{(\hat{S}(t^*|z_i))^2}{\hat{G}(X_i)}\cdot I(X_i