This vignette gives the instructions on how to fit the Bernstein
polynomial (BP) based survival regression models using the unprecedent
routines implemented in the spsurv package. In addition to
this vignette, you can have access to a full description of the
methodology (MSc dissertation) available at arXiv, check it out https://arxiv.org/abs/2003.10548.
The spsurv::spbp function is the main routine of this
package, as it allows to fit all BP related survival regression
approaches presented in the dissertation. The acronym
spbp refers to semi-parametric Bernstein polynomial
based regression. The formula argument in
spsurv::spbp makes use of the same structure available at
the survival package in order to provide a familiar
environment to the public. The development version can be found on github.com/rvpanaro/spsurv,
it can be installed using:
The spsurv package imports specific routines that
provide the necessary support for internal calculations. During the
installation, other dependencies are required, such as
survival, loo, coda,
rstan and MASS packages.
The target data set is passed to the spsurv::spbp
function through a mandatory data.frame object class. Also,
it is possible to switch between bayesian and
mle approaches through the approach argument
and between the ph, po and aft
frameworks using the model argument. Naturally, prior
choices are ignored if the approach argument is set to
mle referring to maximum likelihood (ML) estimation; a
warning is displayed in this case. In addition, consider extra arguments
that may be passed directly to Stan software to apply
rstan::optimizing (if ML method), or
rstan::sampling if Markov chain Monte Carlo (MCMC) method,
function control options. Both methods were applied under the
rstan interface, see https://mc-stan.org/users/interfaces/rstan.
As mentioned in Chapter 4, the polynomial degree (highest basis order) can be chosen arbitrarily. In particular, the polynomial degree must be greater than zero, the default value of the polynomial degree \(m = \sqrt{n}\) is rounded to the greatest integer. Note that, the domain restriction for the BP, referred to as \(\tau\) here, is defined internally, see the discussion in Chapter 5. The reason for not allowing a user-defined \(\tau\) is to avoid an improper definition that would cause computing problems.
Considering the variety of settings that Stan can
provide and the modeling options above, we believe that the package is
flexible regarding user-defined applications of the BP based models.
Beyond that, a class, namely spbp was created to extend
some S3 methods that meet the R community need for
printing, summarizing, and plotting functions. Accordingly, we had
developed S3 methods extensions to accomplish these tasks such as the
spsurv::print.spbp, spsurv::summary.spbp,
spsurv::model.matrix.spbp and other summary printing
extensions such as print.summary.bpph.bayes.
Further, there are some coding instructions on how to fit the
semi-parametric models: Bernstein based proportional hazards (BPPH),
Bernstein based proportional odds (BPPO), and Bernstein based
accelerated failure time (BPAFT) models for right-censored data, under
the Bayesian or frequentist approach. In the Bayesian perspective,
Normal prior distributions are attributed to the regression coefficients
while Log-Normal, Gamma, or Inverse Gamma can be attributed to the BP
parameters. The default arguments for the spbp functions
were set as follows:
spbp.default <-
spbp(formula, degree, data,
approach = c("mle", "bayes"),
model = c("ph", "po", "aft"),
priors = list(beta = c("normal(0,4)"),
gamma = "lognormal(0,10)"),
scale = TRUE,
...)Consider formula an object of class formula, with the
Surv object (survival package) for
right-censored time-to-event data on the left side of \(\sim\) and the explanatory terms on the
right; degree for the integer value of the BP degree,
non-integer values are rounded to the greatest valid degree;
data for a mandatory data.frame object with
variables named in the formula; approach for either
Bayesian or ML estimation methods, default is bayes;
model proportional hazards (PH), proportional odds (PO) or
accelerated failure time (AFT) for the modeling classes discussed in Chapter 2, default is
ph; priors list of prior settings, which is
ignored when mle, and scale for a logical
value that indicates whether to apply the standardization discussed in
Chapter 5. Recall that
extra arguments can be passed to rstan::sampling
(e.g. iter, chains, init), more
details in https://mc-stan.org/users/documentation/.
Most statistical packages about survival regression returns an ANOVA
table. In this sense, the object of class spbp follows the
design provided in the survival package. The output
corresponding to the ANOVA table can be obtained with:
library("KMsurv")
data("larynx")
library(spsurv)
fit <- spsurv::spbp(Surv(time, delta)~ age + factor(stage),
approach = "mle", data = larynx)
summary(fit) ## Bernstein Polynomial based Proportional Hazards model
## Call:
## spbp.default(formula = Surv(time, delta) ~ age + factor(stage),
## data = larynx, approach = "mle", model = "ph")
##
## n= 90, number of events= 50
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age 0.0193 1.0195 0.0144 1.34 0.180
## factor(stage)2 0.1719 1.1876 0.4626 0.37 0.710
## factor(stage)3 0.6585 1.9318 0.3557 1.85 0.064 .
## factor(stage)4 1.7991 6.0440 0.4290 4.19 2.7e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Likelihood ratio test= 19.6 on 4 df, p=6e-04
## Wald test = 22.6 on 4 df, p=2e-04
One can reproduce this example by copying and pasting the indicated
code in the R console. The output is as follows:
library("KMsurv")
data("larynx")
library(spsurv)
fit <- spsurv::spbp(Surv(time, delta)~age + factor(stage),
approach = "mle", data = larynx)Consider that coef refers to the ML point estimates;
exp(coef) is the point estimate for the hazard ratio;
se(coef) represents the standard errors; z is
the test statistic for the Wald test and p is the p-value
of the Wald test. The estimated BP parameters, the value of the
evaluated log-likelihood of the null (reference) model and the
stan object can be obtained having access to the
spbp class object elements. Moreover, apart from the
fit object, it is also possible to obtain the matrix of
covariates, the covariance matrix and the likelihood value. This can be
done using the following code:
## age factor(stage)2 factor(stage)3 factor(stage)4 gamma1
## 1.927115e-02 1.719313e-01 6.584601e-01 1.799067e+00 1.687247e-02
## gamma2 gamma3 gamma4 gamma5 gamma6
## 4.163216e-02 8.007208e-85 1.400546e-02 4.677120e-02 9.441894e-58
## gamma7 gamma8 gamma9 gamma10
## 1.330574e-01 1.904282e-107 8.952326e-193 5.080056e-226
## age factor(stage)2 factor(stage)3 factor(stage)4
## 1 77 0 0 0
## 2 53 0 0 0
## 3 45 0 0 0
## 4 57 0 0 0
## 5 58 0 0 0
## 6 51 0 0 0
## age factor(stage)2 factor(stage)3 factor(stage)4
## 0.0002064493 0.2140166058 0.1264874601 0.1840017964
## [1] -149.8360 -140.0512
From the Bayesian point of view, the spbp class contains
posterior summary statistics such as the mode, median, mean and standard
deviation, along with 95% HPD interval based on the posterior density.
Note that the arguments passed after data are considered
Stan specific control parameters. For instance, the
argument chain allows to choose the number of chains in the
MCMC . Other settings such as iter and warmup
are also flexible and might be set at convenience. The user can simply
type in the R console the code
??rstan::sampling” for help. The following R
console outcome refers to the Bayesian estimation for the
larynx data set:
fit <- spsurv::spbp(Surv(time, delta)~age + factor(stage),
approach = "bayes", data = larynx,
iter = 2000, chains = 1, warmup = 1000, cores = 1)##
## SAMPLING FOR MODEL 'spbp' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 2.7e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.27 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 2.201 seconds (Warm-up)
## Chain 1: 2.03 seconds (Sampling)
## Chain 1: 4.231 seconds (Total)
## Chain 1:
## Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
## Warning:
## 3 (3.3%) p_waic estimates greater than 0.4. We recommend trying loo instead.
As with the ML estimation, the summary method is extended to the
spsurv::spbp class when applying to a Bayesian fit. Along
with the regression estimates, this output also contains descriptive
statistics for the posterior hazard ratio denoted by _exp
(in the console output) and the diagnosis statistics from the
loo package. The effective sample size n_eff
gives an estimate of the independent draws from the posterior
distribution, and Rhat referred to as the potential scale
reduction statistic, is one of the useful ways to monitor whether a
chain has converged to the equilibrium distribution. This statistic
measures the ratio between the average variation of the samples within
each chain and the variation of the combined samples in the chains; if
the chains have not converged to a common distribution, this statistic
will be greater than one. It is worth noting that all, the information
provided by the Stan output, including warnings, is passed
to the final user. One can have access to the stanfit
object with the fit$stanfit command. In particular, one can
have access to built-in plot functions and even to a shiny
app (details in https://shiny.rstudio.com/
developed by Stan developer’s team. The summary outcome is
as follows:
## Bayesian Bernstein Polynomial based Proportional Hazards model
##
## Call:
## spbp.default(formula = Surv(time, delta) ~ age + factor(stage),
## data = larynx, approach = "bayes", cores = 1, iter = 2000,
## chains = 1, warmup = 1000, model = "ph")
##
## n= 0, number of events= 0
##
## mode mean se_mean sd 50% n_eff Rhat lowerHPD
## age 0.0213 0.020 0.000484 0.0146 0.0202 909 0.999 -0.00872
## factor(stage)2 0.3173 0.147 0.016925 0.4520 0.1716 713 1.001 -0.72621
## factor(stage)3 0.6173 0.663 0.012350 0.3594 0.6573 847 1.000 -0.05280
## factor(stage)4 1.6345 1.792 0.014202 0.4338 1.7969 933 1.007 0.91792
## upperHPD
## age 0.0487
## factor(stage)2 1.0632
## factor(stage)3 1.3445
## factor(stage)4 2.5967
## ---
## mean_exp median_exp sd_exp lowerHPD_exp upperHPD_exp
## age 1.02 1.02 0.0149 0.991 1.05
## factor(stage)2 1.28 1.19 0.5668 0.393 2.49
## factor(stage)3 2.07 1.93 0.7611 0.876 3.69
## factor(stage)4 6.59 6.03 2.9427 2.046 12.84
## ---
##
## Computed from 1000 by 90 log-likelihood matrix.
##
## Estimate SE
## elpd_waic -149.9 9.6
## p_waic 9.6 1.0
## waic 299.7 19.1
##
## 3 (3.3%) p_waic estimates greater than 0.4. We recommend trying loo instead.
##
## Computed from 1000 by 90 log-likelihood matrix.
##
## Estimate SE
## elpd_loo -150.0 9.6
## p_loo 9.7 1.0
## looic 300.0 19.1
## ------
## MCSE of elpd_loo is NA.
## MCSE and ESS estimates assume independent draws (r_eff=1).
##
## Pareto k diagnostic values:
## Count Pct. Min. ESS
## (-Inf, 0.67] (good) 89 98.9% 437
## (0.67, 1] (bad) 1 1.1% <NA>
## (1, Inf) (very bad) 0 0.0% <NA>
## See help('pareto-k-diagnostic') for details.
The next code chunk shows the code for trace and density plotting and
to give access to the shiny app from the shinystan package
shinystan. Figures illustrate the trace plot and the
density plot of the BPPH for the larynx data set. The
graphs show unimodal posterior densities and well behaved chains with
good mixing, this is a good behavior indication.
Not least, a S3 method had to be created rather than extended. The
survivor method was created to accomplish the calculation
of the survival function evaluated in each time point. The goal is
similar to the survival::survfit S3 method, that could be
extended instead. The difference is that spbp classes
allows both Bayesian and frequentist approaches. The following code was
used to generate Figure:
## CoxPH model
fitcoxph <- survival::coxph(Surv(time , delta)~age + factor(stage),
data = larynx)
## Determine the groups of patients
newdata <- data.frame(age =c(77,77,77,77), stage = c(1,2,3,4))
## survfit Breslow estimator
breslowsurv <- survival::survfit(fitcoxph, newdata = newdata)
## spbp point-wise estimate
spbpsurv <- spsurv::survivor(fit, newdata = newdata)
plot(breslowsurv, bty = "n", lwd = 3, main = "77 years old patient survival per Stage")
points(spbpsurv$time, spbpsurv$survival1, col = 1, pch = 23)
points(spbpsurv$time, spbpsurv$survival2, col = 2, pch = 23)
points(spbpsurv$time, spbpsurv$survival3, col = 3, pch = 23)
points(spbpsurv$time, spbpsurv$survival4, col = 4, pch = 23)
legend("topright", c("Stage I", "Stage II", "Stage III", "Stage IV"), pch = 23, bty = "n", col = 1:4)The content of this vignette introduces the spsurv
package. Here, the analysis was dedicated to illustrating, in practice,
the commands implemented in the proposed package spsurv. We still have
work to do to improve and update this tool, however, the present version
is ready for the main statistical study in the field of survival
analysis. The routines presented in this dissertation are unprecedented.
Many efforts with regard to the instruction the routines documentation
were carried out concurrently with the spsurv package implementation.
This document is part of the content submitted to CRAN. The package is also in
public use and is available at the github development platform, the link
is: https://github.com/rvpanaro/spsurv.