merlin

Description

The merlin package allows the fitting of multi-outcome models and with any number of nested random effects, outcomes can be modelled jointly, or with shared random effects.

Missing data

For n record to be included in the model there must be at least one observation per level in the model. For example in a joint survival and longitudinal biomarker model, if the survival time for a patient is available, but all longitudinal biomarker observations are missing, the individual will be excluded from the analysis.

For each individual outcome in the model, a complete case analysis is used. The response, covariates and level indicators are required to fit the model, which may vary by outcome. For example if a model simultaneously models two separate biomarkers, which are sampled at different time points, data in each record will only be included in the model for the appropriate outcome, potentially resulting in different sample sizes being used for each outcome.

Variance-covariance matrix structure

There is a separate variance-covariance matrix for each level of random effects. The structure of the variance-covariance matrix can be varied to allow for correlation between parameters at each level. The possible structures are identity, where all diagonal elements are estimated and constrained to be the same, diagonal where all diagonal elements are estimated uniquely (the default), unstructured where all elements of the variance-covariance matrix are estimated, and exchangeable which assumes a common variance and a common covariance.

Predictions

Predictions using the fixed-effects only can be found using the predict() function with the argument type = fixedonly. Marginal predictions, where the random effects are integrated out, can be obtained using type = marginal.

Examples

In order to illustrate the potential uses of merlin a number of increasingly advanced models have been fitted to the commonly used (in the area of joint modelling of longitudinal and survival data) primary biliary cirrhosis dataset.

This data set needs some re-formatting in order to fit joint models

library(merlin)
data("pbc")
pbc[1:11,c("id","years","status","status2","drug","serBilir","prothrombin","year")]
#>    id    years status status2      drug serBilir prothrombin      year
#> 1   1  1.09517   dead       1 D-penicil     14.5        12.2 0.0000000
#> 2   1  1.09517   dead       1 D-penicil     21.3        11.2 0.5256817
#> 3   2 14.15234  alive       0 D-penicil      1.1        10.6 0.0000000
#> 4   2 14.15234  alive       0 D-penicil      0.8        11.0 0.4983025
#> 5   2 14.15234  alive       0 D-penicil      1.0        11.6 0.9993429
#> 6   2 14.15234  alive       0 D-penicil      1.9        10.6 2.1027270
#> 7   2 14.15234  alive       0 D-penicil      2.6        11.3 4.9008871
#> 8   2 14.15234  alive       0 D-penicil      3.6        11.5 5.8892783
#> 9   2 14.15234  alive       0 D-penicil      4.2        11.5 6.8858833
#> 10  2 14.15234  alive       0 D-penicil      3.6        11.5 7.8907020
#> 11  2 14.15234  alive       0 D-penicil      4.6        11.5 8.8325485

The event times are given in the years variable and the event indicator in the status variable which can take the values alive, dead or transplanted or the status2 variable which is 0 for alive and 1 for dead. Each survival observation should only appear once for each individual, otherwise each observation will be treated as a separate event, this allows for recurrent events to be modelled. The survival observations can be reformatted as follows:

pbc$stime <- pbc$years
pbc$stime[duplicated(pbc$id)] <- NA
pbc$died <- pbc$status2
pbc$died[duplicated(pbc$id)] <- NA

We are also going to work with the log of biomarkers prothrombin index and serum bilirubin throughout, the time of this measurements will be recorded in the variable time. We will also change the treatment variable to be numeric rather than a factor variable.

pbc$logb <- log(pbc$serBilir) 
pbc$logp <- log(pbc$prothrombin)
pbc$time <- pbc$year
pbc$trt <- as.numeric(pbc$drug) - 1

The data now looks like this. Failing to set the data up in this way will lead to errors in the parameter estimates.

pbc[1:11,c("id","stime","died","logb","logp","time")]
#>    id    stime died        logb     logp      time
#> 1   1  1.09517    1  2.67414865 2.501436 0.0000000
#> 2   1       NA   NA  3.05870707 2.415914 0.5256817
#> 3   2 14.15234    0  0.09531018 2.360854 0.0000000
#> 4   2       NA   NA -0.22314355 2.397895 0.4983025
#> 5   2       NA   NA  0.00000000 2.451005 0.9993429
#> 6   2       NA   NA  0.64185389 2.360854 2.1027270
#> 7   2       NA   NA  0.95551145 2.424803 4.9008871
#> 8   2       NA   NA  1.28093385 2.442347 5.8892783
#> 9   2       NA   NA  1.43508453 2.442347 6.8858833
#> 10  2       NA   NA  1.28093385 2.442347 7.8907020
#> 11  2       NA   NA  1.52605630 2.442347 8.8325485

We can fit a simple linear model

lin.model <- merlin(logb ~ time, family = "gaussian", data = pbc)
lin.model
#> Merlin: mixed-effects model
#> Data: pbc 
#> 
#> Coefficients:
#>           time           _cons  log_sd(resid.)  
#>        0.01392         0.55956         0.10354
summary(lin.model)
#> Mixed effects regression model
#> Log likelihood = -2961.414
#> 
#>                 Estimate Std. Error      z Pr(>|z|) [95% Conf. Interval]
#> time            0.013916   0.008128  1.712  0.08687  -0.002014  0.029846
#> _cons           0.559560   0.035806 15.628  0.00000   0.489382  0.629739
#> log_sd(resid.)  0.103540   0.016032  6.458  0.00000   0.072119  0.134962

We can add flexibility to the model using restricted cubic splines:

rcs.model <- merlin(logb ~ rcs(time, df = 3), family = "gaussian", timevar = "time", data = pbc)
summary(rcs.model)
#> Mixed effects regression model
#> Log likelihood = -2960.733
#> 
#>                 Estimate Std. Error      z Pr(>|z|) [95% Conf. Interval]
#> rcs():1         0.043260   0.025147  1.720   0.0854  -0.006028  0.092548
#> rcs():2         0.028845   0.025148  1.147   0.2514  -0.020443  0.078133
#> rcs():3        -0.009251   0.025147 -0.368   0.7130  -0.058539  0.040037
#> _cons           0.603291   0.025147 23.990   0.0000   0.554003  0.652579
#> log_sd(resid.)  0.103511   0.016037  6.455   0.0000   0.072079  0.134943

By default, the restricted cubic splines are orthogonalised (orthog = TRUE). The serum bilirubin observations are clustered within individuals, so we can add a normally-distributed random intercept term named M1. The coefficient of the random-effect term will normally constrained to 1 using the *1 notation, unless the random effect is being shared with another outcome model:

r.int.model <- merlin(logb ~ rcs(time, df = 3) + M1[id]*1,
                      family = "gaussian",
                      levels = "id",
                      timevar = "time",
                      data = pbc)
summary(r.int.model)
#> Mixed effects regression model
#> Log likelihood = -1978.61
#> 
#>                 Estimate Std. Error       z Pr(>|z|) [95% Conf. Interval]
#> rcs():1         0.266362   0.013699  19.444  0.00000   0.239513  0.293211
#> rcs():2         0.058592   0.013014   4.502  0.00001   0.033086  0.084098
#> rcs():3         0.016696   0.012695   1.315  0.18845  -0.008186  0.041578
#> _cons           0.614076   0.023175  26.498  0.00000   0.568655  0.659498
#> log_sd(resid.) -0.611477   0.016776 -36.450  0.00000  -0.644357 -0.578597
#> log_sd(M1)     -0.127962   0.019616  -6.523  0.00000  -0.166409 -0.089515
#> 
#> Integration method: Non-adaptive Gauss-Hermite quadrature 
#> Integration points: 7

We can also add an additional random-slope term (M2) to the model by forming an interaction between the time variable and random-effect using :. We can increase the number of quadrature nodes to improve estimation of the likelihood using the ip option within the control argument.

r.slope.model <- merlin(logb ~ rcs(time, df = 3) + M1[id]*1 + time:M2[id]*1,
                        family = "gaussian",
                        timevar = "time",
                        levels = "id",
                        data = pbc,
                        control = list(ip = 15))
summary(r.slope.model)
#> Mixed effects regression model
#> Log likelihood = -1718.679
#> 
#>                 Estimate Std. Error       z Pr(>|z|) [95% Conf. Interval]
#> rcs():1        -0.024415   0.016517  -1.478  0.13935  -0.056787  0.007957
#> rcs():2        -0.020279   0.010947  -1.852  0.06396  -0.041736  0.001177
#> rcs():3        -0.025302   0.009456  -2.676  0.00745  -0.043835 -0.006769
#> _cons           0.234964   0.019792  11.872  0.00000   0.196172  0.273756
#> log_sd(resid.) -1.017033   0.017687 -57.502  0.00000  -1.051698 -0.982367
#> log_sd(M1)     -0.276734   0.021426 -12.916  0.00000  -0.318728 -0.234740
#> log_sd(M2)     -1.432000   0.023614 -60.642  0.00000  -1.478283 -1.385717
#> 
#> Integration method: Non-adaptive Gauss-Hermite quadrature 
#> Integration points: 15

Initial estimates

If a model has random effects, merlin will fit fixed effects models to obtain starting values and then assume an identity matrix for the variance of the random effects. Initial values can be manually set using the from option.