DynForest
with
continuous outcome?data_simu1
and
data_simu2
datasetsIn this section, we present an illustration of DynForest
with a continuous outcome. DynForest
was used on a
simulated dataset with 200 subjects and 10 predictors (6 time-dependent
and 4 time-fixed predictors). The 6 longitudinal predictors were
generated using a linear mixed model with linear trajectory according to
time. We considered 6 measurements by subject (at baseline and then
randomly drawn around theoretical annual visits up to 5 years). Then, we
generated the continuous outcome using a linear regression with the
random intercept of marker 1 and random slope of marker 2 as linear
predictors. We generated two datasets (data_simu1
and
data_simu2
), one for each step (training and prediction).
These datasets are available in the DynForest
package.
library("DynForest")
data(data_simu1)
head(data_simu1)
#> id time cont_covar1 cont_covar2 bin_covar1 bin_covar2 marker1
#> 1 1 0.000000 -1.749785 -1.003791 0 0 2.2335135
#> 2 1 1.202289 -1.749785 -1.003791 0 0 -0.3215715
#> 3 1 2.347419 -1.749785 -1.003791 0 0 -4.7698038
#> 4 1 3.068631 -1.749785 -1.003791 0 0 -8.4552464
#> 5 1 4.233713 -1.749785 -1.003791 0 0 -13.8477022
#> 6 1 5.100951 -1.749785 -1.003791 0 0 -15.0028533
#> marker2 marker3 marker4 marker5 marker6 Y_res
#> 1 5.223892 3.1760262 5.273218 5.812561 4.965971 -10.80882
#> 2 8.101329 -0.7506024 6.333628 7.297978 0.452370 -10.80882
#> 3 14.533583 -0.9671289 8.487332 10.833834 -1.986883 -10.80882
#> 4 16.789004 -0.3502227 9.695727 13.149591 -2.932720 -10.80882
#> 5 20.209609 -2.1694950 12.568234 15.442259 -6.615383 -10.80882
#> 6 25.862403 -7.8668698 11.540265 17.758242 -8.698395 -10.80882
data(data_simu2)
head(data_simu2)
#> id time cont_covar1 cont_covar2 bin_covar1 bin_covar2 marker1 marker2
#> 1 1 0.000000 -1.335695 -0.7260684 1 1 2.9558630 3.112574
#> 2 1 1.289630 -1.335695 -0.7260684 1 1 0.3125889 4.469843
#> 3 1 2.251666 -1.335695 -0.7260684 1 1 -2.3155137 5.426203
#> 4 1 3.119182 -1.335695 -0.7260684 1 1 -2.2991755 7.043745
#> 5 1 4.007813 -1.335695 -0.7260684 1 1 -3.0584749 7.905586
#> 6 1 5.075597 -1.335695 -0.7260684 1 1 -4.9786537 7.268507
#> marker3 marker4 marker5 marker6 Y_res
#> 1 0.5024613 7.254809 3.27673801 1.722360 1.758953
#> 2 2.7152092 6.570345 1.12816586 5.113413 1.758953
#> 3 2.7096358 9.545796 0.08583794 5.006489 1.758953
#> 4 2.2938817 9.844153 -0.53377189 9.649994 1.758953
#> 5 1.8877637 12.749989 -1.80312531 9.900544 1.758953
#> 6 1.0021239 12.522054 -2.92588781 11.557506 1.758953
First of all, we load the data and we build the mandatory objects
needed to execute dynforest()
function that are
timeData_train
for time-dependent predictors and
fixedData_train
for time-fixed predictors. We specify the
model for the longitudinal predictors in timeVarModel
object. We considered linear trajectories over time for the 6
longitudinal predictors.
timeData_train <- data_simu1[,c("id","time",
paste0("marker",seq(6)))]
timeVarModel <- lapply(paste0("marker",seq(6)),
FUN = function(x){
fixed <- reformulate(termlabels = "time",
response = x)
random <- ~ time
return(list(fixed = fixed, random = random))
})
fixedData_train <- unique(data_simu1[,c("id",
"cont_covar1","cont_covar2",
"bin_covar1","bin_covar2")])
To define the Y
object for a continuous outcome, the
type
argument should be set to numeric
to run
the random forest in regression mode. The dataframe Y
should include two columns with the unique identifier id
and the continuous outcome, here Y_res
.
To build the random forest, we chose default hyperparameters (i.e.,
ntree
= 200 and nodesize
= 1), except for
mtry
which was fixed at its maximum (i.e.,
mtry
= 10). We ran dynforest()
function with
the following code:
For continuous outcome, the OOB prediction error is evaluated using
the mean square error (MSE). We used compute_ooberror()
function to compute the OOB prediction error and we provided overall
results with summary()
function as shown below:
summary(res_dyn_OOB)
dynforest executed for continuous outcome
Splitting rule: Minimize weighted within-group variance
Out-of-bag error type: Mean square error
Leaf statistic: Mean
----------------
Input
Number of subjects: 200
Longitudinal: 6 predictor(s)
Numeric: 2 predictor(s)
Factor: 2 predictor(s)
----------------
Tuning parameters
mtry: 10
nodesize: 1
ntree: 200
----------------
----------------
dynforest summary
Average depth per tree: 9.06
Average number of leaves per tree: 126.47
Average number of subjects per leaf: 1
----------------
Out-of-bag error based on Mean square error
Out-of-bag error: 4.3713
----------------
Computation time
Number of cores used: 7
Time difference of 8.261093 mins
----------------
The random forest was executed in regression mode (for a continuous
outcome). The splitting rule aimed to minimize the weighted within-group
variance. We built the random forest using 200 subjects and 10
predictors (6 time-dependent and 4 time-fixed predictors) with
hyperparameters ntree
= 200, mtry
= 10 and
nodesize
= 1. As we can see, nodesize
= 1
leads to deeper trees (the average depth by tree is 9.1) and a single
subject by leaf. We obtained 4.4 for the MSE. This quantity can be
minimized by tuning hyperparameters mtry
and
nodesize
.
In regression mode, the tree and leaf-specific means are averaged
across the trees to get a unique prediction over the random forest.
predict()
function provides the individual predictions. We
first created the timeData
and fixedData
from
the testing sample data_simu2
. We then predicted the
continuous outcome by running predict()
function:
timeData_pred <- data_simu2[,c("id","time",
paste0("marker",seq(6)))]
fixedData_pred <- unique(data_simu2[,c("id","cont_covar1","cont_covar2",
"bin_covar1","bin_covar2")])
pred_dyn <- predict(object = res_dyn,
timeData = timeData_pred,
fixedData = fixedData_pred,
idVar = "id", timeVar = "time")
Individual predictions can be extracted using print()
function:
For instance, we predicted 5.22 for subject 1, -1.28 for subject 2 and 0.86 for subject 3.
In this illustration, we want to evaluate if can identify the true predictors (i.e., random intercept of marker1 and random slope of marker2). To do this, we used the minimal depth which allows to understand the random forest at the feature level.
Minimal depth information can be extracted using
compute_vardepth()
function and can be displayed with
plot()
function. For the purpose of this illustration, we
displayed the minimal depth in Figure 1 by predictor and by feature.
depth_dyn <- compute_vardepth(dynforest_obj = res_dyn)
p1 <- plot(depth_dyn, plot_level = "predictor")
p2 <- plot(depth_dyn, plot_level = "feature")
We observe in figure @ref(fig:DynForestRdepthscalar)A that marker2 and marker1 have the lowest minimal depth, as expected. To go further, we also looked into the minimal depth computed on features. We perfectly identified the random slope of marker2 (i.e., marker2.bi1) and the random intercept of marker1 (i.e., marker1.bi0) as the predictors in this simulated dataset.