DynForest
packageDynForest
methodology was implemented into the R package
DynForest
(Devaux 2024)
freely available on The Comprehensive R Archive Network (CRAN) to
users.
The package includes two main functions: dynforest()
and
predict()
for the learning and the prediction steps. These
functions are fully described in section 3.1 and 3.2. Other functions
available are briefly described in the table below. These functions are
illustrated in examples, one for a survival outcome, one for a
categorical outcome and one for a continuous outcome.
Function | Description |
---|---|
Learning and prediction steps | |
dynforest() |
Function that builds the random forest |
predict() |
Function for S3 class dynforest predicting the outcome
on new subjects using the individual-specific information |
Assessment function | |
compute_ooberror() |
Function that computes the Out-Of-Bag error to be minimized to tune the random forest |
Exploring functions | |
compute_vimp() |
Function that computes the importance of variables |
compute_gvimp() |
Function that computes the importance of a group of variables |
compute_vardepth() |
Function that extracts information about the tree building process |
plot() functions for S3 class: | |
dynforest |
Plot the estimated CIF for given tree nodes or subjects |
dynforestpred |
Plot the predicted CIF for the cause of interest for given subjects |
dynforestvimp |
Plot the importance of variables by value or percentage |
dynforestgvimp |
Plot the importance of a group of variables by value or percentage |
dynforestvardepth |
Plot the minimal depth by predictors or features |
Other functions | |
summary() |
Function for class S3 dynforest or
dynforestoob displaying information about the type of
random forest, predictors included, parameters used, Out-Of-Bag error
(only for dynforestoob class) and brief summaries about the
leaves |
print() |
Function to print object of class dynforest ,
dynforestoob , dynforestvimp ,
dynforestgvimp , dynforestvardepth and
dynforestpred |
get_tree() |
Function that extracts the tree structure for a given tree |
get_treenode() |
Function that extracts the terminal node identifiers for a given tree |
dynforest()
functiondynforest()
is the function to build the random forest.
The call of this function is:
dynforest(timeData = NULL, fixedData = NULL, idVar = NULL,
timeVar = NULL, timeVarModel = NULL, Y = NULL,
ntree = 200, mtry = NULL, nodesize = 1, minsplit = 2, cause = 1,
nsplit_option = "quantile", ncores = NULL,
seed = 1234, verbose = TRUE)
timeData
is an optional argument that contains the
dataframe in longitudinal format (i.e., one observation per row) for the
time-dependent predictors. In addition to time-dependent predictors,
this dataframe should include a unique identifier and the measurement
times. This argument is set to NULL
if no time-dependent
predictor is included. Argument fixedData
contains the
dataframe in wide format (i.e., one subject per row) for the time-fixed
predictors. In addition to time-fixed predictors, this dataframe should
also include the same identifier as used in timeData.
This
argument is set to NULL
if no time-fixed predictor is
included. Argument idVar
provides the name of identifier
variable included in timeData
and fixedData
dataframes. Argument timeVar
provides the name of time
variable included in timeData
dataframe. Argument
timeVarModel
contains as many lists as time-dependent
predictors defined in timeData
to specify the structure of
the mixed models assumed for each predictor. For each time-dependent
predictor, the list should contain a fixed
and a
random
argument to define the formula of a mixed model to
be estimated with lcmm
R package (Proust-Lima, Philipps, and Liquet 2017).
fixed
defines the formula for the fixed-effects and
random
for the random-effects (e.g.,
list(Y1 = list(fixed = Y1 ~ time, random = ~ time))
.
Argument Y
contains a list of two elements
type
and Y
. Element type
defines
the nature of the outcome (surv
for survival outcome with
possibly competing causes, numeric
for continuous outcome
and factor
for categorical outcome) and element
Y
defines the dataframe which includes the identifier (same
as in timeData
and fixedData
dataframes) and
outcome variables.
Arguments ntree
, mtry
,
nodesize
and minsplit
are the hyperparameters
of the random forest. Argument ntree
controls the number of
trees in the random forest (200 by default). Argument mtry
indicates the number of variables randomly drawn at each node (square
root of the total number of predictors by default). Argument
nodesize
indicates the minimal number of subjects allowed
in the leaves (1 by default). Argument minsplit
controls
the minimal number of events required to split the node (2 by
default).
For survival outcome, argument cause
indicates the event
the interest. Argument nsplit_option
indicates the method
to build the two groups of individuals at each node. By default, we
build the groups according to deciles (quantile
option) but
they could be built according to random values (sample
option).
Argument ncores
indicates the number of cores used to
grow the trees in parallel mode. By default, we set the number of cores
of the computer minus 1. Argument seed
specifies the random
seed. It can be fixed to replicate the results. Argument
verbose
allows to display a progression bar during the
execution of the function.
dynforest()
function returns an object of class
dynforest
containing several elements:
data
a list with longitudinal predictors
(Longitudinal
element), continuous predictors
(Numeric
element) and categorical predictors
(Factor
element)rf
is a dataframe with one column per tree containing a
list with several elements, which includes:
leaves
the leaf identifier for each subject used to
grow the treeidY
the identifiers for each subject used to grow the
treeV_split
the split summary (more detailed below)Y_pred
the estimated outcome in each leafmodel_param
the estimated parameters of the mixed model
for the longitudinal predictors used to split the subjects at each
nodeYtype
, hist_nodes
, Y
,
boot
and Ylevels
internal information used in
other functionstype
the nature of the outcometimes
the event times (only for survival outcome)cause
the cause of interest (only for survival
outcome)causes
the unique causes (only for survival
outcome)Inputs
the list of predictors names for
Longitudinal
(longitudinal predictor),
Continuous
(continuous predictor) and Factor
(categorical predictor)Longitudinal.model
the mixed model specification for
each longitudinal predictorparam
a list of hyperparameters used to grow the random
forestcomput.time
the computation timeThe main information returned by rf
is
V_split
element which can also be extract using
get_tree()
function. This element contains a table sorted
by the node/leaf identifier (id_node
column) with each row
representing a node/leaf. Each column provides information about the
splits:
type
: the nature of the predictor
(Longitudinal
for longitudinal predictor,
Numeric
for continuous predictor or Factor
for
categorical predictor) if the node was split, Leaf
otherwise;var_split
: the predictor used for the split defined by
its order in timeData
and fixedData
;feature
: the feature used for the split defined by its
position in random statistic;threshold
: the threshold used for the split (only with
Longitudinal
and Numeric
). No information is
returned for Factor
;N
: the number of subjects in the node/leaf;Nevent
: the number of events of interest in the
node/leaf (only with survival outcome);depth
: the depth level of the node/leaf.dynforest()
function internally calls other functions
from related packages to build the random forest:
hlme()
function (from lcmm
package (Proust-Lima, Philipps, and Liquet 2017)) to fit
the mixed models for the time-dependent predictors defined in
timeData
and timeVarModel
argumentsEntropy()
function (from base package) to compute the
Shannon entropysurvdiff()
function (from survival
package
(Therneau 2022)) to compute the log-rank
statistic testcrr()
function (from cmprsk
package (Gray 2020)) to compute the Fine & Gray
statistic testpredict()
functionpredict()
is the S3 function for class
dynforest
to predict the outcome on new subjects. Landmark
time can be specified to consider only longitudinal data collected up to
this time to compute the prediction. The call of this function is:
Argument object
contains a dynforest
object
resulting from dynforest()
function. Argument
timeData
contains the dataframe in longitudinal format
(i.e., one observation per row) for the time-dependent predictors of new
subjects. In addition to time-dependent predictors, this dataframe
should also include a unique identifier and the time measurements. This
argument can be set to NULL
if no time-dependent predictor
is included. Argument fixedData
contains the dataframe in
wide format (i.e., one subject per row) for the time-fixed predictors of
new subjects. In addition to time-fixed predictors, this dataframe
should also include an unique identifier. This argument can be set to
NULL
if no time-fixed predictor is included. Argument
idVar
provides the name of the identifier variable included
in timeData
and fixedData
dataframes. Argument
timeVar
provides the name of time-measurement variable
included in timeData
dataframe. Argument t0
defines the landmark time; only the longitudinal data collected up to
this time are to be considered. This argument should be set to
NULL
to include all longitudinal data.
predict()
function returns several elements:
t0
the landmark time defined in argument
(NULL
by default)times
times used to compute the individual predictions
(only with survival outcome). The times are defined according to the
time-to-event subjects used to build the random forest.pred_indiv
the predicted outcome for the new subject.
With survival outcome, predictions are provided for each time defined in
times
element.pred_leaf
a table giving for each tree (in column) the
leaf in which each subject is assigned (in row)pred_indiv_proba
the proportion of the trees leading to
the category prediction for each subject (only with categorical
outcome)