---
title: "Introduction to `DynForest` methodology"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to `DynForest` methodology}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
preamble: >
  \usepackage{amsmath}
  \usepackage{graphicx}
  \usepackage{tabularx}
bibliography: refs.bib
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE
)
```

`DynForest` is a random forest methodology which can include both time-fixed predictors of any nature and time-dependent predictors possibly measured at irregular times. The purpose of `DynForest` is to predict an outcome which can be categorical, continuous or survival (with possibly competing events).

The random forest should first be built on a learning dataset of $N$ subjects including: $Y$ the outcome; $\mathcal{M}_x$ an ensemble of $P$ time-fixed predictors; $\mathcal{M}_z$ an ensemble of $Q$ time-dependent predictors. The random forest consists of an ensemble of $B$ trees which are grown as detailed below.

```{r dynforestRgraph, eval = TRUE, echo = FALSE, out.width = "80%", out.height = "30%", fig.cap = "Figure 1: Overall scheme of the tree building in DynForest with (A) the tree structure, (B) the node-specific treatment of time-dependent predictors to obtain time-fixed features, (C) the dichotomization of the time-fixed features, (D) the splitting rule."}
knitr::include_graphics("Figures/dynforestR_graph.png")
```

## The tree building

The tree building process, summarized in Figure 1, aims to recursively partition the subjects into groups/nodes that are the most homogeneous regarding the outcome $Y$. 

For each tree $b$ $(b = 1, ..., B)$, we first draw a bootstrap sample from the original dataset of $N$ subjects ($N$ draws among the $N$ subjects with replacement). The subjects excluded by the bootstrap constitute the out-of-bag (OOB) sample, noted $OOB^b$ for tree $b$. At each node $d \in \mathcal{D}^b$ of the tree, we recursively repeat the following steps using the $N^{(d)}$ subjects located at node $d$:

1. An ensemble of $\mathcal{M}^{(d)}=\{\mathcal{M}_x^{(d)},\mathcal{M}_z^{(d)}\}$ candidate predictors are randomly selected among $\{\mathcal{M}_x,\mathcal{M}_z\}$ (see Figure 1B). The size of $\mathcal{M}^{(d)}$ is defined by the hyperparameter $mtry$.

2. For each time-dependent predictor in $\mathcal{M}_z^{(d)}$:
    a. We independently model the trajectory of the predictor using a flexible linear mixed model [@laird_random_effects_1982] according to time (the specification of the model is defined by the user). It is defined as: $Z_{ij} = X1_{ij}(t_{ij})\beta + X2_{ij}(t_{ij})b_i + \epsilon_{ij}$ where $Z_{ij}$ is the value of predictor $Z$ for subject $i$ at occasion $j$, $X1_{ij}$ and $X2_{ij}$ are vectors of time functions and covariates to be specified by the user and $t_{ij}$ is the time at occasion $j$ for subject $i$. Beta are population effects, and $b_i$ individual random effects which follow a multivariate normal distribution. $\epsilon_{ij}$ are the zero-mean independent Gaussian errors of measurement. 
    b. We predict the vector of random-effects $\hat{b}_i$ using the available information of individual $i = 1, ..., N^{(d)}$. $\hat{b}_i$ constitute time-independent features summarizing each time-dependent predictors. We thus derive the ensemble $\mathcal{M}_{z \star}^{(d)}$ for all the variables in $\mathcal{M}_z^{(d)}$. 
    
3. We define $\mathcal{M}_\star^{(d)}=\{\mathcal{M}_x^{(d)},\mathcal{M}_{z \star}^{(d)}\}$ our new ensemble of candidate features.

4. For each candidate feature $W \in \mathcal{M}_\star^{(d)}$:
    a. We build a series of splits $c_W^{(d)}$ according to the feature values if continuous, or subsets of categories otherwise (see Figure 1C), leading each time to two groups.
    b. We quantify the distance between the two groups according to the nature of $Y$:
        * If $Y$ continuous: we compute the weighted within-group variance with the proportion of subjects in each group as weights 
        * If $Y$ categorical: we compute the weighted within-group Shannon entropy [@shannon_mathematical_1948] (i.e., the amount of uncertainty) with the proportion of subjects in each group as weights 
        * If $Y$ survival \underline{without} competing events: we compute the log-rank statistic test [@peto_asymptotically_1972]
        * If $Y$ survival \underline{with} competing events: we compute the Fine \& Gray statistic test [@gray_class_1988]

5. We split the subjects into the two groups that minimize (for continuous and categorical outcome) or maximize (for survival outcome) the quantity defined previously. We denote $\{W_0^d,c_0^d\}$ the optimal couple used to split the subjects and assign them to the left and right daughter nodes $2d$ and $2d + 1$, respectively (see Figure 1D and A).

6. Step 1 to 5 are iterated on the daughter nodes until stopping criteria are met. 

We define two stopping criteria: `nodesize` the minimal number of subjects in a node required to reiterate the split and `minsplit` the minimal number of events required to split the node. `minsplit` is only defined with survival outcome. In the following, we call leaves the terminal nodes.

In each leaf $h \in \mathcal{H}^b$ of tree $b$, a summary $\pi^{h^b}$ is computed using the individuals belonging to the leaf. The leaf summary is defined according to the outcome:

* the mean, for $Y$ continuous
* the category with the highest probability, for $Y$ categorical
* the cumulative incidence function over time computed using the Nelson-Aalen cumulative hazard function estimator [@nelson_hazard_1969; @aalen_nonparametric_1976], for $Y$ single cause time-to-event
* the cumulative incidence function over time computed using the non-parametric Aalen-Johansen estimator [@aalen_empirical_1978], for $Y$ time-to-event with multiple causes

NOTE: Step 2 involves the estimation of a parametric mixed model for each time-dependent predictor. The specification of this model should be carefully determined in preliminary analyses by the user, keeping in mind that there should be a trade-off between goodness-of-fit and parameter parsimony. The critical point is to adequately specify the trajectory shape over time at the population and individual levels. Different bases of time functions can be considered (e.g., fractional polynomials, splines, adhoc) and compared in terms of fit.

## Individual prediction of the outcome

### Out-Of-Bag individual prediction

The overall OOB prediction $\hat{\pi}_{ \star}$ for a subject $\star$ can be computed by averaging the tree-based predictions of $\star$ over the random forest as follows:
\begin{equation}
  \hat{\pi}_{ \star} = \frac{1}{|\mathcal{O}_\star|} \sum_{b \in \mathcal{O}_\star} \hat{\pi}^{h_\star^b}
\end{equation}
where $\mathcal{O}_\star$ is the ensemble of trees where $\star$ is $OOB$ and $|\mathcal{O}_\star|$ denotes its cardinal The prediction $\hat{\pi}^{h_\star^b}$ is obtained by dropping down subject $\star$ along tree $b$. At each node $d \in \mathcal{D}^b$, the subject $\star$ is assigned to the left or right node according to his/her data and the optimal couple $\{W_0^d,c_0^d\}$. $W_0^d$ is a random-effect feature, its value for $\star$ is predicted from the individual repeated measures using the estimated parameters from the linear mixed model.

### Individual dynamic prediction from a landmark time

With a survival outcome, the OOB prediction described in the previous paragraph can be extended to compute the individual probability of event from a landmark time $s$ by exploiting the repeated measures of subject $\star$ only until $s$. For a new subject $\star$, we thus define the individual probability of event, noted $\hat{\pi}_{\star}(s,t)$, as the probability of experiencing the event by time $s+t$ given the information prior to landmark time $s$:
\begin{equation}
  \hat{\pi}_{\star}(s,t) = \frac{1}{B} \sum_{b=1}^B \hat{\pi}^{h_\star^b}(s,t)
\end{equation}
where $\hat{\pi}^{h_\star^b}(s,t)$ is the tree-based probability of event at time $s+t$ computed by dropping down $\star$ along the tree by considering longitudinal predictors collected until $s$ and time-fixed predictors. Note that any horizon $t$ can be considered provided $s+t$ remains in the time window on which the random forest was trained. In the `DynForest` package, by default, the probability is computed at all the observed event times after the landmark time.

## Out-Of-Bag prediction error

Using the OOB individual predictions, an OOB prediction error can be internally assessed. The OOB prediction error quantifies the difference between the observed and the predicted values. It is defined according to the nature of $Y$ as:

* for $Y$ continuous, the mean square error (MSE) defined by:

\begin{equation}
  errOOB = \frac{1}{N} \sum_{i=1}^N ( \hat{\pi}_i - \pi_i^0 )^2
\end{equation}

* for $Y$ categorical, the missclassification error defined by: 

\begin{equation}
  errOOB = \frac{1}{N} \sum_{i=1}^N {1}_{( \hat{\pi}_i \neq \pi_i^0 )}
\end{equation}

* for $Y$ survival, the Integrated Brier Score (IBS) [@sene_individualized_2016] between $\tau_1$ and $\tau_2$ defined by: 

\begin{equation}
  errOOB = \int_{\tau_1}^{\tau_2} \frac{1}{N} \sum_{i=1}^{N}  \hat{\omega}_i(t) \Big\{ I(T_i \leq t, \delta_i = k) - 
  \hat{\pi}_{ik}(t) \Big) \Big\}^2 dt
\end{equation}

with $T$ the time-to-event, $k$ the cause of interest and $\hat{\omega}(t)$ the estimated weights using Inverse Probability of Censoring Weights (IPCW) technique that accounts for censoring [@gerds_consistent_2006].

The OOB error of prediction is used to particular to tune the random forest by determining the hyperparameters (i.e., `mtry`, `nodesize` and `minsplit`) which give the smallest OOB prediction error.


## Explore the most predictive variables

### Variable importance

The variable importance (VIMP) measures the loss of predictive performance [@ishwaran_random_2008] when removing the link between a predictor and the outcome. The link is removed by permuting the predictor values at the subject level for time-fixed predictors or at observation level for time-dependent predictors. A large VIMP value indicates a good prediction ability for the predictor. 

However, in case of correlated predictors, the VIMP may not properly quantify the variable importance [@gregorutti_correlation_2017] as the information of the predictor may still be present. To better handle situations with highly correlated predictors, the grouped variable importance (gVIMP) can be computed indirectly. It consists in simultaneously evaluate the importance of a group of predictors defined by the user. The computation is the same as for the VIMP except the permutation is performed simultaneously on all the predictors of the group. A large gVIMP value indicates a good prediction ability for the group of predictors.

### Minimal depth

The minimal depth is another statistic to quantify the importance of a variable. It assesses the distance between the root node and the first node for which the predictor is used to split the subjects (1 for first level, 2 for second level, 3 for third level, ...). This statistic can be computed at the predictor level or at the feature level, allowing to fully understand the tree building process.

We strongly advice to compute the minimal depth with `mtry` hyperparameter chosen at its maximum to ensure that all predictors are systematically among candidate predictors for splitting the subjects.


## References