This R package implements the Covariance Regression with Random Forests (CovRegRF) method described in Alakus et al. (2023) <doi:10.1186/s12859-023-05377-y>. The theoretical details of the proposed method are presented in Section 1 followed by a data analysis using this method in Section 2.
Most of the existing multivariate regression analyses focus on estimating the conditional mean of the response variable given its covariates. However, it is also crucial in various areas to capture the conditional covariances or correlations among the elements of a multivariate response vector based on covariates. We consider the following setting: let \(\mathbf{Y}_{n \times q}\) be a matrix of \(q\) response variables measured on \(n\) observations, where \(\mathbf{y}_i\) represents the \(i\)th row of \(\mathbf{Y}\). Similarly, let \(\mathbf{X}_{n \times p}\) be a matrix of \(p\) covariates available for all \(n\) observations, where \(\mathbf{x}_i\) represents the \(i\)th row of \(\mathbf{X}\). We assume that the observation \(\mathbf{y}_i\) with covariates \(\mathbf{x}_i\) has a conditional covariance matrix \(\Sigma_{\mathbf{x}_i}\). We propose a novel method called Covariance Regression with Random Forests (CovRegRF) to estimate the covariance matrix of a multivariate response \(\mathbf{Y}\) given a set of covariates \(\mathbf{X}\), using a random forest framework. Random forest trees are built with a specialized splitting criterion \[\sqrt{n_Ln_R}*d(\Sigma^L, \Sigma^R)\] where \(\Sigma^L\) and \(\Sigma^R\) are the covariance matrix estimates of left and right nodes, and \(n_L\) and \(n_R\) are the left and right node sizes, respectively, \(d(\Sigma^L, \Sigma^R)\) is the Euclidean distance between the upper triangular part of the two matrices and computed as follows: \[d(A, B) = \sqrt{\sum_{i=1}^{q}\sum_{j=i}^{q} (\mathbf{A}_{ij} - \mathbf{B}_{ij})^2}\] where \(\mathbf{A}_{q \times q}\) and \(\mathbf{B}_{q \times q}\) are symmetric matrices. For a new observation, the random forest provides the set of nearest neighbour out-of-bag observations which is used to estimate the conditional covariance matrix for that observation.
We propose a hypothesis test to evaluate the effect of a subset of covariates on the covariance matrix estimates while controlling for the other covariates. Let \(\Sigma_\mathbf{X}\) be the conditional covariance matrix of \(\mathbf{Y}\) given all \(X\) variables and \(\Sigma_{\mathbf{X}^c}\) is the conditional covariance matrix of \(\mathbf{Y}\) given only the set of controlling \(X\) variables. If a subset of covariates has an effect on the covariance matrix estimates obtained with the proposed method, then \(\Sigma_\mathbf{X}\) should be significantly different from \(\Sigma_{\mathbf{X}^c}\). We conduct a permutation test for the null hypothesis \[H_0 : \Sigma_\mathbf{X} = \Sigma_{\mathbf{X}^c}\] We estimate a \(p\)-value with the permutation test. If the \(p\)-value is less than the pre-specified significance level \(\alpha\), we reject the null hypothesis.
We will show how to use the CovRegRF package on a generated data set. The data set consists of two multivariate data sets: \(\mathbf{X}_{n \times 3}\) and \(\mathbf{Y}_{n \times 3}\). The sample size (\(n\)) is 200. The covariance matrix of \(\mathbf{Y}\) depends on \(X_1\) and \(X_2\) (i.e. \(X_3\) is a noise variable). We load the data and split it into train and test sets:
library(CovRegRF)
data(data)
<- colnames(data$X)
xvar.names <- colnames(data$Y)
yvar.names <- data.frame(data$X, data$Y)
data1
set.seed(4567)
<- sample(1:nrow(data1), size = round(nrow(data1)*0.6), replace = FALSE)
smp <- data1[smp,, drop=FALSE]
traindata <- data1[-smp, xvar.names, drop=FALSE] testdata
Firstly, we check the global effect of \(\mathbf{X}\) on the covariance matrix estimates by applying the significance test for the three covariates.
<- as.formula(paste(paste(yvar.names, collapse="+"), ".", sep=" ~ "))
formula <- significance.test(formula, traindata, params.rfsrc = list(ntree = 200),
globalsig.obj nperm = 10, test.vars = NULL)
$pvalue
globalsig.obj#> [1] 0
Using 10 permutations, the estimated \(p\)-value is 0 which is smaller than the significance level (\(\alpha\)) of 0.05 and we reject the null hypothesis indicating the conditional covariance matrices significantly vary with the set of covariates. When performing a permutation test to estimate a \(p\)-value, we need more than 10 permutations. Using 500 permutations, the estimated \(p\)-value is 0.012. The computational time increases with the number of permutations.
Next, we apply the proposed method with covregrf()
and
get the out-of-bag (OOB) covariance matrix estimates for the training
observations.
<- covregrf(formula, traindata, params.rfsrc = list(ntree = 200))
covregrf.obj <- covregrf.obj$predicted.oob
pred.oob head(pred.oob, 2)
#> [[1]]
#> y1 y2 y3
#> y1 1.1286879 0.8387699 1.101836
#> y2 0.8387699 1.9878143 1.507416
#> y3 1.1018365 1.5074164 3.591839
#>
#> [[2]]
#> y1 y2 y3
#> y1 1.353311 1.050629 1.761897
#> y2 1.050629 2.153400 2.142313
#> y3 1.761897 2.142313 4.453531
Then, we get the variable importance (VIMP) measures for the covariates. VIMP measures reflect the predictive power of \(\mathbf{X}\) on the estimated covariance matrices. Also, we can plot the VIMP measures.
<- vimp(covregrf.obj)
vimp.obj $importance
vimp.obj#> x1 x2 x3
#> 1.38012921 0.51122499 0.07159317
plot.vimp(vimp.obj)
From the VIMP measures, we see that \(X_3\) has smaller importance than \(X_1\) and \(X_2\). We apply the significance test to evaluate the effect of \(X_3\) on the covariance matrices while controlling for \(X_1\) and \(X_2\).
<- significance.test(formula, traindata, params.rfsrc = list(ntree = 200),
partialsig.obj nperm = 10, test.vars = "x3")
$pvalue
partialsig.obj#> [1] 0.3
Using 10 permutations, the estimated p-values is 0.3 and we fail to reject the null hypothesis, indicating that we do not have enough evidence to prove that \(X_3\) has an effect on the estimated covariance matrices while \(X_1\) and \(X_2\) are in the model. Using 500 permutations, the estimated \(p\)-value is 0.218.
Finally, we can get the covariance matrix predictions for the test observations.
<- predict(covregrf.obj, testdata)
pred.obj <- pred.obj$predicted
pred head(pred, 2)
#> [[1]]
#> y1 y2 y3
#> y1 1.0710647 0.5039413 0.6859008
#> y2 0.5039413 1.6077176 1.1050398
#> y3 0.6859008 1.1050398 2.7710218
#>
#> [[2]]
#> y1 y2 y3
#> y1 1.294158 1.306257 1.854186
#> y2 1.306257 2.183658 2.424231
#> y3 1.854186 2.424231 4.387248
Alakus, C., Larocque, D., and Labbe, A. (2023). Covariance regression with random forests. BMC Bioinformatics 24, 258.
sessionInfo()
#> R version 4.2.0 (2022-04-22)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Monterey 12.6.3
#>
#> Matrix products: default
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] CovRegRF_2.0.0
#>
#> loaded via a namespace (and not attached):
#> [1] visNetwork_2.1.0 digest_0.6.29 R6_2.5.1 jsonlite_1.8.0 magrittr_2.0.3 evaluate_0.15
#> [7] highr_0.9 rlang_1.0.6 stringi_1.7.8 cli_3.6.0 data.table_1.14.2 rstudioapi_0.13
#> [13] DiagrammeR_1.0.9 rmarkdown_2.14 data.tree_1.0.0 RColorBrewer_1.1-3 tools_4.2.0 stringr_1.4.0
#> [19] htmlwidgets_1.5.4 glue_1.6.2 parallel_4.2.0 xfun_0.31 yaml_2.3.5 fastmap_1.1.0
#> [25] compiler_4.2.0 htmltools_0.5.3 knitr_1.39