symDMatrix is an R package that provides symmetric matrices partitioned into file-backed blocks.
A symmetric matrix G
is partitioned into blocks as
follows:
+ --- + --- + --- +
| G11 | G12 | G13 |
+ --- + --- + --- +
| G21 | G22 | G23 |
+ --- + --- + --- +
| G31 | G32 | G33 |
+ --- + --- + --- +
Because the matrix is assumed to be symmetric (i.e., Gij
equals Gji
), only the diagonal and upper-triangular blocks
are stored and the other blocks are virtual transposes of the
corresponding diagonal blocks. Each block is a file-backed matrix of
type ff_matrix
of the ff package.
The package defines the class and multiple methods that allow treating this file-backed matrix as a standard RAM matrix.
Before we start, let’s create a symmetric matrix in RAM.
library(BGLR)
# Load genotypes from a mice data set
data(mice)
<- mice.X
X rownames(X) <- paste0("ID_", 1:nrow(X))
# Compute a symmetric genetic relationship matrix (G matrix) in RAM
<- tcrossprod(scale(X))
G1 <- G1 / mean(diag(G1)) G1
In practice, if we can hold a matrix in RAM, there is not much of a
point to convert it to a symDMatrix
object; however, this
will help us to get started.
library(symDMatrix)
<- as.symDMatrix(G1, blockSize = 400, vmode = "double", folderOut = "mice") G2
Now that we have a symDMatrix
object, let’s illustrate
some operators.
# Basic operators applied to a matrix in RAM and to a symDMatrix
# Dimension operators
all.equal(dim(G1), dim(G2))
nrow(G1) == nrow(G2)
ncol(G1) == ncol(G2)
all.equal(diag(G1), diag(G2))
# Names operators
all.equal(dimnames(G1), dimnames(G2))
all(rownames(G1) == rownames(G2))
all(colnames(G1) == rownames(G2))
# Block operators
nBlocks(G2)
blockSize(G2)
# Indexing (can use booleans, integers or labels)
1:2, 1:2]
G2[c("ID_1", "ID_2"), c("ID_1", "ID_2")]
G2[<- c(TRUE, TRUE, rep(FALSE, nrow(G2) - 2))
tmp
G2[tmp, tmp]head(G2[tmp, ])
# Exhaustive check of indexing
for (i in 1:100) {
<- sample(1:50, size = 1)
n1 <- sample(1:50, size = 1)
n2 <- sample(1:nrow(X), size = n1)
i1 <- sample(1:nrow(X), size = n2)
i2 <- G1[i1, i2, drop = FALSE]
TMP1 <- G2[i1, i2, drop = FALSE]
TMP2 stopifnot(all.equal(TMP1, TMP2))
}
The function getG_symDMatrix
of the BGData package
computes G=XX’ (with options for centering and scaling) without ever
loading G in RAM. It creates the symDMatrix
object
directly, block by block. In this example, X
is a matrix in
RAM. For large genotype data sets, X
could be a file-backed
matrix, e.g., a BEDMatrix
or ff
object.
library(BGData)
<- getG_symDMatrix(X, blockSize = 400, vmode = "double", folderOut = "mice2")
G3 class(G3)
all.equal(diag(G1), diag(G3))
for(i in 1:10){
<- sample(1:25, size = 1)
n1 <- sample(1:25, size = n1)
i1 for(j in 1:10){
<- sample(1:nrow(G1), size = 1)
n2 <- sample(1:nrow(G1), size = n2)
i2 <- G1[i1, i2]
tmp1 <- G3[i1, i2]
tmp2 stopifnot(all.equal(tmp1, tmp2))
} }
ff
files containing the
blocksThe function symDMatrix
allows creating a
symDMatrix
object from a list of .RData
files
containing ff_matrix
objects. The list is assumed to
provide, in order, files for
G11, G12, ..., G1q, G22, G23, ..., G2q, ..., Gqq
. This
approach is useful for very large G matrices. If n
is large
it may make sense to compute the blocks of the symDMatrix
object in parallel jobs (e.g., in an HPC). The function
getG
of the BGData package is
similar to getG_symDMatrix
but accepts arguments
i1
and i2
which define a block of G (i.e.,
rows of X
).
library(BGLR)
library(BGData)
library(ff)
# Load genotypes from a wheat data set
data(wheat)
<- wheat.X
X rownames(X) <- paste0("ID_", 1:nrow(X))
# Compute G matrix in RAM
<- colMeans(X)
centers <- apply(X = X, MARGIN = 2, FUN = sd)
scales <- tcrossprod(scale(X, center = centers, scale = scales))
G1 <- G1 / mean(diag(G1))
G1
# Compute G matrix block by block (each block computation can be distributed)
<- 3
nBlocks <- ceiling(nrow(X) / nBlocks)
blockSize <- 1:nrow(X)
i <- split(i, ceiling(i / blockSize))
blockIndices for (r in 1:nBlocks) {
for (s in r:nBlocks) {
<- paste0("wheat_", r, "_", s - r + 1)
blockName <- getG(X, center = centers, scale = scales, scaleG = TRUE,
block i = blockIndices[[r]], i2 = blockIndices[[s]])
<- ff::as.ff(block, filename = paste0(blockName, ".bin"), vmode = "double")
block save(block, file = paste0(blockName, ".RData"))
}
}<- as.symDMatrix(list.files(pattern = "^wheat.*RData$"))
G2 attr(G2, "centers") <- centers
attr(G2, "scales") <- scales
all.equal(diag(G1), diag(G2)) # there will be a slight numerical penalty
Install the stable version from CRAN:
install.packages("symDMatrix")
Alternatively, install the development version from GitHub:
# install.packages("remotes")
::install_github("QuantGen/symDMatrix") remotes
Further documentation can be found on RDocumentation.