filematrix
packageThe filematrix
package was originally conceived as an alternative to bigmemory
package for three reasons.
First, matrices created with bigmemory
on NFS (network file system) have often been corrupted (contained all zeros). This is most likely a fault of memory-mapped files on NFS.
Second, bigmemory
was not available for Windows for a long periof of time. It is now fully cross platform.
Finally, bigmemory
package uses memory mapped file interface to work with data files. This delivers great performance for matrices smaller than the amount of computer memory, but were experiencing major slowdown for larger matrices.
filematrix
and bigmemory
packagesThe packages use different libraries to read from and write to their big files. The filematrix
package uses readBin
and writeBin
R functions. The bigmemory
package memory-mapped file access via BH
R package interface (Boost C++).
Note that filematrix
can store real values in short
4 byte format. This feature is not available in bigmemory
.
Due to different file access approach:
bigmemory
accumulates changes to the matrix in memory and writes them to the file upon call of flush
or file closure.filematrix
writes the changes to the file upon the request without delay.Consequently:
bigmemory
works well for matrices smaller than the system memory. Writing to larger matrices is much slower due to system trying to keep as much of the matrix in the system memory (cache) as possible.filematrix
’s performance does not deteriorate on matrices many times larger than the system memory.
bigmemory
is better for random access of small file matrices.filematrix
is equally good or better for block and column-wise access of the file matrices.
filematrix
is much more efficient than bigmemory
Let us consider a simple task of filling in a large matrix (twice memory size). Below is the code using filematrix
. It finishes in 10 minutes and does not interfere with other programs.
library(filematrix)
fm = fm.create(
filenamebase = "big_fm",
nrow = 1e5,
ncol = 1e5)
tic = proc.time()
for( i in seq_len(ncol(fm)) ) {
message(i, " of ", ncol(fm))
fm[,i] = i + 1:nrow(fm)
}
toc = proc.time()
show(toc-tic)
# Cleanup
closeAndDeleteFiles(fm)
Filling the same sized big matrix with bigmemory
can be very slow (2.5 times slower in this experiment). The bigmemory
package uses memory mapped file technique to access the file. When the matrix is written to, the memory mapped file occupies all available RAM and the computer slows to a halt.
Please excercise caution when running the code below.
library(bigmemory)
fm = filebacked.big.matrix(
nrow = 1e5,
ncol = 1e5,
type = "double",
backingfile = "big_bm.bmat",
backingpath = "./",
descriptorfile = "big_bm.desc.txt")
tic = proc.time()
for( i in seq_len(ncol(fm)) ) {
message(i, " of ", ncol(fm))
fm[,i] = i + 1:nrow(fm)
}
flush(fm)
toc = proc.time()
show(toc-tic)
# Cleanup
rm(fm)
gc()
unlink("big_bm.bmat")
unlink("big_bm.desc.txt")