Motivation for creation of filematrix package

The filematrix package was originally conceived as an alternative to bigmemory package for three reasons.

First, matrices created with bigmemory on NFS (network file system) have often been corrupted (contained all zeros). This is most likely a fault of memory-mapped files on NFS.

Second, bigmemory was not available for Windows for a long periof of time. It is now fully cross platform.

Finally, bigmemory package uses memory mapped file interface to work with data files. This delivers great performance for matrices smaller than the amount of computer memory, but were experiencing major slowdown for larger matrices.

Differences between filematrix and bigmemory packages

The packages use different libraries to read from and write to their big files. The filematrix package uses readBin and writeBin R functions. The bigmemory package memory-mapped file access via BH R package interface (Boost C++).

Note that filematrix can store real values in short 4 byte format. This feature is not available in bigmemory.

Differences in tests

Due to different file access approach:

  • bigmemory accumulates changes to the matrix in memory and writes them to the file upon call of flush or file closure.
  • filematrix writes the changes to the file upon the request without delay.

Consequently:

  • bigmemory works well for matrices smaller than the system memory. Writing to larger matrices is much slower due to system trying to keep as much of the matrix in the system memory (cache) as possible.
  • filematrix’s performance does not deteriorate on matrices many times larger than the system memory.

  • bigmemory is better for random access of small file matrices.
  • filematrix is equally good or better for block and column-wise access of the file matrices.

Example when filematrix is much more efficient than bigmemory

Let us consider a simple task of filling in a large matrix (twice memory size). Below is the code using filematrix. It finishes in 10 minutes and does not interfere with other programs.

library(filematrix)
fm = fm.create(
        filenamebase = "big_fm",
        nrow = 1e5,
        ncol = 1e5)

tic = proc.time()
for( i in seq_len(ncol(fm)) ) {
    message(i, " of ", ncol(fm))
    fm[,i] = i + 1:nrow(fm)
}
toc = proc.time()
show(toc-tic)

# Cleanup

closeAndDeleteFiles(fm)

Filling the same sized big matrix with bigmemory can be very slow (2.5 times slower in this experiment). The bigmemory package uses memory mapped file technique to access the file. When the matrix is written to, the memory mapped file occupies all available RAM and the computer slows to a halt. Task Manager shows the memory mapped file occupy all available RAM when filling a large matrix with bigmemory package.

Please excercise caution when running the code below.

library(bigmemory)
fm = filebacked.big.matrix(
        nrow = 1e5,
        ncol = 1e5,  
        type = "double",
        backingfile = "big_bm.bmat",
        backingpath = "./",
        descriptorfile = "big_bm.desc.txt")

tic = proc.time()
for( i in seq_len(ncol(fm)) ) {
    message(i, " of ", ncol(fm))
    fm[,i] = i + 1:nrow(fm)
}
flush(fm)
toc = proc.time()
show(toc-tic)

# Cleanup

rm(fm)
gc()
unlink("big_bm.bmat")
unlink("big_bm.desc.txt")