The git2rdata
package is an R package for writing and
reading dataframes as plain text files. A metadata file stores important
information.
git2rdata
optimizes the data for file storage. The
optimization is most effective on data containing factors. The
optimization makes the data less human readable. The user can turn this
off when they prefer a human readable format over smaller files. Details
on the implementation are available in
vignette("plain_text", package = "git2rdata")
.vignette("version_control", package = "git2rdata")
.
Although we envisioned git2rdata
with a git workflow in mind, you can use it in
combination with other version control systems like subversion or mercurial.git2rdata
is a useful tool in a reproducible and
traceable workflow.
vignette("workflow", package = "git2rdata")
gives a toy
example.vignette("efficiency", package = "git2rdata")
provides
some insight into the efficiency of file storage, git repository size
and speed for writing and reading.git2rdata
checks the data and metadata during the
reading. read_vc()
informs the user if there is tampering
with the data or metadata.git2r
package for working with git repository from R.
read.table()
, writing to a HDD takes about 70% more time
than write.table()
.vignette("workflow", package = "git2rdata")
.git2rdata
atuseR!2019 in Toulouse, France
Install from CRAN
install.packages("git2rdata")
Install the development version from GitHub
# installation requires the "remotes" package
# install.package("remotes")
# install with vignettes (recommended)
::install_github(
remotes"ropensci/git2rdata",
build = TRUE,
dependencies = TRUE,
build_opts = c("--no-resave-data", "--no-manual")
)# install without vignettes
::install_github("ropensci/git2rdata")) remotes
The user stores dataframes with write_vc()
and retrieves
them with read_vc()
. Both functions share the arguments
root
and file
. root
refers to a
base location where to store the dataframe. It can either point to a
local directory or a local git repository. file
is the file
name to use and can include a path relative to root
. Make
sure the relative path stays within root
.
# using a local directory
library(git2rdata)
<- "~/myproject"
root write_vc(my_data, file = "rel_path/filename", root = root)
read_vc(file = "rel_path/filename", root = root)
<- git2r::repository("~/my_git_repo") # git repository root
More details on store dataframes as plain text files in
vignette("plain_text", package = "git2rdata")
.
# using a git repository
library(git2rdata)
<- repository("~/my_git_repo")
repo pull(repo)
write_vc(my_data, file = "rel_path/filename", root = repo, stage = TRUE)
commit(repo, "My message")
push(repo)
read_vc(file = "rel_path/filename", root = repo)
Please read
vignette("version_control", package = "git2rdata")
for more
details on using git2rdata in combination with version control.
The recommendation for git repositories is to use files smaller than
100 MiB, a repository size less than 1 GiB and less than 25k files. The
individual file size is the limiting factor. Storing the airbag dataset
(DAAG::nassCDS
)
with write_vc()
requires on average 68 (optimized) or 97
(verbose) byte per record. The file reaches the 100 MiB limit for this
data after about 1.5 million (optimized) or 1 million (verbose)
observations.
Storing a 90% random subset of the airbag dataset requires 370 kiB (optimized) or 400 kiB (verbose) storage in the git history. Updating the dataset with other 90% random subsets requires on average 60 kiB (optimized) to 100 kiB (verbose) per commit. The git history reaches the limit of 1 GiB after 17k (optimized) to 10k (verbose) commits.
Your mileage might vary.
Please use the output of citation("git2rdata")
R
: The source scripts of the R functions with documentation in
Roxygen
formatman
: The help files in Rd
formatinst/efficiency
: pre-calculated data to speed up
vignette("efficiency", package = "git2rdata")
testthat
: R scripts with unit tests using the testthat
frameworkvignettes
: source code for the vignettes describing the
packageman-roxygen
: templates for documentation in Roxygen
formatpkgdown
: source files for the git2rdata
website.github
: guidelines and templates for contributorsgit2rdata
├── .github
├─┬ inst
│ └── efficiency
├── man
├── man-roxygen
├── pkgdown
├── R
├─┬ tests
│ └── testthat
└── vignettes
git2rdata
welcomes contributions. Please read our Contributing
guidelines first. The git2rdata
project has a Contributor
Code of Conduct. By contributing to this project, you agree to abide
by its terms.