--- title: "Suggested Workflow for Storing a Variable Set of Dataframes under Version Control" author: "Thierry Onkelinx" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Suggested Workflow for Storing a Variable Set of Dataframes under Version Control} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\VignetteDepends{git2r} --- ```{r setup, include = FALSE} library(knitr) opts_chunk$set( collapse = TRUE, comment = "#>" ) set.seed(20120225) ``` ## Introduction This vignette describes a suggested workflow for storing a snapshot of dataframes as git2rdata objects under version control. The workflow comes in two flavours: 1. A single repository holding both the data and the analysis code. The single repository set-up is simple. A single reference (e.g. commit) points to both the data and the code. 1. One repository holding the data and a second repository holding the code. The data and the code have an independent history under a two repository set-up. Documenting the analysis requires one reference to each repository. Such a set-up is useful for repeating the same analysis (stable code) on updated data. In this vignette we use a `git2r::repository()` object as the root. This adds git functionality to `write_vc()` and `read_vc()`, provided by the [`git2r`](https://cran.r-project.org/package=git2r) package. This allows to pull, stage, commit and push from within R. Each commit in the data git repository describes a complete snapshot of the data at the time of the commit. The difference between two commits can consist of changes in existing git2rdata object (updated observations, new observations, deleted observations or updated metadata). Besides updating the existing git2rdata objects, we can also add new git2rdata objects or remove existing ones. We need to track such higher level addition and deletions as well. We illustrate the workflow with a mock analysis on the `datasets::beaver1` and `datasets::beaver2` datasets. ## Setup We start by initializing a git repository. `git2rdata` assumes that is already done. We'll use the `git2r` functions to do so. We start by creating a local bare repository. In practice we will use a remote on an external server (GitHub, Gitlab, Bitbucket, ...). The example below creates a local git repository with an upstream git repository. Any other workflow to create a similar structure is fine. ```{r initialize} # initialize a bare git repo to be used as remote remote <- tempfile("git2rdata-workflow-remote") remote <- normalizePath(remote, winslash = "/") dir.create(remote) git2r::init(remote, bare = TRUE) # initialize a local git repo path <- tempfile("git2rdata-workflow") path <- normalizePath(path, winslash = "/") dir.create(path) init_repo <- git2r::clone(remote, path, progress = FALSE) git2r::config(init_repo, user.name = "me", user.email = "me@me.com") # add an initial commit with .gitignore file writeLines("*extra*", file.path(path, ".gitignore")) git2r::add(init_repo, ".gitignore", force = TRUE) git2r::commit(init_repo, message = "Initial commit") # push initial commit to remote branch_name <- git2r::branches(init_repo)[[1]]$name git2r::push( init_repo, "origin", file.path("refs", "heads", branch_name, fsep = "/") ) rm(init_repo) ``` ## Structuring Git2rdata Objects Within a Project `git2rdata` imposes a minimal structure. Both the `.tsv` and the `.yml` file need to be in the same folder. That's it. For the sake of simplicity, in this vignette we dump all git2rdata objects at the root of the repository. This might not be good idea for real project. We recommend to use at least a different directory tree for each import script. This directory can go into the root of a data repository. It goes in the `data` directory in case of a data and code repository. Or the `inst` directory in case of an R package. Your project might need a different directory structure. Feel free to choose the most relevant data structure for your project. ## Storing Dataframes _ad Hoc_ into a Git Repository ### First Commit In the first commit we use `datasets::beaver1`. We connect to the git repository using `repository()`. Note that this assumes that `path` is an existing git repository. Now we can write the dataset as a git2rdata object in the repository. If the `root` argument of `write_vc()` is a `git_repository`, it gains two extra arguments: `stage` and `force`. Setting `stage = TRUE`, will automatically stage the files written by `write_vc()`. ```{r store_data_1} library(git2rdata) repo <- repository(path) fn <- write_vc(beaver1, "beaver", repo, sorting = "time", stage = TRUE) ``` We can use `status()` to check that `write_vc()` wrote and staged the required files. Then we `commit()` the changes. ```{r avoid_subsecond_commit, echo = FALSE} Sys.sleep(1.2) ``` ```{r commit_data_1} status(repo) cm1 <- commit(repo, message = "First commit") cat(cm1$message) ``` ### Second Commit The second commit adds `beaver2`. ```{r store_data_2} fn <- write_vc(beaver2, "extra_beaver", repo, sorting = "time", stage = TRUE) status(repo) ``` Notice that `extra_beaver` is not listed in the `status()`, although `write_vc()` wrote it to the repository. The reason is that we set a `.gitignore` which contains `"*extra*`, so git ignores any git2rdata object with a name containing "extra". We force git to stage it by setting `force = TRUE`. ```{r avoid_subsecond_commit2, echo = FALSE} Sys.sleep(1.2) ``` ```{r} status(repo, ignored = TRUE) fn <- write_vc(beaver2, "extra_beaver", repo, sorting = "time", stage = TRUE, force = TRUE) status(repo) cm2 <- commit(repo, message = "Second commit") ``` ### Third Commit Now we decide that a single git2rdata object containing the data of both beavers is more relevant. We add an ID variable for each of the animals. This requires updating the `sorting` to avoid ties. And `strict = FALSE` to update the metadata. The "extra_beaver" git2rdata object is no longer needed so we remove it. We use `all = TRUE` to stage the removal of "extra_beaver" while committing the changes. ```{r avoid_subsecond_commit3, echo = FALSE} Sys.sleep(1.2) ``` ```{r store_data_3} beaver1$beaver <- 1 beaver2$beaver <- 2 beaver <- rbind(beaver1, beaver2) fn <- write_vc(beaver, "beaver", repo, sorting = c("beaver", "time"), strict = FALSE, stage = TRUE) file.remove(list.files(path, "extra", full.names = TRUE)) status(repo) cm3 <- commit(repo, message = "Third commit", all = TRUE) status(repo) ``` ## Scripted Workflow for Storing Dataframes We strongly recommend to add git2rdata object through an import script instead of adding them [_ad hoc_](#storing-dataframes-ad-hoc-into-a-git-repository). Store this script in the (analysis) repository. It documents the creation of the git2rdata objects. Rerun this script whenever updated data becomes available. Old versions of the import script and the associated git2rdata remain available through the version control history. Remove obsolete git2rdata objects from the import script. This keeps both the import script and the working directory tidy and minimal. Basically, the import script should create all git2rdata objects within a given directory tree. This gives the advantage that we start the import script by clearing any existing git2rdata object in this directory. If the import script no longer creates a git2rdata object, it gets removed without the need to track what git2rdata objects existed in the previous version. The brute force method of removing all files or all `.tsv` / `.yml` pairs is not a good idea. This removes the existing metadata which we need for efficient storage (see `vignette("efficiency", package = "git2rdata")`). A better solution is to use `rm_data()` on the directory at the start of the import script. This removes all `.tsv` files which have valid metadata. The existing metadata remains untouched at this point. Then write all git2rdata objects and stage them. Unchanged objects will not lead to a diff, even if we first deleted and then recreated them. The script won't recreate the `.tsv` file of obsolete git2rdata objects. Use `prune_meta()` to remove any leftover metadata files. Commit and push the changes at the end of the script. Below is an example script recreating the "beaver" git2rdata object from the [third commit](#third-commit). ```{r eval = FALSE} # load package library(git2rdata) # step 1: setup the repository and data path repo <- repository(".") data_path <- file.path("data", "beaver") # step 1b: sync the repository with the remote pull(repo = repo) # step 2: remove all existing data files rm_data(root = repo, path = data_path, stage = TRUE) # step 3: write all relevant git2rdata objects to the data path beaver1$beaver <- 1 beaver2$beaver <- 2 body_temp <- rbind(beaver1, beaver2) fn <- write_vc(x = body_temp, file = file.path(data_path, "body_temperature"), root = repo, sorting = c("beaver", "time"), stage = TRUE) # step 4: remove any dangling metadata files prune_meta(root = repo, path = data_path, stage = TRUE) # step 5: commit the changes cm <- commit(repo = repo, message = "import") # step 5b: sync the repository with the remote push(repo = repo) ``` ## R Package Workflow for Storing Dataframes We recommend a two repository set-up in case of recurring analyses. These are relative stable analyses which have to run with some frequency on updated data (e.g. once a month). That makes it worthwhile to convert the analyses into an R package. Split long scripts into a set of shorter functions which are much easier to document and maintain. An R package offers lots of [functionality](http://r-pkgs.had.co.nz/check.html) out of the box to check the quality of your code. The example below converts the import script above into a function. We illustrate how you can use Roxygen2 (see `vignette("roxygen2", package = "roxygen2")`) tags to document the function and to list its dependencies. Note that we added `session = TRUE` to `commit()`. This will append the `sessionInfo()` at the time of the commit to the commit message. Thus documenting all loaded R packages and their version. This documents to code used to create the git2rdata object since your analysis code resides in a dedicated package with its own version number. We strongly recommend to run the import from a fresh R session. Then the `sessionInfo()` at commit time contains those packages with are strictly required for the import. Consider running the import from the command line. e.g. `Rscript -e 'mypackage::import_body_temp("path/to/root")'`. ```{r eval = FALSE} #' Import the beaver body temperature data #' @param path the root of the git repository #' @importFrom git2rdata repository pull rm_data write_vc prune_meta commit push #' @export import_body_temp <- function(path) { # step 1: setup the repository and data path repo <- repository(path) data_path <- file.path("data", "beaver") # step 1b: sync the repository with the remote pull(repo = repo) # step 2: remove all existing data files rm_data(root = repo, path = data_path, stage = TRUE) # step 3: write all relevant git2rdata objects to the data path beaver1$beaver <- 1 beaver2$beaver <- 2 body_temp <- rbind(beaver1, beaver2) write_vc(x = body_temp, file = file.path(data_path, "body_temperature"), root = repo, sorting = c("beaver", "time"), stage = TRUE) # step 4: remove any dangling metadata files prune_meta(root = repo, path = data_path, stage = TRUE) # step 5: commit the changes commit(repo = repo, message = "import", session = TRUE) # step 5b: sync the repository with the remote push(object = repo) } ``` ## Analysis Workflow with Reproducible Data The example below is a small trivial example of a standardized analysis in which documents the source of the data by describing the name of the data, the repository URL and the commit. We can use this information when reporting the results. This makes the data underlying the results traceable. ```{r standardized_analysis} analysis <- function(ds_name, repo) { ds <- read_vc(ds_name, repo) list( dataset = ds_name, repository = git2r::remote_url(repo), commit = recent_commit(ds_name, repo, data = TRUE), model = lm(temp ~ activ, data = ds) ) } report <- function(x) { knitr::kable( coef(summary(x$model)), caption = sprintf("**dataset:** %s \n**commit:** %s \n**repository:** %s", x$dataset, x$commit$commit, x$repository) ) } ``` In this case we can run every analysis by looping over the list of datasets in the repository. ```{r run_current_analyses, results = "asis"} repo <- repository(path) current <- lapply(list_data(repo), analysis, repo = repo) names(current) <- list_data(repo) result <- lapply(current, report) junk <- lapply(result, print) ``` The example below does the same thing for the first and second commit. ```{r run_previous_analyses, results = "asis"} # checkout first commit git2r::checkout(cm1) # do analysis previous <- lapply(list_data(repo), analysis, repo = repo) names(previous) <- list_data(repo) result <- lapply(previous, report) junk <- lapply(result, print) # checkout second commit git2r::checkout(cm2) # do analysis previous <- lapply(list_data(repo), analysis, repo = repo) names(previous) <- list_data(repo) result <- lapply(previous, report) junk <- lapply(result, print) ``` If you inspect the reported results you'll notice that all the output (coefficients and commit hash) for "beaver" object is identical for the first and second commit. This makes sense since the "beaver" object didn't change during the second commit. The output for the current (third) commit is different because the dataset changed. ### Long running analysis Imagine the case where an individual analysis takes a while to run. We store the most recent version of each analysis and add the information from `recent_commit()`. When preparing the analysis, you can run `recent_commit()` again on the dataset and compare the commit hash with that one of the available analysis. If the commit hashes match, then the data hasn't changed. Then there is no need to rerun the analysis^[assuming the code for running the analysis didn't change.], saving valuable computing resources and time.