--- title: "Data formats" author: "Dieter Menne, dieter.menne@menne.biomed.de" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Data formats} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Concepts 13C data can be imported in generic formats in Excel files, and in several vendor-specific formats, e.g. from BreathID and Wagner/IRIS. A collection of sample files with and without errors is available in the directory ``r R.home()`/library/breathtestcore/extdata`; function `btcore_file()` retrieves the names and long path of the available data sets. ```{r, echo = FALSE, include = FALSE} library(knitr) library(dplyr) library(stringr) opts_chunk$set(comment = NA, fig.width = 4, fig.height = 3) knitr::opts_knit$set(unnamed.chunk.label = "btcore_data_") options(digits = 3) ``` ```{r} library(breathtestcore) head(btcore_file()) btcore_file("Standard.TXT") ``` * When you know the format, you can read the data using the special functions, e.g. `read_breathid()` or `read_breathid_xml()`. * When you do not know the format, or when you want to read several different file formats at once, use function `read_any_breathtest()` which tries to guess the format. ```{r} files = c( btcore_file("IrisCSV.TXT"), # Wagner/IRIS format btcore_file("350_20043_0_GER.txt") # BreathID ) bt = read_any_breathtest(files) # Returns a list of elements of class breathtest_data str(bt, 1) bt_df = cleanup_data(bt) str(bt_df) ``` Passing through `cleanup_data()` returns a data frame/tibble and adds a grouping variable. To plot data without fitting, use `null_fit()`. ```{r, nf, fig.height = 2, fig.width =4} nf = null_fit(bt_df) str(nf) plot(nf) # dispatches to plot.breathtestfit ``` To add new formats, override `breathtest_read_function()` and add a new function that returns a structure given by `breathtest_data()`. > Always pass data through function `cleanup_data()` to obtain a data frame that can be fed to one of the fitting functions `nls_fit()`, `nlme_fit()`, `null_fit()` or `breathteststan::stan_fit()`. ## Automatic grouping You can add a grouping variable, e.g. for multiple meal types, to compute between group differences of means. Cross-over, randomized or mixed designs (some patients cross-over) are supported. You must explicitlty state the grouping variable for each single file as shown below. Without names, it is possible to vectorize, e.g. `read_any_breathtest(c(file1, file2))`, but the 'c()' operator used on vectors disambiguates the names by appending numbers. ```{r, three, fig.height = 2.5, fig.width = 8} files1 = c( group_a = btcore_file("IrisCSV.TXT"), # Use only single file with grouping group_a = btcore_file("Standard.TXT"), group_b = btcore_file("350_20043_0_GER.txt") ) # Alternative syntax using magrittr operator suppressPackageStartupMessages(library(dplyr)) read_any_breathtest(files1) %>% cleanup_data() %>% null_fit() %>% plot() ``` ## Simulated data Function `simulate_breathtest_data()` generates sample data you can use to test different algorithms. Curves with outliers can be generated by setting `student_t_df` to values from 2 (very strong outliers) to 10 (almost gaussian). ```{r, simulated, fig.height = 5, fig.width = 6, fig.cap = "Example of a cross-over design with missing data, outliers and missing record in the red curve."} set.seed(212) data = list(meal_a = simulate_breathtest_data(n_records = 3, noise = 2, student_t_df = 3, missing = 0.3), meal_b = simulate_breathtest_data(n_records = 4)) data %>% cleanup_data() %>% nlme_fit() %>% plot() ``` ```{r, fig.cap= "Function simulate_breathtest_data returns the values of the parameters used to generate the data. These can be used to check the results of the model prediction."} data$meal_a$record ``` ## Built-in data sets Three data sets are included in R format and can be loaded as shown below. All data were provided by the University Hospital of Zürich; details are given in the documentation. ```{r} data("usz_13c") cat("usz_13c has data from", length(unique(usz_13c$patient_id)), "patients with" , length(unique(usz_13c$group)), "different meals") ``` * `breathtestcore::usz_13c` A large data set used to establish reference ranges for healthy volunteers and patients * `breathtestcore::usz_13c_a` Exotic data, a challenge for fitting algorthms * `breathtestcore::usz_13c_d` Has gastric emptying half time from MRI as attribute, and can used to compare recorded data with gold standards; see the example in the documentation. # Generic formats The easiest way to supply generic formats is from Excel files. The data formats described in the following are shown as examples in the workbook ``r R.home()`/library/breathtestcore/extdata/ExcelSamples.xlsx`. Any other tab-separated data set can directly be inserted into the editor of the [breathtestshiny](https://github.com/dmenne/breathtestshiny) web app using copy/paste. ## How to use the Excel data formats * Use function `read_breathtest_excel()`; this is the only way to select a worksheet different other than first in the workbook by passing parameter `sheet`. All other methods only read the first worksheet. * Use function `read_any_breathtest()`. This always reads the first worksheet, but you can combine results from several files, even when they have different formats ```{r, fig.height = 3, include = FALSE} knitr::include_graphics("breathtestshiny.png") ``` ### Two-column format When you have only data from one record, you can supply data in a two-column format as given in sheet `2col` of workboot `ExcelSamples.xlsx`. The column headers must be `minute, ```{r, echo = FALSE, include = FALSE} options(tibble.print_min= 4) options(digits = 2) ``` ```{r} (bt = read_breathtest_excel(btcore_file("ExcelSamples.xlsx"), "2col")) ``` A list is returned, and its only element is a tibble with two columns. To create a standardized format for fitting and plotting, pass it through `cleanup_data` which adds dummy columns `patient_id` (all `pat_a`), and `group` (all `A`) ```{r} (cbt = cleanup_data(bt)) ``` Compute the fit and plot ```{r, nlsfit, height = 3, width = 4} cbt %>% nls_fit() %>% plot() ``` ### Three-column format When you have more than one patient, you must add a column `patient_id` which may be numeric or a string. ```{r} (bt = read_breathtest_excel(btcore_file("ExcelSamples.xlsx"), "3col")) ``` ```{r} (cbt = cleanup_data(bt)) ``` A dummy group 'A' is added by `cleanup_data()`, so that data are in a standardized format now. ### Four-column format The four-column format with column names `patient_id, group, minute, pdr` is the standard format. In cross-over designs, you can have different groups for one patient. ```{r, four_col "} bt = read_breathtest_excel(btcore_file("ExcelSamples.xlsx"), "4col_2group") %>% cleanup_data() kable(sample_frac(bt, 0.08) %>% arrange(patient_id, group), caption = "A sample from a four-column format. See worksheet 4col_2group.") ``` ```{r, nlme_fit, fig.width = 7} bt %>% nlme_fit() %>% plot() ``` ### DOB instead of PDF When you have DOB data (d), you can use `dob` instead of `pdr` as the header of the last column. DOB data will be automatically converted to PDR with function `dob_to_pdr()`. Since no body weight and height are given, the defaults of 75kg and 180 cm are assumed. The half-emptying time and lags do not depend on this assumptions. Only the parameter `m` of the fit which normalized area and amplitude, is affected, and I do not know of a case the `m` has been used in clinical practice. # Vendor-specific formats ### IRIS-Wagner composite data The first lines of `IrisMulti.TXT` ``` "Testergebnis" "Nummer","22" "Datum","12.06.2009" "Testart" "Name","Magenentleerung fest" "Abkürzung","GE FEST" "Substrat","Natriumoktanoat" ``` Use `read_iris()` or `read_any_breathtest()` : ```{r, iriswagner, fig.cap = "IRIS/Wagner composite file. These data cannot be fitted successfully with the single-curve fit method, therefore only data are shown."} read_iris(btcore_file("IrisMulti.TXT")) %>% cleanup_data() %>% null_fit() %>% plot() ``` ### IRIS/Wagner CSV format Files in this format start like this (lines shortened ...) ``` "Name","Vorname","Test","Identifikation","Testzeit[min]",... "Einstein","Albert","GE FEST","330240","0","0","-26.32","4.501891E-02", ... "Einstein","Albert","GE FEST","330240","10","2.02","-24.3","5.617962E-02","2.391013",.. "Einstein","Albert","GE ``` Use `read_iris_csv()` or `read_any_breathtest()` : ```{r, iris_csv, fig.cap = "IRIS/Wagner CSV file"} read_iris_csv(btcore_file("Standard.TXT")) %>% cleanup_data() %>% nls_fit() %>% plot() ``` ### BreathID composite format Files in this format start like this ``` Test and Patient parameters Date - 12/11/12 End time - 08:54 Start time - 12:49 Patient # - 0 Patient ID - Franz ``` Use `read_breathid()` or `read_any_breathtest()`: ```{r, breathidc, fig.cap = "BreathID composite file"} read_breathid(btcore_file("350_20043_0_GER.txt")) %>% cleanup_data() %>% nls_fit() %>% plot() ``` ### BreathID XML format The more recent XML format from BreathID can contain data from multiple record and starts like this: ``` TEST123 N/A 19Jul2017 11:56 19Jul2017 12:12 0 true 45689 19Jul2017 12:22 19Jul2017 12:29 0 ``` Use `read_breathid_xml()` or `read_any_breathtest()`: ```{r, breathid_xml, fig.cap = "BreathID XML format"} read_breathid_xml(btcore_file("NewBreathID_multiple.xml")) %>% cleanup_data() %>% nls_fit() %>% plot() ``` Grouping is most useful in a cross-over design to force within-subject comparisons by functions `coef_by_group()` and `coef_diff_by_group()`; in the above case, the default grouping above might not be what you required. Replace the group parameter manually to remove the groups, but do not delete the column with `group = NULL`, because the fitting functions requires a dummy group name. ```{r, breathid_man, fig.cap = "BreathID XML format with manual grouping."} # Could also use read_any_breathtest() read_breathid_xml(btcore_file("NewBreathID_multiple.xml")) %>% cleanup_data() %>% mutate( group = "New" ) %>% nls_fit() %>% plot() ```