--- title: "Synthesising Data from Marginals" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Synthesising Data from Marginals} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` Data is synthesised by sampling from a multivariate cumulative distribution (Copula), using the `simstudy` package. # Without Correlations Data can be synthesised from marginal distributions using the `synthesise_data()` function: ```{r synthesise data} library(RESIDE) marginals <- import_marginal_distributions() simulated_data <- synthesise_data(marginals) ``` # With correlations User specified correlations can be added to the synthesised data by supplying a correlation matrix. An empty correlations matrix can be generated using the `export_empty_cor_matrix()` function, supplying the marginals imported using 'import_marginal_distributions' and a folder path respectively: ```{r export_cor_matrix} library(RESIDE) marginals <- import_marginal_distributions() export_empty_cor_matrix(marginals, folder_path = tempdir()) ``` * By default the file wil be names *correlation_matrix.csv* but can be changed with the 'file_name' parameter * The exported CSV file will be a symmetric table which looks like: ```{r print_cor_matrix, eval = TRUE, echo = FALSE} .cor_matrix <- utils::read.csv("correlation_matrix.csv") .cor_matrix <- tibble::column_to_rownames(.cor_matrix, names(.cor_matrix)[1]) DT::datatable( .cor_matrix, options = list( pageLength=10, scrollX='400px' ) ) ``` Correlations should then be added to the CSV file, without modifying the column / row names. Correlations should use rank order correlations. Categorical variables are represented as dummy variables named using the format variable name underscore category name e.g. SEX_F. **Note** the correlation matrix should be symmetrical and positive semi definite. Once the correlations have been added to the CSV file, the correlations can be imported using the `import_cor_matrix' function: ```{r import_cor_matrix} library(RESIDE) correlation_matrix <- import_cor_matrix() ``` By default the filename for the correlation matrix is that of the exported filename (`correlation_matrix.csv`) and is imported from the current working directory. This can be changed by specifying a `file_path` using the corresponding parameter of the `import_cor_matrix()` function, this file path should be a relative or absolute file path. The `import_cor_matrix()` function will produce and error if the matrix is not symmetrical and positive semi definite, or the file does not exist. With a correlation matrix data can now be synthesised with the user specified correlations using the `synthesise_data()` function, specifying the correlation matrix imported by the `import_cor_matrix()` function: ```{r synthesise_data_with_correlations} library(RESIDE) marginals <- import_marginal_distributions() export_empty_cor_matrix(marginals) correlation_matrix <- import_cor_matrix() simulated_data <- synthesise_data( marginals, correlation_matrix ) ``` **NB** It is not possible to entirely maintain all the marginal distributions when specifying correlations, this is a known limitation and is not likely to change.