SimMultiCorrData
generates continuous (normal or non-normal), binary, ordinal, and count (Poisson or Negative Binomial) variables with a specified correlation matrix. It can also produce a single continuous variable. This package can be used to simulate data sets that mimic real-world situations (i.e. clinical data sets, plasmodes, as in Vaughan et al. (2009)). All variables are generated from standard normal variables with an imposed intermediate correlation matrix. Continuous variables are simulated by specifying mean, variance, skewness, standardized kurtosis, and fifth and sixth standardized cumulants using either Fleishman’s Third-Order (1978) or Headrick’s Fifth-Order (2002) Polynomial Transformation. Binary and ordinal variables are simulated using a modification of GenOrd::ordsample
Barbiero and Ferrari (2015a). Count variables are simulated using the inverse cdf method. There are two simulation pathways which differ primarily according to the calculation of the intermediate correlation matrix Sigma
. In Correlation Method 1, the intercorrelations involving count variables are determined using a simulation based, logarithmic correlation correction (adapting Yahav and Shmueli (2012)’s method). In Correlation Method 2, the count variables are treated as ordinal (adapting Barbiero and Ferrari (2015b)’s modification of GenOrd
). There is an optional error loop that corrects the final correlation matrix to be within a user-specified precision value. The package also includes functions to calculate standardized cumulants for theoretical distributions or from real data sets, check if a target correlation matrix is within the possible correlation bounds (given the distributions of the simulated variables), summarize results (numerically or graphically), verify valid power method pdfs, and calculate lower standardized kurtosis bounds.
The main strengths of SimMultiCorrData
are:
The user may generate correlated continuous (normal or non-normal), ordinal (r >= 2 categories), Poisson and/or Negative Binomial variables simultaneously, based on either theoretical distributions or empirical data.
Two distinct methods for generating non-normal continuous variables: Fleishman’s third-order or Headrick’s fifth-order polynomial transformation.
Two distinct methods for generating count variables (see Comparison of Correlation Method 1 and Correlation Method 2 vignette). The user may test each to see which yields greater simulation accuracy.
Calculation of the precise lower kurtosis boundary using the Lagrangean constraint equations, instead of an approximation (see calc_lower_skurt
).
Valid power method pdf checks during the calculation of the constants for continuous variables, and optional use of a sixth cumulant correction value to enable the discovery of valid pdf constants.
Computation of feasible correlation bounds based on data simulation method (see valid_corr
for correlation method 1 or valid_corr2
for correlation method 2).
Numerous attempts to reproduce the desired correlation matrix, including correcting for non-positive-definite intermediate correlation matrices and an optional final error loop (see Overview of Error Loop vignette). This error loop enables reproduction of many correlation structures that can not be achieved through other methods.
Function arguments (i.e. seed
, n
, maxit
, epsilon
) that allow the user to have greater control over simulation accuracy, speed, and reproducibility.
Detailed simulation results, including the simulation time (in minutes) and descriptions of the generated variables and the correlation structure.
Additional functions to supplement the simulation process:
calc_theory
) or a vector of data by the method of moments (calc_moments
) or based on Fisher’s k-statistics (calc_fisherk
). Additional summary functions compute important statistics for the generated continuous variables.ggplot2
objects so that the user may save them or further adapt the graphs as necessary.There are several other simulation packages. For example, Barbiero & Ferrari’s (2015a) GenOrd
, Amatya & Demirtas’ (2016a) MultiOrd
, Leisch, Kaiser, & Hornik’s (2010) orddata
, and Demirtas, Nordgren, & Allozi’s (2017) PoisBinOrdNonNor
. The first three generate only binary and ordinal data, while the last generates Poisson, binary, ordinal, and non-normal variables.
GenOrd
GenOrd
generates discrete random variables (i.e. binary or ordinal) with given correlation matrix and marginal distributions. The method used to determine the intermediate MVN correlation matrix in GenOrd::ordcont
has been modified in SimMultiCorrData
’s ordnorm
function. It works by setting the intermediate correlation equal to the target correlation of the discrete variables. Each intermediate pairwise correlation is updated until the final pairwise correlation is within a user-specified precision value (epsilon
) of the target correlation or the maximum number of iterations (maxit
) has been reached. GenOrd::ordcont
has been modified in the following ways:
SimMultiCorrData::valid_corr
or valid_corr2
.Sigma
for all variable types, and if necessary, Sigma
is converted to the nearest positive-definite matrix using Higham’s (2002) algorithm in Matrix::nearPD
.SimMultiCorrData::ordnorm
uses GenOrd::contord
to calculate the ordinal correlation obtained from discretizing the normal variables generated from the intermediate correlation matrix Sigma
. The reason is because the function does not require random generation of the normal variables, which ensures greater reproducibility.
SimMultiCorrData
also improves the way ordinal variables are generated, as compared to GenOrd::ordsample
:
SimMultiCorrData::rcorrvar
and rcorrvar2
allow a user-specified seed, maximum number of iterations, and epsilon value.GenOrd::ordsample
stops if the intermediate correlation matrix Sigma
is not positive-definite. As described above, SimMultiCorrData
attempts to correct the problem and a warning is given that it may not be possible to produce the desired correlation matrix.MultiOrd
MultiOrd
generates multivariate ordinal data with given correlation matrix and marginal distributions via the binary conversion method of Demirtas (2006). This method computes the binary marginals by collapsing the marginal distributions of the ordinal variables. The intermediate correlation matrix is also computed through an iterative process based on matching the target matrix. Binary data are then converted to ordinal data through a randomization step. This procedure requires the simulation of large samples of binary data in order to maximize accuracy, which requires greater computational time and resources than the methods used in SimMultiCorrData
.
orddata
orddata
generates binary and ordinal data through 4 available methods:
PoisBinOrdNonNor
PoisBinOrdNonNor
is one in an extensive series of simulation packages created by Demirtas with additional authors. Other packages include OrdNor
(Amatya and Demirtas 2015), BinNonNor
(Inan and Demirtas 2016a), BinOrdNonNor
(Demirtas, Wang, and Allozi 2017), PoisBinOrd
(Inan and Demirtas 2016b), PoisNor
(Amatya and Demirtas 2016b), and PoisBinOrdNor
(Demirtas, Hu, and Allozi 2017). PoisBinOrdNonNor
generates Poisson, binary, ordinal, and non-normal variables. It differs from SimMultiCorrData
in the following ways:
SimMultiCorrData
’s simulation functions rcorrvar
and rcorrvar2
allow the user to either provide an intermediate matrix or the matrix is calculated during the simulation.SimMultiCorrData
. However, PoisBinOrdNonNor
does not produce Negative Binomial variables.SimMultiCorrData
. However, those for ordinal variables are found using ordcont
, which, as previously mentioned, will stop if the intermediate matrix is not positive-definite.SimMultiCorrData
contains the functions power_norm_corr
and pdf_check
. The function that solves for the constants (SimMultiCorrData::find_constants
) executes these checks when finding the constants and attempts to produce valid pdf constants. In the case of Headrick’s fifth-order method, the user may specify a sixth cumulant correction value to help find these constants.PoisBinOrdNonNor
is a simple approximation: \(\Large standardized\ kurtosis \ge skew^2 - 2\). SimMultiCorrData::calc_lower_skurt
solves the Lagrangean expressions (as described in Headrick (2002) and Headrick and Sawilowsky (2002)) that determine the precise lower kurtosis boundary. Examination of the boundaries computed in PoisBinOrdNonNor
demonstrates that the approximate boundaries are much lower than the actual Fleishman boundaries, indicating that the guideline is not accurate (see calc_lower_skurt
for examples).PoisBinOrdNonNor
does not allow the user to specify a seed for random number generation, or an epsilon value and maximum number of iterations to use when determining the intermediate ordinal correlations. These specifications, as found in SimMultiCorr
’s simulation functions rcorrvar
and rcorrvar2
, are essential for reproducibility and controlling accuracy.SimMultiCorr
’s simulation functions produce detailed summaries of the variables, the final correlation matrix, the maximum error between the final and target correlation matrices, and the simulation time.Amatya, A, and H Demirtas. 2015. OrdNor: An R Package for Concurrent Generation of Correlated Ordinal and Normal Data. Journal of Statistical Software Code Snippets. Vol. 68. doi:10.18637/jss.v068.c02.
———. 2016a. MultiOrd: Generation of Multivariate Ordinal Variates. https://CRAN.R-project.org/package=MultiOrd.
———. 2016b. PoisNor: Simultaneous Generation of Multivariate Data with Poisson and Normal Marginals. https://CRAN.R-project.org/package=PoisNor.
Barbiero, A, and P A Ferrari. 2015a. GenOrd: Simulation of Discrete Random Variables with Given Correlation Matrix and Marginal Distributions. https://CRAN.R-project.org/package=GenOrd.
———. 2015b. “Simulation of Correlated Poisson Variables.” Applied Stochastic Models in Business and Industry 31: 669–80. doi:10.1002/asmb.2072.
Demirtas, H. 2006. “A Method for Multivariate Ordinal Data Generation Given Marginal Distributions and Correlations.” Journal of Statistical Computation and Simulation 76 (11): 1017–25. doi:10.1080/10629360600569246.
Demirtas, H, D Hedeker, and R J Mermelstein. 2012. “Simulation of Massive Public Health Data by Power Polynomials.” Statistics in Medicine 31 (27): 3337–46. doi:10.1002/sim.5362.
Demirtas, H, Y Hu, and R Allozi. 2017. PoisBinOrdNor: Data Generation with Poisson, Binary, Ordinal and Normal Components. https://CRAN.R-project.org/package=PoisBinOrdNor.
Demirtas, H, R Nordgren, and R Allozi. 2017. PoisBinOrdNonNor: Generation of up to Four Different Types of Variables. https://CRAN.R-project.org/package=PoisBinOrdNonNor.
Demirtas, H, Y Wang, and R Allozi. 2017. BinOrdNonNor: Concurrent Generation of Binary, Ordinal and Continuous Data. https://CRAN.R-project.org/package=BinOrdNonNor.
Fleishman, A I. 1978. “A Method for Simulating Non-Normal Distributions.” Psychometrika 43: 521–32. doi:10.1007/BF02293811.
Headrick, T C. 2002. “Fast Fifth-Order Polynomial Transforms for Generating Univariate and Multivariate Non-Normal Distributions.” Computational Statistics and Data Analysis 40 (4): 685–711. doi:10.1016/S0167-9473(02)00072-5.
Headrick, T C, and S S Sawilowsky. 2002. “Weighted Simplex Procedures for Determining Boundary Points and Constants for the Univariate and Multivariate Power Methods.” Journal of Educational and Behavioral Statistics 25: 417–36. doi:10.3102/10769986025004417.
Inan, G, and H Demirtas. 2016a. BinNonNor: Data Generation with Binary and Continuous Non-Normal Components. https://CRAN.R-project.org/package=BinNonNor.
———. 2016b. PoisBinOrd: Data Generation with Poisson, Binary and Ordinal Components. https://CRAN.R-project.org/package=PoisBinOrd.
Kaiser, S, D Traeger, and F Leisch. 2011. “Generating Correlated Ordinal Random Values.” Technical Report Number 94. Department of Statistics at University of Munich. https://epub.ub.uni-muenchen.de/12157/1/kaiser-tr-94-ordinal.pdf.
Leisch, F, A W S Kaiser, and K Hornik. 2010. Orddata: Generation of Artificial Ordinal and Binary Data.
Vaughan, L K, J Divers, M Padilla, D T Redden, H K Tiwari, D Pomp, and D B Allison. 2009. “The Use of Plasmodes as a Supplement to Simulations: A Simple Example Evaluating Individual Admixture Estimation Methodologies.” Computational Statistics and Data Analysis 53 (5): 1755–66. doi:10.1016/j.csda.2008.02.032.
Yahav, I, and G Shmueli. 2012. “On Generating Multivariate Poisson Data in Management Science Applications.” Applied Stochastic Models in Business and Industry 28 (1): 91–102. doi:10.1002/asmb.901.