Benefits of SimMultiCorrData and Comparison to Other Packages

Allison C Fialkowski

2018-06-28

SimMultiCorrData generates continuous (normal or non-normal), binary, ordinal, and count (Poisson or Negative Binomial) variables with a specified correlation matrix. It can also produce a single continuous variable. This package can be used to simulate data sets that mimic real-world situations (i.e. clinical data sets, plasmodes, as in Vaughan et al. (2009)). All variables are generated from standard normal variables with an imposed intermediate correlation matrix. Continuous variables are simulated by specifying mean, variance, skewness, standardized kurtosis, and fifth and sixth standardized cumulants using either Fleishman’s Third-Order (1978) or Headrick’s Fifth-Order (2002) Polynomial Transformation. Binary and ordinal variables are simulated using a modification of GenOrd::ordsample Barbiero and Ferrari (2015a). Count variables are simulated using the inverse cdf method. There are two simulation pathways which differ primarily according to the calculation of the intermediate correlation matrix Sigma. In Correlation Method 1, the intercorrelations involving count variables are determined using a simulation based, logarithmic correlation correction (adapting Yahav and Shmueli (2012)’s method). In Correlation Method 2, the count variables are treated as ordinal (adapting Barbiero and Ferrari (2015b)’s modification of GenOrd). There is an optional error loop that corrects the final correlation matrix to be within a user-specified precision value. The package also includes functions to calculate standardized cumulants for theoretical distributions or from real data sets, check if a target correlation matrix is within the possible correlation bounds (given the distributions of the simulated variables), summarize results (numerically or graphically), verify valid power method pdfs, and calculate lower standardized kurtosis bounds.

Benefits of this package:

The main strengths of SimMultiCorrData are:

  1. The user may generate correlated continuous (normal or non-normal), ordinal (r >= 2 categories), Poisson and/or Negative Binomial variables simultaneously, based on either theoretical distributions or empirical data.

  2. Two distinct methods for generating non-normal continuous variables: Fleishman’s third-order or Headrick’s fifth-order polynomial transformation.

    • Simulate distributions whose higher moments are unknown or do not exist using Fleishman’s third-order method.
    • Accurately reproduce non-normal data up to the sixth moment using Headrick’s fifth-order method.
    • Generate data with a wider range of standardized kurtosis values, given skewness and standardized fifth and sixth cumulants, using Headrick’s fifth-order method.
    • Generate more distributions with a valid pdf using Headrick’s fifth-order method.
  3. Two distinct methods for generating count variables (see Comparison of Correlation Method 1 and Correlation Method 2 vignette). The user may test each to see which yields greater simulation accuracy.

  4. Calculation of the precise lower kurtosis boundary using the Lagrangean constraint equations, instead of an approximation (see calc_lower_skurt).

  5. Valid power method pdf checks during the calculation of the constants for continuous variables, and optional use of a sixth cumulant correction value to enable the discovery of valid pdf constants.

  6. Computation of feasible correlation bounds based on data simulation method (see valid_corr for correlation method 1 or valid_corr2 for correlation method 2).

  7. Numerous attempts to reproduce the desired correlation matrix, including correcting for non-positive-definite intermediate correlation matrices and an optional final error loop (see Overview of Error Loop vignette). This error loop enables reproduction of many correlation structures that can not be achieved through other methods.

  8. Function arguments (i.e. seed, n, maxit, epsilon) that allow the user to have greater control over simulation accuracy, speed, and reproducibility.

  9. Detailed simulation results, including the simulation time (in minutes) and descriptions of the generated variables and the correlation structure.

  10. Additional functions to supplement the simulation process:

    • Summary functions that may be used to determine standardized cumulants from theoretical distributions (calc_theory) or a vector of data by the method of moments (calc_moments) or based on Fisher’s k-statistics (calc_fisherk). Additional summary functions compute important statistics for the generated continuous variables.
    • Plotting functions for comparing simulated data to either theoretical distributions or empirical data. The plots display either actual data, cdfs, or pdfs, and work for continuous, ordinal, and/or count variables. The cdf plotting functions enable the user to specify a y-value to calculate the cumulative probability. This region on the graph is filled in and the probability is given on the graph. The data values and pdf graphs enable the user to overlay the target distributions. These functions produce ggplot2 objects so that the user may save them or further adapt the graphs as necessary.

Comparison to other R packages:

There are several other simulation packages. For example, Barbiero & Ferrari’s (2015a) GenOrd, Amatya & Demirtas’ (2016a) MultiOrd, Leisch, Kaiser, & Hornik’s (2010) orddata, and Demirtas, Nordgren, & Allozi’s (2017) PoisBinOrdNonNor. The first three generate only binary and ordinal data, while the last generates Poisson, binary, ordinal, and non-normal variables.

Barbiero & Ferrari’s (2015) GenOrd

GenOrd generates discrete random variables (i.e. binary or ordinal) with given correlation matrix and marginal distributions. The method used to determine the intermediate MVN correlation matrix in GenOrd::ordcont has been modified in SimMultiCorrData’s ordnorm function. It works by setting the intermediate correlation equal to the target correlation of the discrete variables. Each intermediate pairwise correlation is updated until the final pairwise correlation is within a user-specified precision value (epsilon) of the target correlation or the maximum number of iterations (maxit) has been reached. GenOrd::ordcont has been modified in the following ways:

  1. The initial correlation check has been removed because it is assumed the user has done this before simulation using SimMultiCorrData::valid_corr or valid_corr2.
  2. The final positive-definite check has been removed because this is done on the intermediate correlation matrix Sigma for all variable types, and if necessary, Sigma is converted to the nearest positive-definite matrix using Higham’s (2002) algorithm in Matrix::nearPD.
  3. The intermediate correlation update function was changed to accommodate more situations.

SimMultiCorrData::ordnorm uses GenOrd::contord to calculate the ordinal correlation obtained from discretizing the normal variables generated from the intermediate correlation matrix Sigma. The reason is because the function does not require random generation of the normal variables, which ensures greater reproducibility.

SimMultiCorrData also improves the way ordinal variables are generated, as compared to GenOrd::ordsample:

  1. The simulation functions SimMultiCorrData::rcorrvar and rcorrvar2 allow a user-specified seed, maximum number of iterations, and epsilon value.
  2. GenOrd::ordsample stops if the intermediate correlation matrix Sigma is not positive-definite. As described above, SimMultiCorrData attempts to correct the problem and a warning is given that it may not be possible to produce the desired correlation matrix.

Amatya & Demirtas’ (2016) MultiOrd

MultiOrd generates multivariate ordinal data with given correlation matrix and marginal distributions via the binary conversion method of Demirtas (2006). This method computes the binary marginals by collapsing the marginal distributions of the ordinal variables. The intermediate correlation matrix is also computed through an iterative process based on matching the target matrix. Binary data are then converted to ordinal data through a randomization step. This procedure requires the simulation of large samples of binary data in order to maximize accuracy, which requires greater computational time and resources than the methods used in SimMultiCorrData.

Leisch, Kaiser, & Hornik’s (2010) orddata

orddata generates binary and ordinal data through 4 available methods:

  1. Mean mapping: this method involves an analytical algorithm for determining the intermediate MVN correlation. It creates the ordinal variables by thresholding the normal variates. The mean mapping method provides a wider correlation range than the binary conversion method, but the run time can be greater, depending on the number of categories and variables.
  2. Binary conversion: Kaiser et al.‘s binary conversion is similar to the method of Demirtas. However, while Demirtas’ method uses simulations to determine the binary correlation required to create the desired ordinal variables, Kaiser et al. developed an algorithm to analytically determine the binary correlation (see Kaiser, Traeger, and Leisch (2011)). The authors note that although the feasible correlation range is smaller than that of Demirtas, the analytical solution decreases computational time.
  3. Monte Carlo simulation: this method generates ordinal values by shifting the variables until the desired correlation structure is achieved.
  4. A “native” method: this method involves a simpler analytical algorithm than the mean mapping method and also thresholds normal variates to generate the ordinal values.

Demirtas, Nordgren, & Allozi’s (2017) PoisBinOrdNonNor

PoisBinOrdNonNor is one in an extensive series of simulation packages created by Demirtas with additional authors. Other packages include OrdNor (Amatya and Demirtas 2015), BinNonNor (Inan and Demirtas 2016a), BinOrdNonNor (Demirtas, Wang, and Allozi 2017), PoisBinOrd (Inan and Demirtas 2016b), PoisNor (Amatya and Demirtas 2016b), and PoisBinOrdNor (Demirtas, Hu, and Allozi 2017). PoisBinOrdNonNor generates Poisson, binary, ordinal, and non-normal variables. It differs from SimMultiCorrData in the following ways:

  1. The intermediate correlation matrix must be found using a function separate from the simulation function, while SimMultiCorrData’s simulation functions rcorrvar and rcorrvar2 allow the user to either provide an intermediate matrix or the matrix is calculated during the simulation.
  2. The intermediate correlations for the Poisson variables are found using the method of Yahav & Shmueli (2012), as in method 1 of SimMultiCorrData. However, PoisBinOrdNonNor does not produce Negative Binomial variables.
  3. Binary intermediate correlations are calculated from tetrachoric correlations based on the method of Demirtas, Hedeker, and Mermelstein (2012), as in SimMultiCorrData. However, those for ordinal variables are found using ordcont, which, as previously mentioned, will stop if the intermediate matrix is not positive-definite.
  4. Non-normal variables are generated using Fleishman’s third-order polynomial transformation, which generally yields simulation accuracy inferior to Headrick’s fifth-order transformation and has a smaller range of possible kurtosis values.
  5. The authors do not provide any checks for positive correlation of the continuous variable with the generating standard normal variable or for a valid power method pdf, while SimMultiCorrData contains the functions power_norm_corr and pdf_check. The function that solves for the constants (SimMultiCorrData::find_constants) executes these checks when finding the constants and attempts to produce valid pdf constants. In the case of Headrick’s fifth-order method, the user may specify a sixth cumulant correction value to help find these constants.
  6. The lower kurtosis boundary check in PoisBinOrdNonNor is a simple approximation: \(\Large standardized\ kurtosis \ge skew^2 - 2\). SimMultiCorrData::calc_lower_skurt solves the Lagrangean expressions (as described in Headrick (2002) and Headrick and Sawilowsky (2002)) that determine the precise lower kurtosis boundary. Examination of the boundaries computed in PoisBinOrdNonNor demonstrates that the approximate boundaries are much lower than the actual Fleishman boundaries, indicating that the guideline is not accurate (see calc_lower_skurt for examples).
  7. PoisBinOrdNonNor does not allow the user to specify a seed for random number generation, or an epsilon value and maximum number of iterations to use when determining the intermediate ordinal correlations. These specifications, as found in SimMultiCorr’s simulation functions rcorrvar and rcorrvar2, are essential for reproducibility and controlling accuracy.
  8. The results include only the generated variables, while SimMultiCorr’s simulation functions produce detailed summaries of the variables, the final correlation matrix, the maximum error between the final and target correlation matrices, and the simulation time.

References

Amatya, A, and H Demirtas. 2015. OrdNor: An R Package for Concurrent Generation of Correlated Ordinal and Normal Data. Journal of Statistical Software Code Snippets. Vol. 68. doi:10.18637/jss.v068.c02.

———. 2016a. MultiOrd: Generation of Multivariate Ordinal Variates. https://CRAN.R-project.org/package=MultiOrd.

———. 2016b. PoisNor: Simultaneous Generation of Multivariate Data with Poisson and Normal Marginals. https://CRAN.R-project.org/package=PoisNor.

Barbiero, A, and P A Ferrari. 2015a. GenOrd: Simulation of Discrete Random Variables with Given Correlation Matrix and Marginal Distributions. https://CRAN.R-project.org/package=GenOrd.

———. 2015b. “Simulation of Correlated Poisson Variables.” Applied Stochastic Models in Business and Industry 31: 669–80. doi:10.1002/asmb.2072.

Demirtas, H. 2006. “A Method for Multivariate Ordinal Data Generation Given Marginal Distributions and Correlations.” Journal of Statistical Computation and Simulation 76 (11): 1017–25. doi:10.1080/10629360600569246.

Demirtas, H, D Hedeker, and R J Mermelstein. 2012. “Simulation of Massive Public Health Data by Power Polynomials.” Statistics in Medicine 31 (27): 3337–46. doi:10.1002/sim.5362.

Demirtas, H, Y Hu, and R Allozi. 2017. PoisBinOrdNor: Data Generation with Poisson, Binary, Ordinal and Normal Components. https://CRAN.R-project.org/package=PoisBinOrdNor.

Demirtas, H, R Nordgren, and R Allozi. 2017. PoisBinOrdNonNor: Generation of up to Four Different Types of Variables. https://CRAN.R-project.org/package=PoisBinOrdNonNor.

Demirtas, H, Y Wang, and R Allozi. 2017. BinOrdNonNor: Concurrent Generation of Binary, Ordinal and Continuous Data. https://CRAN.R-project.org/package=BinOrdNonNor.

Fleishman, A I. 1978. “A Method for Simulating Non-Normal Distributions.” Psychometrika 43: 521–32. doi:10.1007/BF02293811.

Headrick, T C. 2002. “Fast Fifth-Order Polynomial Transforms for Generating Univariate and Multivariate Non-Normal Distributions.” Computational Statistics and Data Analysis 40 (4): 685–711. doi:10.1016/S0167-9473(02)00072-5.

Headrick, T C, and S S Sawilowsky. 2002. “Weighted Simplex Procedures for Determining Boundary Points and Constants for the Univariate and Multivariate Power Methods.” Journal of Educational and Behavioral Statistics 25: 417–36. doi:10.3102/10769986025004417.

Inan, G, and H Demirtas. 2016a. BinNonNor: Data Generation with Binary and Continuous Non-Normal Components. https://CRAN.R-project.org/package=BinNonNor.

———. 2016b. PoisBinOrd: Data Generation with Poisson, Binary and Ordinal Components. https://CRAN.R-project.org/package=PoisBinOrd.

Kaiser, S, D Traeger, and F Leisch. 2011. “Generating Correlated Ordinal Random Values.” Technical Report Number 94. Department of Statistics at University of Munich. https://epub.ub.uni-muenchen.de/12157/1/kaiser-tr-94-ordinal.pdf.

Leisch, F, A W S Kaiser, and K Hornik. 2010. Orddata: Generation of Artificial Ordinal and Binary Data.

Vaughan, L K, J Divers, M Padilla, D T Redden, H K Tiwari, D Pomp, and D B Allison. 2009. “The Use of Plasmodes as a Supplement to Simulations: A Simple Example Evaluating Individual Admixture Estimation Methodologies.” Computational Statistics and Data Analysis 53 (5): 1755–66. doi:10.1016/j.csda.2008.02.032.

Yahav, I, and G Shmueli. 2012. “On Generating Multivariate Poisson Data in Management Science Applications.” Applied Stochastic Models in Business and Industry 28 (1): 91–102. doi:10.1002/asmb.901.