---
title: "How to use the wqspt package"
author: "Drew Day, James Peng"
date: "5/26/2022 (Updated: 3/4/2025)"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{How to use the wqspt package}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Introduction 

Weighted quantile sum (WQS) regression is a statistical technique to evaluate the 
effect of complex exposure mixtures on an outcome (
[Carrico 2015](https://doi.org/10.1007/s13253-014-0180-3)). It is a 
single-index method which estimates a combined mixture sum effect as well as weights 
determining each individual mixture component's contributions to the sum effect. 
However, the model features a statistical power and Type I error (i.e., false positive) 
rate tradeoff, as there is a machine learning step to determine the weights that 
optimize the linear model fit. If the full data is used to estimate both the mixture 
component weights and the regression coefficients, there is high power but also a 
high false positive rate since coefficient p-values are calculated for a weighted 
mixture independent variable with weights that have already been optimized to find 
a large effect. 

We recently proposed alternative methods based on a permutation test that should 
reliably allow for both high power and low false positive rate when utilizing WQS
regression. The permutation test is a method of obtaining a p-value by simulating 
the null distribution through permutations of the data. The permutation test 
algorithm is described more in detail and validated in 
[Day et al. 2022](https://doi.org/10.1289/EHP10570). 
The version of this permutation test used for a continuous outcome 
variable has been applied in [Loftus et al. 2021](https://doi.org/10.1016/j.envint.2021.106409), 
[Day et al. 2021](https://doi.org/10.1016/j.envint.2020.106330), 
[Wallace et al. 2022](https://doi.org/10.1016/j.envint.2021.107039), 
[Barrett et al. 2022](https://doi.org/10.1016/j.envint.2022.107078),
[Freije et al. 2022](https://doi.org/10.1016/j.envint.2022.107246), 
[Wallace et al. 2023](https://doi.org/10.1016/j.envres.2022.114759),
[Sun et al. 2023](https://doi.org/10.1016/j.envint.2023.108009), 
[Barrett et al. 2024](https://doi.org/10.1016/j.envint.2024.108425), and 
[Hazlehurst et al. 2024](https://doi.org/10.1007/s11524-024-00829-z).

Another version of the permutation test adapted for WQS logistic regression with a binary outcome
variable is applied in [Loftus et al. 2022](https://doi.org/10.1016/j.envint.2022.107494) 
and described in the supplement of that paper. This WQS logistic regression 
with the permutation test has also been utilized in 
[Barrett et al. 2024](https://doi.org/10.1016/j.envint.2024.108425), 
[Day et al. 2024](https://doi.org/10.1016/j.envint.2024.108486), 
[Sherris et al. 2024](https://doi.org/10.1186/s12940-024-01066-2), and 
[Yin et al. 2024](https://doi.org/10.1016/j.ecoenv.2024.116428).

## About WQS 

The goal of WQS regression is to determine whether an exposure mixture is associated 
with an outcome in a prespecified direction. It fits the following model: 

$Y = \beta_0 + \beta_1(\sum_{i=1}^{m} w_i {X_q}_i) + Z'\gamma$   

Where $Y$ is the outcome variable, $\beta_0$ is the intercept, $\beta_1$ is the 
coefficient for the weighted quantile sum, $\sum_{i=1}^{c} w_i {X_q}_i$ is the 
weighted index for the set of quantiled mixture exposures, $Z$ is the set of 
covariates, and $\gamma$ is the regression coefficients for the covariates. 

A full description of the WQS methodology is described in [Carrico 2015](https://doi.org/10.1007/s13253-014-0180-3). 

## Permutation Test 

The WQS regression comprises two steps, for which we typically split the data 
into a training and validation set. Doing this reduces statistical power since 
we are training our model on only part of the data. On the other hand, if we skip this 
training/test split, we can get a skewed representation of uncertainty for the WQS 
coefficient. A permutation test method gives us a p-value for the 
uncertainty while also allowing us to use the full dataset for training and 
validation. This p-value is based on comparing a test value (e.g., coefficient or 
naive p-values) to iterated values, and so the minimum non-zero p-value that can 
be detected by the permutation test would be 1 divided by the number of permutation 
test iterations. For example, if we run 200 iterations, we’d be able to define a 
p-value of as low as 1/200 = 0.005, and any lower p-value would appear as zero 
and be interpreted as <0.005.

### Continuous outcome algorithm (linear regression)

1. Run WQS regression without splitting the data, obtaining a WQS coefficient 
estimate. 

2. Regress the outcome on all covariates but not the WQS variable. Then obtain the 
predicted outcome values and their residuals from this regression. 

3. Randomly permute the residual values and add them to the predicted outcome 
values to get a new outcome variable $y*$. 

4. Run a WQS regression without splitting the data in which $y*$ replaces the vector of 
observed outcome variables, obtaining an estimate for the WQS coefficient $\beta_1^*$. 

5. Repeat steps 3 and 4. 

6. Calculate the p-value by taking the proportion of $\beta_1^*$ values greater than 
the WQS coefficient estimate obtained in Step 1. 

### Binary or Count outcome algorithms (Generalized linear models (GLMs))

1. 	Regress each of the $m$ mixture components on all covariates $Z$ and obtain a $n$
observations x $m$ matrix with columns being the residuals from each of the $m$ 
models ($R_{m|Z}$).

2. 	Obtain the initial fit ($fit1$) by running a “non-split” WQS logistic regression 
(or other WQS GLM) in which the binary (or count) outcome variable $Y$ is regressed 
on the WQS vector and the covariates, and the mixture matrix used to calculate the 
WQS vector is the matrix of residuals from Step 1, $R_{m|Z}$. 

3. 	Obtain the reduced fit ($fit2$) by running a logistic regression (or other GLM) 
regressing $Y$ on $Z$. 

4. Calculate the test p-value ($p_t$) as $1-pchisq(d(fit1)-d(fit2),1)$ where d 
is the deviance for a given model and $pchisq(x,1)$ is the probability density 
function of the chi-square distribution in which the input $x$ is the difference 
between the deviances of $fit1$ and $fit2$ and there is 1 degree of freedom. 

5. Permute the rows of the $R_{m|Z}$ residual matrix from Step 1 and repeat 
Step 2 to get a series of null fit1 models ($fit1^*$) for K iterations. 
Obtain a distribution of permuted p-values ($p^*$) using the following formula: 
$p^*=1-pchisq(fit1^*)-d(fit2),1$).

6. 	Obtain the number of permuted $p^*$ less than or equal to the test $p_t$ from 
Step 4 and divide that by the number of iterations K to calculate the permutation
test p-value.

Note that the above algorithm has been validated in WQS logistic regressions but
not yet for other forms of WQS GLMs (e.g., WQS Poisson regression). However, since 
deviances can also be derived from those models, the algorithm should work for 
those other WQS GLMs as well.

# How to use the `wqspt` package 

The `wqspt` package builds from the `gWQS` package.

The two main functions of the `wqspt` package are `wqs_pt` and `wqs_full_perm`. 

## `wqs_pt`

### Arguments 

`wqs_pt` uses a `gwqs` object (from the `gWQS` [package](https://CRAN.R-project.org/package=gWQS)) as an input. To use 
`wqs_pt`, we first need to run an initial *permutation test reference WQS regression*
run while setting `validation = 0`. Note that permutation test can 
currently take in `gwqs` inputs with the following families: 
`family = gaussian(link = "identity")`, `family = binomial()` with any accepted link
function (e.g., "logit" or "probit"), `family = poisson(link = "log")`, 
`family = quasipoisson(link = "log")`, and `family = "negbin"` for negative binomial. 
It is not currently able to accommodate multinomial WQS regression, stratified 
weights, or WQS interaction terms. 

We will use this `gwqs` object as the `model` argument for the `wqs_pt` 
function and set the following additional parameters: 

* `boots`: Number of bootstraps for the WQS regression run in each permutation test iteration. 
Note that we may elect a bootstrap count `boots` lower than that specified in the 
`model` object for the sake of efficiency. If we do, `wqs_pt` will run the
iterated WQS regressions for the permutation test with the number of bootstraps defined in 
`boots`. If `boots` is not specified, then the function will use the same bootstrap count 
in the permutation test iterated WQS regressions as that specified in the main WQS regression.
* `niter`: Number of permutation test iterations. 
* `b1_pos`: A logical value that indicates whether beta values should be positive 
or negative.
* `rs`: A logical value indicating whether the random subset implementation for WQS 
should be performed ([Curtin 2019](https://doi.org/10.1080/03610918.2019.1577971))
* `plan_strategy`: Evaluation strategy for the `plan` function (`"sequential"`, 
`"multisession"`, `"multicore"`, or `"cluster"`). See the documentation for the 
`future::plan` function for full details. This defaults to `"multicore"`.
* `seed`: An optional random seed for the permutation test WQS regression 
reference run. This defaults to `NULL`, meaning that no random seed is set.
* `nworkers`: An optional argument to specify the number of parallel processes 
to use when `plan_strategy` is not `"sequential"`. By default this uses 
`future::availableWorkers()` to determine the number of parallel processes. If 
one is submitting `wqs_pt` to a high-performance computing cluster, one should 
make sure that this number is less than or equal to the number of allotted 
computing cores.
* `...`: Additional arguments to be passed to the `gWQS::gwqs` function.

The arguments `b1_pos` and `rs` should be consistent with the inputs chosen in 
the `model` object. The `seed` should ideally be consistent with the seed set 
in the `model` object, though this is not required. 

### Outputs 

The permutation test returns an object of class `wqs_pt`, which contains three 
sublists: 

* **perm_test**
  * **pval**: permutation test p-value
  * *Linear WQS regression only*
  * **testbeta**: reference WQS coefficient $\beta_1$ value 
  * **betas**: a vector of $\beta_1$ values from each iteration of the permutation 
test
  * *WQS GLM only*
  * **testpval**: test reference p-value
  * **permpvals**: p-values from each iteration of the permutation test 
* **gwqs_main**: main gWQS object (same as `model` input)
* **gwqs_perm**: permutation test reference gWQS object (NULL if model 
`family != "gaussian"` or if same number of bootstraps are used in permutation 
test WQS regression runs as in the main run.)

### Plotting method

The `wqs_pt` class has a `wqspt_plot` method to help visualize and summarize WQS 
permutation test results. Plots include (1) a forest plot of the beta WQS coefficient 
with the naive confidence intervals as well as the permutation test p-value and
(2) a heatmap of the WQS weights for each mixture component. 

## `wqs_full_perm` 

The second function `wqs_full_perm` is a full wrapper which implements the initial 
WQS regression run using gWQS::gwqs and the permutation test in one function call. 

To use `wqs_full_perm`, you must specify the same required arguments as needed in 
the `gwqs` call. This function can run WQS regressions and the permutation test 
for the following families: `family = gaussian(link = "identity")`, 
`family = binomial()` with any accepted link function (e.g., "logit" or "probit"), 
`family = poisson(link = "log")`, `family = quasipoisson(link = "log")`, 
and `family = "negbin"` for negative binomial. `wqs_full_perm `is not currently 
able to accommodate multinomial WQS regression, stratified weights, or WQS 
interaction terms.

For the bootstrap count `b` argument, you must specify `b_main`,the number of 
bootstraps for the *main WQS regression* run, as well as `b_perm`, the number of 
bootstraps for the *permutation test reference WQS regression* run (linear WQS
regression only) and each WQS regression iteration of the permutation test. As 
with before, you can choose to set `b_main` $>$ `b_perm` for the sake of efficiency. 
Finally, you should indicate the number of desired permutation test runs `niter`.

Since the WQS permutation test can be computationally intensive, you can specify 
`stop_if_nonsig = TRUE` if you do not wish for the permutation test to proceed 
if the naive main WQS regression run produces an nonsignificant result (if the p-value 
is below the `stop_thresh` argument, for which the default is 0.05). 
See *Recommendations for Use* section below.

The `wqs_full_perm` returns an object of class `wqs_pt`, with outputs described 
above. 

## Recommendations for Use 

Larger bootstrap counts and numbers of iterations lead to better estimates, though
it is unclear how many iterations or bootstraps are needed for a stable estimate. We 
generally recommend using 1000 bootstraps on the main WQS regression and then 
performing 200 iterations of 200-boostrap WQS regressions for the permutation test.
However, this takes a substantial amount of computational time, and one could also
get relatively stable p-values for instance for 100 iterations of 100-boostrap WQS
regressions for the permutation test.

We recommend that users only apply the permutation test in cases where the naive 
WQS test approaches significance or near-significance. If the naive test produces 
a non-significant result, then there likely is no reason to run the permutation 
test, as it will produce a result which is more conservative than the naive 
method (i.e., it will have a larger p-value). This is the strategy that we have 
applied in our published papers 
([Loftus et al. 2021](https://doi.org/10.1016/j.envint.2021.106409) and 
[Day et al. 2021](https://doi.org/10.1016/j.envint.2020.106330)). 

# Examples 

## Example 1 (using `wqs_pt`)

This is an example where the WQS permutation test confirms a significant naive 
result.

We first load both the wqspt and gWQS packages.

```{r, warning=F, message=F}
library(gWQS)
library(wqspt)
```

Then we produce a simulated dataset with the following parameters: 

* WQS coefficient $\beta_1$: 0.2 
* Mixture weights: 0.15 for first 5 components, 0.05 for remaining 5 components 

```{r, warning = F, message = F, eval=F}
# simulated dataset
sim_res1 <- wqs_sim(nmix = 10,
                    ncovrt = 10,
                    nobs = 1000,
                    ntruewts = 10, 
                    ntruecovrt = 5, 
                    truewqsbeta = 0.2, 
                    truebeta0 = 2, 
                    truewts = c(0.15, 0.15, 0.15, 0.15, 0.15,
                                0.05, 0.05, 0.05, 0.05, 0.05), 
                    q = 10, 
                    seed = 16)

sim_data1 <- sim_res1$Data

wqs_form <- formula(paste0("y ~ wqs + ", paste(paste0("C", 1:10), 
                                               collapse = "+")))
```

Now we run WQS regression on the simulated data.  

```{r, warning = F, eval=F}
# mixture names
mix_names1 <- colnames(sim_data1)[2:11]

# create reference wqs object
wqs_main1 <- gwqs(wqs_form, mix_name = mix_names1, data = sim_data1, 
                  q = 10, validation = 0, b = 20, b1_pos = TRUE, 
                  plan_strategy = "multicore", family = "gaussian", 
                  seed = 16)
```

Finally, we can perform a permutation test on the WQS object. 

```{r, eval=F}
# run permutation test
perm_test_res1 <- wqs_pt(wqs_main1, niter = 50, boots = 5, b1_pos = TRUE, 
                         seed = 16)
```

Note that the naive WQS regression produces a significant result for the 
WQS coefficient (p-value < 0.001).

```{r, echo=F}
load("data/introduction-vignette.RData")
```

```{r, eval = F}
main_sum1 <- summary(perm_test_res1$gwqs_main)
```
```{r}
main_sum1$coefficients
```

The permutation test confirms the significance of this result. 

```{r}
perm_test_res1$perm_test$pval
```

Here are the summary plots: 

```{r, fig.height = 6}
wqspt_plot(perm_test_res1)$FullPlot
```


## Example 2 (using `wqs_pt`)

This is an example where the WQS permutation test goes against a (false positive) 
significant naive result. 

We produce a simulated dataset with the following parameters: 

* WQS coefficient $\beta_1$: 0
* Mixture weights: 0.15 for first 5 components, 0.05 for remaining 5 components 

```{r, eval = F}
sim_res2 <- wqs_sim(nmix = 10,
                    ncovrt = 10,
                    nobs = 1000,
                    ntruewts = 10, 
                    ntruecovrt = 5, 
                    truewqsbeta = 0, 
                    truebeta0 = 0.1, 
                    truewts = c(0.15, 0.15, 0.15, 0.15, 0.15,
                                0.05, 0.05, 0.05, 0.05, 0.05), 
                    q = 10, 
                    seed = 16)

sim_data2 <- sim_res2$Data
```

Now we run WQS regression as well as the permutation test on the simulated data.  

```{r, eval=F}
# mixture names
mix_names2 <- colnames(sim_data2)[2:11]

# create reference wqs object
wqs_main2 <- gwqs(wqs_form, mix_name = mix_names2, data = sim_data2, q = 10, 
                  validation = 0, b = 20, b1_pos = TRUE, 
                  plan_strategy = "multicore", family = "gaussian", 
                  seed = 16)

# run permutation test
perm_test_res2 <- wqs_pt(wqs_main2, niter = 50, boots = 5, b1_pos = TRUE, 
                         seed = 16)
```

Note that the naive WQS regression produces a significant result for the 
WQS coefficient (p-value = `r round(main_sum2$coefficients[2,4],3)`).

```{r, eval = F}
main_sum2 <- summary(perm_test_res2$gwqs_main)
```
```{r}
main_sum2$coefficients
```

The permutation test, however, repudiates the significance of these plots (p = `r round(perm_test_res2$perm_test$pval, 2)`).  

```{r}
perm_test_res2$perm_test$pval
```

Here are the summary plots: 

```{r, fig.height = 6}
wqspt_plot(perm_test_res2)$FullPlot
```

## Example 3 (using `wqs_full_perm`)

Using the same data as in Example 1, we run the WQS regression with permutation
test using the full wrapper `wqs_full_perm` call.

```{r, eval = F}
perm_test_res3 <- wqs_full_perm(wqs_form,
                               data = sim_data1,
                               mix_name = mix_names1,
                               q = 10,
                               b_main = 20,
                               b_perm = 5,
                               b1_pos = TRUE,
                               niter = 50,
                               seed = 16,
                               plan_strategy = "multicore")
```
```{r, fig.height = 6}
wqspt_plot(perm_test_res3)$FullPlot
```

## Example 4 (using `wqs_full_perm` with a binary outcome variable)

This is an example in which we apply the logistic regression version of the WQS 
permutation test. 

We produce a simulated dataset with the following parameters: 

* WQS coefficient $\beta_1$: 0.4
* Mixture weights: 0.15 for first 5 components, 0.05 for remaining 5 components 

```{r, eval=F}
sim_res3 <- wqs_sim(nmix = 10,
                    ncovrt = 10,
                    nobs = 1000,
                    ntruewts = 10, 
                    ntruecovrt = 5, 
                    truewqsbeta = 0.4, 
                    truebeta0 = -2.5, 
                    truewts = c(0.15, 0.15, 0.15, 0.15, 0.15,
                                0.05, 0.05, 0.05, 0.05, 0.05), 
                    q = 10, 
                    family = "binomial",
                    seed = 16)

sim_data3 <- sim_res3$Data

perm_test_res4 <- wqs_full_perm(wqs_form,
                               data = sim_data3,
                               mix_name = mix_names1,
                               q = 10,
                               b_main = 20,
                               b_perm = 5,
                               b1_pos = TRUE,
                               niter = 50,
                               seed = 16,
                               plan_strategy = "multicore",
                               family = "binomial")
```
```{r, fig.height=6}
wqspt_plot(perm_test_res4)$FullPlot
```

# References
* Barrett, E. S., Corsetti, M., Day, D., Thurston, S. W., Loftus, C. T., 
Karr, C. J., ... & Sathyanarayana, S. (2022). Prenatal phthalate exposure in 
relation to placental corticotropin releasing hormone (pCRH) in the CANDLE 
cohort. *Environment International*, 160, 107078.

* Barrett, E. S., Day, D. B., Szpiro, A., Peng, J., Loftus, C. T., Ziausyte, U., 
... & Bush, N. R. (2024). Prenatal exposures to phthalates and life events 
stressors in relation to child behavior at age 4–6: a combined cohort analysis. 
*Environment International*, 183, 108425.

* Carrico, C., Gennings, C., Wheeler, D. C., & Factor-Litvak, P. (2015). 
Characterization of weighted quantile sum regression for highly correlated data 
in a risk analysis setting. Journal of Agricultural, Biological, and 
*Environmental Statistics*, 20(1), 100-120.

* Curtin, P., Kellogg, J., Cech, N., & Gennings, C. (2019). A random subset 
implementation of weighted quantile sum (WQSRS) regression for analysis of 
high-dimensional mixtures. 
*Communications in Statistics-Simulation and Computation*, 50(4), 1119-1134.

* Day, D. B., Collett, B. R., Barrett, E. S., Bush, N. R., Swan, S. H., Nguyen, 
R. H., ... & Sathyanarayana, S. (2021). Phthalate mixtures in pregnancy, 
autistic traits, and adverse childhood behavioral outcomes. 
*Environment International*, 147, 106330.

* Day, D. B., Sathyanarayana, S., LeWinn, K. Z., Karr, C. J., Mason, W. A., & 
Szpiro, A. A. (2022). A permutation test-based approach to strengthening 
inference on the effects of environmental mixtures: comparison between single 
index analytic methods. *Environmental Health Perspectives*, 130(8). 

* Day, D. B., LeWinn, K. Z., Karr, C. J., Loftus, C. T., Carroll, K. N., 
Bush, N. R., ... & program collaborators for Environmental influences on Child 
Health Outcomes. (2024). Subpopulations of children with multiple chronic health 
outcomes in relation to chemical exposures in the ECHO-PATHWAYS consortium. 
*Environment International*, 185, 108486.

* Freije, S. L., Enquobahrie, D. A., Day, D. B., Loftus, C., Szpiro, A. A., 
Karr, C. J., ... & Sathyanarayana, S. (2022). Prenatal exposure to polycyclic 
aromatic hydrocarbons and gestational age at birth. *Environment International*, 
164, 107246.

* Hazlehurst, M. F., Hajat, A., Szpiro, A. A., Tandon, P. S., Kaufman, J. D., 
Loftus, C. T., ... & Karr, C. J. (2024). Individual and Neighborhood Level 
Predictors of Children’s Exposure to Residential Greenspace. 
*Journal of Urban Health*, 101(2), 349-363.

* Loftus, C. T., Bush, N. R., Day, D. B., Ni, Y., Tylavsky, F. A., Karr, C. J., 
... & LeWinn, K. Z. (2021). Exposure to prenatal phthalate mixtures and 
neurodevelopment in the Conditions Affecting Neurocognitive Development and 
Learning in Early childhood (CANDLE) study. 
*Environment International*, 150, 106409.

* Loftus, C., Szpiro, A. A., Workman, T., Wallace, E. R., Hazlehurst, M. F., 
Day, D. B., ... & Karr, C. J. (2022). Maternal exposure to urinary polycyclic 
aromatic hydrocarbons (PAH) in pregnancy and childhood asthma in a pooled 
multi-cohort study. *Environment International*, 170, p.107494.

* Renzetti, S., Curtin, P., Just, A. C., Bello, G., & Gennings, C. (2021). 
‘gWQS: Generalized Weighted Quantile Sum Regression’. 
https://CRAN.R-project.org/package=gWQS.

* Sherris, A. R., Loftus, C. T., Szpiro, A. A., Dearborn, L. C., 
Hazlehurst, M. F., Carroll, K. N., ... & Karr, C. J. (2024). Prenatal polycyclic 
aromatic hydrocarbon exposure and asthma at age 8–9 years in a multi-site 
longitudinal study. *Environmental Health*, 23(1), 26.

* Sun, B., Wallace, E. R., Ni, Y., Loftus, C. T., Szpiro, A., Day, D., ... & 
LeWinn, K. Z. (2023). Prenatal exposure to polycyclic aromatic hydrocarbons and 
cognition in early childhood. *Environment International*, 178, 108009.

* Wallace, E. R., Ni, Y., Loftus, C. T., Sullivan, A., Masterson, E., 
Szpiro, A., ... & Sathyanarayana, S. (2022). Prenatal urinary metabolites of 
polycyclic aromatic hydrocarbons and toddler cognition, language, and behavior. 
*Environment International*, 159, 107039.

* Wallace, E. R., Buth, E., Szpiro, A. A., Ni, Y., Loftus, C. T., Masterson, E., 
... & Karr, C. J. (2023). Prenatal exposure to polycyclic aromatic hydrocarbons 
is not associated with behavior problems in preschool and early school-aged 
children: a prospective multi-cohort study. 
*Environmental Research*, 216, 114759.

* Yin, A., Mao, L., Zhang, C., Du, B., Xiong, X., Chen, A., ... & Jiang, H. 
(2024). Phthalate exposure and subfecundity in preconception couples: A nested 
case-control study. *Ecotoxicology and Environmental Safety*, 278, 116428.