surveyCV
This package implements cross validation (CV) for complex survey data, by accounting for strata, clusters, FPCs, and survey weights when creating CV folds as well as when calculating test-set loss estimates (currently either mean squared error (MSE) for linear models, or binary cross-entropy for logistic models). It is meant to work smoothly with the survey
package.
In addition, the function folds.svy()
(which makes the design-based CV folds) can be used to code up custom CV loops to evaluate other models besides linear and logistic regression.
To understand why we believe it’s important to account for survey design when carrying out CV, please see our paper: Wieczorek, Guerin, and McMahon (2022), “K-Fold Cross-Validation for Complex Sample Surveys,” Stat <doi:10.1002/sta4.454>.
Briefly, the problem is that standard CV assumes exchangeable data, which is not true of most complex sampling designs. We argue (1) that CV folds should mimic the sampling design (so that we’re honest about how much the model-fits will vary across “samples like this one”), and (2) that test set MSEs should be averaged and weighted in ways that generalize to the full population from which the sample was drawn, assuming that is the population for which the models will be applied.
This vignette briefly illustrates how to use surveyCV
with several common types of sampling design. We provide three equivalent ways to carry out CV: direct control with cv.svy
, use of an existing survey design object with cv.svydesign
, or use of an existing survey GLM object with cv.svyglm
. We also illustrate how to write your own design-based CV loop, with a design-consistent random forest example from the rpms
package.
# We set a random seed in this vignette only to ensure
# that our discussion will match the CV output;
# otherwise, each time we reran the vignette, we'd get different CV folds
set.seed(2022)
library(surveyCV)
data("NSFG_data")
library(survey)
data("api")
cv.svy
, cv.svydesign
, and cv.svyglm
cv.svy
First, we borrow some examples from the documentation for survey::svyglm()
, based on the api
data (the Academic Performance Index for California schools in the year 2000).
Say that we want to carry out complex survey CV for several linear models, and we are working with the stratified sample in apistrat
. Using cv.svy()
, we specify the dataset; the formulas for the models being compared; the number of folds; and any necessary information about the sampling design. For this particular stratified design, we must specify the strata variable, the weights variable, and the FPC variable.
#stratified sample
cv.svy(apistrat, c("api00~ell",
"api00~ell+meals",
"api00~ell+meals+mobility"),
nfolds = 5, strataID = "stype", weightsID = "pw", fpcID = "fpc")
#> mean SE
#> .Model_1 9101.0 872.52
#> .Model_2 5333.9 504.76
#> .Model_3 5390.0 511.97
The 2nd model (api00~ell+meals
) appears to have the lowest average test MSE (in the mean
) column, although its performance does not differ much from the 3rd model and the difference between them is much smaller than either of their standard errors (in the SE
column).
If we were to use the single-stage cluster sample in apiclus1
instead, the command would be similar but we would need to specify clusters instead of strata:
# one-stage cluster sample
cv.svy(apiclus1, c("api00~ell",
"api00~ell+meals",
"api00~ell+meals+mobility"),
nfolds = 5, clusterID = "dnum", weightsID = "pw", fpcID = "fpc")
#> mean SE
#> .Model_1 8541.8 2127.89
#> .Model_2 3363.1 707.12
#> .Model_3 3473.8 726.64
On the other hand, if we are working with a simple random sample as in apisrs
, we could just use standard CV instead of complex survey CV to generate the folds. However, complex survey CV may still be useful with a SRS if we want to account for the finite population correction (FPC) when estimating the SEs of our CV MSEs.
# simple random sample
cv.svy(apisrs, c("api00~ell",
"api00~ell+meals",
"api00~ell+meals+mobility"),
nfolds = 5, fpcID = "fpc")
#> mean SE
#> .Model_1 9880.3 994.73
#> .Model_2 6720.3 1151.04
#> .Model_3 6869.2 1178.15
Note that our intent in this vignette is not to compare the stratified sample results vs. the cluster sample results vs. the SRS results. We show 3 approaches only to illustrate how the code depends on the sampling design. Each of these samples was taken in a different way, and so it should be analyzed in a different way. In most cases, the data analyst will have only one dataset, and they ought to choose the right form of CV based on that one dataset’s sampling design.
(But if we do have different datasets taken under different sampling designs from the same population, it’s plausible that the “best model” might differ across sampling designs. For instance, a stratified sample will likely have more precision than a cluster sample with the same number of ultimate observation units—so it’s possible that the stratified-sample’s CV would choose a larger model than the cluster-sample’s CV. That is, the stratified sample might have enough precision to fit a larger model without overfitting, while CV might show us that the cluster sample is overfitting when it tries to fit a larger model.)
Finally, we show an example from a different dataset, the 2015-2017 National Survey of Family Growth (NSFG), which uses clusters nested within strata as well as unequal sample weights. In this case we are using 4 folds, not 5 as above, because each stratum only has 4 clusters.
(Note: We use nest = TRUE
only if cluster IDs are nested within strata, i.e., if clusters in different strata might reuse the same names.)
# complex sample from NSFG
library(splines)
cv.svy(NSFG_data, c("income ~ ns(age, df = 1)",
"income ~ ns(age, df = 2)",
"income ~ ns(age, df = 3)",
"income ~ ns(age, df = 4)",
"income ~ ns(age, df = 5)",
"income ~ ns(age, df = 6)"),
nfolds = 4,
strataID = "strata", clusterID = "SECU",
nest = TRUE, weightsID = "wgt")
#> mean SE
#> .Model_1 22674 730.26
#> .Model_2 22450 733.46
#> .Model_3 22416 734.37
#> .Model_4 22468 745.27
#> .Model_5 22497 749.42
#> .Model_6 22517 739.71
Here we tend to see that the spline with 3 degrees of freedom fits a little better than the others, but the differences between models are mostly smaller than the SEs.
cv.svydesign
When using the survey
package, we will often create a svydesign
object in order to calculate means and totals, fit GLMs, etc. In that case, instead of using cv.svy()
, it is more convenient to use cv.svydesign()
, which will read the relevant information out of the svydesign
object and internally pass it along to cv.svy()
for us.
We repeat the api
stratified, cluster, and SRS examples above using this cv.svydesign()
approach.
#stratified sample
<- svydesign(id = ~1, strata = ~stype, weights = ~pw, data = apistrat, fpc = ~fpc)
dstrat cv.svydesign(formulae = c("api00~ell",
"api00~ell+meals",
"api00~ell+meals+mobility"),
design_object = dstrat, nfolds = 5)
#> mean SE
#> .Model_1 9138.9 879.50
#> .Model_2 5332.5 498.04
#> .Model_3 5416.8 502.89
# one-stage cluster sample
<- svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc)
dclus1 cv.svydesign(formulae = c("api00~ell",
"api00~ell+meals",
"api00~ell+meals+mobility"),
design_object = dclus1, nfolds = 5)
#> mean SE
#> .Model_1 8992.7 2131.82
#> .Model_2 3351.6 648.11
#> .Model_3 3394.9 654.34
# simple random sample
<- svydesign(id = ~1, data = apisrs, fpc = ~fpc)
dsrs cv.svydesign(formulae = c("api00~ell",
"api00~ell+meals",
"api00~ell+meals+mobility"),
design_object = dsrs, nfolds = 5)
#> mean SE
#> .Model_1 9850.6 1009.0
#> .Model_2 7224.2 1302.3
#> .Model_3 7587.6 1360.2
Next, we repeat the NSFG example using this approach.
<- svydesign(id = ~SECU, strata = ~strata, nest = TRUE,
NSFG.svydes weights = ~wgt, data = NSFG_data)
cv.svydesign(formulae = c("income ~ ns(age, df = 1)",
"income ~ ns(age, df = 2)",
"income ~ ns(age, df = 3)",
"income ~ ns(age, df = 4)",
"income ~ ns(age, df = 5)",
"income ~ ns(age, df = 6)"),
design_object = NSFG.svydes, nfolds = 4)
#> mean SE
#> .Model_1 22668 733.45
#> .Model_2 22464 740.10
#> .Model_3 22364 748.72
#> .Model_4 22374 731.03
#> .Model_5 22452 753.67
#> .Model_6 22523 735.79
cv.svyglm
Finally, if we have already fit a svyglm
object, we can use cv.svyglm()
which will read the formula as well as the survey design information and internally pass it along to cv.svy()
for us. However, this approach only works for one GLM at time instead of comparing several models in a single function.
We repeat the api
stratified, cluster, and SRS examples above using this cv.svyglm()
approach, with the help of the surveydesign
objects already created above.
#stratified sample
<- svyglm(api00 ~ ell+meals+mobility, design = dstrat)
glmstrat cv.svyglm(glmstrat, nfolds = 5)
#> mean SE
#> .Model_1 5632.6 548.47
# one-stage cluster sample
<- svyglm(api00 ~ ell+meals+mobility, design = dclus1)
glmclus1 cv.svyglm(glmclus1, nfolds = 5)
#> mean SE
#> .Model_1 3446.1 701.11
# simple random sample
<- svyglm(api00 ~ ell+meals+mobility, design = dsrs)
glmsrs cv.svyglm(glmsrs, nfolds = 5)
#> mean SE
#> .Model_1 6738.8 1106.1
Next, we repeat the NSFG example using this approach.
<- svyglm(income ~ ns(age, df = 3), design = NSFG.svydes)
NSFG.svyglm cv.svyglm(glm_object = NSFG.svyglm, nfolds = 4)
#> mean SE
#> .Model_1 22387 732.46
If we have a binary response variable and wish to fit a logistic regression instead, we can use family = quasibinomial()
in the call to svyglm()
and feed that directly into cv.svyglm()
:
<- svyglm(LBW ~ ns(age, df = 3), design = NSFG.svydes,
NSFG.svyglm.logistic family = quasibinomial())
cv.svyglm(glm_object = NSFG.svyglm.logistic, nfolds = 4)
#> mean SE
#> .Model_1 0.34523 0.0177
In this case, the mean
column shows the average of the binary cross-entropy loss \(-[y_i \log(\hat y_i) + (1-y_i)\log(1-\hat y_i)]\) rather than MSE.
As with MSE, lower values of this loss indicate better-fitting models.
Or we can set method = "logistic"
in cv.svydesign()
or in cv.svy()
:
cv.svydesign(formulae = c("LBW ~ ns(age, df = 1)",
"LBW ~ ns(age, df = 2)",
"LBW ~ ns(age, df = 3)",
"LBW ~ ns(age, df = 4)",
"LBW ~ ns(age, df = 5)",
"LBW ~ ns(age, df = 6)"),
design_object = NSFG.svydes, nfolds = 4,
method = "logistic")
#> mean SE
#> .Model_1 0.33997 0.0173
#> .Model_2 0.33973 0.0171
#> .Model_3 0.34199 0.0173
#> .Model_4 0.34470 0.0178
#> .Model_5 0.34635 0.0179
#> .Model_6 0.34689 0.0180
cv.svy(NSFG_data, c("LBW ~ ns(age, df = 1)",
"LBW ~ ns(age, df = 2)",
"LBW ~ ns(age, df = 3)",
"LBW ~ ns(age, df = 4)",
"LBW ~ ns(age, df = 5)",
"LBW ~ ns(age, df = 6)"),
nfolds = 4,
strataID = "strata", clusterID = "SECU",
nest = TRUE, weightsID = "wgt",
method = "logistic")
#> mean SE
#> .Model_1 0.34134 0.0175
#> .Model_2 0.34081 0.0174
#> .Model_3 0.34073 0.0174
#> .Model_4 0.34242 0.0174
#> .Model_5 0.34441 0.0176
#> .Model_6 0.34539 0.0176
These two examples are carrying out the same CV over the same data and same models; they have slightly different results only because of randomness in forming the CV folds for each example.
In these examples, model 2 or model 3 has the lowest loss, though any differences between models are well within one SE.
folds.svy
and folds.svydesign
The function folds.svy()
generates design-based fold IDs for K-fold CV, using any specified strata and clusters.
(Briefly: For a stratified sample, each fold will contain data from each stratum. For a cluster sample, a given cluster’s rows will all be assigned to the same fold. See our Stat paper for details.)
Using these fold IDs, you can write your own CV loop for models that our packages does not currently handle. These might be other models from the survey
package (Poisson regression, Cox proportional hazards model, etc.) or from other design-based modeling packages such as rpms
, mase
, or others in the CRAN Task View on Official Statistics & Survey Statistics.
Here is an example of tuning the bin size parameter for a design-based random forest, using the rpms_forest()
function from the rpms
package. (I have also set the forest size to 50 trees instead of the default 500, purely in order to make the example run faster.)
Note that while folds.svy()
accounts for the clustering in this survey design, we need to do a bit of extra work to ensure that the model training and testing steps also account for the survey design. In this particular example, we must pass the cluster IDs and survey weights to rpms_forest()
for design-consistent model-fitting, and we must use the survey weights in the MSE calculations.
# Based on example("rpms"):
# model the mean of retirement account value `IRAX` among households with
# reported retirement account values > 0,
# predicted from householder education, age, and urban/rural location
library(rpms)
data(CE)
# Generate fold IDs that account for clustering in the survey design
# for the IRAX>0 subset of the CE dataset
<- 5
nfolds <- CE[which(CE$IRAX > 0), ]
CEsubset $.foldID <- folds.svy(CEsubset, nfolds = nfolds, clusterID = "CID")
CEsubset
# Use CV to tune the bin_size parameter of rpms_forest()
<- c(10, 20, 50, 100, 250, 500)
bin_sizes
# Create placeholder for weighted Sums of Squared Errors
<- rep(0, length(bin_sizes))
SSEs
for(ff in 1:nfolds) { # For every fold...
# Use .foldID to split data into training and testing sets
<- subset(CEsubset, .foldID != ff)
train <- subset(CEsubset, .foldID == ff)
test
for(bb in 1:length(bin_sizes)) { # For every value of the tuning parameter...
# Fit a new model
<- rpms_forest(IRAX ~ EDUCA + AGE + BLS_URBN,
rf data = train,
weights = ~FINLWT21, clusters = ~CID,
bin_size = bin_sizes[bb], f_size = 50)
# Get predictions and squared errors
<- predict(rf, newdata = test)
yhat <- (yhat - test$IRAX)^2
res2 # Sum up weighted SSEs, not MSEs yet,
# b/c cluster sizes may differ across folds and b/c of survey weights
<- SSEs[bb] + sum(res2 * test$FINLWT21)
SSEs[bb]
}
}
# Divide entire weighted sum by the sum of weights
<- SSEs / sum(CEsubset$FINLWT21)
MSEs
# Show results
cbind(bin_sizes, MSEs)
#> bin_sizes MSEs
#> [1,] 10 204246617270
#> [2,] 20 202870633392
#> [3,] 50 201393921358
#> [4,] 100 201085838446
#> [5,] 250 201825549231
#> [6,] 500 204155844501
Bin size 100 had the lowest survey-weighted CV MSE estimate, although sizes 50 and 250 were quite similar.
Using this chosen tuning parameter value, the next step could be to fit a random forest with bin size 100 on the full CEsubset
dataset.
In this case, rpms_forest()
does not work with the survey
package so there was no need to create a svydesign
object for this analysis. But if we were working with a different model that expects us to create a svydesign
object from CEsubset
, we could have used folds.svydesign()
instead, which would read out the cluster variable from the svydesign
object automatically:
<- svydesign(id = ~CID, weights = ~FINLWT21, data = CEsubset)
CE.svydes # Use update() to add variables to a svydesign object
<- update(CE.svydes,
CE.svydes .foldID = folds.svydesign(CE.svydes, nfolds = nfolds))
# Now the fold IDs are available as CE.svydes$variables$.foldID
table(CE.svydes$variables$.foldID)
#> 1 2 3 4 5
#> 813 923 928 885 960