Summary: In a block-randomized controlled trial, the
individuals in a study are divided based on important characteristics
(for example age group, sex, or smoking status). By stratifying the data
in this way prior to randomization, the researchers can reduce the
heterogeneity of their treatment and control groups, thereby reducing
the variance of their estimate of the effect of their treatment. The
stratamatch
package is designed to extend this approach to
the observational setting by providing functions to allow users to
easily divide an observational data set into strata of observations with
similar baseline characteristics. The treated and control individuals
within each stratum of the data can then by matched (for example, based
on propensity score), in order to recapitulate the block-randomized
trial from observational data. In order to select an effective
stratification scheme for the data, stratamatch
applies a
pilot design approach to estimate a quantity called the
prognostic score, and divides the data into blocks of individuals with
similar scores. The potential benefits of such an approach are twofold.
First, stratifying the data effectively allows for more computationally
efficient matching of large data sets. Second, early studies of the
prognostic score show that using the prognostic score to inform the
matching process may help reduce the variance in the estimated treatment
effect and increase the robustness of an observational study to bias
from unmeasured confounding factors.
This document contains the following sections
After briefly introducing the goals of stratamatch and the approaches it implements on a statistical level, this document will walk through an example of how stratamatch might be used on a toy data set. We will show how to stratify a data set, how to diagnose the quality of the stratification, and how to match a data set within strata.
In a fully-randomized controlled experiment, treatment assignments are made independently of each individual’s baseline covariates, allowing for unbiased estimation of the treatment effect. In observational studies, researchers may attempt to account for differences in their subject’s baseline covariates by matching treated and control subjects with similar characteristics. In particular, the popular approach of propenisty score matching pairs individuals who had similar estimated probabilities of recieving the treatment based on their baseline characteristics. By balancing the treated and control groups in this way, the propensity score matching approach attempts to coerce the data set into a form that resembles a randomized controlled trial. Importantly, however, propensity score matching can only address bias to due measured baseline covariates. Imbalances in unmeasured baseline covariates may still bias the effect estimate after propensity score matching. For this reason, researchers often carry out a sensitivity analysis to address concerns about unmeasured confounding.
However, the fully-randomized experiment is not the only experimental study design. In a block-randomized controlled experiment, subjects are stratified based on important covariates (e.g. sex, age group, smoking status) before randomization into treatment groups. This helps reduce the heterogeneity between treatment and control groups. In the experimental context, reducing the heterogeneity between compared groups helps to reduce the variance of the effect estimator. In observational settings however, this reduction in heterogeneity has the added benefit of reducing the sensitivity of the study results to unobserved confounding (see Rosenbaum (2005), for a great discussion on this point). Moreover, if the sample size of an observational study is very large, stratifying the sample and matching separately within the strata in this way may be much faster than matching the entire sample at once. Stratamatch helps to facilitate this process by supplying tools stratify an observational dataset and match within each stratum, thereby emulating an observational version of the block-randomized controlled trial.
Once we have decided to stratify our data set, the question becomes: how should strata be determined? One option is to select the covariates on which to stratify by hand based on expert knowledge. Another option is to use knowledge from previous experiments to select the best substrata. In the experimental setting, this may be done with a pilot study. Using this approach, researchers set aside some of their resources before running the main experiment for the purpose of running a smaller, ‘pilot’ experiment. By examining the outcomes of the pilot study, they can gather information which they can use to inform the design of the main experiment. Importantly, after the pilot study is run, the individuals in the pilot experiment are not reused in the main experiment, so the observations of the outcomes from this earlier analysis are not allowed to bias the results of the main study.
Aikens, Greaves, and Baiocchi (2020) extend the idea of the pilot study to the observational setting. Using an observational pilot design, the researchers may set aside a ‘pilot set’ of their data. Outcome information on this pilot set can be observed, and the information gained can be used to inform the study design. Subsequently, in order to preserve the separation of the study design from the study analysis, the observations from the pilot set are omitted from the main analysis.
Stratamatch uses a pilot design to estimate a quantity called the prognostic score (Ben B. Hansen (2008)), defined here as an individual’s expected outcome under the control assignment, based on their baseline covariates. Balancing observational data sets based on the prognostic score can the reduce heterogeneity between matched individuals, decreasing variance and diminishing the sensitivity of the study results to unobserved confounding (See Aikens, Greaves, and Baiocchi (2020), Antonelli et al. (2018), and Leacy and Stuart (2014)). Moreover, since the prognostic score is often continuous, strata can be easily determined using prognostic score quantiles to select evenly sized bins for the data. This circumvents common problems with stratification based on expert knowledge, since that process often generates strata which are too large, too small, or too poorly balanced between treatment and control observations.
The stratamatch function, auto_stratify
, carries out the
prognostic score stratification pilot design described above. Although
there are many additional options available when running this function,
the most basic procedure does the following:
Partition the data set into a pilot data set and an analysis data set
Fit a model for the prognostic score from the observations in the pilot set
Estimate prognostic scores for the analysis set using the prognostic model
Stratify the analysis set based on prognostic score quantiles.
Once the data set has been stratified, stratamatch
suggests several diagnostic plots and tables that can be used to assess
the size and balance of the strata produced. If the strata are
satisfactory, the treatment and control individuals can then be matched.
At this point, the reader may choose to match the data set within each
stratum using the strata_match
function, or they may select
and implement their own matching scheme.
We demonstrate the functionality of stratamatch with a toy example.
The stratamatch
package contains its own function,
make_sample_data
, for just this purpose.
library(stratamatch)
set.seed(125)
# make sample data set of 5000 observations
<- make_sample_data(n = 5000)
dat
# print the first few rows of the sample_data
head(dat)
## X1 X2 B1 B2 C1 treat outcome
## 1 0.93332697 0.22256993 1 1 b 0 0
## 2 -0.52503178 1.07623706 1 0 c 1 0
## 3 1.81443979 2.83989785 1 1 b 0 0
## 4 0.08304562 0.04686229 0 0 a 1 0
## 5 0.39571880 0.42609036 0 0 b 0 0
## 6 -2.19366962 1.28452442 1 1 c 0 1
Let’s assume that the data in dat
are an observational
data set, and we would like to estimate the effect of some binary
treatment on a binary outcome. Suppose the column treat
in
this data gives the treatment assigment (where a 0
means
untreated and a 1
means treated), and the column
outcome
gives information on who had the outcome of
interest and who didn’t (similarly, let’s call 0
the
negative outcome and 1
the positive outcome). However,
since this is an observational data set, we didn’t get to choose who got
the treatment and who didn’t. Instead, individuals self-selected into
treatment or control groups, perhaps in some way which is influenced by
their background characteristics. Additionally, an individual’s
probability of having the positive outcome may be influenced by the
treatment or by other baseline factors, but we don’t know which. In
addition to treatment assignments and outcomes, we have measured five
baseline covariates for each individual: X1
,
X2
, B1
, B2
, and C1
.
X1
and X2
are continuous, B1
and
B2
are binary, and C1
is a categorical
variable which takes on possible values "a"
,
"b"
, and "c"
.
We should also note that dat
is a fairly large data set
(5000 observations), but only about 1/5 of the individuals in the data
set recieved the treatment.
Suppose we wanted to stratify the data manually based on covariates
that our experts have selected. Importantly, we can only stratify the
data based on categorical or binary variables. This means that we cannot
use X1
or X2
directly (this would cause
manual_stratify
to throw an error). Below, I stratify the
data set based on C1
, B1
, and
B2
.
# manually stratify dat based on categorial and binary variables
<- manual_stratify(data = dat, treat ~ B1 + B2 + C1)
m.strat
# try printing the result
m.strat
## manual_strata object from package stratamatch.
##
## Function call:
## manual_stratify(data = dat, strata_formula = treat ~ B1 + B2 +
## C1)
##
## Analysis set dimensions: 5000 X 8
summary(m.strat)
## Number of strata: 12
##
## Min size: 33 Max size: 1533
##
## Strata with Potential Issues: 8
Above, we see that manual_stratify
returns a
manual strata
object. The printed summary above shows that
manual_stratify
has divided our dataset according to our
instructions and produced 12 strata.
We can extract important information from m.strat
with
$
.
The most important piece of m.strat
is the analysis set.
This is the stratified data which we can later match and use for effect
estimation.
# show the first few rows of the stratified data set
head(m.strat$analysis_set)
## # A tibble: 6 × 8
## X1 X2 B1 B2 C1 treat outcome stratum
## <dbl> <dbl> <int> <int> <chr> <int> <int> <int>
## 1 0.933 0.223 1 1 b 0 0 11
## 2 -0.525 1.08 1 0 c 1 0 9
## 3 1.81 2.84 1 1 b 0 0 11
## 4 0.0830 0.0469 0 0 a 1 0 1
## 5 0.396 0.426 0 0 b 0 0 2
## 6 -2.19 1.28 1 1 c 0 1 12
Since we didn’t use a pilot design for this example, our analysis set
just has all of the same data in dat
that we put in to
manual_stratify
. The only difference is that now there is
an additional column, stratum
, which says which stratum
each individual has been sorted into.
In contrast to manual_stratify
, the
auto_stratify
function automatically creates the stratified
data set based on estimated prognostic scores. In this process, we
specify what model we want to use for the prognostic score (the
prognosis
argument) and what percent of the control reserve
we want to use to do the fitting (the pilot_fraction
)
argument. We also need to tell auto_stratify
which column
of our data set designates the treatment assignment
(treat
). The final argument, size,
specifies
about how large we would like our strata to be. I pick
size = 400
here so that we get roughly the same number of
strata as when we manually stratified our data set.
In practice, pilot_fraction
and size
have
defaults, so we don’t need to specify them every time. Instead of
pilot_fraction
, we could specify a pilot_size
to get a pilot set of approximately that number of observations. Run
help(auto_stratify)
for more information.
<- auto_stratify(dat, treat = "treat", prognosis = outcome ~ X1 + X2,
a.strat pilot_fraction = 0.1, size = 400)
## Constructing a pilot set by subsampling 10% of controls.
## Fitting prognostic model via logistic regression: outcome ~ X1 + X2
# print and summarize the result from running auto_stratify
a.strat
## auto_strata object from package stratamatch.
##
## Function call:
## auto_stratify(data = dat, treat = "treat", prognosis = outcome ~
## X1 + X2, size = 400, pilot_fraction = 0.1)
##
## Analysis set dimensions: 4619 X 8
##
## Pilot set dimensions: 381 X 7
##
## Prognostic Score Formula:
## outcome ~ X1 + X2
summary(a.strat)
## Number of strata: 12
##
## Min size: 384 Max size: 385
##
## Strata with Potential Issues: 2
Our call to auto_stratify
has created
auto_strata
object. An auto_strata
object is
much like a manual_strata
object, except that it contains
somewhat more information, since the stratification process was done
differently. Importantly, auto_strata
has partitioned 10%
of the individuals in the control group to be in the pilot set. These
individuals were used to build the prognostic score, and were extracted
from the analysis set. In order to prevent overfitting, the observations
in the pilot set should not be included in later analyses, and should
not be reused when making the final effect estimate.
We can access the pilot set and the analysis set using $
with a.strat
. The command a.strat$analysis_set
gives the analysis set and a.strat$pilot_set
gives the
pilot set. We can also view the prognostic score estimates for the
individuals in the analysis set with
a.strat$prognostic_scores
.
Suppose you’re interested in partitioning your data into a pilot set
and an analysis set, but you’d like to do other things with it after
that - for example perhaps you’d like to fit a more complicated model
for prognostic score than a logistic regression. The
split_pilot_set
will split your pilot and analysis set for
you based on specified preferences. For more information, run
help(split_pilot_set)
.
Above, we showed the default method used by
auto_stratify
. By providing other arguments, it is possible
to construct the strata using various other methods. For example, one
could fit the prognostic scores for the analysis set separately and
provide them as an argument to auto_stratify
.
Alternatively, the researcher could specify their own pilot set that
they would like to use to build the prognostic model. Run
help(auto_stratify)
to see all of the possible inputs to
this function.
Both manual_stratify
and auto_stratify
objects have a strata table, which shows the rules defining each
stratum. Just like when we extracted the analysis set, we can extract
the strata_table
with $
. Below, I show the
strata table for m.strat
, our manually stratified data.
Each row describes the definitions used to make one stratum. For
example, below we see that stratum 1 contains all individuals who have
B1 = 0
and B2 = 0
, and C1 = "a"
:
# get strata table for manually stratified data set
$strata_table m.strat
## # A tibble: 12 × 5
## # Groups: B1, B2 [4]
## B1 B2 C1 stratum size
## <int> <int> <chr> <int> <int>
## 1 0 0 a 1 359
## 2 0 0 b 2 360
## 3 0 0 c 3 1492
## 4 0 1 a 4 39
## 5 0 1 b 5 51
## 6 0 1 c 6 150
## 7 1 0 a 7 175
## 8 1 0 b 8 40
## 9 1 0 c 9 33
## 10 1 1 a 10 1533
## 11 1 1 b 11 383
## 12 1 1 c 12 385
The strata table for an automatically stratified data set is somewhat different:
# show strata table for the automatically stratified data
$strata_table a.strat
## # A tibble: 12 × 3
## stratum quantile_bin size
## <int> <chr> <int>
## 1 1 [0.0461,0.241) 385
## 2 2 [0.2409,0.307) 385
## 3 3 [0.3068,0.352) 385
## 4 4 [0.3524,0.400) 385
## 5 5 [0.4004,0.445) 385
## 6 6 [0.4454,0.482) 385
## 7 7 [0.4825,0.526) 385
## 8 8 [0.5264,0.569) 385
## 9 9 [0.5685,0.615) 385
## 10 10 [0.6149,0.668) 385
## 11 11 [0.6676,0.735) 385
## 12 12 [0.7353,0.946] 384
The strata table for a.strat
shows what prognostic score
quantiles were used to divide the strata. For example, individuals in
stratum 1 have the lowest prognostic scores. This means that, if the
individuals in stratum 1 were in the control group, we would predict
them to have a very low probability of having the positive outcome,
based solely on their values of X1
and X2
. The
a vector of the cutoff values used to divide the strata can be extracted
from a.strat
with extract_cut_points
.
Another important diagnostic for both manually and automatically
stratified data is the issue_table
.
# issue table for manually stratified data
$issue_table m.strat
## # A tibble: 12 × 6
## Stratum Treat Control Total Control_Proportion Potential_Issues
## <int> <int> <int> <int> <dbl> <chr>
## 1 1 125 234 359 0.652 none
## 2 2 105 255 360 0.708 none
## 3 3 451 1041 1492 0.698 none
## 4 4 2 37 39 0.949 Too few samples; Small treat:…
## 5 5 3 48 51 0.941 Too few samples; Small treat:…
## 6 6 12 138 150 0.92 Small treat:control ratio
## 7 7 104 71 175 0.406 none
## 8 8 25 15 40 0.375 Too few samples
## 9 9 18 15 33 0.455 Too few samples
## 10 10 241 1292 1533 0.843 Small treat:control ratio
## 11 11 51 332 383 0.867 Small treat:control ratio
## 12 12 55 330 385 0.857 Small treat:control ratio
This table shows each of the strata that we obtained through manual
stratification and some of the potential issues we may face when trying
to match within these strata. In particular, this table highlights some
strata which are too small and/or have many more treated individuals
than controls. The Total
column shows the total number of
observations in each stratum. Some of the strata here are quite a bit
larger than others. In practice, data sets of about 4000 start to take a
long time to match, so that’s about the maximum size we would like for
our strata. In data sets of fewer than 100 individuals, it can be hard
to find good matches between our treated and control individuals.
One benefit of automatic stratification is that it tends to create
strata of roughly equal size, so that no strata are extremely small or
extremely large. We can see this by comparing the Total
column in the issue tables of the manually and automatically stratified
data sets:
# issue table for automatically stratified data
$issue_table a.strat
## # A tibble: 12 × 6
## Stratum Treat Control Total Control_Proportion Potential_Issues
## <int> <int> <int> <int> <dbl> <chr>
## 1 1 133 252 385 0.655 none
## 2 2 131 254 385 0.660 none
## 3 3 110 275 385 0.714 none
## 4 4 112 273 385 0.709 none
## 5 5 106 279 385 0.725 none
## 6 6 124 261 385 0.678 none
## 7 7 78 307 385 0.797 none
## 8 8 75 310 385 0.805 Small treat:control ratio
## 9 9 97 288 385 0.748 none
## 10 10 81 304 385 0.790 none
## 11 11 84 301 385 0.782 none
## 12 12 61 323 384 0.841 Small treat:control ratio
Plots are an important way that we can check how our statification
went. There are two types of plots that we can make automatically with
either a manual_strata
object or an
auto_strata
object. The first is a size-ratio plot. This
plots each stratum in the analysis set based on its size and the
percentage of control observations. If a stratum is too small or is very
imbalanced in its treat:control ratio, finding quality matches may be
difficult (shaded yellow areas). If a stratum is too large, matching may
be very computationally taxing (shaded red areas). We can see here that
a few strata from our manual stratification are quite small, and some
have nearly entirely control individuals.
# size-ratio plot for manually stratified data
plot(m.strat)
# plot(m.strat, type = "SR") will give the same output
# plot(m.strat, label = TRUE) will allow the user to click points to label them
We can compare this with a size-ratio plot from automatic stratification:
# size-ratio plot for automatically stratified data
plot(a.strat)
This plot clearly shows that the strata generated by auto_stratify are all about the same size. However, some of the strata still have a treat:control ratio much less than 1:1. Some of this is unavoidable because our data set naturally had a treat:control ratio of about 1:5. However, we should try to avoid having very extreme treat:control ratios in any of our strata.
The second plot we can make to diagnose issues both manually and
automatically stratified data is an overlap histogram. Below, I’ve made
an overlap histogram for all strata, and an overlap histogram for
stratum 3. This shows the propensity score distributions of the treated
and control populations. In order to make this histogram, we must tell
the plot
function what the propensity scores are for our
data set by specifying the propensity
argument. We have
three options: propensity
can be a propensity score formula
(which will then be fit on the analysis set), a glm
modeling propensity score, or a vector of propensity scores. Here, I
just pass in a formula.
# propensity score histograms for all strata from manually stratified data
plot(m.strat, type = "hist", propensity = treat ~ X2 + X1 + B1 + B2)
# propensity score histograms for all strata from manually stratified data
plot(m.strat, type = "hist", propensity = treat ~ X2 + X1 + B1 + B2, stratum = 3)
A similar command works for automatically stratified data. :
# propensity score histograms for stratum 3 from automatically stratified data
plot(a.strat, type = "hist", propensity = treat ~ X2 + X1 + B1 + B2, stratum = 3)
In addition to the two plot types above, there are some plots that
can be used to visualize auto_strata
objects only:
"residual"
and "AC"
plots.
Setting type = "AC"
in the plot command produces a
Assignment-Control plot. This shows each individual in terms of their
estimated propensity score and their estimated prognostic score (for
more details, see Aikens, Greaves, and Baiocchi
(2020)). This allows us to check how well the treated and control
samples overlap not only in their probability for treatment but in their
expected outcome under the control assignment. Just as with the overlap
histograms, we need to provide information on the propensity score.
Below, I use the same propensity formula as in the histogram above. The
first plot shows all of the strata, with lines indicating the barriers
between strata, while the second plot looks specifically at stratum 3 by
specifying the stratum
argument. Apparent striations or
groupings in assignment control plots like the ones below are common;
they appear when some binary or categorical covariate has a strong
association with either the outcome or the treatment assignment.
# make a Assignment-Control plot
plot(a.strat, type = "AC", propensity = treat ~ X1 + X2 + B1 + B2)
# make a Assignment-Control plot
plot(a.strat, type = "AC", propensity = treat ~ X1 + X2 + B1 + B2, stratum = 3)
Finally, if we fit a prognostic model, then we can run diagnostics on
that prognostic model with type = "residual"
. This is
essentially just a wrapper for the glm
plotting method. It
produces all the familiar diagnostic plots that we can obtain from
calling plot
on a fitted model from glm
.
# diagnostic plots for prognostic model
plot(a.strat, type = "residual")
# plot(a.strat$prognostic_model) will do the same thing - see below
The plot
command with type = "residual"
gives us a one-line command that shows us standard diagnostic plots for
the prognostic score model that we fit on the pilot set. If we want to
run extra diagnostics on our prognostic model, we can do that too. Our
a.strat
stores a copy of the prognostic score model, which
we can extract in the usual way with $
. By extracting the
prognostic model (prognostic_model
), we can run any of the
usual diagnostics for our fitted model.
# extract prognostic model from a.strat
<- a.strat$prognostic_model
progmod
# as an example, summarize model coefficients
summary(progmod)
##
## Call:
## glm(formula = prognostic_formula, family = "binomial", data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7627 -1.0509 -0.6498 1.0808 1.9692
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.17106 0.15821 -1.081 0.280
## X1 -0.77962 0.12341 -6.318 2.66e-10 ***
## X2 0.09660 0.09498 1.017 0.309
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 527.73 on 380 degrees of freedom
## Residual deviance: 480.76 on 378 degrees of freedom
## AIC: 486.76
##
## Number of Fisher Scoring iterations: 4
Based on the coefficients from the prognostic model, we can tell that
our prognostic score estimates are primarily determined by each
individual’s X2
baseline value.
Now that we have stratified the data, we can match the individuals
within each stratum. Note that all of the examples that follow
additionally require the R package optmatch
(Ben B. Hansen and Klopfer (2006)) to be
installed.
The most flexible strategy - although occasionally more complex - is
for the researcher to do this step by hand, using the strata
designations in the analysis set returned by either one of the
stratification methods. If the researcher plans to do something very
specific or nuanced in the matching process, this may be the best
approach. For example, users proficient with the matching package
optmatch
(Ben B. Hansen and Klopfer
(2006)) will note that adding + strata(stratum)
to
the matching formula supplied to optmatch will request that optmatch
perform the matching within each stratum in the analysis set (see the optmatch
documentation for further details). Another approach is to divide
the analysis_set
into separate data frames and match on
those individually, perhaps distributing over several computing clusters
or using a different matching scheme in each stratum (for example
different \(k\) in \(1:k\) matching)
If the goal is something more simple, e.g. a 1-to-1 or 1-to-\(k\) optimal propensity score matching
within strata, the strata_match
function can do that
automatically. Below, I match two control individuals to each treatment
individual within strata in my analysis set provided by a.strat. I use
the propensity score formula treat ~ X1 + X2 + B1 + B2
.
# match the automatically stratified data
<- strata_match(a.strat, treat ~ X1 + X2 + B1 + B2, k = 1) mymatch
## Fitting propensity model: treat ~ X1 + X2 + B1 + B2
##
# summarize matching results
summary(mymatch)
## Structure of matched sets:
## 1:1 0:1
## 1192 2235
## Effective Sample Size: 1192
## (equivalent number of matched pairs).
This function in turn calls optmatch
to do the matching,
stratified by the strata assignments. It returns an optmatch matching.
In essence, each individual is given an identification code for its
match. <NA>
indicates that the individual was not
matched, and an identifier such as 3.15
indicates that this
individual was in stratum 3 and it was placed in match 15.
# add match information as a column in the data set
<- a.strat$analysis_set
matched_data $match <- as.character(mymatch)
matched_data
head(matched_data)
## X1 X2 B1 B2 C1 treat outcome stratum match
## 1 0.93332697 0.22256993 1 1 b 0 0 2 <NA>
## 2 -0.52503178 1.07623706 1 0 c 1 0 9 9.22
## 3 1.81443979 2.83989785 1 1 b 0 0 1 <NA>
## 4 0.08304562 0.04686229 0 0 a 1 0 5 5.82
## 5 -0.36031653 1.69169297 0 0 c 0 1 8 <NA>
## 6 0.14285392 0.68549042 1 1 a 0 0 6 <NA>
Effect estimation can then be done using the matched analysis set via researcher’s method of choice.