This vignette offers an in-depth walkthrough of the simulation-based
functions that are based in the conjugate linear model setting within
the bayesassurance
package. This includes
bayes_sim()
, bayes_sim_unbalanced()
, and
bayes_sim_unknownvar()
. We also review supplementary
functions that help generate design matrices when users are unable to
provide them. The following sections elaborate on the underlying
framework embedded within each of these functions followed by example
code segments.
bayes_sim()
FunctionThe bayes_sim()
function estimates the assurance by
determining the proportion of MC samples that meet a pre-specified
condition described later in this section. The underlying algorithm of
the function works in the context of conjugate Bayesian linear
regression models, in which a set of \(n\) observations denoted by \(y = (y_1, y_2, \cdots, y_n)^{\top}\) are to
be collected in the presence of \(p\)
controlled explanatory variables, \(x_1, x_2,
\cdots, x_p\). Specifically, \[y =
X_n\beta + \epsilon_n,\] where \(X_n\) is an \(n
\times p\) design matrix with the \(i\)-th row corresponding to \(x_i\), and \(\epsilon_n \sim N(0, \sigma^2V_n)\), where
\(V_n\) is a known \(n \times n\) correlation matrix. A
conjugate Bayesian linear regression model specifies the joint
distribution of parameters \(\{\beta,
\sigma^2\}\) and \(y\) as \[
IG(\sigma^2|a_{\sigma}, b_{\sigma}) \times N(\beta|\mu_{\beta}^{(a)},
\sigma^2V_{\beta}^{(a)}) \times N(y | X_n\beta, \sigma^2V_n).
\]
Analysis Stage
Depending on the objective of our study, we may either be interested
in defining a one-sided or two-sided objective. Suppose we implement a
one-sided test to assess whether the linear contrast of unknown
parameter \(\beta\) is greater than
some constant \(C\). The
bayes_sim()
function finds the assurance of the realized
data favoring \(H: u^{\top}\beta >
C\), where \(u\) is a \(p \times 1\) vector of fixed contrasts. We
start in the analysis stage and frame the condition that is to be
observed. Inference proceeds from the posterior distribution given by
\[
p(\beta, \sigma^2|y) = IG(\sigma^2|a_{\sigma}^*, b_{\sigma}^*)
\times N(\beta|M_nm_n, \sigma^2M_n),
\] where \(a^*_{\sigma} =
a_{\sigma}+\frac{n}{2}\), \(b^*_{\sigma} = b_{\sigma} +
\frac{1}{2}\left\{\mu_{\beta}^{\top}V_{\beta}^{-1}\mu_{\beta} +
y^{\top}V_n^{-1}y - m_n^{\top}M_nm_n\right\}\), \(M_n^{-1} = V_{\beta}^{-1(a)} +
X_n^{\top}V_n^{-1}X_n\) and \(m_n =
V_{\beta}^{-1(a)}\mu_{\beta}^{(a)} + X_n^{\top}V_n^{-1}y\). The
posterior distribution helps shape our analysis objective. The
superscripts (a) denote that the parameters take place in the analysis
stage.
If \(\sigma^2\) is known and fixed, then the posterior distribution of \(\beta\) is \(\displaystyle p(\beta | \sigma^2, y) = N(\beta | M_nm_n, \sigma^2M_n)\) as shown in the equation above. Referring back to the evaluation of \(H: u^{\top}\beta > C\), standardization leads to \[\begin{equation} \label{eq1} \left.\frac{u^{\top}\beta - u^{\top}M_nm_n}{\sigma \sqrt{u^{\top}M_nu}} \right| \sigma^2, y \sim N(0,1)\;. \end{equation}\]
To evaluate the credibility of \(H: u^{\top}\beta > C\), we decide in favor of \(H\) if the observed data belongs in the region \[\begin{align*} A_{\alpha}(u, \beta, C) &= \left\{y: P\left(u^{\top}\beta \leq C | y\right) < \alpha\right\} = \left\{y: \Phi \left(\frac{C - u^{\top}M_n m_n}{\sigma \sqrt{u^{\top}M_nu}}\right) < \alpha \right\}. \end{align*}\] This defines our analysis objective. Alternatively, if we wanted to evaluate the tenability of \(H: u^{\top}\beta < C\) this can easily be done by adjusting the corresponding sign in the expression. If we wanted to evaluate the tenability of \(H: u^{\top}\beta \neq C\), the observed region becomes \[ A_{\alpha}(u, \beta, C) = \left\{y: P\left(u^{\top}\beta \leq C | y\right) < \alpha/2 \right\} \text{ or } \left\{y: P\left(u^{\top}\beta > C | y\right) < \alpha/2\right\}. \]
The Design Stage
From here we shift to the design stage, whose purpose is to seek a sample size \(n\) such that the analysis objective is met 100\(\delta \%\) of the time, where \(\delta\) is the assurance. Doing this requires determining the marginal distribution of \(y\), which is assigned a separate set of priors to quantify our belief about the population from which the sample will be taken. Hence, the marginal of \(y\) under the design priors will be derived from \[\begin{align}\label{eq: ohagan_stevens_linear_regression_design_priors} y &= X_n\beta + e_n; \quad e_n \sim N(0, \sigma^2 V_n)\;;\quad \beta = \mu_{\beta}^{(d)} + \omega; \quad \omega \sim N(0, \sigma^2 V_{\beta}^{(d)})\;, \end{align}\] where \(\beta\sim N(\mu_{\beta}^{(d)},\sigma^2 V_{\beta}^{(d)})\) is the design prior on \(\beta\). Substituting the equation for \(\beta\) into the equation for \(y\) gives \(y = X\mu_{\beta}^{(d)} + (X\omega + e_n)\) and, hence, \(y \sim N\left(X\mu_{\beta}^{(d)}, \sigma^2V_n^{*}\right)\), where \(V_n^{\ast} = \left(XV_{\beta}^{(d)}X^{\top} + V_n\right)\).
Summary
To summarize our simulation strategy for estimating the Bayesian assurance, we fix sample size \(n\) and generate a sequence of \(J\) datasets \(y^{(1)}, y^{(2)}, ..., y^{(J)}\). Then, a Monte Carlo estimate of the Bayesian assurance is \[ \hat{\delta}(n) = \frac{1}{J}\sum_{j=1}^J \mathbb{I}\left(\left\{y^{(j)}: \Phi \left(\frac{C - u^{\top}M_n^{(j)}m_n^{(j)}}{\sigma \sqrt{u^{\top}M_n^{(j)}u}}\right) < \alpha \right\}\right)\;, \] where \(\mathbb{I}(\cdot)\) is the indicator function of the event in its argument, \(M_n^{(j)}\) and \(m_n^{(j)}\) are the values of \(M_n\) and \(m_n\) computed from \(y^{(j)}\).
Load in the bayesassurance
package.
library(bayesassurance)
Specify the following inputs:
n
: sample size (either scalar or vector). If vector,
each unique value corresponds to a separate study design.p
: number of explanatory variables being considered.
Also denotes the column dimension of design matrix Xn
. If
Xn = NULL
, p
must be specified for the
function to assign a default design matrix for Xn
.u
: a scalar or vector embedded in the linear contrast
that delineates our assessment of either \(u^{\top}\beta > C\), \(u^{\top}\beta < C\), or \(u^{\top}\beta \neq C\).C
: constant that linear contrast is being compared
to.Xn
: an \(n \times p\)
design matrix that characterizes the observations given by the normal
linear regression model \(y = X_n\beta +
\epsilon_n\), where \(\epsilon_n \sim
N(0, \sigma^2V_n)\). See above description for details. Default
Xn
is an \(np \times p\)
matrix comprised of \(n \times 1\) ones
vectors that run across the diagonal of the matrix.Vbeta_d
: correlation matrix that characterizes prior
information on \(\beta\) in the design
stage, i.e. \(\beta \sim N(\mu_{\beta}^{(d)},
\sigma^2 V_{\beta}^{(d)})\).Vbeta_a_inv
: inverse-correlation matrix that
characterizes prior information on \(\beta\) in the analysis stage, i.e. \(\beta \sim N(\mu_{\beta}^{(a)},
\sigma^2V_{\beta}^{(a)})\). The inverse is passed in for
convenience sake, i.e. \(V_{\beta}^{-1(a)}\)Vn
: an \(n \times n\)
correlation matrix for the marginal distribution of the sample data
\(y\). Takes on an identity matrix when
set to NULL.sigsq
: a known and fixed constant preceding all
correlation matrices Vn
, Vbeta_d
and
Vbeta_a_inv
mu_beta_d
: design stage mean, \(\mu_{\beta}^{(d)}\)mu_beta_a
: analysis stage mean, \(\mu_{\beta}^{(a)}\)alt
: specifies alternative test case, where
alt = "greater"
tests if , alt = "less"
tests
if , and alt = "two.sided"
performs a two-sided test. By
default, alt = "greater"
.alpha
: significance levelmc_iter
: number of MC samples evaluated under the
analysis objectiveExample 1: One-Dimensional Setting
We start with a simple example that considers a scalar parameter
\(\beta\) in evaluating the credibility
of \(H: u^{\top}\beta > C\). We
assign the following set of inputs for the bayes_sim()
function and save the output as assur_vals
. Some key points
to note:
p = 1
, u = 1
and
C = 0.15
implies we are evaluating the tenability of \(H: \beta > 0.15\), where \(\beta\) is a scalar.Vbeta_d
and Vbeta_a_inv
are scalars to
conform with the dimension of \(\beta\). A weak analysis prior
(Vbeta_a_inv = 0
) and a strong design prior
(Vbeta_d = 1e-8
) are assigned for the purpose of
demonstrating the overlapping scenario that takes place between the
Bayesian and frequentist settings.Xn
and Vn
are set to NULL
which means they will take on the default settings specified in the
parameter descriptions above. <- seq(100, 250, 5)
n
set.seed(10)
<- bayes_sim(n, p = 1, u = 1, C = 0.15, Xn = NULL,
assur_vals Vbeta_d = 1e-8, Vbeta_a_inv = 0,
Vn = NULL, sigsq = 0.265,
mu_beta_d = 0.25, mu_beta_a = 0,
alt = "greater", alpha = 0.05, mc_iter = 10000)
Within assur_vals
contains a list of two outputs:
assurance_table
: table of sample sizes and
corresponding assurance values.assur_plot
: plot of estimated assurance values with
sample size \(n\) running along the
x-axis.The first six entries of the resulting power table can be seen by
calling pwr_vals$pwr_table
.
head(assur_vals$assurance_table)
Observations.per.Group..n. | Assurance |
---|---|
100 | 0.6177 |
105 | 0.6305 |
110 | 0.6529 |
115 | 0.6666 |
120 | 0.6861 |
125 | 0.7084 |
The assurance plot is produced using the ggplot2
package. It displays the inputted sample sizes on the x-axis and the
resulting assurance values on the y-axis.
$assurance_plot assur_vals
Notice that if we input the following parameters into the
pwr_freq()
function, we observe an approximate overlap
between the assurance and power values.
<- seq(100, 250, 5)
n <- bayesassurance::pwr_freq(n = n, sigsq = 0.265, alt = "greater", alpha = 0.05,
y1 theta_0 = 0.15, theta_1 = 0.25)
<- assur_vals
y2
library(ggplot2)
<- ggplot2::ggplot(y1$pwr_table, alpha = 0.5,
p1 aes(x = n, y = Power, color="Power")) +
geom_line(lwd=1.2) + geom_point(data = y2$assurance_table, alpha = 0.5,
aes(y = y2$assurance_table$Assurance, color="Assurance"),lwd=1.2) +
ggtitle("Power Curve and Assurance Values Overlayed") + xlab("Sample Size n") +
ylab("Power/Assurance")
p1
Example 2: Use of Linear Contrasts
In this next example, we assume \(\beta\) is a vector of unknown components. We refer to the cost-effectiveness application discussed in O’Hagan et al. 2001 to demonstrate a real-world setting. The application considers a randomized clinical trial that compares the cost-effectiveness of two treatments, where the cost-effectiveness is evaluated using a net monetary benefit measure expressed as \[ \xi = K(\mu_2 - \mu_1) - (\gamma_2 - \gamma_1), \] where \(\mu_2 - \mu_1\) and \(\gamma_2 - \gamma_1\) corresponds to the true differences in treatment efficacy and costs, respectively, between Treatments 1 and 2. The threshold unit cost, \(K\), represents the maximum price that a health care provider is willing to pay for a unit increase in efficacy. Note that the indices of \(\mu\) and \(\gamma\) indicate whether we are looking at Treatment 1 or 2.
In this setting, we seek the tenability of \(H: \xi > 0\), which if true, indicates that Treatment 2 is more cost-effective than Treatment 1. Since the net monetary benefit formula involves assessing the cost and efficacy components conveyed within each of the two treatment groups, we set \(\beta = (\mu_1, \gamma_1, \mu_2, \gamma_2)^{\top}\), where \(\mu_i\) and \(\gamma_i\) denote the efficacy and cost for treatments \(i = 1, 2\). We therefore set \(u = (-K, 1, K, -1)^{\top}\) and \(C = 0\), which lets us evaluate an equivalent form of \(\xi > 0\) re-expressed \(u^{\top}\beta > 0\). All other inputs of this application were either specified or deduced from the paper.
The following set of inputs specified in the subsequent code chunk
should result in an assurance of approximately 0.70 according to O’Hagan
et al. 2001. The bayes_sim()
returns a similar value,
demonstrating that sampling from the posterior yields results similar to
those reported in the paper.
= 285
n = 4
p = 20000 # threshold unit cost
K <- 0
C <- as.matrix(c(-K, 1, K, -1))
u <- 4.04^2
sigsq
## Assign correlation matrices to analysis and design stage priors
<- matrix(rep(0, p^2), nrow = p, ncol = p)
Vbeta_a_inv
<- (1 / sigsq) * matrix(c(4, 0, 3, 0, 0, 10^7, 0, 0, 3, 0, 4, 0, 0, 0, 0, 10^7),
Vbeta_d nrow = 4, ncol = 4)
<- tau2 <- 8700
tau1 <- sqrt(sigsq)
sig <- matrix(0, nrow = n*p, ncol = n*p)
Vn 1:n, 1:n] <- diag(n)
Vn[2*n - (n-1)):(2*n), (2*n - (n-1)):(2*n)] <- (tau1 / sig)^2 * diag(n)
Vn[(3*n - (n-1)):(3*n), (3*n - (n-1)):(3*n)] <- diag(n)
Vn[(4*n - (n-1)):(4*n), (4*n - (n-1)):(4*n)] <- (tau2 / sig)^2 * diag(n)
Vn[(
## Assign mean parameters to analysis and design stage priors
<- as.matrix(c(5, 6000, 6.5, 7200))
mu_beta_d <- as.matrix(rep(0, p))
mu_beta_a
set.seed(10)
<- bayesassurance::bayes_sim(n = 285, p = 4, u = as.matrix(c(-K, 1, K, -1)),
assur_vals C = 0, Xn = NULL,
Vbeta_d = Vbeta_d, Vbeta_a_inv = Vbeta_a_inv,
Vn = Vn, sigsq = 4.04^2,
mu_beta_d = as.matrix(c(5, 6000, 6.5, 7200)),
mu_beta_a = as.matrix(rep(0, p)),
alt = "greater", alpha = 0.05, mc_iter = 10000)
$assur_val assur_vals
#> [1] "Assurance: 0.724"
bayes_sim()
Special Case: Longitudinal DesignWe demonstrate an additional setting embedded in the
bayes_sim()
function tailored to longitudinal data.
Referring back to the linear regression model discussed in the general
framework, we can construct a longitudinal model that utilizes the same
linear regression form, where \[
y = X_n\beta + \epsilon_n.
\]
Consider a group of subjects with an equal number of repeated measures at equally-spaced time points, i.e. a balanced longitudinal study. These subjects can be characterized by \[ y_{ij} = \beta_{0i} + \beta_{1i}t_{ij} + \epsilon_{ij}, \] where \(y_{ij}\) denotes the \(j^{th}\) observation of subject \(i\) at time \(t_{ij}\), \(\beta_{0i}\) and \(\beta_{1i}\) respectively denote the intercept and slope terms for subject \(i\), and \(\epsilon_{ij}\) is an error term characterized by \(\epsilon \sim N(0, \sigma^2_i)\).
In a simple case where we only have two subjects, each with \(n\) repeated measures, we can individually express the observations as \[\begin{align*} y_{11} &= \beta_{01} + \beta_{11}t_{11} + \epsilon_{11}\\ &\vdots\\ y_{1n} &= \beta_{01} + \beta_{11}t_{1n} + \epsilon_{1n}\\ y_{21} &= \beta_{02} + \beta_{12}t_{21} + \epsilon_{21}\\ &\vdots\\ y_{2n} &= \beta_{02} + \beta_{12}t_{2n} + \epsilon_{2n}. \end{align*}\] The model can also be expressed cohesively using matrices, \[\begin{gather} \underbrace{\begin{pmatrix} y_{11} \\ \vdots \\ y_{1n} \\ y_{21} \\ \vdots\\ y_{2n} \end{pmatrix}}_{y} = \underbrace{ \begin{pmatrix} 1 & 0 & t_{11} & 0\\ \vdots & \vdots & \vdots & \vdots \\ 1 & 0 & t_{1n} & 0\\ 0 & 1 & 0 & t_{21}\\ \vdots & \vdots & \vdots & \vdots \\ 0 & 1 & 0 & t_{2n} \end{pmatrix} }_{X_n} \underbrace{ \begin{pmatrix} \beta_{01}\\ \beta_{02}\\ \beta_{11}\\ \beta_{12} \end{pmatrix} }_{\beta} + \underbrace{ \begin{pmatrix} \epsilon_{11}\\ \vdots\\ \epsilon_{1n}\\ \epsilon_{21}\\ \vdots\\ \epsilon_{2n} \end{pmatrix} }_{\epsilon_n} \end{gather}\] bringing us back to the linear regression model form.
Example 3: Longitudinal Setting
The following example uses similar parameter settings as the previous cost-effectiveness example along with longitudinal specifications. We assume two subjects and want to test whether the growth rate of subject 1 is greater than that of subject 2. This could have either positive or negative implications depending on the measurement scale.
Assigning an appropriate linear contrast lets us evaluate this under the general expression, \(u^{\top}\beta > C\), where \(u = (1, -1, 1, -1)^{\top}\) and \(C = 0\). The timepoints are arbitrarily chosen to be 10 through 120- this could be days, months, or years depending on the context of the problem. The number of repeated measurements per subject to be tested includes values 10 through 100 in increments of 5. This indicates that we are evaluating the assurance for 19 study designs in total. \(n = 10\) divides the specified time interval into 10 evenly-spaced timepoints between 10 and 120.
For a more complicated study design consisting of more than two subjects that are divided into two treatment groups, the user could consider testing if the mean growth rate is higher in the first treatment group versus the second, e.g. if we have three subjects per treatment group, the linear contrast could be set as \(u = (0, 0, 0, 0, 0, 0, 1/3, 1/3, 1/3, -1/3, -1/3, -1/3)^{\top}\).
<- seq(10, 100, 5)
n <- c(1,2)
ids <- 100
sigsq <- matrix(rep(0, 16), nrow = 4, ncol = 4)
Vbeta_a_inv <- (1 / sigsq) * matrix(c(4, 0, 3, 0, 0, 6, 0, 0, 3, 0, 4, 0, 0, 0, 0, 6),
Vbeta_d nrow = 4, ncol = 4)
<- bayes_sim(n = n, p = NULL, u = c(1, -1, 1, -1), C = 0, Xn = NULL,
assur_out Vbeta_d = Vbeta_d, Vbeta_a_inv = Vbeta_a_inv,
Vn = NULL, sigsq = 100,
mu_beta_d = as.matrix(c(5, 6.5, 62, 84)),
mu_beta_a = as.matrix(rep(0, 4)), mc_iter = 5000,
alt = "two.sided", alpha = 0.05, longitudinal = TRUE, ids = ids,
from = 10, to = 120)
head(assur_out$assurance_table)
Observations.per.Group..n. | Assurance |
---|---|
10 | 0.7032 |
15 | 0.8114 |
20 | 0.8870 |
25 | 0.9260 |
30 | 0.9494 |
35 | 0.9654 |
$assurance_plot assur_out
bayes_sim_unbalanced()
FunctionThe bayes_sim_unbalanced()
function relies on the same
analysis and design stage framework as bayes_sim()
but lets
users pass in unequal pairs of sample sizes to determine the assurance
of achieving the analysis objective for a multi-sample study design
characterized by two distinct sample sizes (e.g. two-sample design with
different sized groups, four-sample design with two groups having the
same sample size while the other two groups have a different sample
size). This is unlike the bayes_sim()
function that
determines the assurance solely for balanced study designs.
The bayes_sim_unbalanced()
function is very similar to
bayes_sim()
in terms of parameter specifications. There are
a few noteworthy exceptions that are listed below.
n1
: first sample size (vector or scalar). Note that a
vector can only be passed in when Xn = NULL
. If users
choose to specify their own design matrix, n1
must be a
scalar as the specified Xn
would correspond to a single
study design for a specific sample size.n2
: second sample size (vector or scalar). Note that a
vector can only be passed in when Xn = NULL
. If users
choose to specify their own design matrix, n2
must be a
scalar as the specified Xn
would correspond to a single
study design for a specific sample size.repeats
: an integer giving the number of times to
repeat c(n1, n2)
. Applicable for study designs that have
more than two explanatory variables containing the same sample sizes
specified in n1
and n2
. Set to
repeats = 1
by default.surface_plot
: logical parameter that specifies whether
user wishes to view a contour plot. when set to TRUE
, and
n1
and n2
are vectors, a contour plot
(i.e. heat map) showcasing assurances obtained for unique combinations
of n1
and n2
is produced.The resulting outputs are as follows:
assurance_table
: table of sample size and corresponding
assurance valuescontourplot
: contour map of assurance values if
surface_plot = TRUE
mc_samples
: number of Monte Carlo samples that were
generated for evaluationExample 1: Simple Example
Similar to bayes_sim()
, we let Xn = NULL
to
allow our function to automatically generate a design matrix using
bayesassurance::gen_Xn()
, which is discussed in a later
section. Note that n1
and n2
can each take a
vector of equal length, where a vector of assurance values will be
returned corresponding to the distinct pairs of n1
and
n2
.
<- seq(20, 75, 5)
n1 <- seq(50, 160, 10)
n2
set.seed(100)
<- bayes_sim_unbalanced(n1 = n1, n2 = n2, repeats = 1, u = c(1, -1),
assur_out C = 0, Xn = NULL, Vbeta_d = matrix(c(50, 0, 0, 10),nrow = 2, ncol = 2),
Vbeta_a_inv = matrix(rep(0, 4), nrow = 2, ncol = 2),
Vn = NULL, sigsq = 100, mu_beta_d = c(1.17, 1.25),
mu_beta_a = c(0, 0), alt = "two.sided", alpha = 0.05, mc_iter = 5000,
surface_plot = TRUE)
## Outputs
head(assur_out$assurance_table)
n1 | n2 | Assurance |
---|---|---|
20 | 50 | 0.9424 |
25 | 60 | 0.9508 |
30 | 70 | 0.9580 |
35 | 80 | 0.9600 |
40 | 90 | 0.9610 |
45 | 100 | 0.9642 |
$contourplot assur_out
Example 2: Cost-effectiveness Example
We revisit the cost-effectiveness example demonstrated in the
bayes_sim()
discussion. In this example, we need to make
use of the repeats
parameter as the two sample sizes we are
defining correspond to four explanatory variables, i.e. two sets of
efficacy and cost measurements pertaining to Treatments 1 and 2.
<- c(4, 5, 15, 25, 30, 100, 200)
n1 <- c(8, 10, 20, 40, 50, 200, 250)
n2
<- as.matrix(c(5, 6000, 6.5, 7200))
mu_beta_d <- as.matrix(rep(0, 4))
mu_beta_a = 20000 # threshold unit cost
K <- 0
C <- as.matrix(c(-K, 1, K, -1))
u <- 4.04^2
sigsq <- matrix(rep(0, 16), nrow = 4, ncol = 4)
Vbeta_a_inv <- (1 / sigsq) * matrix(c(4, 0, 3, 0, 0, 10^7, 0, 0, 3, 0, 4, 0, 0, 0, 0, 10^7),
Vbeta_d nrow = 4, ncol = 4)
set.seed(12)
<- bayes_sim_unbalanced(n1 = n1, n2 = n2, repeats = 2, u = as.matrix(c(-K, 1, K, -1)),
assur_out C = 0, Xn = NULL, Vbeta_d = Vbeta_d,
Vbeta_a_inv = Vbeta_a_inv,
Vn = NULL, sigsq = 4.04^2,
mu_beta_d = as.matrix(c(5, 6000, 6.5, 7200)),
mu_beta_a = as.matrix(rep(0, 4)), alt = "greater",
alpha = 0.05, mc_iter = 5000, surface_plot = TRUE)
View the outputs.
head(assur_out$assurance_table)
n1 | n2 | Assurance |
---|---|---|
4 | 8 | 0.1468 |
5 | 10 | 0.1692 |
15 | 20 | 0.3184 |
25 | 40 | 0.4080 |
30 | 50 | 0.4322 |
100 | 200 | 0.6280 |
$contourplot assur_out
bayes_sim_unknownvar()
FunctionSimilar to the bayes_sim()
function, the
bayes_sim_unknownvar()
function takes in a set of inputs
and returns the approximate Bayesian assurance by determining the
proportion of MC samples that meet the analysis stage objective. The
bayes_sim_unknownvar()
function is used when the variance
\(\sigma^2\) is unknown. Hence, the
simulation itself relies on an additional step that involves generating
a \(\sigma^2\) prior to evaluating the
analysis objective.
Analysis Stage
Our region of interest corresponding to our analysis objective still remains as \(A_{\alpha}(u, \beta, C) = \left\{y: P\left(u^{\top}\beta \leq C | y\right) < \alpha\right\}\) when deciding whether or not we are in favor of \(H\). To implement this in a simulation setting, we rely on iterative sampling for both \(\beta\) and \(\sigma^2\) to estimate the assurance. We specify analysis priors \(\beta | \sigma^2 \sim N(\mu_{\beta}^{(a)}, \sigma^2V_{\beta}^{(a)})\) and \(\sigma^{2} \sim IG(a^{(a)}, b^{(a)})\), where the superscripts \((a)\) signify analysis priors.
We had previously derived the posterior distribution of \(\beta\) expressed as \(p(\beta | y, \sigma^{2}) = N(\beta | M_nm_n, \sigma^{2}M_n)\), where \(M_n = (V_\beta^{-1(a)} + X^{\top} V_n^{-1} X)^{-1}\) and \(m_n = V_\beta^{-1(a)}\mu_\beta^{(a)} + X^{\top}V_n^{-1}y\). The posterior distribution of \(\sigma^{2}\) is obtained by integrating out \(\beta\) from the joint posterior distribution of \(\{\beta, \sigma^2\}\), which yields \[\begin{equation}\label{eq: postbeta} \begin{split} p(\sigma^2| y) &\propto IG(\sigma^2| a^{(a)},b^{(a)}) \times \int N(\beta | \mu_{\beta}, \sigma^2 V_{\beta}) \times N(y | X\beta, \sigma^2 V_{n}) d\beta\\ &\propto \left(\frac{1}{\sigma^2}\right)^{a^{(a)}+ \frac{n}{2} + 1}\exp\left\{-\frac{1}{\sigma^2}\left(b^{(a)} + \frac{c^{\ast}}{2}\right)\right\}\;. \end{split} \end{equation}\] Therefore, \(p(\sigma^2| y) = IG\left(\sigma^2| a^{\ast}, b^{\ast}\right)\), where \(a^{\ast} = a^{(a)} + \frac{n}{2}\) and \(b^{\ast} = b^{(a)} + \frac{c^{\ast}}{2} = b^{(a)} + \frac{1}{2} \left\{\mu_{\beta}^{\top (a)} V_{\beta}^{-1(a)}\mu_{\beta}^{(a)} + y^{\top}V_n^{-1}y - m_n^{\top}M_nm_n\right\}\).
Design Stage
Recall the design stage objective aims to identify sample size \(n\) that is needed to attain the assurance
level specified by the investigator. Similar to Section~\(\ref{subsec:
ohagan_stevens_linear_regression_known_sigma}\) we will need the
marginal distribution of \(y\) with
priors placed on both \(\beta\) and
\(\sigma^2\). We denote these design
priors as \(\beta^{(d)}\) and \(\sigma^{2(d)}\), respectively, to signify
that we are working within the design stage. Derivation steps are almost
identical to those outlined in bayes_sim()
for the known
\(\sigma^2\) case. With \(\sigma^{2(d)}\) now treated as an unknown
parameter, the marginal distribution of \(y\), given \(\sigma^{2(d)}\), under the design prior is
derived from \(y = X_n\beta^{(d)} +
e_n\), $ e_n N(0, ^{2(d)} V_n)$, \(\beta^{(d)} = \mu_{\beta}^{(d)} + \omega; \quad
\omega \sim N(0, \sigma^{2(d)} V_{\beta}^{(d)})\), where \(\beta^{(d)}\sim N(\mu_{\beta}^{(d)},\sigma^{2(d)}
V_{\beta}^{(d)})\) and \(\sigma^{2(d)}
\sim IG(a^{(d)}, b^{(d)})\). Substituting \(\beta^{(d)}\) into \(y\) gives us \(y
= X_n\mu_{\beta}^{(d)} + (X_n\omega + e_n)\) such that \(X_n\omega + e_n \sim N(0, \sigma^{2(d)}(V_n + X_n
V_{\beta}^{(d)} X_n^{\top}))\). The marginal distribution of
\(p(y| \sigma^{2(d)})\) is \[\begin{equation}\label{eq:marg_yn}
y | \sigma^{2(d)} \sim N(X_n \mu_{\beta}^{(d)}, \sigma^{2(d)}
V_n^{*}); \quad
V_n^{*} = X_n V_{\beta}^{(d)} X_n^{\top} + V_n.
\end{equation}\] Equation~\(\eqref{eq:marg_yn}\) specifies our data
generation model for ascertaining sample size.
Summary
In the design stage, we draw \(\sigma^{2(d)}\) from \(IG(a_\sigma^{(d)}, b_\sigma^{(d)})\) and generate the data from our sampling distribution from \(\eqref{eq:marg_yn}\), \(y \sim N(X\mu_{\beta}^{(d)}, \sigma^{2(d)}(XV_{\beta}^{(d)}X^{\top} + V_n))\). For each such data set, \(\{y, X_n\}\), we perform Bayesian inference for \(\beta\) and \(\sigma^2\). Here, we draw \(J\) samples of \(\beta\) and \(\sigma^2\) from their respective posterior distributions and compute the proportion of these \(J\) samples that satisfy \(u^{\top} \beta_j > C\). If the proportion exceeds a certain threshold \(1-\alpha\), then the analysis objective is met for that dataset. The above steps for the design and analysis stage are repeated for \(R\) datasets and the proportion of the \(R\) datasets that meet the analysis objective, i.e., deciding in favor of \(H\), correspond to the Bayesian assurance.
Taking the exact same inputs as those in the first example of
bayes_sim()
, we modify the inputs and perform the
simulation using bayes_sim_unknownvar()
instead.
<- seq(100, 250, 5)
n <- bayesassurance::bayes_sim(n, p = 1, u = 1, C = 0.15, Xn = NULL,
assur_vals Vbeta_d = 1e-8, Vbeta_a_inv = 0,
Vn = NULL, sigsq = 0.265,
mu_beta_d = 0.25, mu_beta_a = 0,
alpha = 0.05, mc_iter = 10000)
<- bayesassurance::bayes_sim_unknownvar(n, p = 1, u = 1,
assur_vals C = 0.15, R = 150, Xn = NULL, Vn = NULL,
Vbeta_d = 1e-8, Vbeta_a_inv = 0,
mu_beta_d = 0.25, mu_beta_a = 0,
a_sig_a = 0.1, b_sig_a = 0.1,
a_sig_d = 0.1, b_sig_d = 0.1,
alpha = 0.05, mc_iter = 5000)
gen_Xn()
All simulation-based functions contained in the package do not
require users to specify their own design matrix \(X_n\). Users have the option of setting
Xn = NULL
, which prompts the function to construct a
default design matrix using bayesassurance::gen_Xn()
. The
resulting design matrix depends on the sample size parameter
n
being passed into bayes_sim()
. Each unique
sample size contained in n
results in a distinct design
matrix. The column dimension of \(X_n\)
corresponds to the number of groups being assessed, denoted by parameter
p
. Each column is comprised of ones vectors that run
diagonally across the matrix with lengths that correspond to the sample
size value.
The following examples showcase the general functionality of
gen_Xn()
as well as how gen_Xn()
is
specifically used within bayes_sim()
.
Example 1
As a general example, suppose we pass in a vector of length 4 into
gen_Xn()
. The function will interpret this as a study
design consisting of 4 explanatory variables with unequal observed
frequencies of 1, 2, 3, and 4. This will result in \(X_n\) having 4 columns containing ones
vectors of lengths 1, 2, 3, and 4 respectively.
<- c(1,2,3,4)
n ::gen_Xn(n)
bayesassurance#> [,1] [,2] [,3] [,4]
#> [1,] 1 0 0 0
#> [2,] 0 1 0 0
#> [3,] 0 1 0 0
#> [4,] 0 0 1 0
#> [5,] 0 0 1 0
#> [6,] 0 0 1 0
#> [7,] 0 0 0 1
#> [8,] 0 0 0 1
#> [9,] 0 0 0 1
#> [10,] 0 0 0 1
Example 2
The bayes_sim()
function assumes a balanced study design
such that the \(p\) explanatory
variables have an equal number of observations. When a vector of sample
sizes is passed into bayes_sim()
for parameter \(n\), each value within \(n\) corresponds to a separate study design
that results in a different design matrix. By specifying \(p\), the resulting \(X_n\) will have \(p\) columns with ones vectors of equal
length running diagonally across the matrix.
<- 3
n <- 4
p ::gen_Xn(rep(n, p))
bayesassurance#> [,1] [,2] [,3] [,4]
#> [1,] 1 0 0 0
#> [2,] 1 0 0 0
#> [3,] 1 0 0 0
#> [4,] 0 1 0 0
#> [5,] 0 1 0 0
#> [6,] 0 1 0 0
#> [7,] 0 0 1 0
#> [8,] 0 0 1 0
#> [9,] 0 0 1 0
#> [10,] 0 0 0 1
#> [11,] 0 0 0 1
#> [12,] 0 0 0 1
It is actually encouraged for users to set Xn = NULL
as
this will greatly facilitate the process of determining the assurances
for multiple sample sizes at once. This saves the trouble of having
users define their own separate design matrices for each sample
size.
gen_Xn_longitudinal()
When working in the longitudinal setting, additional parameters need
to be specified prior to running bayes_sim()
. These
include
longitudinal
: when set to TRUE
, this
informs bayes_sim()
that the simulation will be based in a
longitudinal setting. If Xn = NULL
, the function will
construct a design matrix using inputs that correspond to a balanced
longitudinal study design.ids
: vector of unique subject idsfrom
: start time of repeated measures for each
subjectto
: end time of repeated measures for each subjectThe parameter n
plays a vital role in this setting as
its value(s) correspond to the number of repeated measures each subject
has. Referring to the sequence generating function, seq()
,
in R’s base package, the time points for each subject are determined as
seq(from = from, to = to, length.out = n)
.
The longitudinal setting in bayes_sim()
implicitly
relies on the bayesassurance::gen_Xn_longitudinal()
function to construct its design matrices. The user does not need to
explicitly call on this function when setting Xn = NULL
.
Each unique sample size contained in n
corresponds to a
distinct study design.
The next example shows how gen_Xn_longitudinal()
is
used. \(X_n\) takes the form of the
design matrix highlighted in the linear regression model discussed in
the previous section, comprised of ones vectors in the first set of
columns and vectors of time points in the second half. When
Xn = NULL
and longitudinal = TRUE
, the
bayes_sim()
function will automatically generate a design
matrix using this feature.
** Example **
<- c(1,2,3,4)
ids gen_Xn_longitudinal(ids, from = 1, to = 10, num_repeated_measures = 4)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#> [1,] 1 0 0 0 1 0 0 0
#> [2,] 1 0 0 0 4 0 0 0
#> [3,] 1 0 0 0 7 0 0 0
#> [4,] 1 0 0 0 10 0 0 0
#> [5,] 0 1 0 0 0 1 0 0
#> [6,] 0 1 0 0 0 4 0 0
#> [7,] 0 1 0 0 0 7 0 0
#> [8,] 0 1 0 0 0 10 0 0
#> [9,] 0 0 1 0 0 0 1 0
#> [10,] 0 0 1 0 0 0 4 0
#> [11,] 0 0 1 0 0 0 7 0
#> [12,] 0 0 1 0 0 0 10 0
#> [13,] 0 0 0 1 0 0 0 1
#> [14,] 0 0 0 1 0 0 0 4
#> [15,] 0 0 0 1 0 0 0 7
#> [16,] 0 0 0 1 0 0 0 10