The disaggregation package contains functions to run Bayesian disaggregation models. Aggregated response data over large heterogeneous regions can be used alongside fine-scale covariate information to predict fine-scale response across the region. The github page for this package can be found here.
Install disaggregation using:
or from github using
The key functions are prepare_data
,
disag_model
and predict
. The
prepare_data
function takes the aggregated data and
covariate data to be used in the model and produces an object to be use
by disag_model
. This functions runs the disaggregation
model and the out can be passed to predict
to produce
fine-scale predicted maps of the response variable.
To use the disaggregation prepare_data
function, you
must have the aggregated data as a sf
object and a
SpatRaster
of the covariate data to be used in the
model.
We will demonstrate an example of the disaggregation
package using areal data of leukemia incidence in New York, using data
from the package SpatialEpi
.
library(SpatialEpi, quietly = TRUE)
library(dplyr, quietly = TRUE)
library(disaggregation, quietly = TRUE)
library(ggplot2)
library(sf)
library(terra)
polygons <- sf::st_as_sf(NYleukemia$spatial.polygon)
df <- cbind(polygons, NYleukemia$data)
ggplot() + geom_sf(data = df, aes(fill = cases / population)) + scale_fill_viridis_c(lim = c(0, 0.003))
Now we simulate two covariate rasters for the area of interest and
make a two-layered SpatRaster
. They are simulated at the
resolution of approximately 1km2.
bbox <- sf::st_bbox(df)
extent_in_km <- 111*(bbox[c(3, 4)] - bbox[c(1, 2)])
n_pixels_x <- floor(extent_in_km[[1]])
n_pixels_y <- floor(extent_in_km[[2]])
r <- terra::rast(ncols = n_pixels_x, nrows = n_pixels_y)
terra::ext(r) <- terra::ext(df)
data_generate <- function(x){
rnorm(1, ifelse(x %% n_pixels_x != 0, x %% n_pixels_x, n_pixels_x), 3)
}
terra::values(r) <- sapply(seq(terra::ncell(r)), data_generate)
r2 <- terra::rast(ncol = n_pixels_x, nrow = n_pixels_y)
terra::ext(r2) <- terra::ext(df)
terra::values(r2) <- sapply(seq(terra::ncell(r2)),
function(x) rnorm(1, ceiling(x/n_pixels_y), 3))
cov_stack <- terra::rast(list(r, r2))
cov_stack <- terra::scale(cov_stack)
names(cov_stack) <- c('layer1', 'layer2')
We also create a population raster. This is to allow the model to correctly aggregated the pixel values to the polygon level. For this simple example we assume that the population within each polygon is uniformly distributed.
extracted <- terra::extract(r, terra::vect(df$geometry), fun = sum)
n_cells <- terra::extract(r, terra::vect(df$geometry), fun = length)
df$pop_per_cell <- df$population/n_cells$lyr.1
pop_raster <- terra::rasterize(terra::vect(df), cov_stack, field = 'pop_per_cell')
To correct small inconsistencies in the polygon geometry, we run the code below.
Now we have setup the data we can use the prepare_data
function to create the objects needed to run the disaggregation model.
The name of the response variable and id variable in the sf
object should be specified.
The user can also control the parameters of the mesh that is used to
create the spatial field. The mesh is created by finding a tight
boundary around the polygon data, and creating a fine mesh within the
boundary and a coarser mesh outside. This speeds up computation time by
only having a very fine mesh within the area of interest and having a
small region outside with a coarser mesh to avoid edge effects. The mesh
parameters: concave
, convex
and
resolution
refer to the parameters used to create the mesh
boundary using the fm_nonconvex_hull_inla
function, while the mesh parameters max.edge
,
cut
and offset
refer to the parameters used to
create the mesh using the fm_mesh_2d
function.
data_for_model <- prepare_data(polygon_shapefile = df,
covariate_rasters = cov_stack,
aggregation_raster = pop_raster,
response_var = 'cases',
id_var = 'censustract.FIPS',
mesh_args = list(cutoff = 0.01,
offset = c(0.1, 0.5),
max.edge = c(0.1, 0.2),
resolution = 250),
na_action = TRUE)
Now we have our data object we are ready to run the model. Here we can specify the likelihood function as Gaussian, binomial or poisson, and we can specify the link function as logit, log or identity. The disaggregation model makes predictions at the pixel level:
\(link(pred_i) = \beta_0 + \beta X + GP(s_i) + u_i\)
where \(X\) are the covariates, \(GP\) is the Gaussian random field and \(u_i\) is the iid random effect. The pixel predictions are then aggregated to the polygon level using the weighted sum (via the aggregation raster, \(agg_i\)):
\(cases_j = \sum_{i \epsilon j} pred_i \times agg_i\)
\(rate_j = \frac{\sum_{i \epsilon j} pred_i \times agg_i}{\sum_{i \epsilon j} agg_i}\)
The different likelihood correspond to slightly different models (\(y_j\) is the response count data):
Gaussian (\(\sigma_j\) is the dispersion of the polygon data),
\(dnorm(y_j/\sum agg_i, rate_j, \sigma_j)\)
Here \(\sigma_j = \sigma \sqrt{\sum agg_i^2} / \sum agg_i\), where \(\sigma\) is the dispersion of the pixel data, a parameter learnt by the model.
Binomial (For a survey in polygon j, \(y_j\) is the number positive and \(N_j\) is the number tested)
\(dbinom(y_j, N_j, rate_j)\)
Poisson (predicts incidence count)
\(dpois(y_j, cases_j)\)
The user can also specify the priors for the regression parameters. For the field, the user specifies the pc priors for the range, \(\rho_{min}\) and \(\rho_{prob}\), where \(P(\rho < \rho_{min}) = \rho_{prob}\), and the variation, \(\sigma_{min}\) and \(\sigma_{prob}\), where \(P(\sigma > \sigma_{min}) = \sigma_{prob}\), in the field. For the iid effect, the user also specifies pc priors.
By default the model contains a spatial field and a polygon iid
effect. These can be turned off in the disag_model
function, using field = FALSE
or
iid = FALSE
.
model_result <- disag_model(data_for_model,
iterations = 1000,
family = 'poisson',
link = 'log',
priors = list(priormean_intercept = 0,
priorsd_intercept = 2,
priormean_slope = 0.0,
priorsd_slope = 0.4,
prior_rho_min = 3,
prior_rho_prob = 0.01,
prior_sigma_max = 1,
prior_sigma_prob = 0.01,
prior_iideffect_sd_max = 0.05,
prior_iideffect_sd_prob = 0.01))
#> Fitting model. This may be slow.
Now we have the results from the model of the fitted parameters, we
can predict Leukemia incidence rate at fine-scale (the scale of the
covariate data) across New York. The predict
function takes
the model result and predicts both the mean raster surface and predicts
and summarises N
parameter draws, where N
is
set by the user (default 100). The uncertainty is summarised via the
confidence interval set by the user (default 95% CI).