---
title: "Introduction to cdid"
author: "David Benatia"
date: "2024-12-31"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Introduction to cdid}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
editor_options:
markdown:
wrap: 72
---
```{r setup, include=FALSE}
# Set the working directory to the project root
knitr::opts_knit$set(root.dir = getwd())
```
# Introduction
------------------------------------------------------------------------
- Recent advances in econometrics have focused on the identification,
estimation, and inference of treatment effects in staggered designs,
i.e. where treatment timing varies across units. Despite the variety
of available approaches, none stand out as universally superior, and
many overlook key challenges such as estimator efficiency or the
frequent occurrence of unbalanced panel data.
In our recent paper [(Bellégo, Benatia, Dortet-Bernadet, Journal of
Econometrics,
2024)](),
we address these gaps by extending the well-established method
developed by [Callaway and Sant'Anna (Journal of Econometrics,
2021)]()
The cdid library is therefore intended to be used in connection with
the **did** library:
. Our
approach improves efficiency and adapts seamlessly to unbalanced
panels, making it particularly valuable for researchers facing data
with missing observations.
This page introduces the **cdid** R library that implements the
methods from our paper, showcasing their relevance for empirical
research. **For those short on time, here are the key takeaways:**
- **cdid** offers greater precision (smaller standard errors) than
**did** with balanced panel data, by aggregating smaller ATT
parameters efficiently.
- **cdid** outperforms **did** with unbalanced panels,
particularly if errors are serially correlated (e.g., in
presence of unit-level fixed effects). Remark, however, units
observed only once across all time periods must be dropped
before estimation because they are not used by **cdid.**
- **cdid** is less prone to bias in cases of attrition, notably if
missingness is related to unobservable heterogeneity. For cases
where missingness is related to observables, additional results
from our paper have yet to be implemented in the library (see
the paper).
### Features of the cdid library
**What it supports:**
- Handles any form of missing data.
- Allows for control units comprising only "never treated" or also
"not-yet-treated."
- Provides two weighting matrix options for GMM-based aggregation of
$\Delta_k ATT(g,t)$ into $ATT(g,t)$: identity or 2-step.
Small-sample simulations show that:
- **Identity**: Best for smaller datasets with more missing data,
and less overidentifying restrictions, like rotating panel
surveys.
- **2-step**: Best for larger, balanced datasets with many
overidentifying restrictions, or unbalanced datasets with
relatively few missings.
- Implements simple propensity scores for treatment assignment where
predictor columns X are treated as constant across time periods. For
time-varying covariates, users can include additional columns
(Xt,Xt+1,etc.) in the dataset.
**Current limitations:**
- The generalized attrition model from our paper, which allows for
dynamic attrition processes (e.g., past outcomes influencing
observation status), is not yet implemented.
- The library currently supports only the IPW estimator, though
extensions to doubly-robust estimators like in the **did** library
are planned.
# Getting started
Installing the library requires using remotes at the moment because we
cannot submit the package on CRAN yet (Happy holidays!). The command is
``` r
remotes::install_github("joelcuerrier/cdid", ref = "main", build_vignettes = TRUE, force = TRUE)
```
## Balanced data
```{r}
#Load the relevant packages
library(did) #Callaway and Sant'Anna (2021)
library(cdid) #Bellego, Benatia, and Dortet-Bernadet (2024)
set.seed(123)
#Generate a dataset: 500 units, 8 time periods, with unit fixed-effects.
# The parameter sigma_alpha controls the unit-specific time-persistent unobserved
# heterogeneity.
data0=fonction_simu_attrition(N = 500,T = 8,theta2_alpha_Gg = 0.5,
lambda1_alpha_St = 0, sigma_alpha = 2,
sigma_epsilon = 0.1, tprob = 0.5)
# The true values of the coefficients are based on time-to-treatment. The treatment
# effect is zero before the treatment, 1.75 one period after, 1.5 two period after,
# 1.25 three period after, 1 four period after, 0.75 five period after, 0.5 six
# period after, etc.
#We keep all observations, so we have a balanced dataset
data0$S <- 1
#Look at the data
head(data0,20)
```
The dataset is a dataframe with 3960 observations and 6 columns: $id$ to
keep track of unit ids, $date$ to keep track of time periods, an outcome
variable $Y$, a predictor $X$, a treatment date $date\_G$ (zero for
control group), and a sampling dummy $S$ used to keep track which
observations are used in the estimation. There are 495 unique id, and 8
time periods.
```{r}
#run did library on balanced panel
did.results = att_gt(
yname="Y",
tname="date",
idname = "id",
gname = "date_G",
xformla = ~X,
data = data0,
weightsname = NULL,
allow_unbalanced_panel = FALSE,
panel = TRUE,
control_group = "notyettreated",
alp = 0.05,
bstrap = TRUE,
cband = TRUE,
biters = 1000,
clustervars = NULL,
est_method = "ipw",
base_period = "varying",
print_details = FALSE,
pl = FALSE,
cores = 1
)
#run cdid with 2step weighting matrix
result_2step = att_gt_cdid(yname="Y", tname="date",
idname="id",
gname="date_G",
xformla=~X,
data=data0,
control_group="notyettreated",
alp=0.05,
bstrap=TRUE,
biters=1000,
clustervars=NULL,
cband=TRUE,
est_method="2-step",
base_period="varying",
print_details=FALSE,
pl=FALSE,
cores=1)
#run cdid with identity weighting matrix
result_id = att_gt_cdid(yname="Y", tname="date",
idname="id",
gname="date_G",
xformla=~X,
data=data0,
control_group="notyettreated",
alp=0.05,
bstrap=TRUE,
biters=1000,
clustervars=NULL,
cband=TRUE,
est_method="Identity",
base_period="varying",
print_details=FALSE,
pl=FALSE,
cores=1)
```
```{r}
print(did.results)
# Remark that standard errors are smaller for most ATT(g,t) when using cdid
print(result_2step)
print(result_id)
```
Now that we have the ATT(g,t) for all three methods, we can compare
aggregate parameters using the following commands
```{r}
# Aggregation
agg.es.did <- aggte(MP = did.results, type = 'dynamic')
agg.es.2step <- aggte(MP = result_2step, type = 'dynamic')
agg.es.id <- aggte(MP = result_id, type = 'dynamic')
```
```{r}
# Print results
agg.es.did
```
```{r}
#Remark that standard errors are smaller, notably so for the 2step estimator
agg.es.2step
```
```{r}
agg.es.id
```
## Unbalanced data
Let us now do the same but with an unbalanced panel dataset.
```{r}
#Generate a dataset: 500 units, 8 time periods, with unit fixed-effects (alpha)
set.seed(123)
data0=fonction_simu_attrition(N = 500,T = 8,theta2_alpha_Gg = 0.5,
lambda1_alpha_St = 0, sigma_alpha = 2,
sigma_epsilon = 0.1, tprob = 0.5)
#We discard observations based on sampling indicator S
data0 <- data0[data0$S==1,]
#run did
did.results = att_gt(
yname="Y",
tname="date",
idname = "id",
gname = "date_G",
xformla = ~X,
data = data0,
weightsname = NULL,
allow_unbalanced_panel = FALSE,
panel = FALSE,
control_group = "notyettreated",
alp = 0.05,
bstrap = TRUE,
cband = TRUE,
biters = 1000,
clustervars = NULL,
est_method = "ipw",
base_period = "varying",
print_details = FALSE,
pl = FALSE,
cores = 1
)
#run cdid with 2step weighting matrix
result_2step = att_gt_cdid(yname="Y", tname="date",
idname="id",
gname="date_G",
xformla=~X,
data=data0,
control_group="notyettreated",
alp=0.05,
bstrap=TRUE,
biters=1000,
clustervars=NULL,
cband=TRUE,
est_method="2-step",
base_period="varying",
print_details=FALSE,
pl=FALSE,
cores=1)
#run cdid with identity weighting matrix
result_id = att_gt_cdid(yname="Y", tname="date",
idname="id",
gname="date_G",
xformla=~X,
data=data0,
control_group="notyettreated",
alp=0.05,
bstrap=TRUE,
biters=1000,
clustervars=NULL,
cband=TRUE,
est_method="Identity",
base_period="varying",
print_details=FALSE,
pl=FALSE,
cores=1)
```
```{r}
#Note the precision gains
print(did.results)
print(result_2step)
print(result_id)
```
```{r}
# Aggregation
# There are other ways to aggregate, see the did library
agg.es.did <- aggte(MP = did.results, type = 'dynamic')
agg.es.2step <- aggte(MP = result_2step, type = 'dynamic')
agg.es.id <- aggte(MP = result_id, type = 'dynamic')
```
```{r}
#Note the precision gains
agg.es.did
```
```{r}
agg.es.2step
```
```{r}
agg.es.id
```