--- title: "Introduction to cdid" author: "David Benatia" date: "2024-12-31" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to cdid} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: markdown: wrap: 72 --- ```{r setup, include=FALSE} # Set the working directory to the project root knitr::opts_knit$set(root.dir = getwd()) ``` # Introduction ------------------------------------------------------------------------ - Recent advances in econometrics have focused on the identification, estimation, and inference of treatment effects in staggered designs, i.e. where treatment timing varies across units. Despite the variety of available approaches, none stand out as universally superior, and many overlook key challenges such as estimator efficiency or the frequent occurrence of unbalanced panel data. In our recent paper [(Bellégo, Benatia, Dortet-Bernadet, Journal of Econometrics, 2024)](), we address these gaps by extending the well-established method developed by [Callaway and Sant'Anna (Journal of Econometrics, 2021)]() The cdid library is therefore intended to be used in connection with the **did** library: . Our approach improves efficiency and adapts seamlessly to unbalanced panels, making it particularly valuable for researchers facing data with missing observations. This page introduces the **cdid** R library that implements the methods from our paper, showcasing their relevance for empirical research. **For those short on time, here are the key takeaways:** - **cdid** offers greater precision (smaller standard errors) than **did** with balanced panel data, by aggregating smaller ATT parameters efficiently. - **cdid** outperforms **did** with unbalanced panels, particularly if errors are serially correlated (e.g., in presence of unit-level fixed effects). Remark, however, units observed only once across all time periods must be dropped before estimation because they are not used by **cdid.** - **cdid** is less prone to bias in cases of attrition, notably if missingness is related to unobservable heterogeneity. For cases where missingness is related to observables, additional results from our paper have yet to be implemented in the library (see the paper). ### Features of the cdid library **What it supports:** - Handles any form of missing data. - Allows for control units comprising only "never treated" or also "not-yet-treated." - Provides two weighting matrix options for GMM-based aggregation of $\Delta_k ATT(g,t)$ into $ATT(g,t)$: identity or 2-step. Small-sample simulations show that: - **Identity**: Best for smaller datasets with more missing data, and less overidentifying restrictions, like rotating panel surveys. - **2-step**: Best for larger, balanced datasets with many overidentifying restrictions, or unbalanced datasets with relatively few missings. - Implements simple propensity scores for treatment assignment where predictor columns X are treated as constant across time periods. For time-varying covariates, users can include additional columns (Xt,Xt+1,etc.) in the dataset. **Current limitations:** - The generalized attrition model from our paper, which allows for dynamic attrition processes (e.g., past outcomes influencing observation status), is not yet implemented. - The library currently supports only the IPW estimator, though extensions to doubly-robust estimators like in the **did** library are planned. # Getting started Installing the library requires using remotes at the moment because we cannot submit the package on CRAN yet (Happy holidays!). The command is ``` r remotes::install_github("joelcuerrier/cdid", ref = "main", build_vignettes = TRUE, force = TRUE) ``` ## Balanced data ```{r} #Load the relevant packages library(did) #Callaway and Sant'Anna (2021) library(cdid) #Bellego, Benatia, and Dortet-Bernadet (2024) set.seed(123) #Generate a dataset: 500 units, 8 time periods, with unit fixed-effects. # The parameter sigma_alpha controls the unit-specific time-persistent unobserved # heterogeneity. data0=fonction_simu_attrition(N = 500,T = 8,theta2_alpha_Gg = 0.5, lambda1_alpha_St = 0, sigma_alpha = 2, sigma_epsilon = 0.1, tprob = 0.5) # The true values of the coefficients are based on time-to-treatment. The treatment # effect is zero before the treatment, 1.75 one period after, 1.5 two period after, # 1.25 three period after, 1 four period after, 0.75 five period after, 0.5 six # period after, etc. #We keep all observations, so we have a balanced dataset data0$S <- 1 #Look at the data head(data0,20) ``` The dataset is a dataframe with 3960 observations and 6 columns: $id$ to keep track of unit ids, $date$ to keep track of time periods, an outcome variable $Y$, a predictor $X$, a treatment date $date\_G$ (zero for control group), and a sampling dummy $S$ used to keep track which observations are used in the estimation. There are 495 unique id, and 8 time periods. ```{r} #run did library on balanced panel did.results = att_gt( yname="Y", tname="date", idname = "id", gname = "date_G", xformla = ~X, data = data0, weightsname = NULL, allow_unbalanced_panel = FALSE, panel = TRUE, control_group = "notyettreated", alp = 0.05, bstrap = TRUE, cband = TRUE, biters = 1000, clustervars = NULL, est_method = "ipw", base_period = "varying", print_details = FALSE, pl = FALSE, cores = 1 ) #run cdid with 2step weighting matrix result_2step = att_gt_cdid(yname="Y", tname="date", idname="id", gname="date_G", xformla=~X, data=data0, control_group="notyettreated", alp=0.05, bstrap=TRUE, biters=1000, clustervars=NULL, cband=TRUE, est_method="2-step", base_period="varying", print_details=FALSE, pl=FALSE, cores=1) #run cdid with identity weighting matrix result_id = att_gt_cdid(yname="Y", tname="date", idname="id", gname="date_G", xformla=~X, data=data0, control_group="notyettreated", alp=0.05, bstrap=TRUE, biters=1000, clustervars=NULL, cband=TRUE, est_method="Identity", base_period="varying", print_details=FALSE, pl=FALSE, cores=1) ``` ```{r} print(did.results) # Remark that standard errors are smaller for most ATT(g,t) when using cdid print(result_2step) print(result_id) ``` Now that we have the ATT(g,t) for all three methods, we can compare aggregate parameters using the following commands ```{r} # Aggregation agg.es.did <- aggte(MP = did.results, type = 'dynamic') agg.es.2step <- aggte(MP = result_2step, type = 'dynamic') agg.es.id <- aggte(MP = result_id, type = 'dynamic') ``` ```{r} # Print results agg.es.did ``` ```{r} #Remark that standard errors are smaller, notably so for the 2step estimator agg.es.2step ``` ```{r} agg.es.id ``` ## Unbalanced data Let us now do the same but with an unbalanced panel dataset. ```{r} #Generate a dataset: 500 units, 8 time periods, with unit fixed-effects (alpha) set.seed(123) data0=fonction_simu_attrition(N = 500,T = 8,theta2_alpha_Gg = 0.5, lambda1_alpha_St = 0, sigma_alpha = 2, sigma_epsilon = 0.1, tprob = 0.5) #We discard observations based on sampling indicator S data0 <- data0[data0$S==1,] #run did did.results = att_gt( yname="Y", tname="date", idname = "id", gname = "date_G", xformla = ~X, data = data0, weightsname = NULL, allow_unbalanced_panel = FALSE, panel = FALSE, control_group = "notyettreated", alp = 0.05, bstrap = TRUE, cband = TRUE, biters = 1000, clustervars = NULL, est_method = "ipw", base_period = "varying", print_details = FALSE, pl = FALSE, cores = 1 ) #run cdid with 2step weighting matrix result_2step = att_gt_cdid(yname="Y", tname="date", idname="id", gname="date_G", xformla=~X, data=data0, control_group="notyettreated", alp=0.05, bstrap=TRUE, biters=1000, clustervars=NULL, cband=TRUE, est_method="2-step", base_period="varying", print_details=FALSE, pl=FALSE, cores=1) #run cdid with identity weighting matrix result_id = att_gt_cdid(yname="Y", tname="date", idname="id", gname="date_G", xformla=~X, data=data0, control_group="notyettreated", alp=0.05, bstrap=TRUE, biters=1000, clustervars=NULL, cband=TRUE, est_method="Identity", base_period="varying", print_details=FALSE, pl=FALSE, cores=1) ``` ```{r} #Note the precision gains print(did.results) print(result_2step) print(result_id) ``` ```{r} # Aggregation # There are other ways to aggregate, see the did library agg.es.did <- aggte(MP = did.results, type = 'dynamic') agg.es.2step <- aggte(MP = result_2step, type = 'dynamic') agg.es.id <- aggte(MP = result_id, type = 'dynamic') ``` ```{r} #Note the precision gains agg.es.did ``` ```{r} agg.es.2step ``` ```{r} agg.es.id ```