A ROLLING METHOD FOR CAUSAL INFERENCE: A FLEXIBLE ESTIMATION METHOD FOR DIVERSE PANEL DATA SETUPS By Soo Jeong Lee A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Economics—Doctor of Philosophy 2025 ABSTRACT This dissertation develops new methods for estimating treatment effects using panel data, with a focus on flexible and robust identification under treatment effect heterogeneity across units and over time. Chapter 1 provides an overview of the subsequent chapters. Chapter 2 introduces a simple time-series transformation—termed the Rolling Method—for Difference-in-Differences estimation. This method converts panel data into a sequence of cross-sectional datasets through unit-specific outcome transformations, enabling consistent estimation of group-time-specific treatment effects even in the presence of treatment heterogeneity. Chapter 3 extends the Rolling Method to small- sample settings, particularly when the number of treated or control units is limited, and demonstrates improved finite-sample inference properties. Chapter 4 further generalizes the framework to accommodate dynamic treatment paths, allowing for treatment reversals and subgroup-specific moderating effects. This extension is applied in an empirical case study examining the effects of the entry and exit of chain pharmacies in rural areas. Copyright by SOO JEONG LEE 2025 ACKNOWLEDGEMENTS Above all, I extend my deepest gratitude to my advisor and mentor, Dr. Jeffrey M. Wooldridge. The excitement in his eyes during our research discussions has been a constant source of inspiration. I aspire to follow his example–to be a scholar who speaks with passion, wisdom, and boundless curiosity. Under his guidance, I have learned invaluable lessons that I will carry with me as I mentor future students, sharing the joy of economics and the fulfillment that comes from contributing to the world through research. I am sincerely grateful to my committee members–Dr. Antonio Galvao, Dr. Kyoo il Kim, and Dr. Haolei Weng– for their invaluable feedback and encouragement, which have been instrumental to my growth throughout this journey. I would also like to extend special thanks to Dr. Timothy Vogelsang and Dr. Hugo Freeman. Being part of the MSU econometrics group has been an extraordinary privilege, and I could not have been happier to be a part of it. I am deeply grateful to the faculty and friends in the MSU Economics Department. Your presence made the joyful moments even brighter and the challenging times easier to navigate. I also extend my heartfelt thanks to my EPIC Grad Lab friends and my Project Investigator, Dr. Tara Kilbride. A special thank you to Dr. Todd Elder, whose support as DGS made it possible for me to join the program and become part of such an inspiring community. Finally, I am thankful to my family. To my parents, Sunduk Jeong and Jinrak Lee–the most loving couple I know–thank you for your unwavering support and for always reminding me, “We believe in you. You always do well.” To my older sister, Eunjeong, who gifted me my very first flight ticket to the U.S. and took on all family responsibilities so I could devote myself fully to my studies–thank you. And to my younger sister, Minjeong–I could not love you more. To all who have stood by me through moments of struggle, happiness, and solitude– thank you! My journey as an economist will never cease, thanks to your unwavering support. iv CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 TABLE OF CONTENTS CHAPTER 2 A SIMPLE TRANSFORMATION APPROACH TO DIFFERENCE-IN- DIFFERENCES ESTIMATION FOR PANEL DATA . . . . . . . . . . . 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 . . 32 PROOF OF THEOREM 2.2 . . . . . . . . . . . . . . . . . . . PROOF OF THE DETRENDING PROCEDURE . . . . . . . . 35 PRE-PERIOD DYNAMICS AND EVENT STUDY PLOT . . . 42 ESTIMATION RESULTS . . . . . . . . . . . . . . . . . . . . 44 BIBLIOGRAPHY . APPENDIX 2A APPENDIX 2B APPENDIX 2C APPENDIX 2D . . . CHAPTER 3 SIMPLE APPROACHES TO INFERENCE WITH DIFFERENCE-IN- DIFFERENCES ESTIMATORS WITH SMALL CROSS-SECTIONAL SAMPLE SIZES . . . . . . . . . . . . . . . . . . 47 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 . . . . BIBLIOGRAPHY . CHAPTER 4 ROLLING APPROACH TO DIFFERENCE-IN-DIFFERENCES: EXPLORING TREATMENT REVERSIBILITY AND MODERATING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . EFFECTS . . 83 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 . . . . . . BIBLIOGRAPHY . v CHAPTER 1 INTRODUCTION This dissertation advances estimation and inference techniques for panel data by developing novel methods for treatment effect estimation. By introducing a time-series outcome transformation, addressing heterogeneous treatment effects, and enhancing inference in small-sample settings, it offers more robust and flexible tools for policy evaluation and empirical research. Chapter 2 introduces a time-series transformation technique–termed the Rolling Method–that can be combined with various treatment effect estimators, including regression adjustment, match- ing methods, and doubly robust estimators. Unlike conventional approaches that assume constant treatment effects across units and over time, our method accounts for treatment effect heterogene- ity. In the common timing case, we show that applying the transformation with linear regression adjustment numerically reproduces the pooled OLS estimator in Wooldridge (2021), which is the Best Linear Unbiased Estimator (BLUE) under classical assumptions. The transformation is at the unit level, and simply requires computing the average outcome prior to an intervention, subtracting it from a post-treatment outcome, and then carefully selecting the control units for each time period. We show that, allowing for staggered entry under no anticipation and parallel trends assumptions, the cohort treatment indicators satisfy the key unconfoundedness assumption with respect to the transformed potential outcome. Given identification, any number of treatment effect estimators can be applied for each treated cohort and calendar time pair where the average treatment effects on the treated are identified. The doubly robust method of combining inverse probability weighting with linear regression (IPWRA) works particularly well in terms of bias and efficiency. Importantly, our transformation is easily modified to account for unit-specific trends. We illustrate the method using empirical data on Walmart’s entry into local labor markets. The application demonstrates how the transformation approach yields more reliable estimates of employment effects, particularly in cases where traditional methods may suffer from bias and inefficiency due to pre-existing trends. 1 Chapter 3 proposes simple methods for valid inference in panel data difference-in-differences (DiD) settings with a small number of treated units or a small number of control units (including a small number of both). The approach uses a suitable transformation to collapse the panel data set into a cross-sectional data set. If the classical linear model assumptions hold in the cross-sectional regression, exact inference is available – even in cases with only two control units and one treated unit. The approach likely works best with a large number of time periods before and after the intervention so that the central limit theorem across time makes normality a better approximation. Nevertheless, under joint normality and homoskedasticity of the time-varying errors, the exact approach can be applied with few time periods – both before and after the intervention – as well as few cross-sectional units. In addition to standard DiD estimation, the approach permits removal of unit-specific trends. With large enough sample sizes, control variables may be included. If the cross-sectional sample size is not too small, one can use a particular heteroskedasticity-robust standard error. We illustrate the approach using increased smoking restrictions in California, where there is only a single treated unit, as an alternative to the synthetic control approach. In the staggered intervention case, we reexamine the expansion of so-called “castle” laws in the United States. Chapter 4 extends the Rolling Method framework to accommodate complex treatment patterns and moderating effects. I propose a novel aggregation strategy suitable for measuring treatment effects with staggered entry and exit of treatment. Furthermore, I develop a two-stage IPWRA estimator that incorporates moderating effects, allowing for heterogeneous treatment effects across subgroups defined by observed characteristics. This extension enables a more nuanced understand- ing of how treatment effects evolve over time and vary across populations. To demonstrate its practical utility, I apply the method to evaluate the impact of chain pharmacy entry and exit on pharmacy access in rural areas. The results reveal substantial heterogeneity in treatment effects across three distinct trajectories—initial treatment, exit, and re-entry—that conventional approaches may overlook. This framework offers applied researchers a robust and flexible tool for analyzing dynamic and heterogeneous treatment paths. 2 CHAPTER 2 A SIMPLE TRANSFORMATION APPROACH TO DIFFERENCE-IN-DIFFERENCES ESTIMATION FOR PANEL DATA (CO-AUTHORED WITH JEFFREY M. WOOLDRIDGE) 2.1 Introduction The two-way fixed effects (TWFE) estimator, applied to a linear panel model with a constant treatment effect, has been commonly applied in difference-in-differences settings. The TWFE estimator of a single effect is simple to understand and is taught in courses that cover panel data methods. Recently, several authors have pointed out shortcomings of the constant effect model under staggered interventions, including Borusyak and Jaravel (2018), Goodman-Bacon (2021), and De Chaisemartin and d’Haultfoeuille (2020), who propose alternative representations of the simple TWFE estimator when the treatment effects (TEs) are heterogeneous across treatment cohort or calendar time. Other authors have proposed more flexible estimation methods that uncover average treatment effects on the treated in the staggered intervention case. These include Callaway and Sant’Anna (2021) [CS (2021)], who propose long-differencing strategies and apply standard treatment effect estimators. Sun and Abraham (2021) [SA (2021)] propose a fixed effects estimator applied to a more flexible model. Both SA (2021) and CS (2021) are event-study-type estimators that use only the single period prior to the first intervention time as the control period. Wooldridge (2021) shows that a pooled OLS (POLS) strategy that includes cohort and calendar time interactions, as well as interactions of cohort dummies, time period dummies, and the treatment indicators with covariates, identifies the ATTs under standard no anticipation and parallel trends assumptions. These estimators effectively use all pre-treatment periods and all not-yet-treated units in the control group. Wooldridge (2021) also shows the POLS is equivalent to a TWFE estimator on an expanded equation that includes interactions of cohort and time dummies with each other and with covariates. The co-author has approved that the co-authored chapter is included. The co-author’s contact: Jeffrey M. Wooldridge, Department of Economics, 486 W. Circle Drive, 110 Marshall-Adams Hall, Michigan State University, East Lansing, MI 48824-1038. Email: wooldri1@msu.edu 3 Borusyak et al. (2024) propose imputation estimators based on pooled OLS regressions that can include unit and time fixed effects. Wooldridge (2021) shows that, with time constant covariates, the imputation estimates are identical to the POLS (and therefore TWFE) estimation of the flexible model using the entire sample. An attractive feature of the CS (2021) approach, which builds on Abadie (2005) for the two- period case, is that it permits the application of treatment effects estimators beyond regression adjustment. However, as mentioned above, the CS (2021) method uses only the period just prior to the intervention in defining the control group, thereby discarding potentially useful information in earlier time periods. In fact, Wooldridge (2021) shows that, under the standard “error components” structure on the error, with a homoskedastic time-constant component and homoskedastic and serially uncorrelated idiosyncratic errors, the POLS estimator is both best linear unbiased (BLUE) and asymptotically efficient. These theoretical results imply that the CS (2021) estimators are inefficient under a standard set of assumptions. The simulations in Wooldridge (2021) bear this out, showing the CS approach can be very inefficient. Under strong forms of positive serial correlation, CS (2021) can be more efficient because it is similar to a first-differencing estimator. Balanced against the loss in precision is that the CS approach can be less biased when parallel trends are violated, although there is no guarantee. See a De Chaisemartin and d’Haultfoeuille (2023) and Wooldridge (2021) for further discussion. In this paper, we propose an alternative “rolling” approach that allows for the application of many different treatment effects estimators while maintaining much of the efficiency of regression- based methods. The idea is to use as many control observations as possible – in both the common timing and staggered cases – while permitting methods such as inverse probability weighting (IPW), doubly robust methods such as the one in Wooldridge (2007) that combines regression and IPW (IPWRA), and matching on covariates or the propensity score. Like CS (2021) in the panel data case, our approach is based on time series transformations at the unit level. Rather than using long differences, we show how to use all suitable control observations in transforming the outcome variable. This leads to significant improvements in efficiency compared with CS (2021) and allows 4 one substantial flexibility in the choice of treatment effects estimators. In the case of common timing – so that there is one average treatment effect per post-treatment period – we show that applying regression adjustment to our transformed outcome variable is equivalent to the regression adjustment estimator based on levels. As mentioned above, Wooldridge (2021) shows that this estimator is both BLUE and asymptotically efficient under standard assump- tions. This provides strong motivation for applying estimators other than regression adjustment to the transformed variables in order to check robustness of findings. In the case of staggered entry, our approach identifies the average treatment effects on the treated (ATTs) by cohort and calendar time under the same no anticipation and parallel trends assumptions as in Wooldridge (2021). We show this in both the case of common timing and staggered interventions. Once identification is established, various estimation methods can be applied. The remainder of the paper is organized as follows. Section 2.2 begins with the common timing case, defining the potential outcomes and parameters of interest, and establishing identification assumptions. In Section 2.3 we propose a general approach to estimation using a transformed outcome variable. In Section 2.4 we extend the framework and identification argument to the staggered case. In Section 2.5, we show how we can account for heterogeneous trends, focusing on linear trends, to allow violation of the parallel trends assumption (even after we condition on covariates). Section 2.6 discusses how one might accommodate suspected failures of no anticipation, and how one modifies the procedure for unbalanced panels. In Section 2.7, we revisit the effect of Walmart’s opening on the local labor markets. Section 2.8 contains some concluding remarks. Supplementary Appendix 2B.1 presents Monte Carlo simulations demonstrating that the rolling methods perform well in terms of both bias and precision. 2.2 Setup and Identification with Common Timing In this section, we assume that the date of the intervention is the same for all treated units and then the intervention is in place through the final period. The time periods in the population are 𝑡 = 1, 2, . . . , 𝑇 and the date of the intervention is 𝑆, where 1 < 𝑆 ≤ 𝑇; in other words, there is at least one pre-treatment period. The arguments in this section are based on an underlying 5 population, and so we use {𝑌𝑡 (0) , 𝑌𝑡 (1) : 𝑡 = 1, . . . , 𝑇 } to denote the time series of outcomes in the control and treated states. The binary time-constant treatment indicator is 𝐷, where 𝐷 = 1 means treatment starting in period 𝑆 and lasting through period 𝑇. A time-varying treatment indicator is 𝑊𝑡 = 𝐷 · 𝑝𝑡, where 𝑝𝑡 is a post-treatment indicator equal to 1 if 𝑡 ≥ 𝑆 and otherwise zero. Without treated units prior to time 𝑆 we can, at most, hope to identify average treatment effects in periods 𝑆, 𝑆 + 1, . . . , 𝑇. Our focus here, like almost all of the other recent literature, is on the average treatment effect on the treated (ATT or ATET) in each treated period: 𝜏𝑟 = 𝐸 [𝑌𝑟 (1) − 𝑌𝑟 (0) |𝐷 = 1] ,𝑟 = 𝑆, . . . , 𝑇 (2.1) The methods we propose can, under stronger assumptions than we propose, recover the overall average treatment effects (ATEs), 𝐸 [𝑌𝑟 (1) − 𝑌𝑟 (0)], and we will mention how that can be done. The fundamental problem of identification of 𝜏𝑟 is that we only observe the treatment status, 𝐷, and the outcome 𝑌𝑟 = (1 − 𝐷) · 𝑌𝑟 (0) + 𝐷 · 𝑌𝑟 (1) Importantly, when 𝐷 = 1, 𝑌𝑟 = 𝑌𝑟 (1), which means 𝐸 [𝑌𝑟 (1) |𝐷 = 1] = 𝐸 (𝑌𝑟 |𝐷 = 1) , 𝑟 = 𝑆, . . . , 𝑇 (2.2) (2.3) The expectation 𝐸 (𝑌𝑟 |𝐷 = 1) can be estimated in a consistent,even unbiased, way under various sampling schemes. Under random sampling, the average of 𝑌𝑟 across the treated subsample is unbiased and consistent. Therefore, writing 𝜏𝑟 = 𝐸 [𝑌𝑟 (1) |𝐷 = 1] − 𝐸 [𝑌𝑟 (0) |𝐷 = 1] = 𝐸 (𝑌𝑟 |𝐷 = 1) − 𝐸 [𝑌𝑟 (0) |𝐷 = 1] , it is easily seen that the challenge is in identifying 𝐸 [𝑌𝑟 (0) |𝐷 = 1]. If the treatment is randomly assigned with respect to𝑌𝑟 (0) , then 𝐸 [𝑌𝑟 (0)|𝐷 = 1] = 𝐸 [𝑌𝑟 (0)|𝐷 = 0]. Because 𝑌𝑟 = 𝑌𝑟 (0) when 𝐷 = 0, 𝐸 [𝑌𝑟 (0)|𝐷 = 0] = 𝐸 (𝑌𝑟 |𝐷 = 0) is consistently estimated using the control units under various sampling schemes. Under random sampling, one would use the sample average of 𝑌𝑟 across the control units. The resulting estimator of 𝜏𝑟 would be the simple difference in sample means between the treated and control units in period 𝑟. 6 The assumption of random assignment is too strong for most applications. To see how to relax it, use simple algebra to write 𝑌𝑟 (1) − 𝑌𝑟 (0) =   𝑌𝑟 (1) −     + 1 (𝑆 − 1) 1 (𝑆 − 1) 𝑆−1 ∑︁ 𝑞=1 𝑌𝑞 (1)       −   𝑌𝑟 (0) −     1 (𝑆 − 1) 𝑆−1 ∑︁ 𝑞=1 𝑌𝑞 (0)       𝑆−1 ∑︁ 𝑞=1 (cid:2)𝑌𝑞 (1) − 𝑌𝑞 (0)(cid:3) where ≡ (cid:164)𝑌𝑟 (1) − (cid:164)𝑌𝑟 (0) + 1 (𝑆 − 1) 𝑆−1 ∑︁ 𝑞=1 (cid:2)𝑌𝑞 (1) − 𝑌𝑞 (0)(cid:3) (cid:164)𝑌𝑟 (1) ≡ 𝑌𝑟 (1) − 1 (𝑆 − 1) 𝑆−1 ∑︁ 𝑞=1 𝑌𝑞 (1) (2.4) (2.5) and similarly for (cid:164)𝑌𝑟 (0). Note that for each 𝑟 ∈ {𝑆, 𝑆 + 1, . . . , 𝑇 }, (cid:164)𝑌𝑟 (1) is the time 𝑟 potential outcome with the average of the pre-treatment period outcomes removed. The third term is the average of the difference of the pre-treatment period “treatment” effects. Given the representation in (2.4), we can write 𝜏𝑟 = 𝐸 (cid:2) (cid:164)𝑌𝑟 (1) |𝐷 = 1(cid:3) − 𝐸 (cid:2) (cid:164)𝑌𝑟 (0) |𝐷 = 1(cid:3) + 1 (𝑆 − 1) 𝑆−1 ∑︁ 𝑞=1 𝐸 (cid:2)𝑌𝑞 (1) − 𝑌𝑞 (0) |𝐷 = 1(cid:3) (2.6) The first assumption, a weak version of “no anticipation”, eliminates the third term in (2.6). Assumption NAC (No Anticipation, Common Timing): For the eventually treated indicator 𝐷, 𝐸 [𝑌𝑡 (1) − 𝑌𝑡 (0)|𝐷 = 1] = 0, 𝑡 = 1, . . . , 𝑆 − 1.□ (2.7) The name of this assumption derives from the fact that 𝐸 [𝑌𝑡 (1) − 𝑌𝑡 (0)|𝐷 = 1] for 𝑡 < 𝑆 are average treatment effects on the treated prior to the intervention, and the assumption is that these are all zero. Assumption NAC is implied by an assumption commonly used in the literature, namely, 𝑌𝑡 (1) = 𝑌𝑡 (0), 𝑡 = 1, . . . , 𝑆 − 1. This assumption is implicit in Heckman et al. (1997) and made explicit in Abadie (2005) and elsewhere. Because the variable indexing 𝑌𝑡 (·) is treatment status not yet assigned, the assumption rules out anticipatory changes in the potential outcomes, on average. If one is concerned that the announcement of a policy prior to its implementation may 7 result in some units strategically manipulating their pre-intervention outcomes, one might drop a period or two just prior to the intervention – as a minimum, as a robustness check. Naturally, this can result in less precise estimators. Given Assumption NAC, we can express 𝜏𝑟 as 𝜏𝑟 = 𝐸 (cid:2) (cid:164)𝑌𝑟 (1) |𝐷 = 1(cid:3) − 𝐸 (cid:2) (cid:164)𝑌𝑟 (0) |𝐷 = 1(cid:3) . (2.8) Estimating the first term in (2.8) is easy because we observe (cid:164)𝑌𝑟 (1) when 𝐷 = 1. More precisely, define the same transformation in the observed variable 𝑌𝑟: (cid:164)𝑌𝑟 ≡ 𝑌𝑟 − 1 𝑆 − 1 𝑆−1 ∑︁ 𝑌𝑞 ≡ 𝑌𝑟 − ¯𝑌𝑝𝑟𝑒 (2.9) 𝑞=1 When 𝐷 = 1, (cid:164)𝑌𝑟 = (cid:164)𝑌𝑟 (1) and so 𝐸 (cid:2) (cid:164)𝑌𝑟 (1) |𝐷 = 1(cid:3) = 𝐸 ( (cid:164)𝑌𝑟 |𝐷 = 1) and the latter is trivially identified (as usual, under a suitable sampling scheme). Notice in the simple 𝑇 = 2 case with 𝑆 = 2, (cid:164)𝑌2 = 𝑌2 − 𝑌1, the difference from period one to two. The difficult term in identifying 𝜏𝑟 is 𝐸 (cid:2) (cid:164)𝑌𝑟 (0) |𝐷 = 1(cid:3). The unconditional parallel trends assumption implies that 𝐸 (cid:2) (cid:164)𝑌𝑟 (0) |𝐷 = 1(cid:3) = 𝐸 (cid:2) (cid:164)𝑌𝑟 (0) |𝐷 = 0(cid:3). Here we allow a weaker version of parallel trends by assuming it holds conditional on observed (pre-treatment) covariates. Assumption CPTC (Conditional Parallel Trends, Common Timing): For observed covari- ates X, 𝐸 [𝑌𝑡 (0) − 𝑌1(0)|𝐷, X] = 𝐸 [𝑌𝑡 (0) − 𝑌1(0)|X] , 𝑡 = 2, . . . , 𝑇 . □ (2.10) Simple algebra shows that (2.10) is the same as assuming 𝐸 [𝑌𝑡 (0) − 𝑌𝑠 (0)|𝐷, X] = 𝐸 [𝑌𝑡 (0) − 𝑌𝑠 (0)|X] for all 𝑡 ≠ 𝑠. Wooldridge (2021) used very similar assumptions, along with linearity of conditional means, to derive identification of the 𝜏𝑟. Here we are interested in applying methods other than regression adjustment to the transformed outcomes in (??). Assumption CPTC allows us to identify 𝐸 (cid:2) (cid:164)𝑌𝑟 (0) |𝐷 = 1(cid:3). To see how, first note that, by iterated expectations, 𝐸 (cid:2) (cid:164)𝑌𝑟 (0) |𝐷 = 1(cid:3) = 𝐸 (cid:8)𝐸 (cid:2) (cid:164)𝑌𝑟 (0) |𝐷 = 1, X(cid:3) |𝐷 = 1(cid:9) (2.11) Next, write (cid:164)𝑌𝑟 (0) = (𝑆 − 1)−1 (cid:2)𝑌𝑟 (0) − 𝑌𝑞 (0)(cid:3) 𝑆−1 ∑︁ 𝑞=1 8 Then, by CPTC, 𝐸 (cid:2) (cid:164)𝑌𝑟 (0) |𝐷 = 1, X(cid:3) = (𝑆 − 1)−1 = (𝑆 − 1)−1 𝑆−1 ∑︁ 𝑞=1 𝑆−1 ∑︁ 𝑞=1 𝐸 (cid:2)𝑌𝑟 (0) − 𝑌𝑞 (0) |𝐷 = 1, X(cid:3) 𝐸 (cid:2)𝑌𝑟 (0) − 𝑌𝑞 (0) |𝐷 = 0, X(cid:3)       = 𝐸 𝑌𝑟 (0) − (𝑆 − 1)−1 = 𝐸 (cid:2) (cid:164)𝑌𝑟 (0) |𝐷 = 0, X(cid:3) 𝑆−1 ∑︁ 𝑞=1 𝑌𝑞 (0) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12)   𝐷 = 0, X     (2.12) The conclusion in equation (2.12) is simple but important. It says that, in terms of the potential outcome (cid:164)𝑌𝑟 (0), treatment 𝐷 is unconfounded conditional on X. Assumption NA ensures that the ATTs can be expressed as in (2.8). This means that, for a post-intervention period 𝑟, we have turned the difference-in-differences problem into a standard problem of estimating an ATT in a cross-sectional population. Using the fact that 𝑌𝑞 = 𝑌𝑞 (0) when 𝐷 = 0, (2.12) implies that, 𝐸 (cid:2) (cid:164)𝑌𝑟 (0) |𝐷 = 1, X(cid:3) = 𝐸 (cid:0) (cid:164)𝑌𝑟 |𝐷 = 0, X(cid:1) Now the argument is the same as in the typical cross section: By iterated expectations, 𝐸 (cid:2) (cid:164)𝑌𝑟 (0) |𝐷 = 1(cid:3) = 𝐸 (cid:2)𝐸 (cid:0) (cid:164)𝑌𝑟 |𝐷 = 0, X(cid:1) |𝐷 = 1(cid:3) ≡ 𝐸 [ (cid:164)𝑚0𝑟 (X) |𝐷 = 1] , (2.13) where (cid:164)𝑚0𝑟 (X) ≡ 𝐸 (cid:0) (cid:164)𝑌𝑟 |𝐷 = 0, X = x(cid:1) is the conditional mean of the observed variable (cid:164)𝑌𝑟 for the control group. This function is nonparametrically identified on Supp (X|𝐷 = 0), the support of the covariates for the control group. To ensure we can compute 𝐸 [ (cid:164)𝑚0𝑟 (X) |𝐷 = 1] without extrapolation to covariate values outside Supp (X|𝐷 = 0), we impose a standard overlap assumption. Assumption OVLC (Overlap, Common Timing): Define the propensity score Then 𝑝 (x) = 𝑃 (𝐷 = 1|X = x) , x ∈ Supp (X) . 𝑝 (x) < 1, x ∈ Supp (X) . □ 9 (2.14) (2.15) The previous derivations and discussion prove the following. Theorem 2.1 Under Assumption NAC, 𝜏𝑟 can be expressed as in (2.8) for 𝑟 = 𝑆, . . . , 𝑇. Under Assumption CPTC, 𝐷 is unconfounded (in the conditional mean sense) with respect to (cid:164)𝑌𝑟 (0) conditional on X. When we add Assumption OVLC, the parameters 𝜏𝑟, 𝑟 = 𝑆, . . . , 𝑇, are identified. □ 2.3 Estimation in the Common Timing Case Given the identification result stated in Theorem 2.1, the estimation of the 𝜏𝑟 is straightforward. We can apply any estimation method once the outcome variable has been transformed as in equation (2.9). Essentially, this is the conclusion reached in Sant’Anna and Zhao (2020) in the 𝑇 = 2 case. Earlier, Abadie (2005) proposed inverse probability weighting when 𝑇 = 2. For simplicity, assume in this section we observe a random sample of size 𝑁 from the cross section. The observed outcome can be expressed as 𝑌𝑖𝑡 = (1 − 𝐷𝑖) · 𝑌𝑖𝑡 (0) + 𝐷𝑖 · 𝑌𝑖𝑡 (1) (2.16) where we use an 𝑖 subscript to denote unit 𝑖. Under the strong form of no anticipation, 𝑌𝑖𝑡 (1) = 𝑌𝑖𝑡 (0) for 𝑡 < 𝑆 and all 𝑖. Given the derivations in the previous section, we only need Assumption NAC. For each 𝑖, we observe the time series {(𝑌𝑖𝑡, 𝐷𝑖, X𝑖) : 𝑖 = 1, 2, . . . , 𝑁 }. To exploit the unconfoundedness and identification in Theorem 2.1, we simply need to obtain the transformed data. For each unit 𝑖, define (cid:164)𝑌𝑖𝑟 = 𝑌𝑖𝑟 − 𝑆−1 ∑︁ 𝑌𝑖𝑞 ≡ 𝑌𝑖𝑟 − ¯𝑌𝑖,𝑝𝑟𝑒 (2.17) 1 (𝑆 − 1) 𝑞=1 Then, for any 𝑟 ∈ {𝑆, 𝑆 + 1, . . . , 𝑇 }, we can apply any standard treatment effect (TE) estimator to the data (cid:8)(cid:0) (cid:164)𝑌𝑖𝑟, 𝐷𝑖, X𝑖(cid:1) : 𝑖 = 1, 2, . . . , 𝑁 (cid:9). A common TE estimator is called “ regression adjustment,” which means estimating separate regression functions for the control and treated units. Because (cid:164)𝑌𝑖𝑟 can take on negative and positive values, linear regression adjustment (RA) makes the most sense. Linear RA is based on the conditional mean, stated in terms of population random variables, 𝐸 (cid:0) (cid:164)𝑌𝑟 |𝐷 = 0, X(cid:1) = (cid:164)𝛼𝑟 + X𝛽𝑟 (2.18) 10 The parameters (cid:164)𝛼𝑟 and 𝛽𝑟 are estimated from the cross-sectional regression Then, 𝜏𝑟 can be estimated using imputation: (cid:164)𝑌𝑖𝑟 on 1, X𝑖 if 𝐷𝑖 = 0 𝜏𝑟 = (cid:164)𝑌 𝑟1 − 𝑁 −1 (cid:98) 1 𝑁 ∑︁ 𝑖=1 𝐷𝑖 (cid:16) (cid:98)(cid:164)𝛼𝑟 + X𝑖(cid:98)𝛽𝑟 (cid:17) = (cid:164)𝑌 𝑟1 − (cid:16) (cid:98)(cid:164)𝛼𝑟 + ¯X1(cid:98)𝛽𝑟 (cid:17) (2.19) (2.20) where (cid:164)𝑌 𝑟1 = 𝑁 −1 1 (cid:205)𝑁 𝑖=1 𝐷𝑖 (cid:164)𝑌𝑖𝑟 and ¯X1 = 𝑁 −1 1 (cid:205)𝑁 𝑖=1 𝐷𝑖X𝑖 are the averages over the treated units. From a practical perspective, the important thing to remember is that (cid:98) software that does basic regression adjustment once the (cid:164)𝑌𝑖𝑟 have been obtained. 𝜏𝑟 can be obtained from standard As discussed in Wooldridge (2021), the imputation estimate, (cid:98) 𝜏𝑟, also can be obtained as the coefficient on 𝐷𝑖 in the pooled OLS regression (cid:164)𝑌𝑖𝑟 on 1, 𝐷𝑖, X𝑖, 𝐷𝑖 · (cid:0)X𝑖 − ¯X1 (cid:1) , 𝑖 = 1, 2, . . . , 𝑁, (2.21) which uses all observations in time period 𝑟. This formulation is convenient because it leads to simple inference for (cid:98) heteroskedasticity. Also, it is often easy to account for the sampling variation in ¯X1 as an estimator 𝜏𝑟, allowing easy computation of standard errors robust to any kind of of 𝜇1 ≡ 𝐸 (X|𝐷 = 1). Issues of clustering standard errors are relatively easy to deal with given we have a standard cross-sectional regression. Because of the representation of (cid:164)𝑌𝑖𝑟 in (2.17), there is a simple characterization of (cid:98) 𝜏𝑟. All of the coefficients in (2.21) are obtained by differencing the coefficients from two separate regressions. In the first, 𝑌𝑖𝑟 is regressed on all variables in (2.21). Then ¯𝑌𝑖,𝑝𝑟𝑒 is regressed on the same set of variables, and these coefficients are subtracted from the first. In particular, (cid:98) the difference between two standard ATT estimators using regression adjustment. The first uses 𝜏𝑟 is obtained as observations in period 𝑟 only, and the second uses the average of 𝑌𝑖𝑞 over the pre-treatment periods. Without covariates, the estimator would be 𝜏𝑟 = (cid:0) ¯𝑌1𝑟 − ¯𝑌0𝑟 (cid:1) − (cid:0) ¯𝑌1,𝑝𝑟𝑒 − ¯𝑌0,𝑝𝑟𝑒(cid:1) , (cid:98) (2.22) where the first subscript indicates treatment or control units. This has a clear interpretation as a DiD estimator. 11 Recognizing that an estimator can be obtained from (2.21) has additional benefits. For example, if 𝐷𝑖 is independent of (cid:164)𝑌𝑖𝑟 (0)–so that the unconditional PT assumption holds–, then the covariates X𝑖 need not be included in (2.21) in order to consistently estimate 𝜏𝑟 as the coefficient on 𝐷𝑖. Remember, this allows 𝐷𝑖 to be correlated with, the level, say, 𝑌1 (0), the potential outcome in the first time period. If, in addition, 𝐷𝑖 is independent of X𝑖, the regression in (2.21) still can be used to improve efficiency over the simple estimator without the covariates. As discussed in Negi and Wooldridge (2021), such improvements are possible if X𝑖 helps predict (cid:164)𝑌𝑖𝑟. In many cases, X𝑖 may not have much predictive power for (cid:164)𝑌𝑖𝑟 even though it might predict the level, 𝑌𝑖𝑟, well. A special case is 𝑇 = 2, in which case (2.21) is simply Δ𝑌𝑖 on 1, 𝐷𝑖, X𝑖, 𝐷𝑖 · (cid:0)X𝑖 − ¯X1 (cid:1), 𝑖 = 1, 2, . . . , 𝑁 where Δ𝑌𝑖 = 𝑌𝑖2 − 𝑌𝑖1. In the 𝑇 = 2 case, whether including X𝑖 substantively helps precision when 𝐷𝑖 is independent of X𝑖 hinges on how well X𝑖 predicts the difference, Δ𝑌𝑖. It turns out there is another useful algebraic equivalence. Suppose we act as if the following conditional expectation holds for all population units and time periods: 𝐸 (𝑌𝑡 |𝐷, X) = 𝛼 + X𝛽 + 𝛾𝐷 + (𝐷 · X) 𝛿 + 𝑇 ∑︁ 𝑟=2 𝜃𝑟 𝑓 𝑟𝑡 + 𝑇 ∑︁ 𝑟=2 ( 𝑓 𝑟𝑡 · X) 𝜋𝑟 + 𝑇 ∑︁ 𝑟=𝑆 𝜏𝑟 (𝐷 · 𝑓 𝑟𝑡) + 𝑇 ∑︁ 𝑟=𝑆 (𝐷 · 𝑓 𝑟𝑡) (X − 𝜇1) 𝜂𝑟, 𝑡 = 1, . . . , 𝑇, (2.23) where 𝑓 𝑟𝑡 is a time period dummy equal to one if 𝑟 = 𝑡 and zero otherwise. The interaction 𝐷 · 𝑓 𝑟𝑡 is the treatment indicator for time period 𝑟. Equation (2.21) suggests a pooled OLS regression across all 𝑖 and 𝑡: 𝑌𝑖𝑡 on 1, X𝑖, 𝐷𝑖, 𝐷𝑖 · X𝑖, 𝑓 2𝑡, . . . , 𝑓 𝑇𝑡, 𝑓 2𝑡 · X𝑖, . . . , 𝑓 𝑇𝑡 · X𝑖 𝐷𝑖 · 𝑓 𝑆𝑡, . . . , 𝐷𝑖 · 𝑓 𝑇𝑡, 𝐷𝑖 · 𝑓 𝑆𝑡 · (cid:0)X𝑖 − ¯X1 (cid:1) , . . . , 𝐷𝑖 · 𝑓 𝑇𝑡 · (cid:0)X𝑖 − ¯X1 (cid:1) (2.24) 𝜏𝑟, are the coefficients on the treatment dummies 𝐷𝑖 · 𝑓 𝑆𝑡, . . . , The estimated treatment effects, say (cid:101) 𝐷𝑖 · 𝑓 𝑇𝑡. Wooldridge (2021) shows that the (cid:101) approach based on the levels, 𝑌𝑖𝑡. It turns out that the (cid:101) using the transformed outcome variable, (cid:164)𝑌𝑖𝑟, one period at a time. 𝜏𝑟 are numerically identical to a two-stage imputation 𝜏𝑟 are also equivalent to the (cid:98) 𝜏𝑟 obtained by 12 Theorem 2.2 Let (cid:98) – equivalently, from equation (2.20) – and let (cid:101) 𝜏𝑟 = (cid:98) (2.24). Then (cid:101) is identical to that on 𝐷𝑖 · (cid:0)X𝑖 − ¯X1 (cid:1) in (2.21). □ 𝜏𝑟, 𝑟 = 𝑆, 𝑆 + 1, . . . , 𝑇 be the coefficients on 𝐷𝑖 in the separate regressions (2.21) 𝜏𝑟, 𝑟 = 𝑆, . . . , 𝑇. Moreover, the coefficient vector on 𝐷𝑖 · 𝑓 𝑟𝑡 · (cid:0)X𝑖 − ¯X1 𝜏𝑟 be the coefficients on 𝐷𝑖 · 𝑓 𝑆𝑡, . . . , 𝐷𝑖 · 𝑓 𝑇𝑡 from (cid:1) in (2.24) The proof of Theorem 2.2 can be found in Appendix 2A. The equivalence is valuable for a couple of reasons. First, it shows that two different ways to approach identification under the same set of assumptions – that in Wooldridge (2021) and the approach we use here – leads to the same estimation methods. Second, Wooldridge (2021, Theorem 6.2) shows that, under standard assumptions on the implied error term (which includes a unit-specific unobserved effect and a time-varying component), the estimators from (2.24) are both best linear unbiased and asymptotically efficient (with 𝑇 fixed, 𝑁 → ∞) under random sampling across 𝑖. This establishes that the transformation used in (2.21) does not discard useful information. Given the equivalence of our transformation approach and the OLS estimator pooled across 𝑖 and 𝑡, what use is the former? Importantly, it allows us to use other treatment effects estimators beyond regression adjustment. For example, we can apply IPW or, even better, IPWRA, using the cross-sectional data (cid:8)(cid:0) (cid:164)𝑌𝑖𝑟, 𝐷𝑖, X𝑖(cid:1) : 𝑖 = 1, 2, . . . , 𝑁 (cid:9). We can also apply propensity score matching, covariate matching or nearest neighbor matching. Procedure 3.1 (Rolling Methods, Common Timing): Step 1. For a given time period 𝑟 ∈ {𝑆, . . . , 𝑇 } and each unit 𝑖, compute (cid:164)𝑌𝑖𝑟 as in (2.17). Step 2. Using all of the units, apply standard TE methods – such as linear RA, IPW, IPWRA, matching – to the cross section (cid:8)(cid:0) (cid:164)𝑌𝑖𝑟, 𝐷𝑖, X𝑖(cid:1) : 𝑖 = 1, . . . , 𝑁 (cid:9) . □ Inference on a single 𝜏𝑟 is simple when one uses built-in commands in step (2) of Procedure 3.1. Joint inference on multiple 𝜏𝑟 is trickier because the estimators are not independent. For estimators such as IPW and IPWRA using parametric models, a general approach is to stack all moment conditions used in estimation and use the formulas from just-identified generalized method 13 of moments estimation. Applying the panel bootstrap – resampling all time periods from the cross- sectional units – is valid for IPW and IPWRA, and should be computationally feasible in most cases. It is instructive to compare the transformation in equation (2.17) to that in CS (2021). In the common timing case, the CS transformation, for 𝑟 ≥ 𝑆, is ˚𝑌𝑖𝑟 = 𝑌𝑖𝑟 − 𝑌𝑖,𝑆−1, (2.25) so that ˚𝑌𝑖𝑟 is a “long” difference. If 𝑟 = 𝑆 then ˚𝑌𝑖𝑆 = 𝑌𝑖𝑆 − 𝑌𝑖,𝑆−1, which is differencing adjacent periods. In many cases, the CS transformation will be inefficient compared with (2.17) because the CS differencing ignores time periods other than 𝑆 − 1. However, there are cases where CS (2021) can be more efficient. When 𝑇 = 2, the transformation is (cid:164)𝑌2 = ˚𝑌2 = Δ𝑌2 ≡ 𝑌2 − 𝑌1. Thus, our approach encompasses and extends Abadie (2005) and Sant’Anna and Zhao (2020) in the panel data case. 2.4 Staggered Interventions 2.4.1 Some Units Never Treated We now turn to the staggered intervention case. As in Athey and Imbens (2022) and Wooldridge (2021, 2023), the potential outcomes are denoted 𝑌𝑡 (𝑔) , 𝑔 ∈ {𝑆, . . . , 𝑇, ∞} , 𝑡 ∈ {1, 2, . . . , 𝑇 } , (2.26) where 𝑔 indicates the first time subjected to the intervention – defining the treatment group cohorts – and 𝑡 is calendar time. The case 𝑔 = ∞ indicates the potential outcome in the never treated state. In other words, 𝑌𝑡 (∞) is the potential outcome at time 𝑡 when a unit is not subjected to the intervention over the observed stretch of time. Listing potential outcomes that vary only by cohort and calendar time reflects the assumption of no reversibility with staggered entry. We denote the group or cohort indicators by {𝐷 𝑆, 𝐷 𝑆+1, . . . , 𝐷𝑇 , 𝐷∞}, where 𝐷∞ = 1 indicates the never-treated. These dummy variables are mutually exclusive and exhaustive: 𝐷 𝑆 + · · · + 𝐷𝑇 + 𝐷∞ = 1. 14 The ATTs of interest are now written as 𝜏𝑔𝑟 = 𝐸 (cid:2)𝑌𝑟 (𝑔) − 𝑌𝑟 (∞) |𝐷𝑔 = 1(cid:3) , 𝑟 = 𝑔, . . . , 𝑇; 𝑔 = 𝑆, . . . , 𝑇 (2.27) For each treated cohort 𝑔, 𝜏𝑔𝑟 (𝑟 = 𝑔, . . . , 𝑇) denotes the ATT in all subsequent time period. To identify the 𝜏𝑔𝑟, we extend the trick for the common timing case by writing 𝑌𝑡 (𝑔) − 𝑌𝑡 (∞) = (cid:34) 𝑌𝑡 (𝑔) − + 1 (𝑔 − 1) 1 (𝑔 − 1) 𝑔−1 ∑︁ 𝑠=1 (cid:35) 𝑌𝑠 (𝑔) − (cid:34) 𝑌𝑡 (∞) − 𝑔−1 ∑︁ 𝑠=1 1 (𝑔 − 1) 𝑔−1 ∑︁ 𝑠=1 (cid:35) 𝑌𝑠 (∞) [𝑌𝑠 (𝑔) − 𝑌𝑠 (∞)] (2.28) As in the common timing case, we make a no anticipation assumption so that the third term can be dropped and that effectively allows using all available control units in each treated period. Here we condition on the covariates so that we can use not-yet-treated units as part of the control group. Assumption CNAS (Conditional No Anticipation, Staggered): For 𝑔 ∈ {𝑆, . . . , 𝑇 }, 𝑡 ∈ {1, . . . , 𝑔 − 1} and covariates X, 𝐸 (cid:2)𝑌𝑡 (𝑔) |𝐷𝑔 = 1, X(cid:3) = 𝐸 (cid:2)𝑌𝑡 (∞) |𝐷𝑔 = 1, X(cid:3) . □ (2.29) As in the common timing case, this assumption means that the “treatment” effects prior to the intervention are all zero. Because 𝑠 < 𝑔 in the third sum, it follows that the expected value of the last term conditional on 𝐷𝑔 = 1 is zero. Therefore, 𝜏𝑔𝑟 = 𝐸 (cid:2) (cid:164)𝑌𝑟𝑔 (𝑔) |𝐷𝑔 = 1(cid:3) − 𝐸 (cid:2) (cid:164)𝑌𝑟𝑔 (∞) |𝐷𝑔 = 1(cid:3) , (2.30) where (cid:164)𝑌𝑟𝑔 (𝑔) and (cid:164)𝑌𝑟𝑔 (∞) are defined as the first and second terms in (2.28), respectively. Note that the first subscript on (cid:164)𝑌𝑟𝑔 (𝑔) and (cid:164)𝑌𝑟𝑔 (∞) means that we are averaging all periods just prior to 𝑔 and subtracting from the outcome in the current current calendar time period 𝑟. As before, the first term in equation (2.28) is easily estimated because we observe (cid:164)𝑌𝑟𝑔 (𝑔) when 𝐷𝑔 = 1. A parallel trends assumption, stated in terms of the never treated state, is sufficient to identify 𝐸 (cid:2) (cid:164)𝑌𝑟𝑔 (∞) |𝐷𝑔 = 1(cid:3). We state an assumption conditional on a set of covariates, X, with no covariates as a special case. 15 Assumption CPTS (Conditional PT, Staggered): For D = (𝐷 𝑆, . . . , 𝐷𝑇 ) and 𝑡 = 1, 2, . . . , 𝑇, 𝐸 [𝑌𝑡 (∞) − 𝑌1(∞)|D, X] = 𝐸 [𝑌𝑡 (∞) − 𝑌1(∞)|X], 𝑡 = 2, . . . , 𝑇 . □ (2.31) This assumption is used in Wooldridge (2021). Again, it is unconfoundedness of the treatment level, as given by D, with respect to the trend in the untreated state, 𝑌𝑡 (∞) − 𝑌1(∞). Wooldridge (2021) used this assumption, along with linearity of conditional means, to derive an imputation estimator and showed it was the same as a pooled OLS and TWFE estimator. Here we show how it can be used to identify the 𝜏𝑔𝑟 very generally. With (cid:164)𝑌𝑟𝑔 (∞) defined above, 𝐸 (cid:2) (cid:164)𝑌𝑟𝑔 (∞) |𝐷𝑔 = 1, X(cid:3) = 1 (𝑔 − 1) 𝑔−1 ∑︁ 𝑠=1 𝑔−1 ∑︁ 𝐸 (cid:2)𝑌𝑟 (∞) − 𝑌𝑠 (∞) |𝐷𝑔 = 1, X(cid:3) 𝐸 [𝑌𝑟 (∞) − 𝑌𝑠 (∞) |𝐷∞ = 1, X] (2.32) = 1 (𝑔 − 1) 𝑠=1 =𝐸 (cid:2) (cid:164)𝑌𝑟𝑔 (∞) |𝐷∞ = 1, X(cid:3) where the second equality follows from CPTS and the third follows by taking the expectation outside the summation. We have shown the following. Theorem 2.3 Under Assumption CNAS, equation (2.30) holds. If we add Assumption CPTS, the cohort assignments, D = (𝐷 𝑆, . . . , 𝐷𝑇 ) are unconfounded with respect to (cid:164)𝑌𝑟𝑔 (∞) (in the conditional mean sense), 𝑔 ∈ {𝑆, . . . , 𝑇 }, 𝑟 ∈ {𝑔, . . . , 𝑇 }, conditional on X. □ Because the vector of cohort indicators is unconfounded with respect to (cid:164)𝑌𝑟𝑔 (∞), Theorem 2.3 implies 𝐸 (cid:2) (cid:164)𝑌𝑟𝑔 (∞) |𝐷∞ = 1, X(cid:3) = 𝐸 (cid:2) (cid:164)𝑌𝑟𝑔 (∞) |𝐷 ℎ = 1, X(cid:3) , ℎ = 𝑆, . . . , 𝑇 (2.33) We can combine this implication of CPTS with Assumption CNAS because, at time 𝑟, cohorts ℎ = 𝑟 + 1, . . . , 𝑇 have yet to be treated. Therefore, 𝐸 (cid:2) (cid:164)𝑌𝑟𝑔 (∞) |𝐷 ℎ = 1, X(cid:3) = 𝐸 (cid:2) (cid:164)𝑌𝑟𝑔 (ℎ) |𝐷 ℎ = 1, X(cid:3) , ℎ = 𝑟 + 1, . . . , 𝑇 (2.34) 16 Combined, (2.33) and (2.34) mean that, in addition to the never treated (NT) group, we can use treatment cohorts ℎ ∈ {𝑟 + 1, . . . , 𝑇 } in estimating 𝐸 (cid:2) (cid:164)𝑌𝑟𝑔 (∞) |𝐷∞ = 1, X(cid:3). Incidentally, this derivation shows that if we only use the NT group as the control for each (𝑔, 𝑟) pair then we can drop the conditioning on X in Assumption CNAS. Later we discuss what can be identified in period 𝑇 without a NT group (under CNAS). We have established the following. For estimating 𝐸 (cid:2) (cid:164)𝑌𝑟𝑔 (∞) |𝐷𝑔 = 1, X(cid:3) for 𝑟 ∈ {𝑔, 𝑔 + 1, . . . , 𝑇 } we can use cohorts {𝑟 + 1, . . . , 𝑇, ∞} as the control group. Define the indicator for the control group as 𝐴𝑟+1 ≡ 𝐷𝑟+1 + 𝐷𝑟+2 + · · · + 𝐷𝑇 + 𝐷∞. Then, within the subpopulation 𝐷𝑔 + 𝐴𝑟+1 = 1, 𝐷𝑔 is unconfounded with respect to (cid:164)𝑌𝑟𝑔 (∞), conditional on X. Therefore, we can apply standard treatment effect estimators after transforming the observed outcome and conditioning on the subpopulation. Naturally, we will need an overlap assumption in order to ensure identification when using methods that condition on covariates. For 𝜏𝑔𝑟 and using all legitimate control groups under CNAS and CPTS, the overlap assumption is Assumption OVLS (Overlap, Staggered Case): For cohorts 𝑔 ∈ {𝑆, 𝑆 + 1, . . . , 𝑇 } and time periods 𝑟 ∈ {𝑔, 𝑔 + 1, . . . , 𝑇 }, 𝑃 (cid:0)𝐷𝑔 = 1|𝐷𝑔 + 𝐴𝑟+1 = 1, X = x(cid:1) < 1 for all x ∈ Supp (X) . □ (2.35) This condition ensures that, within the subpopulation of cohort 𝑔 plus the never treated and not- yet-treated units at time 𝑟, every treated unit has a comparable control unit. Given data on units again indexed by 𝑖, the following simple steps lead to a general analysis. Assumptions CNAS, CPTS, and the overlap assumption are in force. Procedure 4.1 (Rolling Methods, Staggered Interventions): Step 1. For a given 𝑔 ∈ {𝑆, . . . , 𝑇 } and time period 𝑟 ∈ {𝑔, 𝑔 + 1, . . . , 𝑇 }, compute (cid:164)𝑌𝑖𝑟𝑔 ≡ 𝑌𝑖𝑟 − 1 (𝑔 − 1) 𝑔−1 ∑︁ 𝑌𝑖𝑠 ≡ 𝑌𝑖𝑟 − ¯𝑌𝑖,𝑝𝑟𝑒(𝑔) (2.36) 𝑠=1 Step 2. Choose as the control group the units with 𝐷𝑖,𝑟+1 + 𝐷𝑖,𝑟+2 + · · · + 𝐷𝑖𝑇 + 𝐷𝑖∞ = 1 (or, if desired, a subset, such as the NT group). 17 Step 3. Using the subset of data with 𝐷𝑖𝑔 + 𝐷𝑖,𝑟+1 + 𝐷𝑖,𝑟+2 + · · · + 𝐷𝑖𝑇 + 𝐷𝑖∞ = 1, (2.37) apply standard TE methods – such as linear RA, IPW, IPWRA, matching – to the cross section (cid:8)(cid:0) (cid:164)𝑌𝑖𝑟𝑔, 𝐷𝑖𝑔, X𝑖(cid:1) : 𝑖 = 1, . . . , 𝑁 (cid:9) , with 𝐷𝑖𝑔 acting as the treatment indicator. □ When X𝑖 has high dimension, Procedure 4.1 implies that machine learning methods can be applied after obtaining the transformation in (2.36). See, for example, Belloni et al. (2014) and Chernozhukov et al. (2018). Interestingly, when 𝑟 = 𝑔, so that 𝜏𝑔𝑔 is the instantaneous effect of the intervention for treatment cohort 𝑔, applying linear RA to (cid:8)(cid:0) (cid:164)𝑌𝑖𝑔𝑔, 𝐷𝑖𝑔, X𝑖(cid:1) : 𝑖 = 1, . . . , 𝑁 (cid:9) , with all possible control units, reproduces the POLS estimator proposed by Wooldridge (2021). When 𝑟 > 𝑔 this is not the case, which means, under standard assumptions, the rolling approach we propose is inefficient for the dynamic effects. The trade off is that we are able to apply many different kind of estimators, including doubly robust and matching estimators. 2.4.2 Comparison to Alternative Methods Procedure 4.1 can be compared with the Callaway and Sant’Anna (2021) approach with staggered interventions. CS (2021) suggest using a long difference of the form ˚𝑌𝑖𝑟𝑔 ≡ 𝑌𝑖𝑟 − 𝑌𝑖,𝑔−1 (2.38) and then choosing control groups from cohorts {𝑟 + 1, . . . , 𝑇, ∞}. The transformation in (2.38) ignores the control periods prior to 𝑔 − 1 and is generally inefficient under classical assumptions. Also, the default implementation in commonly used software (R and Stata) is to use only the never-treated group as controls. Several authors, including Marcus and Sant’Anna (2021) and Callaway (2023), recognize that the version of parallel trends in Assumption CPT implies that 18 more information is available for estimating the ATTs. However, we believe we are the first to propose the demeaning transformation at the unit level, using the pre-treatment averages, and then combining the transformation with various treatment effects estimators after establishing unconfoundedness. Our approach makes it straightforward to apply different strategies based on different transformations and different control periods and control groups. Implementing Procedure 4.1 is straightforward because it simply requires obtaining (cid:164)𝑌𝑖𝑟𝑔 and then applying standard treatment effect software. Standard errors are easily obtained. To illustrate where the efficiency and robustness of our method originate, we compare the information sets utilized by various estimators when estimating treatment effects. Suppose there are three treatment cohorts—groups 4, 5, and 6—and our goal is to estimate the average treatment effects on the treated (ATT) for group 4 at time 𝑡 = 5, 𝐴𝑇𝑇 (4, 5). Table 2.1 summarizes the subset of data each estimator employs for this estimation task. For pre-treatment periods up to 𝑡 = 3, the areas highlighted in dark gray represent the data used by each estimator, while light gray regions indicate observations that are not used. Compared to the estimator proposed by Callaway and Sant’Anna (2021), our unit-specific time-series transformation (rolling) method leverages a larger set of pre-treatment observations for each unit. This broader utilization enhances efficiency, recovering some of the precision otherwise lost in the CS (2021) framework. Table 2.1 Pre-Treatment Period Observations Used by Each Method time 1 2 3 4 5 6 CS (2021) 𝑔 = 5 𝑔 = 4 𝑔 = 6 Rolling Method 𝑔 = 5 𝑔 = 4 𝑔 = 6 Wooldridge (2021) 𝑔 = 5 𝑔 = 6 𝑔 = 4 - ⋆ - - - - - ⋆ - - - Utilization of Information Loss of Information - ⋆ - - - - Target Estimate: 𝐴𝑇𝑇 (4, 5) 𝐴𝑇𝑇 (𝑔, 𝑡) - ⋆ - However, the rolling method uses slightly less information than Wooldridge (2021)’s approach. 19 Specifically, estimating 𝐴𝑇𝑇 (4, 5) under the rolling method requires dropping observations from units already treated by period 𝑡 = 5, including (cid:164)𝑌55, which retains pre-treatment history. This ex- clusion is necessary to avoid “bad comparisons.” A similar logic underlies the CS (2021) approach, which likewise omits already-treated units. In contrast, Wooldridge (2021)’s regression-based framework retains all pre-treatment observa- tions by construction, achieving high efficiency under correct specification. However, this efficiency comes at the cost of increased vulnerability to model misspecification. In this respect, our estimator can be viewed as a middle ground that combines the robustness of the CS method with the efficiency of the Wooldridge (2010) framework. For example, our method accommodates doubly robust estimation, offering greater robustness to outcome model misspecifi- cation while still preserving much of the efficiency of the Wooldridge (2021) framework—a benefit not shared by the CS (2021) framework. In the special case of a single treatment group–that is, under common timing—the rolling method and Extended TWFE estimator (Wooldridge, 2021) use an equivalent set of pre-treatment observations, resulting in comparable efficiency when the model is correctly specified. 2.4.3 All Units Eventually Treated As in Wooldridge (2021, 2023), we can handle situations where all units are treated by 𝑡 = 𝑇 by simply modifying the parallel trends assumptions and changing the specifics of the estimation. Rather than stating the CPT assumption in terms of the NT state, 𝑌𝑡 (∞), it is stated in terms of 𝑌𝑡 (𝑇), the state of not being treated until the final time period. Now all of the treatment effects are, initially, defined relative to 𝑌𝑡 (𝑇): 𝐸 (cid:2)𝑌𝑟 (𝑔) − 𝑌𝑟 (𝑇) |𝐷𝑔 = 1(cid:3), 𝑔 ∈ {𝑆, . . . , 𝑇 − 1}, 𝑟 ∈ {𝑔, . . . , 𝑇 }. We can no longer estimate a treatment effect for the final treated cohort because there are no control units. As discussed in Wooldridge (2021), by no anticipation it follows that for 𝑟 < 𝑇, 𝐸 (cid:2)𝑌𝑟 (𝑔) − 𝑌𝑟 (𝑇) |𝐷𝑔 = 1(cid:3) = 𝐸 (cid:2)𝑌𝑟 (𝑔) − 𝑌𝑟 (∞) |𝐷𝑔 = 1(cid:3) and so, except for the final time period, the ATTs are interpreted as when we have a never treated group. In terms of estimation, the modifications to Procedure 4.1 are straightforward. In particular, (cid:164)𝑌𝑖𝑟𝑔 is computed as in (2.36) but only for 𝑔 ∈ {𝑆, . . . , 𝑇 − 1}. In steps (2) and (3), we drop 𝐷𝑖∞ 20 everywhere – which means still choosing as the control group for cohort 𝑔 in period 𝑟 those units not yet treated. Then, for each 𝑔 ∈ {𝑆, . . . , 𝑇 − 1} and for each period 𝑟 ∈ {𝑔, . . . , 𝑇 }, we apply standard treatment effect estimators, as before. When 𝑟 = 𝑇, 𝐷𝑇 = 1 acts as the only control group for all cohorts first treated prior to period 𝑇. 2.5 Heterogeneous Trends One way to test for violation of the PT assumption is to estimate placebo treatment effects prior to the intervention. Callaway and Sant’Anna (2021) take this approach using their differencing method. Here, we can apply Procedure 3.1 or 4.1 to pre-treatment periods and test for effects prior to the intervention. For cohort 𝑔, it makes sense to split pre-treatment periods, {1, 2, . . . , 𝑔 − 1}, into roughly equal sizes. Then, the never treated group, or any of the groups not yet treated in {1, 2, . . . , 𝑔 − 1} can be used as the controls. Under the null hypothesis of (conditional) PT, the tests should not find a “ treatment” effect. As motivation for adjusting Procedure 4.1 (with common timing being a special case) to allow for heterogeneous trends, express the conditional PT assumption as 𝐸 [𝑌𝑡 (∞)|D, X] = 𝑞∞ (X) + 𝑇 ∑︁ 𝑔=𝑆 𝐷𝑔𝑞𝑔 (X) + 𝑚𝑡 (X) , 𝑡 = 1, . . . , 𝑇, (2.39) where 𝑞𝑔 (·) and 𝑚𝑡 (·) are functions of the covariates, with the first not changing across time and the second not depending on the treatment cohort. As a normalization, 𝑚1 (x) ≡ 0 for all x ∈ Supp (X). It is easily seen that 𝐸 [𝑌𝑡 (∞) − 𝑌1(∞)|D, X] = 𝑚𝑡 (X) , 𝑡 = 2, . . . , 𝑇, which does not depend on D. Moreover, for 𝑟 ≥ 𝑔, it follows from the definition of (cid:164)𝑌𝑟𝑔 (∞) that 𝐸 (cid:2) (cid:164)𝑌𝑟𝑔 (∞) |𝐷∞ = 1, X(cid:3) = 𝑚𝑟 (X) − 1 (𝑔 − 1) 𝑔−1 ∑︁ 𝑠=1 𝑚𝑠 (X) ≡ (cid:164)𝑚𝑟𝑔 (X) (2.40) and so (cid:164)𝑚𝑟𝑔 (·) is the conditional mean function implicit in the methods from Section 2.4 that use a conditional mean specification (RA IPW or IPWRA). Because (cid:164)𝑚𝑟𝑔 (·) can take on positive and negative values, we essentially assumed in regression-based methods that (cid:164)𝑚𝑟𝑔 (·) can be approximated by a function linear in parameters (allowing controls to appear flexibly, as usual). 21 The representation in (2.39) suggests a way to relax parallel trends for cohorts where we have at least two pre-treatment time periods. We replace (2.39) with the following assumption. Assumption CHT (Conditional Heterogeneous Trends): For D = (𝐷 𝑆, . . . , 𝐷𝑇 ) and 𝑡 = 1, 2, . . . , 𝑇, 𝐸 [𝑌𝑡 (∞)|D, X] = 𝜂𝑆 (𝐷 𝑆 · 𝑡) + · · · + 𝜂𝑇 (𝐷𝑇 · 𝑡) + 𝑞∞ (X) + 𝑇 ∑︁ 𝑔=𝑆 𝐷𝑔𝑞𝑔 (X) + 𝑚𝑡 (X) . □ (2.41) Assumption CHT allows a separate linear trend in the never treated state for each treatment cohort. It is easy to see that 𝐸 [𝑌𝑡 (∞) − 𝑌𝑡−1(∞)|D, X] = 𝜂𝑔 𝐷𝑔 + · · · + 𝜂𝑇 𝐷𝑇 + [𝑚𝑡 (X) − 𝑚𝑡−1 (X)] (2.42) and so PT, even conditional on X, fails. Because the trend in the never treated state is systematically related to cohort, the estimation approaches in Sections 2.3 and 2.4 are no longer valid. Instead, we can use a linearly detrending, unit by unit, to remove the relationship between 𝑌𝑡 (∞) and cohort assignment. For any 𝑖, we can write 𝑌𝑖𝑡 (∞) = h (D𝑖, X𝑖) + D𝑖 · 𝑡 · 𝜂 + 𝑚𝑡 (X𝑖) + 𝑈𝑖𝑡 (∞) (2.43) 𝐸 [𝑈𝑖𝑡 (∞) |D𝑖, X𝑖] = 0, where h (D𝑖, X𝑖) does not vary across 𝑡. For any 𝑡 ≥ 2, we can eliminate both h (D𝑖, X𝑖) + D𝑖 · 𝑡 · 𝜂 using unit-specific linear detrending. Equation (2.43) is an example of a heterogeneous (or random) trend model of the kind discussed in Wooldridge (2010, Section 11.7.2).Appendix 2B details how unit-specific linear trends are removed by regressing the outcome on a constant and time dummies using pre-treatment periods. The resulting residuals define the detrended outcome variable, denoted by (cid:165)𝑌𝑖𝑟𝑔 (∞). We then show that the treatment cohort indicators, 𝐷𝑖, are unconfounded with respect to these detrended variable, conditional on 𝑋𝑖. Moreover, by Assumption CNAS, cohorts with ℎ > 𝑟 (including ℎ = ∞) can be used as part of the control group. Then, the argument is as in Section 2.4: We have justified the following procedure as producing consistent estimators of 𝜏𝑔𝑟 under Assumptions CNAS, CHT, and OVLS. 22 Procedure 5.1 (Staggered Entry, Heterogeneous Linear Trends): Step 1. For a specified cohort 𝑔 ∈ {𝑆, . . . , 𝑇 }, run the unit-specific regressions 𝑌𝑖𝑡 on 1, 𝑡, 𝑡 = 1, . . . , 𝑔 − 1 (2.44) For 𝑟 ∈ {𝑔, . . . , 𝑇 }, compute the out-of-sample predicted value (cid:98)𝑌𝑖𝑟𝑔 and the prediction error (detrended variable) (cid:165)𝑌𝑖𝑟𝑔 ≡ 𝑌𝑖𝑟 − (cid:98)𝑌𝑖𝑟𝑔. (One need not do this for units treated prior to period 𝑔.) Step 2. Choose as the control group the units with 𝐷𝑖,𝑟+1 + 𝐷𝑖,𝑟+2 + · · · + 𝐷𝑖𝑇 + 𝐷𝑖∞ = 1 (or, if desired, a subset, such as the NT group). Step 3. Using the subset of data with 𝐷𝑖𝑔 + 𝐷𝑖,𝑟+1 + 𝐷𝑖,𝑟+2 + · · · + 𝐷𝑖𝑇 + 𝐷𝑖∞ = 1, or a further subset, apply standard TE methods – such as linear RA, IPW, IPWRA or matching – to the cross section (cid:8)(cid:0) (cid:165)𝑌𝑖𝑟𝑔, 𝐷𝑖𝑔, X𝑖(cid:1) : 𝑖 = 1, . . . , 𝑁 (cid:9) , (2.45) with 𝐷𝑖𝑔 acting as the treatment indicator. □ Procedure 5.1 is very easy to implement, requiring just many unit-specific simple regressions on a constant and linear time trend. The common timing case is especially easy because the regression in (2.43) is done with 𝑔 = 𝑆 only and then the detrended outcomes (cid:165)𝑌𝑖𝑟 are used in standard treatment effect estimation for 𝑟 = 𝑆, . . . , 𝑇. In the simplest case where Procedure 5.1 can be applied, with 𝑇 = 3 and common intervention at 𝑆 = 3, and without covariates, the resulting estimator of 𝜏3 is 𝑁 −1 1 𝑁 ∑︁ 𝑖=1 𝐷𝑖 (cid:165)𝑌𝑖3 − 𝑁 −1 0 𝑁 ∑︁ 𝑖=1 (1 − 𝐷𝑖) (cid:165)𝑌𝑖3, (2.46) where (cid:165)𝑌𝑖3 is obtained as the prediction error in period three after the regression 𝑌𝑖𝑡 on 1, 𝑡, 𝑡 = 1, 2. After a little algebra, (2.46) can be shown to equal (cid:2)(cid:0) ¯𝑌13 − ¯𝑌12 (cid:1) − (cid:0) ¯𝑌03 − ¯𝑌02 (cid:1)(cid:3) − (cid:2) (cid:0) ¯𝑌12 − ¯𝑌11 (cid:1) − (cid:0) ¯𝑌02 − ¯𝑌01 (cid:1)(cid:3) , (2.47) where the first subscript on the average is one for treated and zero for control, and the second subscript indicates time period. The first term in brackets is the usual two-period DiD estimator 23 if the first time period is ignored. The second term is an estimate of the difference in trends prior to the intervention – often interpreted as estimating a placebo effect. The estimator in (2.47) is an example of a difference-in-difference-in-differences estimator. Procedure 5.1 allows one to control for covariates in case removing an estimate of the pre-intervention difference in trends is still not deemed sufficient to uncover a causal effect. Before ending this section, we head off a potential source of confusion. The fact that we are running unit-specific linear trend regressions in (2.44) does not mean there is an incidental parameters problem that can cause inconsistency in the (cid:98) small. In fact, we are just using these regressions to eliminate unit-specific heterogeneity that can 𝜏𝑔𝑟 when the number of time periods is be correlated with treatment cohorts. It is substantively the same as removing the unit-specific pre-treatment means in Procedure 4.1. In fact, this kind of unit-specific detrending is the same idea prevalent in the panel data literature with heterogeneous trends. See, for example, Wooldridge (2010, Section 11.7.2). 2.6 Violations of No Anticipation. Unbalanced Panels The no anticipation assumption requires that, prior to the first intervention period for a given treatment cohort, the potential outcomes are the same (on average) as in the never treated state. This assumption can fail if units know that a program or policy change is approaching prior to its being actually implemented. If the NA assumption is in doubt, one can leave one or more periods prior to the intervention time, and redo the analysis as a robustness check. As an example, suppose a cohort is first treated in 𝑔 = 5. In Procedure 4.1, one would average over periods {1, 2, 3, 4} in obtaining the average to remove for 𝑌𝑖5 (or, one would remove a unit- specific linear trend, as in Procedure 5.1). Instead, one might drop period four altogether, or maybe even periods three and four. Any precise recommendation is context specific. It is very easy to apply any of the procedures we have recommended to cases where time periods are skipped. Another issue that often arises in practice is unbalanced panels. With time-constant controls, unbalancedness would typically arise because of missing data on 𝑌𝑖𝑡, possibly due to attrition. If data are missing on 𝑌𝑖𝑡 for some time periods for unit 𝑖, the demeaning or detrending is simply 24 applied to the observed data. The mechanics of the procedure are then exactly the same. For treatment cohort 𝑔 in period 𝑟, the transformed outcome ( (cid:164)𝑌𝑖𝑟𝑔 or (cid:165)𝑌𝑖𝑟𝑔) can only be used if there are enough observed data in the periods 𝑡 < 𝑔 to compute an average (one period) or a linear trend (two periods). Of course, 𝑌𝑖𝑟 must also be observed. It is natural to wonder when ignoring the reason the panel is unbalanced does not cause systematic bias. Because our method removes unit-specific averages in Procedure 4.1, selection is allowed to depend on unobserved time-constant heterogeneity – just like with the usual fixed effects estimator. Selection cannot be systematically related to the shocks to 𝑌𝑖𝑡 (∞) – again, just as with the FE estimator. When we remove a unit-specific linear trend, now selection can be correlated with both a level heterogeneity term and a trend heterogeneity term, providing for somewhat more robustness to sample selection bias. 2.7 Application In this section, we revisit the literature examining the effects of Walmart’s entry on local labor markets (Basker, 2005, Neumark et al., 2008, Brown and Butts, 2023). Prior studies highlight concerns about potential violations of the parallel trends assumption. For example, Brown and Butts (2023) provide visual evidence suggesting that much of the estimated effect could be due to pre-existing trends rather than the impact of Walmart’s entry (Rambachan and Roth, 2023). If store opening decisions are mainly driven by county-level economic fundamentals, observed employment differences between treated and control groups may not be attributable solely to Walmart’s opening. 2.7.1 Data As we show in Section 2.5, to illustrate that our rolling method yields robust results even when counties exhibit disparate trends after controlling for covariates, we use the dataset constructed by Brown and Butts (2023), which follows Basker (2005) and draws on the County Business Patterns (CBP) data from 1964 and 1977–1999. We limit the dataset to counties that had more than 1, 500 of total employment in 1964 and had non-negative employment growth rates during 1964-1977, each of which was imputed from the 1977-1999 dataset (see, Brown and Butts, 2023). The Walmart store opening indicator is derived 25 from the geographic dataset of openings in Arcidiacono et al. (2020). We exclude counties whose first Walmart opened before 1985. As a result, we obtain a balanced panel of 1280 counties, each observed for nine pre-treatment years and fourteen years of post-treatment periods, for a total of 23 years. Figure 2.1 plots the sample counties and their first Walmart opening year. Darker shades indicate later openings, light green shows never-treated counties, and gray marks excluded areas. Figure 2.1 First Year of Walmart Store Openings by County The treatment cohort is defined by the year of first opening. We utilize covariates, including the 1980 shares of the population employed in manufacturing, above the poverty line, and with a high school education. Never-treated counties comprise 31% of the sample. For descriptive statistics of the variables used in our analysis, see Appendix ?? Table 2D.1. We use the dataset to assess how Walmart openings affect county-level retail and wholesale employment. 2.7.2 Estimation Results First, we estimate cohort-year specific average treatment effects on the treated (ATT), denoted as 𝐴𝑇𝑇 (𝑔, 𝑡), for cohort 𝑔 in year 𝑡. We then compute the weighted sum of ATTs by relative time (length of exposure) and calculate the 95% bootstrap confidence intervals. We define the weighted 26 ATTs, 𝑊 𝐴𝑇𝑇 (𝑟), as follows: 𝑊 𝐴𝑇𝑇 (𝑟) = ∑︁ 𝑔∈𝐺𝑟 𝑤(𝑔, 𝑟) · 𝐴𝑇𝑇 (𝑔, 𝑔 + 𝑟) where 𝑟 = 𝑡 − 𝑔 , for all 𝑟 ∈ {0, 1, . . . , 𝑇 − 𝑆}, 𝑔 ∈ 𝐺𝑟 = {𝑆, . . . , 𝑇 − 𝑟}, and 𝑤(𝑔, 𝑟) = 𝑁𝑔 𝑁𝐺𝑟 (𝑁𝑔 is the number of counties in treated-cohort 𝑔, 𝑁𝐺𝑟 is the number of counties in 𝐺𝑟). For pre-treatment periods (𝑟 < 0), the procedure used to obtain the ATTs is detailed in Appendix ??. For example, 𝑊 𝐴𝑇𝑇 (0) denotes the weighted average of immediate impact on the log of employment level, which we are averaging out ATTs of the first treated year across every treated- cohort; i.e., 𝐴𝑇𝑇 (𝑔, 𝑔) for all 𝑔 ∈ 𝐺𝑟=0. Lastly, we compare the results from CS (2021) approach, rolling IPWRA estimator with unit- specific demeaning, and rolling IPWRA estimator with unit-specific detreading. 2.7.2.1 Retail Employment In this section, we estimate the impact of Walmart’s entry on county-level retail employment, using the log of employment as the outcome. Figure 2.2 reports weighted ATT estimates with bootstrapped 95% confidence intervals from the three estimators. Panel (a) is an event-study plot generated using the CS (2021) approach. The blue line represents the pre-trends, which ideally should be zero; however, it shows a clear positive slope, indicating pre-existing differences between treated and control counties. As a result, the estimated treatment effects—shown in red—likely do not reflect the true impact of Walmart’s entry on wholesale employment. Panel (b) presents results from our rolling IPWRA estimator applied to the demeaned outcome variable. This approach smooths out much of the pre-trend bias relative to the CS estimator, but the pre-trends are still not fully addressed. Panel (c) shows results from the rolling IPWRA estimator with detrending, which effectively eliminates county-level linear trends. The flatter pre-treatment trajectory indicates better adjustment for pre-existing differences. Post-entry, we observe modest but statistically significant increases in retail employment that decline over time and eventually converge to zero. 27 Panel (a) CS Approach Panel (b) Rolling IPWRA estimator Panel (c) Rolling IPWRA with unit-specific detrending Figure 2.2 Effects of Walmart Opening on log(Retail Employment) For example, ATT(1) from the rolling IPWRA with heterogeneous trends is 0.032 (SE = 0.005), indicating a 3.2% increase in retail employment one year after entry (see Appendix ?? for all 28 estimates). Given an average of 6, 589 retail employees in treated counties, this translates to about 210 additional jobs. However, since Walmart stores typically employ 150–300 workers (Basker, 2005), it is difficult to attribute broader local employment gains beyond direct hiring. Results for wholesale employment appear in Appendix 2D.1. 2.8 Concluding Remarks In this paper, we propose a flexible transformation approach for estimating treatment effects in panel data with staggered interventions and heterogeneous trends. A key advantage of this method is that once the transformed dependent variable is defined—whether in the common timing case, the staggered case, or after removing unit-specific trends—researchers can apply standard treatment effect estimators to the resulting cross-sectional data. This includes regression adjustment, inverse probability weighting, and doubly robust procedures. The transformation is easily adapted to accommodate unit-specific trendsand allows for dynamic and heterogeneous treatment effects across units and time periods. Our empirical application to Walmart’s entry demonstrates how the approach improves estimation accuracy and robustness compared to existing methods. These features, combined with its simplicity of implementation, make the method a valuable tool for applied researchers. 29 BIBLIOGRAPHY Abadie, A. (2005). Semiparametric difference-in-differences estimators. The review of economic studies, 72(1):1–19. Arcidiacono, P., Ellickson, P. B., Mela, C. F., and Singleton, J. D. (2020). The competitive effects of entry: Evidence from supercenter expansion. American Economic Journal: Applied Economics, 12(3):175–206. Athey, S. and Imbens, G. W. (2022). Design-based analysis in difference-in-differences settings with staggered adoption. Journal of Econometrics, 226(1):62–79. Basker, E. (2005). Job creation or destruction? labor market effects of wal-mart expansion. Review of Economics and Statistics, 87(1):174–183. Belloni, A., Chernozhukov, V., and Hansen, C. (2014). Inference on treatment effects after selection among high-dimensional controls. Review of Economic Studies, 81(2):608–650. Borusyak, K. and Jaravel, X. (2018). Revisiting event study designs. Available at SSRN 2826228. Borusyak, K., Jaravel, X., and Spiess, J. (2024). Revisiting event study designs: Robust and efficient estimation. Review of Economic Studies, page rdae007. Brown, N. and Butts, K. (2023). Dynamic treatment effect estimation with interactive fixed effects and short panels. Callaway, B. (2023). Difference-in-differences for policy evaluation. Handbook of Labor, Human Resources and Population Economics, pages 1–61. Callaway, B. and Sant’Anna, P. H. (2021). Difference-in-differences with multiple time periods. Journal of econometrics, 225(2):200–230. Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. De Chaisemartin, C. and d’Haultfoeuille, X. (2020). Two-way fixed effects estimators with hetero- geneous treatment effects. American economic review, 110(9):2964–2996. De Chaisemartin, C. and d’Haultfoeuille, X. (2023). Two-way fixed effects and difference- in-differences with heterogeneous treatment effects: A survey. The Econometrics Journal, 26(3):C1–C30. Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. Journal of econometrics, 225(2):254–277. 30 Heckman, J. J., Ichimura, H., and Todd, P. E. (1997). Matching as an econometric evaluation estimator: Evidence from evaluating a job training programme. The review of economic studies, 64(4):605–654. Marcus, M. and Sant’Anna, P. H. (2021). The role of parallel trends in event study settings: An application to environmental economics. Journal of the Association of Environmental and Resource Economists, 8(2):235–275. Negi, A. and Wooldridge, J. M. (2021). Revisiting regression adjustment in experiments with heterogeneous treatment effects. Econometric Reviews, 40(5):504–534. Neumark, D., Zhang, J., and Ciccarella, S. (2008). The effects of wal-mart on local labor markets. Journal of Urban Economics, 63(2):405–430. Rambachan, A. and Roth, J. (2023). A more credible approach to parallel trends. Review of Economic Studies, 90(5):2555–2591. Sant’Anna, P. H. and Zhao, J. (2020). Doubly robust difference-in-differences estimators. Journal of econometrics, 219(1):101–122. Sun, L. and Abraham, S. (2021). Estimating dynamic treatment effects in event studies with heterogeneous treatment effects. Journal of econometrics, 225(2):175–199. Wooldridge, J. M. (2007). Inverse probability weighted estimation for general missing data prob- lems. Journal of econometrics, 141(2):1281–1301. Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data. Wooldridge, J. M. (2021). Two-way fixed effects, the two-way mundlak regression, and difference- in-differences estimators. Available at SSRN 3906345. Wooldridge, J. M. (2023). Simple approaches to nonlinear difference-in-differences with panel data. The Econometrics Journal, 26(3):C31–C66. 31 APPENDIX 2A PROOF OF THEOREM 2.2 We modify the argument in Wooldridge (2021, Theorem 8.1). The (cid:98) (2.21). Because (cid:164)𝑌𝑖𝑟 = 𝑌𝑖𝑟 − ¯𝑌𝑖,𝑝𝑟𝑒, basic OLS algebra shows that all coefficients from (2.21) are 𝜏𝑟 are obtained from regression obtained by differencing the coefficients from the two regressions (cid:164)𝑌𝑖𝑟 on 1, 𝐷𝑖, X𝑖, 𝐷𝑖 · X𝑖, 𝑖 = 1, 2, . . . , 𝑁 ¯𝑌𝑖,𝑝𝑟𝑒 on 1, 𝐷𝑖, X𝑖, 𝐷𝑖 · X𝑖, 𝑖 = 1, 2, . . . , 𝑁 (2A.1) (2A.2) where X𝑖 = X𝑖 − ¯X1 are the covariates demeaned using the treated units. In particular, letting (cid:98) the coefficient on 𝐷𝑖 from (2A.1) and (cid:98) 𝜌 𝑝𝑟𝑒 the coefficient on 𝐷𝑖 from (2A.2), 𝜌𝑟 be 𝜏𝑟 = (cid:98) (cid:98) 𝜌𝑟 − (cid:98) 𝜌 𝑝𝑟𝑒 (2A.3) Note also that the coefficients on the “ moderating” terms, 𝐷𝑖 · X𝑖, are also obtained by differencing across the two regressions. To show (2A.3) is the same as the coefficient on 𝐷𝑖 · 𝑓 𝑟𝑡 in (2.24), first note that, by Wooldridge (2021, Theorem 3.2), we can drop ( 𝑓 𝑞𝑡, 𝑓 𝑞𝑡 · X𝑖) for 𝑞 < 𝑆 without affecting the estimates. In other words, the (cid:101) 𝜏𝑟 are the coefficients on 𝐷𝑖 · 𝑓 𝑟𝑡 in the regression 𝑌𝑖𝑡 on 1, X𝑖, 𝐷𝑖, 𝐷𝑖 · X𝑖, 𝑓 𝑆𝑡, . . . , 𝑓 𝑇𝑡, 𝑓 𝑆𝑡 · X𝑖, . . . , 𝑓 𝑇𝑡 · X𝑖 𝐷𝑖 · 𝑓 𝑆𝑡, . . . , 𝐷𝑖 · 𝑓 𝑇𝑡, 𝐷𝑖 · 𝑓 𝑆𝑡 · X𝑖, . . . , 𝐷𝑖 · 𝑓 𝑇𝑡 · X𝑖 (2A.4) Now, define H𝑖 ≡ (1, X𝑖, 𝐷𝑖, 𝐷𝑖 · X𝑖), a 1 × 2(𝐾 + 1) vector, and L𝑖𝑡 ≡ ( 𝑓 𝑆𝑡, 𝑓 𝑆𝑡 · X𝑖, 𝐷𝑖 · 𝑓 𝑆𝑡, 𝐷𝑖 · 𝑓 𝑆𝑡 · X𝑖, . . . , 𝑓 𝑇𝑡, , 𝑓 𝑇𝑡 · X𝑖, 𝐷𝑖 · 𝑓 𝑇𝑡 · X𝑖) , (2A.5) a row vector with 2 (𝑇 − 𝑆 + 1) (𝐾 + 1) elements. Note that for 𝑞 ≠ 𝑟, ( 𝑓 𝑞𝑡, 𝑓 𝑞𝑡 · X𝑖, 𝐷𝑖 · 𝑓 𝑞𝑡, 𝐷𝑖 · 𝑓 𝑞𝑡 · X𝑖) and ( 𝑓 𝑟𝑡, 𝑓 𝑟𝑡 · X𝑖, 𝐷𝑖 · 𝑓 𝑟𝑡, 𝐷𝑖 · 𝑓 𝑟𝑡 · X𝑖) are orthogonal in sample because 𝑓 𝑞𝑡 · 𝑓 𝑟𝑡 = 0. The full set of regressors in (2A.4) is simply (H𝑖, L𝑖𝑡). With 𝑝𝑡 = 𝑓 𝑆𝑡 + · · · + 𝑓 𝑇𝑡, the post-treatment period indicator, (1 − 𝑝𝑡) 𝑓 𝑟𝑡 = 0, 𝑟 = 𝑆, 32 𝑆 + 1, . . . , 𝑇, which means (1 − 𝑝𝑡) L𝑖𝑡 = 0. Therefore, the objective function underlying the regression in (2A.4) can be written as min 𝜃,𝛿 𝑁 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑡=1 (1 − 𝑝𝑡)(𝑌𝑖𝑡 − H𝑖𝜃)2 + 𝑁 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑡=1 𝑝𝑡 (𝑌𝑖𝑡 − H𝑖𝜃 − L𝑖𝑡𝛿)2 (2A.6) Letting (cid:101)𝜃 and (cid:101)𝛿 denote the POLS estimators, the first order conditions are 𝑁 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑡=1 (1 − 𝑝𝑡)H′ 𝑖 (𝑌𝑖𝑡 − H𝑖(cid:101)𝜃) + 𝑁 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑡=1 𝑝𝑡H′ 𝑖 (𝑌𝑖𝑡 − H𝑖(cid:101)𝜃 − L𝑖𝑡(cid:101)𝛿) = 0 (2A.7) 𝑁 ∑︁ 𝑇 ∑︁ 𝑝𝑡L′ 𝑖𝑡 (𝑌𝑖𝑡 − H𝑖(cid:101)𝜃 − L𝑖𝑡(cid:101)𝛿) = 0 (2A.8) 𝑖=1 Next, note that because 𝑝𝑡 = 𝑓 𝑆𝑡 + · · · + 𝑓 𝑇𝑡, we can write 𝑡=1 𝑝𝑡H𝑖 = [ 𝑝𝑡, 𝑝𝑡 · 𝐷𝑖, 𝑝𝑡 · X𝑖, 𝑝𝑡 · 𝐷𝑖 · X𝑖] = 𝑇 ∑︁ 𝑞=𝑆 [ 𝑓 𝑞𝑡, 𝑓 𝑞𝑡 · 𝐷𝑖, 𝑓 𝑞𝑡 · X𝑖, 𝑓 𝑞𝑡 · 𝐷𝑖 · X𝑖] , (2A.9) which is simply the sum the subvectors in L𝑖𝑡 consisting of different time periods. It follows that 𝑝𝑡H𝑖 = 𝑝𝑡L𝑖𝑡A for a 2 (𝑇 − 𝑆 + 1) (𝐾 + 1) × 2(𝐾 + 1) matrix A. Plugging into (2A.8) gives 𝑁 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑡=1 (1 − 𝑝𝑡)H′ 𝑖 (𝑌𝑖𝑡 − H𝑖(cid:101)𝜃) + A′ 𝑁 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑡=1 𝑝𝑡L′ 𝑖𝑡 (𝑌𝑖𝑡 − H𝑖(cid:101)𝜃 − L𝑖𝑡(cid:101)𝛿) = 0 (2A.10) Along with (2A.8), (2A.10) implies that the FOCs for (cid:16) (cid:17) (cid:101)𝜃, (cid:101)𝛿 are 𝑁 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑡=1 (1 − 𝑝𝑡)H′ 𝑖 (𝑌𝑖𝑡 − H𝑖(cid:101)𝜃) = 0 (2A.11) But (2A.11) means that (cid:101)𝜃 is the OLS estimator from the regression 𝑌𝑖𝑡 on H𝑖 using the pre-treatment period observations. With H𝑖 not varying over time, (cid:101)𝜃 is the same as the cross-sectional regression ¯𝑌𝑖,𝑝𝑟𝑒 on 1, 𝐷𝑖, X𝑖, 𝐷𝑖 · X𝑖 (2A.12) In particular, the coefficient on 𝐷𝑖 is precisely (cid:98) 𝜌 𝑝𝑟𝑒 in (2A.3). Next, the FOC in (2A.3) shows that (cid:101)𝛿 is from a POLS regression using the post-treatment periods: 𝑌𝑖𝑡 − H𝑖(cid:101)𝜃 on L𝑖𝑡, 𝑡 = 𝑆, . . . , 𝑇; 𝑖 = 1, . . . , 𝑁 33 By definition of L𝑖𝑡 and the orthogonality of the elements of L𝑖𝑡 across the subvectors representing the different time periods, each 2 (𝐾 + 1) subvector, (cid:101)𝛿𝑟, 𝑟 = 𝑆, . . . , 𝑇, is obtained from a separate cross-sectional regression for each post-treatment period. Namely, because 𝑓 𝑟𝑟 = 1, the regression for period 𝑟 is 𝑌𝑖𝑟 − H𝑖(cid:101)𝜃 on 1, 𝐷𝑖, X𝑖, 𝐷𝑖 · X𝑖, 𝑖 = 1, . . . , 𝑁 (2A.13) The vector on right is simply H𝑖, and so (cid:101)𝛿𝑟 = (cid:32) 𝑁 ∑︁ 𝑖=1 (cid:33) −1 (cid:32) 𝑁 ∑︁ (cid:33) H′ 𝑖𝑌𝑖𝑟 − (cid:101)𝜃 H′ 𝑖H𝑖 𝑖=1 (2A.14) The first term is the regression coefficients from 𝑌𝑖𝑟 on 1, 𝐷𝑖, X𝑖, 𝐷𝑖 · X𝑖, 𝑖 = 1, . . . , 𝑁, 𝜌𝑟 from (2A.3). We have shown that the coefficient corresponding to 𝐷𝑖 · 𝑓 𝑟𝑡 in the which is (cid:98) regression (2A.4) is (cid:98) across all time periods, (2.24), and the cross-sectional OLS estimators using the transformed 𝜌 𝑝𝑟𝑒, which establishes the equivalence of the pooled OLS estimator 𝜌𝑟 − (cid:98) variable (cid:164)𝑌𝑖𝑟 for each 𝑟 ∈ {𝑆, . . . , 𝑇 } separately. Essentially the same argument shows that the coefficients on the interaction terms 𝐷𝑖 · 𝑓 𝑟𝑡 · X𝑖 in (2.24) are the same as the coefficients on 𝐷𝑖 · X𝑖 in (2.21). □ 34 APPENDIX 2B PROOF OF THE DETRENDING PROCEDURE For a cohort 𝑔, where we require 𝑔 ≥ 3 so there are at least two pre-treatment periods, define the (𝑔 − 1) × 2 matrix J𝑔−1 = and let Y𝑖,𝑔−1 (∞) be the (𝑔 − 1) × 1 vector (cid:169) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:171) 1 1 1 ... 2 ... (cid:170) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) 1 𝑔 − 1 (cid:172) (2B.1) Y𝑖,𝑔−1 (∞) = (cid:2)𝑌𝑖1 (∞) , . . . , 𝑌𝑖,𝑔−1 (∞)(cid:3) ′ . (2B.2) A similar definition holds for U𝑖,𝑔−1 (∞). Also, let M𝑖,𝑔−1 be the (𝑔 − 1) × 1 vector with elements 𝑚𝑡 (X𝑖), 𝑡 = 1, . . . , 𝑔 − 1. Note that we can write Y𝑖,𝑔−1 (∞) = J𝑔−1 h (D𝑖, X𝑖) D𝑖𝜂 (cid:169) (cid:173) (cid:173) (cid:171) (cid:170) (cid:174) (cid:174) (cid:172) + M𝑖,𝑔−1 + U𝑖,𝑔−1 (∞) ≡ J𝑔−1Q𝑖 + M𝑖,𝑔−1 + U𝑖,𝑔−1 (∞) (2B.3) Now regress Y𝑖,𝑔−1 (∞) on J𝑔−1, and obtain the coefficients, (cid:16) (cid:98)B𝑖,𝑔−1 = (cid:17) −1 J′ 𝑔−1J𝑔−1 (cid:16) J′ 𝑔−1Y𝑖,𝑔−1 (∞) (cid:17) −1 = Q𝑖 + J′ 𝑔−1J𝑔−1 J′ 𝑔−1 (cid:2)M𝑖,𝑔−1 + U𝑖,𝑔−1 (∞)(cid:3) (2B.4) For 𝑟 ≥ 𝑔, the prediction of 𝑌𝑖𝑟 (∞) using the unit-specific linear trend up through period 𝑔 − 1 is (cid:98)𝑌𝑖𝑟𝑔 (∞) ≡ (1, 𝑟) (cid:98)B𝑖,𝑔−1 (2B.5) Use these predicted values to detrend 𝑌𝑖𝑟 (∞): (cid:165)𝑌𝑖𝑟𝑔 (∞) ≡ 𝑌𝑖𝑟 (∞) − (cid:98)𝑌𝑖𝑟𝑔 (∞) = 𝑌𝑖𝑟 (∞) − (1, 𝑟) (cid:98)B𝑖,𝑔−1 = (1, 𝑟) Q𝑖 + 𝑚𝑟 (X𝑖) + 𝑈𝑖𝑟 (∞) − (1, 𝑟) (cid:26) (cid:16) Q𝑖 + (cid:17) −1 J′ 𝑔−1J𝑔−1 J′ 𝑔−1 (cid:2)M𝑖,𝑔−1 + U𝑖,𝑔−1 (∞)(cid:3) (cid:27) = 𝑚𝑟 (X𝑖) + 𝑈𝑖𝑟 (∞) − (1, 𝑟) (cid:26)(cid:16) (cid:17) −1 J′ 𝑔−1J𝑔−1 J′ 𝑔−1 (cid:2)M𝑖,𝑔−1 + U𝑖,𝑔−1 (∞)(cid:3) (cid:27) (2B.6) 35 The expression in (2B.6) shows that (cid:165)𝑌𝑖𝑟 (∞) does not depend on D𝑖. In particular, 𝐸 (cid:2) (cid:165)𝑌𝑖𝑟𝑔 (∞) |D𝑖, X𝑖(cid:3) = 𝐸 (cid:2) (cid:165)𝑌𝑖𝑟𝑔 (∞) |X𝑖(cid:3) = 𝑚𝑟 (X𝑖) − (1, 𝑟) (cid:16) J′ 𝑔−1J𝑔−1 (cid:17) −1 J′ 𝑔−1M𝑖,𝑔−1 (2B.7) This conclusion is practically important because it shows that the vector of treatment cohort indicators, D𝑖, are unconfounded with respect to the detrended variable (cid:165)𝑌𝑖𝑟 (∞), conditional on X𝑖. Note how this extends the argument in Section 2.4 where, instead of J𝑔−1 having rows (1, 𝑡), its rows simply consisted of unity. Now the modification to the arguments in Section 2.4 are straightforward. In place of (2.28) we have 𝑌𝑖𝑟 (𝑔) − 𝑌𝑖𝑟 (∞) = (cid:165)𝑌𝑖𝑟𝑔 (𝑔) − (cid:165)𝑌𝑖𝑟𝑔 (∞) + (cid:104) (cid:98)𝑌𝑖𝑟𝑔 (𝑔) − (cid:98)𝑌𝑖𝑟𝑔 (∞) (cid:105) , (2B.8) where (cid:165)𝑌𝑖𝑟𝑔 (𝑔) ≡ 𝑌𝑖𝑟 (𝑔) − (cid:98)𝑌𝑖𝑟𝑔 (𝑔) and (cid:98)𝑌𝑖𝑟𝑔 (𝑔) are the predicted values from (2B.6) and (2B.5) but with Y𝑖,𝑔−1 (𝑔) and 𝑌𝑖𝑟 (𝑔) in place of Y𝑖,𝑔−1 (∞) and 𝑌𝑖𝑟 (∞). Now take the expectation conditional on 𝐷𝑔 = 1: 𝜏𝑔𝑟 = 𝐸 (cid:2)𝑌𝑖𝑟 (𝑔) − 𝑌𝑖𝑟 (∞) |𝐷𝑖𝑔 = 1(cid:3) = 𝐸 (cid:2) (cid:165)𝑌𝑖𝑟𝑔 (𝑔) − (cid:165)𝑌𝑖𝑟𝑔 (∞) |𝐷𝑖𝑔 = 1(cid:3) + 𝐸 (cid:104) (cid:98)𝑌𝑖𝑟𝑔 (𝑔) − (cid:98)𝑌𝑖𝑟𝑔 (∞) |𝐷𝑖𝑔 = 1 (cid:105) (2B.9) The second term in (2B.9) is zero by no anticipation because (cid:98)𝑌𝑖𝑟𝑔 (𝑔) and (cid:98)𝑌𝑖𝑟𝑔 (∞) are the same linear functions of the potential outcomes in periods {1, 2, . . . , 𝑔 − 1}. Therefore, 𝜏𝑔𝑟 = 𝐸 (cid:2) (cid:165)𝑌𝑖𝑟𝑔 (𝑔) − (cid:165)𝑌𝑖𝑟𝑔 (∞) |𝐷𝑖𝑔 = 1(cid:3) (2B.10) 2B.1 Monte Carlo Simulations In this supplementary section, we conduct Monte Carlo simulations to study the exact properties of our proposed estimators and compare them with competing approaches. We evaluate the performance of five different estimators. The first is the POLS/ETWFE estimator in Wooldridge (2021), which is efficient under a commonly imposed set of assumptions (but is not doubly robust). Three of our rolling estimators: regression adjustment (RA), inverse-probability-weighted 36 regression adjustment (IPWRA), and propensity score matching (PSM). The final estimator is CS (2021), who apply the augmented IPW (AIPW) estimator - a different doubly robust estimator than IPWRA. We use the never treated group as the control in CS (2021) whose transformation is in equation (2.38). 2B.2 Common Timing Case We consider the common timing case. Recall from Theorem 2.1., that the POLS method in Wooldridge (2021) and regression adjustment using our rolling method are the same in the common timing case. Therefore, we have four estimators in the simulations. We assume that the Assumptions NA and CPTC hold, along with overlap. However, we consider scenarios where the functional form of the conditional means and the functional form of the propensity score can be misspecified. For each scenario, we use 𝑇 = 6 with the first treatment in 𝑆 = 4, which implies three post- treatment periods. Across Monte Carlo simulations we draw random samples of sizes 100, 500, and 1, 000. We report bias, Monte Carlo standard deviation, and the root mean squared error (RMSE) of each estimator. All simulations use 1, 000 Monte Carlo replications. Data Generation We generate the data as follows. Two control variables are included: X = (𝑋1, 𝑋2), where 𝑋1 and 𝑋2 are independent with 𝑋1 ∼ 𝐺𝑎𝑚𝑚𝑎 (2, 2) [and so 𝐸 (𝑋1) = 4] and 𝑋2 ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖 (0.6). The treatment indicator, 𝐷, has propensity score 𝑝 (x) = 𝑃(𝐷 = 1|X = x) = exp (Z1𝛾1) 1 + exp (Z1𝛾1) where the propensity score index function, Z1𝛾1, is Z1𝛾1 = −1.2 + (𝑋1 − 4) 2 − 𝑋2 (2B.11) (2B.12) Our second step is generating heterogeneous treatment effects as follows: 𝜏𝑟 (X) = 𝜃 · 𝑇 ∑︁ 𝑟=𝑆 (𝑟 − 𝑆 + 1)−1 + 𝜆𝑟 · ℎ(X), 𝑟 ∈ {𝑆, . . . , 𝑇 }, (2B.13) 37 where 𝜃 = 𝑇 − 𝑆 + 1 and 𝜆𝑟 is a time-varying parameter, set as (𝜆𝑆, .., 𝜆𝑇 ) = (0.5, 0.6, 1.0) in each simulation. This setup allows dynamic effects of being treated to vary across time and to increase as the length of exposure to the treatment increases. We consider two different functional forms of ℎ(X) in simulations. The first is ℎ(X) = (𝑋1 − 4) 2 + 𝑋2 3 In the second, ℎ(X) includes a quadratic in 𝑋1 and an interaction between 𝑋1 and 𝑋2: ℎ(X) = (𝑋1 − 4) 2 + 𝑋2 3 + (𝑋1 − 4)2 4 + (𝑋1 − 4) · 𝑋2 2 We generate the potential outcomes in the untreated state as (2B.14) (2B.15) 𝑌𝑡 (0) = 𝛿𝑡 + 𝐶 + 𝛽𝑡 · 𝑓 (X) + 𝑈𝑡 (0), (2B.16) where 𝛿𝑡 = 𝑡 is a time-specific component, 𝐶 |𝐷, 𝑋 ∼ 𝑁𝑜𝑟𝑚𝑎𝑙 (2, 1) is an individual-specific component, 𝑈𝑡 (0)|𝐷, X ∼ 𝑁𝑜𝑟𝑚𝑎𝑙 (0, 4) is the time-varying shock. The time-varying 𝛽𝑡 allows the effect of the covariates on potential outcome paths to vary across time. For each simulation, the parameters are fixed as bellow: 𝛽′ = (𝛽1, 𝛽2, . . . , 𝛽𝑇 ) = (1.0, 1.5, 0.8, 1.5, 2, 2.5) (2B.17) We consider two functional forms for 𝑓 (X): 𝑓 (X) = (𝑋1 − 4) 3 + 𝑋2 2 and 𝑓 (X) = (𝑋1 − 4) 3 + 𝑋2 2 + (𝑋1 − 4)2 3 + (𝑋1 − 4) · 𝑋2 4 Finally, the post-treatment period outcome in the treated state is generated as (2B.18) (2B.19) 𝑌𝑡 (1) =    𝑌𝑡 (0), 𝑡 < 𝑆 𝑌𝑡 (0) + 𝜏𝑡 + 𝑈𝑡 (1) − 𝑈𝑡 (0), 𝑡 ≥ 𝑆 where 𝑈𝑡 (1) |𝐷, X ∼ 𝑁𝑜𝑟𝑚𝑎𝑙 (0, 4). 38 For each simulation, all estimators that involve estimating the conditional means of 𝑌𝑡 assume the correct model is linear in X. Therefore, when (2B.14) and (2B.18) are used in simulations, the conditional mean is correctly specified. However, when the data are generated as in (2B.15) and (2B.19), quadratic term 𝑋 2 1 and an interaction term 𝑋1 · 𝑋2 are included; the conditional mean is misspecified. We also consider a case where (𝑋1 − 4)2 /2 is added to the index function in the propensity score, and so the estimated logit model, which is always estimated assuming an index linear in 𝑋1 and 𝑋2, is misspecified: Z2𝛾2 = −1.2 + (𝑋1 − 4) 2 − 𝑋2 + (𝑋1 − 4)2 2 (2B.20) Table 2B.1 describes basic setups for each scenario. Table 2B.1 Scenarios with Common Timing Correctly Specified? Yes Yes No No Conditional Mean ℎ(X) (𝐵.4) (𝐵.4) (𝐵.5) (𝐵.5) Propensity Score 𝑓 (X) (𝐵.8) (𝐵.8) (𝐵.9) (𝐵.9) Correctly Specified? Yes No Yes No PS Index Function (𝐵.3) (𝐵.10) (𝐵.3) (𝐵.10) Scenario 1C Scenario 2C Scenario 3C Scenario 4C Simulation Results This section shows the simulation results, especially for two different scenarios listed in Table 2B.1: Scenario 1C and 3C. The results for Scenario 1C are shown in Table 2B.2. Here the conditional means and the propensity score are correctly specified, so we expect all estimators to have little bias. This is indeed the case, with the biases being trivial as a percentage of the effect sizes. The biases are small even when 𝑁 = 100. Because the POLS estimator, which is the same as RA on the transformed outcome, is best linear unbiased, it is not surprising that it produces notably smaller standard deviations compared with PSM and CS (2021). For example, with 𝑁 = 500, and for 𝜏4, the PSM SD is about 37 percent higher than the POLS/RA SD and the CS SD is about 25 percent higher. Because POLS/RA averages all three pre-treatment periods, its main competitor is the doubly robust IPWRA estimator, which also averages the three pre-treatment periods. When 39 𝑁 = 500, the rolling IPWRA estimator has SDs that are, at most, three percent higher than those for the POLS. In terms of RMSE, the POLS estimator is uniformly better in Table 2B.2 – again, this is not surprising because POLS is the BLUE. When 𝑁 = 1, 000, rolling IPWRA estimator has RMSEs that are just slightly larger than POLS. For example, the RMSE of rolling IPWRA for 𝜏6 is 0.399, which is slightly higher than that of POLS/RA estimator, 0.379. Table 2B.2 Scenario 1C: When 𝐸 (𝑌𝑡 |X = x) and 𝑝 (x) are Correctly Specified Sample ATT POLS/RA PSM IPWRA CS(2021) Sample ATT POLS/RA PSM IPWRA CS(2021) Sample ATT POLS/RA PSM IPWRA CS(2021) N Bias 100 100 100 100 500 500 500 500 −0.002 0.020 −0.014 0.015 0.008 0.002 0.009 0.011 0.006 1,000 0.023 1,000 0.007 1,000 1,000 −0.008 𝜏4 SD 3.326 1.241 1.784 1.318 1.534 3.218 0.541 0.893 0.566 0.662 3.220 0.375 0.710 0.395 0.474 RMSE Bias 1.241 1.784 1.318 1.534 0.541 0.893 0.566 0.662 0.375 0.710 0.395 0.474 0.006 0.130 0.018 0.036 −0.036 0.001 −0.034 −0.033 0.009 0.055 0.007 −0.006 𝜏5 SD 4.800 1.220 1.803 1.352 1.554 4.809 0.537 0.931 0.562 0.659 4.802 0.382 0.673 0.411 0.486 RMSE Bias 1.220 1.807 1.352 1.554 0.538 0.931 0.563 0.660 0.382 0.676 0.411 0.486 0.036 0.195 0.046 0.065 −0.010 0.084 −0.009 −0.009 0.020 0.101 0.021 0.007 𝜏6 SD 5.858 1.285 1.820 1.380 1.576 5.992 0.552 0.939 0.579 0.684 5.959 0.378 0.679 0.398 0.476 RMSE 1.285 1.831 1.381 1.577 0.552 0.943 0.579 0.684 0.379 0.686 0.399 0.476 Note 1: The population 𝑅-squared values are about 0.39, 0.36, and 0.36, respectively. Note 2: The average propensity score is about 0.26. Table 2B.3 reports the findings for Scenario 3C, where now the conditional means are misspec- ified because the linear regressions omits the terms 𝑋 2 1 and 𝑋1 · 𝑋2. Because the propensity score is correctly specified in this scenario, POLS is the only estimator that, theoretically, will exhibit systematic bias. The rolling IPWRA and CS (2021) approaches are still consistent, as is PSM. In these simulations, the POLS estimator does have the most bias, although it is fairly small as a fraction of the size effects. For instance, for 𝑁 = 1, 000, the bias of POLS/RA estimator for 𝜏6 is 0.154, which is at least five times larger (in absolute value) than those of the PSM, IPWRA, and CS (2021) estimators: 0.032, −0.002, and −0.008, respectively. Nevertheless, 0.154 is still small as a percentage of the effect size, 6.361. 40 Table 2B.3 Scenario 3C; When 𝐸 (𝑌𝑡 |X = x) Misspecified, 𝑝 (x) Correctly Specified Sample ATT POLS/RA PSM IPWRA CS(2021) Sample ATT POLS/RA PSM IPWRA CS(2021) Sample ATT POLS/RA PSM IPWRA CS(2021) N Bias 100 100 100 100 500 500 500 500 −0.034 −0.060 −0.099 −0.100 0.079 0.032 0.034 0.055 0.044 1,000 0.031 1,000 0.000 1,000 1,000 −0.003 𝜏4 SD 3.550 1.412 1.797 1.431 1.751 3.418 0.608 0.827 0.618 0.764 3.440 0.402 0.569 0.408 0.513 RMSE Bias 1.413 1.798 1.434 1.753 0.614 0.828 0.619 0.766 0.404 0.570 0.408 0.513 0.104 0.071 0.025 0.009 0.085 −0.015 −0.013 0.006 0.091 0.017 −0.011 −0.015 𝜏5 SD 4.975 1.406 1.867 1.433 1.734 5.053 0.613 0.863 0.619 0.763 5.017 0.426 0.576 0.423 0.523 RMSE Bias 1.410 1.868 1.433 1.734 0.619 0.863 0.619 0.763 0.436 0.577 0.423 0.523 0.222 0.054 0.071 0.053 0.197 0.081 0.047 0.062 0.154 0.032 −0.002 −0.008 𝜏6 SD 6.295 1.568 1.966 1.561 1.863 6.356 0.670 0.878 0.665 0.830 6.361 0.452 0.616 0.448 0.559 RMSE 1.583 1.967 1.563 1.864 0.699 0.882 0.666 0.833 0.477 0.617 0.448 0.559 Note 1: The population 𝑅-squared values are about 0.41, 0.38, and 0.38, respectively. Note 2: The average propensity score is about 0.17. In some cases, the smaller SD of POLS gives it the smallest RMSE even when it is biased. Among the consistent estimators, rolling IPWRA is the most efficient. And, in several cases, the rolling IPWRA estimator has the smallest RMSE. For example, the RMSEs for (cid:98) are 0.477, 0.617, 0.448, and 0.559 for POLS, PSM, IPWRA, and CS (2021), respectively. Our 𝜏6 with 𝑁 = 1, 000 application of IPWRA to the transformed outcome that uses all pre-treatment periods not only reduces bias compared with POLS, but it largely preserves the efficiency of the POLS estimator. 41 APPENDIX 2C PRE-PERIOD DYNAMICS AND EVENT STUDY PLOT This section describes how we compute transformed outcomes in the pre-treatment periods to estimate pre-trend treatment effects and plot them in the event study graph. 2C.1 Demeaning for Pre-treatment Periods, (cid:164)𝑌𝑖𝑡𝑔 For each unit 𝑖, group 𝑔, and pre-treatment time period 𝑡 ∈ {1, 2, . . . , 𝑔 − 1}, we define: (cid:164)𝑌𝑖𝑡𝑔 ≡ 𝑌𝑖𝑡 − 1 𝑔 − 𝑡 − 1 𝑔−1 ∑︁ 𝑞=𝑡+1 𝑌𝑖𝑞 (2C.1) This transformation removes the average of future outcomes from the pre-treatment period (from 𝑡 + 1 to 𝑔 − 1), helping to anchor pre-treatment dynamics relative to a consistent future baseline, 𝑡 = 𝑔 − 1. This is in the spirit of a “rolling” transformation. For example, to obtain the transformed outcome at time 2 for group 6, (cid:164)𝑌𝑖26, we subtract the average of future pre-treatment outcomes (𝑞 = {3, 4, 5}) from the outcome at 𝑡 = 2. The final pre-treatment period, 𝑔 − 1, serves as the reference point, yielding a value of zero and anchoring the dynamics accordingly. 2C.2 Detrending for Pre-treatment Periods, (cid:165)𝑌𝑖𝑡𝑔 To obtain detrended outcome variable, we follow the same logic described in Appendix 2C.1, but but now anchor the two most recent pre-treatment periods, 𝑔 − 2 and 𝑔 − 1, since at lease two periods are necessary to estimate the fitted value (cid:98)𝑌𝑖𝑡𝑔. Formally, for pre-treamtent periods 𝑡 ∈ {1, 2, . . . , 𝑔 − 3}, the detrended outcome is defined as: (cid:165)𝑌𝑖𝑡𝑔 ≡ 𝑌𝑖𝑡 − (cid:98)𝑌𝑖𝑡𝑔 (2C.2) where (cid:98)𝑌𝑖𝑡𝑔 is the fitted value from regressing 𝑌𝑖𝑞 on 𝑞 using future pre-treatment periods 𝑞 ∈ {𝑡 + 1, . . . , 𝑔 − 1}. This procedure removes potential linear (or more general) trends in the pre-treatment outcome trajectory, yielding a detrended outcome series. For valid extrapolation, at least two future pre- treatment periods are required (i.e., 𝑔 − 𝑡 − 1 ≥ 2). 42 2C.3 Estimation Once the transformed outcome variables are constructed, the estimation procedure follows the same steps as Procedure 3.1, 4.1, and 5.1. For identification, we dynamically define the control group based on whether pre-treatment periods 𝑡 for group 𝑔 precedes or follows the first treatment period 𝑆. Specifically, if 𝑡 < 𝑆, the control group includes all not-yet-treated and never-treated units. If 𝑡 ≥ 𝑆, the control group consists only of units treated at 𝑡 + 1 or later, as well as never-treated units. This staggered adoption rule ensures that the comparison group remains untreated at each point of comparison. The weighted ATTs in the pre-treatment periods capture the pre-treatment effects. Under the no anticipation assumption, we expect 𝑊 𝐴𝑇𝑇 (𝑟) for 𝑟 < 0 to be approximately zero. Significant deviations may indicate dynamic selection or violations of the identifying assumptions. When using demeaning transformations, the pre-treatment analysis applies to periods 𝑟 ≤ −2, since the transformation anchors outcomes at 𝑡 = 𝑔 − 1 (i.e., 𝑟 = −1). Similarly, with detrending transformations, the analysis applies to periods 𝑟 ≤ −3. 43 APPENDIX 2D ESTIMATION RESULTS In this section, Table 2D.1 reports the descriptive statistics of variables used in our analysis. Table 2D.1 Descriptive Statistics of variables Variable log(Retail Emp) log(Wholesale Emp) Share of Population Poverty (above) Share of Population in Manufacture Share of Population Graduate High School Treated Cohort Counties Obs 29,440 29,440 29,440 29,440 29,440 1280 Mean 7.754502 6.413699 .8470385 .0998018 .092258 Std. 1.281587 1.482646 .0619999 .0501518 .0256816 Min Max 1986 1999 Table 2D.2 presents estimation results, corresponding to the visual representation in Figure 2.2; the retail labor market case. We estimate standard errors of the weighted ATTs with bootstrap, repeating 100 times. The first three columns display estimates of weighted ATTs from the ETWFE (Wooldridge, 2021), CS (2021), and our rolling IPWRA estimator (without unit-specific detrend- ing), each of which is biased under the violation of parallel trend assumption. The fourth and fifth columns present the point estimates from our rolling RA and rolling IPWRA estimators, which appropriately account for potential heterogeneous linear trends. Table 2D.2 Effects of Walmart Opening on log( Retail employment) ETWFE CS (2021) ATT(𝑟) 0.041 0.073 0.073 0.075 0.081 0.091 0.101 0.119 0.132 0.137 0.158 0.166 0.167 0.206 SE (0.006) (0.007) (0.008) (0.009) (0.01) (0.011) (0.012) (0.013) (0.014) (0.016) (0.019) (0.023) (0.03) (0.043) ATT(𝑟) 0.023 0.054 0.054 0.055 0.059 0.067 0.077 0.096 0.110 0.120 0.138 0.146 0.153 0.191 SE (0.003) (0.004) (0.005) (0.006) (0.007) (0.008) (0.009) (0.01) (0.011) (0.013) (0.015) (0.018) (0.023) (0.032) 𝑟 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Heterogeneous Trend Rolling RA SE (0.004) (0.005) (0.006) (0.007) (0.009) (0.01) (0.012) (0.013) (0.016) (0.019) (0.023) (0.031) (0.036) (0.054) ATT(𝑟) 0.002 0.035 0.029 0.014 0.019 0.015 0.022 0.005 0.016 0.015 0.032 0.007 -0.008 0.046 Rolling IPWRA ATT(𝑟) 0.007 0.032 0.025 0.021 0.018 0.017 0.019 0.036 0.041 0.041 0.037 0.018 0.017 0.047 SE (0.004) (0.005) (0.006) (0.007) (0.009) (0.01) (0.012) (0.013) (0.016) (0.019) (0.023) (0.03) (0.036) (0.053) Rolling IPWRA ATT(𝑟) 0.018 0.045 0.038 0.032 0.031 0.036 0.040 0.054 0.062 0.063 0.081 0.083 0.080 0.107 SE (0.004) (0.004) (0.004) (0.004) (0.004) (0.005) (0.005) (0.006) (0.008) (0.01) (0.013) (0.018) (0.026) (0.039) 44 2D.1 Wholesale Employment In this section, we report the estimation results for the county-level wholesale employment level. Figure 2D.1 presents the estimates from CS (2021) approach and our rolling IPWRA estimators. Panel (a) CS Approach Panel (b) Rolling IPWRA with unit-specific demeaning Panel (c) Rolling IPWRA with unit-specific detrending Figure 2D.1 Effects of Walmart Opening on log(Wholesale Employment) 45 In the top panels of Figure 2D.1, the weighted ATT estimates from the CS (2021) approach and the rolling IPWRA estimator (without unit-specific detrending) mirror the retail employment results, showing an upward pre-treatment trend. In contrast, the panel (c) of Figure 2D.1 displays almost flat pre-treatment trends, indicating no significant pre-treatment effects. Additionally, it is noteworthy that the effects of Walmart entry on wholesale employment levels turn negative when the rolling IPWRA estimator is applied to the proposed dataset after removing county-specific linear trends. This reversal in the sign of the coefficients aligns with the findings of Brown and Butts (2023), who use factor models to relax the parallel trends assumption. Table 2D.3 shows estimates in line with Figure 2D.1. Table 2D.3 Effects of Walmart Opening on log (Wholesale employment) ETWFE CS (2021) ATT(𝑟) 0.041 0.032 0.030 0.027 0.031 0.039 0.047 0.047 0.052 0.058 0.109 0.113 0.098 0.130 SE (0.011) (0.012) (0.014) (0.016) (0.018) (0.02) (0.022) (0.024) (0.026) (0.029) (0.034) (0.04) (0.049) (0.062) ATT(𝑟) 0.008 -0.002 -0.005 -0.009 -0.007 0.004 0.013 0.016 0.019 0.028 0.073 0.062 0.057 0.112 SE (0.007) (0.008) (0.011) (0.013) (0.015) (0.019) (0.021) (0.024) (0.026) (0.028) (0.034) (0.037) (0.043) (0.051) 𝑟 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Rolling IPWRA ATT(𝑟) 0.031 0.018 0.014 0.007 0.008 0.015 0.019 0.015 0.015 0.016 0.063 0.057 0.034 0.055 SE (0.007) (0.007) (0.007) (0.008) (0.009) (0.01) (0.011) (0.014) (0.016) (0.019) (0.023) (0.031) (0.04) (0.06) Heterogeneous Trend Rolling RA SE (0.008) (0.012) (0.013) (0.016) (0.021) (0.023) (0.027) (0.033) (0.037) (0.043) (0.049) (0.059) (0.067) (0.075) ATT(𝑟) -0.009 -0.018 -0.035 -0.052 -0.021 -0.071 -0.038 -0.096 -0.043 -0.120 -0.045 -0.149 -0.239 -0.219 Rolling IPWRA ATT(𝑟) -0.003 -0.019 -0.031 -0.043 -0.051 -0.051 -0.055 -0.052 -0.068 -0.065 -0.058 -0.099 -0.125 -0.074 SE (0.008) (0.012) (0.014) (0.017) (0.021) (0.024) (0.028) (0.034) (0.038) (0.044) (0.049) (0.058) (0.067) (0.078) Our rolling estimators, with county-level detrending, account for heterogeneous linear trends in a straightforward manner, allowing for the application of various treatment effect estimators. This underscores the importance of capturing variations in trends across different counties—trends that may not be captured by observed covariates but can significantly influence estimation results. Estimators that do not explicitly account for heterogeneous trends yield substantially different results, which are less convincing as causal estimates. 46 CHAPTER 3 SIMPLE APPROACHES TO INFERENCE WITH DIFFERENCE-IN-DIFFERENCES ESTIMATORS WITH SMALL CROSS-SECTIONAL SAMPLE SIZES (CO-AUTHORED WITH JEFFREY M. WOOLDRIDGE) 3.1 Introduction Difference-in-differences methods with panel data have a long history for evaluating policy interventions. In the case of common intervention timing, various methods are available for allowing treatment effects to vary by treatment period, as well as by control variables. Wooldridge (2021, 2025) shows how to estimate flexible linear models by pooled OLS or fixed effects – they are equivalent – that allows straightforward inference, provided there are enough cross-sectional units and the data are independently sampled across units. With a relatively small time series sample size (𝑇) and large enough numbers of control (𝑁0) and treated units (𝑁1), standard errors robust to arbitrary serial correlation and heteroskedasticity are readily available. Other methods – such as the Callaway and Sant’Anna (2021) long differencing methods – also rely on large cross-sectional sample sizes for inference. As shown in Lee and Wooldridge (2023) [LW (2023)], regression-based difference-in-differences (DID) estimators can be obtained via cross-sectional regressions after simple time-series transfor- mations of the data at the unit level. This characterization of DiD estimators permits application of many widely used treatment effects estimators in the DiD setting, including matching, doubly robust estimators using the propensity score, and even machine learning causal methods. Here we use the representations in LW (2023) to obtain simple inference when the usual inference obtained from clustering at the individual unit may be problematical. Our approach in the current paper combines the algebraic equivalence in LW (2023) and an idea from Donald and Lang (2007), who propose collapsing individual-level data to aggregated groups when the assignment of an intervention is at the group level. Here, the collapsing of the The co-author has approved that the co-authored chapter is included. The co-author’s contact: Jeffrey M. Wooldridge, Department of Economics, 486 W. Circle Drive, 110 Marshall-Adams Hall, Michigan State University, East Lansing, MI 48824-1038. Email: wooldri1@msu.edu 47 data is at the unit level across time in order to exploit the equivalences of panel data DiD estimators and those based on cross-sectional regressions. In addition to the standard DiD estimators, the approach extends easily to allow for unit-specific trends. Also, time-constant control variables can be added if the cross-sectional section sample sizes are large enough to avoid degeneracies. The method can also be useful in cases where 𝑁0 and 𝑁1 (the sizes of the control and treatment groups, respectively) are not particularly small but where the number of time series is reasonably large. In such cases, using the panel structure and clustering by cross-sectional unit to account for serial dependence can cause distortions in the inference. The approach we suggest here requires no modification for large 𝑇. In fact, because inference is based on a cross-sectional regression, we need not worry about strong dependence in the time series dimension. Recently, Simonsohn (2021), in commenting on Young (2019), argues that a particular version of the heteroskedasticity-robust standard error, proposed by Davidson, MacKinnon et al. (1993) (and labeled “hc3” in the popular Stata statistical package), can produce satisfactory results even in Young’s setting with 𝑁0 = 18 and 𝑁1 = 2. Therefore, in some cases one might feel comfortable using heteroskedasticity-robust inference. A popular method for studying interventions with a small number of treated units (with one being the most common) is the synthetic control method (SCM), pioneered by Abadie et al. (2010), where pre-intervention data are used to obtain a weighted average of “donor” control units to obtain a single synthetic control. More recently, Arkhangelsky, Athey, Hirshberg, Imbens and Wager (2021) propose a unification of traditional DiD and SC methods, called Synthetic difference-in- differences (SDiD). The SDiD approach allows more flexibility than either DiD or SC, and inference methods are available. However, SDiD assume that the series are weakly dependent – often called “integrated of order zero,” or 𝐼 (0) – and that 𝑁0, 𝑇0, and 𝑇1 are suitably “large.” The authors also impose normality in obtaining inference, although it is not clear how important that is in practice. Compared with the SDiD approach, the method we propose in this paper has some advantages and disadvantages. The advantages are that one need not have a very large control group nor a large number of time periods. Because inference is based on a cross-sectional regression under the 48 assumption of the classical linear model assumptions, there are minimal restrictions on 𝑁0, 𝑁1, 𝑇0, and 𝑇1. Moreover, we can use a cross-sectional regression for each treated time period, allowing for different estimated effects along with confidence intervals. We can also aggregate to estimate a single effect, which is what the SDiD approach produces. The SDiD approach does not explicitly allow for heterogeneous trends, something that is easily accommodated using the exact approach in this paper. The downside to the cross-sectional approach here is that it can be be more biased than SDiD when differences in pre-trends are complicated, although SDiD is more biased for simple heterogeneous trends. Our cross-sectional approach also can be considerably less efficient than SDiD. We provide evidence of that via simulations. Nevertheless, sometimes SDiD will be less efficient. Moreover, the popular SDiD packages that produce the estimators and standard errors can produce confidence intervals that are too optimistic. We view the methods proposed in this paper as a complement to SDiD, providing another tool for estimating treatment effects when the number of treated or control units is small. We start with the common timing case in Section 3.2, first summarize the algebraic equivalences that allow one to apply cross-sectional regression to obtain the DiD estimates(s) and standard errors. Section 3.3 extends the framework to accommodate heterogeneous trends and seasonal effects, demonstrating how to implement our approach in these more complex settings. In Section 3.4, we compare our method to the Synthetic Control (SC) and Synthetic Difference-in-Differences (SDiD) approaches, highlighting differences in their underlying assumptions. Section 3.5 presents simulation results evaluating the performance of these estimators. We find that SC and SDiD, especially the former, can exhibit substantially more bias than removing heterogeneous trends. In Section 3.6, we revisit the California smoking restrictions passed in 1989, using the data in Abadie, Diamond, and Hainmueller (2010). Section 3.7 addresses the staggered rollout case and applies our method to measure the effects of castle laws on homicides. We conclude in Section 3.8. 49 3.2 The Common Timing Case 3.2.1 Estimating a Single ATT Initially, the setting is that a total of 𝑇 time periods are available, and an intervention occurs at time 𝑆, 𝑆 ∈ {2, ..., 𝑇 }. A cross-sectional unit, 𝑖, is either subjected to the intervention, which remains in place periods 𝑆 through 𝑇, or it is a control unit. Treatment status is indicated by the binary variable 𝐷𝑖 = 0 if a control unit 𝐷𝑖 = 1 if a treated unit The time-varying treatment indicator is 𝑊𝑖𝑡 = 𝐷𝑖 · 𝑝𝑜𝑠𝑡𝑡 (3.1) (3.2) where 𝑝𝑜𝑠𝑡𝑡 = 1 if 𝑡 ∈ {𝑆, 𝑆 + 1, ..., 𝑇 } and zero otherwise. In what follows, we assume that the data draws are independent and identically distributed across 𝑖 but allow general dependence and changing distributions across 𝑡. Late, we provide a discussion of how one can accomodate clustering or spatial correlation if the cross-sectional sample sizes are reasonably large. The simple DiD estimator can be obtained as the coefficient (cid:98) 𝜏𝐷𝐷 on 𝑊𝑖𝑡 from the pooled regression 𝑌𝑖𝑡 on 1, 𝐷𝑖, 𝑝𝑜𝑠𝑡𝑡, 𝑊𝑖𝑡, 𝑖 = 1, ..., 𝑁; 𝑡 = 1, ..., 𝑇 (3.3) Equivalently, one can replace 𝐷𝑖 and 𝑝𝑜𝑠𝑡𝑡 with a full set of unit and time period fixed effects, resulting in the popular two-way fixed effects (TWFE) estimator. [The equivalence follows from Wooldridge (2021, 2010).] When 𝑁 is small (or 𝑁1 is small), inference using the TWFE estimator is tricky because one should allow serial correlation in the underlying time-varying errors, 𝑈𝑖𝑡, that are implicit in the equation. The usual method of clustering by unit 𝑖 generally results in poor performance with small 𝑁 (or with small 𝑁0 or 𝑁1). As is fairly well known, and recently shown in Lee and Wooldridge (2023) in cases with covariates, (cid:98) 𝜏𝐷𝐷 also can be obtained as follows. First, for each unit 𝑖, create 50 Δ ¯𝑌𝑖 ≡ 𝑇 ∑︁ 1 (𝑇 − 𝑆 + 1) 𝑡=𝑆 ≡ ¯𝑌𝑖,𝑝𝑜𝑠𝑡 − ¯𝑌𝑖,𝑝𝑟𝑒, 𝑌𝑖𝑡 − 1 (𝑆 − 1) 𝑆−1 ∑︁ 𝑡=1 𝑌𝑖𝑡 (3.4) which is a simple transformation of the data within each unit. This results in a simple cross- sectional data set (cid:8)(cid:0)Δ ¯𝑌𝑖, 𝐷𝑖(cid:1) : 𝑖 = 1, ..., 𝑁 (cid:9), where 𝑁 can be large or small. In the second step, (cid:98) is obtained as the coefficient on 𝐷𝑖 from the simple regression 𝜏𝐷𝐷 Δ ¯𝑌𝑖 on 1, 𝐷𝑖, 𝑖 = 1, ..., 𝑁. Of course, this regression leads to the formula for a difference in means: 𝜏𝐷𝐷 = 𝑁 −1 (cid:98) 1 𝑁 ∑︁ 𝑖=1 𝐷𝑖 · Δ ¯𝑌𝑖 − 𝑁 −1 0 𝑁 ∑︁ 𝑖=1 = Δ ¯𝑌𝑡𝑟𝑒𝑎𝑡 − Δ ¯𝑌𝑐𝑜𝑛𝑡𝑟𝑜𝑙 . (1 − 𝐷𝑖) · Δ ¯𝑌𝑖 (3.5) (3.6) The benefit of the previous characterization of the DiD estimator is that we can think of a simple underlying model, Δ ¯𝑌𝑖 = 𝛼 + 𝜏𝐷𝑖 + 𝑈𝑖 𝐸 (𝑈𝑖 |𝐷𝑖) = 0, (3.7) (3.8) where the zero conditional mean assumption holds when the difference-in-differences design iden- tifies the average treatment effect on the treated, 𝜏. If 𝑁 is reasonably large, and we have several treated and control units, we can rely on asymptotic analysis and justify a heteroskedasticity-robust standard error for (cid:98) 𝑁0 is large but 𝑁1 is small, we cannot rely on asymptotics. Nevertheless, it is possible that (3.7) 𝜏𝐷𝐷 for computing a confidence interval for 𝜏. But when 𝑁 is small, or, say, satisfies the classical linear model (CLM) assumptions: 𝑈𝑖 |𝐷𝑖 ∼ 𝑁𝑜𝑟𝑚𝑎𝑙 (cid:16) 0, 𝜎2 𝑈 (cid:17) Under (3.7) and (3.9), conditional on 𝐷𝑖, (cid:98) 𝜏𝐷𝐷 − 𝜏) ((cid:98) 𝜏𝐷𝐷) se ((cid:98) ∼ T𝑁−2 𝜏𝐷𝐷 has an exact normal distribution, and 51 (3.9) (3.10) This result means that we can obtain exact tests of any null hypothesis, such as 𝐻0 : 𝜏 = 0. Moreover, the usual confidence, obtained using percentiles of the T𝑁−2 distribution, have exact coverage. Relatedly, Hagemann (2025) assumes normality in the context of testing for a treatment effect when a single cluster is treated with many units within the cluster, allowing for different variances in the distributions of the control and treated units. Here, we do not require any particular number of time periods (units within a cluster) provided the normality assumption (3.9) holds. In looking at the expression for (cid:98) 𝜏𝐷𝐷 in (3.6), it is clear that asymptotic theory will not apply when, say, 𝑁0 is very large if 𝑁1 is still small: we need the central limit theorem to apply to both terms in (3.6), and it will not generally be a good approximation to the distribution if 𝑁1 is small. The same is true if 𝑁0 is small. Assuming (3.9) holds, the approach works in any setting with 𝑁0 ≥ 1, 𝑁1 ≥ 1, and 𝑁 = 𝑁0 + 𝑁1 ≥ 3. In particular, we can have a single treated unit, 𝑁1 = 1, and many or few control units. When there is only a single treated unit, the 𝑡-statistic (cid:98) residual,” which plays a role in outlier analysis. See, for example, Wooldridge (2020, Section 9.5). Viewing the 𝑡 statistic for (cid:98) determine whether the single treated unit, with 𝐷𝑖 = 1, is an outlier compared with the control 𝜏𝐷𝐷 as an outlier diagnostic makes intuitive sense: we are trying to 𝜏𝐷𝐷) is known as the “studentized 𝜏𝐷𝐷/se ((cid:98) units. The approach just outlined is essentially that taken by Donald and Lang (2007), but here we apply it to panel data with either a short or long stretch of time. If 𝑇0 and 𝑇1 are reasonably large, the normality in (3.9) may be an acceptable assumption because, if the data are weakly dependent across time, the central limit theorem implies that ¯𝑌𝑖,𝑝𝑜𝑠𝑡 and ¯𝑌𝑖,𝑝𝑟𝑒 are approximately normal. Moreover, because we only require conditional normality of Δ ¯𝑌𝑖 = ¯𝑌𝑖,𝑝𝑜𝑠𝑡 − ¯𝑌𝑖,𝑝𝑟𝑒, strong dependence (such as a time-constant unobserved effect) is eliminated by differencing the post- intervention and pre-intervention averages. Underlying unit root processes also need not cause Δ ¯𝑌𝑖 to deviate substantially from normality. In addition to (technically) maintaining exact normality, another drawback of the exact inference approach is that it maintains homoskedasticity – that is, 𝑉 𝑎𝑟 (𝑈𝑖 |𝐷𝑖) = 𝑉 𝑎𝑟 (𝑈𝑖). Nevertheless, 52 recent simulations by Simonsohn (2021) are promising in that one version of the heteroskedasticity- robust standard error, often called “hc3” in the literature, can work well even with a small number of treated and control units. Thus, one may feel comfortable using heteroskedasticity-robust inference except in cases with a very small number of treated and/or control units. In addition, a notable advantage of our method is its compatibility with randomization inference (RI), which does not rely on the normality assumption. RI enables the computation of exact p- values for testing the null hypothesis that all treatment effects are zero. The p-value is calculated as the proportion of permutation-based test statistics that are as extreme or more extreme than the observed test statistic. Specifically, if the number of such permutations is 𝑐, and the total number of permutations is 𝑁, then the two-sided randomization p-value is defined as 𝑐/𝑁. In practice, the user-written Stata command ritest can be employed to implement this procedure and obtain exact p-values. See, for example, Heß (2017). There is another characterization of (cid:98) 𝜏𝐷𝐷 that will be useful when we allow unit-specific pre- trends in Section 3.3. Procedure 2.1 (Unit-Specific Demeaning): 1. Namely, think of obtaining ¯𝑌𝑖,𝑝𝑟𝑒, for each 𝑖, using the pre-intervention regression: 𝑌𝑖𝑡 on 1, 𝑡 = 1, . . . , 𝑆 − 1, (3.11) where the coefficient on unit (the only coefficient) gives ¯𝑌𝑖,𝑝𝑟𝑒. 2. Next, in each post-intervention period, we obtain out-of-sample residuals, or prediction errors: (cid:164)𝑌𝑖𝑡 = 𝑌𝑖𝑡 − ¯𝑌𝑖,𝑝𝑟𝑒 = 𝑌𝑖𝑡 − 1 (𝑆 − 1) 𝑆−1 ∑︁ 𝑟=1 𝑌𝑖𝑟, 𝑡 = 𝑆, . . . , 𝑇 (3.12) 3. It is easy to see that, when these out-of-sample residuals are average, we obtain the same regressand as in (3.5): (cid:164)𝑌 𝑖 ≡ 1 (𝑇 − 𝑆 + 1) 𝑇 ∑︁ 𝑡=𝑆 (cid:164)𝑌𝑖𝑡 = ¯𝑌𝑖,𝑝𝑜𝑠𝑡 − ¯𝑌𝑖,𝑝𝑟𝑒 = Δ ¯𝑌𝑖 4. Obtain an average effect, (cid:98) 𝜏𝐷 𝑀, and its standard error and confidence interval, from (cid:164)𝑌 𝑖 on 1, 𝐷𝑖, 𝑖 = 1, ..., 𝑁. □ (3.13) 53 3.2.2 Adding Control Variables As discussed in Lee and Wooldridge (2023), Procedure 2.1 uncovers the ATT under a no anticipation (NA) assumption and a parallel trends (PT) assumption. With 𝑌𝑖𝑡 (0) and 𝑌𝑖𝑡 (1) denoting the potential outcomes for unit 𝑖 in period 𝑡, the weakest version of NA is 𝐸 [𝑌𝑖𝑡 (1) − 𝑌𝑖𝑡 (0) |𝐷𝑖 = 1] = 0, 𝑡 = 1, . . . , 𝑆 − 1, (3.14) so the potential outcomes are the same, on average for the treated units. Sufficient, of course, is 𝑌𝑖𝑡 (1) = 𝑌𝑖𝑡 (0), 𝑡 = 1, . . . , 𝑆 − 1. The parallel trends assumption is that, for all 𝑡 = 2, . . . , 𝑇, 𝐸 [𝑌𝑖𝑡 (0) − 𝑌𝑖1 (0) |𝐷𝑖] = 𝐸 [𝑌𝑖𝑡 (0) − 𝑌𝑖1 (0)] ≡ 𝛿𝑡, (3.15) where 𝛿𝑡 is a constant that is unrestricted over time. This assumption allows an unrestricted trend in the control state, but that trend must be the same across control and treated units. In effect, it allows treatment assignment, 𝐷𝑖, to be correlated with the level, say 𝑌𝑖1(0), but it rules out cases where the assignment is based on differential trends in the control state. As shown in Lee and Wooldridge (2023), the PT assumption implies that, for some constant 𝛼, 𝐸 (cid:2)Δ ¯𝑌𝑖 (0) |𝐷𝑖(cid:3) = 𝛼, (3.16) where Δ ¯𝑌𝑖 (0) = ¯𝑌𝑖,𝑝𝑜𝑠𝑡 (0) − ¯𝑌𝑖,𝑝𝑟𝑒 (0) is the same transformation appearing in (3.4) but for the potential outcomes. Equation (3.16) means that, Δ ¯𝑌𝑖 (0) is mean independent of 𝐷𝑖; as discussed in LW (2023), along with the NA assumption this implies that 𝜏 is identified and (cid:98) unbiased and consistent. Here, we are relying on unbiasedness because of the small 𝑁1 or small 𝜏𝐷𝐷 is (conditionally) 𝑁0. LW (2023) also show that the PT assumption is easily relaxed by conditioning on controls. Given time-constant controls X𝑖, a 1 × 𝐾 vector, the conditional PT assumption is 𝐸 [𝑌𝑖𝑡 (0) − 𝑌𝑖1 (0) |𝐷𝑖, X𝑖] = 𝐸 [𝑌𝑖𝑡 (0) − 𝑌𝑖1 (0) |X𝑖] ≡ 𝛼𝑡 + X𝑖 𝛽𝑡, (3.17) where we have imposed linearity because we have little choice in a small-𝑁 setting. Assumption (3.17) is an unconfoundedness assumption on 𝐷𝑖 once we condition on X𝑖, but it is in terms of the 54 differences 𝑌𝑖𝑡 (0) − 𝑌𝑖1 (0), rather than the levels 𝑌𝑖𝑡 (0). (This is what gives the DiD procedure its main advantage over cross-sectional regression.) In terms of the cross-sectional regression, it is easy to adapt LW (2023) to see that we should add X𝑖 to the equation. To use exact inference, we would write Δ ¯𝑌𝑖 = 𝛼 + 𝜏𝐷𝑖 + X𝑖 𝛽 + 𝑈𝑖 𝑈𝑖 |𝐷𝑖, X𝑖 ∼ 𝑁𝑜𝑟𝑚𝑎𝑙 (cid:16) 0, 𝜎2 𝑈 (cid:17) (3.18) (3.19) Provided 𝑁 > 𝐾 + 2, (3.18) can be estimated by OLS, and, under the conditional normality assumption, the 𝑡 statistic has an exact T𝑁−𝐾−2 distribution and the corresponding confidence intervals are exact. Again, it is intriguing to think that, even without 𝑁 being too large, we might use heteroskedasticity-robust inference. When 𝑁0 and 𝑁1 are both sufficiently large, and the support of X for 𝐷 = 1 is contained in that for 𝐷 = 0, full regression is preferred on theoretical grounds. Namely, (cid:98) from the regression that includes interactions: 𝜏𝐷𝐷 is the coefficient on 𝐷𝑖 Δ ¯𝑌𝑖 on 1, 𝐷𝑖, X𝑖, 𝐷𝑖 · (cid:0)X𝑖 − ¯X1 (cid:1) , 𝑖 = 1, . . . , 𝑁, which is identical to running separate regressions for 𝐷𝑖 = 0 and 𝐷𝑖 = 1. Here, ¯X1 is the average of the covariates over 𝐷𝑖 = 1. Unfortunately, this cannot be done with small 𝑁1, as it requires 𝑁0 > 𝐾 + 1 and 𝑁1 > 𝐾 + 1. 3.2.3 Estimating Separate Effects for Each Treated Period Rather than estimate a single ATT, it is little additional effort to estimate ATTs separately in each treated period. Define (cid:164)𝑌𝑖𝑡 as in (3.12), and this is used as the dependent variable in period 𝑡. Then (cid:98) 𝜏𝑡,𝐷𝐷 for 𝑡 = 𝑆, . . . , 𝑇 is obtained from (cid:164)𝑌𝑖𝑡 on 1, 𝐷𝑖, 𝑖 = 1, . . . , 𝑁 (3.20) In order to conduct small 𝑁 (or small 𝑁1 inference), we cannot rely on the central limit theorem in the post-treatment period. Also, we cannot easily obtain standard errors for linear combinations for 𝜏𝑡,𝐷𝐷 because they will be dependence in complicated ways due to serial correlation. (Unless the (cid:98) 55 we make the ideal assumptions that rule out serial correlation.) Nevertheless, we can obtain a CI for each 𝜏𝑡 = 𝐸 [𝑌𝑖𝑡 (1) − 𝑌𝑖𝑡 (0) |𝐷𝑖 = 1]. The coefficients one obtains from (3.20) are identical to running a standard panel DiD using the control and treatments for the time periods {1, 2, . . . , 𝑆 − 1, 𝑡} where 𝑆 ≤ 𝑡 ≤ 𝑇. That is, we include all pre-treatment variables. Again, with 𝑁 large enough, we can add controls X𝑖. Note that, like the treatment effects themselves, the intercept and slope parameters on X𝑖 vary freely over the treatment periods. 3.2.4 Using Different Pre-Treatment Periods The suggestion in the previous subsection is to use all pre-treatment periods in computing the pre-treatment averages. Under NA and parallel trends, that approach uses the most informa- tion. Nevertheless, one might want to ignore information in the pre-treatment time periods. The standard event-study approach uses only the period just before the treatment. This results in the transformation ˚𝑌𝑖𝑡 = 𝑌𝑖𝑡 − 𝑌𝑖,𝑆−1, 𝑡 = 𝑆, . . . , 𝑇, (3.21) which can then be used in place of (cid:164)𝑌𝑖𝑡 in (3.20). This is the transformation used by Callaway and Sant’Anna (2021). Of course, one can use any average of {𝑌𝑖𝑡 : 𝑡 = 1, . . . , 𝑆 − 1} – even a weighted average – and the method would be still valid. If there is concern about violation of no anticipation, the average ¯𝑌𝑖,𝑝𝑟𝑒 could be replaced with ¯𝑌𝑖,𝑆0 = 1 𝑆0 𝑆0∑︁ 𝑟=1 𝑌𝑖𝑟 (3.22) where 𝑆0 < 𝑆 − 1. A long-difference also can be used: 𝑌𝑖𝑡 − 𝑌𝑖,𝑆0. 3.3 Heterogeneous Trends and Seasonality in the Common Timing Case LW (2023) show that, when the units have unit-specific trends, these can be removed before applying a modification of Procedure 2.1. For concreteness, the following procedure removes linear trends; it is easily modified to remove higher-order unit-specific trends. 56 Procedure 3.1 (Unit-Specific Detrending): 1. For each 𝑖, obtain (cid:98)𝐴𝑖, (cid:98)𝐵𝑖 from the pre-treatment periods by regressing on a constant and time: 𝑌𝑖𝑡 on 1, 𝑡, 𝑡 = 1, . . . , 𝑆 − 1 2. For the post-treatment periods, remove the pre-treatment trends: (cid:165)𝑌𝑖𝑡 = 𝑌𝑖𝑡 − (cid:98)𝑌𝑖𝑡 ≡ 𝑌𝑖𝑡 − (cid:98)𝐴𝑖 − (cid:98)𝐵𝑖 · 𝑡, 𝑡 = 𝑆, . . . , 𝑇, (3.23) (3.24) where (cid:98)𝑌𝑖𝑡 ≡ (cid:98)𝐴𝑖 + (cid:98)𝐵𝑖 · 𝑡 is the projected value of 𝑌𝑖𝑡 in a treated period using a trend obtained from pre-treatment periods. 3. For each unit, average the adjusted outcomes: (cid:165)𝑌 𝑖 ≡ 1 (𝑇 − 𝑆 + 1) 𝑇 ∑︁ 𝑡=𝑆 (cid:165)𝑌𝑖𝑡 4. Obtain an average effect, (cid:98) 𝜏𝐷𝑇 , and its standard error and confidence interval, from (cid:165)𝑌 𝑖 on 1, 𝐷𝑖, 𝑖 = 1, ..., 𝑁. □ (3.25) (3.26) We use “DT” to denote that (cid:98) 𝑁1 = 1, we can perform inference under 𝜏𝐷𝑇 is obtained from a detrending procedure. As before, even with (cid:165)𝑌 𝑖 = 𝛼 + 𝜏𝐷𝑇 · 𝐷𝑖 + 𝑈𝑖 0, 𝜎2(cid:17) 𝑈𝑖 |𝐷𝑖 ∼ 𝑁𝑜𝑟𝑚𝑎𝑙 (cid:16) With large enough 𝑁 we could include covariates, as in (3.17), which would allow for confounded assignment even after removing unit-specific linear trends. Because Procedure 3.1 uses unit-specific removal trends, it may initially appear that the final regression used to obtain (cid:98) parameters” problem. This is not the case, as we are simply transforming each unit using its own 𝜏𝐷𝐷 in Procedure 2.1) might suffer from the so-called “incidental 𝜏𝐷𝑇 (or (cid:98) time series data. That is, Δ ¯𝑌𝑖 and (cid:165)𝑌 𝑖 are just linear functions of {𝑌𝑖𝑡 : 𝑡 = 1, 2, . . . , 𝑇 }. Under independence across 𝑖, the result is cross-sectional data where the observations are independent. That allows us to at least argue for classical linear model analysis when 𝑁 is small. Without 57 control variables, Lee and Wooldridge (2023) show that a sufficient condition is the independence of the treatment indicator from transformed outcome variable– that is, the outcome after removing unit-specific pre-treatment averages or trends. As in the case without trends, we can easily estimate separate effects for each 𝑡: for each 𝑡 ∈ {𝑆, ..., 𝑇 } obtain (cid:98) 𝜏𝑡,𝐷𝑇 from the sequence of regressions (cid:165)𝑌𝑖𝑡 on 1, 𝐷𝑖, 𝑖 = 1, ..., 𝑁 and possibly include X𝑖. As before, under the CLM assumptions we can perform inference on the 𝜏𝑡,𝐷𝑇 . The other strategies discussed in Section 3.2, such as leaving a gap before the intervention if (cid:98) one is concerned about anticipation, apply here as well. Rather than using all of the time periods to remove a trend, one could use second differencing, say, (cid:168)𝑌𝑖𝑡 ≡ (cid:0)𝑌𝑖𝑡 − 𝑌𝑖,𝑆−1 (cid:1) − (cid:0)𝑌𝑖,𝑆−1 − 𝑌𝑖𝑅(cid:1) , where 1 ≤ 𝑅 < 𝑆 − 1. The resulting estimator is a difference-in-difference-in-differences estimator. Such an estimator does not exploit averaging across pre- and post-treatment periods, likely making it more sensitive to violations of the normality assumption. When 𝑌𝑖𝑡 is the log of a positive outcome, a linear time trend is attractive. Nevertheless, with large enough 𝑆, we can make the pre-trend removal more flexible; most obviously by including higher powers, such as 𝑡2 and 𝑡3. However, removing too much of the variation in the outcome may make it difficult to detect effects of the intervention. With quarterly or monthly data, and maybe even with weekly data, it might make sense to remove seasonality at the unit level (perhaps in addition to a trend). In the first step of Procedure 2.1 or 3.1, we would simply include quarterly, monthly, or weekly dummy variables and obtain the deseasonalized/detrended outcomes. After that, the procedure is exactly the same. 3.4 Comparison with Synthetic Control Methods With a single treated unit, ADH (2010) propose algorithms to use pre-treatment data to obtain weights on “donors” to create a synthetic control (SC) for the treated unit. The weights are chosen 58 by an optimization algorithm so that all weights are nonnegative and sum to unity. In early SC applications, inference is essentially visual, relying on taking each control unit as a placebo treated unit and comparing the post-treatment gaps with those of the actual treated unit. More recently, Arkhangelsky et al. (2021) study the SC methods, and extensions, when the target parameter is the post-treatment time average of the ATTs – rather than estimating a differ- ent parameter in each time period. They propose the Synthetic difference-in-differences (SDiD) estimator, which can be interpreted as a weighted two-way fixed effects (TWFE) estimator. Specif- ically, SDiD assigns unit weights to balance the pre-treatment trends of treated and control units, and time weights to balance pre- and post-treatment periods within control units. These weights are obtained by solving regularized optimization problems that minimize dis- crepancies in pre-treatment outcome trajectories, thereby improving robustness to violations of the parallel trends assumption and enhancing efficiency. This approach allows SDiD to blend the strengths of both SC and DID while retaining valid large-sample inference guarantees. When it comes to staggered interventions, Arkhangelsky et al. (2021) suggest splitting the sample by adoption date into subsets and applying SDiD method separately to each subset. See, for example, AAHIW (Appendix: Staggered Adoption, 2021). For instance, if units 5 and 6 begin treatment in period 5, units 3 and 4 begin in period 3, and units 1 and 2 are never treated, then, the sample can be divided into to sub-samples: one including units 1, 2, 5, and 6 and another including units 1, 2, 3, and 4. Inference for the SDiD estimator is developed under large 𝑁0, large 𝑇 asymptotics. The key assumptions include independence of the error terms across 𝑖, allowing for standard central limit arguments as the number of control units increases. Within each unit, the errors are permitted to exhibit serial correlation over time, provided they satisfy weak dependence conditions. It also allows arbitrary correlation across time within units while maintaining independence across units. Additionally, the validity of their placebo-based standard error estimation relies on approximate normality of the residuals. Procedures 2.1 and 3.1 of our paper use relatively less sophisticated strategies in effectively 59 choosing a synthetic Control. However, it is important to remember that the chosen control is based on the outcome after the pre-treatment averages or pre-treatment trends have been removed. The average of the resulting residuals across all included control units effectively plays the role of the synthetic control for the residualized outcomes of the treated unit. In other words, the cross sectional average of the controls, (cid:40) 𝑁 −1 0 𝑁0∑︁ 𝑖=1 (cid:41) (cid:164)𝑌𝑖𝑡 : 𝑡 = 1, ..., 𝑇 , is the synthetic control for the cross-sectional averages of the treated units, (cid:40) 𝑁 −1 1 𝑁0+𝑁1∑︁ 𝑖=𝑁0 (cid:41) (cid:164)𝑌𝑖𝑡 : 𝑡 = 1, ..., 𝑇 . (3.27) (3.28) If we remove pre-treatment trends then (cid:165)𝑌𝑖𝑡 replaces (cid:164)𝑌𝑖𝑡. The hope is that after transforming 𝑌𝑖𝑡 in the pre-treatment periods, the control residuals do a good job of tracking the treated residuals pre- treatment. When the procedure is successful, there should be close agreement in the two average residual series over the periods 𝑡 = 1, 2, . . . , 𝑆 − 1. The estimated treatment effects are obtained from the differences for 𝑡 ≥ 𝑆, and these are exactly the coefficients in (3.5) or (3.26). 3.5 Monte Carlo Simulations In this section, we design simulations to assess the performance of our proposed estimator in the presence of treatment effect heterogeneity under the common timing case. Our goal is to understand how well the estimator recovers the average treatment effects under realistic data- generating processes characterized by serial correlation, unit-specific heterogeneity, and dynamic treatment effects. In each simulation, we generate data for 𝑁 = 20 units observed over 𝑇 = 20 time periods, with treatment beginning at a common timing in period 𝑡 = 11, resulting in 10 pre-treatment and 10 post-treatment periods for each unit. 3.5.1 Data Generating Process This section outlines the specific steps used to generate the simulated data in detail. 60 Step 1. Unit characteristics Each unit is endowed with two latent characteristics, unit-specific heterogeneity and unit-specific time trend slope: 𝑐𝑖 ∼ N (0, 𝜎2 𝑐 ) and 𝑔𝑖 ∼ N (1, 𝜎2 𝑔 ), respectively, where 𝜎𝑐 = 2 and 𝜎𝑔 = 1 in our simulation. These characteristics are constant across time. Step 2. Idiosyncratic error term We generate serially correlated errors 𝑢𝑖𝑡 using an AR(1) process, (cid:32) √︄ 𝑢𝑖1 ∼ 𝑁 0, (cid:33) , 2 1 − 𝜌2 𝑢𝑖𝑡 = 𝜌𝑢𝑖,𝑡−1 + 𝜀𝑖𝑡, 𝜀𝑖𝑡 ∼ 𝑁 (0, 2), 𝜌 = 0.75 This structure introduces persistent shocks in the outcome evolution. Step 3. Potential outcomes The potential outcome in the control state and treated state is defined as 𝑌𝑖𝑡 (0) = 𝜆𝑡 · 𝑓 𝑠𝑡 − 𝑐𝑖 + 𝑔𝑖𝑡 + 𝑢𝑖𝑡, 𝑌𝑖𝑡 (1) = 𝑌𝑖𝑡 (0) + 𝛿𝑡 · 𝑓 𝑠𝑡 + 𝜈𝑖𝑡, 𝜈𝑖𝑡 ∼ 𝑁 (0, √ 2), where 𝑓 𝑠𝑡 is time dummy, 𝜆𝑡 represents time fixed effects and 𝛿𝑡 is time-varying treatment effects. We assume that 𝛿𝑡 = 0 for 𝑡 < 11. Then, the observed outcome is determined using a potential outcomes framework: 𝑌𝑖𝑡 = (1 − 𝐷𝑖) · 𝑌𝑖𝑡 (0) + 𝐷𝑖 · 𝑌𝑖𝑡 (1) where 𝐷𝑖 ∈ {0, 1} is a binary indicator for treatment status. Treatment is assigned based on a logistic selection rule, Pr(𝐷𝑖 = 1) = I (𝛼0 − 𝛼1 · 𝑐𝑖 + 𝛼2 · 𝑔𝑖 + 𝜖𝑖 > 0) , 𝜖𝑖 ∼ 𝐿𝑜𝑔𝑖𝑠𝑡𝑖𝑐(0, 1) The average post-treatment effect is set to approximately 2 by construction. The detailed settings for each scenario are summarized in Table 3.1. Specifically, the time fixed effects 𝜆𝑡 and time-varying treatment effects 𝛿𝑡 are held constant, while the parameters governing 61 the treatment assignment Pr(𝐷𝑖 = 1) vary across three scenarios. The probability of being treated decreases from Scenario 1 to Scenario 3. Table 3.1 Simulation Setup Parameters Across Three Scenarios , 1 4 ) Scenario2 Scenario 1 (−1, − 1 3 Parameters Treatment Rule (𝛼0, 𝛼1, 𝛼2) Time Fixed Effects (𝜆1, 𝜆2, . . . , 𝜆20) Treatment Effects (𝛿11, . . . , 𝛿20) Note: only the treatment rule parameters vary across scenarios. (1, 2, 3, 3, 3, 2, 2, 2, 1, 1) (−1.5, 1 3 , 1 4 ) Scenario 3 (−2, 1 3 , 1 4 ) (0, 0, 0, 0, 0.2, 0.6, 0.7, 0.8, 0.6, 0.9, 0.9, 1, 1.1, 1.3, 1.2, 1.5, 0.6, 1.4, 1.8, 1.9) 3.5.2 Simulation Results Table 3.2 presents the simulation results across three scenarios outlined in Table 3.1. Our proposed Detrending estimator, both in its standard error (𝐻𝐶0) and with heteroskedasticity- consistent standard errors (𝐻𝐶3), outperforms alternative methods such as Synthetic Control (SC) and Synthetic difference-in-differences (SDiD) methods. Across all scenarios, the Detrending estimator achieves the lowest root mean squared error (RMSE), indicating superior finite-sample performance. Specifically, in Scenario 1 where the probability of treatment is higher (𝑃(𝐷𝑖 = 1) = 0.32), the Detrending estimator exhibits a negligible bias and an RMSE of approximately 1.73, significantly outperforming SC and SDiD, which exhibit large biases (-0.375 and 0.392, respectively) and correspondingly higher RMSEs. Similar patterns are observed in Scenarios 2 and 3. In particular, while SC and SDiD exhibit a notable gap between the empirical standard deviation (SD) of the estimates and the average estimated standard errors (Avg SE) reported within each replication, our Detrending estimator maintains a close alignment between the two, leading to more reliable inference and coverage rates close to the nominal 95%. For example, in Scenario 1, the SD for SC is 1.73, while the average standard error is 2.66, indicating that the method tends to overestimate its own uncertainty and produce wider-than- necessary confidence intervals. This conservative estimation contributes to coverage rates that remain close to or slightly above the nominal 95%. 62 Table 3.2 Simulation Results Across Three Scenarios Scenario 1. 𝑃(𝐷𝑖 = 1) = 0.32 Sample AE Demeaning Detrending Detrending (hc3) SC SDiD Average Effects 1.991 3.905 2.000 2.000 1.616 2.383 Scenario 2. 𝑃(𝐷𝑖 = 1) = 0.24 Sample AE Demeaning Detrending Detrending (hc3) SC SDiD Average Effects 2.011 4.264 1.969 1.969 1.622 2.395 Scenario 3. 𝑃(𝐷𝑖 = 1) = 0.17 Bias SD RMSE Coverage Rate Avg SE 1.914 0.009 0.009 -0.375 0.392 5.10 1.73 1.73 2.14 1.77 5.445 1.734 1.734 2.177 1.808 0.94 0.96 0.96 0.97 0.96 4.96 1.74 1.87 3.27 2.56 Bias SD RMSE Coverage Rate Avg SE 2.253 -0.042 -0.042 -0.389 0.384 5.67 1.89 1.89 2.35 1.89 6.101 1.892 1.892 2.384 1.925 0.93 0.95 0.93 0.94 0.95 5.48 1.91 2.04 2.73 2.25 Bias SD RMSE Coverage Rate Avg SE Sample ATT 6.78 Demeaning 2.37 Detrending 2.37 Detrending (hc3) 2.92 SC 2.35 SDiD Each simulation study has 1, 000 replications with 𝑁 = 20 Average Effects 1.996 4.566 2.161 2.161 2.962 2.615 2.570 0.165 0.165 0.966 0.619 7.254 2.380 2.380 3.078 2.435 0.94 0.95 0.91 0.92 0.94 6.43 2.26 2.60 2.85 2.33 Theoretically, substantial bias would be expected to shift confidence intervals away from the true treatment effect and lower coverage rates. Although SC and SDiD exhibit noticeably larger biases compared to our Detrending method, in our simulations, the magnitude of these biases is not sufficiently large relative to their conservative standard errors to meaningfully reduce coverage. As a result, despite their higher bias, SC and SDiD still achieve coverage rates comparable to or slightly exceeding the nominal 95% level. Moreover, in Scenario 3 where treatment assignment becomes rarer (𝑃(𝐷𝑖 = 1) = 0.17), all estimators experience greater challenges, but our detrending method still maintains the smallest 63 RMSE and the most reliable inference properties compared to SC and SDiD. Overall, these simulation results suggest that the detrending approach can offer substantial improvements in bias reduction, efficiency, and inference accuracy in scenarios with heterogeneous linear trends. However, its advantages are not universal. When a unit’s pre-trends exhibit more complex patterns that cannot be adequately captured by linear detrending, our method will not always work better than existing alternatives. 3.6 Application to California Smoking Restrictions We apply our methods to the problem analyzed in ADH (2010), estimating the effect of California’s tobacco control program. The first year of the program is 1989, with 19 pre-treatment years (1970-1988) and 12 treatment years (1989-2000). We use the log of per capita cigarette sales as the outcome variable; ADH (2010) used the level of this variable, but the log seems more natural – partly to obtain a percentage effect and partly to make normality of the transformed outcome a better approximation. California is the only treated state. As in ADH (2010), we use 𝑁0 = 38 potential control states after eliminating states that implemented some sort of anti-smoking program over the period. As discussed in the previous section, SCM uses pre-treatment outcomes (and sometimes other variables) to create a synthetic control – in this case, a synthetic California. The weights for the 38 controls are chosen to make the SC as close to California as possible based on pre-treatment observables (primarily 𝑌𝑖𝑡). These weights are then used to project out what would have been the untreated outcome for CA after the policy change; this is then compared with the actual outcome in post-intervention. 3.6.1 Results Figure 3.1 shows the graphs of (3.27) and (3.28) using Procedure 2.1, allowing for separate effects in each treated period. The solid pink line represents the outcome for the treated unit (California), while the dashed blue line corresponds to the synthetic control group. The vertical dashed line marks the intervention year (1989). It is evident that the tracking is not particularly good, with the average of the controls initially 64 being below the treated and then above the treated unit in 1980. Since it includes all of the controls, it is not surprising that perhaps just removing a pre-treatment average is not enough to make them similar to California prior to the intervention – even averaged across 𝑁0 = 38. One possibility is to choose a subset of the controls that seem most similar to California. In effect, that is what the SCM does in a systematic way. Figure 3.1 Removing Pre-Treatment Averages, All Controls On the other hand, Figure 3.2 plots the average residuals for the controls against California after removing the state-specific trends, obtained using Procedure 3.1. This provide a much better fit up until the policy change. Figure 3.2 Removing Pre-Treatment Linear Trends, All Controls 65 Now, we can interpret the gaps after the intervention data as the effects of the intervention at different time horizons. Figure 3.3 shows the estimated gap between detrended California and the average of the 38 detrended controls. Prior to the intervention in 1989, the gap ranges from about −1.4% to 3.5%, with an average very close to zero. After the intervention, the treated unit exhibits a sharper decline compared to the synthetic control, indicating a potential negative effect of the policy. Figure 3.3 Gap Between California and Average of All Controls, State-Specific Linear Trend Similarly, both the Synthetic Control and Synthetic DiD methods, using all 38 states as controls, display a similar pre-treatment trend between California and Synthetic California (Control), as visualized in Figure 3.4 and Figure 3.5, respectively. 66 Figure 3.4 Synthetic Control Method, All Controls Unlike Synthetic Control, Synthetic DiD in Figure 3.5 allows for level differences (drift) in pre-period trend between the treated and control units. Figure 3.5 Synthetic DID, All Controls Table 3.3 reports the estimated effect and associated standard errors from these four different estimation method: (i) Demeaning – Procedure 2.1, (ii) Detrending – Procedure 3.1, (iii) Synthetic Control Method, and (iv) Synthetic DiD. The first row, Average Effect, presents a single estimated effect averaged over all treated periods. 67 Table 3.3 Estimated ATTs, 38 Control (Donor) State Average Effect 𝜏1989 𝜏1995 𝜏2000 Procedure 2.1 (DiD) -0.422*** (0.121) -0.168* (0.096) -0.484*** (0.137) -0.667*** (0.164) Procedure 3.1 (Unit-Specific Detrending) -0.227** (0.094) -0.043 (0.059) -0.282** (0.112) -0.403** (0.152) SC Synthetic DiD -0.304*** (0.112) -0.286*** (0.097) Note 1: standard errors in parentheses. *** 𝑝 < 0.01, ** 𝑝 < 0.05, * 𝑝 < 0.1. Note 2: For Procedure 3.1, the 𝑝-value for the average effect under the normality assumption is 0.021. Based on randomization inference (RI) with 1,000 replications, the 𝑝-value is 0.041. This result is reasonably close to the conventional value, supporting the robustness of the inference. Table 3.3 also shows the estimated effect and associated standard errors for several time horizons, since our cross-sectional regression method can easily estimate separate effects for each 𝑡. As is evident from the picture, the estimated effect grows over time, starting off small but ending at 𝑦𝑒𝑎𝑟 = 2000 with −0.403 (𝑡 = −2.65). 3.6.2 Robustness Check To further evaluate the robustness of our state-specific detrending procedure, we contrast the results from Procedure 3.1 with those from synthetic control methods when the donor pool is restricted to a small set of states that, a priori, does not seem “similar” California. While the previous section used all 38 potential control states, here we consider two alternative donor pools. The first group consists of Alabama (AL), Arkansas (AR), Louisiana (LA), and Mississippi (MS)—four southern states that differ substantially from California in socioeconomic characteristics. The second group includes Midwestern states—Illinois (IL), Indiana (IN), Iowa (IA), and Ohio (OH). By limiting the set of controls in this way, we assess how each estimator performs when the availability of similar donor units is limited, providing additional insights into the robustness of our procedure. 68 3.6.2.1 AL, AR, LA and MS as Controls Table 3.4 summarizes the estimates using Alabama (AL), Arkansas (AR), Louisiana (LA), and Mississippi (MS) as control states across four different estimation methods. The results from the detrending procedure closely resemble those obtained when all 38 states are used as controls, both in magnitude and statistical significance. In contrast, the synthetic control (SC) and synthetic DiD methods yield larger estimates in absolute value, reflecting poorer pre-treatment fits observed in Figure 3.6. Table 3.4 Estimated ATTs, AL, AR, LA, MS as Controls Average 1989 1995 2000 Procedure 2.1 (DiD) -0.556*** (0.080) -0.247 (0.107) -0.611*** (0.077) -0.839*** (0.032) Procedure 3.1 (Unit-Specific Detrending) -0.215** (0.039) -0.027 (0.052) -0.259** (0.055) -0.377** (0.115) SC Synthetic DiD -0.571*** (0.034) -0.392*** (0.030) Note: standard errors in parentheses: *** p<0.01, ** p<0.05, * p<0.1 Panel (a) of Figure 3.6 shows how the removing unit-specific linear trends fits when the controls consist of AL, AR, LA, and MS. The average residuals across these four states provide a remarkably good fit to the detrended California. By contrast, Panel (b) and (c) of Figure 3.6 shows that both SC method and Synthetic DiD cannot recover a synthetic California that provides a close enough fit in order to produce a convincing estimate of the causal effect. 69 (a) State-Specific Detrending (b) Synthetic Control (c) Synthetic DiD Figure 3.6 AL, AR, LA and MS as Controls 70 3.6.2.2 IL, IA, MN and OH as Controls Similarly, we next employ four Midwestern states—Illinois (IL), Iowa (IA), Minnesota (MN), and Ohio (OH)—as control units. The corresponding results are presented in Table 3.5. As in the previous case with AL, AR, LA, and MS as controls, our unit-specific detrending approach achieves a strong pre-treatment fit, while the synthetic control and synthetic DiD methods perform poorly. As shown in Panel (a) of Figure 3.7, the detrending method closely aligns with the pre-treatment trends, whereas the alternative methods exhibit clear discrepancies. Table 3.5 Estimated ATTs, IL, IA, MN and OH as Controls Average 1989 1995 2000 Procedure 2.1 (DiD) -0.413** (0.118) -0.178* (0.071) -0.462** (0.133) -0.655** (0.183) Procedure 3.1 (Unit-Specific Detrending) -0.198* (0.079) -0.040 (0.045) -0.239* (0.088) -0.363* (0.136) SC Synthetic DiD -0.437** (0.184) -0.275* (0.154) Note: standard errors in parentheses: *** p<0.01, ** p<0.05, * p<0.1 These findings further support the robustness of our approach, even in settings with a limited donor pool. Notably, while both synthetic DiD and synthetic control method typically rely on a large number of control units to construct a reliable counterfactual, our method requires only minimal conditions (e.g., at least one treated and one control unit, with a total of three or more units), making it particularly useful when the donor pool is small. 71 (a) State-Specific Detrending (b) Synthetic Control (c) Synthetic DiD Figure 3.7 IL, IA, MN and OH as Controls 72 3.7 Staggered Rollouts We can extend the previous methods to the case of staggered interventions, where units are allowed to be first treated at different times. But one must use care in choosing a suitable control group. 3.7.1 The Setup and Estimation Suppose now that the first treatment period is 𝑆, but treatment can first occur in periods 𝑆 + 1, . . . , 𝑇 . Index the treatment cohorts or groups by 𝑔 ∈ {𝑆, 𝑆 + 1, . . . , 𝑇 }, the time of first treatment. Let 𝐷𝑔 , 𝑔 = 𝑆, 𝑆 + 1, . . . , 𝑇 be the cohort indicators. Specifically, 𝐷𝑖𝑔 = 1 if and only if unit 𝑖 was first treated in period 𝑔. The never treated units are indicated by 𝐷𝑖∞ = 1. There are 𝑁𝑔 units in each cohort, with 𝑁∞ being the number of never treated units. With 𝑁∞ ≥ 2, the other treated cohorts may have only one unit. Naturally, if there are periods without new treated units, there will be no treatment effects estimated in those periods. The potential outcomes are 𝑌𝑡 (𝑔), 𝑔 = 𝑆, . . . , 𝑇, ∞, where 𝑌𝑡 (𝑔) is the outcome in period 𝑡 if the first period of treatment is 𝑔. 𝑌𝑡 (∞) is the never treated state (or control state). The treatment effects are now indexed by cohort and time: 𝜏𝑔𝑡 = 𝐸 (cid:2)𝑌𝑡 (𝑔) − 𝑌𝑡 (∞) |𝐷𝑔 = 1(cid:3) , 𝑡 = 𝑔, . . . , 𝑇 . (3.29) We are also interested in estimating the cohort-specific average effects, 𝜏𝑔 = 1 (𝑇 − 𝑔 + 1) 𝑇 ∑︁ 𝑡=𝑔 𝜏𝑔𝑡. (3.30) As is common, we make a no anticipation Assumption: 𝑌𝑡 (𝑔) = 𝑌𝑡 (∞) , 𝑡 < 𝑔, which says that all potential outcomes are the same as in the never-treated state in periods before the treatment occurs. For a given cohort 𝑔, one now removes the average or a trend using data up through 𝑔 − 1, the final pre-treatment period. It is easiest to obtain the transformation for each unit in the sample, and 73 then sort out the valid control units afterwards. For each 𝑖, one runs the regressions 𝑌𝑖𝑡 on 1, 𝑡 = 1, 2, . . . , 𝑔 − 1 (3.31) Because we have different treatment cohorts, we add a cohort subscript when projecting out to post-treatment periods: (cid:164)𝑌𝑖𝑡𝑔 = 𝑌𝑖𝑡 − ¯𝑌𝑖,𝑝𝑟𝑒(𝑔), 𝑡 = 𝑔, . . . , 𝑇 (3.32) where ¯𝑌𝑖,𝑝𝑟𝑒(𝑔) is the average of 𝑌𝑖𝑡 for 𝑡 = 1, . . . , 𝑔 −1. To remove a linear time trend, the regressions are 𝑌𝑖𝑡 on 1, 𝑡, 𝑡 = 1, 2, . . . , 𝑔 − 1 and then the out-of-sample residuals are (cid:165)𝑌𝑖𝑡𝑔 = 𝑌𝑖𝑡 − (cid:98)𝐴𝑖𝑔 − (cid:98)𝐵𝑖𝑔 · 𝑡, 𝑡 = 𝑔, . . . , 𝑇 (3.33) (3.34) Once the transformed variables have been obtained, the key is choosing a valid control group at time 𝑡 for treatment cohort 𝑔 (which could have as few as one unit). One can use only the never treated units at time 𝑡, provided 𝑁∞ ≥ 2. However, as shown in LW (2023), under suitable NA and PT assumptions, one can using any unit not-yet treated (NYT). Then for cohort 𝑔 in time period 𝑡, the units represented by 𝐷𝑖,𝑡+1 + 𝐷𝑖,𝑡+2 + · · · + 𝐷𝑖𝑇 + 𝐷𝑖∞ = 1 (3.35) are valid controls. For efficiency reasons, one should use all possible controls (and this tends to help the normality assumption be more realistic). Nevertheless, one can use any subset (including the NT group). Let 𝐶𝑖,𝑡+1 be the control group chosen for period 𝑡, with the subscript indicating that these groups cannot have been treated prior to 𝑡 + 1. Then (cid:98) cross-sectional regression 𝜏𝑔𝑡 is the coefficient on 𝐷𝑖𝑔 in the (cid:164)𝑌𝑖𝑡𝑔 on 1, 𝐷𝑖𝑔 using 𝐷𝑖𝑔 + 𝐶𝑖,𝑡+1 = 1 (3.36) Or, replace (cid:164)𝑌𝑖𝑡𝑔 with the detrended version, (cid:165)𝑌𝑖𝑡𝑔. Under a homoskedastic normality assumption, exact inference can be used in (3.36) even if 𝑁𝑔 = 1 and the number of control units is small. 74 For many purposes, we are more interested in the 𝜏𝑔, which we can aggregate to obtain a single, weighted effect – more on this below. Then, because we want inference without many units in the treated cohort, we use the never treated units as the control and run a single, aggregated regression. Specifically, define (cid:164)𝑌 𝑖𝑔 ≡ 1 (𝑇 − 𝑔 + 1) 𝑇 ∑︁ 𝑡=𝑔 (cid:164)𝑌𝑖𝑡 and (cid:165)𝑌 𝑖𝑔 ≡ 1 (𝑇 − 𝑔 + 1) 𝑇 ∑︁ 𝑡=𝑔 (cid:165)𝑌𝑖𝑡 . Then run the regression (cid:164)𝑌 𝑖𝑔 on 1, 𝐷𝑖𝑔, 𝑖 = 1, ..., 𝑁, 𝐷𝑖𝑔 + 𝐷𝑖∞ = 1 (3.37) (3.38) 𝜏𝑔 as the coefficient on 𝐷𝑖𝑔, or use (cid:165)𝑌 𝑖𝑔 in place of (cid:164)𝑌 𝑖𝑔. Again, this is a cross-sectional to obtain (cid:98) regression with independent observations. If 𝑁𝑔 and 𝑁∞ are suitably large to employ asymptotics, then we can use a heteroskedasticity-robust standard error (hc3) to obtain a 𝑡 statistic or confidence intervals. We can appeal to exact theory if 𝑁𝑔 is small, including 𝑁𝑔 = 1. If we want to aggregate the (cid:98) 𝜏𝑔 to obtain a single treatment effect, a challenge is to obtain a valid standard error when cohort sizes are not particularly large. A natural parameter is to weight the cohort ATTs, 𝜏𝑔, but the cohort shares. This leads to the estimator 𝜏𝜔 = (cid:98) 𝑇 ∑︁ 𝑔=𝑆 𝜏𝑔 𝜔𝑔(cid:98) (cid:98) (3.39) (3.40) where 𝜔𝑔 = (cid:98) are the cohort shares. A standard error for (cid:98) is difficult if we cannot appeal to asymptotics. Even with a somewhat larger number of total treated 𝜏𝑔, which 𝑁𝑔 𝑁𝑆 + 𝑁𝑆+1 + · · · + 𝑁𝑇 𝜏𝜔 must account for the covariances among the (cid:98) units, a trick is helpful. We know from simple regression analysis that (cid:98) 𝜏𝑔 is a difference in means: 𝜏𝑔 = (cid:98) 1 𝑁𝑔 𝑁 ∑︁ 𝑖=1 𝐷𝑖𝑔 (cid:164)𝑌 𝑖𝑔 − 1 𝑁∞ 𝑁 ∑︁ 𝑖=1 𝐷𝑖∞ (cid:164)𝑌 𝑖𝑔 and so, by simple algebra, 𝜏𝜔 = (cid:98) 1 (𝑁𝑆 + 𝑁𝑆+1 + · · · + 𝑁𝑇 ) 𝐷𝑖𝑔 (cid:164)𝑌 𝑖𝑔 − 1 𝑁∞ 𝑇 ∑︁ 𝑁 ∑︁ 𝑔=𝑆 𝑖=1 𝐷𝑖∞ (cid:98) 𝜔𝑔 (cid:164)𝑌 𝑖𝑔 𝑇 ∑︁ 𝑁 ∑︁ 𝑔=𝑆 𝑖=1 75 (3.41) (3.42) Define a treatment indicator the number of treated units as Rearrange (3.42) to obtain 𝐷𝑖 = 𝐷𝑖𝑆 + · · · + 𝐷𝑖𝑇 (3.43) 𝑁𝑡𝑟𝑒𝑎𝑡 = 𝑁𝑆 + 𝑁𝑆+1 + · · · + 𝑁𝑇 . 𝜏𝜔 = (cid:98) ≡ = 1 𝑁𝑡𝑟𝑒𝑎𝑡 1 𝑁𝑡𝑟𝑒𝑎𝑡 1 𝑁𝑡𝑟𝑒𝑎𝑡 𝑁 ∑︁ 𝑖=1 𝑁 ∑︁ 𝑖=1 𝑁 ∑︁ 𝑖=1 where 𝑇 ∑︁ 𝑔=𝑆 𝐷𝑖 · (cid:169) (cid:173) (cid:171) 𝐷𝑖 · (cid:164)𝑌 𝑖 − 𝐷𝑖𝑔 (cid:164)𝑌 𝑖𝑔(cid:170) (cid:174) (cid:172) 𝑁 ∑︁ 1 𝑁∞ 𝑖=1 − 1 𝑁∞ 𝑁 ∑︁ 𝑖=1 𝑇 ∑︁ 𝑔=𝑆 𝐷𝑖∞ · (cid:169) (cid:173) (cid:171) 𝜔𝑔 (cid:164)𝑌 𝑖𝑔(cid:170) (cid:98) (cid:174) (cid:172) 𝐷𝑖∞ · (cid:164)𝑌 𝑖 𝐷𝑖 · (cid:164)𝑌 𝑖 − 1 𝑁𝑐𝑜𝑛𝑡𝑟𝑜𝑙 𝑁 ∑︁ 𝑖=1 (1 − 𝐷𝑖) · (cid:164)𝑌 𝑖 (3.44) (3.45) (cid:164)𝑌 𝑖 ≡ 𝐷𝑖𝑆 · (cid:164)𝑌 𝑖𝑆 + · · · + 𝐷𝑖𝑇 · (cid:164)𝑌 𝑖𝑇 + 𝐷𝑖∞ · (cid:169) (cid:173) (cid:171) 𝑇 ∑︁ 𝑔=𝑆 𝜔𝑔 (cid:164)𝑌 𝑖𝑔(cid:170) (cid:98) (cid:174) (cid:172) . (3.46) Note that the representation in (3.45) uses the fact that 𝐷𝑖 · 𝐷𝑖𝑔 = 𝐷𝑖𝑔. Also, 𝐷𝑖 + 𝐷𝑖∞ = 1, and so (3.45) is a simple difference in sample mean of (cid:164)𝑌 𝑖 between units that are eventually treated and the never treated group. In other words, run the cross-sectional regression (cid:164)𝑌 𝑖 on 1, 𝐷𝑖, 𝑖 = 1, . . . , 𝑁, (3.47) 𝜏𝜔 is the coefficient on 𝐷𝑖. Naturally, if we use unit-specific detrending for each cohort and then (cid:98) date 𝑔 to obtain the (cid:165)𝑌 𝑖𝑔, then we replace (cid:164)𝑌 𝑖𝑔 with (cid:165)𝑌 𝑖𝑔 everywhere. Obtaining (cid:98) 𝜏𝜔 from the regression in (3.47) has some advantages. It automatically accounts for the correlations among the (cid:98) for 𝑁𝑐𝑜𝑛𝑡𝑟𝑜𝑙). If 𝑁𝑡𝑟𝑒𝑎𝑡 and 𝑁𝑐𝑜𝑛𝑡𝑟𝑜𝑙 are even moderately large, a heteroskedasticity-robust standard 𝜏𝑔, and it can be used whether or not 𝑁𝑡𝑟𝑒𝑎𝑡 is small or large (and same error from (3.47) is justified via asymptotic analysis. Arkhangelsky et al. (2021) discuss how SDiD can be applied to staggered interventions. They suggest splitting the sample by adoption date and applying the SDiD method separately to each 76 treatment cohort, using the never treated cohort as the control [see the appendix in Arkhangelsky et al. (2021)]. In other words, do exactly what we propose with our unit-specific demeaning or detrending. Therefore, our proposed methods and the way SDiD is implemented for staggered designs are directly comparable. 3.7.2 Application: Estimating the Effects of Castle Laws on Homicides We apply the previous methods to the data set used in Cunningham (2021) on the adoption of so called “castle” laws – or “hold-your-ground” laws – on homicide rates in the United States. A castle law typically allows individuals to use force, including deadly force, to defend themselves against an intruder in their home – without a duty to retreat. The data set covers the 50 United States from 2000 through 2010. In 2005, one state adopted a castle law. In 2006, 13 more states did. The remaining treated cohorts are 2007 (four states), 2008 (two states), and 2009 (one state). Therefore, there are 𝑁𝑡𝑟𝑒𝑎𝑡 = 21 eventually treated units and 𝑁𝑐𝑜𝑛𝑡𝑟𝑜𝑙 = 29 control units. the outcome variable is the log of the annual homicides. Using the regression in (3.47) with (cid:164)𝑌 𝑖 defined in (3.46), the estimated aggregated treatment 𝜏𝜔, is about 0.092, or about 9.2% more homicides from a state adopting a castle law. The effect, (cid:98) usual OLS standard error is 0.057, which gives 𝑡 ≈ 1.61. This is not quite significant at the 10% level against a two-sided alternative. The hc3 𝑡 statistic is 1.50. Replacing (cid:164)𝑌 𝑖 with (cid:165)𝑌 𝑖 – obtained from linear detrending – decreases the estimate to 0.067, but the hc3 standard error is 0.055, and 𝑡 ≈ 1.21. The synthetic DiD estimate, obtained using the sdid package in Stata 18, is 0.099 and the standard error, using the placebo method, is 0.069 (𝑡 = 1.41). Both the estimate and its precision are in close agreement with the demeaning method described in Section 3.7.1. 3.8 Additional Practical Considerations 3.8.1 Choosing the Number of Time Periods With synthetic control-type approaches and the approaches we suggest here, one can study the robustness of the findings by adjusting the number of pre-treatment time periods. For example, when using the unit-specific detrending method, we can vary the starting point of the data as a way 77 of evaluated the way the estimates change as the unit-specific trends are removed using different stretches of data. This is in the spirit of Rambachan and Roth (2023), who formalize the idea of allowing a range of violations of parallel trends and studying the sensitivity to the estimates. In many examples, the policy intervention is likely based on past outcomes (in the untreated state). Does one need to go back, say, 20 years to remove unit specific intercepts and trends to account for the selection into treatment? In many cases, fewer pre-treatment periods might suffice. Because our approach does not rely on large 𝑇0 or 𝑇1 (but does rely on normality), the robustness of the estimates can be studied by varying 𝑇0 in particular. 3.8.2 Clustering and Spatial Correlation with Larger Cross Sections Although our motivation for the previous procedures is motivated by settings with few control or treated units, or few of both, our approach has benefits in settings where 𝑁𝑐𝑜𝑛𝑡𝑟𝑜𝑙 and 𝑁𝑡𝑟𝑒𝑎𝑡 are large enough to rely on large-𝑁 asymptotic analysis. Recall that the basic DiD estimator is obtained from (cid:164)𝑌 𝑖 on 1, 𝐷𝑖, 𝑖 = 1, ..., 𝑁 and the one that removes pre-treatment unit-specific trends is (cid:165)𝑌 𝑖 on 1, 𝐷𝑖, 𝑖 = 1, ..., 𝑁. We can even add controls with large enough 𝑁. In Section 3.7 we showed that the same regressions can be used in the case of staggered assignment by defining 𝐷𝑖 to be an “ever treated” indicator – see (3.43) – and by modifying (cid:164)𝑌 𝑖 or (cid:165)𝑌 𝑖 as in equation (3.46). Because these are cross-sectional regressions, it is straightforward to compute standard errors clustered at a level higher than 𝑖. For example, if 𝑖 is a county, but we we are studying a policy that varies only at the state level, we can cluster at the state level if we have a sufficiently large number of treated and control states. Again, it does not matter how large 𝑇0 and 𝑇1 are. Abadie, Athey, Imbens and Wooldridge (2023) discuss why standard errors should be clustered when the intervention is at a higher level than the unit. As another example, 𝑖 could be a household whereas a policy is applied at the village level. Clustering can be done separately by time period, too. The key is that it is a cross sectional regression and 78 then the usual clustering methods can apply. In situations with a spatial structure – for example, the assignment of treated units in implement- ing a new policy may be spatially correlated – we can obtain standard errors using a covariance matrix estimator robust to heteroskedasticity and spatial correlation. These so-called “SHAC” standard errors are proposed in Conley (1999). 3.9 Concluding Remarks Building on Lee and Wooldridge (2023), we have proposed a simple approach to inference when using panel data with few treated units or few control units. In the common timing case and without controls, the unit-specific demeaning reproduces the standard difference-in-differences estimator for each treated period, and also averaged across the treated periods. We also show how the unit-specific trending method of LW (2023) can be implemented. Estimation is simple and inference follows under normality using the classical linear model assumptions taught in introductory econometrics. Because the method uses averages across time, the central limit theorem across the time dimension often can be used to justify the normality assumption. Heteroskedasticity is always an issue in cross-sectional regressions, and we propose using what is known as the hc3 version provided there are at least a handful of treated units. The approach here is not intended to replace the very popular synthetic DiD method of AAHIW (2021) in cases where SDiD has natural advantages. However, our simulations show that, when there is sufficient heterogeneity in, say, linear time trends, our method of unit-specific detrending can have substantially less bias. In some cases the inference is more reliable. Moreover, SDiD is not intended to apply to situations with few time periods, or few donor units among which to choose controls. The SDiD asymptotics assumes 𝑁0, 𝑇0, and 𝑇1 all increase – although our simulations suggest SDiD tends to work well more generally, provided trends are not to heterogeneous. Our approach also allows the estimation of separate effects in the treated periods, a feature not allowed by SDiD methods. Also, we showed in Section 3.7 that allowing for staggered interventions is straightforward, and obtaining an overall weighted estimate with valid standard error can be done by a single cross-sectional regression. 79 Overall, we view our approach as complementary to SDiD methods for many applications. When applied to the California smoking data, the state-specific detrending, using all 38 control states, produces estimates and inference similar to SDiD when restricting attention to the overall average affect. In applying our approach to the staggered rollout of so-called castle laws, the our rolling method based on unit-specific demeaning gives a very similar point estimate and standard error to SDiD. For cases with a small number of time periods, a small number of controls, or both, exact inference in a cross-sectional regression using our transformation is appealing. 80 BIBLIOGRAPHY Abadie, A., Athey, S., Imbens, G. W. and Wooldridge, J. M. (2023), “When should you adjust standard errors for clustering? ”, The Quarterly Journal of Economics 138(1), 1–35. Abadie, A., Diamond, A. and Hainmueller, J. (2010), “Synthetic control methods for comparative case studies: Estimating the effect of California’s tobacco control program”, Journal of the American statistical Association 105(490), 493–505. Arkhangelsky, D., Athey, S., Hirshberg, D. A., Imbens, G. W. and Wager, S. (2021), “Synthetic difference-in-differences”, American Economic Review 111(12), 4088–4118. Callaway, B. and Sant’Anna, P. H. (2021), “Difference-in-Differences with Multiple Time Periods”, Journal of econometrics 225(2), 200–230. Conley, T. G. (1999), “GMM estimation with cross sectional dependence”, Journal of econometrics 92(1), 1–45. Cunningham, S. (2021), Causal inference: The mixtape, Yale university press. Davidson, R., MacKinnon, J. G. et al. (1993), Estimation and inference in econometrics, Vol. 63, Oxford New York. Donald, S. G. and Lang, K. (2007), “Inference with difference-in-differences and other panel data”, The review of Economics and Statistics 89(2), 221–233. Hagemann, A. (2025), “Inference with a single treated cluster”, Review of Economic Studies p. rdaf002. Heß, S. (2017), “Randomization inference with Stata: A guide and software”, The Stata Journal 17(3), 630–651. Lee, S. J. and Wooldridge, J. M. (2023), “A Simple Transformation Approach to Difference-in- Differences Estimation for Panel Data”, Available at SSRN 4516518 . Rambachan, A. and Roth, J. (2023), “A more credible approach to parallel trends”, Review of Economic Studies 90(5), 2555–2591. Simonsohn, U. (2021), “Just Run Robust Standard Errors: A Commentary on Young (2019)”. Working paper, available at http://urisohn.com/43. Wooldridge, J. M. (2010), “Econometric Analysis of Cross Section and Panel Data”. Wooldridge, J. M. (2020), Introductory Econometrics: A Modern Approach, Cengage learning:: Boston, MA. 81 Wooldridge, J. M. (2021), “Two-Way Fixed Effects, the Two-Way Mundlak Regression, and Difference-in-Differences Estimators”, Available at SSRN 3906345 . Wooldridge, J. M. (2025), “Two-Way Fixed Effects, the Two-Way Mundlak Regression, and Difference-in-Differences Estimators”, latest version of Wooldridge(2021) . Young, A. (2019), “Channeling fisher: Randomization tests and the statistical insignificance of seemingly significant experimental results”, The quarterly journal of economics 134(2), 557– 598. 82 CHAPTER 4 ROLLING APPROACH TO DIFFERENCE-IN-DIFFERENCES: EXPLORING TREATMENT REVERSIBILITY AND MODERATING EFFECTS 4.1 Introduction Difference-in-differences (DiD) methods have become indispensable for estimating causal ef- fects in settings with staggered interventions and heterogeneous responses. Yet, recent work has highlighted a critical limitation: when treatment effects vary across groups or over time, the tra- ditional two-way fixed effects (TWFE) estimator can produce biased estimates under staggered intervention designs. To overcome this issue, several alternative estimators have been proposed to capture treatment effect heterogeneity more faithfully; see, for example, Callaway and Sant’Anna (2021), Wooldridge (2021), Sun and Abraham (2021), Borusyak et al. (2024), and De Chaisemartin and d’Haultfoeuille (2020). Most of these approaches, however, rely on the assumption of absorbing treatments—once a unit receives treatment, it remains treated in all subsequent periods. But, in many real-world settings, treatment status is neither permanent nor monotonic. For example, firms may adopt a policy temporarily, governments may roll out and later withdraw interventions, or—as in the setting examined in this paper—pharmacy chains may open and later close locations. Most existing methods are not well suited to handle such dynamic treatment regimes, and few provide formal guidance for estimation and aggregation under this complexity. Recently, De Chaisemartin and d’Haultfoeuille (2024) propose a framework designed to identify treatment effects in contexts with non-binary, non-absorbing, and potentially lagged treatments, relating to their prior work in De Chaisemartin and d’Haultfoeuille (2020). Their method accom- modates reversible treatment states, allowing units to both enter and exit treatment over time, and explicitly accounts for dynamic effects that unfold across different exposure durations. However, their approach defines treatment relative to exposure duration and classifies units based on their initial transition into treatment. Once a unit exits treatment, its subsequent observations are excluded from the analysis. As a result, the framework does not permit the estimation of 83 post-exit effects, and re-entry into treatment is not considered. This limitation hinders the ability to assess whether treatment effects persist after discontinuation or to evaluate the impact of repeated exposures. In contrast, this paper introduces a more flexible framework—building on Lee and Wooldridge (2023)—that enables the estimation of treatment effects both during and after treatment episodes. By aligning event time relative to treatment exposure (e.g., 𝑡 − 𝑔, where 𝑔 is the treatment start period), I incorporate both on-treatment and post-exit periods in the analysis, facilitating the identification of lingering or decaying effects over time. This extension is particularly useful for evaluating whether the impact of an intervention persists or fades after discontinuation. Furthermore, I explicitly account for re-entry into treatment—a case ignored in prior implemen- tations—by identifying multiple treatment paths and estimating the effects of subsequent exposures separately from the initial one. This allows for direct comparisons between first-time and repeated treatments and offers a more comprehensive perspective on dynamic, non-monotonic treatment patterns. Together, these contributions enhance the applicability of dynamic DID methods in real-world settings where treatment status can change multiple times, and allowing for evaluating whether being treated again has the same, weaker, or stronger impact than the first treatment. In addition, I propose an extended doubly robust estimator that incorporates moderating effects across subgroups using an augmented inverse probability weighting and regression adjustment (IP- WRA) approach. This method allows for explicit estimation of treatment effect heterogeneity across subgroups while preserving the desirable double robustness property. It also enables researchers to examine whether treatment effects vary across subgroups, allowing for more nuanced evaluations of treatment or policy interventions in the presence of demographic or structural heterogeneity. The remainder of the paper is organized as follows. Section 4.2 outlines the setup and identifying assumptions underlying the proposed framework. In Section 4.3, I demonstrate how to incorporate moderating effects while leveraging a doubly robust estimator. Section 4.4 presents Monte Carlo simulation results to evaluate the performance of the proposed method. In Section 4.5, I apply the approach to data on the entry and exit of chain pharmacy stores to examine their impact on 84 independently owned pharmacies in rural areas. Section 4.6 concludes. 4.2 Setup and Identification This framework is designed to accommodate treatment patterns involving both staggered adop- tion and reversible treatment spells, aligning with and extending the work of Lee and Wooldridge (2023). By incorporating exit and re-entry episodes, we provide a comprehensive toolkit for ana- lyzing dynamic treatment effects where treatment status may change multiple times throughout the study period. Our approach is initially focused on binary treatment indicators, classifying units as either treated or untreated at each time point. I define the treatment indicator 𝐷𝑖𝑔 ∈ {0, 1}, where 𝐷𝑖𝑔 = 1 if unit 𝑖 is treated at time 𝑔, and 𝐷𝑖𝑔 = 0 otherwise. For notational simplicity, I denote the treatment indicator as 𝐷𝑔 ≡ 𝐷𝑖𝑔, suppressing the unit index when focusing solely on treatment timing. However, future work will extend this framework to discrete treatment intensities, where units can receive varying levels of treatment 𝐷𝑔,𝑡 ∈ {0, 1, . . . , 𝐾 }, allowing for analysis of policies where treatment intensity (e.g., dosage or funding levels) is a crucial factor. We consider the following treatment patterns: • Case 1: Binary Treatment with Staggered Adoption — This case mirrors the structure in Lee and Wooldridge (2023), where treatment is introduced at different times across units without accounting for exit or re-entry. • Case 2: Binary Treatment with Exit and Re-entry — Units may exit treatment and potentially re-enter at a later time, resulting in multiple treatment spells. This setting allows for robust comparisons by aligning units based on their treatment histories and adjusting for pre-treatment dynamics. By accommodating a wider array of treatment patterns—including reversible and intermittent treatments—our framework provides a unified, flexible approach for researchers aiming to assess the dynamic and heterogeneous effects of policy interventions in realistic settings. 85 4.2.1 Identification Assumptions To identify treatment effects under these settings, we introduce the following assumptions, which follows Lee and Wooldridge (2023). Following the potential outcome framework, let 𝑌𝑡 (𝑔) denote the potential outcome at time 𝑡 for a unit treated at time 𝑔. For units that are never treated, we define the untreated state using ∞, such that 𝑌𝑡 (∞) represents the potential outcome under the control state at time 𝑡. Assumption 1. Conditional No Anticipation (CNA): For 𝑔 ∈ {𝑆, . . . , 𝑇 }, 𝑡 ∈ {1, . . . , 𝑔 − 1} and covariates X, 𝐸 (cid:2)𝑌𝑡 (𝑔) |𝐷𝑔 = 1, X(cid:3) = 𝐸 (cid:2)𝑌𝑡 (∞) |𝐷𝑔 = 1, X(cid:3) . □ (4.1) This implies treatment effects prior to the intervention are all zero. Assumption 2. Conditional Parallel Trends for Treatment, Exit and Re-entry Groups (CPT for T.E.R): For D = (𝐷 𝑆, . . . , 𝐷𝑇 ) and 𝑡 = 1, 2, . . . , 𝑇, 𝐸 [𝑌𝑡 (∞) − 𝑌1(∞)|D, X] = 𝐸 [𝑌𝑡 (∞) − 𝑌1(∞)|X], 𝑡 = 2, . . . , 𝑇 . □ (4.2) Units with identical outcome paths up to 𝑡 = 𝑔−1 are assumed to follow the same expected trends in the evolution of control state outcomes, regardless of whether they exit or re-enter treatment after their initial treatment at time 𝑔. This allows for constructing a control group based on pre-treatment outcome paths, facilitating identification of treatment, exit, and re-entry effects. Assumption 3. Overlap: For cohorts 𝑔 ∈ {𝑆, . . . , 𝑇 } and time periods 𝑟 ∈ {𝑔, 𝑔 + 1, . . . , 𝑇 }, 𝑃 (cid:0)𝐷𝑔 = 1|𝐷𝑔 + 𝐴𝑟+1 = 1, X = x(cid:1) < 1 for all x ∈ Supp (X) . □ (4.3) This condition ensures that within the subpopulation consisting of cohort 𝑔, as well as the never- treated and not-yet-treated units at time 𝑟 (denoted as 𝐴𝑟+1), every treated unit has a comparable control unit with the same level (magnitude) of covariates. 86 4.2.2 Estimation Strategy: Treatment, Exit, and Re-entry Effects The estimation procedure is structured in two main stages: First Stage: General Effect Estimation For each treatment group 𝑔, i.e., whose initial treatment occur at time 𝑔, we define a control group at time 𝑡 based on units with the same outcome paths up until time 𝑔 − 1 and not yet treated at time 𝑡, which including never treated units. At this stage, I focus solely on each group’s first treatment timing 𝑔, without accounting for subsequent treatment, exit, or re-entry status. It means we will estimate the treatment effects on the treated ATT(g,t) as if each group is treated at g, and then stay treated up to 𝑇, which means their realized potential outcome is considered as 𝑌𝑖𝑡 (𝑔). However, they could "exit" or "re-enter" down the road. Thus, let me define these estimated group-time specific effect as a general treatment effect on the treated (GTT) at time 𝑡 for 𝑔 as follow: 𝐺𝑇𝑇 (𝑔, 𝑡) = 𝐸 [𝑌𝑖𝑡 (𝑔) − 𝑌𝑖𝑡 (∞)|𝐷𝑔 = 1] (4.4) To get this 𝐺𝑇𝑇 (𝑔, 𝑡), I simply apply Procedure 4.1 (Rolling Methods, Staggered Interventions) in Lee and Wooldridge (2023): Step 1. For a given 𝑔 ∈ {𝑆, . . . , 𝑇 } and time period 𝑟 ∈ {𝑔, 𝑔 + 1, . . . , 𝑇 }, compute (cid:164)𝑌𝑖𝑟𝑔 ≡ 𝑌𝑖𝑟 − 1 (𝑔 − 1) 𝑔−1 ∑︁ 𝑌𝑖𝑠 ≡ 𝑌𝑖𝑟 − ¯𝑌𝑖,𝑝𝑟𝑒(𝑔) (4.5) 𝑠=1 Step 2. Choose as the control group the units with 𝐷𝑖,𝑟+1 + 𝐷𝑖,𝑟+2 + · · · + 𝐷𝑖𝑇 + 𝐷𝑖∞ = 1 (or, if desired, a subset, such as the Never Treated (NT) group). Step 3. Using the subset of data with 𝐷𝑖𝑔 + 𝐷𝑖,𝑟+1 + 𝐷𝑖,𝑟+2 + · · · + 𝐷𝑖𝑇 + 𝐷𝑖∞ = 1, (4.6) apply standard treatment effect (TE) methods—such as linear regression adjustment (RA), inverse probability weighting (IPW), IPWRA, and matching—to each cross-sectional dataset sep- arately. (cid:8)(cid:0) (cid:164)𝑌𝑖𝑟𝑔, 𝐷𝑖𝑔, X𝑖(cid:1) : 𝑖 = 1, . . . , 𝑁 (cid:9) , with 𝐷𝑖𝑔 acting as the treatment indicator. □ 87 Lee and Wooldridge (2023) establish that the coefficient on 𝐷𝑖𝑔 from these cross-sectional regressions identifies the average treatment effect on the treated (ATT) for group 𝑔 at time 𝑡, denoted as 𝐴𝑇𝑇 (𝑔, 𝑡), under the assumption of treatment irreversibility. In addition, if you are concerned about unit-specific heterogeneous linear trends, you can apply Procedure 5.1 in Lee and Wooldridge (2023), which allowing for unit-specific heterogeneous linear trends. However, in the present framework, I allow for the possibility that group 𝑔 may exit treatment at certain periods 𝑡 (> 𝑔). Thus, while I initially follow the Lee and Wooldridge (2023) approach to estimate the effect for these periods, I subsequently reclassify these periods as exit periods in Step 2 of our estimation procedure. Consequently, instead of defining the effect as 𝐴𝑇𝑇 (𝑔, 𝑡), I adopt a broader concept, denoted as 𝐺𝑇𝑇 (𝑔, 𝑡), representing the generalized treatment effect for group 𝑔 at time 𝑡, which can encompass not only the standard treatment effect but also exit and re-entry effects. Second Stage: Classification and Aggregation by Treatment Path After estimating 𝐺𝑇𝑇 (𝑔, 𝑡) for all 𝑔 ∈ {𝑆, . . . , 𝑇 } and 𝑡 ∈ {𝑔, . . . , 𝑇 }, we classify each period based on treatment path, allowing for a more nuanced interpretation of treatment effects. The classification includes three distinct categories based on the observed treatment status: 𝛼 ∈ {𝑇𝑟𝑒𝑎𝑡𝑒𝑑, 𝐸𝑥𝑖𝑡, 𝑅𝑒 − 𝑒𝑛𝑡𝑟 𝑦}, In this context, Treated refers to time periods immediately following the initial treatment, indicated by a positive treatment status relative to the first treatment timing. Exit represents periods after treatment cessation, during which the treatment status returns to zero following the initial treatment. Re-entry denotes periods after the re-initiation of treatment following an exit episode, allowing for multiple treatment spells. Table 4.1 illustrates the classification for two groups, 𝑔 = 4 and 𝑔 = 3, each representing distinct treatment paths. The table presents the relative time for each treatment spell, distinguishing periods of treatment, exit, and re-entry based on the specific treatment path of each group. 88 Table 4.1 Classification of Treatment Paths for Groups 𝑔 = 4 and 𝑔 = 3 Group 𝑔 = 4 Treatment Path Calendar Time (𝑡) 1. Treatment Period Relative Time (𝑟) Relative Time (𝑟) 2. Exit Period Relative Time (𝑟) 3. Re-entry Period Group 𝑔 = 3 Treatment Path Calendar Time (𝑡) 1. Treatment Period Relative Time (𝑟) Relative Time (𝑟) 2. Exit Period Relative Time (𝑟) 3. Re-entry Period 0 1 -3 0 1 -2 0 2 -2 0 2 -1 0 3 -1 1 3 0 1 4 0 1 4 1 1 5 1 0 5 0 -3 1 6 2 0 6 1 -2 0 7 0 0 7 2 -1 0 8 1 1 8 0 9 2 1 9 0 1 In Group 𝑔 = 4, treatment is initiated at 𝑡 = 4, continues for three periods, and then exits at 𝑡 = 7. Exit effects are observed from 𝑡 = 7 to 𝑡 = 9. No re-entry occurs in this group. In contrast, Group 𝑔 = 3 initiates treatment earlier at 𝑡 = 3, exits at 𝑡 = 5, and then re-enters at 𝑡 = 8. The re-entry period spans 𝑡 = 8 to 𝑡 = 9, allowing for the estimation of re-treatment effects distinct from the initial treatment effects. 4.2.3 Aggregation Strategy Define relative time 𝑟 = 𝑡 − 𝑔. Then, we compute the weighted average of the group-time general treatment effects 𝐺𝑇𝑇 (𝑔, 𝑡) based on relative time 𝑟 for each 𝛼 ∈ {Treated, Exit, Re-entry}: 𝑊𝐺𝑇𝑇 (𝛼, 𝑟) = ∑︁ 𝑔∈𝐺𝑟 , 𝛼 𝑤(𝑔, 𝑟) · 𝐺𝑇𝑇 (𝑔, 𝑔 + 𝑟), 𝑤ℎ𝑒𝑟𝑒 𝑟 = 𝑡 − 𝑔 (4.7) where 𝐺𝑟,𝛼 denotes the set of groups with relative time 𝑟 and treatment state 𝛼 at time 𝑡, and 𝑤(𝑔, 𝑟) represents the weight assigned to group 𝑔. While various weighting strategies could be considered, this paper adopts a simple proportional weighting scheme based on group sizes: 𝑤(𝑔, 𝑟) = 𝑁𝑔 𝑁𝐺𝑟 , 𝛼 , where 𝑁𝑔 is the number of units in group 𝑔, and 𝑁𝐺𝑟 , 𝛼 is the total number of units across all groups in 𝐺𝑟,𝛼 used for estimating 𝑊𝐺𝑇𝑇 (𝛼, 𝑟). For example, if only groups 𝑏 and 𝑐, each with 10 units, contribute to the estimation at 𝑟 = 1, then 𝑤(𝑏, 1) = 𝑤(𝑐, 1) = 10 20 = 0.5. 89 This aggregation yields three distinct 𝑊𝐺𝑇𝑇 series, which can be visualized through event- study plots to analyze dynamic treatment effects across different treatment states. De Chaisemartin and d’Haultfoeuille (2024) also propose an estimation method for settings with dynamic treatment patterns, such as treatment entry and exit. However, their approach addresses this complexity by focusing exclusively on the “treatment period,” and explicitly excluding observations during “exit” and “re-entry” periods. This restriction is driven by their no-crossing condition, which requires treatment levels to remain strictly above or below the period-one status (𝑡 = 1), effectively dropping post-exit observations from the analysis. As a result, their framework does not accommodate re-entry into treatment or estimate effects beyond the treatment spell. In contrast, I leverage control groups to estimate effects across all three treatment states. By explicitly classifying periods according to treatment status, the proposed framework enables a comprehensive analysis of how treatment effects evolve across multiple spells and interruptions. This extension provides researchers with a deeper understanding of treatment dynamics over time, including potential lingering effects after discontinuation and resurgent effects following re-entry. In addition, a key strength of the framework proposed by De Chaisemartin and d’Haultfoeuille (2024) lies in its ability to accommodate discrete treatment intensities and flexibility regarding the initial treatment status. Specifically, even when units are already treated in the initial period, their approach allows for the estimation of treatment effects for “switchers”—units whose treatment intensity strictly increases or decreases relative to the baseline level. By using status-quo units as the control group, their method enables identification of dynamic treatment effects for such transitions under certain assumptions. While my current study focuses on binary treatment effects, the proposed classification-based framework is naturally extensible to settings with discrete or continuous treatment intensities. Incorporating these variations into the treatment-exit-reentry structure will be a key direction for future research. Such an extension would enable more comprehensive evaluation of dynamic policy interventions, where treatment may vary not only in timing but also in dosage or intensity. 90 4.2.4 Subgroup-Specific Generalized Treatment Effects (SGTTs) In previous discussions, the treatment group 𝑔 was assumed to exit treatment uniformly, implying that all units within the group followed an identical treatment trajectory over time. However, in many empirical settings, units within the same treatment group may experience heterogeneous treatment patterns. Specifically, within group 𝑔, some units may remain treated throughout, others may exit treatment at certain periods, and some may re-enter treatment after a period of exit. To account for these varied treatment patterns, I define three subgroup-specific generalized treatment effects (SGTTs) for each treatment group 𝑔 at time 𝑡, as follows: 𝑆𝐺𝑇𝑇 (𝑔, 𝑡, 𝛼) = 𝐸 [𝑌𝑖𝑡 (𝑔) − 𝑌𝑖𝑡 (∞)|𝐷𝑔 = 1, 𝛼], 𝛼 ∈ {𝑇𝑟𝑒𝑎𝑡𝑒𝑑, 𝐸𝑥𝑖𝑡, 𝑅𝑒 − 𝑒𝑛𝑡𝑟 𝑦} (4.8) Each SGTT captures the generalized treatment effect for units that remain treated at time 𝑡, for units that have exited treatment by time 𝑡, and for units that have re-entered treatment at time 𝑡. Similarly, at the first stage, following (2023, Procedure 4.1), estimate the subgroup-specific generalized treatment effects 𝑆𝐺𝑇𝑇 (𝑔, 𝑡, 𝛼). These SGTTs will subsequently be utilized in the second stage of the estimation procedure, where these effects are aggregated based on the observed treatment status. 𝑊𝐺𝑇𝑇 (𝛼, 𝑟) = ∑︁ 𝑔𝛼∈𝐺𝑟 , 𝛼 𝑤(𝑔𝛼, 𝑟) · 𝑆𝐺𝑇𝑇 (𝑔, 𝑔 + 𝑟, 𝛼), 𝑤ℎ𝑒𝑟𝑒 𝑟 = 𝑡 − 𝑔 (4.9) where 𝑤(𝑔𝛼, 𝑟) = 𝑁𝑔𝛼 𝑁𝐺𝑟 , 𝛼 denotes the weight assigned to group 𝑔 in state 𝛼 ∈ {𝑇𝑟𝑒𝑎𝑡𝑒𝑑, 𝐸𝑥𝑖𝑡, 𝑅𝑒 − 𝑒𝑛𝑡𝑟 𝑦} at relative time 𝑟. 4.3 Moderating Effects This section extends the rolling transformation framework in Lee and Wooldridge (2023) to incorporate moderating effects—treatment effect heterogeneity driven by observable unit charac- teristics. Such heterogeneity arises when the impact of a treatment varies across subgroups defined by factors such as income, race, gender, or baseline access levels. For instance, in the Empirical Application section, I consider two subgroups: 𝑏𝑖 = 1 if a town has more than 20% of residents aged 65 or older, 𝑏𝑖 = 0, otherwise. The treatment effect may differ across these subgroups. 91 To address treatment effect heterogeneity, I outline a Two-stage estimation procedure for a generalized version of the Inverse Probability Weighted Regression Adjustment (IPWRA) method, accommodating staggered intervention settings with multiple treatment groups. Stage 1: Propensity Score Estimation To account for staggered treatment timing, we estimate the propensity score using a multinomial logit model, where the treatment status 𝐷𝑔 is defined as the first treatment period 𝑔 ∈ {𝑆, 𝑆 + 1, . . . , 𝑇 }. The propensity score for each unit 𝑖 is given by: 𝑝𝑔 (X) = Pr(𝐷𝑔 = 1|X) = exp(X′𝛿𝑔) 𝑘=1 exp(X′𝛿𝑘 ) 1 + (cid:205)𝐺 (4.10) where 𝛿𝑔 are the parameters estimated for each treatment group 𝑔. The control group’s (i.e., 𝑔 = ∞) coefficient is normalized to zero. Stage 2: IPWRA Estimation with Moderating Effects After obtaining the estimated propensity scores (cid:98) 𝑝𝑔 (Xi), I proceed to the IPWRA estimation that incorporates moderating effects. Specifically, the treatment effect for group 𝑔 at time 𝑡 is estimated by solving the following weighted least squares problem: arg min 𝜏𝑔,𝑡 ,𝛼,𝛽,𝛾𝑔,𝑡 𝑁 ∑︁ (cid:18) 𝑖=1 𝐷𝑔 + 𝑝𝑔 (X) (1 − 𝐷𝑔) (cid:98) 𝑝𝑔 (X) 1 − (cid:98) (cid:19) (cid:16) (cid:164)𝑌𝑔,𝑡 − 𝜏𝑔,𝑡 · 𝐷𝑔 − 𝛼 − 𝑋 ′ 𝑖 𝛽 − 𝐷𝑔 (𝑋𝑖 − ¯𝑋𝑔) (cid:17) 2 ′ · 𝛾𝑔,𝑡 (4.11) Here, 𝛾𝑔,𝑡 captures the moderating effects—heterogeneous treatment effects that vary with deviations from the treated group’s average covariates ¯X𝑔. This formulation allows for a flexible and interpretable estimation of treatment effect heterogeneity across subgroups. Importantly, this structure reproduces the same point estimate for 𝜏 as the baseline IPWRA estimator in Lee and Wooldridge (2023), given by: arg min 𝜏𝑔,𝑡 ,𝛼,𝛽 𝑁 ∑︁ (cid:18) 𝑖=1 𝐷𝑔 + 𝑝𝑔 (X) (1 − 𝐷𝑔) (cid:98) 𝑝𝑔 (X) 1 − (cid:98) (cid:19) (cid:16) (cid:164)𝑌𝑔,𝑡 − 𝜏𝑔,𝑡 · 𝐷𝑔 − 𝛼 − 𝑋 ′ 𝑖 𝛽 (cid:17) 2 (4.12) However, the inclusion of the 𝛾 coefficients provides additional insights by quantifying the extent to which the treatment effect differs across levels of baseline covariates, thereby offering a more comprehensive understanding of policy impacts across heterogeneous subgroups. 92 This framework offers a practical advantage over standard built-in command in Stata such as the teffects ipwra, which does not support estimation of moderating effects. By applying the transformed-outcome approach and two-stage IPWRA estimation procedure, we can estimate moderating effects while leveraging the efficiency of doubly robust estimation. 4.3.1 Comparison with the Pooled OLS Estimator Wooldridge (2021) proposes a pooled OLS method for estimating treatment effect heterogeneity over time, while also allowing for the identification of moderating effects. For illustration, consider the common timing case in which all treated units received their initial treatment at the same period 𝑡 = 𝑞. The specification includes a full set of time-fixed effects and interaction terms between the treatment indicator and covariates, including demeaned covariates (centered at the treated group mean). As shown in Equation (5.46) of Wooldridge (2021), the pooled OLS regression includes terms of the form: 𝑦𝑖𝑡 on 1, 𝑑𝑖, (cid:164)x𝑖, 𝑑𝑖 · (cid:164)x𝑖, 𝑓𝑞,𝑡, . . . , 𝑓𝑇,𝑡, 𝑓𝑞,𝑡 · (cid:164)x𝑖, . . . , 𝑓𝑇,𝑡 · (cid:164)x𝑖, 𝑑𝑖 · 𝑓𝑞,𝑡, . . . , 𝑑𝑖 · 𝑓𝑇,𝑡, 𝑑𝑖 · 𝑓𝑞,𝑡 · (cid:164)x𝑖, . . . , 𝑑𝑖 · 𝑓𝑇,𝑡 · (cid:164)x𝑖, (4.13) where 𝑓𝑞,𝑡, . . . , 𝑓𝑇,𝑡 denote time indicators such that 𝑓𝑡′,𝑡 = 1 if 𝑡 = 𝑡′ and 0 otherwise, where 𝑡′ ∈ {𝑞, . . . , 𝑇 }, post-treatment periods. The covariates (cid:164)x𝑖 = x𝑖 − ¯x1 represent deviations from the treated group’s mean covariate values. The terms 𝑑𝑖 · 𝑓𝑡,𝑡 · (cid:164)x𝑖 capture period-specific moderating effects, allowing the treatment effect to vary flexibly with covariates across time. Each coefficient on these triple interaction terms identifies 𝛾(𝑡), the marginal effect of covariates on the treatment effect in period 𝑡. While straightforward and easy to implement, this regression-based approach relies on correct specification of the outcome model. If relevant interaction or nonlinear terms from the true model—which are rarely known to researchers in practice—are omitted, the estimator may suffer from bias due to model misspecification. In contrast, my proposed two-stage IPWRA estimator provides greater robustness through its doubly robust structure. It remains consistent as long as either the outcome model or the propensity score model is correctly specified. Importantly, it also enables the estimation of moderating effects 93 even when the outcome model is misspecified. In Section 4.4, I demonstrate through simulation studies that the two-stage IPWRA approach successfully recovers both the average treatment effects on the treated (ATTs) and the corresponding moderating effects. Compared to POLS, the two-stage IPWRA yields substantially lower bias and RMSE for heterogeneous treatment effects under outcome model misspecification, while maintain- ing accurate estimation of average treatment effects. These findings underscore the practical value of the proposed method for evaluating differential policy impacts across heterogeneous populations, especially in settings where model misspecification is a concern. 4.4 Monte Carlo Simulations This section outlines the data generating process (DGP) employed in the simulation study to evaluate the performance of the proposed two-stage IPWRA estimator described in Section 4.3. Although the study primarily focuses on staggered intervention settings, the simulation design is restricted to a common timing case for simplicity. The findings demonstrate that the proposed two- stage IPWRA estimator successfully incorporates both moderating and treatment effects. Regarding treatment effects, the estimator accurately replicates the baseline IPWRA estimates in Lee and Wooldridge (2023). 4.4.1 Data Generating Process (DGP) The data generating process (DGP) is structured to simulate treatment effects under a common timing framework, involving a single treatment group and a control group. The DGP is replicated following the procedure outlined in Appendix 2D of Lee and Wooldridge (2023), and the relevant variables are defined as described below. The dataset spans six time periods (𝑇 = 6), representing the years 2001 to 2006, with the treatment initiated at time 𝑆 = 4. The control group (𝐷𝑖 = 0) consists of never-treated units, while the treatment group (𝐷𝑖 = 1) includes units treated at 𝑆 = 4. The detailed structure of the variables is as follows. I generate two time-invariant covariates X = (𝑋1, 𝑋2), 𝑋1 ∼ 𝐺𝑎𝑚𝑚𝑎(2, 2) and 𝑋2 ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(0.6) 94 The treatment indicator is defined through a logistic propensity score model: 𝑝(X) = Pr(𝐷 = 1 | X) = exp(X′z) 1 + exp(X′z) , (4.14) where the index function is specified as X′z = −1.2 + (𝑋1−4) 2 − 𝑋2. To investigate the moderating effects, the treatment effects are generated to exhibit heterogeneity not only across time but also over covariates X. The treatment effect function is specified as follows: 𝑇 ∑︁ 𝜏𝑟 (X) = 𝜃 · (𝑟 − 𝑆 + 1)−1 + 𝜆𝑟 · ℎ(X), (4.15) 𝑟=𝑆 where 𝜃 = 𝑇 − 𝑆 + 1, 𝜆𝑟 = (0.5, 0.6, 0.8). The functional form of ℎ(X) is: ℎ(X) = (𝑋1 − 4) 2 + 𝑋2 3 In this specification, 𝜃 · (cid:205)𝑇 𝑟=𝑆 (𝑟 − 𝑆 + 1)−1 captures the baseline treatment effect accumulated over time, while 𝜆𝑟 · ℎ(X) represents the period-specific moderating effect, allowing treatment effects to vary with covariates across time. Lastly, I define a potential outcome in the control state: 𝑌𝑡 (0) = 𝛿𝑡 + 𝛼𝑖 + 𝛽𝑡 · 𝑓 (X) + 𝑈𝑡 (0), (4.16) where 𝛿𝑡 = 𝑡, 𝛼𝑖 ∼ N (2, 1), and 𝑈𝑡 (0) ∼ N (0, 4). The coefficient vector 𝛽′ is set as: 𝛽′ = (1.0, 1.5, 0.8, 1.5, 2, 2.5). 𝑓 (X) has two functional forms: 𝑓 (X) = 𝑓 (X) = (𝑋1 − 4) 3 (𝑋1 − 4) 3 + + 𝑋2 2 𝑋2 2 , + (𝑋1 − 4)2 3 + (𝑋1 − 4) · 𝑋2 4 (4.17) (4.18) For the simulation studies, I assume that the outcome model is linear in X. Under this speci- fication, the conditional mean function 𝐸 (𝑌 |X) is correctly specified when using equation (4.17), but becomes misspecified under equation (4.18) due to the inclusion of interaction and quadratic 95 terms in the true data-generating process. However, since the IPWRA estimator is doubly robust, it remains consistent as long as the propensity score model is correctly specified. Therefore, even under outcome model misspecification, I expect the IPWRA estimator to accurately recover the treatment effect estimates. Finally, a potential outcome in the treated state is 𝑌𝑡 (1) =  𝑌𝑡 (0),  𝑌𝑡 (0) + 𝜏𝑡 + 𝑈𝑡 (1) − 𝑈𝑡 (0),  𝑡 < 𝑆 𝑡 ≥ 𝑆 where 𝑈𝑡 (1) ∼ N (0, 4). 4.4.2 Simulation Results Table 4.2 presents the simulation results under correct specification of the outcome model, where the conditional mean function 𝐸 (𝑌 |X) is linear in covariates. I compare the performance of four estimators: Pooled OLS (POLS) following Wooldridge (2021), the baseline IPWRA proposed in Lee and Wooldridge (2023), the doubly robust estimator developed by Callaway and Sant’Anna (2021) (denoted as CS), and the proposed two-stage IPWRA–these estimators are designed to estimate average treatment effects on the treated (ATTs), particularly in the presence of treatment effect heterogeneity across units and time. Estimator performance is evaluated based on bias, standard deviation (SD), and root mean squared error (RMSE), focusing on both average treatment effects (𝜏4, 𝜏5, 𝜏6) and moderating effects associated with covariates 𝑥1 and 𝑥2. Under correct specification, all estimators perform reasonably well in recovering the average treatment effects on the treated (ATT). Both the baseline IPWRA and two-stage IPWRA estimators yield identical estimates with virtually zero bias and low RMSE across all time periods. This confirms that the two-stage extension preserves the desirable statistical properties of the standard IPWRA estimator when estimating group-time average treatment effects. 96 Table 4.2 Simulation Results Under Correct Specification of 𝐸 (𝑌 |X) Sample ATT POLS (RA) CS IPWRA Two-Stage IPWRA Sample Moderating Effects POLS Two-Stage IPWRA Sample Moderating Effects POLS Two-Stage IPWRA 𝜏4 3.220 SD 0.395 0.503 0.403 0.403 𝛾(𝑋1, 4) SD 0.250 0.324 0.370 𝛾(𝑥2, 4) SD 0.167 0.803 0.848 Bias -0.002 -0.003 -0.001 -0.001 Bias -0.006 -0.006 Bias 0.038 0.030 RMSE 0.395 0.503 0.403 0.403 Bias -0.013 -0.015 -0.014 -0.014 RMSE Bias 0.325 0.370 0.000 -0.002 RMSE Bias 0.804 0.849 0.037 0.045 𝜏5 4.753 SD 0.410 0.503 0.413 0.413 𝛾(𝑋1, 5) SD 0.300 0.328 0.361 𝛾(𝑥2, 5) SD 0.200 0.811 0.850 RMSE 0.410 0.504 0.413 0.413 Bias -0.008 -0.009 -0.007 -0.007 RMSE Bias 0.328 0.361 0.001 0.003 RMSE Bias 0.811 0.851 -0.023 -0.019 𝜏6 5.838 SD 0.413 0.518 0.421 0.421 𝛾(𝑋2, 6) SD 0.400 0.332 0.365 𝛾(𝑥2, 6) SD 0.267 0.836 0.869 RMSE 0.413 0.518 0.421 0.421 RMSE 0.332 0.365 RMSE 0.837 0.869 Importantly, both POLS and the two-stage IPWRA also demonstrate strong performance in estimating the moderating effects 𝛾(𝑥1, 𝑡) and 𝛾(𝑥2, 𝑡). For example, when the true moderating effect is 0.25 for 𝛾(𝑥1, 4), the two-stage IPWRA achieves a bias of 0.002 and an RMSE of 0.370, while POLS also performs comparably well. This is expected, as the outcome model is correctly specified in both frameworks. Table 4.3 presents simulation results under misspecification of the outcome model, where the true conditional mean function 𝐸 (𝑌 |X) includes nonlinear interaction terms that are omitted in estimation. In this setting, while the propensity score model remains correctly specified, the outcome model deviates from the true data-generating process. This setup provides a useful test for evaluating the robustness of each estimator. The upper panel of the table reports estimates for the average treatment effects on the treated (𝜏4, 𝜏5, 𝜏6). Since the propensity score model is correctly specified, both the baseline IPWRA and the two-stage IPWRA recover the treatment effects with negligible bias and low RMSE—consistent with the doubly robust property of IPWRA. Notably, this property holds even when the outcome model is misspecified, confirming that the two-stage extension maintains the validity and consistency of the original IPWRA framework in estimating ATT. 97 Table 4.3 Simulation Results under Misspecification of 𝐸 (𝑌 |X) Sample ATT POLS (RA) CS IPWRA Two-Stage IPWRA Sample Moderating Effects POLS Two-Stage IPWRA Sample Moderating Effects POLS Two-Stage IPWRA 𝜏4 3.220 SD 0.397 0.509 0.405 0.405 𝛾(𝑥1, 4) SD 0.250 0.329 0.374 𝛾(𝑥2, 4) SD 0.167 0.807 0.853 Bias 0.044 -0.003 0.000 0.000 Bias 0.155 0.002 Bias 0.114 0.028 RMSE 0.400 0.510 0.405 0.405 Bias 0.091 -0.015 -0.011 -0.011 RMSE Bias 0.364 0.374 0.362 0.016 RMSE Bias 0.815 0.853 0.208 0.040 𝜏5 4.753 SD 0.418 0.517 0.417 0.417 𝛾(𝑥1, 5) SD 0.300 0.344 0.391 𝛾(𝑥2, 5) SD 0.200 0.819 0.866 RMSE 0.428 0.517 0.417 0.417 Bias 0.154 -0.008 -0.002 -0.002 RMSE Bias 0.500 0.391 0.564 0.031 RMSE Bias 0.845 0.867 0.245 -0.027 𝜏6 5.838 SD 0.429 0.543 0.431 0.431 𝛾(𝑥1, 6) SD 0.400 0.380 0.438 𝛾(𝑥2, 6) SD 0.267 0.855 0.905 RMSE 0.456 0.543 0.431 0.431 RMSE 0.681 0.439 RMSE 0.889 0.905 The distinction between estimators becomes more pronounced when examining moderating effects. The two-stage IPWRA continues to estimate 𝛾(𝑥1, 𝑡) and 𝛾(𝑥2, 𝑡) with minimal bias and relatively low RMSE. In contrast, the POLS estimator suffers from substantial bias in nearly every case, particularly when the true moderating effect is large. For example, when the true value of 𝛾(𝑥1, 6) is 0.4, POLS overestimates it by more than 0.56, yielding an RMSE of 0.681, whereas the two-stage IPWRA produces a much smaller bias of 0.031 and an RMSE of 0.439. This pattern holds across other periods and covariates. As a regression-based estimator, POLS is sensitive to outcome model misspecification—particularly when covariate interactions or nonlinear terms from the true model are omitted. In contrast, the two-stage IPWRA leverages its doubly robust property and remains consistent as long as the propensity score is correctly specified. By incorporating covariate-treatment interactions within a weighted least squares framework, it flexibly recovers heterogeneous effects without requiring correct specification of the outcome model. In summary, the two-stage IPWRA estimator not only replicates the average treatment effect estimates of the standard IPWRA under outcome model misspecification, but also provides a reliable and robust framework for estimating heterogeneous treatment effects—something the standard IPWRA is not equipped to do. This makes it particularly valuable in empirical settings where 98 treatment effect heterogeneity is of interest, but the outcome model’s functional form is uncertain or potentially misspecified. 4.5 Empirical Application In this section, I revisit Kim (2023), who studies the impact of chain pharmacy entry on the number of local independent pharmacies in rural townships. While the primary contribution of that paper lies in modeling the strategic interaction between chain and independent pharmacies using a static game framework, the author first presents reduced-form evidence using recent staggered difference-in-differences (DID) methods. Although Kim (2023) notes that chain pharmacies fre- quently enter and exit markets, the analysis ultimately treats entry as a one-time event, citing the absence of a suitable framework to handle dynamic path of treatment exposure. However, my estimation strategy directly addresses this limitation by allowing treatment status to vary over time–specifically, to turn on, off, and back on–offering a more realistic depiction of competitive exposure in local markets. Through a two-stage estimation procedure, I estimate seperate effects for three different observed treatment states, Treated, Exit, Re-entry. I also focus on a key source of treatment effect heterogeneity: the share of the elderly population in a township. My framework extends the standard doubly-robust estimator—combining inverse probability weighting with regression adjustment (IPWRA)—to incorporate moderating effects. Following Kim (2023), I classify townships into two groups based on their elderly population share in the year 2000: (i) high elderly population towns (those with ≥ 20% of residents aged 65 or older), and (ii) non-high elderly population towns (those with < 20% aged 65+). This approach enables the estimation of subgroup-specific dynamic treatment effects, facilitating a more nuanced understanding of how the competitive effects of chain entry vary across demographic contexts. The method not only assesses whether average effects differ across subgroups but also tracks how those effects evolve over time within each group. 4.5.1 Data and Setting The empirical analysis employs the dataset constructed by Kim (2023), which combines the Data Axle Historical Business Database (1997–2021)—a comprehensive record of business estab- 99 lishments, including pharmacy operations across the United States—with demographic variables from the U.S. Census and the American Community Survey (ACS). The data are matched at the township-year level and focus on rural townships in the Midwestern United States, where access to pharmacies is a key healthcare concern. Independent pharmacies in rural areas often serve as more than just retail outlets; they function as essential community hubs, providing prescription fulfillment, over-the-counter medications, and informal health advice for minor ailments. since the 1970s, however, the pharmacy landscape has been reshaped by the rapid expansion of chain pharmacies such as Walgreens, CVS, and Rite Aid, as well as mass merchandisers like Walmart and Target. These developments have threatened the survival of locally owned pharmacies, especially in underserved rural areas where the closure of a single pharmacy can have substantial consequences for healthcare access. The geographic unit of analysis is the township. Each township is characterized by the number of active independent pharmacies, the presence of chain pharmacies within a 15-mile radius, and market characteristics such as population size, elderly share, and poverty rate. The outcome of interest is the number of independent pharmacies operating in each township-year. The treatment variable is a binary indicator equal to one if at least one chain pharmacy is present within 15 miles of the township in year 𝑡. While Kim (2023) includes 802 towns in the original dataset, I restrict the sample to 596 townships by excluding those that were already exposed to chain pharmacies in the initial year. This restriction ensures that treatment does not occur at baseline; the number of chain presence (within 15 mi) is zero across all units in the initial year. The resulting panel covers the periods 2000 through 2019. For further details on township definitions and sampling criteria, see Section 2.2, Final sample, in Kim (2023). Table 4.4 presents summary statistics for the 596 townships included in the analysis, covering a total of 11,920 township-year observations from 2000 to 2019. All summary measures, unless otherwise noted, reflect township characteristics in the initial year of the panel, 2000. Again, townships are classified as “High Elderly” if 20% or more of the population was aged 100 Table 4.4 Summary Statistics of Full sample in Year 2000 Full Sample Non-High Elderly High Elderly Number of Townships Number of Observations Independent Pharmacies (mean) Chain Presence (within 15 mi, mean, in 2000) Chain Presence (averaged over 2000–2019) Log(Population) (mean) Elderly Share (Aged 65 or older, mean) Poverty Share (mean) 596 11,920 0.865 0 0.26 7.16 22.9% 12.4% 181 3,820 0.674 0 0.37 7.29 16.1% 13.9% 415 8,100 0.949 0 0.22 7.10 26.0% 11.6% 65 or older in 2000. Based on their elderly population share in 2000, 415 townships (68%) are classified as high elderly population and 181 (32%) as non-high elderly population. On average, high elderly population townships have smaller populations than their non-high elderly counterparts, which is often associated with lower market demand. Since chain pharmacies are more likely to enter markets with greater demand and population density, chain presence is more common in non-high elderly population townships over the full sample period (2000–2019), with an average presence rate of 37% compared to 22% in high elderly townships. Nevertheless, in the initial year of the panel (2000), high elderly population townships already exhibited a higher average number of independent pharmacies. This highlights the essential role that independent pharmacies have historically played in these communities—particularly in serving older populations and addressing healthcare needs in areas. 4.5.2 Results Table 4.5 presents the estimated weighted generalized treatment effects, 𝑊𝐺𝑇𝑇 (𝛼, 𝑟), over relative time 𝑟 for each treatment state 𝛼 ∈ {Treat, After Exit, After Re-entry}. Each estimate is accompanied by a 95% bootstrap confidence interval. While the proposed framework allows for various treatment effect estimators, the results reported here are obtained using the rolling regression adjustment (RA) estimator, following Lee and Wooldridge (2023, Procedure 4.1) with each subsample categorized according to its respective treatment state. 101 Table 4.5 Estimated Weighted General Treatment Effects 𝑊𝐺𝑇𝑇 (𝛼, 𝑟) Relative Time, 𝑟 0 1 2 3 4 5 6 7 8 9 10 𝛼 = Treat -0.355 (-0.427, -0.283) -0.368 (-0.449, -0.287) -0.389 (-0.487, -0.29) -0.339 (-0.441, -0.237) -0.34 (-0.452, -0.229) -0.344 (-0.472, -0.215) -0.347 (-0.482, -0.212) -0.375 (-0.538, -0.212) -0.389 (-0.531, -0.246) -0.324 (-0.527, -0.121) -0.339 (-0.547, -0.131) Estimates (95% CI, bootstrapped) 𝛼 = After Re-entry -0.291 (-0.418, -0.165) -0.33 (-0.448, -0.212) -0.307 (-0.459, -0.155) -0.317 (-0.47, -0.164) -0.286 (-0.424, -0.148) -0.283 (-0.435, -0.132) -0.319 (-0.476, -0.163) -0.291 (-0.467, -0.114) -0.367 (-0.571, -0.162) -0.355 (-0.538, -0.171) -0.315 (-0.506, -0.124) 𝛼 = After Exit -0.253 (-0.364, -0.143) -0.257 (-0.374, -0.141) -0.212 (-0.368, -0.057) -0.247 (-0.42, -0.074) -0.166 (-0.354, 0.022) -0.249 (-0.454, -0.044) -0.236 (-0.521, 0.05) -0.116 (-0.556, 0.325) -0.161 (-0.652, 0.331) -0.275 (-0.71, 0.159) -0.313 (-0.875, 0.25) Figure 4.1 visualizes these results, showing dynamic treatment effect patterns by subgroup over time using the rolling regression adjustment (RA) estimator following LW (2023). The y-axis represents 𝑊𝐺𝑇𝑇 (𝛼, 𝑟), and the x-axis denotes relative time since initial treatment, exit, or re-entry. Figure 4.1 WGTT Estimates Using Rolling Regression Adjustment Estimator Overall, the effects are consistently negative and statistically significant for the Treat group, indicating a persistent adverse impact on local pharmacy access. In contrast, the effects for the After 102 -.6-.5-.4-.3-.2-.1.1.20WGTT-11-9-7-5-3-113579010Time to Treatment, Exit or Re-entryPre-treatment EffectsFirst Treatment EffectsAfter-Re-entry EffectsAter-Exit (Lingering) Effects95% Confidence Interval95% Confidence IntervalEffects for Each Treatment Status Exit group diminish over time and become statistically insignificant, suggesting potential recovery. For the After Re-entry group, the effects remain negative but are somewhat attenuated, implying that re-entry still harms access, though less severely than initial exposure. These patterns likely reflect underlying market dynamics. The initial closure may involve substantial financial losses and signal poor profitability, discouraging future entrants. In small rural markets with thin margins, the departure of a chain pharmacy may not be enough to restore viable conditions for independent pharmacies. Thus, exit does not necessarily reverse the decline in access, highlighting the persistent and asymmetric nature of competitive disruption. The attenuation observed for the Re-entry group (blue line) may reflect adaptive responses by independent pharmacies, market saturation, or reduced vulnerability after prior exposure. Together, these results underscore the importance of modeling treatment as a dynamic, evolving process. By distinguishing between initial exposure, exit, and re-entry phases, the proposed estimator captures nuanced, heterogeneous treatment effects that would be obscured under static or binary frameworks. 4.6 Concluding Remarks Building on the rolling approach developed by Lee and Wooldridge (2023), this paper extends the framework to accommodate dynamic treatment paths—including not only initial treatment but also treatment exit and re-entry—thereby offering a generalized method for evaluating complex intervention patterns over time. In addition, I propose a two-stage IPWRA estimator that allows for the identification of mod- erating effects, enabling researchers to assess how treatment effects vary across subpopulations defined by covariates. By incorporating covariate-treatment interactions into a weighted least squares framework and leveraging the doubly robust structure of IPWRA, the proposed method recovers moderating effects even when the conditional outcome model is misspecified, as long as the propensity score is correctly specified. Simulation results demonstrate that the two-stage IPWRA estimator performs well in recovering not only average treatment effects on the treated (ATTs) but also moderating effects, under both 103 correct and misspecified outcome models. In particular, for moderating effects, it outperforms the pooled OLS estimator when the outcome model is misspecified. These findings underscore the robustness and flexibility of the proposed approach in evaluating differential policy impacts across heterogeneous populations. To illustrate its empirical relevance, I apply the proposed method to examine the effects of phar- macy chain entry on independently owned pharmacies. Beyond estimating the average treatment effect of initial chain entry, the analysis also uncovers dynamic treatment patterns following exit and re-entry. Specifically, treatment effects in the After-Exit phase remain persistently negative, indi- cating that the departure of chain pharmacies does not reverse the decline in access to independent pharmacies. These lingering effects suggest long-term disruptions in local markets, particularly in rural areas with limited demand. In contrast, treatment effects in the Re-entry phase—reflecting repeated exposures to chain competition—are more attenuated, potentially due to adaptive responses or market saturation. These results highlight the importance of modeling treatment as a dynamic process and demonstrate the value of the proposed framework in capturing complex and evolving treatment effects that would be overlooked in static or binary settings. While the current paper focuses on binary treatment settings, the framework is naturally extensi- ble to discrete or continuous treatment intensities, including dynamic transitions such as treatment exit and re-entry. Future work will build on this foundation to further expand the applicability of the estimator in complex policy environments. 104 BIBLIOGRAPHY Borusyak, K., Jaravel, X., and Spiess, J. (2024). Revisiting event study designs: Robust and efficient estimation. Review of Economic Studies, page rdae007. Callaway, B. and Sant’Anna, P. H. (2021). Difference-in-differences with multiple time periods. Journal of econometrics, 225(2):200–230. De Chaisemartin, C. and d’Haultfoeuille, X. (2024). Difference-in-differences estimators of in- tertemporal treatment effects. Review of Economics and Statistics, pages 1–45. De Chaisemartin, C. and d’Haultfoeuille, X. (2020). Two-way fixed effects estimators with hetero- geneous treatment effects. American economic review, 110(9):2964–2996. Kim, H. (2023). Rural pharmacy access and competition: Static games with machine learning. Available at SSRN 4377695. Lee, S. J. and Wooldridge, J. M. (2023). A simple transformation approach to difference-in- differences estimation for panel data. Available at SSRN 4516518. Sun, L. and Abraham, S. (2021). Estimating dynamic treatment effects in event studies with heterogeneous treatment effects. Journal of econometrics, 225(2):175–199. Wooldridge, J. M. (2021). Two-way fixed effects, the two-way mundlak regression, and difference- in-differences estimators. Available at SSRN 3906345. 105