A ROLLING METHOD FOR CAUSAL INFERENCE: A FLEXIBLE ESTIMATION METHOD
FOR DIVERSE PANEL DATA SETUPS

By

Soo Jeong Lee

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Economics—Doctor of Philosophy

2025

ABSTRACT

This dissertation develops new methods for estimating treatment effects using panel data, with a

focus on flexible and robust identification under treatment effect heterogeneity across units and over

time. Chapter 1 provides an overview of the subsequent chapters. Chapter 2 introduces a simple

time-series transformation—termed the Rolling Method—for Difference-in-Differences estimation.

This method converts panel data into a sequence of cross-sectional datasets through unit-specific

outcome transformations, enabling consistent estimation of group-time-specific treatment effects

even in the presence of treatment heterogeneity. Chapter 3 extends the Rolling Method to small-

sample settings, particularly when the number of treated or control units is limited, and demonstrates

improved finite-sample inference properties. Chapter 4 further generalizes the framework to

accommodate dynamic treatment paths, allowing for treatment reversals and subgroup-specific

moderating effects. This extension is applied in an empirical case study examining the effects of

the entry and exit of chain pharmacies in rural areas.

Copyright by
SOO JEONG LEE
2025

ACKNOWLEDGEMENTS

Above all, I extend my deepest gratitude to my advisor and mentor, Dr. Jeffrey M. Wooldridge. The

excitement in his eyes during our research discussions has been a constant source of inspiration.

I aspire to follow his example–to be a scholar who speaks with passion, wisdom, and boundless

curiosity. Under his guidance, I have learned invaluable lessons that I will carry with me as I mentor

future students, sharing the joy of economics and the fulfillment that comes from contributing to

the world through research.

I am sincerely grateful to my committee members–Dr. Antonio Galvao, Dr. Kyoo il Kim, and

Dr. Haolei Weng– for their invaluable feedback and encouragement, which have been instrumental

to my growth throughout this journey. I would also like to extend special thanks to Dr. Timothy

Vogelsang and Dr. Hugo Freeman. Being part of the MSU econometrics group has been an

extraordinary privilege, and I could not have been happier to be a part of it.

I am deeply grateful to the faculty and friends in the MSU Economics Department. Your

presence made the joyful moments even brighter and the challenging times easier to navigate. I

also extend my heartfelt thanks to my EPIC Grad Lab friends and my Project Investigator, Dr. Tara

Kilbride. A special thank you to Dr. Todd Elder, whose support as DGS made it possible for me

to join the program and become part of such an inspiring community.

Finally, I am thankful to my family. To my parents, Sunduk Jeong and Jinrak Lee–the most

loving couple I know–thank you for your unwavering support and for always reminding me, “We

believe in you. You always do well.” To my older sister, Eunjeong, who gifted me my very first

flight ticket to the U.S. and took on all family responsibilities so I could devote myself fully to my

studies–thank you. And to my younger sister, Minjeong–I could not love you more.

To all who have stood by me through moments of struggle, happiness, and solitude– thank you!

My journey as an economist will never cease, thanks to your unwavering support.

iv

CHAPTER 1

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

1

TABLE OF CONTENTS

CHAPTER 2

A SIMPLE TRANSFORMATION APPROACH TO DIFFERENCE-IN-
DIFFERENCES ESTIMATION FOR PANEL DATA . . . . . . . . . . .

3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
.
. 32
PROOF OF THEOREM 2.2 . . . . . . . . . . . . . . . . . . .
PROOF OF THE DETRENDING PROCEDURE . . . . . . . . 35
PRE-PERIOD DYNAMICS AND EVENT STUDY PLOT . . . 42
ESTIMATION RESULTS . . . . . . . . . . . . . . . . . . . . 44

BIBLIOGRAPHY .
APPENDIX 2A
APPENDIX 2B
APPENDIX 2C
APPENDIX 2D

.

.

.

CHAPTER 3

SIMPLE APPROACHES TO INFERENCE WITH DIFFERENCE-IN-
DIFFERENCES ESTIMATORS WITH SMALL
CROSS-SECTIONAL SAMPLE SIZES . . . . . . . . . . . . . . . . . . 47
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

.

.

.

.

BIBLIOGRAPHY .

CHAPTER 4

ROLLING APPROACH TO DIFFERENCE-IN-DIFFERENCES:
EXPLORING TREATMENT REVERSIBILITY AND MODERATING
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EFFECTS .
. 83
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
.

.
.

.
.

.

BIBLIOGRAPHY .

v

CHAPTER 1

INTRODUCTION

This dissertation advances estimation and inference techniques for panel data by developing novel

methods for treatment effect estimation. By introducing a time-series outcome transformation,

addressing heterogeneous treatment effects, and enhancing inference in small-sample settings, it

offers more robust and flexible tools for policy evaluation and empirical research.

Chapter 2 introduces a time-series transformation technique–termed the Rolling Method–that

can be combined with various treatment effect estimators, including regression adjustment, match-

ing methods, and doubly robust estimators. Unlike conventional approaches that assume constant

treatment effects across units and over time, our method accounts for treatment effect heterogene-

ity. In the common timing case, we show that applying the transformation with linear regression

adjustment numerically reproduces the pooled OLS estimator in Wooldridge (2021), which is the

Best Linear Unbiased Estimator (BLUE) under classical assumptions.

The transformation is at the unit level, and simply requires computing the average outcome prior

to an intervention, subtracting it from a post-treatment outcome, and then carefully selecting the

control units for each time period. We show that, allowing for staggered entry under no anticipation

and parallel trends assumptions, the cohort treatment indicators satisfy the key unconfoundedness

assumption with respect to the transformed potential outcome. Given identification, any number

of treatment effect estimators can be applied for each treated cohort and calendar time pair where

the average treatment effects on the treated are identified. The doubly robust method of combining

inverse probability weighting with linear regression (IPWRA) works particularly well in terms of

bias and efficiency. Importantly, our transformation is easily modified to account for unit-specific

trends.

We illustrate the method using empirical data on Walmart’s entry into local labor markets.

The application demonstrates how the transformation approach yields more reliable estimates of

employment effects, particularly in cases where traditional methods may suffer from bias and

inefficiency due to pre-existing trends.

1

Chapter 3 proposes simple methods for valid inference in panel data difference-in-differences

(DiD) settings with a small number of treated units or a small number of control units (including a

small number of both). The approach uses a suitable transformation to collapse the panel data set

into a cross-sectional data set. If the classical linear model assumptions hold in the cross-sectional

regression, exact inference is available – even in cases with only two control units and one treated

unit. The approach likely works best with a large number of time periods before and after the

intervention so that the central limit theorem across time makes normality a better approximation.

Nevertheless, under joint normality and homoskedasticity of the time-varying errors, the exact

approach can be applied with few time periods – both before and after the intervention – as well

as few cross-sectional units. In addition to standard DiD estimation, the approach permits removal

of unit-specific trends. With large enough sample sizes, control variables may be included.

If

the cross-sectional sample size is not too small, one can use a particular heteroskedasticity-robust

standard error. We illustrate the approach using increased smoking restrictions in California, where

there is only a single treated unit, as an alternative to the synthetic control approach. In the staggered

intervention case, we reexamine the expansion of so-called “castle” laws in the United States.

Chapter 4 extends the Rolling Method framework to accommodate complex treatment patterns

and moderating effects. I propose a novel aggregation strategy suitable for measuring treatment

effects with staggered entry and exit of treatment. Furthermore, I develop a two-stage IPWRA

estimator that incorporates moderating effects, allowing for heterogeneous treatment effects across

subgroups defined by observed characteristics. This extension enables a more nuanced understand-

ing of how treatment effects evolve over time and vary across populations. To demonstrate its

practical utility, I apply the method to evaluate the impact of chain pharmacy entry and exit on

pharmacy access in rural areas. The results reveal substantial heterogeneity in treatment effects

across three distinct trajectories—initial treatment, exit, and re-entry—that conventional approaches

may overlook. This framework offers applied researchers a robust and flexible tool for analyzing

dynamic and heterogeneous treatment paths.

2

CHAPTER 2

A SIMPLE TRANSFORMATION APPROACH TO DIFFERENCE-IN-DIFFERENCES
ESTIMATION FOR PANEL DATA

(CO-AUTHORED WITH JEFFREY M. WOOLDRIDGE)

2.1

Introduction

The two-way fixed effects (TWFE) estimator, applied to a linear panel model with a constant

treatment effect, has been commonly applied in difference-in-differences settings. The TWFE

estimator of a single effect is simple to understand and is taught in courses that cover panel data

methods. Recently, several authors have pointed out shortcomings of the constant effect model

under staggered interventions, including Borusyak and Jaravel (2018), Goodman-Bacon (2021),

and De Chaisemartin and d’Haultfoeuille (2020), who propose alternative representations of the

simple TWFE estimator when the treatment effects (TEs) are heterogeneous across treatment cohort

or calendar time.

Other authors have proposed more flexible estimation methods that uncover average treatment

effects on the treated in the staggered intervention case. These include Callaway and Sant’Anna

(2021) [CS (2021)], who propose long-differencing strategies and apply standard treatment effect

estimators. Sun and Abraham (2021) [SA (2021)] propose a fixed effects estimator applied to

a more flexible model. Both SA (2021) and CS (2021) are event-study-type estimators that use

only the single period prior to the first intervention time as the control period. Wooldridge (2021)

shows that a pooled OLS (POLS) strategy that includes cohort and calendar time interactions, as

well as interactions of cohort dummies, time period dummies, and the treatment indicators with

covariates, identifies the ATTs under standard no anticipation and parallel trends assumptions.

These estimators effectively use all pre-treatment periods and all not-yet-treated units in the control

group. Wooldridge (2021) also shows the POLS is equivalent to a TWFE estimator on an expanded

equation that includes interactions of cohort and time dummies with each other and with covariates.

The co-author has approved that the co-authored chapter is included. The co-author’s contact: Jeffrey M.
Wooldridge, Department of Economics, 486 W. Circle Drive, 110 Marshall-Adams Hall, Michigan State University,
East Lansing, MI 48824-1038. Email: wooldri1@msu.edu

3

Borusyak et al. (2024) propose imputation estimators based on pooled OLS regressions that can

include unit and time fixed effects. Wooldridge (2021) shows that, with time constant covariates,

the imputation estimates are identical to the POLS (and therefore TWFE) estimation of the flexible

model using the entire sample.

An attractive feature of the CS (2021) approach, which builds on Abadie (2005) for the two-

period case, is that it permits the application of treatment effects estimators beyond regression

adjustment. However, as mentioned above, the CS (2021) method uses only the period just prior to

the intervention in defining the control group, thereby discarding potentially useful information in

earlier time periods. In fact, Wooldridge (2021) shows that, under the standard “error components”

structure on the error, with a homoskedastic time-constant component and homoskedastic and

serially uncorrelated idiosyncratic errors, the POLS estimator is both best linear unbiased (BLUE)

and asymptotically efficient. These theoretical results imply that the CS (2021) estimators are

inefficient under a standard set of assumptions. The simulations in Wooldridge (2021) bear this

out, showing the CS approach can be very inefficient. Under strong forms of positive serial

correlation, CS (2021) can be more efficient because it is similar to a first-differencing estimator.

Balanced against the loss in precision is that the CS approach can be less biased when parallel

trends are violated, although there is no guarantee. See a De Chaisemartin and d’Haultfoeuille

(2023) and Wooldridge (2021) for further discussion.

In this paper, we propose an alternative “rolling” approach that allows for the application of

many different treatment effects estimators while maintaining much of the efficiency of regression-

based methods. The idea is to use as many control observations as possible – in both the common

timing and staggered cases – while permitting methods such as inverse probability weighting (IPW),

doubly robust methods such as the one in Wooldridge (2007) that combines regression and IPW

(IPWRA), and matching on covariates or the propensity score. Like CS (2021) in the panel data

case, our approach is based on time series transformations at the unit level. Rather than using

long differences, we show how to use all suitable control observations in transforming the outcome

variable. This leads to significant improvements in efficiency compared with CS (2021) and allows

4

one substantial flexibility in the choice of treatment effects estimators.

In the case of common timing – so that there is one average treatment effect per post-treatment

period – we show that applying regression adjustment to our transformed outcome variable is

equivalent to the regression adjustment estimator based on levels. As mentioned above, Wooldridge

(2021) shows that this estimator is both BLUE and asymptotically efficient under standard assump-

tions. This provides strong motivation for applying estimators other than regression adjustment to

the transformed variables in order to check robustness of findings. In the case of staggered entry,

our approach identifies the average treatment effects on the treated (ATTs) by cohort and calendar

time under the same no anticipation and parallel trends assumptions as in Wooldridge (2021). We

show this in both the case of common timing and staggered interventions. Once identification is

established, various estimation methods can be applied.

The remainder of the paper is organized as follows. Section 2.2 begins with the common timing

case, defining the potential outcomes and parameters of interest, and establishing identification

assumptions.

In Section 2.3 we propose a general approach to estimation using a transformed

outcome variable.

In Section 2.4 we extend the framework and identification argument to the

staggered case. In Section 2.5, we show how we can account for heterogeneous trends, focusing

on linear trends, to allow violation of the parallel trends assumption (even after we condition

on covariates). Section 2.6 discusses how one might accommodate suspected failures of no

anticipation, and how one modifies the procedure for unbalanced panels. In Section 2.7, we revisit

the effect of Walmart’s opening on the local labor markets. Section 2.8 contains some concluding

remarks. Supplementary Appendix 2B.1 presents Monte Carlo simulations demonstrating that the

rolling methods perform well in terms of both bias and precision.

2.2 Setup and Identification with Common Timing

In this section, we assume that the date of the intervention is the same for all treated units

and then the intervention is in place through the final period. The time periods in the population

are 𝑡 = 1, 2, . . . , 𝑇 and the date of the intervention is 𝑆, where 1 < 𝑆 ≤ 𝑇; in other words, there

is at least one pre-treatment period. The arguments in this section are based on an underlying

5

population, and so we use {𝑌𝑡 (0) , 𝑌𝑡 (1) : 𝑡 = 1, . . . , 𝑇 } to denote the time series of outcomes in

the control and treated states.

The binary time-constant treatment indicator is 𝐷, where 𝐷 = 1 means treatment starting in

period 𝑆 and lasting through period 𝑇. A time-varying treatment indicator is 𝑊𝑡 = 𝐷 · 𝑝𝑡, where 𝑝𝑡

is a post-treatment indicator equal to 1 if 𝑡 ≥ 𝑆 and otherwise zero.

Without treated units prior to time 𝑆 we can, at most, hope to identify average treatment effects

in periods 𝑆, 𝑆 + 1, . . . , 𝑇. Our focus here, like almost all of the other recent literature, is on the

average treatment effect on the treated (ATT or ATET) in each treated period:

𝜏𝑟 = 𝐸 [𝑌𝑟 (1) − 𝑌𝑟 (0) |𝐷 = 1] ,𝑟 = 𝑆, . . . , 𝑇

(2.1)

The methods we propose can, under stronger assumptions than we propose, recover the overall

average treatment effects (ATEs), 𝐸 [𝑌𝑟 (1) − 𝑌𝑟 (0)], and we will mention how that can be done.

The fundamental problem of identification of 𝜏𝑟 is that we only observe the treatment status, 𝐷,

and the outcome

𝑌𝑟 = (1 − 𝐷) · 𝑌𝑟 (0) + 𝐷 · 𝑌𝑟 (1)

Importantly, when 𝐷 = 1, 𝑌𝑟 = 𝑌𝑟 (1), which means

𝐸 [𝑌𝑟 (1) |𝐷 = 1] = 𝐸 (𝑌𝑟 |𝐷 = 1) , 𝑟 = 𝑆, . . . , 𝑇

(2.2)

(2.3)

The expectation 𝐸 (𝑌𝑟 |𝐷 = 1) can be estimated in a consistent,even unbiased, way under various

sampling schemes. Under random sampling, the average of 𝑌𝑟 across the treated subsample is

unbiased and consistent. Therefore, writing

𝜏𝑟 = 𝐸 [𝑌𝑟 (1) |𝐷 = 1] − 𝐸 [𝑌𝑟 (0) |𝐷 = 1] = 𝐸 (𝑌𝑟 |𝐷 = 1) − 𝐸 [𝑌𝑟 (0) |𝐷 = 1] ,

it is easily seen that the challenge is in identifying 𝐸 [𝑌𝑟 (0) |𝐷 = 1].

If the treatment is randomly assigned with respect to𝑌𝑟 (0) , then 𝐸 [𝑌𝑟 (0)|𝐷 = 1] = 𝐸 [𝑌𝑟 (0)|𝐷 =

0]. Because 𝑌𝑟 = 𝑌𝑟 (0) when 𝐷 = 0, 𝐸 [𝑌𝑟 (0)|𝐷 = 0] = 𝐸 (𝑌𝑟 |𝐷 = 0) is consistently estimated

using the control units under various sampling schemes. Under random sampling, one would use

the sample average of 𝑌𝑟 across the control units. The resulting estimator of 𝜏𝑟 would be the simple

difference in sample means between the treated and control units in period 𝑟.

6

The assumption of random assignment is too strong for most applications. To see how to relax

it, use simple algebra to write

𝑌𝑟 (1) − 𝑌𝑟 (0) =



𝑌𝑟 (1) −





+

1
(𝑆 − 1)

1
(𝑆 − 1)

𝑆−1
∑︁

𝑞=1

𝑌𝑞 (1)








−



𝑌𝑟 (0) −





1
(𝑆 − 1)

𝑆−1
∑︁

𝑞=1

𝑌𝑞 (0)








𝑆−1
∑︁

𝑞=1

(cid:2)𝑌𝑞 (1) − 𝑌𝑞 (0)(cid:3)

where

≡ (cid:164)𝑌𝑟 (1) − (cid:164)𝑌𝑟 (0) +

1
(𝑆 − 1)

𝑆−1
∑︁

𝑞=1

(cid:2)𝑌𝑞 (1) − 𝑌𝑞 (0)(cid:3)

(cid:164)𝑌𝑟 (1) ≡ 𝑌𝑟 (1) −

1
(𝑆 − 1)

𝑆−1
∑︁

𝑞=1

𝑌𝑞 (1)

(2.4)

(2.5)

and similarly for (cid:164)𝑌𝑟 (0). Note that for each 𝑟 ∈ {𝑆, 𝑆 + 1, . . . , 𝑇 }, (cid:164)𝑌𝑟 (1) is the time 𝑟 potential

outcome with the average of the pre-treatment period outcomes removed. The third term is the

average of the difference of the pre-treatment period “treatment” effects.

Given the representation in (2.4), we can write

𝜏𝑟 = 𝐸 (cid:2) (cid:164)𝑌𝑟 (1) |𝐷 = 1(cid:3) − 𝐸 (cid:2) (cid:164)𝑌𝑟 (0) |𝐷 = 1(cid:3) +

1
(𝑆 − 1)

𝑆−1
∑︁

𝑞=1

𝐸 (cid:2)𝑌𝑞 (1) − 𝑌𝑞 (0) |𝐷 = 1(cid:3)

(2.6)

The first assumption, a weak version of “no anticipation”, eliminates the third term in (2.6).

Assumption NAC (No Anticipation, Common Timing): For the eventually treated indicator

𝐷,

𝐸 [𝑌𝑡 (1) − 𝑌𝑡 (0)|𝐷 = 1] = 0,

𝑡 = 1, . . . , 𝑆 − 1.□

(2.7)

The name of this assumption derives from the fact that 𝐸 [𝑌𝑡 (1) − 𝑌𝑡 (0)|𝐷 = 1] for 𝑡 < 𝑆 are

average treatment effects on the treated prior to the intervention, and the assumption is that these

are all zero. Assumption NAC is implied by an assumption commonly used in the literature,

namely, 𝑌𝑡 (1) = 𝑌𝑡 (0), 𝑡 = 1, . . . , 𝑆 − 1. This assumption is implicit in Heckman et al. (1997) and

made explicit in Abadie (2005) and elsewhere. Because the variable indexing 𝑌𝑡 (·) is treatment

status not yet assigned, the assumption rules out anticipatory changes in the potential outcomes,

on average. If one is concerned that the announcement of a policy prior to its implementation may

7

result in some units strategically manipulating their pre-intervention outcomes, one might drop a

period or two just prior to the intervention – as a minimum, as a robustness check. Naturally, this

can result in less precise estimators.

Given Assumption NAC, we can express 𝜏𝑟 as

𝜏𝑟 = 𝐸 (cid:2) (cid:164)𝑌𝑟 (1) |𝐷 = 1(cid:3) − 𝐸 (cid:2) (cid:164)𝑌𝑟 (0) |𝐷 = 1(cid:3) .

(2.8)

Estimating the first term in (2.8) is easy because we observe (cid:164)𝑌𝑟 (1) when 𝐷 = 1. More precisely,

define the same transformation in the observed variable 𝑌𝑟:

(cid:164)𝑌𝑟 ≡ 𝑌𝑟 −

1
𝑆 − 1

𝑆−1
∑︁

𝑌𝑞 ≡ 𝑌𝑟 − ¯𝑌𝑝𝑟𝑒

(2.9)

𝑞=1
When 𝐷 = 1, (cid:164)𝑌𝑟 = (cid:164)𝑌𝑟 (1) and so 𝐸 (cid:2) (cid:164)𝑌𝑟 (1) |𝐷 = 1(cid:3) = 𝐸 ( (cid:164)𝑌𝑟 |𝐷 = 1) and the latter is trivially

identified (as usual, under a suitable sampling scheme). Notice in the simple 𝑇 = 2 case with 𝑆 = 2,

(cid:164)𝑌2 = 𝑌2 − 𝑌1, the difference from period one to two.

The difficult term in identifying 𝜏𝑟 is 𝐸 (cid:2) (cid:164)𝑌𝑟 (0) |𝐷 = 1(cid:3). The unconditional parallel trends
assumption implies that 𝐸 (cid:2) (cid:164)𝑌𝑟 (0) |𝐷 = 1(cid:3) = 𝐸 (cid:2) (cid:164)𝑌𝑟 (0) |𝐷 = 0(cid:3). Here we allow a weaker version of

parallel trends by assuming it holds conditional on observed (pre-treatment) covariates.

Assumption CPTC (Conditional Parallel Trends, Common Timing): For observed covari-

ates X,

𝐸 [𝑌𝑡 (0) − 𝑌1(0)|𝐷, X] = 𝐸 [𝑌𝑡 (0) − 𝑌1(0)|X] ,

𝑡 = 2, . . . , 𝑇 . □

(2.10)

Simple algebra shows that (2.10) is the same as assuming 𝐸 [𝑌𝑡 (0) − 𝑌𝑠 (0)|𝐷, X] = 𝐸 [𝑌𝑡 (0) −

𝑌𝑠 (0)|X] for all 𝑡 ≠ 𝑠. Wooldridge (2021) used very similar assumptions, along with linearity of

conditional means, to derive identification of the 𝜏𝑟. Here we are interested in applying methods

other than regression adjustment to the transformed outcomes in (??). Assumption CPTC allows
us to identify 𝐸 (cid:2) (cid:164)𝑌𝑟 (0) |𝐷 = 1(cid:3). To see how, first note that, by iterated expectations,

𝐸 (cid:2) (cid:164)𝑌𝑟 (0) |𝐷 = 1(cid:3) = 𝐸 (cid:8)𝐸 (cid:2) (cid:164)𝑌𝑟 (0) |𝐷 = 1, X(cid:3) |𝐷 = 1(cid:9)

(2.11)

Next, write

(cid:164)𝑌𝑟 (0) = (𝑆 − 1)−1

(cid:2)𝑌𝑟 (0) − 𝑌𝑞 (0)(cid:3)

𝑆−1
∑︁

𝑞=1

8

Then, by CPTC,

𝐸 (cid:2) (cid:164)𝑌𝑟 (0) |𝐷 = 1, X(cid:3) = (𝑆 − 1)−1

= (𝑆 − 1)−1

𝑆−1
∑︁

𝑞=1
𝑆−1
∑︁

𝑞=1

𝐸 (cid:2)𝑌𝑟 (0) − 𝑌𝑞 (0) |𝐷 = 1, X(cid:3)

𝐸 (cid:2)𝑌𝑟 (0) − 𝑌𝑞 (0) |𝐷 = 0, X(cid:3)








= 𝐸

𝑌𝑟 (0) − (𝑆 − 1)−1

= 𝐸 (cid:2) (cid:164)𝑌𝑟 (0) |𝐷 = 0, X(cid:3)

𝑆−1
∑︁

𝑞=1

𝑌𝑞 (0)

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)



𝐷 = 0, X





(2.12)

The conclusion in equation (2.12) is simple but important. It says that, in terms of the potential

outcome (cid:164)𝑌𝑟 (0), treatment 𝐷 is unconfounded conditional on X. Assumption NA ensures that the

ATTs can be expressed as in (2.8). This means that, for a post-intervention period 𝑟, we have

turned the difference-in-differences problem into a standard problem of estimating an ATT in a

cross-sectional population.

Using the fact that 𝑌𝑞 = 𝑌𝑞 (0) when 𝐷 = 0, (2.12) implies that,

𝐸 (cid:2) (cid:164)𝑌𝑟 (0) |𝐷 = 1, X(cid:3) = 𝐸 (cid:0) (cid:164)𝑌𝑟 |𝐷 = 0, X(cid:1)

Now the argument is the same as in the typical cross section: By iterated expectations,

𝐸 (cid:2) (cid:164)𝑌𝑟 (0) |𝐷 = 1(cid:3) = 𝐸 (cid:2)𝐸 (cid:0) (cid:164)𝑌𝑟 |𝐷 = 0, X(cid:1) |𝐷 = 1(cid:3)

≡ 𝐸 [ (cid:164)𝑚0𝑟 (X) |𝐷 = 1] ,

(2.13)

where (cid:164)𝑚0𝑟 (X) ≡ 𝐸 (cid:0) (cid:164)𝑌𝑟 |𝐷 = 0, X = x(cid:1) is the conditional mean of the observed variable (cid:164)𝑌𝑟 for
the control group. This function is nonparametrically identified on Supp (X|𝐷 = 0), the support

of the covariates for the control group. To ensure we can compute 𝐸 [ (cid:164)𝑚0𝑟 (X) |𝐷 = 1] without

extrapolation to covariate values outside Supp (X|𝐷 = 0), we impose a standard overlap assumption.

Assumption OVLC (Overlap, Common Timing): Define the propensity score

Then

𝑝 (x) = 𝑃 (𝐷 = 1|X = x) , x ∈ Supp (X) .

𝑝 (x) < 1, x ∈ Supp (X) . □

9

(2.14)

(2.15)

The previous derivations and discussion prove the following.

Theorem 2.1 Under Assumption NAC, 𝜏𝑟 can be expressed as in (2.8) for 𝑟 = 𝑆, . . . , 𝑇. Under

Assumption CPTC, 𝐷 is unconfounded (in the conditional mean sense) with respect to (cid:164)𝑌𝑟 (0)

conditional on X. When we add Assumption OVLC, the parameters 𝜏𝑟, 𝑟 = 𝑆, . . . , 𝑇, are identified.

□

2.3 Estimation in the Common Timing Case

Given the identification result stated in Theorem 2.1, the estimation of the 𝜏𝑟 is straightforward.

We can apply any estimation method once the outcome variable has been transformed as in equation

(2.9). Essentially, this is the conclusion reached in Sant’Anna and Zhao (2020) in the 𝑇 = 2 case.

Earlier, Abadie (2005) proposed inverse probability weighting when 𝑇 = 2.

For simplicity, assume in this section we observe a random sample of size 𝑁 from the cross

section. The observed outcome can be expressed as

𝑌𝑖𝑡 = (1 − 𝐷𝑖) · 𝑌𝑖𝑡 (0) + 𝐷𝑖 · 𝑌𝑖𝑡 (1)

(2.16)

where we use an 𝑖 subscript to denote unit 𝑖. Under the strong form of no anticipation, 𝑌𝑖𝑡 (1) = 𝑌𝑖𝑡 (0)

for 𝑡 < 𝑆 and all 𝑖. Given the derivations in the previous section, we only need Assumption NAC. For

each 𝑖, we observe the time series {(𝑌𝑖𝑡, 𝐷𝑖, X𝑖) : 𝑖 = 1, 2, . . . , 𝑁 }. To exploit the unconfoundedness

and identification in Theorem 2.1, we simply need to obtain the transformed data. For each unit 𝑖,

define

(cid:164)𝑌𝑖𝑟 = 𝑌𝑖𝑟 −

𝑆−1
∑︁

𝑌𝑖𝑞 ≡ 𝑌𝑖𝑟 − ¯𝑌𝑖,𝑝𝑟𝑒

(2.17)

1
(𝑆 − 1)

𝑞=1
Then, for any 𝑟 ∈ {𝑆, 𝑆 + 1, . . . , 𝑇 }, we can apply any standard treatment effect (TE) estimator to
the data (cid:8)(cid:0) (cid:164)𝑌𝑖𝑟, 𝐷𝑖, X𝑖(cid:1) : 𝑖 = 1, 2, . . . , 𝑁 (cid:9).

A common TE estimator is called “ regression adjustment,” which means estimating separate

regression functions for the control and treated units. Because (cid:164)𝑌𝑖𝑟 can take on negative and positive

values, linear regression adjustment (RA) makes the most sense. Linear RA is based on the

conditional mean, stated in terms of population random variables,

𝐸 (cid:0) (cid:164)𝑌𝑟 |𝐷 = 0, X(cid:1) = (cid:164)𝛼𝑟 + X𝛽𝑟

(2.18)

10

The parameters (cid:164)𝛼𝑟 and 𝛽𝑟 are estimated from the cross-sectional regression

Then, 𝜏𝑟 can be estimated using imputation:

(cid:164)𝑌𝑖𝑟 on 1, X𝑖 if 𝐷𝑖 = 0

𝜏𝑟 = (cid:164)𝑌 𝑟1 − 𝑁 −1
(cid:98)
1

𝑁
∑︁

𝑖=1

𝐷𝑖

(cid:16)
(cid:98)(cid:164)𝛼𝑟 + X𝑖(cid:98)𝛽𝑟

(cid:17)

= (cid:164)𝑌 𝑟1 −

(cid:16)
(cid:98)(cid:164)𝛼𝑟 + ¯X1(cid:98)𝛽𝑟

(cid:17)

(2.19)

(2.20)

where (cid:164)𝑌 𝑟1 = 𝑁 −1
1

(cid:205)𝑁
𝑖=1

𝐷𝑖 (cid:164)𝑌𝑖𝑟 and ¯X1 = 𝑁 −1
1

(cid:205)𝑁
𝑖=1

𝐷𝑖X𝑖 are the averages over the treated units. From

a practical perspective, the important thing to remember is that (cid:98)
software that does basic regression adjustment once the (cid:164)𝑌𝑖𝑟 have been obtained.

𝜏𝑟 can be obtained from standard

As discussed in Wooldridge (2021), the imputation estimate, (cid:98)

𝜏𝑟, also can be obtained as the

coefficient on 𝐷𝑖 in the pooled OLS regression

(cid:164)𝑌𝑖𝑟 on 1, 𝐷𝑖, X𝑖, 𝐷𝑖 · (cid:0)X𝑖 − ¯X1

(cid:1) , 𝑖 = 1, 2, . . . , 𝑁,

(2.21)

which uses all observations in time period 𝑟. This formulation is convenient because it leads

to simple inference for (cid:98)
heteroskedasticity. Also, it is often easy to account for the sampling variation in ¯X1 as an estimator

𝜏𝑟, allowing easy computation of standard errors robust to any kind of

of 𝜇1 ≡ 𝐸 (X|𝐷 = 1). Issues of clustering standard errors are relatively easy to deal with given we

have a standard cross-sectional regression.

Because of the representation of (cid:164)𝑌𝑖𝑟 in (2.17), there is a simple characterization of (cid:98)

𝜏𝑟. All of the

coefficients in (2.21) are obtained by differencing the coefficients from two separate regressions.

In the first, 𝑌𝑖𝑟 is regressed on all variables in (2.21). Then ¯𝑌𝑖,𝑝𝑟𝑒 is regressed on the same set

of variables, and these coefficients are subtracted from the first. In particular, (cid:98)
the difference between two standard ATT estimators using regression adjustment. The first uses

𝜏𝑟 is obtained as

observations in period 𝑟 only, and the second uses the average of 𝑌𝑖𝑞 over the pre-treatment periods.

Without covariates, the estimator would be

𝜏𝑟 = (cid:0) ¯𝑌1𝑟 − ¯𝑌0𝑟 (cid:1) − (cid:0) ¯𝑌1,𝑝𝑟𝑒 − ¯𝑌0,𝑝𝑟𝑒(cid:1) ,
(cid:98)

(2.22)

where the first subscript indicates treatment or control units. This has a clear interpretation as a

DiD estimator.

11

Recognizing that an estimator can be obtained from (2.21) has additional benefits. For example,

if 𝐷𝑖 is independent of (cid:164)𝑌𝑖𝑟 (0)–so that the unconditional PT assumption holds–, then the covariates

X𝑖 need not be included in (2.21) in order to consistently estimate 𝜏𝑟 as the coefficient on 𝐷𝑖.

Remember, this allows 𝐷𝑖 to be correlated with, the level, say, 𝑌1 (0), the potential outcome in the

first time period. If, in addition, 𝐷𝑖 is independent of X𝑖, the regression in (2.21) still can be used

to improve efficiency over the simple estimator without the covariates. As discussed in Negi and

Wooldridge (2021), such improvements are possible if X𝑖 helps predict (cid:164)𝑌𝑖𝑟. In many cases, X𝑖 may

not have much predictive power for (cid:164)𝑌𝑖𝑟 even though it might predict the level, 𝑌𝑖𝑟, well. A special
case is 𝑇 = 2, in which case (2.21) is simply Δ𝑌𝑖 on 1, 𝐷𝑖, X𝑖, 𝐷𝑖 · (cid:0)X𝑖 − ¯X1
(cid:1), 𝑖 = 1, 2, . . . , 𝑁 where
Δ𝑌𝑖 = 𝑌𝑖2 − 𝑌𝑖1. In the 𝑇 = 2 case, whether including X𝑖 substantively helps precision when 𝐷𝑖 is

independent of X𝑖 hinges on how well X𝑖 predicts the difference, Δ𝑌𝑖.

It turns out there is another useful algebraic equivalence. Suppose we act as if the following

conditional expectation holds for all population units and time periods:

𝐸 (𝑌𝑡 |𝐷, X) = 𝛼 + X𝛽 + 𝛾𝐷 + (𝐷 · X) 𝛿 +

𝑇
∑︁

𝑟=2

𝜃𝑟 𝑓 𝑟𝑡 +

𝑇
∑︁

𝑟=2

( 𝑓 𝑟𝑡 · X) 𝜋𝑟

+

𝑇
∑︁

𝑟=𝑆

𝜏𝑟 (𝐷 · 𝑓 𝑟𝑡) +

𝑇
∑︁

𝑟=𝑆

(𝐷 · 𝑓 𝑟𝑡) (X − 𝜇1) 𝜂𝑟, 𝑡 = 1, . . . , 𝑇,

(2.23)

where 𝑓 𝑟𝑡 is a time period dummy equal to one if 𝑟 = 𝑡 and zero otherwise. The interaction 𝐷 · 𝑓 𝑟𝑡

is the treatment indicator for time period 𝑟. Equation (2.21) suggests a pooled OLS regression

across all 𝑖 and 𝑡:

𝑌𝑖𝑡 on 1, X𝑖, 𝐷𝑖, 𝐷𝑖 · X𝑖, 𝑓 2𝑡, . . . , 𝑓 𝑇𝑡, 𝑓 2𝑡 · X𝑖, . . . , 𝑓 𝑇𝑡 · X𝑖

𝐷𝑖 · 𝑓 𝑆𝑡, . . . , 𝐷𝑖 · 𝑓 𝑇𝑡, 𝐷𝑖 · 𝑓 𝑆𝑡 · (cid:0)X𝑖 − ¯X1

(cid:1) , . . . , 𝐷𝑖 · 𝑓 𝑇𝑡 · (cid:0)X𝑖 − ¯X1

(cid:1)

(2.24)

𝜏𝑟, are the coefficients on the treatment dummies 𝐷𝑖 · 𝑓 𝑆𝑡, . . . ,

The estimated treatment effects, say (cid:101)
𝐷𝑖 · 𝑓 𝑇𝑡. Wooldridge (2021) shows that the (cid:101)
approach based on the levels, 𝑌𝑖𝑡. It turns out that the (cid:101)
using the transformed outcome variable, (cid:164)𝑌𝑖𝑟, one period at a time.

𝜏𝑟 are numerically identical to a two-stage imputation
𝜏𝑟 are also equivalent to the (cid:98)

𝜏𝑟 obtained by

12

Theorem 2.2 Let (cid:98)
– equivalently, from equation (2.20) – and let (cid:101)
𝜏𝑟 = (cid:98)
(2.24). Then (cid:101)
is identical to that on 𝐷𝑖 · (cid:0)X𝑖 − ¯X1

(cid:1) in (2.21). □

𝜏𝑟, 𝑟 = 𝑆, 𝑆 + 1, . . . , 𝑇 be the coefficients on 𝐷𝑖 in the separate regressions (2.21)

𝜏𝑟, 𝑟 = 𝑆, . . . , 𝑇. Moreover, the coefficient vector on 𝐷𝑖 · 𝑓 𝑟𝑡 · (cid:0)X𝑖 − ¯X1

𝜏𝑟 be the coefficients on 𝐷𝑖 · 𝑓 𝑆𝑡, . . . , 𝐷𝑖 · 𝑓 𝑇𝑡 from
(cid:1) in (2.24)

The proof of Theorem 2.2 can be found in Appendix 2A. The equivalence is valuable for a couple

of reasons. First, it shows that two different ways to approach identification under the same set of

assumptions – that in Wooldridge (2021) and the approach we use here – leads to the same estimation

methods. Second, Wooldridge (2021, Theorem 6.2) shows that, under standard assumptions on the

implied error term (which includes a unit-specific unobserved effect and a time-varying component),

the estimators from (2.24) are both best linear unbiased and asymptotically efficient (with 𝑇 fixed,

𝑁 → ∞) under random sampling across 𝑖. This establishes that the transformation used in (2.21)

does not discard useful information.

Given the equivalence of our transformation approach and the OLS estimator pooled across 𝑖

and 𝑡, what use is the former? Importantly, it allows us to use other treatment effects estimators

beyond regression adjustment. For example, we can apply IPW or, even better, IPWRA, using the
cross-sectional data (cid:8)(cid:0) (cid:164)𝑌𝑖𝑟, 𝐷𝑖, X𝑖(cid:1) : 𝑖 = 1, 2, . . . , 𝑁 (cid:9). We can also apply propensity score matching,

covariate matching or nearest neighbor matching.

Procedure 3.1 (Rolling Methods, Common Timing):

Step 1. For a given time period 𝑟 ∈ {𝑆, . . . , 𝑇 } and each unit 𝑖, compute (cid:164)𝑌𝑖𝑟 as in (2.17).

Step 2. Using all of the units, apply standard TE methods – such as linear RA, IPW, IPWRA,

matching – to the cross section

(cid:8)(cid:0) (cid:164)𝑌𝑖𝑟, 𝐷𝑖, X𝑖(cid:1) : 𝑖 = 1, . . . , 𝑁 (cid:9) . □

Inference on a single 𝜏𝑟 is simple when one uses built-in commands in step (2) of Procedure

3.1. Joint inference on multiple 𝜏𝑟 is trickier because the estimators are not independent. For

estimators such as IPW and IPWRA using parametric models, a general approach is to stack all

moment conditions used in estimation and use the formulas from just-identified generalized method

13

of moments estimation. Applying the panel bootstrap – resampling all time periods from the cross-

sectional units – is valid for IPW and IPWRA, and should be computationally feasible in most

cases.

It is instructive to compare the transformation in equation (2.17) to that in CS (2021). In the

common timing case, the CS transformation, for 𝑟 ≥ 𝑆, is

˚𝑌𝑖𝑟 = 𝑌𝑖𝑟 − 𝑌𝑖,𝑆−1,

(2.25)

so that ˚𝑌𝑖𝑟 is a “long” difference. If 𝑟 = 𝑆 then ˚𝑌𝑖𝑆 = 𝑌𝑖𝑆 − 𝑌𝑖,𝑆−1, which is differencing adjacent

periods. In many cases, the CS transformation will be inefficient compared with (2.17) because the

CS differencing ignores time periods other than 𝑆 − 1. However, there are cases where CS (2021)

can be more efficient. When 𝑇 = 2, the transformation is (cid:164)𝑌2 = ˚𝑌2 = Δ𝑌2 ≡ 𝑌2 − 𝑌1. Thus, our

approach encompasses and extends Abadie (2005) and Sant’Anna and Zhao (2020) in the panel

data case.

2.4 Staggered Interventions

2.4.1 Some Units Never Treated

We now turn to the staggered intervention case. As in Athey and Imbens (2022) and Wooldridge

(2021, 2023), the potential outcomes are denoted

𝑌𝑡 (𝑔) , 𝑔 ∈ {𝑆, . . . , 𝑇, ∞} , 𝑡 ∈ {1, 2, . . . , 𝑇 } ,

(2.26)

where 𝑔 indicates the first time subjected to the intervention – defining the treatment group cohorts

– and 𝑡 is calendar time. The case 𝑔 = ∞ indicates the potential outcome in the never treated

state. In other words, 𝑌𝑡 (∞) is the potential outcome at time 𝑡 when a unit is not subjected to the

intervention over the observed stretch of time. Listing potential outcomes that vary only by cohort

and calendar time reflects the assumption of no reversibility with staggered entry.

We denote the group or cohort indicators by {𝐷 𝑆, 𝐷 𝑆+1, . . . , 𝐷𝑇 , 𝐷∞}, where 𝐷∞ = 1 indicates

the never-treated. These dummy variables are mutually exclusive and exhaustive: 𝐷 𝑆 + · · · + 𝐷𝑇 +

𝐷∞ = 1.

14

The ATTs of interest are now written as

𝜏𝑔𝑟 = 𝐸 (cid:2)𝑌𝑟 (𝑔) − 𝑌𝑟 (∞) |𝐷𝑔 = 1(cid:3) , 𝑟 = 𝑔, . . . , 𝑇; 𝑔 = 𝑆, . . . , 𝑇

(2.27)

For each treated cohort 𝑔, 𝜏𝑔𝑟 (𝑟 = 𝑔, . . . , 𝑇) denotes the ATT in all subsequent time period.

To identify the 𝜏𝑔𝑟, we extend the trick for the common timing case by writing

𝑌𝑡 (𝑔) − 𝑌𝑡 (∞) =

(cid:34)
𝑌𝑡 (𝑔) −

+

1
(𝑔 − 1)

1
(𝑔 − 1)
𝑔−1
∑︁

𝑠=1

(cid:35)

𝑌𝑠 (𝑔)

−

(cid:34)
𝑌𝑡 (∞) −

𝑔−1
∑︁

𝑠=1

1
(𝑔 − 1)

𝑔−1
∑︁

𝑠=1

(cid:35)

𝑌𝑠 (∞)

[𝑌𝑠 (𝑔) − 𝑌𝑠 (∞)]

(2.28)

As in the common timing case, we make a no anticipation assumption so that the third term can be

dropped and that effectively allows using all available control units in each treated period. Here we

condition on the covariates so that we can use not-yet-treated units as part of the control group.

Assumption CNAS (Conditional No Anticipation, Staggered): For 𝑔 ∈ {𝑆, . . . , 𝑇 }, 𝑡 ∈

{1, . . . , 𝑔 − 1} and covariates X,

𝐸 (cid:2)𝑌𝑡 (𝑔) |𝐷𝑔 = 1, X(cid:3) = 𝐸 (cid:2)𝑌𝑡 (∞) |𝐷𝑔 = 1, X(cid:3) . □

(2.29)

As in the common timing case, this assumption means that the “treatment” effects prior to the

intervention are all zero. Because 𝑠 < 𝑔 in the third sum, it follows that the expected value of the

last term conditional on 𝐷𝑔 = 1 is zero. Therefore,

𝜏𝑔𝑟 = 𝐸 (cid:2) (cid:164)𝑌𝑟𝑔 (𝑔) |𝐷𝑔 = 1(cid:3) − 𝐸 (cid:2) (cid:164)𝑌𝑟𝑔 (∞) |𝐷𝑔 = 1(cid:3) ,

(2.30)

where (cid:164)𝑌𝑟𝑔 (𝑔) and (cid:164)𝑌𝑟𝑔 (∞) are defined as the first and second terms in (2.28), respectively. Note

that the first subscript on (cid:164)𝑌𝑟𝑔 (𝑔) and (cid:164)𝑌𝑟𝑔 (∞) means that we are averaging all periods just prior to

𝑔 and subtracting from the outcome in the current current calendar time period 𝑟.

As before, the first term in equation (2.28) is easily estimated because we observe (cid:164)𝑌𝑟𝑔 (𝑔) when

𝐷𝑔 = 1. A parallel trends assumption, stated in terms of the never treated state, is sufficient to
identify 𝐸 (cid:2) (cid:164)𝑌𝑟𝑔 (∞) |𝐷𝑔 = 1(cid:3). We state an assumption conditional on a set of covariates, X, with

no covariates as a special case.

15

Assumption CPTS (Conditional PT, Staggered): For D = (𝐷 𝑆, . . . , 𝐷𝑇 ) and 𝑡 = 1, 2, . . . , 𝑇,

𝐸 [𝑌𝑡 (∞) − 𝑌1(∞)|D, X] = 𝐸 [𝑌𝑡 (∞) − 𝑌1(∞)|X], 𝑡 = 2, . . . , 𝑇 . □

(2.31)

This assumption is used in Wooldridge (2021). Again, it is unconfoundedness of the treatment

level, as given by D, with respect to the trend in the untreated state, 𝑌𝑡 (∞) − 𝑌1(∞). Wooldridge

(2021) used this assumption, along with linearity of conditional means, to derive an imputation

estimator and showed it was the same as a pooled OLS and TWFE estimator. Here we show how

it can be used to identify the 𝜏𝑔𝑟 very generally.

With (cid:164)𝑌𝑟𝑔 (∞) defined above,

𝐸 (cid:2) (cid:164)𝑌𝑟𝑔 (∞) |𝐷𝑔 = 1, X(cid:3) =

1
(𝑔 − 1)

𝑔−1
∑︁

𝑠=1
𝑔−1
∑︁

𝐸 (cid:2)𝑌𝑟 (∞) − 𝑌𝑠 (∞) |𝐷𝑔 = 1, X(cid:3)

𝐸 [𝑌𝑟 (∞) − 𝑌𝑠 (∞) |𝐷∞ = 1, X]

(2.32)

=

1
(𝑔 − 1)

𝑠=1
=𝐸 (cid:2) (cid:164)𝑌𝑟𝑔 (∞) |𝐷∞ = 1, X(cid:3)

where the second equality follows from CPTS and the third follows by taking the expectation

outside the summation. We have shown the following.

Theorem 2.3 Under Assumption CNAS, equation (2.30) holds. If we add Assumption CPTS, the

cohort assignments, D = (𝐷 𝑆, . . . , 𝐷𝑇 ) are unconfounded with respect to (cid:164)𝑌𝑟𝑔 (∞) (in the conditional

mean sense), 𝑔 ∈ {𝑆, . . . , 𝑇 }, 𝑟 ∈ {𝑔, . . . , 𝑇 }, conditional on X. □

Because the vector of cohort indicators is unconfounded with respect to (cid:164)𝑌𝑟𝑔 (∞), Theorem 2.3

implies

𝐸 (cid:2) (cid:164)𝑌𝑟𝑔 (∞) |𝐷∞ = 1, X(cid:3) = 𝐸 (cid:2) (cid:164)𝑌𝑟𝑔 (∞) |𝐷 ℎ = 1, X(cid:3) , ℎ = 𝑆, . . . , 𝑇

(2.33)

We can combine this implication of CPTS with Assumption CNAS because, at time 𝑟, cohorts

ℎ = 𝑟 + 1, . . . , 𝑇 have yet to be treated. Therefore,

𝐸 (cid:2) (cid:164)𝑌𝑟𝑔 (∞) |𝐷 ℎ = 1, X(cid:3) = 𝐸 (cid:2) (cid:164)𝑌𝑟𝑔 (ℎ) |𝐷 ℎ = 1, X(cid:3) , ℎ = 𝑟 + 1, . . . , 𝑇

(2.34)

16

Combined, (2.33) and (2.34) mean that, in addition to the never treated (NT) group, we can use
treatment cohorts ℎ ∈ {𝑟 + 1, . . . , 𝑇 } in estimating 𝐸 (cid:2) (cid:164)𝑌𝑟𝑔 (∞) |𝐷∞ = 1, X(cid:3).

Incidentally, this

derivation shows that if we only use the NT group as the control for each (𝑔, 𝑟) pair then we can

drop the conditioning on X in Assumption CNAS. Later we discuss what can be identified in period

𝑇 without a NT group (under CNAS).

We have established the following. For estimating 𝐸 (cid:2) (cid:164)𝑌𝑟𝑔 (∞) |𝐷𝑔 = 1, X(cid:3) for 𝑟 ∈ {𝑔, 𝑔 +

1, . . . , 𝑇 } we can use cohorts {𝑟 + 1, . . . , 𝑇, ∞} as the control group. Define the indicator for

the control group as 𝐴𝑟+1 ≡ 𝐷𝑟+1 + 𝐷𝑟+2 + · · · + 𝐷𝑇 + 𝐷∞. Then, within the subpopulation
𝐷𝑔 + 𝐴𝑟+1 = 1, 𝐷𝑔 is unconfounded with respect to (cid:164)𝑌𝑟𝑔 (∞), conditional on X. Therefore, we can

apply standard treatment effect estimators after transforming the observed outcome and conditioning

on the subpopulation.

Naturally, we will need an overlap assumption in order to ensure identification when using

methods that condition on covariates. For 𝜏𝑔𝑟 and using all legitimate control groups under CNAS

and CPTS, the overlap assumption is

Assumption OVLS (Overlap, Staggered Case): For cohorts 𝑔 ∈ {𝑆, 𝑆 + 1, . . . , 𝑇 } and time

periods 𝑟 ∈ {𝑔, 𝑔 + 1, . . . , 𝑇 },

𝑃 (cid:0)𝐷𝑔 = 1|𝐷𝑔 + 𝐴𝑟+1 = 1, X = x(cid:1) < 1 for all x ∈ Supp (X) . □

(2.35)

This condition ensures that, within the subpopulation of cohort 𝑔 plus the never treated and not-

yet-treated units at time 𝑟, every treated unit has a comparable control unit.

Given data on units again indexed by 𝑖, the following simple steps lead to a general analysis.

Assumptions CNAS, CPTS, and the overlap assumption are in force.

Procedure 4.1 (Rolling Methods, Staggered Interventions):

Step 1. For a given 𝑔 ∈ {𝑆, . . . , 𝑇 } and time period 𝑟 ∈ {𝑔, 𝑔 + 1, . . . , 𝑇 }, compute

(cid:164)𝑌𝑖𝑟𝑔 ≡ 𝑌𝑖𝑟 −

1
(𝑔 − 1)

𝑔−1
∑︁

𝑌𝑖𝑠 ≡ 𝑌𝑖𝑟 − ¯𝑌𝑖,𝑝𝑟𝑒(𝑔)

(2.36)

𝑠=1
Step 2. Choose as the control group the units with 𝐷𝑖,𝑟+1 + 𝐷𝑖,𝑟+2 + · · · + 𝐷𝑖𝑇 + 𝐷𝑖∞ = 1 (or, if

desired, a subset, such as the NT group).

17

Step 3. Using the subset of data with

𝐷𝑖𝑔 + 𝐷𝑖,𝑟+1 + 𝐷𝑖,𝑟+2 + · · · + 𝐷𝑖𝑇 + 𝐷𝑖∞ = 1,

(2.37)

apply standard TE methods – such as linear RA, IPW, IPWRA, matching – to the cross section

(cid:8)(cid:0) (cid:164)𝑌𝑖𝑟𝑔, 𝐷𝑖𝑔, X𝑖(cid:1) : 𝑖 = 1, . . . , 𝑁 (cid:9) ,

with 𝐷𝑖𝑔 acting as the treatment indicator. □

When X𝑖 has high dimension, Procedure 4.1 implies that machine learning methods can be

applied after obtaining the transformation in (2.36). See, for example, Belloni et al. (2014) and

Chernozhukov et al. (2018).

Interestingly, when 𝑟 = 𝑔, so that 𝜏𝑔𝑔 is the instantaneous effect of the intervention for treatment

cohort 𝑔, applying linear RA to

(cid:8)(cid:0) (cid:164)𝑌𝑖𝑔𝑔, 𝐷𝑖𝑔, X𝑖(cid:1) : 𝑖 = 1, . . . , 𝑁 (cid:9) ,

with all possible control units, reproduces the POLS estimator proposed by Wooldridge (2021).

When 𝑟 > 𝑔 this is not the case, which means, under standard assumptions, the rolling approach

we propose is inefficient for the dynamic effects. The trade off is that we are able to apply many

different kind of estimators, including doubly robust and matching estimators.

2.4.2 Comparison to Alternative Methods

Procedure 4.1 can be compared with the Callaway and Sant’Anna (2021) approach with staggered

interventions. CS (2021) suggest using a long difference of the form

˚𝑌𝑖𝑟𝑔 ≡ 𝑌𝑖𝑟 − 𝑌𝑖,𝑔−1

(2.38)

and then choosing control groups from cohorts {𝑟 + 1, . . . , 𝑇, ∞}. The transformation in (2.38)

ignores the control periods prior to 𝑔 − 1 and is generally inefficient under classical assumptions.

Also, the default implementation in commonly used software (R and Stata) is to use only the

never-treated group as controls. Several authors, including Marcus and Sant’Anna (2021) and

Callaway (2023), recognize that the version of parallel trends in Assumption CPT implies that

18

more information is available for estimating the ATTs. However, we believe we are the first to

propose the demeaning transformation at the unit level, using the pre-treatment averages, and

then combining the transformation with various treatment effects estimators after establishing

unconfoundedness. Our approach makes it straightforward to apply different strategies based on

different transformations and different control periods and control groups.

Implementing Procedure 4.1 is straightforward because it simply requires obtaining (cid:164)𝑌𝑖𝑟𝑔 and

then applying standard treatment effect software. Standard errors are easily obtained.

To illustrate where the efficiency and robustness of our method originate, we compare the

information sets utilized by various estimators when estimating treatment effects. Suppose there

are three treatment cohorts—groups 4, 5, and 6—and our goal is to estimate the average treatment

effects on the treated (ATT) for group 4 at time 𝑡 = 5, 𝐴𝑇𝑇 (4, 5).

Table 2.1 summarizes the subset of data each estimator employs for this estimation task. For

pre-treatment periods up to 𝑡 = 3, the areas highlighted in dark gray represent the data used by

each estimator, while light gray regions indicate observations that are not used. Compared to the

estimator proposed by Callaway and Sant’Anna (2021), our unit-specific time-series transformation

(rolling) method leverages a larger set of pre-treatment observations for each unit. This broader

utilization enhances efficiency, recovering some of the precision otherwise lost in the CS (2021)

framework.

Table 2.1 Pre-Treatment Period Observations Used by Each Method

time
1
2
3
4
5
6

CS (2021)
𝑔 = 5

𝑔 = 4

𝑔 = 6

Rolling Method
𝑔 = 5

𝑔 = 4

𝑔 = 6

Wooldridge (2021)
𝑔 = 5

𝑔 = 6

𝑔 = 4

-
⋆
-

-
-

-

-
⋆
-

-
-

Utilization of Information
Loss of Information

-
⋆
-

-
-

-

Target Estimate: 𝐴𝑇𝑇 (4, 5)
𝐴𝑇𝑇 (𝑔, 𝑡)

-

⋆
-

However, the rolling method uses slightly less information than Wooldridge (2021)’s approach.

19

Specifically, estimating 𝐴𝑇𝑇 (4, 5) under the rolling method requires dropping observations from

units already treated by period 𝑡 = 5, including (cid:164)𝑌55, which retains pre-treatment history. This ex-

clusion is necessary to avoid “bad comparisons.” A similar logic underlies the CS (2021) approach,

which likewise omits already-treated units.

In contrast, Wooldridge (2021)’s regression-based framework retains all pre-treatment observa-

tions by construction, achieving high efficiency under correct specification. However, this efficiency

comes at the cost of increased vulnerability to model misspecification.

In this respect, our estimator can be viewed as a middle ground that combines the robustness of

the CS method with the efficiency of the Wooldridge (2010) framework. For example, our method

accommodates doubly robust estimation, offering greater robustness to outcome model misspecifi-

cation while still preserving much of the efficiency of the Wooldridge (2021) framework—a benefit

not shared by the CS (2021) framework.

In the special case of a single treatment group–that is, under common timing—the rolling

method and Extended TWFE estimator (Wooldridge, 2021) use an equivalent set of pre-treatment

observations, resulting in comparable efficiency when the model is correctly specified.

2.4.3 All Units Eventually Treated

As in Wooldridge (2021, 2023), we can handle situations where all units are treated by 𝑡 = 𝑇

by simply modifying the parallel trends assumptions and changing the specifics of the estimation.

Rather than stating the CPT assumption in terms of the NT state, 𝑌𝑡 (∞), it is stated in terms

of 𝑌𝑡 (𝑇), the state of not being treated until the final time period. Now all of the treatment
effects are, initially, defined relative to 𝑌𝑡 (𝑇): 𝐸 (cid:2)𝑌𝑟 (𝑔) − 𝑌𝑟 (𝑇) |𝐷𝑔 = 1(cid:3), 𝑔 ∈ {𝑆, . . . , 𝑇 − 1},

𝑟 ∈ {𝑔, . . . , 𝑇 }. We can no longer estimate a treatment effect for the final treated cohort because

there are no control units. As discussed in Wooldridge (2021), by no anticipation it follows that for
𝑟 < 𝑇, 𝐸 (cid:2)𝑌𝑟 (𝑔) − 𝑌𝑟 (𝑇) |𝐷𝑔 = 1(cid:3) = 𝐸 (cid:2)𝑌𝑟 (𝑔) − 𝑌𝑟 (∞) |𝐷𝑔 = 1(cid:3) and so, except for the final time

period, the ATTs are interpreted as when we have a never treated group.

In terms of estimation, the modifications to Procedure 4.1 are straightforward. In particular,

(cid:164)𝑌𝑖𝑟𝑔 is computed as in (2.36) but only for 𝑔 ∈ {𝑆, . . . , 𝑇 − 1}. In steps (2) and (3), we drop 𝐷𝑖∞

20

everywhere – which means still choosing as the control group for cohort 𝑔 in period 𝑟 those units

not yet treated. Then, for each 𝑔 ∈ {𝑆, . . . , 𝑇 − 1} and for each period 𝑟 ∈ {𝑔, . . . , 𝑇 }, we apply

standard treatment effect estimators, as before. When 𝑟 = 𝑇, 𝐷𝑇 = 1 acts as the only control group

for all cohorts first treated prior to period 𝑇.

2.5 Heterogeneous Trends

One way to test for violation of the PT assumption is to estimate placebo treatment effects prior

to the intervention. Callaway and Sant’Anna (2021) take this approach using their differencing

method. Here, we can apply Procedure 3.1 or 4.1 to pre-treatment periods and test for effects prior

to the intervention. For cohort 𝑔, it makes sense to split pre-treatment periods, {1, 2, . . . , 𝑔 − 1},

into roughly equal sizes. Then, the never treated group, or any of the groups not yet treated in

{1, 2, . . . , 𝑔 − 1} can be used as the controls. Under the null hypothesis of (conditional) PT, the

tests should not find a “ treatment” effect.

As motivation for adjusting Procedure 4.1 (with common timing being a special case) to allow

for heterogeneous trends, express the conditional PT assumption as

𝐸 [𝑌𝑡 (∞)|D, X] = 𝑞∞ (X) +

𝑇
∑︁

𝑔=𝑆

𝐷𝑔𝑞𝑔 (X) + 𝑚𝑡 (X) , 𝑡 = 1, . . . , 𝑇,

(2.39)

where 𝑞𝑔 (·) and 𝑚𝑡 (·) are functions of the covariates, with the first not changing across time

and the second not depending on the treatment cohort. As a normalization, 𝑚1 (x) ≡ 0 for all

x ∈ Supp (X). It is easily seen that

𝐸 [𝑌𝑡 (∞) − 𝑌1(∞)|D, X] = 𝑚𝑡 (X) , 𝑡 = 2, . . . , 𝑇,

which does not depend on D. Moreover, for 𝑟 ≥ 𝑔, it follows from the definition of (cid:164)𝑌𝑟𝑔 (∞) that

𝐸 (cid:2) (cid:164)𝑌𝑟𝑔 (∞) |𝐷∞ = 1, X(cid:3) = 𝑚𝑟 (X) −

1
(𝑔 − 1)

𝑔−1
∑︁

𝑠=1

𝑚𝑠 (X) ≡ (cid:164)𝑚𝑟𝑔 (X)

(2.40)

and so (cid:164)𝑚𝑟𝑔 (·) is the conditional mean function implicit in the methods from Section 2.4 that

use a conditional mean specification (RA IPW or IPWRA). Because (cid:164)𝑚𝑟𝑔 (·) can take on positive

and negative values, we essentially assumed in regression-based methods that

(cid:164)𝑚𝑟𝑔 (·) can be

approximated by a function linear in parameters (allowing controls to appear flexibly, as usual).

21

The representation in (2.39) suggests a way to relax parallel trends for cohorts where we have

at least two pre-treatment time periods. We replace (2.39) with the following assumption.

Assumption CHT (Conditional Heterogeneous Trends): For D = (𝐷 𝑆, . . . , 𝐷𝑇 ) and 𝑡 =

1, 2, . . . , 𝑇,

𝐸 [𝑌𝑡 (∞)|D, X] = 𝜂𝑆 (𝐷 𝑆 · 𝑡) + · · · + 𝜂𝑇 (𝐷𝑇 · 𝑡) + 𝑞∞ (X) +

𝑇
∑︁

𝑔=𝑆

𝐷𝑔𝑞𝑔 (X) + 𝑚𝑡 (X) . □ (2.41)

Assumption CHT allows a separate linear trend in the never treated state for each treatment

cohort. It is easy to see that

𝐸 [𝑌𝑡 (∞) − 𝑌𝑡−1(∞)|D, X] = 𝜂𝑔 𝐷𝑔 + · · · + 𝜂𝑇 𝐷𝑇 + [𝑚𝑡 (X) − 𝑚𝑡−1 (X)]

(2.42)

and so PT, even conditional on X, fails. Because the trend in the never treated state is systematically

related to cohort, the estimation approaches in Sections 2.3 and 2.4 are no longer valid.

Instead, we can use a linearly detrending, unit by unit, to remove the relationship between 𝑌𝑡 (∞)

and cohort assignment. For any 𝑖, we can write

𝑌𝑖𝑡 (∞) = h (D𝑖, X𝑖) + D𝑖 · 𝑡 · 𝜂 + 𝑚𝑡 (X𝑖) + 𝑈𝑖𝑡 (∞)

(2.43)

𝐸 [𝑈𝑖𝑡 (∞) |D𝑖, X𝑖] = 0,

where h (D𝑖, X𝑖) does not vary across 𝑡. For any 𝑡 ≥ 2, we can eliminate both h (D𝑖, X𝑖) + D𝑖 · 𝑡 · 𝜂

using unit-specific linear detrending.

Equation (2.43) is an example of a heterogeneous (or random) trend model of the kind discussed

in Wooldridge (2010, Section 11.7.2).Appendix 2B details how unit-specific linear trends are

removed by regressing the outcome on a constant and time dummies using pre-treatment periods.

The resulting residuals define the detrended outcome variable, denoted by (cid:165)𝑌𝑖𝑟𝑔 (∞). We then show

that the treatment cohort indicators, 𝐷𝑖, are unconfounded with respect to these detrended variable,

conditional on 𝑋𝑖. Moreover, by Assumption CNAS, cohorts with ℎ > 𝑟 (including ℎ = ∞) can

be used as part of the control group. Then, the argument is as in Section 2.4: We have justified

the following procedure as producing consistent estimators of 𝜏𝑔𝑟 under Assumptions CNAS, CHT,

and OVLS.

22

Procedure 5.1 (Staggered Entry, Heterogeneous Linear Trends):

Step 1. For a specified cohort 𝑔 ∈ {𝑆, . . . , 𝑇 }, run the unit-specific regressions

𝑌𝑖𝑡 on 1, 𝑡, 𝑡 = 1, . . . , 𝑔 − 1

(2.44)

For 𝑟 ∈ {𝑔, . . . , 𝑇 }, compute the out-of-sample predicted value (cid:98)𝑌𝑖𝑟𝑔 and the prediction error
(detrended variable) (cid:165)𝑌𝑖𝑟𝑔 ≡ 𝑌𝑖𝑟 − (cid:98)𝑌𝑖𝑟𝑔. (One need not do this for units treated prior to period 𝑔.)

Step 2. Choose as the control group the units with 𝐷𝑖,𝑟+1 + 𝐷𝑖,𝑟+2 + · · · + 𝐷𝑖𝑇 + 𝐷𝑖∞ = 1 (or, if

desired, a subset, such as the NT group).

Step 3. Using the subset of data with 𝐷𝑖𝑔 + 𝐷𝑖,𝑟+1 + 𝐷𝑖,𝑟+2 + · · · + 𝐷𝑖𝑇 + 𝐷𝑖∞ = 1, or a further

subset, apply standard TE methods – such as linear RA, IPW, IPWRA or matching – to the cross

section

(cid:8)(cid:0) (cid:165)𝑌𝑖𝑟𝑔, 𝐷𝑖𝑔, X𝑖(cid:1) : 𝑖 = 1, . . . , 𝑁 (cid:9) ,

(2.45)

with 𝐷𝑖𝑔 acting as the treatment indicator. □

Procedure 5.1 is very easy to implement, requiring just many unit-specific simple regressions on

a constant and linear time trend. The common timing case is especially easy because the regression

in (2.43) is done with 𝑔 = 𝑆 only and then the detrended outcomes (cid:165)𝑌𝑖𝑟 are used in standard treatment

effect estimation for 𝑟 = 𝑆, . . . , 𝑇.

In the simplest case where Procedure 5.1 can be applied, with 𝑇 = 3 and common intervention

at 𝑆 = 3, and without covariates, the resulting estimator of 𝜏3 is

𝑁 −1
1

𝑁
∑︁

𝑖=1

𝐷𝑖 (cid:165)𝑌𝑖3 − 𝑁 −1
0

𝑁
∑︁

𝑖=1

(1 − 𝐷𝑖) (cid:165)𝑌𝑖3,

(2.46)

where (cid:165)𝑌𝑖3 is obtained as the prediction error in period three after the regression 𝑌𝑖𝑡 on 1, 𝑡, 𝑡 = 1, 2.

After a little algebra, (2.46) can be shown to equal

(cid:2)(cid:0) ¯𝑌13 − ¯𝑌12

(cid:1) − (cid:0) ¯𝑌03 − ¯𝑌02

(cid:1)(cid:3) − (cid:2) (cid:0) ¯𝑌12 − ¯𝑌11

(cid:1) − (cid:0) ¯𝑌02 − ¯𝑌01

(cid:1)(cid:3) ,

(2.47)

where the first subscript on the average is one for treated and zero for control, and the second

subscript indicates time period. The first term in brackets is the usual two-period DiD estimator

23

if the first time period is ignored. The second term is an estimate of the difference in trends prior

to the intervention – often interpreted as estimating a placebo effect. The estimator in (2.47) is an

example of a difference-in-difference-in-differences estimator. Procedure 5.1 allows one to control

for covariates in case removing an estimate of the pre-intervention difference in trends is still not

deemed sufficient to uncover a causal effect.

Before ending this section, we head off a potential source of confusion. The fact that we

are running unit-specific linear trend regressions in (2.44) does not mean there is an incidental

parameters problem that can cause inconsistency in the (cid:98)
small. In fact, we are just using these regressions to eliminate unit-specific heterogeneity that can

𝜏𝑔𝑟 when the number of time periods is

be correlated with treatment cohorts. It is substantively the same as removing the unit-specific

pre-treatment means in Procedure 4.1. In fact, this kind of unit-specific detrending is the same

idea prevalent in the panel data literature with heterogeneous trends. See, for example, Wooldridge

(2010, Section 11.7.2).

2.6 Violations of No Anticipation. Unbalanced Panels

The no anticipation assumption requires that, prior to the first intervention period for a given

treatment cohort, the potential outcomes are the same (on average) as in the never treated state.

This assumption can fail if units know that a program or policy change is approaching prior to its

being actually implemented. If the NA assumption is in doubt, one can leave one or more periods

prior to the intervention time, and redo the analysis as a robustness check.

As an example, suppose a cohort is first treated in 𝑔 = 5. In Procedure 4.1, one would average

over periods {1, 2, 3, 4} in obtaining the average to remove for 𝑌𝑖5 (or, one would remove a unit-

specific linear trend, as in Procedure 5.1). Instead, one might drop period four altogether, or maybe

even periods three and four. Any precise recommendation is context specific. It is very easy to

apply any of the procedures we have recommended to cases where time periods are skipped.

Another issue that often arises in practice is unbalanced panels. With time-constant controls,

unbalancedness would typically arise because of missing data on 𝑌𝑖𝑡, possibly due to attrition. If

data are missing on 𝑌𝑖𝑡 for some time periods for unit 𝑖, the demeaning or detrending is simply

24

applied to the observed data. The mechanics of the procedure are then exactly the same. For

treatment cohort 𝑔 in period 𝑟, the transformed outcome ( (cid:164)𝑌𝑖𝑟𝑔 or (cid:165)𝑌𝑖𝑟𝑔) can only be used if there are

enough observed data in the periods 𝑡 < 𝑔 to compute an average (one period) or a linear trend

(two periods). Of course, 𝑌𝑖𝑟 must also be observed.

It is natural to wonder when ignoring the reason the panel is unbalanced does not cause

systematic bias. Because our method removes unit-specific averages in Procedure 4.1, selection is

allowed to depend on unobserved time-constant heterogeneity – just like with the usual fixed effects

estimator. Selection cannot be systematically related to the shocks to 𝑌𝑖𝑡 (∞) – again, just as with

the FE estimator. When we remove a unit-specific linear trend, now selection can be correlated

with both a level heterogeneity term and a trend heterogeneity term, providing for somewhat more

robustness to sample selection bias.

2.7 Application

In this section, we revisit the literature examining the effects of Walmart’s entry on local labor

markets (Basker, 2005, Neumark et al., 2008, Brown and Butts, 2023). Prior studies highlight

concerns about potential violations of the parallel trends assumption. For example, Brown and

Butts (2023) provide visual evidence suggesting that much of the estimated effect could be due to

pre-existing trends rather than the impact of Walmart’s entry (Rambachan and Roth, 2023). If store

opening decisions are mainly driven by county-level economic fundamentals, observed employment

differences between treated and control groups may not be attributable solely to Walmart’s opening.

2.7.1 Data

As we show in Section 2.5, to illustrate that our rolling method yields robust results even when

counties exhibit disparate trends after controlling for covariates, we use the dataset constructed by

Brown and Butts (2023), which follows Basker (2005) and draws on the County Business Patterns

(CBP) data from 1964 and 1977–1999.

We limit the dataset to counties that had more than 1, 500 of total employment in 1964 and had

non-negative employment growth rates during 1964-1977, each of which was imputed from the

1977-1999 dataset (see, Brown and Butts, 2023). The Walmart store opening indicator is derived

25

from the geographic dataset of openings in Arcidiacono et al. (2020). We exclude counties whose

first Walmart opened before 1985. As a result, we obtain a balanced panel of 1280 counties, each

observed for nine pre-treatment years and fourteen years of post-treatment periods, for a total of 23

years.

Figure 2.1 plots the sample counties and their first Walmart opening year. Darker shades

indicate later openings, light green shows never-treated counties, and gray marks excluded areas.

Figure 2.1 First Year of Walmart Store Openings by County

The treatment cohort is defined by the year of first opening. We utilize covariates, including the

1980 shares of the population employed in manufacturing, above the poverty line, and with a high

school education. Never-treated counties comprise 31% of the sample. For descriptive statistics of

the variables used in our analysis, see Appendix ?? Table 2D.1. We use the dataset to assess how

Walmart openings affect county-level retail and wholesale employment.

2.7.2 Estimation Results

First, we estimate cohort-year specific average treatment effects on the treated (ATT), denoted

as 𝐴𝑇𝑇 (𝑔, 𝑡), for cohort 𝑔 in year 𝑡. We then compute the weighted sum of ATTs by relative time

(length of exposure) and calculate the 95% bootstrap confidence intervals. We define the weighted

26

ATTs, 𝑊 𝐴𝑇𝑇 (𝑟), as follows:

𝑊 𝐴𝑇𝑇 (𝑟) =

∑︁

𝑔∈𝐺𝑟

𝑤(𝑔, 𝑟) · 𝐴𝑇𝑇 (𝑔, 𝑔 + 𝑟)

where 𝑟 = 𝑡 − 𝑔 , for all 𝑟 ∈ {0, 1, . . . , 𝑇 − 𝑆}, 𝑔 ∈ 𝐺𝑟 = {𝑆, . . . , 𝑇 − 𝑟}, and 𝑤(𝑔, 𝑟) =

𝑁𝑔
𝑁𝐺𝑟

(𝑁𝑔 is

the number of counties in treated-cohort 𝑔, 𝑁𝐺𝑟 is the number of counties in 𝐺𝑟). For pre-treatment

periods (𝑟 < 0), the procedure used to obtain the ATTs is detailed in Appendix ??.

For example, 𝑊 𝐴𝑇𝑇 (0) denotes the weighted average of immediate impact on the log of

employment level, which we are averaging out ATTs of the first treated year across every treated-

cohort; i.e., 𝐴𝑇𝑇 (𝑔, 𝑔) for all 𝑔 ∈ 𝐺𝑟=0.

Lastly, we compare the results from CS (2021) approach, rolling IPWRA estimator with unit-

specific demeaning, and rolling IPWRA estimator with unit-specific detreading.

2.7.2.1 Retail Employment

In this section, we estimate the impact of Walmart’s entry on county-level retail employment,

using the log of employment as the outcome. Figure 2.2 reports weighted ATT estimates with

bootstrapped 95% confidence intervals from the three estimators.

Panel (a) is an event-study plot generated using the CS (2021) approach. The blue line represents

the pre-trends, which ideally should be zero; however, it shows a clear positive slope, indicating

pre-existing differences between treated and control counties. As a result, the estimated treatment

effects—shown in red—likely do not reflect the true impact of Walmart’s entry on wholesale

employment.

Panel (b) presents results from our rolling IPWRA estimator applied to the demeaned outcome

variable. This approach smooths out much of the pre-trend bias relative to the CS estimator, but

the pre-trends are still not fully addressed.

Panel (c) shows results from the rolling IPWRA estimator with detrending, which effectively

eliminates county-level linear trends. The flatter pre-treatment trajectory indicates better adjustment

for pre-existing differences. Post-entry, we observe modest but statistically significant increases in

retail employment that decline over time and eventually converge to zero.

27

Panel (a) CS Approach

Panel (b) Rolling IPWRA estimator

Panel (c) Rolling IPWRA with unit-specific detrending

Figure 2.2 Effects of Walmart Opening on log(Retail Employment)

For example, ATT(1) from the rolling IPWRA with heterogeneous trends is 0.032 (SE = 0.005),

indicating a 3.2% increase in retail employment one year after entry (see Appendix ?? for all

28

estimates). Given an average of 6, 589 retail employees in treated counties, this translates to about

210 additional jobs. However, since Walmart stores typically employ 150–300 workers (Basker,

2005), it is difficult to attribute broader local employment gains beyond direct hiring. Results for

wholesale employment appear in Appendix 2D.1.

2.8 Concluding Remarks

In this paper, we propose a flexible transformation approach for estimating treatment effects

in panel data with staggered interventions and heterogeneous trends. A key advantage of this

method is that once the transformed dependent variable is defined—whether in the common timing

case, the staggered case, or after removing unit-specific trends—researchers can apply standard

treatment effect estimators to the resulting cross-sectional data. This includes regression adjustment,

inverse probability weighting, and doubly robust procedures. The transformation is easily adapted

to accommodate unit-specific trendsand allows for dynamic and heterogeneous treatment effects

across units and time periods. Our empirical application to Walmart’s entry demonstrates how

the approach improves estimation accuracy and robustness compared to existing methods. These

features, combined with its simplicity of implementation, make the method a valuable tool for

applied researchers.

29

BIBLIOGRAPHY

Abadie, A. (2005). Semiparametric difference-in-differences estimators. The review of economic

studies, 72(1):1–19.

Arcidiacono, P., Ellickson, P. B., Mela, C. F., and Singleton, J. D. (2020). The competitive effects of
entry: Evidence from supercenter expansion. American Economic Journal: Applied Economics,
12(3):175–206.

Athey, S. and Imbens, G. W. (2022). Design-based analysis in difference-in-differences settings

with staggered adoption. Journal of Econometrics, 226(1):62–79.

Basker, E. (2005). Job creation or destruction? labor market effects of wal-mart expansion. Review

of Economics and Statistics, 87(1):174–183.

Belloni, A., Chernozhukov, V., and Hansen, C. (2014). Inference on treatment effects after selection

among high-dimensional controls. Review of Economic Studies, 81(2):608–650.

Borusyak, K. and Jaravel, X. (2018). Revisiting event study designs. Available at SSRN 2826228.

Borusyak, K., Jaravel, X., and Spiess, J. (2024). Revisiting event study designs: Robust and efficient

estimation. Review of Economic Studies, page rdae007.

Brown, N. and Butts, K. (2023). Dynamic treatment effect estimation with interactive fixed effects

and short panels.

Callaway, B. (2023). Difference-in-differences for policy evaluation. Handbook of Labor, Human

Resources and Population Economics, pages 1–61.

Callaway, B. and Sant’Anna, P. H. (2021). Difference-in-differences with multiple time periods.

Journal of econometrics, 225(2):200–230.

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J.

(2018). Double/debiased machine learning for treatment and structural parameters.

De Chaisemartin, C. and d’Haultfoeuille, X. (2020). Two-way fixed effects estimators with hetero-

geneous treatment effects. American economic review, 110(9):2964–2996.

De Chaisemartin, C. and d’Haultfoeuille, X. (2023). Two-way fixed effects and difference-
in-differences with heterogeneous treatment effects: A survey. The Econometrics Journal,
26(3):C1–C30.

Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. Journal

of econometrics, 225(2):254–277.

30

Heckman, J. J., Ichimura, H., and Todd, P. E. (1997). Matching as an econometric evaluation
estimator: Evidence from evaluating a job training programme. The review of economic studies,
64(4):605–654.

Marcus, M. and Sant’Anna, P. H. (2021). The role of parallel trends in event study settings:
An application to environmental economics. Journal of the Association of Environmental and
Resource Economists, 8(2):235–275.

Negi, A. and Wooldridge, J. M. (2021). Revisiting regression adjustment in experiments with

heterogeneous treatment effects. Econometric Reviews, 40(5):504–534.

Neumark, D., Zhang, J., and Ciccarella, S. (2008). The effects of wal-mart on local labor markets.

Journal of Urban Economics, 63(2):405–430.

Rambachan, A. and Roth, J. (2023). A more credible approach to parallel trends. Review of

Economic Studies, 90(5):2555–2591.

Sant’Anna, P. H. and Zhao, J. (2020). Doubly robust difference-in-differences estimators. Journal

of econometrics, 219(1):101–122.

Sun, L. and Abraham, S. (2021). Estimating dynamic treatment effects in event studies with

heterogeneous treatment effects. Journal of econometrics, 225(2):175–199.

Wooldridge, J. M. (2007). Inverse probability weighted estimation for general missing data prob-

lems. Journal of econometrics, 141(2):1281–1301.

Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data.

Wooldridge, J. M. (2021). Two-way fixed effects, the two-way mundlak regression, and difference-

in-differences estimators. Available at SSRN 3906345.

Wooldridge, J. M. (2023). Simple approaches to nonlinear difference-in-differences with panel

data. The Econometrics Journal, 26(3):C31–C66.

31

APPENDIX 2A

PROOF OF THEOREM 2.2

We modify the argument in Wooldridge (2021, Theorem 8.1). The (cid:98)
(2.21). Because (cid:164)𝑌𝑖𝑟 = 𝑌𝑖𝑟 − ¯𝑌𝑖,𝑝𝑟𝑒, basic OLS algebra shows that all coefficients from (2.21) are

𝜏𝑟 are obtained from regression

obtained by differencing the coefficients from the two regressions

(cid:164)𝑌𝑖𝑟 on 1, 𝐷𝑖, X𝑖, 𝐷𝑖 · X𝑖, 𝑖 = 1, 2, . . . , 𝑁

¯𝑌𝑖,𝑝𝑟𝑒 on 1, 𝐷𝑖, X𝑖, 𝐷𝑖 · X𝑖, 𝑖 = 1, 2, . . . , 𝑁

(2A.1)

(2A.2)

where X𝑖 = X𝑖 − ¯X1 are the covariates demeaned using the treated units. In particular, letting (cid:98)
the coefficient on 𝐷𝑖 from (2A.1) and (cid:98)

𝜌 𝑝𝑟𝑒 the coefficient on 𝐷𝑖 from (2A.2),

𝜌𝑟 be

𝜏𝑟 = (cid:98)
(cid:98)

𝜌𝑟 − (cid:98)

𝜌 𝑝𝑟𝑒

(2A.3)

Note also that the coefficients on the “ moderating” terms, 𝐷𝑖 · X𝑖, are also obtained by differencing

across the two regressions.

To show (2A.3) is the same as the coefficient on 𝐷𝑖 · 𝑓 𝑟𝑡 in (2.24), first note that, by Wooldridge

(2021, Theorem 3.2), we can drop ( 𝑓 𝑞𝑡, 𝑓 𝑞𝑡 · X𝑖) for 𝑞 < 𝑆 without affecting the estimates. In

other words, the (cid:101)

𝜏𝑟 are the coefficients on 𝐷𝑖 · 𝑓 𝑟𝑡 in the regression

𝑌𝑖𝑡 on 1, X𝑖, 𝐷𝑖, 𝐷𝑖 · X𝑖, 𝑓 𝑆𝑡, . . . , 𝑓 𝑇𝑡, 𝑓 𝑆𝑡 · X𝑖, . . . , 𝑓 𝑇𝑡 · X𝑖

𝐷𝑖 · 𝑓 𝑆𝑡, . . . , 𝐷𝑖 · 𝑓 𝑇𝑡, 𝐷𝑖 · 𝑓 𝑆𝑡 · X𝑖, . . . , 𝐷𝑖 · 𝑓 𝑇𝑡 · X𝑖

(2A.4)

Now, define H𝑖 ≡ (1, X𝑖, 𝐷𝑖, 𝐷𝑖 · X𝑖), a 1 × 2(𝐾 + 1) vector, and

L𝑖𝑡 ≡ ( 𝑓 𝑆𝑡, 𝑓 𝑆𝑡 · X𝑖, 𝐷𝑖 · 𝑓 𝑆𝑡, 𝐷𝑖 · 𝑓 𝑆𝑡 · X𝑖, . . . , 𝑓 𝑇𝑡, , 𝑓 𝑇𝑡 · X𝑖, 𝐷𝑖 · 𝑓 𝑇𝑡 · X𝑖) ,

(2A.5)

a row vector with 2 (𝑇 − 𝑆 + 1) (𝐾 + 1) elements.

Note that for 𝑞 ≠ 𝑟, ( 𝑓 𝑞𝑡, 𝑓 𝑞𝑡 · X𝑖, 𝐷𝑖 · 𝑓 𝑞𝑡, 𝐷𝑖 · 𝑓 𝑞𝑡 · X𝑖) and ( 𝑓 𝑟𝑡, 𝑓 𝑟𝑡 · X𝑖, 𝐷𝑖 · 𝑓 𝑟𝑡, 𝐷𝑖 · 𝑓 𝑟𝑡 · X𝑖)

are orthogonal in sample because 𝑓 𝑞𝑡 · 𝑓 𝑟𝑡 = 0. The full set of regressors in (2A.4) is simply

(H𝑖, L𝑖𝑡). With 𝑝𝑡 = 𝑓 𝑆𝑡 + · · · + 𝑓 𝑇𝑡, the post-treatment period indicator, (1 − 𝑝𝑡) 𝑓 𝑟𝑡 = 0, 𝑟 = 𝑆,

32

𝑆 + 1, . . . , 𝑇, which means (1 − 𝑝𝑡) L𝑖𝑡 = 0. Therefore, the objective function underlying the

regression in (2A.4) can be written as

min
𝜃,𝛿

𝑁
∑︁

𝑇
∑︁

𝑖=1

𝑡=1

(1 − 𝑝𝑡)(𝑌𝑖𝑡 − H𝑖𝜃)2 +

𝑁
∑︁

𝑇
∑︁

𝑖=1

𝑡=1

𝑝𝑡 (𝑌𝑖𝑡 − H𝑖𝜃 − L𝑖𝑡𝛿)2

(2A.6)

Letting (cid:101)𝜃 and (cid:101)𝛿 denote the POLS estimators, the first order conditions are

𝑁
∑︁

𝑇
∑︁

𝑖=1

𝑡=1

(1 − 𝑝𝑡)H′

𝑖 (𝑌𝑖𝑡 − H𝑖(cid:101)𝜃) +

𝑁
∑︁

𝑇
∑︁

𝑖=1

𝑡=1

𝑝𝑡H′

𝑖 (𝑌𝑖𝑡 − H𝑖(cid:101)𝜃 − L𝑖𝑡(cid:101)𝛿) = 0

(2A.7)

𝑁
∑︁

𝑇
∑︁

𝑝𝑡L′

𝑖𝑡 (𝑌𝑖𝑡 − H𝑖(cid:101)𝜃 − L𝑖𝑡(cid:101)𝛿) = 0

(2A.8)

𝑖=1
Next, note that because 𝑝𝑡 = 𝑓 𝑆𝑡 + · · · + 𝑓 𝑇𝑡, we can write

𝑡=1

𝑝𝑡H𝑖 = [ 𝑝𝑡, 𝑝𝑡 · 𝐷𝑖, 𝑝𝑡 · X𝑖, 𝑝𝑡 · 𝐷𝑖 · X𝑖] =

𝑇
∑︁

𝑞=𝑆

[ 𝑓 𝑞𝑡, 𝑓 𝑞𝑡 · 𝐷𝑖, 𝑓 𝑞𝑡 · X𝑖, 𝑓 𝑞𝑡 · 𝐷𝑖 · X𝑖] ,

(2A.9)

which is simply the sum the subvectors in L𝑖𝑡 consisting of different time periods. It follows that

𝑝𝑡H𝑖 = 𝑝𝑡L𝑖𝑡A for a 2 (𝑇 − 𝑆 + 1) (𝐾 + 1) × 2(𝐾 + 1) matrix A. Plugging into (2A.8) gives

𝑁
∑︁

𝑇
∑︁

𝑖=1

𝑡=1

(1 − 𝑝𝑡)H′

𝑖 (𝑌𝑖𝑡 − H𝑖(cid:101)𝜃) + A′

𝑁
∑︁

𝑇
∑︁

𝑖=1

𝑡=1

𝑝𝑡L′

𝑖𝑡 (𝑌𝑖𝑡 − H𝑖(cid:101)𝜃 − L𝑖𝑡(cid:101)𝛿) = 0

(2A.10)

Along with (2A.8), (2A.10) implies that the FOCs for

(cid:16)

(cid:17)
(cid:101)𝜃, (cid:101)𝛿

are

𝑁
∑︁

𝑇
∑︁

𝑖=1

𝑡=1

(1 − 𝑝𝑡)H′

𝑖 (𝑌𝑖𝑡 − H𝑖(cid:101)𝜃) = 0

(2A.11)

But (2A.11) means that (cid:101)𝜃 is the OLS estimator from the regression 𝑌𝑖𝑡 on H𝑖 using the pre-treatment
period observations. With H𝑖 not varying over time, (cid:101)𝜃 is the same as the cross-sectional regression

¯𝑌𝑖,𝑝𝑟𝑒 on 1, 𝐷𝑖, X𝑖, 𝐷𝑖 · X𝑖

(2A.12)

In particular, the coefficient on 𝐷𝑖 is precisely (cid:98)

𝜌 𝑝𝑟𝑒 in (2A.3).

Next, the FOC in (2A.3) shows that (cid:101)𝛿 is from a POLS regression using the post-treatment

periods:

𝑌𝑖𝑡 − H𝑖(cid:101)𝜃 on L𝑖𝑡, 𝑡 = 𝑆, . . . , 𝑇; 𝑖 = 1, . . . , 𝑁

33

By definition of L𝑖𝑡 and the orthogonality of the elements of L𝑖𝑡 across the subvectors representing

the different time periods, each 2 (𝐾 + 1) subvector, (cid:101)𝛿𝑟, 𝑟 = 𝑆, . . . , 𝑇, is obtained from a separate
cross-sectional regression for each post-treatment period. Namely, because 𝑓 𝑟𝑟 = 1, the regression

for period 𝑟 is

𝑌𝑖𝑟 − H𝑖(cid:101)𝜃 on 1, 𝐷𝑖, X𝑖, 𝐷𝑖 · X𝑖, 𝑖 = 1, . . . , 𝑁

(2A.13)

The vector on right is simply H𝑖, and so

(cid:101)𝛿𝑟 =

(cid:32) 𝑁
∑︁

𝑖=1

(cid:33) −1 (cid:32) 𝑁
∑︁

(cid:33)

H′

𝑖𝑌𝑖𝑟

− (cid:101)𝜃

H′

𝑖H𝑖

𝑖=1

(2A.14)

The first term is the regression coefficients from

𝑌𝑖𝑟 on 1, 𝐷𝑖, X𝑖, 𝐷𝑖 · X𝑖, 𝑖 = 1, . . . , 𝑁,

𝜌𝑟 from (2A.3). We have shown that the coefficient corresponding to 𝐷𝑖 · 𝑓 𝑟𝑡 in the

which is (cid:98)
regression (2A.4) is (cid:98)
across all time periods, (2.24), and the cross-sectional OLS estimators using the transformed

𝜌 𝑝𝑟𝑒, which establishes the equivalence of the pooled OLS estimator

𝜌𝑟 − (cid:98)

variable (cid:164)𝑌𝑖𝑟 for each 𝑟 ∈ {𝑆, . . . , 𝑇 } separately. Essentially the same argument shows that the

coefficients on the interaction terms 𝐷𝑖 · 𝑓 𝑟𝑡 · X𝑖 in (2.24) are the same as the coefficients on 𝐷𝑖 · X𝑖

in (2.21). □

34

APPENDIX 2B

PROOF OF THE DETRENDING PROCEDURE

For a cohort 𝑔, where we require 𝑔 ≥ 3 so there are at least two pre-treatment periods, define the

(𝑔 − 1) × 2 matrix

J𝑔−1 =

and let Y𝑖,𝑔−1 (∞) be the (𝑔 − 1) × 1 vector

(cid:169)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:171)

1

1

1
...

2
...

(cid:170)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
1 𝑔 − 1
(cid:172)

(2B.1)

Y𝑖,𝑔−1 (∞) = (cid:2)𝑌𝑖1 (∞) , . . . , 𝑌𝑖,𝑔−1 (∞)(cid:3) ′ .

(2B.2)

A similar definition holds for U𝑖,𝑔−1 (∞). Also, let M𝑖,𝑔−1 be the (𝑔 − 1) × 1 vector with elements

𝑚𝑡 (X𝑖), 𝑡 = 1, . . . , 𝑔 − 1. Note that we can write

Y𝑖,𝑔−1 (∞) = J𝑔−1

h (D𝑖, X𝑖)

D𝑖𝜂

(cid:169)
(cid:173)
(cid:173)
(cid:171)

(cid:170)
(cid:174)
(cid:174)
(cid:172)

+ M𝑖,𝑔−1 + U𝑖,𝑔−1 (∞)

≡ J𝑔−1Q𝑖 + M𝑖,𝑔−1 + U𝑖,𝑔−1 (∞)

(2B.3)

Now regress Y𝑖,𝑔−1 (∞) on J𝑔−1, and obtain the coefficients,

(cid:16)

(cid:98)B𝑖,𝑔−1 =

(cid:17) −1

J′
𝑔−1J𝑔−1
(cid:16)

J′
𝑔−1Y𝑖,𝑔−1 (∞)
(cid:17) −1

= Q𝑖 +

J′
𝑔−1J𝑔−1

J′
𝑔−1

(cid:2)M𝑖,𝑔−1 + U𝑖,𝑔−1 (∞)(cid:3)

(2B.4)

For 𝑟 ≥ 𝑔, the prediction of 𝑌𝑖𝑟 (∞) using the unit-specific linear trend up through period 𝑔 − 1 is

(cid:98)𝑌𝑖𝑟𝑔 (∞) ≡ (1, 𝑟) (cid:98)B𝑖,𝑔−1

(2B.5)

Use these predicted values to detrend 𝑌𝑖𝑟 (∞):

(cid:165)𝑌𝑖𝑟𝑔 (∞) ≡ 𝑌𝑖𝑟 (∞) − (cid:98)𝑌𝑖𝑟𝑔 (∞) = 𝑌𝑖𝑟 (∞) − (1, 𝑟) (cid:98)B𝑖,𝑔−1 = (1, 𝑟) Q𝑖 + 𝑚𝑟 (X𝑖) + 𝑈𝑖𝑟 (∞)

− (1, 𝑟)

(cid:26)

(cid:16)

Q𝑖 +

(cid:17) −1

J′
𝑔−1J𝑔−1

J′
𝑔−1

(cid:2)M𝑖,𝑔−1 + U𝑖,𝑔−1 (∞)(cid:3)

(cid:27)

= 𝑚𝑟 (X𝑖) + 𝑈𝑖𝑟 (∞) − (1, 𝑟)

(cid:26)(cid:16)

(cid:17) −1

J′
𝑔−1J𝑔−1

J′
𝑔−1

(cid:2)M𝑖,𝑔−1 + U𝑖,𝑔−1 (∞)(cid:3)

(cid:27)

(2B.6)

35

The expression in (2B.6) shows that (cid:165)𝑌𝑖𝑟 (∞) does not depend on D𝑖. In particular,

𝐸 (cid:2) (cid:165)𝑌𝑖𝑟𝑔 (∞) |D𝑖, X𝑖(cid:3) = 𝐸 (cid:2) (cid:165)𝑌𝑖𝑟𝑔 (∞) |X𝑖(cid:3)

= 𝑚𝑟 (X𝑖) − (1, 𝑟)

(cid:16)

J′
𝑔−1J𝑔−1

(cid:17) −1

J′
𝑔−1M𝑖,𝑔−1

(2B.7)

This conclusion is practically important because it shows that the vector of treatment cohort

indicators, D𝑖, are unconfounded with respect to the detrended variable (cid:165)𝑌𝑖𝑟 (∞), conditional on X𝑖.

Note how this extends the argument in Section 2.4 where, instead of J𝑔−1 having rows (1, 𝑡), its

rows simply consisted of unity.

Now the modification to the arguments in Section 2.4 are straightforward. In place of (2.28) we

have

𝑌𝑖𝑟 (𝑔) − 𝑌𝑖𝑟 (∞) = (cid:165)𝑌𝑖𝑟𝑔 (𝑔) − (cid:165)𝑌𝑖𝑟𝑔 (∞) +

(cid:104)
(cid:98)𝑌𝑖𝑟𝑔 (𝑔) − (cid:98)𝑌𝑖𝑟𝑔 (∞)

(cid:105)

,

(2B.8)

where (cid:165)𝑌𝑖𝑟𝑔 (𝑔) ≡ 𝑌𝑖𝑟 (𝑔) − (cid:98)𝑌𝑖𝑟𝑔 (𝑔) and (cid:98)𝑌𝑖𝑟𝑔 (𝑔) are the predicted values from (2B.6) and (2B.5) but
with Y𝑖,𝑔−1 (𝑔) and 𝑌𝑖𝑟 (𝑔) in place of Y𝑖,𝑔−1 (∞) and 𝑌𝑖𝑟 (∞). Now take the expectation conditional

on 𝐷𝑔 = 1:

𝜏𝑔𝑟 = 𝐸 (cid:2)𝑌𝑖𝑟 (𝑔) − 𝑌𝑖𝑟 (∞) |𝐷𝑖𝑔 = 1(cid:3)

= 𝐸 (cid:2) (cid:165)𝑌𝑖𝑟𝑔 (𝑔) − (cid:165)𝑌𝑖𝑟𝑔 (∞) |𝐷𝑖𝑔 = 1(cid:3) + 𝐸

(cid:104)
(cid:98)𝑌𝑖𝑟𝑔 (𝑔) − (cid:98)𝑌𝑖𝑟𝑔 (∞) |𝐷𝑖𝑔 = 1

(cid:105)

(2B.9)

The second term in (2B.9) is zero by no anticipation because (cid:98)𝑌𝑖𝑟𝑔 (𝑔) and (cid:98)𝑌𝑖𝑟𝑔 (∞) are the same
linear functions of the potential outcomes in periods {1, 2, . . . , 𝑔 − 1}. Therefore,

𝜏𝑔𝑟 = 𝐸 (cid:2) (cid:165)𝑌𝑖𝑟𝑔 (𝑔) − (cid:165)𝑌𝑖𝑟𝑔 (∞) |𝐷𝑖𝑔 = 1(cid:3)

(2B.10)

2B.1 Monte Carlo Simulations

In this supplementary section, we conduct Monte Carlo simulations to study the exact properties

of our proposed estimators and compare them with competing approaches. We evaluate the

performance of five different estimators. The first is the POLS/ETWFE estimator in Wooldridge

(2021), which is efficient under a commonly imposed set of assumptions (but is not doubly

robust). Three of our rolling estimators: regression adjustment (RA), inverse-probability-weighted

36

regression adjustment (IPWRA), and propensity score matching (PSM). The final estimator is CS

(2021), who apply the augmented IPW (AIPW) estimator - a different doubly robust estimator than

IPWRA. We use the never treated group as the control in CS (2021) whose transformation is in

equation (2.38).

2B.2 Common Timing Case

We consider the common timing case. Recall from Theorem 2.1., that the POLS method

in Wooldridge (2021) and regression adjustment using our rolling method are the same in the

common timing case. Therefore, we have four estimators in the simulations. We assume that the

Assumptions NA and CPTC hold, along with overlap. However, we consider scenarios where the

functional form of the conditional means and the functional form of the propensity score can be

misspecified.

For each scenario, we use 𝑇 = 6 with the first treatment in 𝑆 = 4, which implies three post-

treatment periods. Across Monte Carlo simulations we draw random samples of sizes 100, 500, and

1, 000. We report bias, Monte Carlo standard deviation, and the root mean squared error (RMSE)

of each estimator. All simulations use 1, 000 Monte Carlo replications.

Data Generation We generate the data as follows. Two control variables are included:

X = (𝑋1, 𝑋2), where 𝑋1 and 𝑋2 are independent with 𝑋1 ∼ 𝐺𝑎𝑚𝑚𝑎 (2, 2) [and so 𝐸 (𝑋1) = 4] and

𝑋2 ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖 (0.6).

The treatment indicator, 𝐷, has propensity score

𝑝 (x) = 𝑃(𝐷 = 1|X = x) =

exp (Z1𝛾1)
1 + exp (Z1𝛾1)

where the propensity score index function, Z1𝛾1, is

Z1𝛾1 = −1.2 +

(𝑋1 − 4)
2

− 𝑋2

(2B.11)

(2B.12)

Our second step is generating heterogeneous treatment effects as follows:

𝜏𝑟 (X) = 𝜃 ·

𝑇
∑︁

𝑟=𝑆

(𝑟 − 𝑆 + 1)−1 + 𝜆𝑟 · ℎ(X), 𝑟 ∈ {𝑆, . . . , 𝑇 },

(2B.13)

37

where 𝜃 = 𝑇 − 𝑆 + 1 and 𝜆𝑟 is a time-varying parameter, set as (𝜆𝑆, .., 𝜆𝑇 ) = (0.5, 0.6, 1.0) in each

simulation. This setup allows dynamic effects of being treated to vary across time and to increase

as the length of exposure to the treatment increases. We consider two different functional forms of

ℎ(X) in simulations. The first is

ℎ(X) =

(𝑋1 − 4)
2

+

𝑋2
3

In the second, ℎ(X) includes a quadratic in 𝑋1 and an interaction between 𝑋1 and 𝑋2:

ℎ(X) =

(𝑋1 − 4)
2

+

𝑋2
3

+

(𝑋1 − 4)2
4

+ (𝑋1 − 4) ·

𝑋2
2

We generate the potential outcomes in the untreated state as

(2B.14)

(2B.15)

𝑌𝑡 (0) = 𝛿𝑡 + 𝐶 + 𝛽𝑡 · 𝑓 (X) + 𝑈𝑡 (0),

(2B.16)

where 𝛿𝑡 = 𝑡 is a time-specific component, 𝐶 |𝐷, 𝑋 ∼ 𝑁𝑜𝑟𝑚𝑎𝑙 (2, 1) is an individual-specific

component, 𝑈𝑡 (0)|𝐷, X ∼ 𝑁𝑜𝑟𝑚𝑎𝑙 (0, 4) is the time-varying shock. The time-varying 𝛽𝑡 allows

the effect of the covariates on potential outcome paths to vary across time. For each simulation,

the parameters are fixed as bellow:

𝛽′ = (𝛽1, 𝛽2, . . . , 𝛽𝑇 ) = (1.0, 1.5, 0.8, 1.5, 2, 2.5)

(2B.17)

We consider two functional forms for 𝑓 (X):

𝑓 (X) =

(𝑋1 − 4)
3

+

𝑋2
2

and

𝑓 (X) =

(𝑋1 − 4)
3

+

𝑋2
2

+

(𝑋1 − 4)2
3

+ (𝑋1 − 4) ·

𝑋2
4

Finally, the post-treatment period outcome in the treated state is generated as

(2B.18)

(2B.19)

𝑌𝑡 (1) =





𝑌𝑡 (0), 𝑡 < 𝑆

𝑌𝑡 (0) + 𝜏𝑡 + 𝑈𝑡 (1) − 𝑈𝑡 (0), 𝑡 ≥ 𝑆

where 𝑈𝑡 (1) |𝐷, X ∼ 𝑁𝑜𝑟𝑚𝑎𝑙 (0, 4).

38

For each simulation, all estimators that involve estimating the conditional means of 𝑌𝑡 assume

the correct model is linear in X. Therefore, when (2B.14) and (2B.18) are used in simulations, the

conditional mean is correctly specified. However, when the data are generated as in (2B.15) and

(2B.19), quadratic term 𝑋 2

1 and an interaction term 𝑋1 · 𝑋2 are included; the conditional mean is

misspecified.

We also consider a case where (𝑋1 − 4)2 /2 is added to the index function in the propensity

score, and so the estimated logit model, which is always estimated assuming an index linear in 𝑋1

and 𝑋2, is misspecified:

Z2𝛾2 = −1.2 +

(𝑋1 − 4)
2

− 𝑋2 +

(𝑋1 − 4)2
2

(2B.20)

Table 2B.1 describes basic setups for each scenario.

Table 2B.1 Scenarios with Common Timing

Correctly Specified?
Yes
Yes
No
No

Conditional Mean
ℎ(X)
(𝐵.4)
(𝐵.4)
(𝐵.5)
(𝐵.5)

Propensity Score

𝑓 (X)
(𝐵.8)
(𝐵.8)
(𝐵.9)
(𝐵.9)

Correctly Specified?
Yes
No
Yes
No

PS Index Function
(𝐵.3)
(𝐵.10)
(𝐵.3)
(𝐵.10)

Scenario 1C
Scenario 2C
Scenario 3C
Scenario 4C

Simulation Results This section shows the simulation results, especially for two different

scenarios listed in Table 2B.1: Scenario 1C and 3C. The results for Scenario 1C are shown in

Table 2B.2. Here the conditional means and the propensity score are correctly specified, so we

expect all estimators to have little bias. This is indeed the case, with the biases being trivial as a

percentage of the effect sizes.

The biases are small even when 𝑁 = 100. Because the POLS estimator, which is the same as

RA on the transformed outcome, is best linear unbiased, it is not surprising that it produces notably

smaller standard deviations compared with PSM and CS (2021). For example, with 𝑁 = 500, and

for 𝜏4, the PSM SD is about 37 percent higher than the POLS/RA SD and the CS SD is about 25

percent higher. Because POLS/RA averages all three pre-treatment periods, its main competitor is

the doubly robust IPWRA estimator, which also averages the three pre-treatment periods. When

39

𝑁 = 500, the rolling IPWRA estimator has SDs that are, at most, three percent higher than those

for the POLS.

In terms of RMSE, the POLS estimator is uniformly better in Table 2B.2 – again, this is not

surprising because POLS is the BLUE. When 𝑁 = 1, 000, rolling IPWRA estimator has RMSEs

that are just slightly larger than POLS. For example, the RMSE of rolling IPWRA for 𝜏6 is 0.399,

which is slightly higher than that of POLS/RA estimator, 0.379.

Table 2B.2 Scenario 1C: When 𝐸 (𝑌𝑡 |X = x) and 𝑝 (x) are Correctly Specified

Sample ATT
POLS/RA
PSM
IPWRA
CS(2021)
Sample ATT
POLS/RA
PSM
IPWRA
CS(2021)
Sample ATT
POLS/RA
PSM
IPWRA
CS(2021)

N

Bias

100
100
100
100

500
500
500
500

−0.002
0.020
−0.014
0.015

0.008
0.002
0.009
0.011

0.006
1,000
0.023
1,000
0.007
1,000
1,000 −0.008

𝜏4
SD
3.326
1.241
1.784
1.318
1.534
3.218
0.541
0.893
0.566
0.662
3.220
0.375
0.710
0.395
0.474

RMSE

Bias

1.241
1.784
1.318
1.534

0.541
0.893
0.566
0.662

0.375
0.710
0.395
0.474

0.006
0.130
0.018
0.036

−0.036
0.001
−0.034
−0.033

0.009
0.055
0.007
−0.006

𝜏5
SD
4.800
1.220
1.803
1.352
1.554
4.809
0.537
0.931
0.562
0.659
4.802
0.382
0.673
0.411
0.486

RMSE

Bias

1.220
1.807
1.352
1.554

0.538
0.931
0.563
0.660

0.382
0.676
0.411
0.486

0.036
0.195
0.046
0.065

−0.010
0.084
−0.009
−0.009

0.020
0.101
0.021
0.007

𝜏6
SD
5.858
1.285
1.820
1.380
1.576
5.992
0.552
0.939
0.579
0.684
5.959
0.378
0.679
0.398
0.476

RMSE

1.285
1.831
1.381
1.577

0.552
0.943
0.579
0.684

0.379
0.686
0.399
0.476

Note 1: The population 𝑅-squared values are about 0.39, 0.36, and 0.36, respectively.
Note 2: The average propensity score is about 0.26.

Table 2B.3 reports the findings for Scenario 3C, where now the conditional means are misspec-

ified because the linear regressions omits the terms 𝑋 2

1 and 𝑋1 · 𝑋2. Because the propensity score
is correctly specified in this scenario, POLS is the only estimator that, theoretically, will exhibit

systematic bias. The rolling IPWRA and CS (2021) approaches are still consistent, as is PSM.

In these simulations, the POLS estimator does have the most bias, although it is fairly small as a

fraction of the size effects. For instance, for 𝑁 = 1, 000, the bias of POLS/RA estimator for 𝜏6 is

0.154, which is at least five times larger (in absolute value) than those of the PSM, IPWRA, and

CS (2021) estimators: 0.032, −0.002, and −0.008, respectively. Nevertheless, 0.154 is still small

as a percentage of the effect size, 6.361.

40

Table 2B.3 Scenario 3C; When 𝐸 (𝑌𝑡 |X = x) Misspecified, 𝑝 (x) Correctly Specified

Sample ATT
POLS/RA
PSM
IPWRA
CS(2021)
Sample ATT
POLS/RA
PSM
IPWRA
CS(2021)
Sample ATT
POLS/RA
PSM
IPWRA
CS(2021)

N

Bias

100
100
100
100

500
500
500
500

−0.034
−0.060
−0.099
−0.100

0.079
0.032
0.034
0.055

0.044
1,000
0.031
1,000
0.000
1,000
1,000 −0.003

𝜏4
SD
3.550
1.412
1.797
1.431
1.751
3.418
0.608
0.827
0.618
0.764
3.440
0.402
0.569
0.408
0.513

RMSE

Bias

1.413
1.798
1.434
1.753

0.614
0.828
0.619
0.766

0.404
0.570
0.408
0.513

0.104
0.071
0.025
0.009

0.085
−0.015
−0.013
0.006

0.091
0.017
−0.011
−0.015

𝜏5
SD
4.975
1.406
1.867
1.433
1.734
5.053
0.613
0.863
0.619
0.763
5.017
0.426
0.576
0.423
0.523

RMSE

Bias

1.410
1.868
1.433
1.734

0.619
0.863
0.619
0.763

0.436
0.577
0.423
0.523

0.222
0.054
0.071
0.053

0.197
0.081
0.047
0.062

0.154
0.032
−0.002
−0.008

𝜏6
SD
6.295
1.568
1.966
1.561
1.863
6.356
0.670
0.878
0.665
0.830
6.361
0.452
0.616
0.448
0.559

RMSE

1.583
1.967
1.563
1.864

0.699
0.882
0.666
0.833

0.477
0.617
0.448
0.559

Note 1: The population 𝑅-squared values are about 0.41, 0.38, and 0.38, respectively.
Note 2: The average propensity score is about 0.17.

In some cases, the smaller SD of POLS gives it the smallest RMSE even when it is biased.

Among the consistent estimators, rolling IPWRA is the most efficient. And, in several cases, the

rolling IPWRA estimator has the smallest RMSE. For example, the RMSEs for (cid:98)
are 0.477, 0.617, 0.448, and 0.559 for POLS, PSM, IPWRA, and CS (2021), respectively. Our

𝜏6 with 𝑁 = 1, 000

application of IPWRA to the transformed outcome that uses all pre-treatment periods not only

reduces bias compared with POLS, but it largely preserves the efficiency of the POLS estimator.

41

APPENDIX 2C

PRE-PERIOD DYNAMICS AND EVENT STUDY PLOT

This section describes how we compute transformed outcomes in the pre-treatment periods to

estimate pre-trend treatment effects and plot them in the event study graph.

2C.1 Demeaning for Pre-treatment Periods, (cid:164)𝑌𝑖𝑡𝑔

For each unit 𝑖, group 𝑔, and pre-treatment time period 𝑡 ∈ {1, 2, . . . , 𝑔 − 1}, we define:

(cid:164)𝑌𝑖𝑡𝑔 ≡ 𝑌𝑖𝑡 −

1
𝑔 − 𝑡 − 1

𝑔−1
∑︁

𝑞=𝑡+1

𝑌𝑖𝑞

(2C.1)

This transformation removes the average of future outcomes from the pre-treatment period (from

𝑡 + 1 to 𝑔 − 1), helping to anchor pre-treatment dynamics relative to a consistent future baseline,

𝑡 = 𝑔 − 1. This is in the spirit of a “rolling” transformation.

For example, to obtain the transformed outcome at time 2 for group 6, (cid:164)𝑌𝑖26, we subtract the

average of future pre-treatment outcomes (𝑞 = {3, 4, 5}) from the outcome at 𝑡 = 2. The final

pre-treatment period, 𝑔 − 1, serves as the reference point, yielding a value of zero and anchoring

the dynamics accordingly.

2C.2 Detrending for Pre-treatment Periods, (cid:165)𝑌𝑖𝑡𝑔

To obtain detrended outcome variable, we follow the same logic described in Appendix 2C.1,

but but now anchor the two most recent pre-treatment periods, 𝑔 − 2 and 𝑔 − 1, since at lease two

periods are necessary to estimate the fitted value (cid:98)𝑌𝑖𝑡𝑔.

Formally, for pre-treamtent periods 𝑡 ∈ {1, 2, . . . , 𝑔 − 3}, the detrended outcome is defined as:

(cid:165)𝑌𝑖𝑡𝑔 ≡ 𝑌𝑖𝑡 − (cid:98)𝑌𝑖𝑡𝑔

(2C.2)

where (cid:98)𝑌𝑖𝑡𝑔 is the fitted value from regressing 𝑌𝑖𝑞 on 𝑞 using future pre-treatment periods 𝑞 ∈
{𝑡 + 1, . . . , 𝑔 − 1}.

This procedure removes potential linear (or more general) trends in the pre-treatment outcome

trajectory, yielding a detrended outcome series. For valid extrapolation, at least two future pre-

treatment periods are required (i.e., 𝑔 − 𝑡 − 1 ≥ 2).

42

2C.3 Estimation

Once the transformed outcome variables are constructed, the estimation procedure follows the

same steps as Procedure 3.1, 4.1, and 5.1.

For identification, we dynamically define the control group based on whether pre-treatment

periods 𝑡 for group 𝑔 precedes or follows the first treatment period 𝑆. Specifically, if 𝑡 < 𝑆, the

control group includes all not-yet-treated and never-treated units.

If 𝑡 ≥ 𝑆, the control group

consists only of units treated at 𝑡 + 1 or later, as well as never-treated units. This staggered adoption

rule ensures that the comparison group remains untreated at each point of comparison.

The weighted ATTs in the pre-treatment periods capture the pre-treatment effects. Under the

no anticipation assumption, we expect 𝑊 𝐴𝑇𝑇 (𝑟) for 𝑟 < 0 to be approximately zero. Significant

deviations may indicate dynamic selection or violations of the identifying assumptions. When

using demeaning transformations, the pre-treatment analysis applies to periods 𝑟 ≤ −2, since

the transformation anchors outcomes at 𝑡 = 𝑔 − 1 (i.e., 𝑟 = −1). Similarly, with detrending

transformations, the analysis applies to periods 𝑟 ≤ −3.

43

APPENDIX 2D

ESTIMATION RESULTS

In this section, Table 2D.1 reports the descriptive statistics of variables used in our analysis.

Table 2D.1 Descriptive Statistics of variables

Variable

log(Retail Emp)
log(Wholesale Emp)
Share of Population Poverty (above)
Share of Population in Manufacture
Share of Population Graduate High School
Treated Cohort
Counties

Obs
29,440
29,440
29,440
29,440
29,440

1280

Mean
7.754502
6.413699
.8470385
.0998018
.092258

Std.
1.281587
1.482646
.0619999
.0501518
.0256816

Min Max

1986

1999

Table 2D.2 presents estimation results, corresponding to the visual representation in Figure 2.2;

the retail labor market case. We estimate standard errors of the weighted ATTs with bootstrap,

repeating 100 times. The first three columns display estimates of weighted ATTs from the ETWFE

(Wooldridge, 2021), CS (2021), and our rolling IPWRA estimator (without unit-specific detrend-

ing), each of which is biased under the violation of parallel trend assumption. The fourth and fifth

columns present the point estimates from our rolling RA and rolling IPWRA estimators, which

appropriately account for potential heterogeneous linear trends.

Table 2D.2 Effects of Walmart Opening on log( Retail employment)

ETWFE

CS (2021)

ATT(𝑟)
0.041
0.073
0.073
0.075
0.081
0.091
0.101
0.119
0.132
0.137
0.158
0.166
0.167
0.206

SE
(0.006)
(0.007)
(0.008)
(0.009)
(0.01)
(0.011)
(0.012)
(0.013)
(0.014)
(0.016)
(0.019)
(0.023)
(0.03)
(0.043)

ATT(𝑟)
0.023
0.054
0.054
0.055
0.059
0.067
0.077
0.096
0.110
0.120
0.138
0.146
0.153
0.191

SE
(0.003)
(0.004)
(0.005)
(0.006)
(0.007)
(0.008)
(0.009)
(0.01)
(0.011)
(0.013)
(0.015)
(0.018)
(0.023)
(0.032)

𝑟
0
1
2
3
4
5
6
7
8
9
10
11
12
13

Heterogeneous Trend

Rolling RA
SE
(0.004)
(0.005)
(0.006)
(0.007)
(0.009)
(0.01)
(0.012)
(0.013)
(0.016)
(0.019)
(0.023)
(0.031)
(0.036)
(0.054)

ATT(𝑟)
0.002
0.035
0.029
0.014
0.019
0.015
0.022
0.005
0.016
0.015
0.032
0.007
-0.008
0.046

Rolling IPWRA
ATT(𝑟)
0.007
0.032
0.025
0.021
0.018
0.017
0.019
0.036
0.041
0.041
0.037
0.018
0.017
0.047

SE
(0.004)
(0.005)
(0.006)
(0.007)
(0.009)
(0.01)
(0.012)
(0.013)
(0.016)
(0.019)
(0.023)
(0.03)
(0.036)
(0.053)

Rolling IPWRA
ATT(𝑟)
0.018
0.045
0.038
0.032
0.031
0.036
0.040
0.054
0.062
0.063
0.081
0.083
0.080
0.107

SE
(0.004)
(0.004)
(0.004)
(0.004)
(0.004)
(0.005)
(0.005)
(0.006)
(0.008)
(0.01)
(0.013)
(0.018)
(0.026)
(0.039)

44

2D.1 Wholesale Employment

In this section, we report the estimation results for the county-level wholesale employment level.

Figure 2D.1 presents the estimates from CS (2021) approach and our rolling IPWRA estimators.

Panel (a) CS Approach

Panel (b) Rolling IPWRA with unit-specific demeaning

Panel (c) Rolling IPWRA with unit-specific detrending

Figure 2D.1 Effects of Walmart Opening on log(Wholesale Employment)

45

In the top panels of Figure 2D.1, the weighted ATT estimates from the CS (2021) approach

and the rolling IPWRA estimator (without unit-specific detrending) mirror the retail employment

results, showing an upward pre-treatment trend.

In contrast, the panel (c) of Figure 2D.1 displays almost flat pre-treatment trends, indicating no

significant pre-treatment effects. Additionally, it is noteworthy that the effects of Walmart entry

on wholesale employment levels turn negative when the rolling IPWRA estimator is applied to

the proposed dataset after removing county-specific linear trends. This reversal in the sign of the

coefficients aligns with the findings of Brown and Butts (2023), who use factor models to relax the

parallel trends assumption.

Table 2D.3 shows estimates in line with Figure 2D.1.

Table 2D.3 Effects of Walmart Opening on log (Wholesale employment)

ETWFE

CS (2021)

ATT(𝑟)
0.041
0.032
0.030
0.027
0.031
0.039
0.047
0.047
0.052
0.058
0.109
0.113
0.098
0.130

SE
(0.011)
(0.012)
(0.014)
(0.016)
(0.018)
(0.02)
(0.022)
(0.024)
(0.026)
(0.029)
(0.034)
(0.04)
(0.049)
(0.062)

ATT(𝑟)
0.008
-0.002
-0.005
-0.009
-0.007
0.004
0.013
0.016
0.019
0.028
0.073
0.062
0.057
0.112

SE
(0.007)
(0.008)
(0.011)
(0.013)
(0.015)
(0.019)
(0.021)
(0.024)
(0.026)
(0.028)
(0.034)
(0.037)
(0.043)
(0.051)

𝑟
0
1
2
3
4
5
6
7
8
9
10
11
12
13

Rolling IPWRA
ATT(𝑟)
0.031
0.018
0.014
0.007
0.008
0.015
0.019
0.015
0.015
0.016
0.063
0.057
0.034
0.055

SE
(0.007)
(0.007)
(0.007)
(0.008)
(0.009)
(0.01)
(0.011)
(0.014)
(0.016)
(0.019)
(0.023)
(0.031)
(0.04)
(0.06)

Heterogeneous Trend

Rolling RA
SE
(0.008)
(0.012)
(0.013)
(0.016)
(0.021)
(0.023)
(0.027)
(0.033)
(0.037)
(0.043)
(0.049)
(0.059)
(0.067)
(0.075)

ATT(𝑟)
-0.009
-0.018
-0.035
-0.052
-0.021
-0.071
-0.038
-0.096
-0.043
-0.120
-0.045
-0.149
-0.239
-0.219

Rolling IPWRA
ATT(𝑟)
-0.003
-0.019
-0.031
-0.043
-0.051
-0.051
-0.055
-0.052
-0.068
-0.065
-0.058
-0.099
-0.125
-0.074

SE
(0.008)
(0.012)
(0.014)
(0.017)
(0.021)
(0.024)
(0.028)
(0.034)
(0.038)
(0.044)
(0.049)
(0.058)
(0.067)
(0.078)

Our rolling estimators, with county-level detrending, account for heterogeneous linear trends

in a straightforward manner, allowing for the application of various treatment effect estimators.

This underscores the importance of capturing variations in trends across different counties—trends

that may not be captured by observed covariates but can significantly influence estimation results.

Estimators that do not explicitly account for heterogeneous trends yield substantially different

results, which are less convincing as causal estimates.

46

CHAPTER 3

SIMPLE APPROACHES TO INFERENCE WITH DIFFERENCE-IN-DIFFERENCES
ESTIMATORS WITH SMALL CROSS-SECTIONAL SAMPLE SIZES

(CO-AUTHORED WITH JEFFREY M. WOOLDRIDGE)

3.1

Introduction

Difference-in-differences methods with panel data have a long history for evaluating policy

interventions.

In the case of common intervention timing, various methods are available for

allowing treatment effects to vary by treatment period, as well as by control variables. Wooldridge

(2021, 2025) shows how to estimate flexible linear models by pooled OLS or fixed effects – they are

equivalent – that allows straightforward inference, provided there are enough cross-sectional units

and the data are independently sampled across units. With a relatively small time series sample

size (𝑇) and large enough numbers of control (𝑁0) and treated units (𝑁1), standard errors robust to

arbitrary serial correlation and heteroskedasticity are readily available. Other methods – such as

the Callaway and Sant’Anna (2021) long differencing methods – also rely on large cross-sectional

sample sizes for inference.

As shown in Lee and Wooldridge (2023) [LW (2023)], regression-based difference-in-differences

(DID) estimators can be obtained via cross-sectional regressions after simple time-series transfor-

mations of the data at the unit level. This characterization of DiD estimators permits application

of many widely used treatment effects estimators in the DiD setting, including matching, doubly

robust estimators using the propensity score, and even machine learning causal methods. Here we

use the representations in LW (2023) to obtain simple inference when the usual inference obtained

from clustering at the individual unit may be problematical.

Our approach in the current paper combines the algebraic equivalence in LW (2023) and an

idea from Donald and Lang (2007), who propose collapsing individual-level data to aggregated

groups when the assignment of an intervention is at the group level. Here, the collapsing of the

The co-author has approved that the co-authored chapter is included. The co-author’s contact: Jeffrey M.
Wooldridge, Department of Economics, 486 W. Circle Drive, 110 Marshall-Adams Hall, Michigan State University,
East Lansing, MI 48824-1038. Email: wooldri1@msu.edu

47

data is at the unit level across time in order to exploit the equivalences of panel data DiD estimators

and those based on cross-sectional regressions. In addition to the standard DiD estimators, the

approach extends easily to allow for unit-specific trends. Also, time-constant control variables can

be added if the cross-sectional section sample sizes are large enough to avoid degeneracies.

The method can also be useful in cases where 𝑁0 and 𝑁1 (the sizes of the control and treatment

groups, respectively) are not particularly small but where the number of time series is reasonably

large. In such cases, using the panel structure and clustering by cross-sectional unit to account for

serial dependence can cause distortions in the inference. The approach we suggest here requires

no modification for large 𝑇. In fact, because inference is based on a cross-sectional regression, we

need not worry about strong dependence in the time series dimension.

Recently, Simonsohn (2021), in commenting on Young (2019), argues that a particular version

of the heteroskedasticity-robust standard error, proposed by Davidson, MacKinnon et al. (1993)

(and labeled “hc3” in the popular Stata statistical package), can produce satisfactory results even

in Young’s setting with 𝑁0 = 18 and 𝑁1 = 2. Therefore, in some cases one might feel comfortable

using heteroskedasticity-robust inference.

A popular method for studying interventions with a small number of treated units (with one

being the most common) is the synthetic control method (SCM), pioneered by Abadie et al. (2010),

where pre-intervention data are used to obtain a weighted average of “donor” control units to obtain

a single synthetic control. More recently, Arkhangelsky, Athey, Hirshberg, Imbens and Wager

(2021) propose a unification of traditional DiD and SC methods, called Synthetic difference-in-

differences (SDiD). The SDiD approach allows more flexibility than either DiD or SC, and inference

methods are available. However, SDiD assume that the series are weakly dependent – often called

“integrated of order zero,” or 𝐼 (0) – and that 𝑁0, 𝑇0, and 𝑇1 are suitably “large.” The authors also

impose normality in obtaining inference, although it is not clear how important that is in practice.

Compared with the SDiD approach, the method we propose in this paper has some advantages

and disadvantages. The advantages are that one need not have a very large control group nor a

large number of time periods. Because inference is based on a cross-sectional regression under the

48

assumption of the classical linear model assumptions, there are minimal restrictions on 𝑁0, 𝑁1, 𝑇0,

and 𝑇1. Moreover, we can use a cross-sectional regression for each treated time period, allowing

for different estimated effects along with confidence intervals. We can also aggregate to estimate a

single effect, which is what the SDiD approach produces. The SDiD approach does not explicitly

allow for heterogeneous trends, something that is easily accommodated using the exact approach

in this paper.

The downside to the cross-sectional approach here is that it can be be more biased than

SDiD when differences in pre-trends are complicated, although SDiD is more biased for simple

heterogeneous trends. Our cross-sectional approach also can be considerably less efficient than

SDiD. We provide evidence of that via simulations. Nevertheless, sometimes SDiD will be less

efficient. Moreover, the popular SDiD packages that produce the estimators and standard errors

can produce confidence intervals that are too optimistic.

We view the methods proposed in this paper as a complement to SDiD, providing another tool

for estimating treatment effects when the number of treated or control units is small.

We start with the common timing case in Section 3.2, first summarize the algebraic equivalences

that allow one to apply cross-sectional regression to obtain the DiD estimates(s) and standard errors.

Section 3.3 extends the framework to accommodate heterogeneous trends and seasonal effects,

demonstrating how to implement our approach in these more complex settings. In Section 3.4, we

compare our method to the Synthetic Control (SC) and Synthetic Difference-in-Differences (SDiD)

approaches, highlighting differences in their underlying assumptions.

Section 3.5 presents simulation results evaluating the performance of these estimators. We

find that SC and SDiD, especially the former, can exhibit substantially more bias than removing

heterogeneous trends. In Section 3.6, we revisit the California smoking restrictions passed in 1989,

using the data in Abadie, Diamond, and Hainmueller (2010). Section 3.7 addresses the staggered

rollout case and applies our method to measure the effects of castle laws on homicides. We conclude

in Section 3.8.

49

3.2 The Common Timing Case

3.2.1 Estimating a Single ATT

Initially, the setting is that a total of 𝑇 time periods are available, and an intervention occurs

at time 𝑆, 𝑆 ∈ {2, ..., 𝑇 }. A cross-sectional unit, 𝑖, is either subjected to the intervention, which

remains in place periods 𝑆 through 𝑇, or it is a control unit. Treatment status is indicated by the

binary variable

𝐷𝑖 = 0 if a control unit

𝐷𝑖 = 1 if a treated unit

The time-varying treatment indicator is

𝑊𝑖𝑡 = 𝐷𝑖 · 𝑝𝑜𝑠𝑡𝑡

(3.1)

(3.2)

where 𝑝𝑜𝑠𝑡𝑡 = 1 if 𝑡 ∈ {𝑆, 𝑆 + 1, ..., 𝑇 } and zero otherwise.

In what follows, we assume that

the data draws are independent and identically distributed across 𝑖 but allow general dependence

and changing distributions across 𝑡. Late, we provide a discussion of how one can accomodate

clustering or spatial correlation if the cross-sectional sample sizes are reasonably large.

The simple DiD estimator can be obtained as the coefficient (cid:98)

𝜏𝐷𝐷 on 𝑊𝑖𝑡 from the pooled

regression

𝑌𝑖𝑡 on 1, 𝐷𝑖, 𝑝𝑜𝑠𝑡𝑡, 𝑊𝑖𝑡,

𝑖 = 1, ..., 𝑁; 𝑡 = 1, ..., 𝑇

(3.3)

Equivalently, one can replace 𝐷𝑖 and 𝑝𝑜𝑠𝑡𝑡 with a full set of unit and time period fixed effects,

resulting in the popular two-way fixed effects (TWFE) estimator. [The equivalence follows from

Wooldridge (2021, 2010).] When 𝑁 is small (or 𝑁1 is small), inference using the TWFE estimator

is tricky because one should allow serial correlation in the underlying time-varying errors, 𝑈𝑖𝑡,

that are implicit in the equation. The usual method of clustering by unit 𝑖 generally results in poor

performance with small 𝑁 (or with small 𝑁0 or 𝑁1).

As is fairly well known, and recently shown in Lee and Wooldridge (2023) in cases with

covariates, (cid:98)

𝜏𝐷𝐷 also can be obtained as follows. First, for each unit 𝑖, create

50

Δ ¯𝑌𝑖 ≡

𝑇
∑︁

1
(𝑇 − 𝑆 + 1)
𝑡=𝑆
≡ ¯𝑌𝑖,𝑝𝑜𝑠𝑡 − ¯𝑌𝑖,𝑝𝑟𝑒,

𝑌𝑖𝑡 −

1
(𝑆 − 1)

𝑆−1
∑︁

𝑡=1

𝑌𝑖𝑡

(3.4)

which is a simple transformation of the data within each unit. This results in a simple cross-
sectional data set (cid:8)(cid:0)Δ ¯𝑌𝑖, 𝐷𝑖(cid:1) : 𝑖 = 1, ..., 𝑁 (cid:9), where 𝑁 can be large or small. In the second step, (cid:98)
is obtained as the coefficient on 𝐷𝑖 from the simple regression

𝜏𝐷𝐷

Δ ¯𝑌𝑖 on 1, 𝐷𝑖,

𝑖 = 1, ..., 𝑁.

Of course, this regression leads to the formula for a difference in means:

𝜏𝐷𝐷 = 𝑁 −1
(cid:98)
1

𝑁
∑︁

𝑖=1

𝐷𝑖 · Δ ¯𝑌𝑖 − 𝑁 −1
0

𝑁
∑︁

𝑖=1

= Δ ¯𝑌𝑡𝑟𝑒𝑎𝑡 − Δ ¯𝑌𝑐𝑜𝑛𝑡𝑟𝑜𝑙 .

(1 − 𝐷𝑖) · Δ ¯𝑌𝑖

(3.5)

(3.6)

The benefit of the previous characterization of the DiD estimator is that we can think of a simple

underlying model,

Δ ¯𝑌𝑖 = 𝛼 + 𝜏𝐷𝑖 + 𝑈𝑖

𝐸 (𝑈𝑖 |𝐷𝑖) = 0,

(3.7)

(3.8)

where the zero conditional mean assumption holds when the difference-in-differences design iden-

tifies the average treatment effect on the treated, 𝜏. If 𝑁 is reasonably large, and we have several

treated and control units, we can rely on asymptotic analysis and justify a heteroskedasticity-robust

standard error for (cid:98)
𝑁0 is large but 𝑁1 is small, we cannot rely on asymptotics. Nevertheless, it is possible that (3.7)

𝜏𝐷𝐷 for computing a confidence interval for 𝜏. But when 𝑁 is small, or, say,

satisfies the classical linear model (CLM) assumptions:

𝑈𝑖 |𝐷𝑖 ∼ 𝑁𝑜𝑟𝑚𝑎𝑙

(cid:16)

0, 𝜎2
𝑈

(cid:17)

Under (3.7) and (3.9), conditional on 𝐷𝑖, (cid:98)
𝜏𝐷𝐷 − 𝜏)
((cid:98)
𝜏𝐷𝐷)
se ((cid:98)

∼ T𝑁−2

𝜏𝐷𝐷 has an exact normal distribution, and

51

(3.9)

(3.10)

This result means that we can obtain exact tests of any null hypothesis, such as 𝐻0 : 𝜏 = 0.

Moreover, the usual confidence, obtained using percentiles of the T𝑁−2 distribution, have exact

coverage. Relatedly, Hagemann (2025) assumes normality in the context of testing for a treatment

effect when a single cluster is treated with many units within the cluster, allowing for different

variances in the distributions of the control and treated units. Here, we do not require any particular

number of time periods (units within a cluster) provided the normality assumption (3.9) holds.

In looking at the expression for (cid:98)

𝜏𝐷𝐷 in (3.6), it is clear that asymptotic theory will not apply

when, say, 𝑁0 is very large if 𝑁1 is still small: we need the central limit theorem to apply to both

terms in (3.6), and it will not generally be a good approximation to the distribution if 𝑁1 is small.

The same is true if 𝑁0 is small.

Assuming (3.9) holds, the approach works in any setting with 𝑁0 ≥ 1, 𝑁1 ≥ 1, and 𝑁 =

𝑁0 + 𝑁1 ≥ 3. In particular, we can have a single treated unit, 𝑁1 = 1, and many or few control units.
When there is only a single treated unit, the 𝑡-statistic (cid:98)
residual,” which plays a role in outlier analysis. See, for example, Wooldridge (2020, Section 9.5).
Viewing the 𝑡 statistic for (cid:98)
determine whether the single treated unit, with 𝐷𝑖 = 1, is an outlier compared with the control

𝜏𝐷𝐷 as an outlier diagnostic makes intuitive sense: we are trying to

𝜏𝐷𝐷) is known as the “studentized

𝜏𝐷𝐷/se ((cid:98)

units.

The approach just outlined is essentially that taken by Donald and Lang (2007), but here we

apply it to panel data with either a short or long stretch of time.

If 𝑇0 and 𝑇1 are reasonably

large, the normality in (3.9) may be an acceptable assumption because, if the data are weakly

dependent across time, the central limit theorem implies that ¯𝑌𝑖,𝑝𝑜𝑠𝑡 and ¯𝑌𝑖,𝑝𝑟𝑒 are approximately

normal. Moreover, because we only require conditional normality of Δ ¯𝑌𝑖 = ¯𝑌𝑖,𝑝𝑜𝑠𝑡 − ¯𝑌𝑖,𝑝𝑟𝑒, strong

dependence (such as a time-constant unobserved effect) is eliminated by differencing the post-

intervention and pre-intervention averages. Underlying unit root processes also need not cause Δ ¯𝑌𝑖

to deviate substantially from normality.

In addition to (technically) maintaining exact normality, another drawback of the exact inference

approach is that it maintains homoskedasticity – that is, 𝑉 𝑎𝑟 (𝑈𝑖 |𝐷𝑖) = 𝑉 𝑎𝑟 (𝑈𝑖). Nevertheless,

52

recent simulations by Simonsohn (2021) are promising in that one version of the heteroskedasticity-

robust standard error, often called “hc3” in the literature, can work well even with a small number of

treated and control units. Thus, one may feel comfortable using heteroskedasticity-robust inference

except in cases with a very small number of treated and/or control units.

In addition, a notable advantage of our method is its compatibility with randomization inference

(RI), which does not rely on the normality assumption. RI enables the computation of exact p-

values for testing the null hypothesis that all treatment effects are zero. The p-value is calculated

as the proportion of permutation-based test statistics that are as extreme or more extreme than the

observed test statistic. Specifically, if the number of such permutations is 𝑐, and the total number

of permutations is 𝑁, then the two-sided randomization p-value is defined as 𝑐/𝑁. In practice,

the user-written Stata command ritest can be employed to implement this procedure and obtain

exact p-values. See, for example, Heß (2017).

There is another characterization of (cid:98)

𝜏𝐷𝐷 that will be useful when we allow unit-specific pre-

trends in Section 3.3.

Procedure 2.1 (Unit-Specific Demeaning):

1. Namely, think of obtaining ¯𝑌𝑖,𝑝𝑟𝑒, for each 𝑖, using the pre-intervention regression:

𝑌𝑖𝑡 on 1, 𝑡 = 1, . . . , 𝑆 − 1,

(3.11)

where the coefficient on unit (the only coefficient) gives ¯𝑌𝑖,𝑝𝑟𝑒.

2. Next, in each post-intervention period, we obtain out-of-sample residuals, or prediction

errors:

(cid:164)𝑌𝑖𝑡 = 𝑌𝑖𝑡 − ¯𝑌𝑖,𝑝𝑟𝑒 = 𝑌𝑖𝑡 −

1
(𝑆 − 1)

𝑆−1
∑︁

𝑟=1

𝑌𝑖𝑟, 𝑡 = 𝑆, . . . , 𝑇

(3.12)

3. It is easy to see that, when these out-of-sample residuals are average, we obtain the same

regressand as in (3.5):

(cid:164)𝑌 𝑖 ≡

1
(𝑇 − 𝑆 + 1)

𝑇
∑︁

𝑡=𝑆

(cid:164)𝑌𝑖𝑡 = ¯𝑌𝑖,𝑝𝑜𝑠𝑡 − ¯𝑌𝑖,𝑝𝑟𝑒 = Δ ¯𝑌𝑖

4. Obtain an average effect, (cid:98)

𝜏𝐷 𝑀, and its standard error and confidence interval, from

(cid:164)𝑌 𝑖 on 1, 𝐷𝑖, 𝑖 = 1, ..., 𝑁. □

(3.13)

53

3.2.2 Adding Control Variables

As discussed in Lee and Wooldridge (2023), Procedure 2.1 uncovers the ATT under a no

anticipation (NA) assumption and a parallel trends (PT) assumption. With 𝑌𝑖𝑡 (0) and 𝑌𝑖𝑡 (1)

denoting the potential outcomes for unit 𝑖 in period 𝑡, the weakest version of NA is

𝐸 [𝑌𝑖𝑡 (1) − 𝑌𝑖𝑡 (0) |𝐷𝑖 = 1] = 0,

𝑡 = 1, . . . , 𝑆 − 1,

(3.14)

so the potential outcomes are the same, on average for the treated units. Sufficient, of course, is

𝑌𝑖𝑡 (1) = 𝑌𝑖𝑡 (0),

𝑡 = 1, . . . , 𝑆 − 1.

The parallel trends assumption is that, for all 𝑡 = 2, . . . , 𝑇,

𝐸 [𝑌𝑖𝑡 (0) − 𝑌𝑖1 (0) |𝐷𝑖] = 𝐸 [𝑌𝑖𝑡 (0) − 𝑌𝑖1 (0)] ≡ 𝛿𝑡,

(3.15)

where 𝛿𝑡 is a constant that is unrestricted over time. This assumption allows an unrestricted trend

in the control state, but that trend must be the same across control and treated units. In effect, it

allows treatment assignment, 𝐷𝑖, to be correlated with the level, say 𝑌𝑖1(0), but it rules out cases

where the assignment is based on differential trends in the control state.

As shown in Lee and Wooldridge (2023), the PT assumption implies that, for some constant 𝛼,

𝐸 (cid:2)Δ ¯𝑌𝑖 (0) |𝐷𝑖(cid:3) = 𝛼,

(3.16)

where Δ ¯𝑌𝑖 (0) = ¯𝑌𝑖,𝑝𝑜𝑠𝑡 (0) − ¯𝑌𝑖,𝑝𝑟𝑒 (0) is the same transformation appearing in (3.4) but for the

potential outcomes. Equation (3.16) means that, Δ ¯𝑌𝑖 (0) is mean independent of 𝐷𝑖; as discussed in
LW (2023), along with the NA assumption this implies that 𝜏 is identified and (cid:98)
unbiased and consistent. Here, we are relying on unbiasedness because of the small 𝑁1 or small

𝜏𝐷𝐷 is (conditionally)

𝑁0.

LW (2023) also show that the PT assumption is easily relaxed by conditioning on controls.

Given time-constant controls X𝑖, a 1 × 𝐾 vector, the conditional PT assumption is

𝐸 [𝑌𝑖𝑡 (0) − 𝑌𝑖1 (0) |𝐷𝑖, X𝑖] = 𝐸 [𝑌𝑖𝑡 (0) − 𝑌𝑖1 (0) |X𝑖] ≡ 𝛼𝑡 + X𝑖 𝛽𝑡,

(3.17)

where we have imposed linearity because we have little choice in a small-𝑁 setting. Assumption

(3.17) is an unconfoundedness assumption on 𝐷𝑖 once we condition on X𝑖, but it is in terms of the

54

differences 𝑌𝑖𝑡 (0) − 𝑌𝑖1 (0), rather than the levels 𝑌𝑖𝑡 (0). (This is what gives the DiD procedure

its main advantage over cross-sectional regression.) In terms of the cross-sectional regression, it is

easy to adapt LW (2023) to see that we should add X𝑖 to the equation. To use exact inference, we

would write

Δ ¯𝑌𝑖 = 𝛼 + 𝜏𝐷𝑖 + X𝑖 𝛽 + 𝑈𝑖

𝑈𝑖 |𝐷𝑖, X𝑖 ∼ 𝑁𝑜𝑟𝑚𝑎𝑙

(cid:16)

0, 𝜎2
𝑈

(cid:17)

(3.18)

(3.19)

Provided 𝑁 > 𝐾 + 2, (3.18) can be estimated by OLS, and, under the conditional normality

assumption, the 𝑡 statistic has an exact T𝑁−𝐾−2 distribution and the corresponding confidence

intervals are exact. Again, it is intriguing to think that, even without 𝑁 being too large, we might

use heteroskedasticity-robust inference.

When 𝑁0 and 𝑁1 are both sufficiently large, and the support of X for 𝐷 = 1 is contained in that

for 𝐷 = 0, full regression is preferred on theoretical grounds. Namely, (cid:98)
from the regression that includes interactions:

𝜏𝐷𝐷 is the coefficient on 𝐷𝑖

Δ ¯𝑌𝑖 on 1, 𝐷𝑖, X𝑖, 𝐷𝑖 · (cid:0)X𝑖 − ¯X1

(cid:1) ,

𝑖 = 1, . . . , 𝑁,

which is identical to running separate regressions for 𝐷𝑖 = 0 and 𝐷𝑖 = 1. Here, ¯X1 is the average

of the covariates over 𝐷𝑖 = 1. Unfortunately, this cannot be done with small 𝑁1, as it requires

𝑁0 > 𝐾 + 1 and 𝑁1 > 𝐾 + 1.

3.2.3 Estimating Separate Effects for Each Treated Period

Rather than estimate a single ATT, it is little additional effort to estimate ATTs separately in

each treated period. Define (cid:164)𝑌𝑖𝑡 as in (3.12), and this is used as the dependent variable in period 𝑡.

Then (cid:98)

𝜏𝑡,𝐷𝐷 for 𝑡 = 𝑆, . . . , 𝑇 is obtained from

(cid:164)𝑌𝑖𝑡 on 1, 𝐷𝑖, 𝑖 = 1, . . . , 𝑁

(3.20)

In order to conduct small 𝑁 (or small 𝑁1 inference), we cannot rely on the central limit theorem in

the post-treatment period. Also, we cannot easily obtain standard errors for linear combinations for

𝜏𝑡,𝐷𝐷 because they will be dependence in complicated ways due to serial correlation. (Unless

the (cid:98)

55

we make the ideal assumptions that rule out serial correlation.) Nevertheless, we can obtain a CI

for each 𝜏𝑡 = 𝐸 [𝑌𝑖𝑡 (1) − 𝑌𝑖𝑡 (0) |𝐷𝑖 = 1].

The coefficients one obtains from (3.20) are identical to running a standard panel DiD using

the control and treatments for the time periods {1, 2, . . . , 𝑆 − 1, 𝑡} where 𝑆 ≤ 𝑡 ≤ 𝑇. That is, we

include all pre-treatment variables.

Again, with 𝑁 large enough, we can add controls X𝑖. Note that, like the treatment effects

themselves, the intercept and slope parameters on X𝑖 vary freely over the treatment periods.

3.2.4 Using Different Pre-Treatment Periods

The suggestion in the previous subsection is to use all pre-treatment periods in computing

the pre-treatment averages. Under NA and parallel trends, that approach uses the most informa-

tion. Nevertheless, one might want to ignore information in the pre-treatment time periods. The

standard event-study approach uses only the period just before the treatment. This results in the

transformation

˚𝑌𝑖𝑡 = 𝑌𝑖𝑡 − 𝑌𝑖,𝑆−1, 𝑡 = 𝑆, . . . , 𝑇,

(3.21)

which can then be used in place of (cid:164)𝑌𝑖𝑡 in (3.20). This is the transformation used by Callaway and

Sant’Anna (2021). Of course, one can use any average of {𝑌𝑖𝑡 : 𝑡 = 1, . . . , 𝑆 − 1} – even a weighted

average – and the method would be still valid. If there is concern about violation of no anticipation,

the average ¯𝑌𝑖,𝑝𝑟𝑒 could be replaced with

¯𝑌𝑖,𝑆0 =

1
𝑆0

𝑆0∑︁

𝑟=1

𝑌𝑖𝑟

(3.22)

where 𝑆0 < 𝑆 − 1. A long-difference also can be used: 𝑌𝑖𝑡 − 𝑌𝑖,𝑆0.

3.3 Heterogeneous Trends and Seasonality in the Common Timing Case

LW (2023) show that, when the units have unit-specific trends, these can be removed before

applying a modification of Procedure 2.1. For concreteness, the following procedure removes linear

trends; it is easily modified to remove higher-order unit-specific trends.

56

Procedure 3.1 (Unit-Specific Detrending):

1. For each 𝑖, obtain (cid:98)𝐴𝑖, (cid:98)𝐵𝑖 from the pre-treatment periods by regressing on a constant and time:

𝑌𝑖𝑡 on 1, 𝑡, 𝑡 = 1, . . . , 𝑆 − 1

2. For the post-treatment periods, remove the pre-treatment trends:

(cid:165)𝑌𝑖𝑡 = 𝑌𝑖𝑡 − (cid:98)𝑌𝑖𝑡 ≡ 𝑌𝑖𝑡 − (cid:98)𝐴𝑖 − (cid:98)𝐵𝑖 · 𝑡, 𝑡 = 𝑆, . . . , 𝑇,

(3.23)

(3.24)

where (cid:98)𝑌𝑖𝑡 ≡ (cid:98)𝐴𝑖 + (cid:98)𝐵𝑖 · 𝑡 is the projected value of 𝑌𝑖𝑡 in a treated period using a trend obtained from

pre-treatment periods.

3. For each unit, average the adjusted outcomes:

(cid:165)𝑌 𝑖 ≡

1
(𝑇 − 𝑆 + 1)

𝑇
∑︁

𝑡=𝑆

(cid:165)𝑌𝑖𝑡

4. Obtain an average effect, (cid:98)

𝜏𝐷𝑇 , and its standard error and confidence interval, from

(cid:165)𝑌 𝑖 on 1, 𝐷𝑖, 𝑖 = 1, ..., 𝑁. □

(3.25)

(3.26)

We use “DT” to denote that (cid:98)
𝑁1 = 1, we can perform inference under

𝜏𝐷𝑇 is obtained from a detrending procedure. As before, even with

(cid:165)𝑌 𝑖 = 𝛼 + 𝜏𝐷𝑇 · 𝐷𝑖 + 𝑈𝑖
0, 𝜎2(cid:17)

𝑈𝑖 |𝐷𝑖 ∼ 𝑁𝑜𝑟𝑚𝑎𝑙

(cid:16)

With large enough 𝑁 we could include covariates, as in (3.17), which would allow for confounded

assignment even after removing unit-specific linear trends.

Because Procedure 3.1 uses unit-specific removal trends, it may initially appear that the final

regression used to obtain (cid:98)
parameters” problem. This is not the case, as we are simply transforming each unit using its own

𝜏𝐷𝐷 in Procedure 2.1) might suffer from the so-called “incidental

𝜏𝐷𝑇 (or (cid:98)

time series data. That is, Δ ¯𝑌𝑖 and (cid:165)𝑌 𝑖 are just linear functions of {𝑌𝑖𝑡 : 𝑡 = 1, 2, . . . , 𝑇 }. Under

independence across 𝑖, the result is cross-sectional data where the observations are independent.

That allows us to at least argue for classical linear model analysis when 𝑁 is small. Without

57

control variables, Lee and Wooldridge (2023) show that a sufficient condition is the independence

of the treatment indicator from transformed outcome variable– that is, the outcome after removing

unit-specific pre-treatment averages or trends.

As in the case without trends, we can easily estimate separate effects for each 𝑡: for each

𝑡 ∈ {𝑆, ..., 𝑇 } obtain (cid:98)

𝜏𝑡,𝐷𝑇 from the sequence of regressions

(cid:165)𝑌𝑖𝑡 on 1, 𝐷𝑖, 𝑖 = 1, ..., 𝑁

and possibly include X𝑖. As before, under the CLM assumptions we can perform inference on the

𝜏𝑡,𝐷𝑇 . The other strategies discussed in Section 3.2, such as leaving a gap before the intervention if
(cid:98)

one is concerned about anticipation, apply here as well.

Rather than using all of the time periods to remove a trend, one could use second differencing,

say,

(cid:168)𝑌𝑖𝑡 ≡ (cid:0)𝑌𝑖𝑡 − 𝑌𝑖,𝑆−1

(cid:1) − (cid:0)𝑌𝑖,𝑆−1 − 𝑌𝑖𝑅(cid:1) ,

where 1 ≤ 𝑅 < 𝑆 − 1. The resulting estimator is a difference-in-difference-in-differences estimator.

Such an estimator does not exploit averaging across pre- and post-treatment periods, likely making

it more sensitive to violations of the normality assumption.

When 𝑌𝑖𝑡 is the log of a positive outcome, a linear time trend is attractive. Nevertheless, with

large enough 𝑆, we can make the pre-trend removal more flexible; most obviously by including

higher powers, such as 𝑡2 and 𝑡3. However, removing too much of the variation in the outcome may

make it difficult to detect effects of the intervention.

With quarterly or monthly data, and maybe even with weekly data, it might make sense to

remove seasonality at the unit level (perhaps in addition to a trend). In the first step of Procedure

2.1 or 3.1, we would simply include quarterly, monthly, or weekly dummy variables and obtain the

deseasonalized/detrended outcomes. After that, the procedure is exactly the same.

3.4 Comparison with Synthetic Control Methods

With a single treated unit, ADH (2010) propose algorithms to use pre-treatment data to obtain

weights on “donors” to create a synthetic control (SC) for the treated unit. The weights are chosen

58

by an optimization algorithm so that all weights are nonnegative and sum to unity. In early SC

applications, inference is essentially visual, relying on taking each control unit as a placebo treated

unit and comparing the post-treatment gaps with those of the actual treated unit.

More recently, Arkhangelsky et al. (2021) study the SC methods, and extensions, when the

target parameter is the post-treatment time average of the ATTs – rather than estimating a differ-

ent parameter in each time period. They propose the Synthetic difference-in-differences (SDiD)

estimator, which can be interpreted as a weighted two-way fixed effects (TWFE) estimator. Specif-

ically, SDiD assigns unit weights to balance the pre-treatment trends of treated and control units,

and time weights to balance pre- and post-treatment periods within control units.

These weights are obtained by solving regularized optimization problems that minimize dis-

crepancies in pre-treatment outcome trajectories, thereby improving robustness to violations of

the parallel trends assumption and enhancing efficiency. This approach allows SDiD to blend the

strengths of both SC and DID while retaining valid large-sample inference guarantees.

When it comes to staggered interventions, Arkhangelsky et al. (2021) suggest splitting the

sample by adoption date into subsets and applying SDiD method separately to each subset. See,

for example, AAHIW (Appendix: Staggered Adoption, 2021). For instance, if units 5 and 6 begin

treatment in period 5, units 3 and 4 begin in period 3, and units 1 and 2 are never treated, then, the

sample can be divided into to sub-samples: one including units 1, 2, 5, and 6 and another including

units 1, 2, 3, and 4.

Inference for the SDiD estimator is developed under large 𝑁0, large 𝑇 asymptotics. The key

assumptions include independence of the error terms across 𝑖, allowing for standard central limit

arguments as the number of control units increases. Within each unit, the errors are permitted

to exhibit serial correlation over time, provided they satisfy weak dependence conditions. It also

allows arbitrary correlation across time within units while maintaining independence across units.

Additionally, the validity of their placebo-based standard error estimation relies on approximate

normality of the residuals.

Procedures 2.1 and 3.1 of our paper use relatively less sophisticated strategies in effectively

59

choosing a synthetic Control. However, it is important to remember that the chosen control is

based on the outcome after the pre-treatment averages or pre-treatment trends have been removed.

The average of the resulting residuals across all included control units effectively plays the role of

the synthetic control for the residualized outcomes of the treated unit. In other words, the cross

sectional average of the controls,

(cid:40)

𝑁 −1
0

𝑁0∑︁

𝑖=1

(cid:41)

(cid:164)𝑌𝑖𝑡 : 𝑡 = 1, ..., 𝑇

,

is the synthetic control for the cross-sectional averages of the treated units,

(cid:40)

𝑁 −1
1

𝑁0+𝑁1∑︁

𝑖=𝑁0

(cid:41)

(cid:164)𝑌𝑖𝑡 : 𝑡 = 1, ..., 𝑇

.

(3.27)

(3.28)

If we remove pre-treatment trends then (cid:165)𝑌𝑖𝑡 replaces (cid:164)𝑌𝑖𝑡. The hope is that after transforming 𝑌𝑖𝑡 in

the pre-treatment periods, the control residuals do a good job of tracking the treated residuals pre-

treatment. When the procedure is successful, there should be close agreement in the two average

residual series over the periods 𝑡 = 1, 2, . . . , 𝑆 − 1. The estimated treatment effects are obtained

from the differences for 𝑡 ≥ 𝑆, and these are exactly the coefficients in (3.5) or (3.26).

3.5 Monte Carlo Simulations

In this section, we design simulations to assess the performance of our proposed estimator

in the presence of treatment effect heterogeneity under the common timing case. Our goal is

to understand how well the estimator recovers the average treatment effects under realistic data-

generating processes characterized by serial correlation, unit-specific heterogeneity, and dynamic

treatment effects.

In each simulation, we generate data for 𝑁 = 20 units observed over 𝑇 = 20 time periods, with

treatment beginning at a common timing in period 𝑡 = 11, resulting in 10 pre-treatment and 10

post-treatment periods for each unit.

3.5.1 Data Generating Process

This section outlines the specific steps used to generate the simulated data in detail.

60

Step 1. Unit characteristics

Each unit is endowed with two latent characteristics, unit-specific heterogeneity and unit-specific

time trend slope: 𝑐𝑖 ∼ N (0, 𝜎2

𝑐 ) and 𝑔𝑖 ∼ N (1, 𝜎2

𝑔 ), respectively, where 𝜎𝑐 = 2 and 𝜎𝑔 = 1 in our

simulation. These characteristics are constant across time.

Step 2. Idiosyncratic error term

We generate serially correlated errors 𝑢𝑖𝑡 using an AR(1) process,

(cid:32)

√︄

𝑢𝑖1 ∼ 𝑁

0,

(cid:33)

,

2
1 − 𝜌2

𝑢𝑖𝑡 = 𝜌𝑢𝑖,𝑡−1 + 𝜀𝑖𝑡,

𝜀𝑖𝑡 ∼ 𝑁 (0, 2), 𝜌 = 0.75

This structure introduces persistent shocks in the outcome evolution.

Step 3. Potential outcomes

The potential outcome in the control state and treated state is defined as

𝑌𝑖𝑡 (0) = 𝜆𝑡 · 𝑓 𝑠𝑡 − 𝑐𝑖 + 𝑔𝑖𝑡 + 𝑢𝑖𝑡,

𝑌𝑖𝑡 (1) = 𝑌𝑖𝑡 (0) + 𝛿𝑡 · 𝑓 𝑠𝑡 + 𝜈𝑖𝑡,

𝜈𝑖𝑡 ∼ 𝑁 (0,

√

2),

where 𝑓 𝑠𝑡 is time dummy, 𝜆𝑡 represents time fixed effects and 𝛿𝑡 is time-varying treatment effects.

We assume that 𝛿𝑡 = 0 for 𝑡 < 11. Then, the observed outcome is determined using a potential

outcomes framework:

𝑌𝑖𝑡 = (1 − 𝐷𝑖) · 𝑌𝑖𝑡 (0) + 𝐷𝑖 · 𝑌𝑖𝑡 (1)

where 𝐷𝑖 ∈ {0, 1} is a binary indicator for treatment status. Treatment is assigned based on a

logistic selection rule,

Pr(𝐷𝑖 = 1) = I (𝛼0 − 𝛼1 · 𝑐𝑖 + 𝛼2 · 𝑔𝑖 + 𝜖𝑖 > 0) , 𝜖𝑖 ∼ 𝐿𝑜𝑔𝑖𝑠𝑡𝑖𝑐(0, 1)

The average post-treatment effect is set to approximately 2 by construction.

The detailed settings for each scenario are summarized in Table 3.1. Specifically, the time fixed

effects 𝜆𝑡 and time-varying treatment effects 𝛿𝑡 are held constant, while the parameters governing

61

the treatment assignment Pr(𝐷𝑖 = 1) vary across three scenarios. The probability of being treated

decreases from Scenario 1 to Scenario 3.

Table 3.1 Simulation Setup Parameters Across Three Scenarios

, 1
4 )

Scenario2

Scenario 1

(−1, − 1
3

Parameters
Treatment Rule
(𝛼0, 𝛼1, 𝛼2)
Time Fixed Effects
(𝜆1, 𝜆2, . . . , 𝜆20)
Treatment Effects
(𝛿11, . . . , 𝛿20)
Note: only the treatment rule parameters vary across scenarios.

(1, 2, 3, 3, 3, 2, 2, 2, 1, 1)

(−1.5, 1
3

, 1
4 )

Scenario 3

(−2, 1
3

, 1
4 )

(0, 0, 0, 0, 0.2, 0.6, 0.7, 0.8, 0.6, 0.9, 0.9, 1, 1.1, 1.3, 1.2, 1.5, 0.6, 1.4, 1.8, 1.9)

3.5.2 Simulation Results

Table 3.2 presents the simulation results across three scenarios outlined in Table 3.1. Our

proposed Detrending estimator, both in its standard error (𝐻𝐶0) and with heteroskedasticity-

consistent standard errors (𝐻𝐶3), outperforms alternative methods such as Synthetic Control (SC)

and Synthetic difference-in-differences (SDiD) methods.

Across all scenarios, the Detrending estimator achieves the lowest root mean squared error

(RMSE), indicating superior finite-sample performance. Specifically, in Scenario 1 where the

probability of treatment is higher (𝑃(𝐷𝑖 = 1) = 0.32), the Detrending estimator exhibits a negligible

bias and an RMSE of approximately 1.73, significantly outperforming SC and SDiD, which exhibit

large biases (-0.375 and 0.392, respectively) and correspondingly higher RMSEs. Similar patterns

are observed in Scenarios 2 and 3.

In particular, while SC and SDiD exhibit a notable gap between the empirical standard deviation

(SD) of the estimates and the average estimated standard errors (Avg SE) reported within each

replication, our Detrending estimator maintains a close alignment between the two, leading to more

reliable inference and coverage rates close to the nominal 95%.

For example, in Scenario 1, the SD for SC is 1.73, while the average standard error is 2.66,

indicating that the method tends to overestimate its own uncertainty and produce wider-than-

necessary confidence intervals. This conservative estimation contributes to coverage rates that

remain close to or slightly above the nominal 95%.

62

Table 3.2 Simulation Results Across Three Scenarios

Scenario 1. 𝑃(𝐷𝑖 = 1) = 0.32

Sample AE
Demeaning
Detrending
Detrending (hc3)
SC
SDiD

Average Effects
1.991
3.905
2.000
2.000
1.616
2.383

Scenario 2. 𝑃(𝐷𝑖 = 1) = 0.24

Sample AE
Demeaning
Detrending
Detrending (hc3)
SC
SDiD

Average Effects
2.011
4.264
1.969
1.969
1.622
2.395

Scenario 3. 𝑃(𝐷𝑖 = 1) = 0.17

Bias

SD

RMSE Coverage Rate Avg SE

1.914
0.009
0.009
-0.375
0.392

5.10
1.73
1.73
2.14
1.77

5.445
1.734
1.734
2.177
1.808

0.94
0.96
0.96
0.97
0.96

4.96
1.74
1.87
3.27
2.56

Bias

SD

RMSE Coverage Rate Avg SE

2.253
-0.042
-0.042
-0.389
0.384

5.67
1.89
1.89
2.35
1.89

6.101
1.892
1.892
2.384
1.925

0.93
0.95
0.93
0.94
0.95

5.48
1.91
2.04
2.73
2.25

Bias

SD

RMSE Coverage Rate Avg SE

Sample ATT
6.78
Demeaning
2.37
Detrending
2.37
Detrending (hc3)
2.92
SC
2.35
SDiD
Each simulation study has 1, 000 replications with 𝑁 = 20

Average Effects
1.996
4.566
2.161
2.161
2.962
2.615

2.570
0.165
0.165
0.966
0.619

7.254
2.380
2.380
3.078
2.435

0.94
0.95
0.91
0.92
0.94

6.43
2.26
2.60
2.85
2.33

Theoretically, substantial bias would be expected to shift confidence intervals away from the

true treatment effect and lower coverage rates. Although SC and SDiD exhibit noticeably larger

biases compared to our Detrending method, in our simulations, the magnitude of these biases is

not sufficiently large relative to their conservative standard errors to meaningfully reduce coverage.

As a result, despite their higher bias, SC and SDiD still achieve coverage rates comparable to or

slightly exceeding the nominal 95% level.

Moreover, in Scenario 3 where treatment assignment becomes rarer (𝑃(𝐷𝑖 = 1) = 0.17), all

estimators experience greater challenges, but our detrending method still maintains the smallest

63

RMSE and the most reliable inference properties compared to SC and SDiD.

Overall, these simulation results suggest that the detrending approach can offer substantial

improvements in bias reduction, efficiency, and inference accuracy in scenarios with heterogeneous

linear trends. However, its advantages are not universal. When a unit’s pre-trends exhibit more

complex patterns that cannot be adequately captured by linear detrending, our method will not

always work better than existing alternatives.

3.6 Application to California Smoking Restrictions

We apply our methods to the problem analyzed in ADH (2010), estimating the effect of

California’s tobacco control program. The first year of the program is 1989, with 19 pre-treatment

years (1970-1988) and 12 treatment years (1989-2000). We use the log of per capita cigarette sales

as the outcome variable; ADH (2010) used the level of this variable, but the log seems more natural

– partly to obtain a percentage effect and partly to make normality of the transformed outcome a

better approximation. California is the only treated state. As in ADH (2010), we use 𝑁0 = 38

potential control states after eliminating states that implemented some sort of anti-smoking program

over the period.

As discussed in the previous section, SCM uses pre-treatment outcomes (and sometimes other

variables) to create a synthetic control – in this case, a synthetic California. The weights for the

38 controls are chosen to make the SC as close to California as possible based on pre-treatment

observables (primarily 𝑌𝑖𝑡). These weights are then used to project out what would have been the

untreated outcome for CA after the policy change; this is then compared with the actual outcome

in post-intervention.

3.6.1 Results

Figure 3.1 shows the graphs of (3.27) and (3.28) using Procedure 2.1, allowing for separate

effects in each treated period. The solid pink line represents the outcome for the treated unit

(California), while the dashed blue line corresponds to the synthetic control group. The vertical

dashed line marks the intervention year (1989).

It is evident that the tracking is not particularly good, with the average of the controls initially

64

being below the treated and then above the treated unit in 1980. Since it includes all of the controls,

it is not surprising that perhaps just removing a pre-treatment average is not enough to make them

similar to California prior to the intervention – even averaged across 𝑁0 = 38. One possibility is

to choose a subset of the controls that seem most similar to California. In effect, that is what the

SCM does in a systematic way.

Figure 3.1 Removing Pre-Treatment Averages, All Controls

On the other hand, Figure 3.2 plots the average residuals for the controls against California after

removing the state-specific trends, obtained using Procedure 3.1. This provide a much better fit up

until the policy change.

Figure 3.2 Removing Pre-Treatment Linear Trends, All Controls

65

Now, we can interpret the gaps after the intervention data as the effects of the intervention at

different time horizons. Figure 3.3 shows the estimated gap between detrended California and the

average of the 38 detrended controls. Prior to the intervention in 1989, the gap ranges from about

−1.4% to 3.5%, with an average very close to zero. After the intervention, the treated unit exhibits

a sharper decline compared to the synthetic control, indicating a potential negative effect of the

policy.

Figure 3.3 Gap Between California and Average of All Controls, State-Specific Linear Trend

Similarly, both the Synthetic Control and Synthetic DiD methods, using all 38 states as controls,

display a similar pre-treatment trend between California and Synthetic California (Control), as

visualized in Figure 3.4 and Figure 3.5, respectively.

66

Figure 3.4 Synthetic Control Method, All Controls

Unlike Synthetic Control, Synthetic DiD in Figure 3.5 allows for level differences (drift) in

pre-period trend between the treated and control units.

Figure 3.5 Synthetic DID, All Controls

Table 3.3 reports the estimated effect and associated standard errors from these four different

estimation method: (i) Demeaning – Procedure 2.1, (ii) Detrending – Procedure 3.1, (iii) Synthetic

Control Method, and (iv) Synthetic DiD. The first row, Average Effect, presents a single estimated

effect averaged over all treated periods.

67

Table 3.3 Estimated ATTs, 38 Control (Donor) State

Average Effect

𝜏1989

𝜏1995

𝜏2000

Procedure 2.1
(DiD)
-0.422***
(0.121)
-0.168*
(0.096)
-0.484***
(0.137)
-0.667***
(0.164)

Procedure 3.1
(Unit-Specific Detrending)
-0.227**
(0.094)
-0.043
(0.059)
-0.282**
(0.112)
-0.403**
(0.152)

SC

Synthetic DiD

-0.304***
(0.112)

-0.286***
(0.097)

Note 1: standard errors in parentheses. *** 𝑝 < 0.01, ** 𝑝 < 0.05, * 𝑝 < 0.1.
Note 2: For Procedure 3.1, the 𝑝-value for the average effect under the normality assumption is
0.021. Based on randomization inference (RI) with 1,000 replications, the 𝑝-value is 0.041. This

result is reasonably close to the conventional value, supporting the robustness of the inference.

Table 3.3 also shows the estimated effect and associated standard errors for several time horizons,

since our cross-sectional regression method can easily estimate separate effects for each 𝑡. As is

evident from the picture, the estimated effect grows over time, starting off small but ending at

𝑦𝑒𝑎𝑟 = 2000 with −0.403 (𝑡 = −2.65).

3.6.2 Robustness Check

To further evaluate the robustness of our state-specific detrending procedure, we contrast the

results from Procedure 3.1 with those from synthetic control methods when the donor pool is

restricted to a small set of states that, a priori, does not seem “similar” California. While the previous

section used all 38 potential control states, here we consider two alternative donor pools. The first

group consists of Alabama (AL), Arkansas (AR), Louisiana (LA), and Mississippi (MS)—four

southern states that differ substantially from California in socioeconomic characteristics. The

second group includes Midwestern states—Illinois (IL), Indiana (IN), Iowa (IA), and Ohio (OH).

By limiting the set of controls in this way, we assess how each estimator performs when the

availability of similar donor units is limited, providing additional insights into the robustness of our

procedure.

68

3.6.2.1 AL, AR, LA and MS as Controls

Table 3.4 summarizes the estimates using Alabama (AL), Arkansas (AR), Louisiana (LA), and

Mississippi (MS) as control states across four different estimation methods. The results from the

detrending procedure closely resemble those obtained when all 38 states are used as controls, both

in magnitude and statistical significance. In contrast, the synthetic control (SC) and synthetic DiD

methods yield larger estimates in absolute value, reflecting poorer pre-treatment fits observed in

Figure 3.6.

Table 3.4 Estimated ATTs, AL, AR, LA, MS as Controls

Average

1989

1995

2000

Procedure 2.1
(DiD)
-0.556***
(0.080)
-0.247
(0.107)
-0.611***
(0.077)
-0.839***
(0.032)

Procedure 3.1
(Unit-Specific Detrending)
-0.215**
(0.039)
-0.027
(0.052)
-0.259**
(0.055)
-0.377**
(0.115)

SC

Synthetic DiD

-0.571***
(0.034)

-0.392***
(0.030)

Note: standard errors in parentheses: *** p<0.01, ** p<0.05, * p<0.1

Panel (a) of Figure 3.6 shows how the removing unit-specific linear trends fits when the controls

consist of AL, AR, LA, and MS. The average residuals across these four states provide a remarkably

good fit to the detrended California. By contrast, Panel (b) and (c) of Figure 3.6 shows that both

SC method and Synthetic DiD cannot recover a synthetic California that provides a close enough

fit in order to produce a convincing estimate of the causal effect.

69

(a) State-Specific Detrending

(b) Synthetic Control

(c) Synthetic DiD

Figure 3.6 AL, AR, LA and MS as Controls

70

3.6.2.2

IL, IA, MN and OH as Controls

Similarly, we next employ four Midwestern states—Illinois (IL), Iowa (IA), Minnesota (MN),

and Ohio (OH)—as control units. The corresponding results are presented in Table 3.5. As in the

previous case with AL, AR, LA, and MS as controls, our unit-specific detrending approach achieves

a strong pre-treatment fit, while the synthetic control and synthetic DiD methods perform poorly.

As shown in Panel (a) of Figure 3.7, the detrending method closely aligns with the pre-treatment

trends, whereas the alternative methods exhibit clear discrepancies.

Table 3.5 Estimated ATTs, IL, IA, MN and OH as Controls

Average

1989

1995

2000

Procedure 2.1
(DiD)
-0.413**
(0.118)
-0.178*
(0.071)
-0.462**
(0.133)
-0.655**
(0.183)

Procedure 3.1
(Unit-Specific Detrending)
-0.198*
(0.079)
-0.040
(0.045)
-0.239*
(0.088)
-0.363*
(0.136)

SC

Synthetic DiD

-0.437**
(0.184)

-0.275*
(0.154)

Note: standard errors in parentheses: *** p<0.01, ** p<0.05, * p<0.1

These findings further support the robustness of our approach, even in settings with a limited

donor pool. Notably, while both synthetic DiD and synthetic control method typically rely on

a large number of control units to construct a reliable counterfactual, our method requires only

minimal conditions (e.g., at least one treated and one control unit, with a total of three or more

units), making it particularly useful when the donor pool is small.

71

(a) State-Specific Detrending

(b) Synthetic Control

(c) Synthetic DiD

Figure 3.7 IL, IA, MN and OH as Controls

72

3.7 Staggered Rollouts

We can extend the previous methods to the case of staggered interventions, where units are

allowed to be first treated at different times. But one must use care in choosing a suitable control

group.

3.7.1 The Setup and Estimation

Suppose now that the first treatment period is 𝑆, but treatment can first occur in periods

𝑆 + 1, . . . , 𝑇 . Index the treatment cohorts or groups by 𝑔 ∈ {𝑆, 𝑆 + 1, . . . , 𝑇 }, the time of first

treatment. Let 𝐷𝑔 , 𝑔 = 𝑆, 𝑆 + 1, . . . , 𝑇 be the cohort indicators. Specifically, 𝐷𝑖𝑔 = 1 if and only

if unit 𝑖 was first treated in period 𝑔. The never treated units are indicated by 𝐷𝑖∞ = 1.

There are 𝑁𝑔 units in each cohort, with 𝑁∞ being the number of never treated units. With

𝑁∞ ≥ 2, the other treated cohorts may have only one unit. Naturally, if there are periods without

new treated units, there will be no treatment effects estimated in those periods.

The potential outcomes are 𝑌𝑡 (𝑔), 𝑔 = 𝑆, . . . , 𝑇, ∞, where 𝑌𝑡 (𝑔) is the outcome in period 𝑡 if

the first period of treatment is 𝑔. 𝑌𝑡 (∞) is the never treated state (or control state). The treatment

effects are now indexed by cohort and time:

𝜏𝑔𝑡 = 𝐸 (cid:2)𝑌𝑡 (𝑔) − 𝑌𝑡 (∞) |𝐷𝑔 = 1(cid:3) , 𝑡 = 𝑔, . . . , 𝑇 .

(3.29)

We are also interested in estimating the cohort-specific average effects,

𝜏𝑔 =

1
(𝑇 − 𝑔 + 1)

𝑇
∑︁

𝑡=𝑔

𝜏𝑔𝑡.

(3.30)

As is common, we make a no anticipation Assumption:

𝑌𝑡 (𝑔) = 𝑌𝑡 (∞) , 𝑡 < 𝑔,

which says that all potential outcomes are the same as in the never-treated state in periods before

the treatment occurs.

For a given cohort 𝑔, one now removes the average or a trend using data up through 𝑔 − 1, the

final pre-treatment period. It is easiest to obtain the transformation for each unit in the sample, and

73

then sort out the valid control units afterwards. For each 𝑖, one runs the regressions

𝑌𝑖𝑡 on 1, 𝑡 = 1, 2, . . . , 𝑔 − 1

(3.31)

Because we have different treatment cohorts, we add a cohort subscript when projecting out to

post-treatment periods:

(cid:164)𝑌𝑖𝑡𝑔 = 𝑌𝑖𝑡 − ¯𝑌𝑖,𝑝𝑟𝑒(𝑔), 𝑡 = 𝑔, . . . , 𝑇

(3.32)

where ¯𝑌𝑖,𝑝𝑟𝑒(𝑔) is the average of 𝑌𝑖𝑡 for 𝑡 = 1, . . . , 𝑔 −1. To remove a linear time trend, the regressions

are

𝑌𝑖𝑡 on 1, 𝑡,

𝑡 = 1, 2, . . . , 𝑔 − 1

and then the out-of-sample residuals are

(cid:165)𝑌𝑖𝑡𝑔 = 𝑌𝑖𝑡 − (cid:98)𝐴𝑖𝑔 − (cid:98)𝐵𝑖𝑔 · 𝑡,

𝑡 = 𝑔, . . . , 𝑇

(3.33)

(3.34)

Once the transformed variables have been obtained, the key is choosing a valid control group

at time 𝑡 for treatment cohort 𝑔 (which could have as few as one unit). One can use only the never

treated units at time 𝑡, provided 𝑁∞ ≥ 2. However, as shown in LW (2023), under suitable NA and

PT assumptions, one can using any unit not-yet treated (NYT). Then for cohort 𝑔 in time period 𝑡,

the units represented by

𝐷𝑖,𝑡+1 + 𝐷𝑖,𝑡+2 + · · · + 𝐷𝑖𝑇 + 𝐷𝑖∞ = 1

(3.35)

are valid controls. For efficiency reasons, one should use all possible controls (and this tends to

help the normality assumption be more realistic). Nevertheless, one can use any subset (including

the NT group). Let 𝐶𝑖,𝑡+1 be the control group chosen for period 𝑡, with the subscript indicating
that these groups cannot have been treated prior to 𝑡 + 1. Then (cid:98)
cross-sectional regression

𝜏𝑔𝑡 is the coefficient on 𝐷𝑖𝑔 in the

(cid:164)𝑌𝑖𝑡𝑔 on 1, 𝐷𝑖𝑔 using 𝐷𝑖𝑔 + 𝐶𝑖,𝑡+1 = 1

(3.36)

Or, replace (cid:164)𝑌𝑖𝑡𝑔 with the detrended version, (cid:165)𝑌𝑖𝑡𝑔. Under a homoskedastic normality assumption,

exact inference can be used in (3.36) even if 𝑁𝑔 = 1 and the number of control units is small.

74

For many purposes, we are more interested in the 𝜏𝑔, which we can aggregate to obtain a single,

weighted effect – more on this below. Then, because we want inference without many units in the

treated cohort, we use the never treated units as the control and run a single, aggregated regression.

Specifically, define

(cid:164)𝑌 𝑖𝑔 ≡

1
(𝑇 − 𝑔 + 1)

𝑇
∑︁

𝑡=𝑔

(cid:164)𝑌𝑖𝑡 and (cid:165)𝑌 𝑖𝑔 ≡

1
(𝑇 − 𝑔 + 1)

𝑇
∑︁

𝑡=𝑔

(cid:165)𝑌𝑖𝑡 .

Then run the regression

(cid:164)𝑌 𝑖𝑔 on 1, 𝐷𝑖𝑔, 𝑖 = 1, ..., 𝑁, 𝐷𝑖𝑔 + 𝐷𝑖∞ = 1

(3.37)

(3.38)

𝜏𝑔 as the coefficient on 𝐷𝑖𝑔, or use (cid:165)𝑌 𝑖𝑔 in place of (cid:164)𝑌 𝑖𝑔. Again, this is a cross-sectional

to obtain (cid:98)
regression with independent observations. If 𝑁𝑔 and 𝑁∞ are suitably large to employ asymptotics,

then we can use a heteroskedasticity-robust standard error (hc3) to obtain a 𝑡 statistic or confidence

intervals. We can appeal to exact theory if 𝑁𝑔 is small, including 𝑁𝑔 = 1.

If we want to aggregate the (cid:98)

𝜏𝑔 to obtain a single treatment effect, a challenge is to obtain a valid

standard error when cohort sizes are not particularly large. A natural parameter is to weight the

cohort ATTs, 𝜏𝑔, but the cohort shares. This leads to the estimator

𝜏𝜔 =
(cid:98)

𝑇
∑︁

𝑔=𝑆

𝜏𝑔
𝜔𝑔(cid:98)
(cid:98)

(3.39)

(3.40)

where

𝜔𝑔 =
(cid:98)

are the cohort shares. A standard error for (cid:98)
is difficult if we cannot appeal to asymptotics. Even with a somewhat larger number of total treated

𝜏𝑔, which

𝑁𝑔
𝑁𝑆 + 𝑁𝑆+1 + · · · + 𝑁𝑇
𝜏𝜔 must account for the covariances among the (cid:98)

units, a trick is helpful. We know from simple regression analysis that (cid:98)

𝜏𝑔 is a difference in means:

𝜏𝑔 =
(cid:98)

1
𝑁𝑔

𝑁
∑︁

𝑖=1

𝐷𝑖𝑔 (cid:164)𝑌 𝑖𝑔 −

1
𝑁∞

𝑁
∑︁

𝑖=1

𝐷𝑖∞ (cid:164)𝑌 𝑖𝑔

and so, by simple algebra,

𝜏𝜔 =
(cid:98)

1
(𝑁𝑆 + 𝑁𝑆+1 + · · · + 𝑁𝑇 )

𝐷𝑖𝑔 (cid:164)𝑌 𝑖𝑔 −

1
𝑁∞

𝑇
∑︁

𝑁
∑︁

𝑔=𝑆

𝑖=1

𝐷𝑖∞ (cid:98)

𝜔𝑔 (cid:164)𝑌 𝑖𝑔

𝑇
∑︁

𝑁
∑︁

𝑔=𝑆

𝑖=1

75

(3.41)

(3.42)

Define a treatment indicator

the number of treated units as

Rearrange (3.42) to obtain

𝐷𝑖 = 𝐷𝑖𝑆 + · · · + 𝐷𝑖𝑇

(3.43)

𝑁𝑡𝑟𝑒𝑎𝑡 = 𝑁𝑆 + 𝑁𝑆+1 + · · · + 𝑁𝑇 .

𝜏𝜔 =
(cid:98)

≡

=

1
𝑁𝑡𝑟𝑒𝑎𝑡

1
𝑁𝑡𝑟𝑒𝑎𝑡

1
𝑁𝑡𝑟𝑒𝑎𝑡

𝑁
∑︁

𝑖=1
𝑁
∑︁

𝑖=1
𝑁
∑︁

𝑖=1

where

𝑇
∑︁

𝑔=𝑆

𝐷𝑖 · (cid:169)
(cid:173)
(cid:171)
𝐷𝑖 · (cid:164)𝑌 𝑖 −

𝐷𝑖𝑔 (cid:164)𝑌 𝑖𝑔(cid:170)
(cid:174)
(cid:172)
𝑁
∑︁

1
𝑁∞

𝑖=1

−

1
𝑁∞

𝑁
∑︁

𝑖=1

𝑇
∑︁

𝑔=𝑆

𝐷𝑖∞ · (cid:169)
(cid:173)
(cid:171)

𝜔𝑔 (cid:164)𝑌 𝑖𝑔(cid:170)
(cid:98)
(cid:174)
(cid:172)

𝐷𝑖∞ · (cid:164)𝑌 𝑖

𝐷𝑖 · (cid:164)𝑌 𝑖 −

1
𝑁𝑐𝑜𝑛𝑡𝑟𝑜𝑙

𝑁
∑︁

𝑖=1

(1 − 𝐷𝑖) · (cid:164)𝑌 𝑖

(3.44)

(3.45)

(cid:164)𝑌 𝑖 ≡ 𝐷𝑖𝑆 · (cid:164)𝑌 𝑖𝑆 + · · · + 𝐷𝑖𝑇 · (cid:164)𝑌 𝑖𝑇 + 𝐷𝑖∞ · (cid:169)
(cid:173)
(cid:171)

𝑇
∑︁

𝑔=𝑆

𝜔𝑔 (cid:164)𝑌 𝑖𝑔(cid:170)
(cid:98)
(cid:174)
(cid:172)

.

(3.46)

Note that the representation in (3.45) uses the fact that 𝐷𝑖 · 𝐷𝑖𝑔 = 𝐷𝑖𝑔. Also, 𝐷𝑖 + 𝐷𝑖∞ = 1, and so

(3.45) is a simple difference in sample mean of (cid:164)𝑌 𝑖 between units that are eventually treated and the

never treated group. In other words, run the cross-sectional regression

(cid:164)𝑌 𝑖 on 1, 𝐷𝑖, 𝑖 = 1, . . . , 𝑁,

(3.47)

𝜏𝜔 is the coefficient on 𝐷𝑖. Naturally, if we use unit-specific detrending for each cohort

and then (cid:98)
date 𝑔 to obtain the (cid:165)𝑌 𝑖𝑔, then we replace (cid:164)𝑌 𝑖𝑔 with (cid:165)𝑌 𝑖𝑔 everywhere.

Obtaining (cid:98)

𝜏𝜔 from the regression in (3.47) has some advantages. It automatically accounts for

the correlations among the (cid:98)
for 𝑁𝑐𝑜𝑛𝑡𝑟𝑜𝑙). If 𝑁𝑡𝑟𝑒𝑎𝑡 and 𝑁𝑐𝑜𝑛𝑡𝑟𝑜𝑙 are even moderately large, a heteroskedasticity-robust standard

𝜏𝑔, and it can be used whether or not 𝑁𝑡𝑟𝑒𝑎𝑡 is small or large (and same

error from (3.47) is justified via asymptotic analysis.

Arkhangelsky et al. (2021) discuss how SDiD can be applied to staggered interventions. They

suggest splitting the sample by adoption date and applying the SDiD method separately to each

76

treatment cohort, using the never treated cohort as the control [see the appendix in Arkhangelsky

et al. (2021)]. In other words, do exactly what we propose with our unit-specific demeaning or

detrending. Therefore, our proposed methods and the way SDiD is implemented for staggered

designs are directly comparable.

3.7.2 Application: Estimating the Effects of Castle Laws on Homicides

We apply the previous methods to the data set used in Cunningham (2021) on the adoption of

so called “castle” laws – or “hold-your-ground” laws – on homicide rates in the United States. A

castle law typically allows individuals to use force, including deadly force, to defend themselves

against an intruder in their home – without a duty to retreat. The data set covers the 50 United States

from 2000 through 2010. In 2005, one state adopted a castle law. In 2006, 13 more states did. The

remaining treated cohorts are 2007 (four states), 2008 (two states), and 2009 (one state). Therefore,

there are 𝑁𝑡𝑟𝑒𝑎𝑡 = 21 eventually treated units and 𝑁𝑐𝑜𝑛𝑡𝑟𝑜𝑙 = 29 control units. the outcome variable

is the log of the annual homicides.

Using the regression in (3.47) with (cid:164)𝑌 𝑖 defined in (3.46), the estimated aggregated treatment

𝜏𝜔, is about 0.092, or about 9.2% more homicides from a state adopting a castle law. The

effect, (cid:98)
usual OLS standard error is 0.057, which gives 𝑡 ≈ 1.61. This is not quite significant at the 10%

level against a two-sided alternative. The hc3 𝑡 statistic is 1.50. Replacing (cid:164)𝑌 𝑖 with (cid:165)𝑌 𝑖 – obtained

from linear detrending – decreases the estimate to 0.067, but the hc3 standard error is 0.055, and

𝑡 ≈ 1.21.

The synthetic DiD estimate, obtained using the sdid package in Stata 18, is 0.099 and the

standard error, using the placebo method, is 0.069 (𝑡 = 1.41). Both the estimate and its precision

are in close agreement with the demeaning method described in Section 3.7.1.

3.8 Additional Practical Considerations

3.8.1 Choosing the Number of Time Periods

With synthetic control-type approaches and the approaches we suggest here, one can study the

robustness of the findings by adjusting the number of pre-treatment time periods. For example,

when using the unit-specific detrending method, we can vary the starting point of the data as a way

77

of evaluated the way the estimates change as the unit-specific trends are removed using different

stretches of data. This is in the spirit of Rambachan and Roth (2023), who formalize the idea of

allowing a range of violations of parallel trends and studying the sensitivity to the estimates. In

many examples, the policy intervention is likely based on past outcomes (in the untreated state).

Does one need to go back, say, 20 years to remove unit specific intercepts and trends to account for

the selection into treatment? In many cases, fewer pre-treatment periods might suffice. Because

our approach does not rely on large 𝑇0 or 𝑇1 (but does rely on normality), the robustness of the

estimates can be studied by varying 𝑇0 in particular.

3.8.2 Clustering and Spatial Correlation with Larger Cross Sections

Although our motivation for the previous procedures is motivated by settings with few control

or treated units, or few of both, our approach has benefits in settings where 𝑁𝑐𝑜𝑛𝑡𝑟𝑜𝑙 and 𝑁𝑡𝑟𝑒𝑎𝑡 are

large enough to rely on large-𝑁 asymptotic analysis. Recall that the basic DiD estimator is obtained

from

(cid:164)𝑌 𝑖 on 1, 𝐷𝑖, 𝑖 = 1, ..., 𝑁

and the one that removes pre-treatment unit-specific trends is

(cid:165)𝑌 𝑖 on 1, 𝐷𝑖, 𝑖 = 1, ..., 𝑁.

We can even add controls with large enough 𝑁. In Section 3.7 we showed that the same regressions

can be used in the case of staggered assignment by defining 𝐷𝑖 to be an “ever treated” indicator

– see (3.43) – and by modifying (cid:164)𝑌 𝑖 or (cid:165)𝑌 𝑖 as in equation (3.46). Because these are cross-sectional

regressions, it is straightforward to compute standard errors clustered at a level higher than 𝑖. For

example, if 𝑖 is a county, but we we are studying a policy that varies only at the state level, we can

cluster at the state level if we have a sufficiently large number of treated and control states. Again,

it does not matter how large 𝑇0 and 𝑇1 are. Abadie, Athey, Imbens and Wooldridge (2023) discuss

why standard errors should be clustered when the intervention is at a higher level than the unit. As

another example, 𝑖 could be a household whereas a policy is applied at the village level. Clustering

can be done separately by time period, too. The key is that it is a cross sectional regression and

78

then the usual clustering methods can apply.

In situations with a spatial structure – for example, the assignment of treated units in implement-

ing a new policy may be spatially correlated – we can obtain standard errors using a covariance

matrix estimator robust to heteroskedasticity and spatial correlation. These so-called “SHAC”

standard errors are proposed in Conley (1999).

3.9 Concluding Remarks

Building on Lee and Wooldridge (2023), we have proposed a simple approach to inference

when using panel data with few treated units or few control units. In the common timing case

and without controls, the unit-specific demeaning reproduces the standard difference-in-differences

estimator for each treated period, and also averaged across the treated periods. We also show how

the unit-specific trending method of LW (2023) can be implemented.

Estimation is simple and inference follows under normality using the classical linear model

assumptions taught in introductory econometrics. Because the method uses averages across time,

the central limit theorem across the time dimension often can be used to justify the normality

assumption. Heteroskedasticity is always an issue in cross-sectional regressions, and we propose

using what is known as the hc3 version provided there are at least a handful of treated units.

The approach here is not intended to replace the very popular synthetic DiD method of AAHIW

(2021) in cases where SDiD has natural advantages. However, our simulations show that, when

there is sufficient heterogeneity in, say, linear time trends, our method of unit-specific detrending

can have substantially less bias. In some cases the inference is more reliable. Moreover, SDiD is

not intended to apply to situations with few time periods, or few donor units among which to choose

controls. The SDiD asymptotics assumes 𝑁0, 𝑇0, and 𝑇1 all increase – although our simulations

suggest SDiD tends to work well more generally, provided trends are not to heterogeneous.

Our approach also allows the estimation of separate effects in the treated periods, a feature not

allowed by SDiD methods. Also, we showed in Section 3.7 that allowing for staggered interventions

is straightforward, and obtaining an overall weighted estimate with valid standard error can be done

by a single cross-sectional regression.

79

Overall, we view our approach as complementary to SDiD methods for many applications.

When applied to the California smoking data, the state-specific detrending, using all 38 control

states, produces estimates and inference similar to SDiD when restricting attention to the overall

average affect. In applying our approach to the staggered rollout of so-called castle laws, the our

rolling method based on unit-specific demeaning gives a very similar point estimate and standard

error to SDiD. For cases with a small number of time periods, a small number of controls, or both,

exact inference in a cross-sectional regression using our transformation is appealing.

80

BIBLIOGRAPHY

Abadie, A., Athey, S., Imbens, G. W. and Wooldridge, J. M. (2023), “When should you adjust

standard errors for clustering? ”, The Quarterly Journal of Economics 138(1), 1–35.

Abadie, A., Diamond, A. and Hainmueller, J. (2010), “Synthetic control methods for comparative
case studies: Estimating the effect of California’s tobacco control program”, Journal of the
American statistical Association 105(490), 493–505.

Arkhangelsky, D., Athey, S., Hirshberg, D. A., Imbens, G. W. and Wager, S. (2021), “Synthetic

difference-in-differences”, American Economic Review 111(12), 4088–4118.

Callaway, B. and Sant’Anna, P. H. (2021), “Difference-in-Differences with Multiple Time Periods”,

Journal of econometrics 225(2), 200–230.

Conley, T. G. (1999), “GMM estimation with cross sectional dependence”, Journal of econometrics

92(1), 1–45.

Cunningham, S. (2021), Causal inference: The mixtape, Yale university press.

Davidson, R., MacKinnon, J. G. et al. (1993), Estimation and inference in econometrics, Vol. 63,

Oxford New York.

Donald, S. G. and Lang, K. (2007), “Inference with difference-in-differences and other panel data”,

The review of Economics and Statistics 89(2), 221–233.

Hagemann, A. (2025), “Inference with a single treated cluster”, Review of Economic Studies

p. rdaf002.

Heß, S. (2017), “Randomization inference with Stata: A guide and software”, The Stata Journal

17(3), 630–651.

Lee, S. J. and Wooldridge, J. M. (2023), “A Simple Transformation Approach to Difference-in-

Differences Estimation for Panel Data”, Available at SSRN 4516518 .

Rambachan, A. and Roth, J. (2023), “A more credible approach to parallel trends”, Review of

Economic Studies 90(5), 2555–2591.

Simonsohn, U. (2021), “Just Run Robust Standard Errors: A Commentary on Young (2019)”.

Working paper, available at http://urisohn.com/43.

Wooldridge, J. M. (2010), “Econometric Analysis of Cross Section and Panel Data”.

Wooldridge, J. M. (2020), Introductory Econometrics: A Modern Approach, Cengage learning::

Boston, MA.

81

Wooldridge, J. M. (2021), “Two-Way Fixed Effects, the Two-Way Mundlak Regression, and

Difference-in-Differences Estimators”, Available at SSRN 3906345 .

Wooldridge, J. M. (2025), “Two-Way Fixed Effects, the Two-Way Mundlak Regression, and

Difference-in-Differences Estimators”, latest version of Wooldridge(2021) .

Young, A. (2019), “Channeling fisher: Randomization tests and the statistical insignificance of
seemingly significant experimental results”, The quarterly journal of economics 134(2), 557–
598.

82

CHAPTER 4

ROLLING APPROACH TO DIFFERENCE-IN-DIFFERENCES: EXPLORING
TREATMENT REVERSIBILITY AND MODERATING EFFECTS

4.1

Introduction

Difference-in-differences (DiD) methods have become indispensable for estimating causal ef-

fects in settings with staggered interventions and heterogeneous responses. Yet, recent work has

highlighted a critical limitation: when treatment effects vary across groups or over time, the tra-

ditional two-way fixed effects (TWFE) estimator can produce biased estimates under staggered

intervention designs. To overcome this issue, several alternative estimators have been proposed to

capture treatment effect heterogeneity more faithfully; see, for example, Callaway and Sant’Anna

(2021), Wooldridge (2021), Sun and Abraham (2021), Borusyak et al. (2024), and De Chaisemartin

and d’Haultfoeuille (2020).

Most of these approaches, however, rely on the assumption of absorbing treatments—once

a unit receives treatment, it remains treated in all subsequent periods. But, in many real-world

settings, treatment status is neither permanent nor monotonic. For example, firms may adopt

a policy temporarily, governments may roll out and later withdraw interventions, or—as in the

setting examined in this paper—pharmacy chains may open and later close locations. Most existing

methods are not well suited to handle such dynamic treatment regimes, and few provide formal

guidance for estimation and aggregation under this complexity.

Recently, De Chaisemartin and d’Haultfoeuille (2024) propose a framework designed to identify

treatment effects in contexts with non-binary, non-absorbing, and potentially lagged treatments,

relating to their prior work in De Chaisemartin and d’Haultfoeuille (2020). Their method accom-

modates reversible treatment states, allowing units to both enter and exit treatment over time, and

explicitly accounts for dynamic effects that unfold across different exposure durations.

However, their approach defines treatment relative to exposure duration and classifies units based

on their initial transition into treatment. Once a unit exits treatment, its subsequent observations

are excluded from the analysis. As a result, the framework does not permit the estimation of

83

post-exit effects, and re-entry into treatment is not considered. This limitation hinders the ability to

assess whether treatment effects persist after discontinuation or to evaluate the impact of repeated

exposures.

In contrast, this paper introduces a more flexible framework—building on Lee and Wooldridge

(2023)—that enables the estimation of treatment effects both during and after treatment episodes.

By aligning event time relative to treatment exposure (e.g., 𝑡 − 𝑔, where 𝑔 is the treatment start

period), I incorporate both on-treatment and post-exit periods in the analysis, facilitating the

identification of lingering or decaying effects over time. This extension is particularly useful for

evaluating whether the impact of an intervention persists or fades after discontinuation.

Furthermore, I explicitly account for re-entry into treatment—a case ignored in prior implemen-

tations—by identifying multiple treatment paths and estimating the effects of subsequent exposures

separately from the initial one. This allows for direct comparisons between first-time and repeated

treatments and offers a more comprehensive perspective on dynamic, non-monotonic treatment

patterns. Together, these contributions enhance the applicability of dynamic DID methods in

real-world settings where treatment status can change multiple times, and allowing for evaluating

whether being treated again has the same, weaker, or stronger impact than the first treatment.

In addition, I propose an extended doubly robust estimator that incorporates moderating effects

across subgroups using an augmented inverse probability weighting and regression adjustment (IP-

WRA) approach. This method allows for explicit estimation of treatment effect heterogeneity across

subgroups while preserving the desirable double robustness property. It also enables researchers to

examine whether treatment effects vary across subgroups, allowing for more nuanced evaluations

of treatment or policy interventions in the presence of demographic or structural heterogeneity.

The remainder of the paper is organized as follows. Section 4.2 outlines the setup and identifying

assumptions underlying the proposed framework. In Section 4.3, I demonstrate how to incorporate

moderating effects while leveraging a doubly robust estimator. Section 4.4 presents Monte Carlo

simulation results to evaluate the performance of the proposed method. In Section 4.5, I apply

the approach to data on the entry and exit of chain pharmacy stores to examine their impact on

84

independently owned pharmacies in rural areas. Section 4.6 concludes.

4.2 Setup and Identification

This framework is designed to accommodate treatment patterns involving both staggered adop-

tion and reversible treatment spells, aligning with and extending the work of Lee and Wooldridge

(2023). By incorporating exit and re-entry episodes, we provide a comprehensive toolkit for ana-

lyzing dynamic treatment effects where treatment status may change multiple times throughout the

study period.

Our approach is initially focused on binary treatment indicators, classifying units as either

treated or untreated at each time point. I define the treatment indicator 𝐷𝑖𝑔 ∈ {0, 1}, where 𝐷𝑖𝑔 = 1

if unit 𝑖 is treated at time 𝑔, and 𝐷𝑖𝑔 = 0 otherwise. For notational simplicity, I denote the treatment

indicator as 𝐷𝑔 ≡ 𝐷𝑖𝑔, suppressing the unit index when focusing solely on treatment timing.

However, future work will extend this framework to discrete treatment intensities, where units can

receive varying levels of treatment 𝐷𝑔,𝑡 ∈ {0, 1, . . . , 𝐾 }, allowing for analysis of policies where

treatment intensity (e.g., dosage or funding levels) is a crucial factor.

We consider the following treatment patterns:

• Case 1: Binary Treatment with Staggered Adoption — This case mirrors the structure

in Lee and Wooldridge (2023), where treatment is introduced at different times across units

without accounting for exit or re-entry.

• Case 2: Binary Treatment with Exit and Re-entry — Units may exit treatment and

potentially re-enter at a later time, resulting in multiple treatment spells. This setting allows

for robust comparisons by aligning units based on their treatment histories and adjusting for

pre-treatment dynamics.

By accommodating a wider array of treatment patterns—including reversible and intermittent

treatments—our framework provides a unified, flexible approach for researchers aiming to assess

the dynamic and heterogeneous effects of policy interventions in realistic settings.

85

4.2.1

Identification Assumptions

To identify treatment effects under these settings, we introduce the following assumptions,

which follows Lee and Wooldridge (2023). Following the potential outcome framework, let 𝑌𝑡 (𝑔)

denote the potential outcome at time 𝑡 for a unit treated at time 𝑔. For units that are never treated,

we define the untreated state using ∞, such that 𝑌𝑡 (∞) represents the potential outcome under the

control state at time 𝑡.

Assumption 1. Conditional No Anticipation (CNA): For 𝑔 ∈ {𝑆, . . . , 𝑇 }, 𝑡 ∈ {1, . . . , 𝑔 − 1}

and covariates X,

𝐸 (cid:2)𝑌𝑡 (𝑔) |𝐷𝑔 = 1, X(cid:3) = 𝐸 (cid:2)𝑌𝑡 (∞) |𝐷𝑔 = 1, X(cid:3) . □

(4.1)

This implies treatment effects prior to the intervention are all zero.

Assumption 2. Conditional Parallel Trends for Treatment, Exit and Re-entry Groups (CPT

for T.E.R): For D = (𝐷 𝑆, . . . , 𝐷𝑇 ) and 𝑡 = 1, 2, . . . , 𝑇,

𝐸 [𝑌𝑡 (∞) − 𝑌1(∞)|D, X] = 𝐸 [𝑌𝑡 (∞) − 𝑌1(∞)|X],

𝑡 = 2, . . . , 𝑇 . □

(4.2)

Units with identical outcome paths up to 𝑡 = 𝑔−1 are assumed to follow the same expected trends

in the evolution of control state outcomes, regardless of whether they exit or re-enter treatment after

their initial treatment at time 𝑔. This allows for constructing a control group based on pre-treatment

outcome paths, facilitating identification of treatment, exit, and re-entry effects.

Assumption 3. Overlap: For cohorts 𝑔 ∈ {𝑆, . . . , 𝑇 } and time periods 𝑟 ∈ {𝑔, 𝑔 + 1, . . . , 𝑇 },

𝑃 (cid:0)𝐷𝑔 = 1|𝐷𝑔 + 𝐴𝑟+1 = 1, X = x(cid:1) < 1 for all x ∈ Supp (X) . □

(4.3)

This condition ensures that within the subpopulation consisting of cohort 𝑔, as well as the never-

treated and not-yet-treated units at time 𝑟 (denoted as 𝐴𝑟+1), every treated unit has a comparable

control unit with the same level (magnitude) of covariates.

86

4.2.2 Estimation Strategy: Treatment, Exit, and Re-entry Effects

The estimation procedure is structured in two main stages:

First Stage: General Effect Estimation

For each treatment group 𝑔, i.e., whose initial treatment occur at time 𝑔, we define a control

group at time 𝑡 based on units with the same outcome paths up until time 𝑔 − 1 and not yet

treated at time 𝑡, which including never treated units. At this stage, I focus solely on each group’s

first treatment timing 𝑔, without accounting for subsequent treatment, exit, or re-entry status. It

means we will estimate the treatment effects on the treated ATT(g,t) as if each group is treated

at g, and then stay treated up to 𝑇, which means their realized potential outcome is considered as

𝑌𝑖𝑡 (𝑔). However, they could "exit" or "re-enter" down the road. Thus, let me define these estimated

group-time specific effect as a general treatment effect on the treated (GTT) at time 𝑡 for 𝑔 as follow:

𝐺𝑇𝑇 (𝑔, 𝑡) = 𝐸 [𝑌𝑖𝑡 (𝑔) − 𝑌𝑖𝑡 (∞)|𝐷𝑔 = 1]

(4.4)

To get this 𝐺𝑇𝑇 (𝑔, 𝑡), I simply apply Procedure 4.1 (Rolling Methods, Staggered Interventions)

in Lee and Wooldridge (2023):

Step 1. For a given 𝑔 ∈ {𝑆, . . . , 𝑇 } and time period 𝑟 ∈ {𝑔, 𝑔 + 1, . . . , 𝑇 }, compute

(cid:164)𝑌𝑖𝑟𝑔 ≡ 𝑌𝑖𝑟 −

1
(𝑔 − 1)

𝑔−1
∑︁

𝑌𝑖𝑠 ≡ 𝑌𝑖𝑟 − ¯𝑌𝑖,𝑝𝑟𝑒(𝑔)

(4.5)

𝑠=1
Step 2. Choose as the control group the units with 𝐷𝑖,𝑟+1 + 𝐷𝑖,𝑟+2 + · · · + 𝐷𝑖𝑇 + 𝐷𝑖∞ = 1 (or, if

desired, a subset, such as the Never Treated (NT) group).

Step 3. Using the subset of data with

𝐷𝑖𝑔 + 𝐷𝑖,𝑟+1 + 𝐷𝑖,𝑟+2 + · · · + 𝐷𝑖𝑇 + 𝐷𝑖∞ = 1,

(4.6)

apply standard treatment effect (TE) methods—such as linear regression adjustment (RA),

inverse probability weighting (IPW), IPWRA, and matching—to each cross-sectional dataset sep-

arately.

(cid:8)(cid:0) (cid:164)𝑌𝑖𝑟𝑔, 𝐷𝑖𝑔, X𝑖(cid:1) : 𝑖 = 1, . . . , 𝑁 (cid:9) ,

with 𝐷𝑖𝑔 acting as the treatment indicator. □

87

Lee and Wooldridge (2023) establish that the coefficient on 𝐷𝑖𝑔 from these cross-sectional

regressions identifies the average treatment effect on the treated (ATT) for group 𝑔 at time 𝑡,

denoted as 𝐴𝑇𝑇 (𝑔, 𝑡), under the assumption of treatment irreversibility.

In addition, if you are

concerned about unit-specific heterogeneous linear trends, you can apply Procedure 5.1 in Lee and

Wooldridge (2023), which allowing for unit-specific heterogeneous linear trends.

However, in the present framework, I allow for the possibility that group 𝑔 may exit treatment

at certain periods 𝑡 (> 𝑔). Thus, while I initially follow the Lee and Wooldridge (2023) approach

to estimate the effect for these periods, I subsequently reclassify these periods as exit periods in

Step 2 of our estimation procedure.

Consequently, instead of defining the effect as 𝐴𝑇𝑇 (𝑔, 𝑡), I adopt a broader concept, denoted as

𝐺𝑇𝑇 (𝑔, 𝑡), representing the generalized treatment effect for group 𝑔 at time 𝑡, which can encompass

not only the standard treatment effect but also exit and re-entry effects.

Second Stage: Classification and Aggregation by Treatment Path

After estimating 𝐺𝑇𝑇 (𝑔, 𝑡) for all 𝑔 ∈ {𝑆, . . . , 𝑇 } and 𝑡 ∈ {𝑔, . . . , 𝑇 }, we classify each period

based on treatment path, allowing for a more nuanced interpretation of treatment effects. The

classification includes three distinct categories based on the observed treatment status:

𝛼 ∈ {𝑇𝑟𝑒𝑎𝑡𝑒𝑑, 𝐸𝑥𝑖𝑡, 𝑅𝑒 − 𝑒𝑛𝑡𝑟 𝑦},

In this context, Treated refers to time periods immediately following the initial treatment,

indicated by a positive treatment status relative to the first treatment timing. Exit represents periods

after treatment cessation, during which the treatment status returns to zero following the initial

treatment. Re-entry denotes periods after the re-initiation of treatment following an exit episode,

allowing for multiple treatment spells.

Table 4.1 illustrates the classification for two groups, 𝑔 = 4 and 𝑔 = 3, each representing distinct

treatment paths. The table presents the relative time for each treatment spell, distinguishing periods

of treatment, exit, and re-entry based on the specific treatment path of each group.

88

Table 4.1 Classification of Treatment Paths for Groups 𝑔 = 4 and 𝑔 = 3

Group 𝑔 = 4

Treatment Path
Calendar Time (𝑡)
1. Treatment Period Relative Time (𝑟)
Relative Time (𝑟)
2. Exit Period
Relative Time (𝑟)
3. Re-entry Period

Group 𝑔 = 3

Treatment Path
Calendar Time (𝑡)
1. Treatment Period Relative Time (𝑟)
Relative Time (𝑟)
2. Exit Period
Relative Time (𝑟)
3. Re-entry Period

0
1
-3

0
1
-2

0
2
-2

0
2
-1

0
3
-1

1
3
0

1
4
0

1
4
1

1
5
1

0
5

0
-3

1
6
2

0
6

1
-2

0
7

0

0
7

2
-1

0
8

1

1
8

0
9

2

1
9

0

1

In Group 𝑔 = 4, treatment is initiated at 𝑡 = 4, continues for three periods, and then exits at

𝑡 = 7. Exit effects are observed from 𝑡 = 7 to 𝑡 = 9. No re-entry occurs in this group. In contrast,

Group 𝑔 = 3 initiates treatment earlier at 𝑡 = 3, exits at 𝑡 = 5, and then re-enters at 𝑡 = 8. The

re-entry period spans 𝑡 = 8 to 𝑡 = 9, allowing for the estimation of re-treatment effects distinct from

the initial treatment effects.

4.2.3 Aggregation Strategy

Define relative time 𝑟 = 𝑡 − 𝑔. Then, we compute the weighted average of the group-time

general treatment effects 𝐺𝑇𝑇 (𝑔, 𝑡) based on relative time 𝑟 for each 𝛼 ∈ {Treated, Exit, Re-entry}:

𝑊𝐺𝑇𝑇 (𝛼, 𝑟) =

∑︁

𝑔∈𝐺𝑟 , 𝛼

𝑤(𝑔, 𝑟) · 𝐺𝑇𝑇 (𝑔, 𝑔 + 𝑟), 𝑤ℎ𝑒𝑟𝑒 𝑟 = 𝑡 − 𝑔

(4.7)

where 𝐺𝑟,𝛼 denotes the set of groups with relative time 𝑟 and treatment state 𝛼 at time 𝑡, and 𝑤(𝑔, 𝑟)

represents the weight assigned to group 𝑔. While various weighting strategies could be considered,

this paper adopts a simple proportional weighting scheme based on group sizes:

𝑤(𝑔, 𝑟) =

𝑁𝑔
𝑁𝐺𝑟 , 𝛼

,

where 𝑁𝑔 is the number of units in group 𝑔, and 𝑁𝐺𝑟 , 𝛼 is the total number of units across all groups

in 𝐺𝑟,𝛼 used for estimating 𝑊𝐺𝑇𝑇 (𝛼, 𝑟). For example, if only groups 𝑏 and 𝑐, each with 10 units,

contribute to the estimation at 𝑟 = 1, then 𝑤(𝑏, 1) = 𝑤(𝑐, 1) = 10

20 = 0.5.

89

This aggregation yields three distinct 𝑊𝐺𝑇𝑇 series, which can be visualized through event-

study plots to analyze dynamic treatment effects across different treatment states.

De Chaisemartin and d’Haultfoeuille (2024) also propose an estimation method for settings with

dynamic treatment patterns, such as treatment entry and exit. However, their approach addresses this

complexity by focusing exclusively on the “treatment period,” and explicitly excluding observations

during “exit” and “re-entry” periods. This restriction is driven by their no-crossing condition,

which requires treatment levels to remain strictly above or below the period-one status (𝑡 = 1),

effectively dropping post-exit observations from the analysis. As a result, their framework does not

accommodate re-entry into treatment or estimate effects beyond the treatment spell.

In contrast, I leverage control groups to estimate effects across all three treatment states. By

explicitly classifying periods according to treatment status, the proposed framework enables a

comprehensive analysis of how treatment effects evolve across multiple spells and interruptions.

This extension provides researchers with a deeper understanding of treatment dynamics over time,

including potential lingering effects after discontinuation and resurgent effects following re-entry.

In addition, a key strength of the framework proposed by De Chaisemartin and d’Haultfoeuille

(2024) lies in its ability to accommodate discrete treatment intensities and flexibility regarding the

initial treatment status. Specifically, even when units are already treated in the initial period, their

approach allows for the estimation of treatment effects for “switchers”—units whose treatment

intensity strictly increases or decreases relative to the baseline level. By using status-quo units

as the control group, their method enables identification of dynamic treatment effects for such

transitions under certain assumptions.

While my current study focuses on binary treatment effects, the proposed classification-based

framework is naturally extensible to settings with discrete or continuous treatment intensities.

Incorporating these variations into the treatment-exit-reentry structure will be a key direction for

future research. Such an extension would enable more comprehensive evaluation of dynamic policy

interventions, where treatment may vary not only in timing but also in dosage or intensity.

90

4.2.4 Subgroup-Specific Generalized Treatment Effects (SGTTs)

In previous discussions, the treatment group 𝑔 was assumed to exit treatment uniformly, implying

that all units within the group followed an identical treatment trajectory over time. However, in

many empirical settings, units within the same treatment group may experience heterogeneous

treatment patterns. Specifically, within group 𝑔, some units may remain treated throughout, others

may exit treatment at certain periods, and some may re-enter treatment after a period of exit.

To account for these varied treatment patterns, I define three subgroup-specific generalized

treatment effects (SGTTs) for each treatment group 𝑔 at time 𝑡, as follows:

𝑆𝐺𝑇𝑇 (𝑔, 𝑡, 𝛼) = 𝐸 [𝑌𝑖𝑡 (𝑔) − 𝑌𝑖𝑡 (∞)|𝐷𝑔 = 1, 𝛼], 𝛼 ∈ {𝑇𝑟𝑒𝑎𝑡𝑒𝑑, 𝐸𝑥𝑖𝑡, 𝑅𝑒 − 𝑒𝑛𝑡𝑟 𝑦}

(4.8)

Each SGTT captures the generalized treatment effect for units that remain treated at time 𝑡, for

units that have exited treatment by time 𝑡, and for units that have re-entered treatment at time 𝑡.

Similarly, at the first stage, following (2023, Procedure 4.1), estimate the subgroup-specific

generalized treatment effects 𝑆𝐺𝑇𝑇 (𝑔, 𝑡, 𝛼). These SGTTs will subsequently be utilized in the

second stage of the estimation procedure, where these effects are aggregated based on the observed

treatment status.

𝑊𝐺𝑇𝑇 (𝛼, 𝑟) =

∑︁

𝑔𝛼∈𝐺𝑟 , 𝛼

𝑤(𝑔𝛼, 𝑟) · 𝑆𝐺𝑇𝑇 (𝑔, 𝑔 + 𝑟, 𝛼), 𝑤ℎ𝑒𝑟𝑒 𝑟 = 𝑡 − 𝑔

(4.9)

where 𝑤(𝑔𝛼, 𝑟) =

𝑁𝑔𝛼
𝑁𝐺𝑟 , 𝛼

denotes the weight assigned to group 𝑔 in state 𝛼 ∈ {𝑇𝑟𝑒𝑎𝑡𝑒𝑑, 𝐸𝑥𝑖𝑡, 𝑅𝑒 −

𝑒𝑛𝑡𝑟 𝑦} at relative time 𝑟.

4.3 Moderating Effects

This section extends the rolling transformation framework in Lee and Wooldridge (2023) to

incorporate moderating effects—treatment effect heterogeneity driven by observable unit charac-

teristics. Such heterogeneity arises when the impact of a treatment varies across subgroups defined

by factors such as income, race, gender, or baseline access levels. For instance, in the Empirical

Application section, I consider two subgroups: 𝑏𝑖 = 1 if a town has more than 20% of residents

aged 65 or older, 𝑏𝑖 = 0, otherwise. The treatment effect may differ across these subgroups.

91

To address treatment effect heterogeneity, I outline a Two-stage estimation procedure for a

generalized version of the Inverse Probability Weighted Regression Adjustment (IPWRA) method,

accommodating staggered intervention settings with multiple treatment groups.

Stage 1: Propensity Score Estimation

To account for staggered treatment timing, we estimate the propensity score using a multinomial

logit model, where the treatment status 𝐷𝑔 is defined as the first treatment period 𝑔 ∈ {𝑆, 𝑆 +

1, . . . , 𝑇 }. The propensity score for each unit 𝑖 is given by:

𝑝𝑔 (X) = Pr(𝐷𝑔 = 1|X) =

exp(X′𝛿𝑔)
𝑘=1 exp(X′𝛿𝑘 )

1 + (cid:205)𝐺

(4.10)

where 𝛿𝑔 are the parameters estimated for each treatment group 𝑔. The control group’s (i.e.,

𝑔 = ∞) coefficient is normalized to zero.

Stage 2: IPWRA Estimation with Moderating Effects

After obtaining the estimated propensity scores (cid:98)

𝑝𝑔 (Xi), I proceed to the IPWRA estimation that

incorporates moderating effects. Specifically, the treatment effect for group 𝑔 at time 𝑡 is estimated

by solving the following weighted least squares problem:

arg min
𝜏𝑔,𝑡 ,𝛼,𝛽,𝛾𝑔,𝑡

𝑁
∑︁

(cid:18)

𝑖=1

𝐷𝑔 +

𝑝𝑔 (X)

(1 − 𝐷𝑔) (cid:98)
𝑝𝑔 (X)
1 − (cid:98)

(cid:19) (cid:16) (cid:164)𝑌𝑔,𝑡 − 𝜏𝑔,𝑡 · 𝐷𝑔 − 𝛼 − 𝑋 ′

𝑖 𝛽 − 𝐷𝑔 (𝑋𝑖 − ¯𝑋𝑔)

(cid:17) 2

′

· 𝛾𝑔,𝑡

(4.11)

Here, 𝛾𝑔,𝑡 captures the moderating effects—heterogeneous treatment effects that vary with

deviations from the treated group’s average covariates ¯X𝑔. This formulation allows for a flexible

and interpretable estimation of treatment effect heterogeneity across subgroups.

Importantly, this structure reproduces the same point estimate for 𝜏 as the baseline IPWRA

estimator in Lee and Wooldridge (2023), given by:

arg min
𝜏𝑔,𝑡 ,𝛼,𝛽

𝑁
∑︁

(cid:18)

𝑖=1

𝐷𝑔 +

𝑝𝑔 (X)

(1 − 𝐷𝑔) (cid:98)
𝑝𝑔 (X)
1 − (cid:98)

(cid:19) (cid:16) (cid:164)𝑌𝑔,𝑡 − 𝜏𝑔,𝑡 · 𝐷𝑔 − 𝛼 − 𝑋 ′
𝑖 𝛽

(cid:17) 2

(4.12)

However, the inclusion of the 𝛾 coefficients provides additional insights by quantifying the

extent to which the treatment effect differs across levels of baseline covariates, thereby offering a

more comprehensive understanding of policy impacts across heterogeneous subgroups.

92

This framework offers a practical advantage over standard built-in command in Stata such

as the teffects ipwra, which does not support estimation of moderating effects. By applying

the transformed-outcome approach and two-stage IPWRA estimation procedure, we can estimate

moderating effects while leveraging the efficiency of doubly robust estimation.

4.3.1 Comparison with the Pooled OLS Estimator

Wooldridge (2021) proposes a pooled OLS method for estimating treatment effect heterogeneity

over time, while also allowing for the identification of moderating effects. For illustration, consider

the common timing case in which all treated units received their initial treatment at the same period

𝑡 = 𝑞. The specification includes a full set of time-fixed effects and interaction terms between the

treatment indicator and covariates, including demeaned covariates (centered at the treated group

mean). As shown in Equation (5.46) of Wooldridge (2021), the pooled OLS regression includes

terms of the form:

𝑦𝑖𝑡 on 1, 𝑑𝑖, (cid:164)x𝑖, 𝑑𝑖 · (cid:164)x𝑖, 𝑓𝑞,𝑡, . . . , 𝑓𝑇,𝑡, 𝑓𝑞,𝑡 · (cid:164)x𝑖, . . . , 𝑓𝑇,𝑡 · (cid:164)x𝑖,

𝑑𝑖 · 𝑓𝑞,𝑡, . . . , 𝑑𝑖 · 𝑓𝑇,𝑡, 𝑑𝑖 · 𝑓𝑞,𝑡 · (cid:164)x𝑖, . . . , 𝑑𝑖 · 𝑓𝑇,𝑡 · (cid:164)x𝑖,

(4.13)

where 𝑓𝑞,𝑡, . . . , 𝑓𝑇,𝑡 denote time indicators such that

𝑓𝑡′,𝑡 = 1 if 𝑡 = 𝑡′ and 0 otherwise, where

𝑡′ ∈ {𝑞, . . . , 𝑇 }, post-treatment periods. The covariates (cid:164)x𝑖 = x𝑖 − ¯x1 represent deviations from the

treated group’s mean covariate values. The terms 𝑑𝑖 · 𝑓𝑡,𝑡 · (cid:164)x𝑖 capture period-specific moderating

effects, allowing the treatment effect to vary flexibly with covariates across time. Each coefficient

on these triple interaction terms identifies 𝛾(𝑡), the marginal effect of covariates on the treatment

effect in period 𝑡. While straightforward and easy to implement, this regression-based approach

relies on correct specification of the outcome model. If relevant interaction or nonlinear terms from

the true model—which are rarely known to researchers in practice—are omitted, the estimator may

suffer from bias due to model misspecification.

In contrast, my proposed two-stage IPWRA estimator provides greater robustness through its

doubly robust structure. It remains consistent as long as either the outcome model or the propensity

score model is correctly specified. Importantly, it also enables the estimation of moderating effects

93

even when the outcome model is misspecified.

In Section 4.4, I demonstrate through simulation studies that the two-stage IPWRA approach

successfully recovers both the average treatment effects on the treated (ATTs) and the corresponding

moderating effects. Compared to POLS, the two-stage IPWRA yields substantially lower bias and

RMSE for heterogeneous treatment effects under outcome model misspecification, while maintain-

ing accurate estimation of average treatment effects. These findings underscore the practical value

of the proposed method for evaluating differential policy impacts across heterogeneous populations,

especially in settings where model misspecification is a concern.

4.4 Monte Carlo Simulations

This section outlines the data generating process (DGP) employed in the simulation study to

evaluate the performance of the proposed two-stage IPWRA estimator described in Section 4.3.

Although the study primarily focuses on staggered intervention settings, the simulation design is

restricted to a common timing case for simplicity. The findings demonstrate that the proposed two-

stage IPWRA estimator successfully incorporates both moderating and treatment effects. Regarding

treatment effects, the estimator accurately replicates the baseline IPWRA estimates in Lee and

Wooldridge (2023).

4.4.1 Data Generating Process (DGP)

The data generating process (DGP) is structured to simulate treatment effects under a common

timing framework, involving a single treatment group and a control group. The DGP is replicated

following the procedure outlined in Appendix 2D of Lee and Wooldridge (2023), and the relevant

variables are defined as described below.

The dataset spans six time periods (𝑇 = 6), representing the years 2001 to 2006, with the

treatment initiated at time 𝑆 = 4. The control group (𝐷𝑖 = 0) consists of never-treated units, while

the treatment group (𝐷𝑖 = 1) includes units treated at 𝑆 = 4.

The detailed structure of the variables is as follows. I generate two time-invariant covariates

X = (𝑋1, 𝑋2),

𝑋1 ∼ 𝐺𝑎𝑚𝑚𝑎(2, 2)

and 𝑋2 ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(0.6)

94

The treatment indicator is defined through a logistic propensity score model:

𝑝(X) = Pr(𝐷 = 1 | X) =

exp(X′z)
1 + exp(X′z)

,

(4.14)

where the index function is specified as X′z = −1.2 + (𝑋1−4)

2 − 𝑋2.

To investigate the moderating effects, the treatment effects are generated to exhibit heterogeneity

not only across time but also over covariates X. The treatment effect function is specified as follows:

𝑇
∑︁

𝜏𝑟 (X) = 𝜃 ·

(𝑟 − 𝑆 + 1)−1 + 𝜆𝑟 · ℎ(X),

(4.15)

𝑟=𝑆
where 𝜃 = 𝑇 − 𝑆 + 1, 𝜆𝑟 = (0.5, 0.6, 0.8). The functional form of ℎ(X) is:

ℎ(X) =

(𝑋1 − 4)
2

+

𝑋2
3

In this specification, 𝜃 · (cid:205)𝑇

𝑟=𝑆 (𝑟 − 𝑆 + 1)−1 captures the baseline treatment effect accumulated
over time, while 𝜆𝑟 · ℎ(X) represents the period-specific moderating effect, allowing treatment

effects to vary with covariates across time.

Lastly, I define a potential outcome in the control state:

𝑌𝑡 (0) = 𝛿𝑡 + 𝛼𝑖 + 𝛽𝑡 · 𝑓 (X) + 𝑈𝑡 (0),

(4.16)

where 𝛿𝑡 = 𝑡, 𝛼𝑖 ∼ N (2, 1), and 𝑈𝑡 (0) ∼ N (0, 4). The coefficient vector 𝛽′ is set as:

𝛽′ = (1.0, 1.5, 0.8, 1.5, 2, 2.5).

𝑓 (X) has two functional forms:

𝑓 (X) =

𝑓 (X) =

(𝑋1 − 4)
3
(𝑋1 − 4)
3

+

+

𝑋2
2
𝑋2
2

,

+

(𝑋1 − 4)2
3

+ (𝑋1 − 4) ·

𝑋2
4

(4.17)

(4.18)

For the simulation studies, I assume that the outcome model is linear in X. Under this speci-

fication, the conditional mean function 𝐸 (𝑌 |X) is correctly specified when using equation (4.17),

but becomes misspecified under equation (4.18) due to the inclusion of interaction and quadratic

95

terms in the true data-generating process. However, since the IPWRA estimator is doubly robust,

it remains consistent as long as the propensity score model is correctly specified. Therefore, even

under outcome model misspecification, I expect the IPWRA estimator to accurately recover the

treatment effect estimates.

Finally, a potential outcome in the treated state is

𝑌𝑡 (1) =


𝑌𝑡 (0),

𝑌𝑡 (0) + 𝜏𝑡 + 𝑈𝑡 (1) − 𝑈𝑡 (0),


𝑡 < 𝑆

𝑡 ≥ 𝑆

where 𝑈𝑡 (1) ∼ N (0, 4).

4.4.2 Simulation Results

Table 4.2 presents the simulation results under correct specification of the outcome model,

where the conditional mean function 𝐸 (𝑌 |X) is linear in covariates. I compare the performance of

four estimators: Pooled OLS (POLS) following Wooldridge (2021), the baseline IPWRA proposed

in Lee and Wooldridge (2023), the doubly robust estimator developed by Callaway and Sant’Anna

(2021) (denoted as CS), and the proposed two-stage IPWRA–these estimators are designed to

estimate average treatment effects on the treated (ATTs), particularly in the presence of treatment

effect heterogeneity across units and time. Estimator performance is evaluated based on bias,

standard deviation (SD), and root mean squared error (RMSE), focusing on both average treatment

effects (𝜏4, 𝜏5, 𝜏6) and moderating effects associated with covariates 𝑥1 and 𝑥2.

Under correct specification, all estimators perform reasonably well in recovering the average

treatment effects on the treated (ATT). Both the baseline IPWRA and two-stage IPWRA estimators

yield identical estimates with virtually zero bias and low RMSE across all time periods. This

confirms that the two-stage extension preserves the desirable statistical properties of the standard

IPWRA estimator when estimating group-time average treatment effects.

96

Table 4.2 Simulation Results Under Correct Specification of 𝐸 (𝑌 |X)

Sample ATT

POLS (RA)
CS
IPWRA
Two-Stage IPWRA

Sample Moderating Effects
POLS
Two-Stage IPWRA

Sample Moderating Effects
POLS
Two-Stage IPWRA

𝜏4
3.220
SD
0.395
0.503
0.403
0.403

𝛾(𝑋1, 4)
SD
0.250
0.324
0.370
𝛾(𝑥2, 4)
SD
0.167
0.803
0.848

Bias
-0.002
-0.003
-0.001
-0.001

Bias

-0.006
-0.006

Bias

0.038
0.030

RMSE
0.395
0.503
0.403
0.403

Bias
-0.013
-0.015
-0.014
-0.014

RMSE

Bias

0.325
0.370

0.000
-0.002

RMSE

Bias

0.804
0.849

0.037
0.045

𝜏5
4.753
SD
0.410
0.503
0.413
0.413

𝛾(𝑋1, 5)
SD
0.300
0.328
0.361
𝛾(𝑥2, 5)
SD
0.200
0.811
0.850

RMSE
0.410
0.504
0.413
0.413

Bias
-0.008
-0.009
-0.007
-0.007

RMSE

Bias

0.328
0.361

0.001
0.003

RMSE

Bias

0.811
0.851

-0.023
-0.019

𝜏6
5.838
SD
0.413
0.518
0.421
0.421

𝛾(𝑋2, 6)
SD
0.400
0.332
0.365
𝛾(𝑥2, 6)
SD
0.267
0.836
0.869

RMSE
0.413
0.518
0.421
0.421

RMSE

0.332
0.365

RMSE

0.837
0.869

Importantly, both POLS and the two-stage IPWRA also demonstrate strong performance in

estimating the moderating effects 𝛾(𝑥1, 𝑡) and 𝛾(𝑥2, 𝑡). For example, when the true moderating

effect is 0.25 for 𝛾(𝑥1, 4), the two-stage IPWRA achieves a bias of 0.002 and an RMSE of 0.370,

while POLS also performs comparably well. This is expected, as the outcome model is correctly

specified in both frameworks.

Table 4.3 presents simulation results under misspecification of the outcome model, where the

true conditional mean function 𝐸 (𝑌 |X) includes nonlinear interaction terms that are omitted in

estimation.

In this setting, while the propensity score model remains correctly specified, the

outcome model deviates from the true data-generating process. This setup provides a useful test

for evaluating the robustness of each estimator.

The upper panel of the table reports estimates for the average treatment effects on the treated (𝜏4,

𝜏5, 𝜏6). Since the propensity score model is correctly specified, both the baseline IPWRA and the

two-stage IPWRA recover the treatment effects with negligible bias and low RMSE—consistent with

the doubly robust property of IPWRA. Notably, this property holds even when the outcome model

is misspecified, confirming that the two-stage extension maintains the validity and consistency of

the original IPWRA framework in estimating ATT.

97

Table 4.3 Simulation Results under Misspecification of 𝐸 (𝑌 |X)

Sample ATT

POLS (RA)
CS
IPWRA
Two-Stage IPWRA

Sample Moderating Effects
POLS
Two-Stage IPWRA

Sample Moderating Effects
POLS
Two-Stage IPWRA

𝜏4
3.220
SD
0.397
0.509
0.405
0.405

𝛾(𝑥1, 4)
SD
0.250
0.329
0.374
𝛾(𝑥2, 4)
SD
0.167
0.807
0.853

Bias
0.044
-0.003
0.000
0.000

Bias

0.155
0.002

Bias

0.114
0.028

RMSE
0.400
0.510
0.405
0.405

Bias
0.091
-0.015
-0.011
-0.011

RMSE

Bias

0.364
0.374

0.362
0.016

RMSE

Bias

0.815
0.853

0.208
0.040

𝜏5
4.753
SD
0.418
0.517
0.417
0.417

𝛾(𝑥1, 5)
SD
0.300
0.344
0.391
𝛾(𝑥2, 5)
SD
0.200
0.819
0.866

RMSE
0.428
0.517
0.417
0.417

Bias
0.154
-0.008
-0.002
-0.002

RMSE

Bias

0.500
0.391

0.564
0.031

RMSE

Bias

0.845
0.867

0.245
-0.027

𝜏6
5.838
SD
0.429
0.543
0.431
0.431

𝛾(𝑥1, 6)
SD
0.400
0.380
0.438
𝛾(𝑥2, 6)
SD
0.267
0.855
0.905

RMSE
0.456
0.543
0.431
0.431

RMSE

0.681
0.439

RMSE

0.889
0.905

The distinction between estimators becomes more pronounced when examining moderating

effects. The two-stage IPWRA continues to estimate 𝛾(𝑥1, 𝑡) and 𝛾(𝑥2, 𝑡) with minimal bias and

relatively low RMSE. In contrast, the POLS estimator suffers from substantial bias in nearly every

case, particularly when the true moderating effect is large. For example, when the true value of

𝛾(𝑥1, 6) is 0.4, POLS overestimates it by more than 0.56, yielding an RMSE of 0.681, whereas the

two-stage IPWRA produces a much smaller bias of 0.031 and an RMSE of 0.439.

This pattern holds across other periods and covariates. As a regression-based estimator, POLS is

sensitive to outcome model misspecification—particularly when covariate interactions or nonlinear

terms from the true model are omitted. In contrast, the two-stage IPWRA leverages its doubly

robust property and remains consistent as long as the propensity score is correctly specified. By

incorporating covariate-treatment interactions within a weighted least squares framework, it flexibly

recovers heterogeneous effects without requiring correct specification of the outcome model.

In summary, the two-stage IPWRA estimator not only replicates the average treatment effect

estimates of the standard IPWRA under outcome model misspecification, but also provides a reliable

and robust framework for estimating heterogeneous treatment effects—something the standard

IPWRA is not equipped to do. This makes it particularly valuable in empirical settings where

98

treatment effect heterogeneity is of interest, but the outcome model’s functional form is uncertain

or potentially misspecified.

4.5 Empirical Application

In this section, I revisit Kim (2023), who studies the impact of chain pharmacy entry on the

number of local independent pharmacies in rural townships. While the primary contribution of that

paper lies in modeling the strategic interaction between chain and independent pharmacies using

a static game framework, the author first presents reduced-form evidence using recent staggered

difference-in-differences (DID) methods. Although Kim (2023) notes that chain pharmacies fre-

quently enter and exit markets, the analysis ultimately treats entry as a one-time event, citing the

absence of a suitable framework to handle dynamic path of treatment exposure.

However, my estimation strategy directly addresses this limitation by allowing treatment status

to vary over time–specifically, to turn on, off, and back on–offering a more realistic depiction

of competitive exposure in local markets. Through a two-stage estimation procedure, I estimate

seperate effects for three different observed treatment states, Treated, Exit, Re-entry.

I also focus on a key source of treatment effect heterogeneity: the share of the elderly population

in a township. My framework extends the standard doubly-robust estimator—combining inverse

probability weighting with regression adjustment (IPWRA)—to incorporate moderating effects.

Following Kim (2023), I classify townships into two groups based on their elderly population share

in the year 2000: (i) high elderly population towns (those with ≥ 20% of residents aged 65 or

older), and (ii) non-high elderly population towns (those with < 20% aged 65+). This approach

enables the estimation of subgroup-specific dynamic treatment effects, facilitating a more nuanced

understanding of how the competitive effects of chain entry vary across demographic contexts. The

method not only assesses whether average effects differ across subgroups but also tracks how those

effects evolve over time within each group.

4.5.1 Data and Setting

The empirical analysis employs the dataset constructed by Kim (2023), which combines the

Data Axle Historical Business Database (1997–2021)—a comprehensive record of business estab-

99

lishments, including pharmacy operations across the United States—with demographic variables

from the U.S. Census and the American Community Survey (ACS). The data are matched at the

township-year level and focus on rural townships in the Midwestern United States, where access to

pharmacies is a key healthcare concern.

Independent pharmacies in rural areas often serve as more than just retail outlets; they function

as essential community hubs, providing prescription fulfillment, over-the-counter medications, and

informal health advice for minor ailments. since the 1970s, however, the pharmacy landscape has

been reshaped by the rapid expansion of chain pharmacies such as Walgreens, CVS, and Rite Aid,

as well as mass merchandisers like Walmart and Target. These developments have threatened the

survival of locally owned pharmacies, especially in underserved rural areas where the closure of a

single pharmacy can have substantial consequences for healthcare access.

The geographic unit of analysis is the township. Each township is characterized by the number

of active independent pharmacies, the presence of chain pharmacies within a 15-mile radius, and

market characteristics such as population size, elderly share, and poverty rate. The outcome of

interest is the number of independent pharmacies operating in each township-year. The treatment

variable is a binary indicator equal to one if at least one chain pharmacy is present within 15 miles

of the township in year 𝑡.

While Kim (2023) includes 802 towns in the original dataset, I restrict the sample to 596

townships by excluding those that were already exposed to chain pharmacies in the initial year.

This restriction ensures that treatment does not occur at baseline; the number of chain presence

(within 15 mi) is zero across all units in the initial year. The resulting panel covers the periods 2000

through 2019. For further details on township definitions and sampling criteria, see Section 2.2,

Final sample, in Kim (2023).

Table 4.4 presents summary statistics for the 596 townships included in the analysis, covering

a total of 11,920 township-year observations from 2000 to 2019. All summary measures, unless

otherwise noted, reflect township characteristics in the initial year of the panel, 2000.

Again, townships are classified as “High Elderly” if 20% or more of the population was aged

100

Table 4.4 Summary Statistics of Full sample in Year 2000

Full Sample Non-High Elderly High Elderly

Number of Townships
Number of Observations
Independent Pharmacies (mean)
Chain Presence (within 15 mi, mean, in 2000)
Chain Presence (averaged over 2000–2019)
Log(Population) (mean)
Elderly Share (Aged 65 or older, mean)
Poverty Share (mean)

596
11,920
0.865
0
0.26
7.16
22.9%
12.4%

181
3,820
0.674
0
0.37
7.29
16.1%
13.9%

415
8,100
0.949
0
0.22
7.10
26.0%
11.6%

65 or older in 2000. Based on their elderly population share in 2000, 415 townships (68%) are

classified as high elderly population and 181 (32%) as non-high elderly population. On average,

high elderly population townships have smaller populations than their non-high elderly counterparts,

which is often associated with lower market demand. Since chain pharmacies are more likely to

enter markets with greater demand and population density, chain presence is more common in

non-high elderly population townships over the full sample period (2000–2019), with an average

presence rate of 37% compared to 22% in high elderly townships. Nevertheless, in the initial year

of the panel (2000), high elderly population townships already exhibited a higher average number

of independent pharmacies. This highlights the essential role that independent pharmacies have

historically played in these communities—particularly in serving older populations and addressing

healthcare needs in areas.

4.5.2 Results

Table 4.5 presents the estimated weighted generalized treatment effects, 𝑊𝐺𝑇𝑇 (𝛼, 𝑟), over

relative time 𝑟 for each treatment state 𝛼 ∈ {Treat, After Exit, After Re-entry}. Each estimate

is accompanied by a 95% bootstrap confidence interval. While the proposed framework allows

for various treatment effect estimators, the results reported here are obtained using the rolling

regression adjustment (RA) estimator, following Lee and Wooldridge (2023, Procedure 4.1) with

each subsample categorized according to its respective treatment state.

101

Table 4.5 Estimated Weighted General Treatment Effects 𝑊𝐺𝑇𝑇 (𝛼, 𝑟)

Relative Time, 𝑟
0
1
2
3
4
5
6
7
8
9
10

𝛼 = Treat
-0.355 (-0.427, -0.283)
-0.368 (-0.449, -0.287)
-0.389 (-0.487, -0.29)
-0.339 (-0.441, -0.237)
-0.34 (-0.452, -0.229)
-0.344 (-0.472, -0.215)
-0.347 (-0.482, -0.212)
-0.375 (-0.538, -0.212)
-0.389 (-0.531, -0.246)
-0.324 (-0.527, -0.121)
-0.339 (-0.547, -0.131)

Estimates (95% CI, bootstrapped)
𝛼 = After Re-entry
-0.291 (-0.418, -0.165)
-0.33 (-0.448, -0.212)
-0.307 (-0.459, -0.155)
-0.317 (-0.47, -0.164)
-0.286 (-0.424, -0.148)
-0.283 (-0.435, -0.132)
-0.319 (-0.476, -0.163)
-0.291 (-0.467, -0.114)
-0.367 (-0.571, -0.162)
-0.355 (-0.538, -0.171)
-0.315 (-0.506, -0.124)

𝛼 = After Exit
-0.253 (-0.364, -0.143)
-0.257 (-0.374, -0.141)
-0.212 (-0.368, -0.057)
-0.247 (-0.42, -0.074)
-0.166 (-0.354, 0.022)
-0.249 (-0.454, -0.044)
-0.236 (-0.521, 0.05)
-0.116 (-0.556, 0.325)
-0.161 (-0.652, 0.331)
-0.275 (-0.71, 0.159)
-0.313 (-0.875, 0.25)

Figure 4.1 visualizes these results, showing dynamic treatment effect patterns by subgroup

over time using the rolling regression adjustment (RA) estimator following LW (2023). The y-axis

represents 𝑊𝐺𝑇𝑇 (𝛼, 𝑟), and the x-axis denotes relative time since initial treatment, exit, or re-entry.

Figure 4.1 WGTT Estimates Using Rolling Regression Adjustment Estimator

Overall, the effects are consistently negative and statistically significant for the Treat group,

indicating a persistent adverse impact on local pharmacy access. In contrast, the effects for the After

102

-.6-.5-.4-.3-.2-.1.1.20WGTT-11-9-7-5-3-113579010Time to Treatment, Exit or Re-entryPre-treatment EffectsFirst Treatment EffectsAfter-Re-entry EffectsAter-Exit (Lingering) Effects95% Conﬁdence Interval95% Conﬁdence IntervalEffects for Each Treatment StatusExit group diminish over time and become statistically insignificant, suggesting potential recovery.

For the After Re-entry group, the effects remain negative but are somewhat attenuated, implying

that re-entry still harms access, though less severely than initial exposure.

These patterns likely reflect underlying market dynamics. The initial closure may involve

substantial financial losses and signal poor profitability, discouraging future entrants.

In small

rural markets with thin margins, the departure of a chain pharmacy may not be enough to restore

viable conditions for independent pharmacies. Thus, exit does not necessarily reverse the decline

in access, highlighting the persistent and asymmetric nature of competitive disruption.

The attenuation observed for the Re-entry group (blue line) may reflect adaptive responses by

independent pharmacies, market saturation, or reduced vulnerability after prior exposure.

Together, these results underscore the importance of modeling treatment as a dynamic, evolving

process. By distinguishing between initial exposure, exit, and re-entry phases, the proposed

estimator captures nuanced, heterogeneous treatment effects that would be obscured under static or

binary frameworks.

4.6 Concluding Remarks

Building on the rolling approach developed by Lee and Wooldridge (2023), this paper extends

the framework to accommodate dynamic treatment paths—including not only initial treatment but

also treatment exit and re-entry—thereby offering a generalized method for evaluating complex

intervention patterns over time.

In addition, I propose a two-stage IPWRA estimator that allows for the identification of mod-

erating effects, enabling researchers to assess how treatment effects vary across subpopulations

defined by covariates. By incorporating covariate-treatment interactions into a weighted least

squares framework and leveraging the doubly robust structure of IPWRA, the proposed method

recovers moderating effects even when the conditional outcome model is misspecified, as long as

the propensity score is correctly specified.

Simulation results demonstrate that the two-stage IPWRA estimator performs well in recovering

not only average treatment effects on the treated (ATTs) but also moderating effects, under both

103

correct and misspecified outcome models. In particular, for moderating effects, it outperforms the

pooled OLS estimator when the outcome model is misspecified. These findings underscore the

robustness and flexibility of the proposed approach in evaluating differential policy impacts across

heterogeneous populations.

To illustrate its empirical relevance, I apply the proposed method to examine the effects of phar-

macy chain entry on independently owned pharmacies. Beyond estimating the average treatment

effect of initial chain entry, the analysis also uncovers dynamic treatment patterns following exit and

re-entry. Specifically, treatment effects in the After-Exit phase remain persistently negative, indi-

cating that the departure of chain pharmacies does not reverse the decline in access to independent

pharmacies. These lingering effects suggest long-term disruptions in local markets, particularly in

rural areas with limited demand.

In contrast, treatment effects in the Re-entry phase—reflecting repeated exposures to chain

competition—are more attenuated, potentially due to adaptive responses or market saturation.

These results highlight the importance of modeling treatment as a dynamic process and demonstrate

the value of the proposed framework in capturing complex and evolving treatment effects that would

be overlooked in static or binary settings.

While the current paper focuses on binary treatment settings, the framework is naturally extensi-

ble to discrete or continuous treatment intensities, including dynamic transitions such as treatment

exit and re-entry. Future work will build on this foundation to further expand the applicability of

the estimator in complex policy environments.

104

BIBLIOGRAPHY

Borusyak, K., Jaravel, X., and Spiess, J. (2024). Revisiting event study designs: Robust and efficient

estimation. Review of Economic Studies, page rdae007.

Callaway, B. and Sant’Anna, P. H. (2021). Difference-in-differences with multiple time periods.

Journal of econometrics, 225(2):200–230.

De Chaisemartin, C. and d’Haultfoeuille, X. (2024). Difference-in-differences estimators of in-

tertemporal treatment effects. Review of Economics and Statistics, pages 1–45.

De Chaisemartin, C. and d’Haultfoeuille, X. (2020). Two-way fixed effects estimators with hetero-

geneous treatment effects. American economic review, 110(9):2964–2996.

Kim, H. (2023). Rural pharmacy access and competition: Static games with machine learning.

Available at SSRN 4377695.

Lee, S. J. and Wooldridge, J. M. (2023). A simple transformation approach to difference-in-

differences estimation for panel data. Available at SSRN 4516518.

Sun, L. and Abraham, S. (2021). Estimating dynamic treatment effects in event studies with

heterogeneous treatment effects. Journal of econometrics, 225(2):175–199.

Wooldridge, J. M. (2021). Two-way fixed effects, the two-way mundlak regression, and difference-

in-differences estimators. Available at SSRN 3906345.

105