AN ECLECTIC COLLECTION OF ESSAYS ON ECONOMETRIC METHODS

By

Kaicheng Chen

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Economics—Doctor of Philosophy

2025

ABSTRACT

This dissertation provides the developments and extensions of methodologies in panel, nonlinear,

and high-dimensional econometrics. Chapter 1 of the dissertation introduces and summarizes the

following chapters. Chapter 2 proposes a fixed-b approximation method for inference in a linear

panel model with two-way clustering. Chapter 3 considers the same dependence structure in panel

models with nonlinearity and high dimensionality. Chapter 4 revisits the use of linear models for

binary responses, focusing on average partial effects. Chapter 5 considers models with endogenous

controls and provides alternative identification methods for a large class of models.

Copyright by
KAICHENG CHEN
2025

ACKNOWLEDGEMENTS

First and foremost, I would like to express my deepest gratitude to my advisor, friend, and lifetime

role model, Professor Tim Vogelsang. The devoted guidance and persistent support from Tim have

guided me through my Ph.D. journey and will remain as my lifelong treasure.

I would also like to extend my sincere appreciation to the members of my dissertation committee,

Professor Jeff Wooldridge, Professor Antonio Galvao, Professor Kyoo il Kim, and Professor Shlomo

Levental, for their constructive feedback and support. Special thanks go to Professor Hugo Freeman

for those valuable discussions on research and support throughout the job market.

I am grateful to Michigan State University for allowing me to take on the doctoral program

and for the generous support provided throughout my graduate studies.

I would also like to

express my gratitude to my master thesis advisor Professor Jeffrey Zabel at Tufts University for

his valuable support and lessons as well as to Professor Todd Elder and my friend Xiaoxin Zhang

whose recognitions and recommendations made my admission to this excellent graduate program

possible.

Additionally, I extend my gratitude to other faculty and staff of the MSU Economics Department.

It is their important work that makes this journey fruitful in every aspect. I am also thankful to my

colleagues, the graduate students in the department, for their camaraderie and shared experiences.

I am especially grateful to my partner, Saera Oh, who has shared this journey with me as a

fellow Ph.D. student from an early stage. My life and this journey would never have been as colorful

and meaningful without her participation. Finally, and as always, I am profoundly indebted to my

parents for their unwavering support. Thank you for your constant love and understanding.

To all who have been part of this journey—thank you.

iv

CHAPTER 1

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

1

TABLE OF CONTENTS

CHAPTER 2

FIXED-B ASYMPTOTICS FOR PANEL MODELS WITH TWO-WAY
CLUSTERING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
PROOFS FOR CHAPTER 2 . . . . . . . . . . . . . . . . . . . 43

.

.

.

.

BIBLIOGRAPHY .
APPENDIX 2A

CHAPTER 3

INFERENCE IN HIGH-DIMENSIONAL PANEL MODELS:
TWO-WAY DEPENDENCE AND UNOBSERVED HETEROGENEITY 56
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
.
PROOFS FOR CHAPTER 3.2 . . . . . . . . . . . . . . . . . . 111
PROOFS FOR CHAPTER 3.3 . . . . . . . . . . . . . . . . . . 122
PROOFS FOR CHAPTER 3.4 . . . . . . . . . . . . . . . . . . 143

BIBLIOGRAPHY .
APPENDIX 3A
APPENDIX 3B
APPENDIX 3C

.

.

.

CHAPTER 4

ANOTHER LOOK AT THE LINEAR PROBABILITY MODEL AND
NONLINEAR INDEX MODELS . . . . . . . . . . . . . . . . . . . . . 148
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
.
PROOFS FOR CHAPTER 4 . . . . . . . . . . . . . . . . . . . 174
. 176
THE RAMP MODEL WITH VARIABLE SUPPORT . . . . .

.

.

.

BIBLIOGRAPHY .
APPENDIX 4A
APPENDIX 4B

CHAPTER 5

IDENTIFICATION OF PARTIAL EFFECTS WITH ENDOGENOUS
CONTROLS .
.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 179
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
PROOFS FOR CHAPTER 5 . . . . . . . . . . . . . . . . . . . 194

.

.

BIBLIOGRAPHY .
APPENDIX 5A

v

CHAPTER 1

INTRODUCTION

This dissertation aims to develop and extend estimation and inference methodologies for panel data

models, nonlinear models, and high-dimensional models.

Chapter 2 of this dissertation studies a cluster robust variance estimator proposed by Chiang,

Hansen, and Sasaki (2024) for linear panels. First, algebraically, it is shown that this variance

estimator (CHS estimator, hereafter) is a linear combination of three common variance estimators:

the one-way unit cluster estimator, the “HAC of averages” estimator, and the “average of HACs”

estimator. Based on this finding, a fixed-𝑏 asymptotic result is obtained for the CHS estimator

and corresponding test statistics as the cross-section and time sample sizes jointly go to infinity.

Furthermore, two simple bias-corrected versions of the variance estimator are proposed.

In a

simulation study, it is shown that the two bias-corrected variance estimators along with fixed-𝑏

critical values provide improvements in finite sample coverage probabilities. To illustrate the impact

of bias correction and the use of the fixed-𝑏 critical values on inference, an empirical example on

the relationship between industry profitability and market concentration is presented.

Chapter 3 studies the estimation and inference methods for panel data models with high dimen-

sionality and unobserved heterogeneous effects. Panel data allows for the modeling of unobserved

heterogeneity, which significantly increases the number of nuisance parameters, making high

dimensionality a practical issue rather than just a theoretical concern. However, unobserved het-

erogeneity, along with potential temporal and cross-sectional dependence in panel data, further

complicates estimation and inference for high-dimensional models. This paper proposes a toolkit

for robust estimation and inference in high-dimensional panel models with large cross-sectional

and time sample sizes. To reduce the dimensionality, I propose a weighted LASSO using two-way

cluster-robust penalty weights. Due to the cluster dependence, the rate of convergence is slow

even in an oracle case. Nevertheless, by leveraging a clustered-panel cross-fitting approach for bias

correction, asymptotic normality can be established for the low-dimensional vector of the estimated

parameters. As a special case, inferential theories are also established using the full sample in a

1

partial linear model with unobserved time and unit effects. In a panel estimation of the government

spending multiplier, I demonstrate how high dimensionality can be hidden and how the proposed

toolkit enables flexible modeling and robust inference.

Chapter 4 reassesses the use of linear models for binary responses, focusing on average partial

effects (APEs). It is confirmed that under certain conditions, linear projection parameters corre-

spond to APEs even when the true model is nonlinear. Simulations demonstrate a large fraction

of fitted values in [0, 1] is neither necessary nor sufficient for OLS to approximate the APEs. To

reduce bias, excluding observations with fitted values outside [0, 1] has been proposed. It is shown

that iteratively trimming the sample is equivalent to the nonlinear least squares estimation of a

piece-wise linear (ramp) model, for which the consistency and asymptotic normality results are

established.

Chapter 5 of this dissertation focuses on an endogenous control issue. Exogeneity of the

treatment needed for identification is often achieved by conditioning. While control variables are

explicitly or implicitly assumed to be exogenous, it is common to encounter endogenous controls

in practice.

It brings a dilemma: without controlling, the treatment may be endogenous; with

controlling, the endogeneity of controls may pollute the identification. The problem is not solved

with an instrumental variable when it is only conditionally valid and controls are endogenous.

We provide identification results for local average response under an extra measurable separability

condition between the treatment and the controls. Noticeably, this condition permits the controls to

be dependent on the treatment. The results apply to a wide class of models ranging from linear to

non-separable ones. Monte Carlo simulations exemplify this prevalent issue and demonstrate the

performance of the proposed methods in the finite sample.

2

CHAPTER 2

FIXED-B ASYMPTOTICS FOR PANEL MODELS WITH TWO-WAY CLUSTERING

(CO-AUTHORED WITH TIM VOGELSANG)

2.1

Introduction

When carrying out inference in a linear panel model, it is well known that failing to adjust the

variance estimator of estimated parameters to allow for different dependence structures in the data

can cause over-rejection/under-rejection problems under null hypotheses, which in turn can give

misleading empirical findings (see Bertrand et al., 2004).

To study different dependence structures and robust variance estimators in panel settings, it is

now common to use a component structure model 𝑦𝑖𝑡 = 𝑓 (𝛼𝑖, 𝛾𝑡, 𝜀𝑖𝑡) where the observable data, 𝑦𝑖𝑡,

is a function of an individual component, 𝛼𝑖, a time component, 𝛾𝑡, and an idiosyncratic component,

𝜀𝑖𝑡. See, for example, Davezies et al. (2021), MacKinnon et al. (2021), Menzel (2021), and Chiang

et al. (2024). As a concrete example, suppose 𝑦𝑖𝑡 = 𝛼𝑖 + 𝜀𝑖𝑡 for 𝑖 = 1, ..., 𝑁 and 𝑡 = 1, ..., 𝑇, where

𝛼𝑖 and 𝜀𝑖𝑡 are assumed to be i.i.d random variables. The existence of 𝛼𝑖 generates serial correlation

within group 𝑖, which is also known as the individual clustering effect. This dependence structure is

well-captured by the cluster variance estimator proposed by Liang and Zeger (1986) and Arellano

(1987). One can also use the “average of HACs" variance estimator that uses cross-section averages

of the heterogeneity and autocorrelation (HAC) robust variance estimator proposed by Newey and

West (1987). On the other hand, suppose 𝑦𝑖𝑡 = 𝛾𝑡 + 𝜀𝑖𝑡 where 𝛾𝑡 is assumed to be an i.i.d sequence

of random variables. Cross-sectional/spatial dependence is generated in 𝑦𝑖𝑡 by 𝛾𝑡 is through the

time clustering effect. In this case one can use a variance estimator that clusters over time or use the

spatial dependence robust variance estimator proposed by Driscoll and Kraay (1998). Furthermore,

if both 𝛼𝑖 and 𝛾𝑡 are assumed to be present, e.g. 𝑦𝑖𝑡 = 𝛼𝑖 + 𝛾𝑡 + 𝜀𝑖𝑡, then the dependence of {𝑦𝑖𝑡 }

exists in both the time and cross-section dimensions, also as known as two-way clustering effects.

This chapter has been published as Chen and Vogelsang (2024) and it is permitted to be included in the author’s
own dissertation per the licensing and copyright contract with Elsevier. The co-author has approved that the co-authored
chapter is included. The co-author’s contact: Tim Vogelsang, Department of Economics, 486 W. Circle Drive, 110
Marshall-Adams Hall, Michigan State University, East Lansing, MI 48824-1038. Email: tjv@msu.edu

3

Correspondingly, the two-way/multi-way robust variance estimator proposed by Cameron et al.

(2011) is suitable for this case.

In macroeconomics, the time effects, 𝛾𝑡, can be regarded as common shocks which are usually

serially correlated. Allowing persistence in 𝛾𝑡 up to a known lag structure, Thompson (2011)

proposed a truncated variance estimator that is robust to dependence in both the cross-section and

time dimensions. Because of unsatisfying finite sample performance of this rectangular-truncated

estimator, Chiang et al. (2024) propose a Bartlett kernel variant (CHS variance estimator, hereafter)

and establish validity of tests based on this variance estimator using asymptotics with the cross-

section sample size, 𝑁, and the time sample size, 𝑇, jointly going to infinity. The asymptotic results

of the CHS variance estimator rely on the assumption that the bandwidth, 𝑀, goes to infinity as 𝑇
goes to infinity while the bandwidth sample size ratio, 𝑏 = 𝑀

𝑇 is of a small order. As pointed out by

Neave (1970) and Kiefer and Vogelsang (2005), the value of 𝑏 in a given application is a non-zero

number that matters for the sampling distribution of the variance estimator. Treating 𝑏 as shrinking

to zero in the asymptotics may miss some important features of finite sample behavior of the

variance estimator and test statistics. As noted by Andrews (1991), Kiefer and Vogelsang (2005),

and many others, HAC robust tests tend to over-reject in finite samples when standard critical values

are used. This is especially true when time dependence is persistent and large bandwidths are used.

We document similar findings for tests based on CHS variance estimator in our simulations.

To improve the performance of tests based on CHS variance estimator, we derive fixed-𝑏

asymptotic results (see Kiefer and Vogelsang, 2005, Sun et al., 2008, Vogelsang, 2012, Zhang and

Shao, 2013, Sun, 2014, Bester et al., 2016, and Lazarus et al., 2021). Fixed-𝑏 asymptotics captures

some important effects of the bandwidth and kernel choices on the finite sample behavior of the

variance estimator and tests and provides reference distributions that can be used to obtain critical

values that depend on the bandwidth (and kernel). Our asymptotic results are obtained for 𝑁 and

𝑇 jointly going to infinity and leverage the joint asymptotic framework developed by Phillips and

Moon (1999). The limiting distribution of tests based on the CHS or BCCHS estimator are not

asymptotically pivotal, so we propose a plug-in method of simulating fixed-𝑏 critical values. One

4

key finding is that the CHS variance has a multiplicative bias given by 1 − 𝑏 + 1
3

𝑏2 ≤ 1 resulting in

a downward bias that becomes more pronounced as the bandwidth increases. By simply dividing

the CHS variance estimator by 1 − 𝑏 + 1
3

𝑏2 we obtain a simple bias-corrected variance estimator

that improves the performance of tests based on the CHS variance estimator even without using

plug-in fixed-𝑏 critical values. We label this bias-corrected CHS variance estimator as BCCHS.

As a purely algebraic result, we show that the CHS variance estimator is the sum of the Arellano

cluster and Driscoll-Kraay variance estimators minus the “averages of HAC” variance estimator.

We show that dropping the “averages of HAC” component in conjunction with bias correcting the

Driscoll-Kraay component removes the asymptotic bias in the CHS variance estimator and has the

same fixed-𝑏 limit as the BCCHS variance estimator. We label the resulting variance estimator of

this second bias correction approach as the DKA (Driscoll-Kraay+Arellano) variance estimator.

Similar ideas are also used by Davezies et al. (2018) and MacKinnon et al. (2021) where they

argue the removal of the negative and small order component in the variance estimator brings

computational advantage in the sense that the variance estimates are ensured to be positive semi-

definite. In our simulations we find that negative CHS variance estimates can occur up to 6.4% of the

time. An advantage of the OKA variance estimator is guaranteed positive semi-definiteness. The

DKA variance estimator also tends to deliver tests with better finite sample coverage probabilities

although there are exceptions: when the data is independent and identically distributed (i.i.d.) in

both the cross-section and time dimensions, we show the DKA estimator has a different fixed-𝑏

limit and results in tests that are conservative including the case where the bandwidth is small1.

The fixed-𝑏 limit of the CHS variance estimator is also different in the i.i.d. case but tests based on

it remain robust when the bandwidth is small.

In a finite sample simulation study, we compare sample coverage probabilities of confidence

intervals based on CHS, BCCHS, and DKA variance estimators using critical values from both

the standard normal distribution and the fixed-𝑏 limits. The fixed-𝑏 limits of the test statistics

1In the small bandwidth case we find that the limit of the DKA variance estimator is twice as big as the population
variance for i.i.d. data - a finding similar to Theorem 2 in MacKinnon et al. (2021) in a multiway clustering setting.
We thank a referee for pointing out the similarity between our results for the DKA variance estimator and the results
in MacKinnon et al. (2021) for multiway cluster variance estimators when the data is i.i.d.

5

constructed by these three variance estimators are not pivotal, so we use a simulation method to

obtain the critical values via a plug-in estimator approach to handle asymptotic nuisance parameters.

While the plug-in fixed-𝑏 critical values can substantially improve coverage rates relative to using

standard critical values when using the CHS variance estimator, improvements from simply using

the bias corrections are impressive. In the case of data-dependent bandwidths, the plug-in fixed-𝑏

critical values provide further improvements in finite sample coverage probabilities when neither

𝑇 nor 𝑁 is large. Conversely, when both 𝑁 and 𝑇 are very small, bias correction alone can give

more accurate finite sample coverage probabilities than bias correction with plug-in fixed-𝑏 critical

values. Similar results hold for tests based on the DKA variance estimator.

Overall, four different approaches for within- and across-cluster dependent robust tests are

proposed: two are simple bias correction by BCCHS and DKA estimators and the other two are

bias correction by BCCHS and DKA estimators with plug-in fixed-𝑏 critical values. Even though

tests based on BCCHS and DKA are asymptotically equivalent under the main assumptions, their

finite sample performance is distinguishable. Based on theory and simulation results, we provide

comprehensive empirical guidance hinged on the researcher’s assessment of the data, model and

priority of the test.

The rest of the paper is organized as follows. In Section 2 we sketch the algebra of the CHS

estimator and rewrite it as a linear combination of three well-known variance estimators.

In

Section 3 we derive fixed-𝑏 limiting distributions of CHS-based tests for pooled ordinary least

squares (POLS) estimators in a simple location panel model and a linear panel regression model.

In Section 4 we derive the fixed-𝑏 asymptotic bias of the CHS estimator and propose two bias-

corrected variance estimators. We also derive fixed-𝑏 limits for tests based on the bias-corrected

variance estimators. When the data is i.i.d. a key assumption for our asymptotic results no longer

holds and we show that the asymptotic limits change in this case. Section 5 presents finite sample

simulation results that illustrate the relative performance of 𝑡-tests based on the variance estimators.

Some theoretical results for two-way-fixed-effects (TWFE) estimator are also discussed along with

the simulation. In Section 6 we illustrate the practical implications of the bias corrections and use

6

of fixed-𝑏 critical values in an empirical example. Section 7 concludes the paper with guidance for

empirical practice and a discussion on the limitations of proposed approaches.

2.2 A Variance Estimator Robust to Two-Way Clustering

We first motivate the estimator of the asymptotic variance of the pooled ordinary least squares

(POLS) estimator under arbitrary dependence in both the time and cross-section dimensions.

Consider the linear panel model

𝑦𝑖𝑡 = 𝑥′

𝑖𝑡 𝛽 + 𝑢𝑖𝑡,

𝑖 = 1, . . . , 𝑁, . 𝑡 = 1, . . . , 𝑇,

(2.1)

where 𝑦𝑖𝑡 is the dependent variable, 𝑥𝑖𝑡 is a 𝑘 × 1 vector of covariates, 𝑢𝑖𝑡 is the error term, and 𝛽 is

the coefficient vector. Let (cid:98)𝛽 be the POLS estimator of 𝛽. For illustrative purposes the variance of
(cid:98)𝛽 can be approximated as

Var

(cid:16)

(cid:17)
(cid:98)𝛽

≈ (cid:98)𝑄−1Ω𝑁𝑇 (cid:98)𝑄−1,

where (cid:98)𝑄 := 1
𝑁𝑇

(cid:205)𝑁
𝑖=1

(cid:205)𝑇

𝑡=1

𝑥𝑖𝑡𝑥′

𝑖𝑡 and Ω𝑁𝑇 := Var

(cid:16) 1
𝑁𝑇

(cid:205)𝑁
𝑖=1

(cid:205)𝑇

𝑡=1

𝑣𝑖𝑡

(cid:17)

with 𝑣𝑖𝑡 := 𝑥𝑖𝑡𝑢𝑖𝑡.

Without imposing assumptions on the dependence structure of 𝑣𝑖𝑡, it has been shown, alge-

braically, that Ω𝑁𝑇 has the following form (see Thompson, 2011 and Chiang et al., 2024):

Ω𝑁𝑇 =

1
𝑁 2𝑇 2

𝑇
∑︁

𝑇
∑︁

E (cid:0)𝑣𝑖𝑡𝑣′
𝑖𝑠

(cid:1) +

𝑇
∑︁

𝑁
∑︁

𝑁
∑︁

𝑡=1

𝑖=1

𝑗=1

E(𝑣𝑖𝑡𝑣′

𝑗𝑡) −

𝑇
∑︁

𝑁
∑︁

𝑡=1

𝑖=1

E (cid:0)𝑣𝑖𝑡𝑣′
𝑖𝑡

(cid:1)

𝑁
∑︁







𝑇−𝑚
∑︁

𝑖=1

𝑇−1
∑︁

𝑠=1

𝑡=1
(cid:32) 𝑁
∑︁

E

𝑚=1

𝑡=1

𝑇−1
∑︁

𝑇−𝑚
∑︁

𝑖=1
(cid:32) 𝑁
∑︁

E

𝑚=1

𝑡=1

𝑖=1

+

+

𝑁
∑︁

𝑗=1
(cid:33)

(cid:33)

𝑣𝑖𝑡

(cid:169)
(cid:173)
(cid:171)

𝑣𝑖,𝑡+𝑚

𝑣′
𝑗,𝑡+𝑚(cid:170)
(cid:174)
(cid:172)

𝑁
∑︁

𝑗=1

(cid:169)
(cid:173)
(cid:171)

−

𝑣′
𝑗,𝑡(cid:170)
(cid:174)
(cid:172)

𝑁
∑︁

𝑇−𝑚
∑︁

−

(cid:16)

E

𝑣𝑖𝑡𝑣′

𝑖,𝑡+𝑚

(cid:17)

𝑖=1

𝑡=1

𝑁
∑︁

𝑇−𝑚
∑︁

𝑖=1

𝑡=1

(cid:16)

E

𝑣𝑖,𝑡+𝑚𝑣′
𝑖,𝑡

.

(cid:17)






Based on this decomposition of Ω𝑁𝑇 , Thompson (2011) and Chiang et al. (2024) each propose

a truncation-type variance estimator. In particular, Chiang et al. (2024) replaces the Thompson

(2011) truncation scheme with a Bartlett kernel and establish the consistency result of their variance

estimator while allowing two-way clustering effects with serially correlated stationary time effects.

As an asymptotic approximation, appealing to consistency of the estimated variance allows

the asymptotic variance to be treated as known when generating asymptotic critical values for

7

inference. While convenient, such a consistency result does not capture the impact of the choice

of 𝑀 and kernel function on the finite sample behavior of the variance estimator and any resulting

size distortions of test statistics. To capture some of the finite sample impacts of the choice of 𝑀

and kernel, we apply the fixed-𝑏 approach of Kiefer and Vogelsang (2005).

Noticeably, the CHS variance estimator can be decomposed into three well-known variance

estimators, which will be helpful when we apply the fixed-𝑏 approximation. Using straightforward

algebra, one can show that the CHS variance estimator defined in equation (2.12) of Chiang et al.

(2024) can be rewritten as

(cid:98)𝑉CHS := (cid:98)𝑄−1

(cid:98)ΩCHS (cid:98)𝑄−1,

(cid:98)ΩCHS :=(cid:98)Ω𝐴 + (cid:98)ΩDK − (cid:98)ΩNW,

(2.2)

(2.3)

where, with the Bartlett kernel defined as 𝑘 (cid:0) 𝑚
𝑀

(cid:1) = 1 − 𝑚

𝑀 and 𝑀 being the truncation parameter,

(cid:98)ΩA :=

1
𝑁 2𝑇 2

(cid:98)ΩDK :=

(cid:98)ΩNW :=

1
𝑁 2𝑇 2

1
𝑁 2𝑇 2

𝑁
∑︁

𝑇
∑︁

𝑇
∑︁

𝑖=1
𝑇
∑︁

𝑡=1
𝑇
∑︁

𝑡=1
𝑁
∑︁

𝑠=1
𝑇
∑︁

(cid:0)
𝑣′
𝑣𝑖𝑡(cid:98)
𝑖𝑠
(cid:98)

(cid:1) ,

𝑠=1

𝑘

(cid:18) |𝑡 − 𝑠|
𝑀

𝑇
∑︁

𝑘

(cid:18) |𝑡 − 𝑠|
𝑀

(cid:19) (cid:32) 𝑁
∑︁

(cid:33)

𝑖=1
(cid:19)

𝑣𝑖𝑡
(cid:98)

(cid:169)
(cid:173)
(cid:171)
𝑣′
𝑖𝑠.
𝑣𝑖𝑡(cid:98)
(cid:98)

𝑖=1

𝑡=1

𝑠=1

𝑁
∑︁

𝑗=1

,

𝑣′
𝑗 𝑠(cid:170)
(cid:98)
(cid:174)
(cid:172)

(2.4)

(2.5)

(2.6)

Notice that (2.4) is the “cluster by individuals” estimator proposed by Liang and Zeger (1986) and

Arellano (1987), (2.5) is the “HAC of cross-section averages" estimator proposed by Driscoll and

Kraay (1998), and (2.6) is the “average of HACs” estimator (see Petersen, 2009 and Vogelsang,

2012). In other words, (cid:98)ΩCHS is a linear combination of three well-known variance estimators that

have been proposed to handle particular forms of dependence structure. While there are some

existing asymptotic results for the components in (2.3) that are potentially relevant (e.g. Hansen,

2007, Vogelsang, 2012, and Chiang et al., 2024), these results are derived either under one-way

dependence or are not sufficiently comprehensive to directly obtain a fixed-𝑏 result for (cid:98)ΩCHS. Some

new theoretical results are needed.

8

2.3 Fixed-𝑏 Asymptotic Results

2.3.1 The Multivariate Mean Case

To set ideas and intuition we first focus on a simple panel mean model (panel location model) of a

𝑘 × 1 random vector 𝑦𝑖𝑡 and then extend the analysis to the linear regression case. We use a large-𝑁

and large-𝑇 framework where 𝑁/𝑇 → 𝑐 for some constant 𝑐 such that 0 < 𝑐 < ∞. As a natural

way to model panel data with two-way effects, we follow Chiang et al. (2024) and assume that 𝑦𝑖𝑡

is generated as follows.

Assumption 2.1

𝑦𝑖𝑡 = 𝜃 + 𝑓 (𝛼𝑖, 𝛾𝑡, 𝜀𝑖𝑡) ,

where 𝜃 = E(𝑦𝑖𝑡) and 𝑓 is an unknown Borel-measurable function, the sequences {𝛼𝑖}, {𝛾𝑡 }, and

{𝜀𝑖𝑡 } are mutually independent, 𝛼𝑖 is i.i.d across 𝑖, 𝜀𝑖𝑡 is i.i.d across 𝑖 and 𝑡, and 𝛾𝑡 is a strictly

stationary serially correlated process.

The time component, 𝛾𝑡, is allowed to have serial correlation given that panel data typically

has serial correlation beyond that induced by individual effects. As pointed out by Chiang et al.

(2024) at the beginning of their Section 3, the data-generating process in Assumption 2.1 is a strict

generalization of the representation developed by Hoover (1979), Aldous (1981) and Kallenberg

(1989) (the so-called AHK representation) precisely because 𝛾𝑡 is allowed to have serial correlation.

The AHK representation is not sufficient here because it was developed for data drawn from an

infinite array of jointly exchangeable random variables in which case 𝛾𝑡 would not have serial

correlation.

Using the representation in Assumption 2.1, Chiang et al. (2024) develop the following decom-

position of 𝑦𝑖𝑡. Denoting 𝑎𝑖 = E (𝑦𝑖𝑡 − 𝜃|𝛼𝑖) , 𝑔𝑡 = E (𝑦𝑖𝑡 − 𝜃|𝛾𝑡), and 𝑒𝑖𝑡 = (𝑦𝑖𝑡 − 𝜃) − 𝑎𝑖 − 𝑔𝑡, one

can decompose 𝑦𝑖𝑡 − 𝜃 as

𝑦𝑖𝑡 − 𝜃 = 𝑎𝑖 + 𝑔𝑡 + 𝑒𝑖𝑡 =: 𝑣𝑖𝑡 .

Chiang et al. (2024) show that the components are mean zero and that 𝑒𝑖𝑡 is also mean zero

conditional on 𝑎𝑖 and conditional on 𝑔𝑡. The individual component, 𝑎𝑖, is i.i.d. across 𝑖, and the

9

time component, 𝑔𝑡, is stationary. Conditional on 𝛾𝑡, the 𝑒𝑖𝑡 component is independent across 𝑖.

Finally, the three components are uncorrelated with each other with 𝑎𝑖 and 𝑔𝑡 being independent of

each other. See Section 3.1 of Chiang et al. (2024) for details.

We can estimate 𝜃 using the pooled sample mean estimator given by (cid:98)𝜃 = (𝑁𝑇)−1 (cid:205)𝑁
𝑖=1

(cid:205)𝑇

𝑡=1

𝑦𝑖𝑡.

Rewriting the sample mean using the component structure representation for 𝑦𝑖𝑡 gives

(cid:98)𝜃 − 𝜃 =

1
𝑁

𝑁
∑︁

𝑖=1

𝑎𝑖 +

1
𝑇

𝑇
∑︁

𝑡=1

𝑔𝑡 +

1
𝑁𝑇

𝑁
∑︁

𝑇
∑︁

𝑖=1

𝑡=1

𝑒𝑖𝑡 =: ¯𝑎 + ¯𝑔 + ¯𝑒.

(2.7)

The Chiang et al. (2024) variance estimator of (cid:98)𝜃 is given by (2.3) with (cid:98)
𝑣𝑖𝑡 = 𝑦𝑖𝑡 − (cid:98)𝜃 used in (2.4) -
(2.6). To obtain fixed-𝑏 results for (cid:98)ΩCHS we rewrite the formula for (cid:98)ΩCHS in terms of the following

𝑣𝑖𝑡:
two partial sum processes of (cid:98)

(cid:98)𝑆𝑖𝑡 =

(cid:98)¯𝑆𝑡 =

𝑡
∑︁

𝑗=1
𝑁
∑︁

𝑖=1

𝑣𝑖 𝑗 = 𝑡 (𝑎𝑖 − ¯𝑎) +
(cid:98)

𝑡
∑︁

𝑗=1

(cid:0)𝑔 𝑗 − ¯𝑔(cid:1) +

𝑡
∑︁

𝑗=1

(cid:0)𝑒𝑖 𝑗 − ¯𝑒(cid:1) ,

(cid:98)𝑆𝑖𝑡 =

𝑁
∑︁

𝑡
∑︁

𝑖=1

𝑗=1

𝑣𝑖 𝑗 = 𝑁
(cid:98)

𝑡
∑︁

𝑗=1

(cid:0)𝑔 𝑗 − ¯𝑔(cid:1) +

𝑁
∑︁

𝑡
∑︁

𝑖=1

𝑗=1

(cid:0)𝑒𝑖 𝑗 − ¯𝑒(cid:1) .

(2.8)

(2.9)

Note that the 𝑎𝑖 component drops from (2.9) because (cid:205)𝑁

𝑖=1 (𝑎𝑖 − ¯𝑎) = 0. The Arellano component
(2.4) of (cid:98)ΩCHS is obviously a simple function of (2.8) with 𝑡 = 𝑇. The HAC components (2.5)
and (2.6) can be written in terms of (2.9) and (2.8) using fixed-𝑏 algebra (see Vogelsang, 2012).

Therefore, the Chiang et al. (2024) variance estimator has the following equivalent formula:

(cid:98)ΩCHS =

1
𝑁 2𝑇 2

𝑁
∑︁

(cid:98)𝑆𝑖𝑇 (cid:98)𝑆′
𝑖𝑇

𝑖=1
(cid:40)

+

1
𝑁 2𝑇 2

2
𝑀

′
(cid:98)¯𝑆𝑡(cid:98)¯𝑆
𝑡 −

1
𝑀

𝑇−𝑀−1
∑︁

(cid:16)
′
′
(cid:98)¯𝑆𝑡(cid:98)¯𝑆
𝑡+𝑀 + (cid:98)¯𝑆𝑡+𝑀(cid:98)¯𝑆
𝑡

(cid:17)

(cid:41)

−

1
𝑁 2𝑇 2

𝑁
∑︁

𝑖=1

𝑇−1
∑︁

𝑡=1

(cid:98)𝑆𝑖𝑡 (cid:98)𝑆′

𝑖𝑡 −

𝑡=1

1
𝑀

𝑇−𝑀−1
∑︁

(cid:16)

𝑡=1

(cid:98)𝑆𝑖𝑡 (cid:98)𝑆′

𝑖,𝑡+𝑀 + (cid:98)𝑆𝑖,𝑡+𝑀 (cid:98)𝑆′
𝑖,𝑡

(cid:17)

𝑇−1
∑︁

𝑡=1
(cid:40)

2
𝑀

−

1
𝑀

𝑇−1
∑︁

(cid:16)

𝑡=𝑇−𝑀

(cid:98)𝑆𝑖𝑡 (cid:98)𝑆′

𝑖𝑇 + (cid:98)𝑆𝑖𝑇 (cid:98)𝑆′
𝑖𝑡

(cid:17)

+ (cid:98)𝑆𝑖𝑇 (cid:98)𝑆′
𝑖𝑇

(cid:41)

.

Define three 𝑘 × 𝑘 matrices Λ𝑎, Λ𝑔, and Λ𝑒 such that:

Λ𝑎Λ′

𝑎 = E(𝑎𝑖𝑎′

𝑖), Λ𝑔Λ′

𝑔 =

∞
∑︁

ℓ=−∞

E[𝑔𝑡𝑔′

𝑡+ℓ], Λ𝑒Λ′

𝑒 =

∞
∑︁

ℓ=−∞

E[𝑒𝑖𝑡𝑒′

𝑖,𝑡+ℓ].

10

(2.10)

(2.11)

(2.12)

The following assumption is used to obtain an asymptotic result for (2.7) and a fixed-𝑏 asymptotic

result for (cid:98)ΩCHS. Throughout the paper, let ∥.∥ denote the Euclidean norm for matrices, and let
𝜆𝑚𝑖𝑛 [.] denote the smallest eigenvalue of a square matrix.

Assumption 2.2 For some 𝑠 > 1 and 𝛿 > 0, (i) E[𝑦𝑖𝑡] = 𝜃 and E[∥𝑦𝑖𝑡 ∥4(𝑠+𝛿)] < ∞.

(ii) 𝛾𝑡

is an 𝛼-mixing sequence with size 2𝑠/(𝑠 − 1), i.e., 𝛼𝛾 (ℓ) = 𝑂 (ℓ−𝜆) for a 𝜆 > 2𝑠/(𝑠 − 1). (iii)

𝜆𝑚𝑖𝑛 [Λ𝑎Λ′

𝑎] > 0 and/or 𝜆𝑚𝑖𝑛 [Λ𝑔Λ′

𝑔] > 0, and 𝑁/𝑇 → 𝑐 as (𝑁, 𝑇) → ∞ for some constant 𝑐. (iv)

𝑀 = [𝑏𝑇] where 𝑏 ∈ (0, 1].

Assumption 2.2(i) assumes the mean of 𝑦𝑖𝑡 exists and 𝑦𝑖𝑡 has finite fourth moments. Assumption

2.2(ii) assumes weak dependence of 𝛾𝑡 using a mixing condition. Assumption 2.2(i) - (ii) follow

Chiang et al. (2024). Assumption 2.2(iii) is a non-degeneracy restriction on the projected individual

and time components. Clearly, when data is i.i.d over both the cross-section and time dimensions,

this condition does not hold. Because the fixed-𝑏 limits of (cid:98)ΩCHS and its associated test statistics turn

out to be different in the i.i.d case, we discuss it separately in Section 4. Assumption 2.2(iii) also

rules out the pathological case described in Example 1.7 of Menzel (2021): when 𝑦𝑖𝑡 = 𝛼𝑖𝛾𝑡 + 𝜀𝑖𝑡

with E(𝛼𝑖) = E(𝛾𝑡) = 0, one can easily verify that 𝑎𝑖 = 𝑔𝑡 = 0, in which case the limiting distribution

of appropriately scaled (cid:98)𝜃 is non-Gaussian. Assumption 2.2(iv) uses the fixed-𝑏 asymptotic nesting
for the bandwidth. The following theorem gives an asymptotic result for appropriately scaled (cid:98)𝜃 and
a fixed-𝑏 asymptotic result for appropriately scaled (cid:98)ΩCHS.

Theorem 2.1 Let 𝑧𝑘 be a 𝑘 × 1 vector of independent standard normal random variables, and let

𝑊𝑘 (𝑟), 𝑟 ∈ (0, 1], be a 𝑘 × 1 vector of independent standard Wiener processes independent of 𝑧𝑘 .

Suppose Assumptions 2.1 and 2.2 hold, then as (𝑁, 𝑇) → ∞,

√

(cid:16)

𝑁

(cid:17)

(cid:98)𝜃 − 𝜃

⇒ Λ𝑎𝑧𝑘 +

√

𝑐Λ𝑔𝑊𝑘 (1),

𝑁 (cid:98)Ω𝐶𝐻𝑆 ⇒ ℎ(𝑏)Λ𝑎Λ′

𝑎 + 𝑐Λ𝑔𝑃

(cid:16)

𝑏, (cid:101)𝑊𝑘 (𝑟)

(cid:17)

Λ′
𝑔,

(2.13)

where ℎ(𝑏) :=

(cid:16)

1 − 𝑏 + 1
3

𝑏2(cid:17)

, (cid:101)𝑊𝑘 (𝑟) := 𝑊𝑘 (𝑟) − 𝑟𝑊𝑘 (1), i.e. a Brownian bridge, and

(cid:16)

𝑃

𝑏, (cid:101)𝑊𝑘 (𝑟)

(cid:17)

:=

2
𝑏

∫ 1

0

(cid:101)𝑊𝑘 (𝑟) (cid:101)𝑊 𝑘 (𝑟)′𝑑𝑟 −

1
𝑏

∫ 1−𝑏

0

11

(cid:101)𝑊𝑘 (𝑟) (cid:101)𝑊𝑘 (𝑟 + 𝑏)′ + (cid:101)𝑊𝑘 (𝑟 + 𝑏) (cid:101)𝑊𝑘 (𝑟)′(cid:3) 𝑑𝑟.
(cid:2)

The proof of Theorem 2.1 is given in Appendix 2A. The limit of

√

𝑁

(cid:16)

(cid:17)

(cid:98)𝜃 − 𝜃

was obtained

by Chiang et al. (2024). Because 𝑧𝑘 and 𝑊𝑘 (1) are vectors of independent standard normals

that are independent of each other, Λ𝑎𝑧𝑘 +

√

𝑐Λ𝑔𝑊𝑘 (1) is a vector of normal random variables

with variance-covariance matrix Λ𝑎Λ′

𝑎 + 𝑐Λ𝑔Λ′

𝑔. The Λ𝑔𝑃

(cid:16)

𝑏, (cid:101)𝑊 (𝑟)

(cid:17)

Λ′

𝑔 component of (2.13) is

equivalent to the fixed-𝑏 limit obtained by Kiefer and Vogelsang (2005) in stationary time series

settings. Obviously, (2.13) is different than the limit obtained by Kiefer and Vogelsang (2005)

because of the ℎ(𝑏)Λ𝑎Λ′

𝑎 term. As the proof illustrates, this term is the limit of the “cluster by

individuals” (2.4) and “average of HACs” (2.6) components whereas the 𝑐Λ𝑔𝑃

(cid:16)

𝑏, (cid:101)𝑊𝑘 (𝑟)

(cid:17)

Λ′

𝑔 term

is the limit of the “HAC of averages” (2.5). Interestingly, Kiefer and Vogelsang (2005) showed that

(cid:16)

(cid:16)

𝑃

E

𝑏, (cid:101)𝑊𝑘 (𝑟)

(cid:18)

(cid:17)(cid:17)

=

1 − 𝑏 +

(cid:19)

𝑏2

1
3

𝐼𝑘 = ℎ(𝑏)𝐼𝑘 ,

(2.14)

where 𝐼𝑘 is a 𝑘 × 𝑘 identity matrix. The fact that both terms in the limit of 𝑁 (cid:98)ΩCHS are proportional to
ℎ(𝑏) suggests a simple bias correction that is discussed in Section 2.4.1. Because of the component

structure of (2.13), the fixed-𝑏 limits of 𝑡 and Wald statistics based on (cid:98)ΩCHS are not pivotal. We

provide details on test statistics after extending our results to the case of a linear panel regression.

2.3.2 The Linear Panel Regression Case

It is straightforward to extend our results to the case of a linear panel regression given by (2.1). The

POLS estimator of 𝛽 is

(cid:98)𝛽 = 𝛽 + (cid:98)𝑄−1

(cid:32)

1
𝑁𝑇

𝑁
∑︁

𝑇
∑︁

𝑖=1

𝑡=1

(cid:33)

𝑥𝑖𝑡𝑢𝑖𝑡

,

(2.15)

where (cid:98)𝑄 := 1
𝑁𝑇

(cid:205)𝑁
𝑖=1

(cid:205)𝑇

𝑡=1

𝑥𝑖𝑡𝑥′

𝑖𝑡 as in Section 2. Following Chiang et al. (2024), we assume the

components of the panel regression are generated from the component structure:

(𝑦𝑖𝑡, 𝑥′

𝑖𝑡, 𝑢𝑖𝑡)′ = 𝑓 (𝛼𝑖, 𝛾𝑡, 𝜀𝑖𝑡)

where 𝑓 is an unknown Borel-measurable function, the sequences {𝛼𝑖}, {𝛾𝑡 }, and {𝜀𝑖𝑡 } are mutually

independent, 𝛼𝑖 is i.i.d across 𝑖, 𝜀𝑖𝑡 is i.i.d across 𝑖 and 𝑡, and 𝛾𝑡 is a strictly stationary serially

correlated process. Define the vector 𝑣𝑖𝑡 = 𝑥𝑖𝑡𝑢𝑖𝑡. Similar to the simple mean model we can write

12

𝑎𝑖 = E (𝑣𝑖𝑡 |𝑎𝑖), 𝑔𝑡 = E (𝑣𝑖𝑡 |𝛾𝑡), 𝑒𝑖𝑡 = 𝑣𝑖𝑡 − 𝑎𝑖 − 𝑔𝑡, giving the decomposition

𝑣𝑖𝑡 = 𝑎𝑖 + 𝑔𝑡 + 𝑒𝑖𝑡 .

The Chiang et al. (2024) variance estimator of (cid:98)𝛽 is given by (2.2) with (cid:98)
defined as (cid:98)

𝑖𝑡 (cid:98)𝛽 are the POLS residuals.

𝑢𝑖𝑡 where (cid:98)

𝑣𝑖𝑡 = 𝑥𝑖𝑡(cid:98)

𝑢𝑖𝑡 = 𝑦𝑖𝑡 − 𝑥′

𝑣𝑖𝑡 in (2.4) - (2.6) now

The following assumption is used to obtain an asymptotic result for (2.15) and a fixed-𝑏

asymptotic result for (cid:98)ΩCHS in the linear panel case.

Assumption 2.3 For some 𝑠 > 1 and 𝛿 > 0, (i) (𝑦𝑖𝑡, 𝑥′

𝑖𝑡, 𝑢𝑖𝑡)′ = 𝑓 (𝛼𝑖, 𝛾𝑡, 𝜀𝑖𝑡) where {𝛼𝑖}, {𝛾𝑡 },

and 𝛾𝑡 is strictly stationary.

and {𝜀𝑖𝑡 } are mutually independent sequences, 𝛼𝑖 is i.i.d across 𝑖, 𝜀𝑖𝑡 is i.i.d across 𝑖 and 𝑡,
𝑖𝑡](cid:3) > 0, E[∥𝑥𝑖𝑡 ∥8(𝑠+𝛿)] < ∞, and
E[∥𝑢𝑖𝑡 ∥8(𝑠+𝛿)] < ∞. (iii) 𝛾𝑡 is an 𝛼-mixing sequence with size 2𝑠/(𝑠 − 1), i.e., 𝛼𝛾 (ℓ) = 𝑂 (ℓ−𝜆) for

(ii) E[𝑥𝑖𝑡𝑢𝑖𝑡] = 0, 𝜆𝑚𝑖𝑛 (cid:2)E[𝑥𝑖𝑡𝑥′

a 𝜆 > 2𝑠/(𝑠 − 1). (iv) 𝜆𝑚𝑖𝑛 [Λ𝑎Λ′

𝑎] > 0 and/or 𝜆𝑚𝑖𝑛 [Λ𝑔Λ′

𝑔] > 0, and 𝑁/𝑇 → 𝑐 as (𝑁, 𝑇) → ∞ for

some constant 𝑐. (v) 𝑀 = [𝑏𝑇] where 𝑏 ∈ (0, 1].

Assumption 2.3 can be regarded as a counterpart of Assumptions 2.1 and 2.2 with Assumption

2.3(ii) being strengthened. It is very similar to its counterpart in Chiang et al. (2024) with a main

difference the use of the fixed-𝑏 asymptotic nesting for the bandwidth, 𝑀. For the same reason

mentioned in the previous section, we discuss the case where (𝑥𝑖𝑡, 𝑢𝑖𝑡) are i.i.d separately in Section

4.

The next theorem presents the joint limit of the POLS estimator and the fixed-𝑏 joint limit of

CHS variance estimator.

Theorem 2.2 Let 𝑧𝑘 , 𝑊𝑘 (𝑟), (cid:101)𝑊𝑘 (𝑟), 𝑃
Assumption 2.3 holds for model (2.1), then as (𝑁, 𝑇) → ∞,

𝑏, (cid:101)𝑊𝑘 (𝑟)

(cid:16)

(cid:17)

and ℎ(𝑏) be defined as in Theorem 2.1. Suppose

√

(cid:16)

𝑁

(cid:17)

(cid:98)𝛽 − 𝛽

⇒ 𝑄−1𝐵𝑘 (𝑐),

where 𝐵𝑘 (𝑐) := Λ𝑎𝑧𝑘 +

√

𝑐Λ𝑔𝑊𝑘 (1) and

𝑁 (cid:98)VCHS( (cid:98)𝛽) ⇒ 𝑄−1𝑉𝑘 (𝑏, 𝑐)𝑄−1,

(2.16)

13

where 𝑉𝑘 (𝑏, 𝑐) := ℎ(𝑏)Λ𝑎Λ′

𝑎 + 𝑐Λ𝑔𝑃

(cid:16)

𝑏, (cid:101)𝑊𝑘 (𝑟)

(cid:17)

𝑔.
Λ′

The proof of Theorem 2.2 is given in Appendix 2A. We can see that the limiting random

variable, 𝑉𝑘 (𝑏, 𝑐), depends on the choice of truncation parameter, 𝑀, through 𝑏. The use of the

Bartlett kernel is reflected in the functional form of 𝑃

(cid:16)

𝑏, (cid:101)𝑊𝑘 (𝑟)

(cid:17)

as well as the scaling term ℎ(𝑏) on

Λ𝑎Λ′

𝑎. Use of a different kernel would result in different functional forms for these limits. Because

of (2.14), it follows that

E (𝑉𝑘 (𝑏, 𝑐)) = ℎ(𝑏)Λ𝑎Λ′

𝑎 + 𝑐Λ𝑔E

(cid:16)

(cid:104)

𝑃

𝑏, (cid:101)𝑊𝑏 (𝑟)

(cid:17)(cid:105)

Λ′
𝑔

(cid:16)

= ℎ(𝑏)

Λ𝑎Λ′

𝑎 + 𝑐Λ𝑔Λ′
𝑔

(cid:17)

.

(2.17)

The scalar ℎ(𝑏) can be viewed as a multiplicative bias term that depends on the bandwidth sample

size ratio, 𝑏 = 𝑀/𝑇. We leverage this fact to implement a simple feasible bias correction for the

CHS variance estimator that is explored below.

Using the theoretical results developed in this section, we next examine the properties of test

statistics based on the POLS estimator and CHS variance estimator. We also analyze tests based

on two variants of the CHS variance estimator. One is a bias-corrected estimator. The other is a

variance estimator guaranteed to be positive semi-definite that is also bias-corrected.

2.4

Inference

In regression model (2.1) we focus on tests of linear hypothesis of the form:

H0 : 𝑅𝛽 = 𝑟, H1 : 𝑅𝛽 ≠ 𝑟,

where 𝑅 is a 𝑞 × 𝑘 matrix (𝑞 ≤ 𝑘) with full rank equal to 𝑞, and 𝑟 is a 𝑞 × 1 vector. Using (cid:98)VCHS( (cid:98)𝛽)

as given by (2.2), define a Wald statistic as

𝑊CHS =

(cid:16)

𝑅 (cid:98)𝛽 − 𝑟

(cid:17)′(cid:16)

𝑅(cid:98)VCHS( (cid:98)𝛽)𝑅′(cid:17) −1 (cid:16)

𝑅 (cid:98)𝛽 − 𝑟

(cid:17)

.

When 𝑞 = 1, we can define a 𝑡-statistic as

𝑡CHS =

√︃

𝑅 (cid:98)𝛽 − 𝑟
𝑅(cid:98)VCHS( (cid:98)𝛽)𝑅′

.

14

Appropriately scaling the numerators and denominators of the test statistics and applying Theorem

2.2, we obtain under H0:

𝑊CHS =

√

(cid:16)

𝑁

(cid:17)′(cid:16)

𝑅 (cid:98)𝛽 − 𝑟

(cid:16)

⇒

𝑅𝑄−1𝐵𝑘 (𝑐)
√
(cid:16)

𝑅 (cid:98)𝛽 − 𝑟

𝑁

(cid:17)

𝑡CHS =

√︃

𝑅𝑁 (cid:98)VCHS( (cid:98)𝛽)𝑅′

𝑅𝑁 (cid:98)VCHS( (cid:98)𝛽)𝑅′(cid:17) −1√
𝑅𝑄−1𝑉𝑘 (𝑏, 𝑐)𝑄−1𝑅′(cid:17) −1 (cid:16)

𝑁

(cid:16)

(cid:17)′(cid:16)

𝑅 (cid:98)𝛽 − 𝑟

(cid:17)

𝑅𝑄−1𝐵𝑘 (𝑐)

(cid:17)

=: 𝑊 ∞

CHS

,

⇒

𝑅𝑄−1𝐵𝑘 (𝑐)
√︁𝑅𝑄−1𝑉𝑘 (𝑏, 𝑐)𝑄−1𝑅′

=: 𝑡∞

CHS

.

(2.18)

(2.19)

The limits of 𝑊CHS and 𝑡CHS are similar to the fixed-𝑏 limits obtained by Kiefer and Vogelsang

(2005) but have distinct differences. First, the form of 𝑉𝑘 (𝑏, 𝑐) depends on two variance matrices

rather than one. Second, the variance matrices do not scale out of the statistics. Therefore, the

fixed-𝑏 limits given by (2.18) and (2.19) are not pivotal. We propose a plug-in method for the

simulation of critical values from these asymptotic random variables.

For the case where 𝑏 is small, the fixed-𝑏 critical values are close to 𝜒2

𝑞 and 𝑁 (0, 1) critical values

respectively. This can be seen by computing the probability limits of the asymptotic distributions

as 𝑏 → 0. In particular, using the fact that plim𝑏→0

𝑃

(cid:16)

𝑏, (cid:101)𝑊𝑘 (𝑟)

(cid:17)

= 𝐼𝑘 (see Kiefer and Vogelsang,

2005), it follows that

plim𝑏→0

𝑉𝑘 (𝑏, 𝑐) = plim𝑏→0

(cid:104)

ℎ(𝑏)Λ𝑎Λ′

𝑎 + 𝑐Λ𝑔𝑃

(cid:16)

𝑏, (cid:101)𝑊𝑘 (𝑟)

(cid:17)

(cid:105)

Λ′
𝑔

= Λ𝑎Λ′

𝑎 + 𝑐Λ𝑔Λ′

𝑔 = Var(𝐵𝑘 (𝑐)),

where ℎ(·) and 𝑃(·) are defined in Theorem 2.1. Therefore, it follows that

(cid:20)(cid:16)

plim𝑏→0

𝑅𝑄−1𝐵𝑘 (𝑐)

(cid:17)′(cid:16)

𝑅𝑄−1𝑉𝑘 (𝑏, 𝑐)𝑄−1𝑅′(cid:17) −1 (cid:16)

𝑅𝑄−1𝐵𝑘 (𝑐)

(cid:17)(cid:21)

(cid:16)

=

𝑅𝑄−1𝐵𝑘 (𝑐)

(cid:17)′(cid:16)

𝑅𝑄−1Var(𝐵𝑘 (𝑐))𝑄−1𝑅′(cid:17) −1 (cid:16)

𝑅𝑄−1𝐵𝑘 (𝑐)

(cid:17)

𝑞,
∼ 𝜒2

and

(cid:34)

𝑝 lim
𝑏→0

𝑅𝑄−1𝐵𝑘 (𝑐)
√︁𝑅𝑄−1𝑉𝑘 (𝑏, 𝑐)𝑄−1𝑅′

(cid:35)

=

𝑅𝑄−1𝐵𝑘 (𝑐)
√︁𝑅𝑄−1Var(𝐵𝑘 (𝑐))𝑄−1𝑅′

∼ 𝑁 (0, 1).

In practice, there will not be a substantial difference between using 𝜒2

𝑞 and 𝑁 (0, 1) critical values

and fixed-𝑏 critical values for small bandwidths. However, for larger bandwidths more reliable

inference can be obtained with fixed-𝑏 critical values.

15

2.4.1 Bias-Corrected CHS Variance Estimator

We now leverage the form of the mean of the fixed-𝑏 limit of the CHS variance estimator as

given by (2.17) to propose a biased corrected version of the CHS variance estimator. The idea

is simple. We can scale out the ℎ(𝑏) multiplicative term evaluated at 𝑏 = 𝑀/𝑇 to make the

CHS variance estimator an asymptotically unbiased estimator of Λ𝑎Λ′

𝑎 + 𝑐Λ𝑔Λ′

𝑔, the variance of

𝐵𝑘 (𝑐) = Λ𝑎𝑧𝑘 +

√

𝑐Λ𝑔𝑊𝑘 (1). Define the bias-corrected CHS variance estimators as

(cid:98)VBCCHS

(cid:16)

(cid:17)
(cid:98)𝛽

= (cid:98)𝑄−1

(cid:98)ΩBCCHS (cid:98)𝑄−1, (cid:98)ΩBCCHS =

(cid:19)(cid:19) −1

(cid:18)

ℎ

(cid:18) 𝑀
𝑇

(cid:98)ΩCHS,

and the corresponding Wald and 𝑡-statistics under the null hypothesis 𝑅𝛽 = 𝑟 are defined as

𝑊BCCHS =

(cid:16)

𝑅 (cid:98)𝛽 − 𝑟

(cid:17)′(cid:16)

𝑅(cid:98)VBCCHS

(cid:16)

(cid:17)
(cid:98)𝛽

𝑅′(cid:17) −1 (cid:16)

𝑅 (cid:98)𝛽 − 𝑟

(cid:17)

,

𝑡BCCHS =

√︂

𝑅 (cid:98)𝛽 − 𝑟
(cid:16)

𝑅(cid:98)VBCCHS

.

(cid:17)
(cid:98)𝛽

𝑅′

Because (cid:98)ΩBCCHS is a simple scalar multiple of (cid:98)ΩCHS, we easily obtain the fixed-𝑏 limits

𝑊BCCHS ⇒ ℎ(𝑏)𝑊 ∞

𝑡BCCHS ⇒ ℎ(𝑏)1/2𝑡∞

CHS =: 𝑊 ∞
CHS =: 𝑡∞

BCCHS

BCCHS

,

.

(2.20)

(2.21)

Notice that while the fixed-𝑏 limits are different when using the bias-corrected CHS variance

estimator, they are scalar multiples of the fixed-𝑏 limits when using the original CHS variance

estimator. Therefore, the fixed-𝑏 critical values of the 𝑊BCCHS and 𝑡BCCHS are proportional to the

fixed-𝑏 critical values of 𝑊CHS and 𝑡CHS. As long as fixed-𝑏 critical values are used, there is no

practical effect on inference from using the bias-corrected CHS variance estimator. Where the bias

correction matters is when 𝜒2

𝑞 and 𝑁 (0, 1) critical values are used. In this case, the bias-corrected

CHS variance can provide more accurate finite sample inference. This will be illustrated by our

finite sample simulations.

2.4.2 An Alternative Bias-Corrected Variance Estimator

As noted by Chiang et al. (2024), the CHS variance estimator does not ensure positive-

definiteness, which is also the case for the clustered estimator proposed by Cameron et al. (2011).

16

Davezies et al. (2018) and MacKinnon et al. (2021) point out that the double-counting adjustment

term in the estimator of Cameron et al. (2011) is of small order, and removing the adjustment

term has the computational advantage of guaranteeing positive semi-definiteness. Analogously,

we can think of (cid:98)ΩNW, as given by (2.6), as a double-counting adjustment term. If we exclude this

term, the variance estimator becomes the sum of two positive semi-definite terms and is guaranteed

to be positive definite. Another motivation for dropping (2.6) is that, under fixed-𝑏 asymptotics,

(2.6) simply contributes downward bias in the estimation of the Λ𝑎Λ′

𝑎 term of Var(𝐵𝑘 (𝑐)) through

the −𝑏 + 1
3

𝑏2 part of ℎ(𝑏) in the ℎ(𝑏)Λ𝑎Λ′

𝑎 portion of 𝑉𝑘 (𝑏, 𝑐). Intuitively, the Arellano cluster

estimator takes care of the serial correlation introduced by 𝑎𝑖, and the DK estimator takes care of

the cross-section and time dependence introduced by 𝑔𝑡. From this perspective, (cid:98)ΩNW is not needed.

Accordingly, we propose a variance estimator which is the sum of the Arellano variance

estimator and the bias-corrected DK variance estimator (labeled as DKA hereafter) defined as

(cid:98)ΩDKA =: (cid:98)ΩA + ℎ(𝑏)−1

(cid:98)ΩDK,

where (cid:98)ΩA and (cid:98)ΩDK are defined in (2.4) and (2.5). Notice that we bias correct the DK component

so that the resulting variance estimator is asymptotically unbiased under fixed-𝑏 asymptotics. This

can improve inference should 𝜒2

𝑞 or 𝑁 (0, 1) critical values be used in practice. The following

theorem gives the fixed-𝑏 limit of the scaled DKA variance estimator.

Theorem 2.3 Suppose Assumption 2.3 holds for model (2.1), then as (𝑁, 𝑇) → ∞,

𝑁 (cid:98)ΩDKA ⇒ Λ𝑎Λ′

𝑎 + 𝑐Λ𝑔ℎ(𝑏)−1𝑃

(cid:16)

𝑏, (cid:101)𝑊𝑏 (𝑟)

(cid:17)

Λ′

𝑔 = ℎ(𝑏)−1𝑉𝑘 (𝑏, 𝑐).

(2.22)

The proof of Theorem 2.3 can be found in Appendix 2A. Define the statistics 𝑊DKA and 𝑡DKA analo-
gous to 𝑊BCCHS and 𝑡BCCHS using the variance estimator for (cid:98)𝛽 given by (cid:98)VDKA
= (cid:98)𝑄−1(cid:98)ΩDKA (cid:98)𝑄−1.
Applying Theorems 2.2 and 2.3, we obtain the fixed-𝑏 limits of Wald/t test statistics associated

(cid:17)
(cid:98)𝛽

(cid:16)

with DKA variance estimator under the null:

𝑊DKA ⇒ 𝑊 ∞

BCCHS

,

𝑡DKA ⇒ 𝑡∞

BCCHS

,

which are the same as the limits of 𝑊BCCHS and 𝑡BCCHS given by (2.20) and (2.21).

17

2.4.3 Results for i.i.d. Data

While the DKA variance estimator is guaranteed to be positive semi-definite, this useful property

comes with a potential cost. As is shown in Theorem 2 of MacKinnon et al. (2021), if the score

𝑥𝑖𝑡𝑢𝑖𝑡 is i.i.d. over 𝑖 and 𝑡, or if clusters are formed at the intersection between individuals and time2,

the probability limit of two-way cluster-robust variance estimators that drop the double-counting

adjustment term, referred to as a two-term variance estimator, is twice the size of the true variance.

In other words, if the researcher believes there is clustering when there is none, the use of a two-term

estimator would overestimate the asymptotic variance. The associated Wald and 𝑡-statistics will

be scaled down causing over-coverage (under-rejection) problems under the null hypothesis. The

following assumption and theorem give fixed-𝑏 results for the CHS, BCCHS and DKA statistics

for the case of i.i.d. data.

Assumption 2.4 For some 𝑠 > 1 and 𝛿 > 0, (i) (𝑥𝑖𝑡, 𝑢𝑖𝑡) are independent and identically distributed
over i and t. (ii) E[𝑥𝑖𝑡𝑢𝑖𝑡] = 0, 𝜆𝑚𝑖𝑛 (cid:2)E[𝑥𝑖𝑡𝑥′
𝑖𝑡](cid:3) > 0, E[∥𝑥𝑖𝑡 ∥8(𝑠+𝛿)] < ∞, and E[∥𝑢𝑖𝑡 ∥8(𝑠+𝛿)] < ∞.

(iii) 𝑁/𝑇 → 𝑐 as (𝑁, 𝑇) → ∞ for some constant 𝑐. (iv) 𝑀 = [𝑏𝑇] for 𝑏 ∈ (0, 1].

Theorem 2.4 Suppose Assumption 2.4 holds for model (2.1), then as (𝑁, 𝑇) → ∞,

𝑊CHS ⇒ 𝑊𝑞 (1)′ (cid:110)
𝑊DKA ⇒ 𝑊𝑞 (1)′ (cid:110)

(cid:16)

𝑃

𝑏, (cid:101)𝑊𝑞 (𝑟)

(cid:17)(cid:111)−1

𝐼𝑞 + ℎ(𝑏)−1𝑃

(cid:16)

𝑊𝑞 (1) =: 𝑊 ∞,𝑖𝑖𝑑
CHS
(cid:17)(cid:111)−1
𝑊𝑞 (1) =: 𝑊 ∞,𝑖𝑖𝑑
DKA

𝑏, (cid:101)𝑊𝑞 (𝑟)

,

, 𝑊BCCHS ⇒ ℎ(𝑏)𝑊 ∞,𝑖𝑖𝑑

CHS =: 𝑊 ∞,𝑖𝑖𝑑

BCCHS

,

𝑡CHS ⇒

𝑊1(1)
(cid:16)

𝑏, (cid:101)𝑊1(𝑟)

(cid:17)

√︂

𝑃

=: 𝑡∞,𝑖𝑖𝑑
CHS

,

𝑡BCCHS ⇒ ℎ(𝑏)1/2𝑡∞,𝑖𝑖𝑑

CHS =: 𝑡∞,𝑖𝑖𝑑

BCCHS

,

𝑡DKA ⇒

√︂

𝑊1(1)
(cid:16)

1 + ℎ(𝑏)−1𝑃

=: 𝑡∞,𝑖𝑖𝑑
DKA

.

(cid:17)

𝑏, (cid:101)𝑊1(𝑟)

Theorem 2.4 shows that the fixed-𝑏 limits in the i.i.d. case are different for all three test statistics

than the limits given by (2.18), (2.19) for CHS and (2.20), (2.21) for BCCHS and DKA.

2In our setting where the clustering only happens at individual and time levels, clustering at the intersection is the

same as the independence across individuals and times.

18

Suppose tests are carried out using 𝜒2

𝑞 and 𝑁 (0, 1) critical values. The limits in Theorem 2.4

can be used to compute asymptotic null rejection probabilities, or equivalently, asymptotic coverage

probabilities for the case of i.i.d. data. For a two-tailed 5% 𝑡-test, the coverage probabilities are

given by

𝑃

(cid:16)(cid:12)
𝑡∞,𝑖𝑖𝑑
(cid:12)
(cid:12)
CHS

(cid:12)
(cid:12)
(cid:12)

≤ 1.96

(cid:17)

, 𝑃

(cid:16)(cid:12)
𝑡∞,𝑖𝑖𝑑
(cid:12)
(cid:12)
BCCHS

(cid:12)
(cid:12)
(cid:12)

≤ 1.96

(cid:17)

, 𝑃

(cid:16)(cid:12)
𝑡∞,𝑖𝑖𝑑
(cid:12)
(cid:12)
DKA

(cid:12)
(cid:12)
(cid:12)

≤ 1.96

(cid:17)

.

For small bandwidths, we can analytically compute these asymptotic coverage probabilities. As
𝑏 → 0, the limits of 𝑡∞,𝑖𝑖𝑑
CHS and 𝑡∞,𝑖𝑖𝑑
coverage of 95%. In contrast as 𝑏 → 0, the limit of 𝑡∞,𝑖𝑖𝑑

BCCHS converge to 𝑁 (0, 1) random variables giving asymptotic
2) random variable and the

DKA is a 𝑁 (0, 1

asymptotic coverage is 99.4%, and DKA over-covers and is conservative. This result for DKA

tests is similar to Corollary 1 of MacKinnon et al. (2021). For non-small bandwidths the limiting

random variables are non-standard. We used simulation methods to compute these probabilities.

We approximated the Wiener processes using scaled partial sums of 1,000 i.i.d. 𝑁 (0, 1) random

increments and used 50,000 replications to simulate the percentiles.

Table 2.4.1: Asymptotic Critical Values and Coverage Probabilities (%)

97.5% Asymptotic Critical Values
𝑡∞,𝑖𝑖𝑑
CHS
1.960
2.191
2.298
2.421
2.546
3.181
4.300
4.791

𝑡∞,𝑖𝑖𝑑
(cid:98)
BCCHS
1.960
1.972
1.991
2.006
2.019
2.070
2.100
2.099

𝑡∞,𝑖𝑖𝑑
BCCHS
1.960
2.104
2.162
2.230
2.296
2.571
2.764
2.766

𝑡∞,𝑖𝑖𝑑
DKA
1.386
1.411
1.416
1.425
1.438
1.470
1.497
1.497

𝑏
0
0.08
0.12
0.16
0.20
0.40
0.80
1.00

𝑡∞,𝑖𝑖𝑑
BCCHS Critical Values

Coverage, 𝑁 (0, 1) & (cid:98)
CHS
BCCHS
𝑁 (0, 1) 𝑁 (0, 1) (cid:98)
95.0
92.5
91.2
89.8
88.6
82.2
71.2
66.7

95.0
93.5
92.8
92.2
91.5
89.0
87.3
87.2

𝑡∞,𝑖𝑖𝑑
BCCHS
95.0
93.7
93.2
92.7
92.3
90.5
89.2
89.2

DKA
𝑁 (0, 1) (cid:98)
99.4
99.4
99.3
99.2
99.1
98.9
98.8
98.8

𝑡∞,𝑖𝑖𝑑
BCCHS
99.4
99.4
99.4
99.3
99.3
99.3
99.2
99.2

Note: Asymptotic critical values and coverage probabilities based on 50,000 replications
𝑡∞,𝑖𝑖𝑑
using 1,000 increments for the Wiener process. The random variables (cid:98)
DKA
are the same. The nominal coverage probability is 95%.

𝑡∞,𝑖𝑖𝑑
BCCHS and (cid:98)

Table 2.4.1 reports 97.5% critical values for 𝑡∞,𝑖𝑖𝑑

that will be used in our finite sample simulations. The critical values of 𝑡∞,𝑖𝑖𝑑

DKA for a range of values of 𝑏
BCCHS equal 1.96
when 𝑏 = 0 and increase as 𝑏 increase. This suggests that CHS and BCCHS tests will under-cover
when the data is i.i.d. and bandwidths are not small. In contrast, the critical values of 𝑡∞,𝑖𝑖𝑑

BCCHS, and 𝑡∞,𝑖𝑖𝑑

CHS and 𝑡∞,𝑖𝑖𝑑

CHS , 𝑡∞,𝑖𝑖𝑑

DKA are

19

always smaller than 1.96 and remain smaller as 𝑏 increases. Thus, DKA tests over-cover regardless

of the bandwidth.

Table 2.4.1 also reports asymptotic coverage probabilities using the 𝑁 (0, 1) critical value. We

see that as 𝑏 goes from 0 to 1.0, coverage decreases from 95% to 66.7% for CHS, 95% to 87.2%

for BCCHS, and is always close to 99% for DKA. These asymptotic calculations predict that CHS

and BCCHS will over-reject (be liberal) when data is i.i.d. and non-small bandwidths are used.

DKA is predicted to be conservative regardless of bandwidth. The table also reports some results

for a random variable, (cid:98)

𝑡∞,𝑖𝑖𝑑
BCCHS, that is discussed in the next section.

2.4.4 Simulated fixed-𝑏 Critical Values

As we have noted, the fixed-𝑏 limits of the test statistics given by (2.18), (2.19) and (2.20), (2.21) are

not pivotal due to the nuisance parameters Λ𝑎 and Λ𝑔. A feasible method for obtaining asymptotic

critical values is to use simulation methods with unknown nuisance parameters replaced with

estimators, i.e. use a plug-in simulation method.

To estimate Λ𝑎 and Λ𝑔 we use the estimators:

(cid:91)Λ𝑎Λ′

𝑎 :=

1
𝑁𝑇 2

𝑁
∑︁

(cid:32) 𝑇
∑︁

𝑖=1

𝑡=1

𝑣𝑖𝑡
(cid:98)

(cid:33) (cid:32) 𝑇

∑︁

(cid:33)

,

𝑣′
𝑖𝑠
(cid:98)

(cid:18)

(cid:91)Λ𝑔Λ′

𝑔 :=

1 − 𝑏dk +

1
3

𝑏2
dk

𝑠=1
(cid:19) −1 1
𝑁 2𝑇

𝑇
∑︁

𝑇
∑︁

𝑡=1

𝑠=1

𝑘

(cid:18) |𝑡 − 𝑠|
𝑀dk

(cid:19) (cid:32) 𝑁
∑︁

𝑖=1

(cid:33)

𝑣𝑖𝑡
(cid:98)

𝑁
∑︁

𝑗=1

(cid:169)
(cid:173)
(cid:171)

.

𝑣′
𝑗 𝑠(cid:170)
(cid:98)
(cid:174)
(cid:172)

𝑇 and 𝑀dk is the truncation parameter for the Driscoll-Kraay variance estimator.3

where 𝑏dk = 𝑀dk
The consistency of (cid:91)Λ𝑎Λ′

𝑎 is given by (2A.20) in the proof of Theorem 2.3:

(cid:91)Λ𝑎Λ′

𝑎 = Λ𝑎Λ′

𝑎 + 𝑜 𝑝 (1),

And by (2A.10) in the proof of Theorem 2.2, we have,

(cid:91)Λ𝑔Λ′

𝑔 ⇒ Λ𝑔

(cid:17)

(cid:16)

𝑃

𝑏dk, (cid:101)𝑊 (𝑟)
𝑏2
dk

1 − 𝑏dk + 1
3

(2.23)

(2.24)

Λ′
𝑔.

3Note that, in principle, 𝑏dk can be different from the 𝑏 used for CHS variance estimator. For simulating asymptotic

critical values we used the data dependent rule of Andrews (1991) to obtain 𝑏dk.

20

Therefore, (cid:91)Λ𝑎Λ′

𝑎 is a consistent estimator for Λ𝑎Λ′

𝑎 and (cid:91)Λ𝑔Λ′

𝑔 is a bias-corrected estimator of

Λ𝑔Λ′
𝑔 with the mean of the limit equal to Λ𝑔Λ′
matrices (cid:98)Λ𝑎 and (cid:98)Λ𝑔 are matrix square roots of (cid:91)Λ𝑎Λ′
and (cid:98)Λ𝑔(cid:98)Λ′

𝑔 = (cid:91)Λ𝑔Λ′

𝑔

𝑔 and the limit converges to Λ𝑔Λ′

𝑎 and (cid:91)Λ𝑔Λ′

𝑔 respectively such (cid:98)Λ𝑎(cid:98)Λ′

𝑔 as 𝑏dk → 0. The
𝑎 = (cid:91)Λ𝑎Λ′

𝑎

We propose the following plug-in method for simulating the asymptotic critical values of the

fixed-𝑏 limits. Details are given for a 𝑡-test with the modifications needed for a Wald test being

obvious.

1. For a given data set with sample sizes 𝑁 and 𝑇, calculate (cid:98)𝑄, (cid:98)Λ𝑎 and (cid:98)Λ𝑔. Let 𝑏 = 𝑀/𝑇 where

𝑀 is the bandwidth used for (cid:98)ΩCHS. Let 𝑐 = 𝑁/𝑇 .

2. Taking (cid:98)𝑄, (cid:98)Λ𝑎, (cid:98)Λ𝑔, 𝑏, 𝑐, and 𝑅 as given, use Monte Carlo methods to simulate critical values

for the distributions

𝑡CHS =
(cid:98)

𝑅𝑄−1(cid:98)𝑏𝑘 (𝑐)

√︁𝑅𝑄−1
𝑣 𝑘 (𝑏, 𝑐)𝑄−1𝑅′
(cid:98)
𝑡DKA = ℎ(𝑏)1/2
𝑡CHS
(cid:98)

𝑡BCCHS = (cid:98)
(cid:98)

(2.25)

(2.26)

where

(cid:98)𝑏𝑘 (𝑐) = 𝑅 (cid:98)𝑄−1 (cid:16)

(cid:98)Λ𝑎𝑧𝑘 +

√

𝑐(cid:98)Λ𝑔𝑊𝑘 (1)

(cid:17)

,

𝑣 𝑘 (𝑏, 𝑐) = ℎ(𝑏)(cid:98)Λ𝑎(cid:98)Λ′
(cid:98)

𝑎 + 𝑐(cid:98)Λ𝑔𝑃

(cid:16)

𝑏, (cid:101)𝑊𝑘 (𝑟)

(cid:17)

(cid:98)Λ′
𝑔.

3. Typically the process 𝑊𝑘 (𝑟) is approximated using scaled partial sums of a large number of

i.i.d. 𝑁 (0, 𝐼𝑘 ) realizations (increments) for each replication of the Monte Carlo simulation.

It is clear that under Assumption 2.3, as 𝑁, 𝑇 → ∞ and then as 𝑏dk → 0, (cid:98)

𝑡CHS converges

weakly to the fixed-𝑏 limit of 𝑡CHS in (2.18) using results in (2.23) and (2.24); and so does (cid:98)
𝑡CHS and (cid:98)
𝑡DKA). However, it is less clear what (cid:98)
((cid:98)
across both individual and time dimensions. In the i.i.d. case 1
𝑁

𝑡DKA) are estimating when data is i.i.d.

𝑔 each estimate

𝑡BCCHS ((cid:98)

𝑡BCCHS

(cid:91)Λ𝑎Λ′

(cid:91)Λ𝑔Λ′

𝑎 and 1
𝑇
𝑔, and (cid:98)𝑄 as consistent plug-in estimators,

(cid:91)Λ𝑔Λ′

Λ𝑥𝑢Λ′

𝑥𝑢 (the variance of 𝑥𝑖𝑡𝑢𝑢𝑡). Treating 1
𝑁

(cid:91)Λ𝑎Λ′

𝑎 , 1
𝑇

21

it can be shown using arguments similar to the proof of Theorem 2.4 that (cid:98)

𝑡CHS and (cid:98)

𝑡BCCHS ((cid:98)

𝑡DKA)

are simulating from the random variables

𝑡∞,𝑖𝑖𝑑
CHS =
(cid:98)

√︂

𝑧1 + 𝑊1(1)
(cid:16)

ℎ(𝑏) + 𝑃

𝑏, (cid:101)𝑊1(𝑟)

𝑡∞,𝑖𝑖𝑑
BCCHS = (cid:98)
(cid:98)

𝑡∞,𝑖𝑖𝑑
𝑡∞,𝑖𝑖𝑑
DKA = ℎ(𝑏)1/2
(cid:98)
CHS

.

,

(cid:17)

While these random variables do not depend on 𝑐 or nuisance parameters, they are clearly different

than the limits given by Theorem 2.4. If we take the probability limit of these random variables as

𝑏 → 0, it is easy to see4 that both random variables converge to

𝑧1 + 𝑊1(1)
√
2

∼ 𝑁 (0, 1),

because (𝑧1 + 𝑊1(1)) ∼ 𝑁 (0, 2). Recall from Theorem 2.4 that the fixed-𝑏 limits of 𝑡CHS and

𝑡BCCHS are also approximately 𝑁 (0, 1) when 𝑏 is small. Thus, the simulated critical values for 𝑡CHS

and 𝑡BCCHS adapt to the i.i.d. case at least for small bandwidths. In contrast, the limit of 𝑡DKA in

Theorem 2.4 is approximately 𝑁 (0, 1

2) when 𝑏 is small whereas (cid:98)

𝑡DKA is simulating from a 𝑁 (0, 1)

random variable. Therefore, simulated critical values for 𝑡DKA do not adapt to i.i.d. data when 𝑏 is

small and 𝑡DKA over-covers and is conservative.

When the plug-in critical values are used, we can make theoretical predictions for coverage

probabilities in the i.i.d. case for bandwidths that are not small (𝑏 > 0) by computing coverage
probabilities of the limiting random variables 𝑡∞,𝑖𝑖𝑑

DKA using critical values from the

BCCHS and 𝑡∞,𝑖𝑖𝑑

asymptotic random variable (cid:98)

𝑡∞,𝑖𝑖𝑑
BCCHS (same as (cid:98)

𝑡∞,𝑖𝑖𝑑
DKA )5. Results are given in Table 2.4.1 in the (cid:98)

𝑡∞,𝑖𝑖𝑑
BCCHS

columns. As shown in the table, the critical values of (cid:98)

𝑡∞,𝑖𝑖𝑑
BCCHS increase with 𝑏 but slowly. This
helps reduce the under-rejection problems of BCCHS but does not remove them as we see in the

coverage probability column for BCCHS that uses critical values from (cid:98)

critical values from (cid:98)
not vary much across 𝑏 because the critical values of 𝑡∞,𝑖𝑖𝑑

𝑡∞,𝑖𝑖𝑑
BCCHS. When DKA uses
𝑡∞,𝑖𝑖𝑑
BCCHS, coverage probabilities are similar to the 𝑁 (0, 1), and the coverages do
𝑡∞,𝑖𝑖𝑑
BCCHS roughly move together as 𝑏
DKA and (cid:98)

increases.

(cid:16)

𝑏2(cid:17)
1 − 𝑏 + 1
4Obviously
3
5Coverage probabilities are the same for 𝑡∞,𝑖𝑖𝑑

→ 1 as 𝑏 → 0, and recall that 𝑝 lim𝑏→0 𝑃
CHS and 𝑡∞,𝑖𝑖𝑑

= 𝐼𝑞.
BCCHS using critical values from (cid:98)

𝑏, (cid:101)𝑊𝑞 (𝑟)

(cid:17)

(cid:16)

𝑡∞,𝑖𝑖𝑑
CHS and (cid:98)

𝑡∞,𝑖𝑖𝑑
BCCHS given the

common scaling factor ℎ(𝑏)1/2.

22

The asymptotic calculations in Table 2.4.1 predict that CHS and BCCHS will tend to under-

cover (liberal) when the data is i.i.d. with the coverage approaching the nominal level for small

bandwidths. DKA is predicted to have over-coverage (conservative) when the data is i.i.d. regardless

of the bandwidth.

2.5 Monte Carlo Simulations

To illustrate the finite sample performance of the various variance estimators and corresponding

test statistics, we present a Monte Carlo simulation study with 10,000 replications in all cases. We

focus on a simple linear panel model:

𝑦𝑖𝑡 = 𝛽0 + 𝛽1𝑥𝑖𝑡 + 𝑢𝑖𝑡,

(2.27)

where the true parameters are (𝛽0, 𝛽1) = (1, 1). To allow direct comparisons with Table 1 of

Chiang et al. (2022), we consider a data generating process (DGP) that is linear in the components:

DGP(1) : 𝑥𝑖𝑡 = 𝜔𝛼𝛼𝑥

𝑖 + 𝜔𝛾𝛾𝑥

𝑡 + 𝜔𝜀𝜀𝑥
𝑖𝑡,

𝑢𝑖𝑡 = 𝜔𝛼𝛼𝑢

𝑖 + 𝜔𝛾𝛾𝑢

𝑡 + 𝜔𝜀𝜀𝑢
𝑖𝑡,

𝛾 ( 𝑗)
𝑡 = 𝜌𝛾𝛾 ( 𝑗)
𝑡−1

𝛾 ( 𝑗)
+ (cid:101)
𝑡

for 𝑗 = 𝑥, 𝑢,

where the latent components {𝛼𝑥

𝑖 , 𝛼𝑢
AR(1) processes are i.i.d 𝑁 (0, 1 − 𝜌2

𝑖𝑡, 𝜀𝑢

𝑖 , 𝜀𝑥
𝑖𝑡 } are each i.i.d 𝑁 (0, 1), and the error terms (cid:101)
𝛾) for 𝑗 = 𝑥, 𝑢. The component weights (𝜔𝛼, 𝜔𝛾, 𝜔𝜀) are used

for the

𝛾 ( 𝑗)
𝑡

to adjust the relative importance of those components.

To further explore the role played by the component structure representation, we consider a

second DGP where the latent components enter 𝑥𝑖𝑡 and 𝑢𝑖𝑡 in a non-linear way:

DGP(2) : 𝑥𝑖𝑡 = 𝑙𝑜𝑔( 𝑝 (𝑥)

𝑖𝑡 )),

𝑖𝑡 /(1 − 𝑝 (𝑥)
𝑖𝑡 /(1 − 𝑝 (𝑢)
𝑖 + 𝜔𝛾𝛾 ( 𝑗)

𝑖𝑡 )),
𝑡 + 𝜔𝜀𝜀( 𝑗)

𝑢𝑖𝑡 = 𝑙𝑜𝑔( 𝑝 (𝑢)

𝑖𝑡 = Φ(𝜔𝛼𝛼( 𝑗)
𝑝 ( 𝑗)

𝑖𝑡 ) for 𝑗 = 𝑥, 𝑢,

where Φ(·) is the cumulative distribution function of a standard normal distribution and the latent

components are generated in the same way as DGP(1).

23

Sample coverage probabilities of 95% confidence intervals for (cid:98)𝛽1, the OLS estimator of the

slope parameter from (2.27), are provided for the following variance estimators: Eicker-Huber-

White (EHW), cluster-by-𝑖 (Ci), cluster-by-𝑡 (Ct), DK, CHS, BCCHS, and DKA. For the variance

estimators that require a bandwidth choice (DK, CHS, BCCHS, and DKA) we report results using

the Andrews (1991) AR(1) plug-in data-dependent bandwidth, labeled as (cid:98)𝑀, designed to minimize

the approximate mean square error of a variance estimator (same formula for all four variance

estimators). In the case of a scalar 𝑥𝑖𝑡, the formula is given by6

(cid:32)

(cid:33) 1/3

𝜌2(cid:1) 2

𝑇 1/3 + 1,

(cid:98)𝑀 = 1.8171

𝜌2
(cid:98)
(cid:0)1 − (cid:98)
𝑣𝑡−1 + 𝜂𝑡 where ¯
𝑣𝑡 = 𝜌 ¯
𝜌 is the OLS estimator from the regression ¯
𝑢𝑖𝑡,
𝑣𝑖𝑡 = 𝑥𝑖𝑡(cid:98)
𝑣𝑡 = 1
where (cid:98)
(cid:98)
(cid:98)
(cid:98)
𝑁
𝑢𝑖𝑡 are the OLS residuals from (2.27). We label the ratio of (cid:98)𝑀 relative to the time sample size as
and (cid:98)
(cid:98)𝑏 = (cid:98)𝑀/𝑇. In some cases (cid:98)𝑀 can exceed 𝑇 especially when the time dependence is strong relative to
𝑇. Therefore, we truncate (cid:98)𝑀 at 𝑇 whenever (cid:98)𝑀 > 𝑇. We also report results for a grid of bandwidth

𝑣𝑖𝑡, (cid:98)

𝑖=1 (cid:98)

(cid:205)𝑁

choices. For tests based on CHS and DKA, we use both the standard normal critical values and the

plug-in fixed-𝑏 critical values. The simulated critical values use 1000 replications with the Wiener

process approximated by scaled partial sums of 500 independent increments drawn from a standard

normal distribution. While these are relatively small numbers of replications and increments for

an asymptotic critical value simulation, it was necessitated by computational considerations given

the need to run an asymptotic critical value simulation for each replication of the finite sample

simulation.

2.5.1 Main Simulation Results

We first focus on DGP(1) to make direct comparisons to the simulation results of Table 1 of

Chiang et al. (2022), a working paper version of Chiang et al. (2024)7. Empirical null coverage

probabilities of the confidence intervals for (cid:98)𝛽1 are presented in Table 2.5.1. We start with both
6Using equation (6.4) from Andrews (1991), we use 0 weight for constant regressor and the weights equal to the
inverse of the squared innovation variances for other regressors. Because Chiang et al. (2024) parameterize the Bartlett
kernel as 1 − 𝑚
𝑀 , we add 1 to the data-dependent formula so that our Bartlett weights match
those used by Chiang et al. (2024).

𝑀+1 whereas we use 1 − 𝑚

7The reason we refer to the 2022 working paper version of Chiang et al. (2024) is because their results for small

sample sizes (𝑁, 𝑇) = (25, 25) are not included in the published paper, Chiang et al. (2024).

24

the cross-section and time sample sizes equal to 25. The weights on the latent components are

𝜔𝛼 = 0.25, 𝜔𝛾 = 0.5, 𝜔𝜀 = 0.25. Because of the relatively large weight on the common time effect,

𝛾𝑡, the cross-section dependence dominates the time dependence. We can see that the confidence

intervals using EHW, Ci, and Ct 8 suffer from a severe under-coverage problems as they fail to

capture both cross-section and time dependence.

Table 2.5.1: Sample Coverage Probabilities (%), Nominal Coverage 95%

BC-

Ci
Ct

𝑁 = 𝑇 = 25; DGP(1): 𝜔𝛼 = 𝜔𝜀 = 0.25, 𝜔𝛾 = 0.5, 𝜌𝛾 = 0.425; POLS.
fixed-b c.v.

No Truncation 𝑀
EHW 37.4
38.7
83.6

Count
DK CHS CHS DKA CHS DKA CHS< 0
83.6
84.0
83.3
82.0
80.6
73.9
62.6
57.9

84.1
90.3
84.4
89.7
83.7
90.1
82.3
90.5
80.8
90.3
74.4
91.2
63.0
91.1
58.4
90.9
Note: (cid:98)𝑀 ranged from 1 to 21, with an average of 2.6 and median of 2.

𝑏
(cid:98)𝑀 (cid:98)𝑏
0.08
2
0.12
3
0.16
4
0.20
5
0.40
10
0.80
20
1.00
25

86.2
86.0
85.8
85.3
84.8
82.2
80.9
80.8

88.1
87.9
87.9
87.5
87.3
85.4
84.0
84.0

88.3
88.1
87.8
88.2
88.1
88.5
87.9
88.2

0
0
0
0
0
0
0
0

With the time effect, 𝛾𝑡, being mildly persistent (𝜌𝛾 = 0.425), the DK and CHS confidence

intervals using the normal approximation undercover with small bandwidths with empirical rejec-

tion rates mostly below 0.85. The under-coverage problem becomes more severe as 𝑀 increases

because of the well-known downward bias in kernel variance estimators that reflects the need to

estimate 𝛽0 and 𝛽1. Coverages of DK and CHS using (cid:98)𝑀 are similar to the smaller bandwidth cases,
e.g. 𝑀 = 2 or 3, which makes sense given that the average (cid:98)𝑀 across replications is 2.6 (about
0.1 in terms of (cid:98)𝑏). However, as the note to the table indicates, large values of (cid:98)𝑀 can occur in
which case (cid:98)𝑏 is not close to zero. Because they are bias corrected, the BCCHS and DKA variance

estimators provide coverage that is less sensitive to the bandwidth. This is particularly true for

DKA. If the plug-in fixed-𝑏 critical values are used, coverages are closest to 0.95 and very stable

across bandwidths with DKA having the best coverage. Because the CHS variance estimator is

not guaranteed to be positive definite, we report the number of times that CHS/BCCHS estimates

8Finite sample adjustments are applied to these three variance estimators. HC1 is used for EHW estimator. The

“cluster-by-𝑖”, and “cluster-by-𝑡” are also adjusted by the usual degrees-of-freedom factor.

25

are negative out of the 10,000 replications. In Table 1 there were no cases where CHS/BCCHS

estimates are negative.

Table 2.5.2: Sample Coverage Probabilities (%), Nominal Coverage 95%

BC-

Ci
Ct

𝑁 = 𝑇 = 25; DGP(2): 𝜔𝛼 = 𝜔𝜀 = 0.25, 𝜔𝛾 = 0.5, 𝜌𝛾 = 0.25; POLS.
fixed-b c.v.

No Truncation 𝑀
EHW 39.9
40.7
87.1

Count
DK CHS CHS DKA CHS DKA CHS< 0
85.3
86.1
84.9
83.5
82.2
75.6
64.9
60.7

85.9
86.6
85.5
84.1
82.8
76.2
65.5
61.3
Note: (cid:98)𝑀 ranged from 1 to 12, with an average of 2.5 and a median of 2.

𝑏
(cid:98)𝑀 (cid:98)𝑏
0.08
2
0.12
3
0.16
4
0.20
5
0.40
10
0.80
20
1.00
25

87.8
88.1
87.6
87.1
86.4
84.1
82.5
82.6

89.6
90.0
89.5
89.2
88.6
86.5
85.4
85.4

89.1
89.5
89.5
89.6
89.7
89.7
89.4
89.2

91.1
91.3
91.2
91.5
91.4
91.6
91.5
91.5

0
0
0
0
0
0
0
0

Tables 2.5.2 - 2.5.5 give results for DGP(2) where the latent components enter in a non-linear

way. Tables 2.5.2 - 2.5.4 have both sample sizes equal to 25 with weights across latent components

being the same as DGP(1) (𝜔𝛼 = 𝜔𝜀 = 0.25, 𝜔𝛾 = 0.5). Table 2.5.2 has mild persistence

in 𝛾𝑡 (𝜌𝛾 = 0.25). Table 2.5.3 has moderate persistence (𝜌𝛾 = 0.5) and Table 2.5.4 has strong

persistence (𝜌𝛾 = 0.75). Tables 2.5.2-2.5.4 have similar patterns as Table 2.5.1: confidence intervals

with variance estimators non-robust to individual or time components under-cover with the under-

coverage problem increasing with 𝜌𝛾. With 𝜌𝛾 = 0.25, CHS has reasonable coverage (about 0.86)

with small bandwidths but under-covers severely with large bandwidths. BCCHS performs much

better because of the bias correction and fixed-𝑏 critical values provide some additional modest

improvements. DKA has better coverage especially when fixed-𝑏 critical values are used with

large bandwidths. As 𝜌𝛾 increases, all approaches have increasing under-coverage problems with

DKA continuing to perform best. Table 2.5.5 has the same configuration as Table 2.5.4 but with

both sample sizes increased to 50. Both BCCHS and DKA show some improvements in coverage.

This illustrates the well-known trade-off between the sample size and magnitude of persistence

for accuracy of asymptotic approximations with dependent data. Regarding bandwidth choice,

the data-dependent bandwidth performs reasonably well for CHS, BCCHS, and DKA. Finally, the

26

chances of CHS/BCCHS being negative are very small but not zero and chances decrease as both

𝑁 and 𝑇 increase.

Table 2.5.3: Sample Coverage Probabilities (%), Nominal Coverage 95%

BC-

Ci
Ct

𝑁 = 𝑇 = 25; DGP(2): 𝜔𝛼 = 𝜔𝜀 = 0.25, 𝜔𝛾 = 0.5, 𝜌𝛾 = 0.5; POLS.
fixed-b c.v.

No Truncation 𝑀
EHW 35.2
37.6
80.7

Count
DK CHS CHS DKA CHS DKA CHS< 0
81.3
81.6
80.7
79.8
78.4
71.5
60.4
55.9

81.9
82.3
81.3
80.1
78.6
71.7
60.6
56.1
Note: (cid:98)𝑀 ranged from 1 to 25, with an average of 2.8 and a median of 3.

𝑏
(cid:98)𝑀 (cid:98)𝑏
0.08
2
0.12
3
0.16
4
0.20
5
0.40
10
0.80
20
1.00
25

84.0
83.8
83.7
83.2
82.8
80.5
78.7
78.5

86.4
86.0
86.1
85.8
85.6
83.5
82.0
81.9

86.0
85.4
85.9
86.3
86.5
86.8
86.2
86.1

88.2
87.5
88.1
88.4
88.8
89.3
89.2
89.2

0
0
0
0
0
0
0
0

To show that large values of (cid:98)𝑀 are not unusual in DGP(2), we report in Figure 1 the frequency
of (cid:98)𝑏 among the 10,000 Monte Carlo replications used in Table 2.5.4. In this case, more than 21% of
replications have (cid:98)𝑏 ≥ 0.2. This explains why bias correction and fixed-𝑏 critical values noticeably
reduce the under-coverage problem when (cid:98)𝑀 is used.

Table 2.5.4: Sample Coverage Probabilities (%), Nominal Coverage 95%

BC-

Ci
Ct

𝑁 = 𝑇 = 25; DGP(2): 𝜔𝛼 = 𝜔𝜀 = 0.25, 𝜔𝛾 = 0.5, 𝜌𝛾 = 0.75; POLS.
fixed-b c.v.

No Truncation 𝑀
EHW 28.9
35.1
66.7

Count
DK CHS CHS DKA CHS DKA CHS< 0
71.4
70.9
71.4
71.0
70.0
63.8
53.2
48.8

72.2
72.3
72.5
71.8
70.6
64.2
53.6
49.1
Note: (cid:98)𝑀 ranged from 1 to 25, with an average of 3.9 and a median of 4.

𝑏
(cid:98)𝑀 (cid:98)𝑏
0.08
2
0.12
3
0.16
4
0.20
5
0.40
10
0.80
20
1.00
25

76.0
74.3
75.5
75.8
75.3
72.8
71.5
71.5

79.0
76.9
78.4
79.0
78.7
77.1
76.3
76.3

79.2
75.7
78.0
78.9
79.7
79.4
79.5
79.3

82.0
78.6
80.9
82.0
82.5
83.4
83.8
83.8

0
0
0
0
0
1
13
13

To show how the relative values of 𝑁 and 𝑇 can matter in practice, we provide additional results

for the same DGP as Tables 2.5.4 and 2.5.5 for 𝑁 and 𝑇 over a range of values9. The results are

given in Table 2.5.6. There are two main takeaways from the table: i) bias correction with and

9We thank a referee for this suggestion,

27

without fixed-𝑏 critical values always improves coverage probabilities relative to the original CHS

test, ii) bias correction alone does slightly better than bias correction with fixed-𝑏 critical values

when both 𝑁, 𝑇 are extremely small (𝑁 = 𝑇 = 10).

Table 2.5.5: Sample Coverage Probabilities (%), Nominal Coverage 95%

BC-

Ci
Ct

𝑁 = 𝑇 = 50; DGP(2): 𝜔𝛼 = 𝜔𝜀 = 0.25, 𝜔𝛾 = 0.5, 𝜌𝛾 = 0.75; POLS.
fixed-b c.v.

No Truncation 𝑀
EHW 17.3
25.9
66.0

Count
DK CHS CHS DKA CHS DKA CHS< 0
78.8
78.5
78.4
77.5
76.2
69.7
58.5
54.4

79.4
79.1
79.2
78.2
76.9
70.0
59.0
54.7
Note: (cid:98)𝑀 ranged from 1 to 26, with an average of 5.4 and a median of 5.

𝑏
(cid:98)𝑀 (cid:98)𝑏
0.08
2
0.12
3
0.16
4
0.20
5
0.40
10
0.80
20
1.00
25

81.5
80.8
81.5
81.7
81.0
79.2
77.7
77.5

82.5
81.8
82.8
83.0
82.5
81.0
79.4
79.2

84.2
82.7
83.9
85.2
86.0
86.2
86.0
86.0

85.2
83.9
85.5
86.5
87.1
87.8
87.8
87.7

0
0
0
0
0
0
0
0

The middle row gives results for both 𝑁 and 𝑇 equal to 10. Here the under-coverage problem

is substantial for the original CHS test. Bias correction helps especially if the DKA estimator is

used. Interestingly, fixed-𝑏 critical values help relative to CHS but less than bias correction alone.

This is not surprising because the simulated critical values are functions of variance estimators

based on small sample sizes. Going up the rows maintains 𝑁 = 10 with 𝑇 increasing to 160.

As expected, coverages approach 0.95 as 𝑇 increases. The top four rows have 𝑇 = 160 with 𝑁

increasing from 10 to 80. With 𝑇 fixed and 𝑁 increasing, the first three tests that fail to capture

within-time/cross-sectional dependence have deteriorating coverage (undercoverage). In contrast

the DK and CHS tests perform well in those cases; bias correction with fixed-𝑏 critical values

continues to provide further improvements.

Going down from the middle rows shows what happens as 𝑁 increases when 𝑇 is small (𝑇 = 10).

Coverage of the original CHS tests remains quite low as 𝑁 increases. Bias correction without fixed-

𝑏 critical values improves coverage rates but coverage does not improve as 𝑁 increases. Bias

correction with fixed-𝑏 critical values performs best and improves as 𝑁 increases. The results for

DKA are interesting. As 𝑁 increases, undercoverage becomes more severe when normal critical

28

values are used, whereas with fixed-𝑏 critical values coverge is best and stable across 𝑁. The

bottom four rows hold 𝑁 fixed at 160 and show what happens as 𝑇 increases from 10 to 80. CHS

and the bias-corrected versions show better coverage as 𝑇 increases. CHS and DKA with fixed-𝑏

critical values perform best in these cases.

Figure 1: The Frequency of (cid:98)𝑏 for Table 2.5.4.

2.5.2 Simulation Results for the i.i.d. Case

In Theorems 2.1, 2.2, and 2.3, a non-degeneracy assumption on the components is imposed.

A special case that violates this assumption is i.i.d. data in both individual and time dimensions

(random sampling). As we showed in Theorem 2.4, the fixed-𝑏 limit of the test statistics is different

in the i.i.d. case. By setting 𝜔𝛼 = 0, 𝜔𝛾 = 0 in DGP(1), we present coverage probabilities for the

i.i.d case in Table 2.5.7. There are some important differences between the coverage probabilities

in Table 2.5.7 relative to previous tables. First, notice that the coverages using EHW, Ci, and Ct

are close to the nominal level as one would expect. The patterns of coverage probabilities for CHS,

BCCHS and DKA are as predicted by Theorem 2.4 and the asymptotic calculations given in Table

2.4.1. Coverages of CHS are close to 0.89 for small bandwidths and under-coverage problems occur

with larger bandwidths. BCCHS is less prone to under-coverage as the bandwidth increases and

plug-in fixed-𝑏 critical values help to reduce, but do not eliminate, the under-coverage problem. In

29

Table 2.5.6: Sample Coverage Probabilities (%), Nominal Coverage 95%
𝑀 = (cid:98)𝑀; DGP(2): 𝜔𝛼 = 𝜔𝜀 = 0.25, 𝜔𝛾 = 0.5, 𝜌𝛾 = 0.75; POLS.

𝑁
80
40
20
10
10
10
10
10
20
40
80
160
160
160
160

𝑇
160
160
160
160
80
40
20
10
10
10
10
10
20
40
80

EHW Ci
28.8
12.8
39.7
19.1
48.9
25.6
54.5
33.4
47.0
35.0
42.7
37.3
45.2
43.6
51.9
53.7
45.3
43.2
35.6
32.3
27.6
24.5
20.8
17.9
16.8
13.3
14.9
10.3
18.5
10.7

Ct
67.7
66.8
65.7
64.7
64.8
65.7
67.2
68.2
68.0
67.6
68.1
68.4
66.9
66.5
66.9

BC-

fixed-b c.v.

DK CHS CHS DKA CHS DKA CHS< 0
86.4
85.2
84.9
83.2
79.3
74.6
69.5
63.1
63.5
63.3
63.8
64.0
70.3
76.9
81.5

Count Med
(cid:98)𝑏
0.08
0.06
0.05
0.05
0.08
0.13
0.15
0.30
0.30
0.30
0.30
0.30
0.15
0.15
0.09

88.1
87.8
88.2
86.6
82.3
77.2
73.1
68.9
71.5
71.0
71.6
71.8
75.0
79.8
83.9

90.3
89.9
90.4
90.1
87.3
83.7
81.1
77.2
79.4
79.3
79.2
79.5
80.7
83.7
86.8

87.1
86.7
87.1
85.9
80.6
74.8
69.1
62.8
64.4
64.4
64.7
64.5
70.7
77.2
81.9

88.7
88.7
89.9
89.6
86.3
82.5
80.1
79.7
78.0
75.4
73.9
73.0
75.6
80.4
84.4

90.0
88.8
88.7
87.3
83.5
78.7
74.5
68.0
73.1
75.4
77.0
78.2
80.1
83.3
86.3

0
0
0
0
0
1
15
172
26
3
0
0
0
0
0

contrast DKA over-covers regardless of the bandwidth and whether or not fixed-𝑏 critical values

are used. As 𝑁,𝑇 get larger, we would expect the coverages of CHS/BCCHS to approach 95% in

the i.i.d. case (assuming a small bandwidth) but not for DKA where over-coverage would persist.

Table 2.5.7: Sample Coverage Probabilities (%), Nominal Coverage 95%

𝑁 = 𝑇 = 25, i.i.d: DGP (1) with 𝜔𝛼 = 𝜔𝛾 = 0 and 𝜔𝜀 = 1; POLS.

BC-

Ci
Ct

fixed-b c.v.

No Truncation 𝑀
EHW 94.7
93.1
93.2

Count
DK CHS CHS DKA CHS DKA CHS< 0
91.0
92.1
91.0
89.6
88.4
82.0
70.9
66.4

88.8
90.2
88.8
87.5
86.1
80.8
70.3
66.0
Note: (cid:98)𝑀 ranged from 1 to 8, with an average of 2.5 and a median of 2.

𝑏
(cid:98)𝑀 (cid:98)𝑏
0.08
2
0.12
3
0.16
4
0.20
5
0.40
10
0.80
20
1.00
25

90.7
91.3
90.7
90.0
89.3
87.3
86.1
85.9

98.9
99.0
99.0
98.9
98.9
98.7
98.5
98.4

99.0
99.1
99.1
99.0
99.1
98.9
98.9
98.9

91.1
91.9
91.0
90.7
90.2
88.8
87.9
87.7

73
35
63
117
174
425
639
641

To gauge the extent to which the under-coverage of CHS/BCCHS and over-coverage of DKA

is caused by the mis-match between the plug-in fixed-𝑏 critical values and the i.i.d. fixed-𝑏 limits,

we report in Table 2.5.8 simulated coverage probabilities using fixed-𝑏 critical values based on the

30

limits in Theorem 2.4. We see that, regardless of the bandwidth, coverages are much closer to 95%.

Therefore, a significant portion of the size distortions in Table 2.5.7 is because of the mis-match.

Table 2.5.8: Sample Coverage Probabilities (%)

(cid:98)𝑀
𝑀
𝑏
(cid:98)𝑏
92.1
CHS
DKA 94.1

3
0.12
92.0
94.0

2
0.08
92.3
94.0

20
0.80
92.8
94.0

10
5
0.20
0.40
92.0 92.5
94.0 93.9

25
4
1.00
0.16
92.9
92.0
94.0
94.1
Nominal coverage probability: 95%. Sample size: 𝑁 = 𝑇 =
25. DGP: i.i.d: DGP (1) with 𝜔𝛼 = 𝜔𝛾 = 0 and 𝜔𝜀 = 1
(same as Table 2.5.7). Slope estimator: POLS. Fixed-b
critical values from Theorem 2.4 are used for confidence
intervals constructed by both CHS and DKA variance esti-
mators.

These results raise a practical question: how does a researcher know which case is being

dealt with? In panel data settings, random sampling is almost never a reasonable assumption in

the time dimension and clustering dependence often exists due to unobserved heterogeneity in

both individual and time dimensions. Some concern may arise as it is common in practice for

empirical researchers to include fixed-effect dummy variables, also as known as two-way fixed-

effect estimator (TWFE, hereafter), to remove at least some of dependence generated by individual

and time unobserved heterogeneity, and in some cases, such as DGP (1), all of the dependence

structure would be removed and we are back to the i.i.d case. However, it is important to note that

DGP (1) is a very special case. In general, fixed-effect approaches do not guarantee the resulting

scores to be free from clustering dependence. Indeed, other data generating mechanisms exist where

TWFE will not completely remove the dependence caused by individual and time components in

the score as shown in Chiang et al. (2024). We discuss an example and its implications in the next

subsection.

One should also note that if we compare absolute size-distortions, DKA-based tests are not

necessarily more concerning than BCCHS-based tests: while in opposite directions, the magnitudes

of size distortions are mostly comparable and DKA-based tests do a better job as 𝑀 increases.

Moreover, given that the DKA-based tests tend to be more conservative, a rejection using a DKA-

based test delivers strong evidence against the null hypothesis. For a researcher that wants to

31

avoid spurious null rejections (relative to the desired significance level), then DKA-based tests are

preferred. On the other hand, if a rejection is not obtained with BCCHS tests, this is strong evidence

that the null cannot be rejected. Suppose a rejection is obtained with BCCHS but not with DKA.

In this case a researcher had to balance potential over-rejections from BCCHS with potential lower

power of DKA which depend on the extent to which the researcher thinks two-way clustering is

present in the model.

2.5.3 Additional Results for TWFE

A popular alternative to the pooled OLS estimator is the additive TWFE estimator where

individual and time period dummies are included in (2.1). It is well known that individual and

time dummies will project out any latent individual or time components that only linearly enter 𝑥𝑖𝑡

and 𝑢𝑖𝑡 individually (as would be the case in DGP(1)) leaving only variation from the idiosyncratic

component 𝑒𝑖𝑡.. In this case, we would expect the sample coverages of CHS and DKA to be similar

to the i.i.d case in Table 2.5.7. However, under the general component structure representation, the

TWFE transformation may not fully remove the individual and time components if they enter in a

nonlinear manner and we would expect results for CHS and DKA similar to Tables 2.5.1-2.5.6.

As an illustration, in Table 2.5.9 we report results for the TWFE estimator using the same

configuration as Table 2.5.4 for DGP(2). The sample coverage probabilities are different from

Table 2.5.4 but are very similar to the results in Table 2.5.7 for the i.i.d. case. Therefore, for

DGP(2), the TWFE dummy variables remove the bulk of the variation from the individual and time

components.

In contrast Chiang et al. (2024) provide an example where the TWFE dummy variables do not

remove the component structure. Consider a third DGP given by

DGP(3) : 𝑥𝑖𝑡 = 𝛼1𝑖𝛾2𝑡 + 𝛼2𝑖𝛾1𝑡 + 𝜀𝑥
𝑖𝑡,

𝑢𝑖𝑡 = 𝛼1𝑖𝛾3𝑡 + 𝛼3𝑖𝛾1𝑡 + 𝜀𝑢
𝑖𝑡,

where the latent components {𝛼1𝑖, 𝛼2𝑖, 𝛼3𝑖, 𝛾1𝑡, 𝛾2𝑡, 𝛾3𝑡, 𝜀𝑥
are independent across 𝑖 and 𝑡 and independent with each other. As Chiang et al. (2024) argue, there

𝑖𝑡 } are 𝑁 (0, 1) random variables that

𝑖𝑡, 𝜀𝑢

is no endogeneity between 𝑥𝑖𝑡 and 𝑢𝑖𝑡, and it is not difficult to show that 𝐸 (𝑥𝑖𝑡 |𝛼𝑖) = 𝐸 (𝑥𝑖𝑡 |𝛾𝑡) =

32

Table 2.5.9: Sample Coverage Probabilities (%), Nominal Coverage 95%

BC-

Ci
Ct

𝑁 = 𝑇 = 25; DGP(2): 𝜔𝛼 = 𝜔𝜀 = 0.25, 𝜔𝛾 = 0.5, 𝜌𝛾 = 0.75; TWFE.
fixed-b c.v.

No Truncation 𝑀
EHW 94.2
93.3
93.0

Count
DK CHS CHS DKA CHS DKA CHS< 0
91.0
92.2
90.9
89.5
88.2
81.5
70.3
66.1

90.2
91.0
89.7
88.4
87.4
81.7
71.3
67.2
Note: (cid:98)𝑀 ranged from 1 to 8, with an average of 2.5 and a median of 2.

𝑏
(cid:98)𝑀 (cid:98)𝑏
0.08
2
0.12
3
0.16
4
0.20
5
0.40
10
0.80
20
1.00
25

91.5
92.1
91.1
90.5
89.5
88.3
86.7
86.3

98.9
99.0
99.0
98.9
98.9
98.5
98.2
98.2

91.6
92.3
91.7
91.1
90.4
89.5
88.7
88.4

99.1
99.1
99.0
99.2
99.1
99.1
98.8
98.9

57
26
53
88
137
331
517
523

𝐸 (𝑢𝑖𝑡 |𝛼𝑖) = 𝐸 (𝑢𝑖𝑡 |𝛾𝑡) = 0. While 𝑥𝑖𝑡 and 𝑢𝑖𝑡 do not have the component structure, the score,

𝑥𝑖𝑡𝑢𝑖𝑡, does because 𝐸 (𝑥𝑖𝑡𝑢𝑖𝑡 |𝛼𝑖) = 𝛼2𝑖𝛼3𝑖 and 𝐸 (𝑥𝑖𝑡𝑢𝑖𝑡 |𝛾𝑡) = 𝛾𝑡2𝛾𝑡3. Therefore, the TWFE dummy

variables will not remove the component structure from 𝑥𝑖𝑡𝑢𝑖𝑡.

Table 2.5.10 gives results for DGP(3) for TWFE with 𝑁 = 𝑇 = 25. We see that tests based

on variance estimators that are not robust to two-way cluster dependence have substantial under-

coverage problems. The original CHS does a better job but tends to under-cover with large

bandwidths. BCCHS works better and plug-in fixed-𝑏 critical values provide additional improve-

ments in coverage. DKA works quite well, with small improvements using plug-in fixed-𝑏 critical

values, and coverage probabilities are close to 95%.

Table 2.5.10: Sample Coverage Probabilities (%), Nominal Coverage 95%

𝑁 = 𝑇 = 25; DGP(4); TWFE.

BC-

Ci
Ct

fixed-b c.v.

No Truncation 𝑀
EHW 61.5
80.1
80.0

Count
DK CHS CHS DKA CHS DKA CHS< 0
77.4
78.6
77.1
75.8
74.5
67.3
55.9
51.7

89.2
89.8
88.8
87.8
86.9
81.8
71.4
66.2
Note: (cid:98)𝑀 ranged from 1 to 25, with an average of 2.6 and a median of 2.

𝑏
(cid:98)𝑀 (cid:98)𝑏
0.08
2
0.12
3
0.16
4
0.20
5
0.40
10
0.80
20
1.00
25

93.9
93.9
93.7
93.6
93.4
92.9
92.3
92.4

90.7
90.9
90.8
90.7
90.4
89.3
88.7
88.6

91.2
91.1
91.4
91.4
91.2
91.1
90.8
90.8

94.2
94.2
94.0
94.3
94.3
94.2
94.2
94.2

0
0
0
0
0
0
0
0

The results for BCCHS and DKA in Table 2.5.10 suggest that fixed-𝑏 limits given by (2.20) and

33

(2.21) for POLS can continue to hold for tests based on the TWFE estimator of 𝛽. Let (cid:165)𝑥𝑖𝑡 and (cid:165)𝑢𝑖𝑡

denote the individual and time dummy demeaned versions of 𝑥𝑖𝑡 and 𝑢𝑖𝑡 respectively. Suppose that

(cid:165)𝑥𝑖𝑡 (cid:165)𝑢𝑖𝑡, has the individual and time component structure. Because Chiang et al. (2024) show that

(cid:165)𝑥𝑖𝑡 (cid:165)𝑢𝑖𝑡 = (cid:101)

𝑥𝑖𝑡(cid:101)

𝑢𝑖𝑡 + 𝑜 𝑝 (1),

𝑥𝑖𝑡 = 𝑥𝑖𝑡 − E[𝑥𝑖𝑡 |𝛼𝑖] − E[𝑥𝑖𝑡 |𝛾𝑡] + E[𝑥𝑖𝑡] and (cid:101)

where (cid:101)
of Theorem 2.2, (2.20), (2.21) and Theorem 2.3 are easily established for the TWFE estimator

𝑢𝑖𝑡 is similarly defined, equivalent versions

provided the stronger exogeniety assumption, E[(cid:101)

𝑥𝑖𝑡𝑢𝑖𝑡] = 0, holds10.

2.6 Empirical Application

We illustrate how the choice of variance estimator affects 𝑡-tests and confidence intervals using

an empirical example from Thompson (2011). We test the predictive power of market concentration

on the profitability of industries where the market concentration is measured by the Herfindahl-

Hirschman Index (HHI, hereafter). This example features data where dependence exists in both

cross-section and time dimensions with common shocks being correlated across time.

Specifically, consider the following linear regression model of profitability measured by ROA𝑚,𝑡,

the ratio of return on total assets for industry 𝑚 at time 𝑡:

ROA𝑚,𝑡 = 𝛽0 + 𝛽1ln(HHI𝑚,𝑡−1) + 𝛽2PB𝑚,𝑡−1 + 𝛽3DB𝑚.𝑡−1 + 𝛽4

¯ROA𝑡−1 + 𝑢𝑚,𝑡

where PB is the price-to-book ratio, DB is the dividend-to-book ratio, and

¯ROA is the market

average ROA ratio.

Table 2.6.1: Industry Profitability, 1972-2021: POLS Estimates and t-statistics

POLS

Regressors
ln(HHI𝑚,𝑡−1)
Price/Book𝑚,𝑡−1
DIV/Book𝑚,𝑡−1
Market ROA𝑡−1
Intercept
Notes: 𝑅2 = 0.117, (cid:98)𝑀 = 5.

Estimates EHW Ci
3.93
-0.09
3.93
14.47
-2.76

0.0097
-0.0001
0.0167
0.6129
-0.0564

12.42
-0.15
6.89
32.31
-8.94

t-statistics
CHS BCCHS DKA
DK
3.30
3.58
3.76
6.40
-0.05
-0.06
-0.07
-0.07
1.74
1.79
2.04
1.89
8.99
9.76
12.06 10.27
-2.35
-2.53
-2.67
-4.69

Ct
10.57
-0.13
3.81
12.05
-7.52

10Strict exogeneity over time, 𝐸 (𝑢𝑖𝑡 |𝑥𝑖1, 𝑥𝑖2, ..., 𝑥𝑖𝑇 ) = 0, is sufficient for 𝐸 [(cid:101)

𝑥𝑖𝑡 𝑢𝑖𝑡 ] = 0 to hold.

34

The data set used to estimate the model is composed of 234 industries in the US from 1972 to

2021. We obtain the annual level firm data from Compustat and aggregate it to industry level based

on Standard Industry Classification (SIC) codes. The details of data construction can be found in

Section 6 and Appendix B of Thompson (2011).

Table 2.6.2: Industry Profitability, 1972-2021: POLS, 95% Confidence Intervals

Fixed-b Critical Values

EHW
(0.0082,
0.0112)
(-0.0017,
0.0014)
(0.0119,
0.0214)
(0.5757,
0.6500)
(-0.0687,
-0.0440)

CHS
(0.0046,
0.0147)
(-0.0037,
0.0035)
(-0.0006,
0.0340)
(0.4959,
0.7299)
(-0.0978,
-0.0149)

BCCHS
(0.0044,
0.0150)
(-0.0039,
0.0036)
(-0.0015,
0.0349)
(0.4898,
0.7360)
(-0.0999,
-0.0128)

DKA
(0.0039,
0.0154)
(-0.0044,
0.0041)
(-0.0022,
0.0355)
(0.4792,
0.7466)
(-0.1034,
-0.0093)

CHS
(0.0043,
0.0149)
(-0.0038,
0.0037)
(-0.0040,
0.0353)
(0.4844,
0.7352)
(-0.1000,
-0.0124)

DKA
(0.0039,
0.0153)
(-0.0043,
0.0041)
(-0.0047,
0.0360)
(0.4734,
0.7459)
(-0.1036,
-0.0088)

Regressors
ln(HHI𝑚,𝑡−1)

Price/Book𝑚,𝑡−1

DIV/Book𝑚,𝑡−1

Market ROA𝑡−1

Intercept

Note: (cid:98)𝑀 = 5.

In Table 2.6.1, we present the POLS estimates for the five parameters and 𝑡-statistics (with the

null H0 : 𝛽 𝑗 = 0 for each 𝑗 = 1, 2, ..., 5) based on the various variance estimators. We use the
data dependent bandwidth, (cid:98)𝑀, in all relevant cases. We can see the 𝑡-statistics vary non-trivially

across different variance estimators. The estimated coefficient of ln(HHI𝑚,𝑡−1) is significant at a

1% level based on two-sided t-tests using any standard errors among comparison, including the

DKA standard error. As is discussed in Section 2.5.3, a rejection using DKA is strong evidence of

market concentration being powerful in predicting the profitability of industries. On the other hand,

the estimated coefficient of DIV/Book is significant at the 5% significance level in a two-sided

test when EHW, cluster-by-industry, cluster-by-time, and DK variances are used, while it is only

marginally significant when CHS is used and marginally insignificant when BCCHS and DKA are

used.

In Table 2.6.2 we present 95% confidence intervals. For CHS/BCCHS and DKA we give

confidence interval using both normal and plug-in fixed-𝑏 critical values. For the bias corrected

variance estimators (BCCHS and DKA) the differences in confidence intervals between normal

35

Table 2.6.3: Industry Profitability, 1972-2021: TWFE estimates and t-statistics

TWFE

Regressors
ln(HHI𝑚,𝑡−1)
Price/Book𝑚,𝑡−1
DIV/Book𝑚,𝑡−1

0.0050
0.0015
0.0056
Notes: 𝑅2 = 0.27, (cid:98)𝑀 = 6.

Estimates EHW Ci
1.84
1.41
2.33

4.27
1.73
2.69

t-statistics
DK CHS BCCHS DKA
Ct
1.33
2.04
3.54
0.78
1.53
1.00
0.95
1.98 1.11

1.46
0.97
1.03

1.55
1.03
1.10

and fixed-𝑏 critical values are not large consistent with our simulation results.

Table 2.6.4: Industry Profitability, 1972-2021: TWFE, 95% Confidence Interval

fixed-b critical values

EHW
(0.0027,
0.0736)
(-0.0020,
0.0032)
(0.0015,
0.0096)

CHS
(-0.0013,
0.0114)
(-0.0013,
0.0043)
(-0.0044,
0.0155)

BCCHS
(-0.0017,
0.0118)
(-0.0015,
0.0045)
(-0.0050,
0.0161)

DKA
(-0.0024,
0.0125)
(-0.0022,
0.0052)
(-0.0059,
0.0170)

CHS
(-0.0019,
0.0126)
(-0.0021,
0.0045)
(-0.0048,
0.0166)

DKA
(-0.0025,
0.0133)
(-0.0029,
0.0052)
(-0.0057,
0.0175)

Regressors
ln(HHI𝑚,𝑡−1)

Price/Book𝑚,𝑡−1

DIV/Book𝑚,𝑡−1

Note: (cid:98)𝑀 = 6.

In Table 2.6.3, we include the results for TWFE estimator to see how the inclusion of firm level

and time period dummies matter in practice. The presence of the dummies results in the intercept

and

¯ROA𝑡−1 being dropped from the regression. Overall, test statistics based on CHS, BCCHS, and

DKA agree with each other in magnitude and they are much smaller relative to EHW-based test

statistics. As we have seen in Table 2.5.7, when the scores are independent in both the cross-section

and time dimensions, test statistics based on those non-twoway robust standard errors tend to be

smaller (higher coverage) on average except for DKA-based tests. Because the non-twoway test

statistics in Table 2.6.3 are larger than the CHS/BCCHS statistics suggests TWFE does not fully

remove two-way dependence and two-way cluster-robust standard errors are appropriate.

The 95% confidence intervals for TWFE case are presented in Table 2.6.4. Confidence intervals

tend to be wider with fixed-𝑏 critical values. This is expected given that fixed-𝑏 critical values are

larger in magnitude than standard normal critical values.

36

2.7 Conclusion

2.7.1 Summary

This paper investigates the fixed-𝑏 asymptotic properties of the CHS variance estimator and

tests. An important algebraic observation is that the CHS variance estimator can be expressed as

a linear combination of the cluster variance estimator, “HAC of averages" estimator, and “average

of HACs" estimator. Building upon this observation, we derive fixed-𝑏 asymptotic results for the

CHS variance estimator when both the sample sizes 𝑁 and 𝑇 tend to infinity. Our analysis reveals

the presence of an asymptotic bias in the CHS variance estimator which depends on the ratio of the

bandwidth parameter, 𝑀, to the time sample size, 𝑇. This bias is multiplicative and leads to simple

feasible bias corrected version of the CHS variance estimator (BCCHS). We propose a second

bias corrected variance estimator, DKA, by dropping the “HAC of averages" that is guaranteed to

be positive semi-definite. We show that the fixed-𝑏 limiting distribution of tests based on CHS,

BCCHS and DKA are not asymptotically pivotal, and we propose a straightforward plug-in method

for simulating fixed-𝑏 asymptotic critical values. Overall, we propose four test statistics that build on

the CHS test: BCCHS and DKA tests using chi-square/standard normal critical values, and BCCHS

and DKA tests using plug-in fixed-𝑏 critical values11. Extensive simulations studies are reported

that compare finite sample performance of the proposed approaches with existing approaches in

terms of finite sample null coverage probabilities. The simple bias-correction approaches provide

non-trivial improvements in coverage probabilities and bias-correction with plug-in fixed-𝑏 critical

values provide additional improvements except in the i.i.d. case and when both 𝑁 and 𝑇 are very

small.

2.7.2 Empirical Recommendations

Our results clearly suggest that the bias corrected variance estimators, BCCHS and DKA,

provide more reliable inference in practice with or without plug-in fixed-𝑏 critical values. While

plug-in fixed-𝑏 critical values involve some computation cost in practice, we can generally recom-

mend fixed-𝑏 critical values be used in practice given that i) fixed-𝑏 critical values improve finite

11CHS tests that use simulated fixed-𝑏 critical values are exactly equivalent to BCCHS tests based on simulated

fixed-𝑏 critical values because the fixed-𝑏 limits explicitly capture the bias in the CHS variance estimator.

37

sample coverage probabilities when large bandwidths are used, ii) data dependent bandwidths can

be large, and iii) coverage probabilities with or without fixed-𝑏 critical values are similar when

bandwidths are small. However, there are important exceptions. When both the cross-section and

time sample sizes are very small, then BCCHS and DKA based tests using plug-in fixed-𝑏 critical

values could yield slightly worse empirical null coverages than using chi-square/standard normal

critical values because the plug-in estimators are noisy. Therefore, the choice between using fixed-𝑏

or chi-square/standard normal critical values for BCCHS and DKA tests depends on the sample

sizes in additional to any relevant computational costs.

The choice between tests based on BCCHS and DKA is nuanced. While DKA ensures positive

definiteness and usually provides tests with better empirical null coverage probabilities, these

benefits do not come without a cost. Although rare in panel settings, if the scores, 𝑥𝑖𝑡𝑢𝑖𝑡, are

i.i.d. over both individual and time dimensions, the DKA estimator has a different fixed-𝑏 limiting

distribution and tests based on the DKA estimator can be conservative.

In contrast, while the

BCCHS estimator also has a different fixed-𝑏 limiting distribution in the i.i.d. case, it has correct

asymptotic coverage probabilities when the bandwidth is small. However, if the bandwidth is not

small, CHS and BCCHS tests under-cover in the i.i.d. case. Therefore, the practical choice between

DKA and BCCHS depends on a researcher’s assessment of the data, the model, and the priority of

inference. If the data is thought to be independent in both dimensions, then one should not consider

cluster-robust variances estimators in the first place. If the data is thought to have individual and

serially correlated time cluster dependence and the researcher places higher priority on controlling

over-rejections while having a conservative test (with the cost of lower power) should there not be

cluster dependence, the DKA estimator is preferred. BCCHS would be preferred if the additional

under-coverage relative to DKA is viewed as reasonable in order to have higher power should there

not be cluster dependence.

2.7.3 Further Discussion

It is important to acknowledge some limitations of our analysis and to highlight areas of future

research. We found that finite sample coverage probabilities of all confidence intervals exhibit

38

under-coverage problems when the autocorrelation of the time effects becomes strong relative to

the time sample size. In such cases, potential improvements resulting from the fixed-𝑏 adjustment

is limited. Part of this limitation arises because the test statistics are not asymptotically pivotal,

necessitating plug-in simulation of critical values. The estimation uncertainty in the plug-in

estimators can introduce sampling errors to the simulated critical values that can be acute when

persistence is strong. Finding a variance estimator that results in a pivotal fixed-𝑏 limit would help

address this problem although appears to be challenging.

An empirically relevant question is whether the component structure is a good approximation

when the component representation in Assumption 2.1 is not exact.

Ideally, inferential theory

should be studied under a DGP where the dependence is generated not only through individual and

time components but also through the idiosyncratic component. Obtaining fixed-𝑏 results for this

generalization appears challenging. Some unreported simulation results point to some theoretical

conjectures but a formal analysis is beyond the scope of this paper and is left for future research.

A second empirically relevant case we do not address in this paper is the unbalanced panel data

case. There are several challenges in establishing formal fixed-𝑏 asymptotic results for unbalanced

panels. Unbalanced panels have time sample sizes that are potentially different across individuals

and this potentially complicates the choice of bandwidths for the individual-by-individual variance

estimators in the average of HACs component of the variance. For the Driscoll-Kraay component,

the averaging by time would have potentially different cross-section sample sizes for each period.

Theoretically, obtaining fixed-𝑏 results for unbalanced panels also depends on how the missing

data is modeled. For example, one might conjecture that if missing observations in the panel

occur randomly (missing at random), then extending the fixed-𝑏 theory would be straightforward.

While that is true in pure time series settings (see Rho and Vogelsang, 2019), the presence of the

individual and time random components in the panel setting complicate things due to the fact that

the asymptotic behavior of the components in the partial sums is very different from the balanced

panel case. Obtaining useful results for the unbalanced panel case is challenging and is a focus of

ongoing research.

39

BIBLIOGRAPHY

Aldous, D. J. (1981). Representations for partially exchangeable arrays of random variables. J.

Multivar. Anal., 11(4):581–598.

Andrews, D. W. K. (1991). Heteroskedasticity and autocorrelation consistent covariance matrix

estimation. Econometrica, 59:817.

Arellano, M. (1987). Computing robust standard errors for within-groups estimators. Oxford. B.

Econ. Stat., 49:431–434.

Bertrand, M., Duflo, E., and Mullainathan, S. (2004). How much should we trust differences-in-

differences estimates? Q. J. Econ., 119:249–275.

Bester, C. A., Conley, T. G., Hansen, C. B., and Vogelsang, T. J. (2016). Fixed-b asymptotics
for spatially dependent robust nonparametric covariance matrix estimators. Econom. Theory,
32(1):154–186.

Cameron, A. C., Gelbach, J. B., and Miller, D. L. (2011). Robust inference with multiway clustering.

J. Bus. Econ. Stat., 29:238–249.

Chen, K. and Vogelsang, T. J. (2024). Fixed-b asymptotics for panel models with two-way clustering.

Journal of Econometrics, 244(1):105831.

Chiang, H. D., Hansen, B. E., and Sasaki, Y. (2022). Standard errors for two-way clustering
with serially correlated time effects. Working Paper, Department of Economics, U. Wisconsin
Madison, arXiv:2201.11304v2.

Chiang, H. D., Hansen, B. E., and Sasaki, Y. (2024). Standard errors for two-way clustering with

serially correlated time effects. Review of Economics and Statistics, pages 1–40.

Davezies, L., D’Haultfoeuille, X., and Guyonvarch, Y. (2018). Asymptotic results under multiway

clustering. Working Paper, arXiv preprint arXiv:1807.07925.

Davezies, L., D’Haultfoeuille, X., and Guyonvarch, Y. (2021). Empirical process results for

exchangeable arrays. Ann. Stat., 49:845–862.

Driscoll, J. C. and Kraay, A. C. (1998). Consistent covariance matrix estimation with spatially

dependent panel data. Rev. Econ. Stat., 80:549–559.

Hansen, B. E. (2022). Econometrics. Princeton University Press, Princeton, NJ.

Hansen, C. B. (2007). Asymptotic properties of a robust variance matrix estimator for panel data

when t is large. J. Econom., 141:597–620.

40

Hoover, D. N. (1979). Relations on probability spaces and arrays of random variables. Preprint,

Institute for Advanced Study.

Kallenberg, O. (1989). On the representation theorem for exchangeable arrays. J. Multivar. Anal.,

30(1):137–154.

Kiefer, N. M. and Vogelsang, T. J. (2005). A new asymptotic theory for heteroskedasticity-

autocorrelation robust tests. Econom. Theory, 21:1130–1164.

Lazarus, E., Lewis, D. J., and Stock, J. H. (2021). The size-power tradeoff in HAR inference.

Econometrica, 89(5):2497–2516.

Liang, K.-Y. and Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models.

Biometrika, 73:13–22.

MacKinnon, J. G., Nielsen, M. Ø., and Webb, M. D. (2021). Wild bootstrap and asymptotic

inference with multiway clustering. J. Bus. Econ. Stat., 39(2):505–519.

Menzel, K. (2021). Bootstrap with cluster-dependence in two or more dimensions. Econometrica,

89:2143–2188.

Neave, H. R. (1970). An improved formula for the asymptotic variance of spectrum estimates. Ann.

Math. Stat., 41(1):70–77.

Newey, W. K. and West, K. D. (1987). A simple, positive semi-definite, heteroskedasticity and

autocorrelation consistent covariance matrix. Econometrica, 55:703–708.

Petersen, M. A. (2009). Estimating standard errors in finance panel data sets: Comparing ap-

proaches. Rev. Financ. Stud, 22:435–480.

Phillips, P. C. B. and Moon, H. R. (1999). Linear regression limit theory for nonstationary panel

data. Econometrica, 67:1057–1111.

Rho, S.-H. and Vogelsang, T. J. (2019). Heteroskedasticity autocorrelation robust inference in time

series regressions with missing data. Econom. Theory, 35(3):601–629.

Sun, Y. (2014). Let’s fix it: Fixed-b asymptotics versus small-b asymptotics in heteroskedasticity

and autocorrelation robust inference. J. Econom., 178:659–677.

Sun, Y., Phillips, P. C., and Jin, S. (2008). Optimal bandwidth selection in heteroskedasticity–

autocorrelation robust testing. Econometrica, 76(1):175–194.

Thompson, S. B. (2011). Simple formulas for standard errors that cluster by both firm and time. J.

Financ. Econ., 99:1–10.

41

Vogelsang, T. J. (2012). Heteroskedasticity, autocorrelation, and spatial correlation robust inference

in linear panel models with fixed-effects. J. Econom., 166:303–319.

Zhang, X. and Shao, X. (2013). Fixed-smoothing asymptotics for time series. Ann. Stat.,

41(3):1329–1349.

42

APPENDIX 2A

Proof of Theorem 2.1: Consider

PROOFS FOR CHAPTER 2
√

𝑁 ((cid:98)𝜃 − 𝜃) with the component structure representation:

√

(cid:16)

𝑁

(cid:17)

=

(cid:98)𝜃 − 𝜃

1
√
𝑁

𝑁
∑︁

𝑖=1

𝑎𝑖 +

√︂ 𝑁
𝑇

1
√
𝑇

𝑇
∑︁

𝑡=1

𝑔𝑡 +

√

1
𝑁𝑇

𝑁
∑︁

𝑇
∑︁

𝑖=1

𝑡=1

𝑒𝑖𝑡 .

Under the same set of assumptions, we can apply Theorem 1 of Chiang et al. (2024) giving

(cid:32)

1
√
𝑁

Var

(cid:32)

Var

(cid:32)

Var

√

1
𝑁𝑇

1
√
𝑇

𝑁
∑︁

(cid:33)

(cid:33)

(cid:33)

𝑁
∑︁

𝑖=1
𝑇
∑︁

𝑎𝑖

𝑔𝑡

𝑡=1
𝑇
∑︁

𝑒𝑖𝑡

= Λ𝑎Λ′

𝑎, (cid:13)

(cid:13)Λ𝑎Λ′
𝑎

(cid:13)
(cid:13) < ∞,

→ Λ𝑔Λ′

𝑔, (cid:13)

(cid:13)Λ𝑔Λ′
𝑔

(cid:13)
(cid:13) < ∞,

→ Λ𝑒Λ′

𝑒, (cid:13)

(cid:13)Λ𝑒Λ′
𝑒

(cid:13)
(cid:13) < ∞.

𝑖=1

𝑡=1

(2A.1)

(2A.2)

(2A.3)

(2A.4)

Thus, 𝑎𝑖 = E[𝑦𝑖𝑡 − 𝜃|𝛼𝑖] is a sequence of i.i.d random vectors with zero mean and finite variance

Λ𝑎Λ′

𝑎. Then, the Lindeberg-Lévy CLT applies to the first sum in (2A.1): as 𝑁 → ∞,

1
√
𝑁

𝑁
∑︁

𝑖=1

𝑎𝑖

𝑑
→ 𝑁 (0, Λ𝑎Λ′

𝑎).

(2A.5)

Consider the second sum in (2A.1) where 𝑔𝑡 = E[𝑦𝑖𝑡 − 𝜃|𝛾𝑡] is strictly stationary and is an 𝛼-mixing

sequence with mixing coefficients 𝛼𝑔 (ℓ) ≤ 𝛼𝛾 (ℓ) by Theorem 14.12 of Hansen (2022), and so

𝛼𝑔 (ℓ) satisfies a summation condition as follows: for some 𝑠 > 1 and 𝛿 > 0, and for 𝐾 ∈ (0, ∞),

there exists integer 𝑁𝐾 such that

∞
∑︁

ℓ=1

𝛼𝑔 (ℓ)1−1/2(𝑠+𝛿) ≤

∞
∑︁

ℓ=1

𝛼𝛾 (ℓ)1−1/2(𝑠+𝛿) =

𝑁𝐾∑︁

𝛼𝛾 (ℓ)1−1/2(𝑠+𝛿) +

∞
∑︁

ℓ=𝑁𝐾 +1

(cid:18) 𝑂 (ℓ−𝜆)
ℓ−𝜆

ℓ−𝜆

(cid:19) 1−1/2(𝑠+𝛿)

< 𝑁𝐾 + 𝐾

∞
∑︁

(cid:16)

ℓ− 2𝑠
𝑠−1

ℓ=1

ℓ=1
(cid:17) 1−1/2(𝑠+𝛿)

< ∞.

Then, by Theorem 16.4 of Hansen (2022) we have as 𝑁, 𝑇 → ∞, for 𝑟 ∈ (0, 1],

1
√
𝑇

[𝑟𝑇]
∑︁

𝑡=1

𝑔𝑡 ⇒ Λ𝑔𝑊𝑘 (𝑟) ,

(2A.6)

where [𝑟𝑇] denotes the integer part of 𝑟𝑇 and 𝑊𝑘 (𝑟) is a 𝑘 × 1 vector of standard Wiener process.

43

As for the third sum, we have Var

(cid:16) 1√

𝑁𝑇

(cid:205)𝑁
𝑖=1

(cid:205)𝑇

𝑡=1

𝑒𝑖𝑡

(cid:17)

→ 0 by (2A.4). Then, we can apply

Chebyshev’s inequality for random variables to show that each component of the random vector

1√

𝑁𝑇

(cid:205)𝑁
𝑖=1

(cid:205)𝑇

𝑡=1

𝑒𝑖𝑡 converges to 0 in probability as 𝑁, 𝑇 → ∞ and so

√

1
𝑁𝑇

𝑁
∑︁

𝑇
∑︁

𝑖=1

𝑡=1

𝑒𝑖𝑡

𝑝
→ 0.

(2A.7)

Combining (2A.5), (2A.6), (2A.7), we have

√

(cid:16)

𝑁

(cid:17)

(cid:98)𝜃 − 𝜃

⇒ Λ𝑎𝑧𝑘 +

√

𝑐Λ𝑔𝑊𝑘 (1) 𝑎𝑠 𝑁, 𝑇 → ∞,

and we conclude that 𝑧𝑘 is independent from 𝑊𝑘 (𝑟) since {𝑎𝑖} and {𝑔𝑡 } are independent to each

other, proving (i) of Theorem 2.1.

Next, consider (2.9), scaled by 1√

𝑁𝑇 :

√

1
𝑁𝑇

(cid:98)¯𝑆 [𝑟𝑇] =

√︂ 𝑁
𝑇

1
√
𝑇

[𝑟𝑇]
∑︁

𝑡=1

(𝑔𝑡 − ¯𝑔) +

√

1
𝑁𝑇

𝑁
∑︁

[𝑟𝑇]
∑︁

𝑖=1

𝑡=1

(𝑒𝑖𝑡 − ¯𝑒) .

(2A.8)

By (2A.7), we have the second partial sum of (2A.8) converges to 0 in probability. Combined with

the results from (2A.6) we obtain,

√

(cid:98)¯𝑆 [𝑟𝑇] ⇒

√

1
𝑁𝑇

𝑐Λ𝑔 (𝑊𝑘 (𝑟) − 𝑟𝑊𝑘 (1)) =

√

𝑐Λ𝑔 (cid:101)𝑊𝑘 (𝑟),

(2A.9)

as 𝑁, 𝑇 → ∞. Note that for each 𝑡 = 1, ..., 𝑇 − 1, we can map 𝑡 to [𝑟𝑡𝑇] for some 𝑟𝑡 ∈ (cid:2) 𝑡

𝑇 , 𝑡+1

𝑇

(cid:17)

.

Similarly, we can map 𝑡+𝑀 to [(𝑟𝑡+𝑏)𝑇] where 𝑏 = 𝑀/𝑇 and [𝑟𝑡𝑇] = 𝑡 for 𝑡 = 1, ..., 𝑇 −𝑀−1. Using

(2A.9), we have

√

𝑁𝑇 (cid:98)¯𝑆𝑡 ⇒
1√

𝑐Λ𝑔 (cid:101)𝑊𝑘 (𝑟𝑡) for each 𝑡 = 1, ..., 𝑇 − 1 and 1√

𝑁𝑇 (cid:98)¯𝑆𝑡+𝑀 ⇒

√

𝑐Λ𝑔 (cid:101)𝑊𝑘 (𝑟𝑡 + 𝑏)

for each 𝑡 = 1, ..., 𝑇 − 𝑀 − 1. Note that we can take 𝑟𝑡 = 𝑡/𝑇, then as 𝑁, 𝑇 → ∞, we have

1
𝑁𝑇 3

𝑇−1
∑︁

𝑡=1

′
(cid:98)¯𝑆𝑡(cid:98)¯𝑆
𝑡 =

(𝑇−1)/𝑇
∑︁

√

1
𝑁𝑇 3

𝑇−𝑀−1
∑︁

𝑡=1

′
(cid:98)¯𝑆𝑡(cid:98)¯𝑆
𝑡+𝑀 =

𝑟𝑡 =1/𝑇
(𝑇−𝑀−1)/𝑇
∑︁

𝑟𝑡 =1/𝑇

1
𝑁𝑇

(cid:98)¯𝑆 [𝑟𝑡𝑇]

√

′
(cid:98)¯𝑆
[𝑟𝑡𝑇] ⇒ 𝑐Λ𝑔

1
𝑁𝑇

∫ 1

0

(cid:101)𝑊𝑘 (𝑟) (cid:101)𝑊𝑘 (𝑟)′𝑑𝑟Λ𝑔,

√

1
𝑁𝑇

(cid:98)¯𝑆 [𝑟𝑡𝑇]

√

1
𝑁𝑇

′
(cid:98)¯𝑆
[(𝑟𝑡 +𝑏)𝑇] ⇒ 𝑐Λ𝑔

∫ 1−𝑏

0

(cid:101)𝑊𝑘 (𝑟) (cid:101)𝑊𝑘 (𝑟 + 𝑏)′𝑑𝑟Λ𝑔.

44

Using the results above, we obtain the fixed-𝑏 joint limit of (2.11):

𝑁
𝑁 2𝑇 2

(cid:40)

2
𝑀

𝑇−1
∑︁

′
(cid:98)¯𝑆𝑡(cid:98)¯𝑆
𝑡 −

1
𝑀

𝑇−𝑀−1
∑︁

𝑡=1

(cid:16)
′
′
𝑡+𝑀 + (cid:98)¯𝑆𝑡+𝑀(cid:98)¯𝑆
(cid:98)¯𝑆𝑡(cid:98)¯𝑆
𝑡

(cid:17)

(cid:41)

𝑡=1
∫ 1

(cid:26) 2
𝑏
(cid:16)

⇒ 𝑐Λ𝑔

= 𝑐Λ𝑔𝑃

0
𝑏, (cid:101)𝑊𝑘 (𝑟)

(cid:17)

Λ′
𝑔.

(cid:101)𝑊𝑘 (𝑟) (cid:101)𝑊 𝑘 (𝑟)′𝑑𝑟 −

1
𝑏

∫ 1−𝑏

0

(cid:101)𝑊𝑘 (𝑟) (cid:101)𝑊𝑘 (𝑟 + 𝑏)′ + (cid:101)𝑊𝑘 (𝑟 + 𝑏) (cid:101)𝑊𝑘 (𝑟)′(cid:3) 𝑑𝑟
(cid:2)

(cid:27)

Λ′
𝑔

(2A.10)

Note that the last term of (2.12) is canceled out with (2.10). The rest of the terms of (2.12) are

functions of the partial sums defined in (2.8). Consider (2.8) evaluated at 𝑡 = [𝑟𝑇] and scaled by 1
𝑇 :

1
𝑇 (cid:98)𝑆𝑖,[𝑟𝑇] =

[𝑟𝑇]
𝑇

(𝑎𝑖 − ¯𝑎) +

1
𝑇

[𝑟𝑇]
∑︁

(𝑔𝑡 − ¯𝑔) +

[𝑟𝑇]
∑︁

(𝑒𝑖𝑡 − ¯𝑒),

1
𝑇

𝑡=1
We first consider fixed-𝑁 and large-𝑇 asymptotic results. As 𝑇 → ∞ while fixing 𝑁,

𝑡=1

[𝑟𝑇]
𝑇

(𝑎𝑖 − ¯𝑎)

𝑝
→ 𝑟 (𝑎𝑖 − ¯𝑎) and 1
𝑇

(𝑔𝑡 − ¯𝑔)

𝑝
→ 0 by (2A.6). Note that

Var

(cid:32)

1
𝑇

[𝑟𝑇]
∑︁

𝑡=1

(cid:33)

𝑒𝑖𝑡

=

1
𝑇

(cid:205)[𝑟𝑇]
𝑡=1

[𝑟𝑇]−1
∑︁

𝑙=−([𝑟𝑇]−1)

(cid:18) [𝑟𝑇]
𝑇

(cid:19)

|𝑙 |
𝑇

−

E(𝑒𝑖𝑡𝑒𝑖,𝑡+𝑙) =

𝑟
𝑇

Λ𝑒Λ′

𝑒 (1 + 𝑜(1)).

By Chebyshev’s inequality, we have

Therefore, we conclude that

1
𝑇

[𝑟𝑇]
∑︁

𝑡=1

𝑒𝑖𝑡

𝑝
→ 0 as 𝑇 → ∞.

1
𝑇 (cid:98)𝑆𝑖,[𝑟𝑇]

𝑝
→ 𝑟 (𝑎𝑖 − ¯𝑎) as 𝑇 → ∞,

which in turn gives that

1
𝑇 2

2
[𝑏𝑇]

𝑇−1
∑︁

𝑡=1

(cid:98)𝑆𝑖𝑡 (cid:98)𝑆′
𝑖𝑡

𝑝
→

2
𝑏

∫ 1

0

𝑟 2 (𝑎𝑖 − ¯𝑎) (𝑎𝑖 − ¯𝑎)′ 𝑑𝑟 =

2
3𝑏

So, if we let 𝑇 → ∞ and then 𝑁 → ∞ sequentially, we have

(𝑎𝑖 − ¯𝑎) (𝑎𝑖 − ¯𝑎)′ as 𝑇 → ∞.

1
𝑁𝑇 2

𝑁
∑︁

𝑖=1

2
[𝑏𝑇]

𝑇−1
∑︁

𝑡=1

(cid:98)𝑆𝑖𝑡 (cid:98)𝑆′
𝑖𝑡

𝑝
→

2
3𝑏

1
𝑁

𝑁
∑︁

𝑖=1

(𝑎𝑖 − ¯𝑎) (𝑎𝑖 − ¯𝑎)′ as 𝑇 → ∞

=

2
3𝑏

1
𝑁

𝑁
∑︁

𝑖=1

(cid:0)𝑎𝑖𝑎′

𝑖 − 𝑎𝑖 ¯𝑎′ − ¯𝑎𝑎′

𝑖 + ¯𝑎 ¯𝑎′(cid:1) 𝑝
→

2
3𝑏 E(𝑎𝑖𝑎′

𝑖) =

2
3𝑏

Λ𝑎Λ′

𝑎 as 𝑁 → ∞,

45

where the last convergence follows from the WLLN. What we obtain here is the sequential limit

of the first term in (2.12). However, the sequential limit is not necessarily equal to the joint

limit. Phillips and Moon (1999) provide a framework to obtain joint convergence results through

sequential convergence results under certain conditions. Following their approach, we first define

sequential convergence and joint convergence for random matrices defined below and then introduce

a lemma which gives a sufficient condition for sequential convergence to imply joint convergence.

Definition 2A.1 Let 𝐺 𝑁𝑇 be defined as

𝐺 𝑁𝑇 :=

1
𝑁

𝑁
∑︁

𝑖=1

𝐺𝑖𝑇 ,

where

𝐺𝑖𝑇 :=

1
𝑇 2

2
[𝑏𝑇]

𝑇−1
∑︁

𝑡=1

(cid:98)𝑆𝑖𝑡 (cid:98)𝑆′
𝑖𝑡

𝑝
→

2
3𝑏

(𝑎𝑖 − ¯𝑎) (𝑎𝑖 − ¯𝑎)′ =: 𝐺𝑖, 𝑎𝑠 𝑇 → ∞.

Further, define

𝐺 𝑁 :=

1
𝑁

𝑁
∑︁

𝑖=1

𝑝
→

𝐺𝑖

2
3𝑏

Λ𝑎Λ𝑎 =: 𝐺, 𝑎𝑠 𝑁 → ∞.

Definition 2A.2 (a) A sequence of 𝑘×𝑘 matrices 𝐺 𝑁𝑇 on (Ω, F , 𝑃) is said to converge in probability

sequentially to 𝐺, if

lim
𝑁→∞

lim
𝑇→∞

𝑃 (∥𝐺 𝑁𝑇 − 𝐺 ∥ > 𝜀) = 0 ∀𝜀 > 0.

(b) Suppose that the 𝑘 ×𝑘 random matrices 𝐺 𝑁𝑇 and 𝐺 are defined on a probability space (Ω, F , 𝑃).

𝐺 𝑁𝑇 is said to converge in probability jointly to 𝐺, if

lim
𝑁,𝑇→∞

𝑃 (∥𝐺 𝑁𝑇 − 𝐺 ∥ > 𝜀) = 0 ∀𝜀 > 0.

Lemma 2A.1 Suppose there exist random matrices 𝐺 𝑁 and 𝐺 on the same probability space as

𝐺 𝑁𝑇 satisfying that, for all 𝑁, 𝐺 𝑁𝑇

𝑝
→ 𝐺 𝑁 as 𝑇 → ∞ and 𝐺 𝑁

𝑝
→ 𝐺 as 𝑁 → ∞. Then, 𝐺 𝑁𝑇

𝑝
→ 𝐺

jointly if

lim sup
𝑁,𝑇

𝑃 (∥𝐺 𝑁𝑇 − 𝐺 𝑁 ∥ > 𝜀) = 0 ∀𝜀 > 0.

(2A.11)

46

Lemma 1 can be proved the same way Lemma 6 of Phillips and Moon (1999) is proved, with the

only difference that the vector norm is replaced by a matrix norm, so the proof is omitted here.

Now we verify condition (2A.11). By Markov’s inequality, Minkowski inequality (for infinite

sum), and the fact that there is no heterogeneity of 𝐺𝑖𝑇 and 𝐺𝑖 across 𝑖, we have

lim sup
𝑁,𝑇

𝑃 (∥𝐺 𝑁𝑇 − 𝐺 𝑁 ∥ > 𝜀) ≤ lim sup

𝑁,𝑇

1
𝜀 E

(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)

1
𝑁

𝑁
∑︁

𝑖=1

𝐺𝑖𝑇 −

1
𝑁

𝑁
∑︁

𝑖=1

𝐺𝑖

(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)

≤ lim sup
𝑁,𝑇

1
𝜀 E ∥𝐺𝑖𝑇 − 𝐺𝑖 ∥ .

Because 𝐺𝑖𝑇 converges to 𝐺𝑖 in probability as 𝑇 → ∞, it suffices to show that for each 𝑖, {𝐺𝑖𝑇 }∞
is uniformly integrable for the last term to converge to 0. Let 𝜁 > 0 and consider E ∥𝐺𝑖𝑇 ∥1+𝜁

:

𝑇=1

E ∥𝐺𝑖𝑇 ∥1+𝜁 = E

(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)

1
𝑇 2

2
[𝑏𝑇]

𝑇−1
∑︁

𝑡=1

(cid:98)𝑆𝑖𝑡 (cid:98)𝑆′
𝑖𝑡

1+𝜁
(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)

(cid:32)

≤

1
𝑇 2

2
[𝑏𝑇]

𝑇−1
∑︁

(cid:18)

E

𝑡=1

(cid:32)

≤

1
𝑇 2

2
[𝑏𝑇]

𝑇−1
∑︁

(cid:18)

𝑡=1

(cid:13)
(cid:13)(cid:98)𝑆𝑖𝑡
(cid:13)

(cid:13)
(cid:13)
(cid:13)

E

2(1+𝜁)(cid:19)

1
2(1+𝜁 ) (cid:18)

(cid:13)
(cid:13)(cid:98)𝑆𝑖𝑡
(cid:13)

2(1+𝜁)(cid:19)
(cid:13)
(cid:13)
(cid:13)

E

1+𝜁 (cid:19) 1

1+𝜁 (cid:33) 1+𝜁

(cid:13)
(cid:13)(cid:98)𝑆𝑖𝑡 (cid:98)𝑆′
(cid:13)

𝑖𝑡

(cid:13)
(cid:13)
(cid:13)
2(1+𝜁 ) (cid:33) 1+𝜁

1

,

where the first and second inequalities follows from Minkowski’s and Hölder’s inequalities respec-

tively. Let 𝜁 = 𝛿

4 with 𝛿 > 0 from Assumption 2.1 and consider

(cid:18)

E

(cid:13)
(cid:13)
(cid:13)

𝑇 (cid:98)𝑆𝑖𝑡
1

(cid:13)
(cid:13)
(cid:13)

2(1+𝜁)(cid:19)

1
2(1+𝜁 )

:

(cid:32)

E

(cid:13)
(cid:13)
(cid:13)
(cid:13)

1
𝑇 (cid:98)𝑆𝑖,𝑡=[𝑟𝑇]

2(1+𝜁)(cid:33)
(cid:13)
(cid:13)
(cid:13)
(cid:13)

1
2(1+𝜁 )

≤

1
𝑇

[𝑟𝑇]
∑︁

𝑗=1

(E∥𝑦𝑖 𝑗 ∥2(1+𝜁))

1
2(1+𝜁 ) +

[𝑟𝑇]
𝑇

1
𝑇

(cid:16)

E∥𝑦𝑖𝑡 ∥2(1+𝜁)(cid:17)

1

2(1+𝜁 ) < ∞,

by Minkowski’s inequality and Assumption 2.2(i). Now we conclude that E∥𝐺𝑖𝑡 ∥1+𝜁 < ∞, i.e. 𝐺𝑖𝑡

is uniformly integrable by Theorem 6.13 of Hansen (2022). By uniform integrability of 𝐺𝑖𝑇 and

convergence in probability of 𝐺𝑖𝑇 to 𝐺𝑖, we have 𝐿1 convergence: lim sup

E ∥𝐺𝑖𝑇 − 𝐺𝑖 ∥ = 0. Then,

𝑁,𝑇

condition (2A.11) follows and we obtain the joint limit of 1
𝑁

(cid:205)𝑁
𝑖=1

𝐺𝑖𝑇 as 𝑁, 𝑇 → ∞. Specifically,

we have

1
𝑁

𝑁
∑︁

𝑖=1

𝐺𝑖𝑇 =

1
𝑁𝑇 2

𝑁
∑︁

𝑖=1

2
[𝑏𝑇]

𝑇−1
∑︁

𝑡=1

(cid:98)𝑆𝑖𝑡 (cid:98)𝑆′

𝑖𝑡 =

2
3𝑏

Λ𝑎Λ′

𝑎 + 𝑜 𝑝 (1),

as 𝑁, 𝑇 → ∞. Following similar steps, we obtain joint limits for the rest of the terms in (2.12):

1
𝑁𝑇 2

1
𝑀

𝑁
∑︁

𝑇−𝑀−1
∑︁

(cid:16)

(cid:98)𝑆𝑖𝑡 (cid:98)𝑆′

𝑖,𝑡+𝑀 + (cid:98)𝑆𝑖,𝑡+𝑀 (cid:98)𝑆′
𝑖,𝑡

𝑡=1
𝑁
∑︁

𝑖=1

1
𝑁𝑇 2

1
𝑀

𝑇−1
∑︁

(cid:16)

(cid:98)𝑆𝑖𝑡 (cid:98)𝑆′

𝑖𝑇 + (cid:98)𝑆𝑖𝑇 (cid:98)𝑆′
𝑖𝑡

𝑖=1

𝑡=𝑇−𝑀

47

(cid:17)

(cid:17)

=

(cid:18) 2
3𝑏

+

(cid:19)

1
3

(1 − 𝑏)2Λ𝑎Λ′

𝑎 + 𝑜 𝑝 (1),

(2A.12)

= (2 − 𝑏)Λ𝑎Λ′

𝑎 + 𝑜 𝑝 (1).

(2A.13)

Combining the partial-sum representation in (2.10), (2.11), (2.12) and the results above, we obtain

(2.13).

Proof of Theorem 2.2: First, rewrite

√

𝑁 ( (cid:98)𝛽 − 𝛽) = (cid:98)𝑄−1

= (cid:98)𝑄−1

(cid:32) √

𝑁
𝑁𝑇

(cid:34)

1
√
𝑁

√

𝑁 ( (cid:98)𝛽 − 𝛽) using the component structure representation:
𝑁
∑︁

𝑇
∑︁

(cid:33)

(𝑎𝑖 + 𝑔𝑡 + 𝑒𝑖𝑡)

𝑡=1

𝑎𝑖 +

𝑖=1
𝑁
∑︁

𝑖=1

√︂ 𝑁
𝑇

1
√
𝑇

𝑇
∑︁

𝑡=1

𝑔𝑡 +

1
√
𝑇

√

1
𝑁𝑇

𝑁
∑︁

𝑇
∑︁

(cid:35)

𝑒𝑖𝑡

.

𝑖=1

𝑡=1

Next, by Assumption 2.3(ii) and Hölder’s inequality we have, for some 𝑠 > 1 and 𝛿 > 0,

(cid:16)

(cid:16)

∥𝑥𝑖𝑡𝑢𝑖𝑡 ∥4(𝑠+𝛿)(cid:17)
𝑖𝑡 ∥4(𝑠+𝛿)(cid:17)

∥𝑥𝑖𝑡𝑥′

E

E

(cid:16)

(cid:16)

∥𝑥𝑖𝑡 ∥8(𝑠+𝛿)(cid:17) 1/2
∥𝑥𝑖𝑡 ∥8(𝑠+𝛿)(cid:17) 1/2

E

E

(cid:16)

(cid:16)

∥𝑢𝑖𝑡 ∥8(𝑠+𝛿)(cid:17) 1/2
∥𝑥𝑖𝑡 ∥8(𝑠+𝛿)(cid:17) 1/2

≤ E

≤ E

< ∞,

< ∞,

Now, we are back to the case of Assumption 2.2(ii). Then, by similar steps as in the proof of
(cid:13) < ∞, (cid:13)
Theorem 2.1, we have (cid:13)
(cid:13)

(cid:13)
(cid:13) < ∞, and

(cid:13) < ∞, (cid:13)
(cid:13)

(cid:13)Λ𝑎Λ′
𝑎

(cid:13)Λ𝑔Λ′
𝑔

(cid:13)Λ𝑒Λ′
𝑒

1
√
𝑁

1
√
𝑇

𝑁
∑︁

𝑁
∑︁

𝑖=1
[𝑟𝑇]
∑︁

𝑎𝑖

𝑑
→ 𝑁 (0, Λ𝑎Λ′

𝑎),

𝑔𝑡 ⇒ Λ𝑔𝑊𝑘 (𝑟),

𝑡=1
𝑇
∑︁

𝑒𝑖𝑡

𝑝
→ 0,

𝑖=1

𝑡=1

√

1
𝑁𝑇

(2A.14)

(2A.15)

(2A.16)

as 𝑁, 𝑇 → ∞.

As for (cid:98)𝑄, we can vectorize it and then decompose it in the same manner as the multivariate

mean case:

𝑣𝑒𝑐(𝑥𝑖𝑡𝑥′

𝑖𝑡) − 𝑣𝑒𝑐(𝑄) = 𝑎𝑥

𝑡 + 𝑒𝑥
𝑖𝑡,

𝑣𝑒𝑐( (cid:98)𝑄) − 𝑣𝑒𝑐(𝑄) =

𝑖𝑡) − 𝑣𝑒𝑐(𝑄)|𝛼𝑖], 𝑔𝑥

𝑎𝑥
𝑖 +

1
𝑇

𝑇
∑︁

𝑔𝑥
𝑡 +

1
𝑁𝑇

𝑁
∑︁

𝑇
∑︁

𝑒𝑖𝑡,

𝑖=1

𝑡=1
𝑖=1
𝑖𝑡) − 𝑣𝑒𝑐(𝑄)|𝛾𝑡], and 𝑒𝑥
𝑡 = E[𝑣𝑒𝑐(𝑥𝑖𝑡𝑥′

𝑡=1

𝑖𝑡 = 𝑣𝑒𝑐(𝑥𝑖𝑡𝑥′

𝑖𝑡) −

𝑡 . Then, we can apply the results of (2A.1) - (2A.4) and the fact that the sums

where 𝑎𝑥

𝑖 = E[𝑣𝑒𝑐(𝑥𝑖𝑡𝑥′
𝑖 − 𝑔𝑥

𝑣𝑒𝑐(𝑄) − 𝑎𝑥

𝑖 + 𝑔𝑥
𝑁
∑︁

1
𝑁

48

in (2A.1) are mutually uncorrelated to conclude that Var(𝑣𝑒𝑐( (cid:98)𝑄)) → 0. Then, by Chebyshev’s
inequality we obtain 𝑣𝑒𝑐( (cid:98)𝑄)

𝑝
→ 𝑣𝑒𝑐(𝑄), i.e. as 𝑁, 𝑇 → ∞,

𝑝
→ 𝑄.

(cid:98)𝑄

(2A.17)

Therefore, as 𝑁, 𝑇 → ∞, we have

√

(cid:16)

𝑁

(cid:17)

(cid:98)𝛽 − 𝛽

⇒ 𝑄−1 (cid:2)Λ𝑎𝑧 +

√

𝑐Λ𝑔𝑊 (1)(cid:3)

as claimed for the first part of Theorem 2.2. Next, for the second part we define the partial sums in

the same fashion as (2.8) and (2.9):

[𝑟𝑇]
∑︁

𝑡=1
𝑁
∑︁

(cid:98)𝑆𝑖,[𝑟𝑇] =

(cid:98)¯𝑆 [𝑟𝑇] =

𝑣𝑖𝑡,
𝑥𝑖𝑡(cid:98)

[𝑟𝑇]
∑︁

𝑖=1

𝑡=1

𝑣𝑖𝑡 .
𝑥𝑖𝑡(cid:98)

With similar steps as in proving (2A.17), we can show

1
𝑁𝑇

𝑁
∑︁

[𝑟𝑇]
∑︁

𝑖=1

𝑡=1

𝑥𝑖𝑡𝑥′

𝑖𝑡 =

𝑁 [𝑟𝑇]
𝑁𝑇

1
𝑁 [𝑟𝑇]

𝑁
∑︁

[𝑟𝑇]
∑︁

𝑖=1

𝑡=1

𝑥𝑖𝑡𝑥′
𝑖𝑡

𝑝
→ 𝑟𝑄.

and

Therefore, as 𝑁, 𝑇 → ∞, we have

1
𝑇

[𝑟𝑇]
∑︁

𝑡=1

𝑥𝑖𝑡𝑥′
𝑖𝑡

𝑝
→ 𝑟𝑄.

(cid:98)¯𝑆 [𝑟𝑇] ⇒ 𝑟Λ𝑎𝑊𝑘 (1) +

√

𝑐Λ𝑔𝑊𝑘 (𝑟) − 𝑟𝑄

(cid:16)

𝑄−1 (cid:2)Λ𝑎𝑊𝑘 (1) +

√

𝑐Λ𝑔𝑊𝑘 (1)(cid:3) (cid:17)

√

1
𝑁𝑇

√

=

𝑐Λ𝑔 (cid:101)𝑊𝑘 (𝑟).

Then, similarly as it is shown in (2A.10) we have

𝑁
𝑁 2𝑇 2

(cid:40)

2
𝑀

𝑇−1
∑︁

′
(cid:98)¯𝑆𝑡(cid:98)¯𝑆
𝑡 −

1
𝑀

𝑇−𝑀−1
∑︁

𝑡=1

(cid:16)
′
′
(cid:98)¯𝑆𝑡(cid:98)¯𝑆
𝑡+𝑀 + (cid:98)¯𝑆𝑡+𝑀(cid:98)¯𝑆
𝑡

(cid:17)

(cid:41)

𝑡=1
∫ 1

(cid:26) 2
𝑏
(cid:16)

⇒ 𝑐Λ𝑔

= 𝑐Λ𝑔𝑃

0
𝑏, (cid:101)𝑊𝑘 (𝑟)

(cid:17)

Λ′
𝑔.

(cid:101)𝑊𝑘 (𝑟) (cid:101)𝑊 𝑘 (𝑟)′𝑑𝑟 −

1
𝑏

∫ 1−𝑏

0

(cid:101)𝑊𝑘 (𝑟) (cid:101)𝑊𝑘 (𝑟 + 𝑏)′ + (cid:101)𝑊𝑘 (𝑟 + 𝑏) (cid:101)𝑊𝑘 (𝑟)′(cid:3) 𝑑𝑟
(cid:2)

(cid:27)

Λ′
𝑔

(2A.18)

49

For the rest of the terms of (cid:98)ΩCHS, we again apply Lemma 1 to obtain the joint limit through the

sequential limit. Consider 1

𝑇 (cid:98)𝑆𝑖,[𝑟𝑇] using the component representation:

1
𝑇 (cid:98)𝑆𝑖,[𝑟𝑇] =

[𝑟𝑇]
𝑇

𝑎𝑖 +

1
𝑇

[𝑟𝑇]
∑︁

𝑡=1

𝑔𝑡 +

1
𝑇

[𝑟𝑇]
∑︁

𝑡=1

𝑒𝑖𝑡 −

1
𝑇

[𝑟𝑇]
∑︁

𝑡=1

𝑥𝑖𝑡𝑥′
𝑖𝑡

(cid:16)

(cid:17)

.

(cid:98)𝛽 − 𝛽

Note that 1
𝑇

(cid:205)[𝑟𝑇]
𝑡=1

𝑒𝑖𝑡

𝑝
→ 0 as 𝑇 → ∞ due to Var( 1
𝑇

(cid:205)[𝑟𝑇]
𝑡=1

𝑒𝑖𝑡) = 𝑂 (1/𝑇) and Chebyshev’s

inequality. Then, given fixed 𝑁 and as 𝑇 → ∞, we have

(cid:98)𝛽 − 𝛽 = (cid:98)𝑄−1

(cid:34)

1
𝑁

𝑁
∑︁

𝑖=1

𝑎𝑖 +

1
𝑇

𝑇
∑︁

𝑡=1

𝑔𝑡 +

1
𝑁𝑇

𝑁
∑︁

𝑇
∑︁

(cid:35)

𝑒𝑖𝑡

𝑖=1

𝑡=1

𝑝
→ 𝑄−1 ¯𝑎𝑖

and so 1

𝑇 (cid:98)𝑆𝑖,[𝑟𝑇]

𝑝
→ 𝑟 (𝑎𝑖 − ¯𝑎𝑖), which is the same as the sample mean estimator case. Define

𝐺 𝑁𝑇 :=

1
𝑁

𝑁
∑︁

𝑖=1

𝐺𝑖𝑇 , 𝐺𝑖𝑇 :=

1
𝑇 2

2
[𝑏𝑇]

𝑇−1
∑︁

𝑡=1

(cid:98)𝑆𝑖𝑡 (cid:98)𝑆′
𝑖𝑡,

𝐺𝑖 :=

2
3𝑏
where (cid:98)𝑆𝑖𝑡 = (cid:98)𝑆𝑖,[𝑟𝑇] = (cid:205)[𝑟𝑇]
𝑡=1

𝑣𝑖𝑡.
𝑥𝑖𝑡(cid:98)

(𝑎𝑖 − ¯𝑎) (𝑎𝑖 − ¯𝑎)′ , 𝐺 𝑁 :=

1
𝑁

𝑁
∑︁

𝑖=1

𝐺𝑖, 𝐺 :=

2
3𝑏

Λ𝑎Λ𝑎

From the Proof of Theorem 2.1, we know that to prove condition (2A.11) it suffices to show the

uniform integrability of {𝐺𝑖𝑇 } for any 𝑖. For some 𝜁 > 0, we have

lim
𝑀→∞

sup
𝑇

E (∥𝐺𝑖𝑇 ∥ ; ∥𝐺𝑖𝑇 ∥ > 𝑀)

≤ lim
𝑀→∞

E

sup
𝑇

(cid:32)

∥𝐺𝑖𝑇 ∥

(cid:19) 𝜁

(cid:18) ∥𝐺𝑖𝑇 ∥
𝑀

(cid:33)

; ∥𝐺𝑖𝑇 ∥ > 𝑀

≤ lim
𝑀→∞

sup
𝑇

1
𝑀 𝜁 E

(cid:16)

∥𝐺𝑖𝑇 ∥1+𝜁 (cid:17)

≤ lim
𝑀→∞

sup
𝑇

(cid:32)

1
𝑀 𝜁

1
𝑇 2

2
[𝑏𝑇]

𝑇−1
∑︁

(cid:18)

𝑡=1

(cid:13)
(cid:13)(cid:98)𝑆𝑖𝑡
(cid:13)

(cid:13)
(cid:13)
(cid:13)

E

2(1+𝜁)(cid:19)

1
2(1+𝜁 ) (cid:18)

(cid:13)
(cid:13)(cid:98)𝑆𝑖𝑡
(cid:13)

2(1+𝜁)(cid:19)
(cid:13)
(cid:13)
(cid:13)

E

2(1+𝜁 ) (cid:33) 1+𝜁

1

.

(cid:18)

Now consider

E

1
2(1+𝜁 )

(cid:13)
(cid:13)(cid:98)𝑆𝑖𝑡
(cid:13)

2(1+𝜁)(cid:19)
(cid:13)
(cid:13)
(cid:13)

. By Minkowski’s inequality, we have

2(1+𝜁)(cid:33)

1
2(1+𝜁 )

(cid:32)

E

(cid:13)
(cid:13)
(cid:13)
(cid:13)

1
𝑇 (cid:98)𝑆𝑖,𝑡=[𝑟𝑇]

(cid:13)
(cid:13)
(cid:13)
(cid:13)

(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)

1
𝑇

[𝑟𝑇]
∑︁

𝑡=1

𝑥𝑖𝑡𝑢𝑖𝑡 −

1
𝑇

[𝑟𝑇]
∑︁

𝑡=1

𝑥𝑖𝑡𝑥′

𝑖𝑡 ( (cid:98)𝛽 − 𝛽)

(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)

= (cid:169)
E
(cid:173)
(cid:171)
1
2(1+𝜁 )

2(1+𝜁)

1
2(1+𝜁 )

(cid:170)
(cid:174)
(cid:172)
1
2(1+𝜁 )

2(1+𝜁)

(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)

1
𝑇

[𝑟𝑇]
∑︁

𝑡=1

𝑥𝑖𝑡𝑢𝑖𝑡

(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)

E

≤ (cid:169)
(cid:173)
(cid:171)

(cid:170)
(cid:174)
(cid:172)

E

+ (cid:169)
(cid:173)
(cid:171)

(cid:13)
(cid:13)
(cid:98)𝑄𝑇 (cid:98)𝑄−1
(cid:13)
𝑁𝑇
(cid:13)
(cid:13)

1
𝑁𝑇

𝑁
∑︁

𝑇
∑︁

𝑖=1

𝑡=1

𝑥𝑖𝑡𝑢𝑖𝑡

2(1+𝜁)
(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)

(cid:170)
(cid:174)
(cid:172)

50

where we denote (cid:98)𝑄𝑇 = 1
𝑇
uniformly over 𝑇 by applying Minkowski’s inequality for infinite sums and Hölder’s inequality under

𝑖𝑡. The first term is easily bounded

𝑖𝑡 and (cid:98)𝑄 𝑁𝑇 = 1
𝑁𝑇

𝑥𝑖𝑡𝑥′

𝑥𝑖𝑡𝑥′

𝑡=1

𝑡=1

(cid:205)𝑁
𝑖=1

(cid:205)𝑇

(cid:205)𝑇

Assumption 2.3. By applying Hölder’s inequality to the second term twice, we have

(cid:13)
(cid:13)
(cid:98)𝑄𝑇 (cid:98)𝑄−1
(cid:13)
𝑁𝑇
(cid:13)
(cid:13)

E

1
𝑁𝑇

𝑁
∑︁

𝑇
∑︁

𝑖=1

𝑡=1

𝑥𝑖𝑡𝑢𝑖𝑡

2(1+𝜁)
(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)

(cid:18)

E

≤

(cid:13)
(cid:13) (cid:98)𝑄𝑇
(cid:13)

(cid:13)
(cid:13)
(cid:13)

4(1+𝜁)(cid:19) 1/2 (cid:18)

(cid:13)
(cid:13) (cid:98)𝑄−1
(cid:13)

𝑁𝑇

(cid:13)
(cid:13)
(cid:13)

E

4𝑝(1+𝜁)(cid:19) 1/𝑝

(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)

1
𝑁𝑇

𝑁
∑︁

𝑇
∑︁

𝑖=1

𝑡=1

𝑥𝑖𝑡𝑢𝑖𝑡

(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)

E

(cid:169)
(cid:173)
(cid:171)

4𝑞(1+𝜁)

1/𝑞

(cid:170)
(cid:174)
(cid:172)

(2A.19)

where 1

𝑝 + 1

𝑞 = 1 and 𝑝, 𝑞 ∈ [1, ∞]. The first term in (2A.19) is bounded with a straightforward

application of Minkowski’s inequality for infinite sums and Hölder’s inequality under Assumption

2.3. Note that we have shown (cid:98)𝑄 𝑁𝑇
4𝑝(1+𝜁)
(cid:13)
𝑁𝑇 ∥ < ∞ and so E
that ∥ (cid:98)𝑄−1
(cid:13)
(cid:13)

(cid:13)
(cid:13) (cid:98)𝑄−1
(cid:13)

𝑁𝑇

𝑝
→ 𝑄. By Assumption 2.3(ii), we have ∥𝑄−1∥ < ∞. It follows

< ∞ given that 𝑝 and 𝜁 are finite. To determine 𝑝 and 𝜁,

observe that

(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)

1
𝑁𝑇

𝑁
∑︁

𝑇
∑︁

𝑖=1

𝑡=1

𝑥𝑖𝑡𝑢𝑖𝑡

(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)

E

(cid:169)
(cid:173)
(cid:171)

4𝑞(1+𝜁)

1
4𝑞 (1+𝜁 )

(cid:170)
(cid:174)
(cid:172)

≤

≤

1
𝑁𝑇

1
𝑁𝑇

𝑁
∑︁

𝑇
∑︁

𝑖=1
𝑁
∑︁

𝑡=1
𝑇
∑︁

𝑖=1

𝑡=1

(cid:16)

E ∥𝑥𝑖𝑡𝑢𝑖𝑡 ∥4𝑞(1+𝜁)(cid:17)

1

4𝑞 (1+𝜁 ) ,

(cid:16)

E ∥𝑥𝑖𝑡 ∥8𝑞(1+𝜁) E ∥𝑢𝑖𝑡 ∥8𝑞(1+𝜁)(cid:17)

1
8𝑞 (1+𝜁 )

where the first inequality follows from Minkowski’s inequality for infinite sums and the second line

follows from Hölder’s inequality. Let 𝑞 = 𝑠 and 𝜁 = 𝛿/𝑞 = 𝛿/𝑠 where 𝑠 and 𝛿 are from Assumption
2.3, then 𝑝 = 𝑠

𝑠−1 and it follows that all three terms in (2A.19) are bounded uniformly over 𝑇.

Therefore, we conclude that E∥𝐺𝑖𝑡 ∥1+𝜁 < ∞ and so 𝐺𝑖𝑡 is uniformly integrable. Then, condition

(2A.11) is established and we obtain the joint limit of 𝐺 𝑁𝑇 as 𝑁, 𝑇 → ∞:

𝐺 𝑁𝑇

𝑝
→

2
3𝑏

Λ𝑎Λ′
𝑎.

Following the same steps for the rest of terms in (2.12) leads to the same results as (2A.12) and

(2A.13).

Proof of Theorem 2.3: Following the proof of Theorem 2.2, the (cid:98)ΩDK part of the variance estimator

can be rewritten as a function of partial sums where the function is defined in (2.11) and so the

51

joint limit follows from (2A.18):

𝑁 (cid:98)ΩDK =

1
𝑁𝑇 2

=

1
𝑁𝑇 2

𝑇
∑︁

𝑇
∑︁

𝑘

𝑠=1
𝑇−1
∑︁

𝑡=1
(cid:40)

2
𝑀

(cid:18) |𝑡 − 𝑠|
𝑀

(cid:19) (cid:32) 𝑁
∑︁

(cid:33)

𝑁
∑︁

𝑣𝑖𝑡
(cid:98)

𝑗=1

𝑣′
(cid:169)
𝑗 𝑠(cid:170)
(cid:98)
(cid:173)
(cid:174)
(cid:171)
(cid:172)
(cid:16)
′
′
𝑡+𝑀 + (cid:98)¯𝑆𝑡+𝑀(cid:98)¯𝑆
(cid:98)¯𝑆𝑡(cid:98)¯𝑆
𝑡

𝑖=1
𝑇−𝑀−1
∑︁

(cid:41)

(cid:17)

′
(cid:98)¯𝑆𝑡(cid:98)¯𝑆
𝑡 −

1
𝑀

𝑡=1
𝑏, (cid:101)𝑊𝑘 (𝑟)

(cid:17)

Λ′
𝑔.

𝑡=1

⇒ 𝑐Λ−1

𝑔 𝑃

(cid:16)

For the (cid:98)ΩA part of the variance estimator, under Assumption 2.3 we can apply Lemma 2 of Chiang

et al. (2024) giving

𝑁 (cid:98)ΩA =

1
𝑁𝑇 2

𝑁
∑︁

(cid:32) 𝑇
∑︁

𝑖=1

𝑡=1

𝑣𝑖𝑡
(cid:98)

(cid:33) (cid:32) 𝑇

∑︁

𝑠=1

(cid:33)

𝑣′
𝑖𝑠
(cid:98)

𝑝
→ Λ𝑎Λ′
𝑎.

(2A.20)

Proof of Theorem 2.4: Define Λ𝑥𝑢Λ′

𝑥𝑢 = Var(𝑥𝑖𝑡𝑢𝑖𝑡) and Λ𝑥𝑥Λ′

𝑥𝑥 = E(𝑥𝑖𝑡𝑥′

𝑖𝑡). By Jensen’s inequality,

Hölder’s inequality, and Assumption 2.4(ii), we have

(cid:13)
(cid:13)Λ𝑥𝑢Λ′
𝑥𝑢

(cid:13)
(cid:13)Λ𝑥𝑥Λ′
𝑥𝑥

(cid:13)
(cid:13) = ∥E [(𝑥𝑖𝑡𝑢𝑖𝑡)(𝑥𝑖𝑡𝑢𝑖𝑡)′] ∥ ≤ E ∥𝑥𝑖𝑡𝑢𝑖𝑡 ∥2 ≤
(cid:13) = (cid:13)
(cid:13)

(cid:13)
(cid:13) = E ∥𝑥𝑖𝑡 ∥2 < ∞.

(cid:13) ≤ E (cid:13)

(cid:13)E(𝑥𝑖𝑡𝑥′

(cid:13)𝑥𝑖𝑡𝑥′
𝑖𝑡

𝑖𝑡)(cid:13)

(cid:16)

E ∥𝑥𝑖𝑡 ∥4(cid:17) 1/2 (cid:16)

E ∥𝑢𝑖𝑡 ∥4(cid:17) 1/2

< ∞,

Then, by the WLLN, the functional central limit theorem for i.i.d random vectors, and Slutsky’s

Theorem, we have

√

𝑁𝑇 ( (cid:98)𝛽 − 𝛽) =

(cid:32)

1
𝑁𝑇

√

1
𝑁𝑇

(cid:98)¯𝑆 [𝑟𝑇] =

√

1
𝑁𝑇

𝑁
∑︁

𝑇
∑︁

𝑥𝑖𝑡𝑥′
𝑖𝑡

(cid:33) −1 (cid:32)

√

1
𝑁𝑇

𝑖=1
𝑁
∑︁

𝑡=1
[𝑟𝑇]
∑︁

𝑖=1

𝑡=1

𝑥𝑖𝑡𝑢𝑖𝑡 −

1
𝑁𝑇

𝑖=1

𝑡=1

𝑁
∑︁

𝑇
∑︁

(cid:33)

𝑥𝑖𝑡𝑢𝑖𝑡

⇒ 𝑄−1Λ𝑥𝑢𝑊𝑘 (1),

(2A.21)

𝑁
∑︁

𝑖=1
[𝑟𝑇]
∑︁

𝑡=1

𝑥𝑖𝑡𝑥′
𝑖𝑡

√

(cid:16)

𝑁𝑇

(cid:17)

(cid:98)𝛽 − 𝛽

⇒ Λ𝑥𝑢 (cid:101)𝑊𝑘 (𝑟),

(2A.22)

as 𝑁, 𝑇 → ∞. Then, due to the partial sum representation of (cid:98)ΩDK, we have

𝑁𝑇 (cid:98)ΩDK ⇒ Λ𝑥𝑢𝑃

(cid:16)

𝑏, (cid:101)𝑊𝑘 (𝑟)

(cid:17)

Λ′

𝑥𝑢,

52

where 𝑃

(cid:16)

𝑏, (cid:101)𝑊𝑘 (𝑟)

(cid:17)

is defined the same way as (2.13).

The probability limit of (2.10) scaled by 𝑁𝑇 follows from Lemma 3 of Chiang et al. (2024):

1
𝑁𝑇

𝑁
∑︁

𝑖=1

(cid:98)𝑆𝑖𝑇 (cid:98)𝑆′
𝑖𝑇

𝑝
→ Λ𝑥𝑢Λ′

𝑥𝑢.

(2A.23)

To derive the joint asymptotic limit of (2.12), we first obtain its sequential limit and then apply

Theorem 1 of Phillips and Moon (1999) to show the joint limit is given by the sequential limit. By

the WLLN and functional CLT for i.i.d. random vectors with finite variance, for given 𝑁 and as

𝑇 → ∞, we have

1
𝑇

1
√
𝑇

[𝑟𝑇]
∑︁

𝑡=1
[𝑟𝑇]
∑︁

𝑡=1

𝑥𝑖𝑡𝑥′
𝑖𝑡

𝑝
→𝑟𝑄, 𝑎𝑠 𝑇 → ∞

𝑥𝑖𝑡𝑢𝑖𝑡 ⇒Λ𝑥𝑢𝑊𝑖,𝑘 (𝑟),

where 𝑊𝑖,𝑘 (𝑟) is a 𝑘 × 1 vector of standard Wiener process for each 𝑖. Therefore, for given 𝑁 and

as 𝑇 → ∞, we have

(cid:32)

√
𝑇 ( (cid:98)𝛽 − 𝛽) =

𝑁
∑︁

𝑇
∑︁

𝑥𝑖𝑡𝑥′
𝑖𝑡

(cid:33) −1 (cid:32)

1
𝑁𝑇

𝑡=1

1
√

𝑇

𝑁

𝑁
∑︁

𝑇
∑︁

𝑖=1

𝑡=1

(cid:33)

𝑥𝑖𝑡𝑢𝑖𝑡

⇒ 𝑄−1Λ𝑥𝑢 ¯𝑍,

1
√
𝑇

(cid:98)𝑆𝑖,[𝑟𝑇] =

1
√
𝑇

𝑥𝑖𝑡𝑢𝑖𝑡 −

1
𝑇

[𝑟𝑇]
∑︁

𝑡=1

𝑥𝑖𝑡𝑥′
𝑖𝑡

√

𝑇

(cid:16)

(cid:17)

(cid:98)𝛽 − 𝛽

⇒ Λ𝑥𝑢 (cid:0)𝑊𝑖,𝑘 (𝑟) − 𝑟 ¯𝑍 (cid:1) ,

𝑖=1
[𝑟𝑇]
∑︁

𝑡=1

for each 𝑖 and ¯𝑍 = 1
𝑁

(cid:205)𝑁
𝑖=1

𝑍𝑖 where 𝑍𝑖 is a 𝑘 × 1 vector of standard normal random variables.

Because the convergence of a sequence of matrices 𝐴𝑛 to some matrix 𝐴0 holds if and only if

𝑒′𝐴𝑛𝑒 converges to 𝑒 𝐴0𝑒 for any comfortable constant vector vector 𝑒, we can assume without loss

of generality that 𝑘 = 1. The sequential limit of the first term of (2.12), scaled by 𝑁𝑇, is obtained

as follows:

𝑌𝑖,𝑇 :=

1
𝑁

𝑁
∑︁

𝑖=1

𝑝
→

𝑌𝑖

1
𝑇

2
𝑏

𝑇−1
∑︁

(cid:98)𝑆𝑖𝑡 (cid:98)𝑆𝑖𝑡 ⇒ Λ𝑥𝑢

2
[𝑏𝑇]

2
𝑏

∫ 1

0

(cid:0)𝑊𝑖,𝑘 (𝑟) − 𝑟 ¯𝑍 (cid:1) 2 𝑑𝑟Λ𝑥𝑢 =: 𝑌𝑖, 𝑎𝑠 𝑇 → ∞,

𝑡=1
∫ 1

0

Λ𝑥𝑢

𝑟𝑑𝑟Λ𝑥𝑢 =

Λ𝑥𝑢Λ𝑥𝑢
𝑏

, 𝑎𝑠 𝑁 → ∞,

where the equality in the second line follows from Tonelli Theorem. Noting that there is no hetero-

geneity across 𝑖 due to i.i.d sequences, the conditions needed for Theorem 1 of Phillips and Moon

53

(1999) reduce to the following conditions: (𝑖) lim sup𝑇→∞ E|𝑌𝑖,𝑇 | < ∞; (𝑖𝑖) lim sup𝑇→∞ |𝐸𝑌𝑖,𝑇 −
𝐸𝑌𝑖 | = 0; (𝑖𝑖𝑖) lim sup𝑁,𝑇→∞ E (cid:0)|𝑌𝑖,𝑇 |; |𝑌𝑖,𝑇 > 𝑁𝜀|(cid:1) = 0 ∀𝜀 > 0; and
(iv) lim sup𝑁→∞ E (|𝑌𝑖 |; |𝑌𝑖 > 𝑁𝜀|) = 0 ∀𝜀 > 0.

Therefore, it suffices to show uniform integrability of 𝑌𝑖,𝑇 and 𝑌𝑖. Uniform integrability of

𝑌𝑖 is trivial since it is equivalent to show E|𝑌𝑖 | < ∞. To show uniform integrability of 𝑌𝑖,𝑇 , fix

𝜀 > 0. We want to show that sup𝑁,𝑇 E|𝑌𝑖,𝑇 | < ∞ and there exists 𝛿 such that if 𝑃( 𝐴) < 𝛿 then

sup𝑁,𝑇 E(|𝑌𝑖,𝑇 |; 𝐴) < 𝜀. By Hölder’s inequalities, we have

E(|𝑌𝑖,𝑇 |; 𝐴) =

2
[𝑏𝑇]

𝑇−1
∑︁

𝑡=1

(cid:18)(cid:12)
(cid:12)
(cid:12)
(cid:12)

E

1
𝑇 (cid:98)𝑆𝑖𝑡 (cid:98)𝑆𝑖𝑡

(cid:12)
(cid:12)
(cid:12)
(cid:12)

(cid:19)

; 𝐴

≤

E|𝑌𝑖,𝑇 | ≤

2
[𝑏𝑇]

2
[𝑏𝑇]

𝑇−1
∑︁

(cid:18)

(cid:19)

(cid:18) 1
𝑇 (cid:98)𝑆2

𝑖𝑡; 𝐴

(cid:19)(cid:19)

(cid:18) 1
𝑇 (cid:98)𝑆2

𝑖𝑡; 𝐴

E

E

𝑡=1
𝑇−1
∑︁

𝑡=1

(cid:18)

E

(cid:19)

(cid:18) 1
𝑇 (cid:98)𝑆2

𝑖𝑡

(cid:19)(cid:19)

(cid:18) 1
𝑇 (cid:98)𝑆2

𝑖𝑡

E

Thus, it is equivalent to show uniform integrability of 1

𝑇 (cid:98)𝑆2
𝑖𝑡. Notice that 1
the consistency of (cid:98)𝛽 and so by the asymptotic equivalence lemma, we have

𝑇 (cid:98)𝑆2

𝑖𝑡 = 1

𝑇 𝑆2

𝑖𝑡 + 𝑜 𝑝 (1) under

(cid:12)
(cid:12)
(cid:12)
(cid:12)

1
𝑇 (cid:98)𝑆2

𝑖𝑡

(cid:12)
(cid:12)
(cid:12)
(cid:12)

E

= E

(cid:12)
(cid:12)
(cid:12)
(cid:12)

1
𝑇

𝑆2
𝑖𝑡

(cid:12)
(cid:12)
(cid:12)
(cid:12)

𝑎𝑠 𝑁, 𝑇 → ∞.

Observe that

(cid:12)
(cid:12)
(cid:12)
(cid:12)

1
𝑇

E

𝑆2
𝑖𝑡

(cid:12)
(cid:12)
(cid:12)
(cid:12)

=

1
𝑇 E

(cid:32) 𝑇
∑︁

𝑡=1

(cid:33) 2

𝑥𝑖𝑡𝑢𝑖𝑡

= E(𝑥𝑖𝑡𝑢𝑖𝑡𝑢𝑖𝑡𝑥𝑖𝑡) ∀𝑇,

where the second equality follows from that {𝑥𝑖𝑡𝑢𝑖𝑡 } are i.i.d across 𝑡. Then, under Assumption
2.4(ii), there exists some constant 𝐶 < ∞ such that E (cid:12)
𝑇 𝑆2
(cid:12) 1
𝑖𝑡

(cid:12)
(cid:12) < 𝐶. Integrating both sides of this

inequality over 𝐴 gives:

𝐶𝑃( 𝐴) >

∫

(cid:12)
(cid:12)
(cid:12)
(cid:12)

1
𝑇

𝑆2
𝑖𝑡

(cid:12)
(cid:12)
(cid:12)
(cid:12)

E

𝐴

𝑑𝑃 = E

(cid:18)∫

1
𝑇

𝐴

(cid:19)

𝑖𝑡 𝑑𝑃
𝑆2

= E

(cid:19)

𝑖𝑡; 𝐴
𝑆2

(cid:18) 1
𝑇

∀𝑇 ∈ N

where the second equality follows from that 1

𝑇 𝑆2

𝑖𝑡 ≥ 0. So, if we take 𝛿 = 𝜀/𝐶, then

(cid:19)

(cid:18) 1
𝑇 (cid:98)𝑆2

𝑖𝑡; 𝐴

E

sup
𝑁,𝑇

= sup
𝑁,𝑇

E

(cid:19)

𝑖𝑡; 𝐴
𝑆2

(cid:18) 1
𝑇

< 𝜀.

It follows that { 1

1 of Phillips and Moon (1999) applies and we obtain 𝑌𝑖,𝑇

𝑖𝑡 } is uniformly integrable and so 𝑌𝑖,𝑇 is uniformly integrable. Therefore, Theorem
𝑝
→ 1

𝑏 Λ𝑥𝑢Λ𝑥𝑢. Similarly, the joint fixed-𝑏

𝑇 (cid:98)𝑆2

54

limit of the rest of the terms of (2.12) are obtained as follows:

1
𝑁𝑇

2
[𝑏𝑇]

𝑁
∑︁

𝑇−𝑀−1
∑︁

𝑖=1

𝑡=1

(cid:98)𝑆𝑖𝑡 (cid:98)𝑆𝑖,𝑡+𝑀

𝑝
→

(1 − 𝑏)2
𝑏

Λ𝑥𝑢Λ𝑥𝑢 𝑎𝑠 𝑁, 𝑇 → ∞,

1
𝑁𝑇

2
[𝑏𝑇]

𝑁
∑︁

𝑇
∑︁

𝑖=1

𝑡=𝑇−𝑀

(cid:98)𝑆𝑖𝑡 (cid:98)𝑆𝑖𝑇

𝑝
→

1 − (1 − 𝑏)2
𝑏

Λ𝑥𝑢Λ𝑥𝑢 𝑎𝑠 𝑁, 𝑇 → ∞.

Arranging the joint limits we obtain above we find that 𝑁𝑇 ((cid:98)ΩA − (cid:98)ΩNW) = 𝑜 𝑝 (1) under

Assumption 2.4, which combined with (2A.21) - (2A.23) delivers the desired result.

55

CHAPTER 3

INFERENCE IN HIGH-DIMENSIONAL PANEL MODELS:
TWO-WAY DEPENDENCE AND UNOBSERVED HETEROGENEITY

3.1

Introduction

In economic research, high dimensionality typically refers to the large number of unknown

parameters relative to the sample size, under which traditional estimations are either infeasible or

tend to yield estimates too noisy to be informative. The issue of high dimensionality becomes more

relevant as data availability grows and economic modeling involves more flexibility. Commonly,

the problem of high dimensionality appears in at least the following three scenarios:

• The dimension of observable and potentially relevant variables can be large relative to the

sample.

In trade literature, preferential trade agreements (PTAs) usually involve a large

number of provisions even though most policy analysis only focuses on the effect of a

small subset of the provisions 1. In demand analysis, even if the focus is on the own-price

elasticity, the prices of relevant goods should also be included, unless strong assumptions for

aggregation are assumed (see Chernozhukov et al., 2019).

• With nonparametric or semiparametric modeling, the unknown functions are viewed as

infinite-dimensional parameters regardless of the dimension of observable variables. If the

unknown function 𝑔(𝑋) is approximately sparse and can be well-approximated by a linear

combination of the 3rd-order polynomial transformation of 𝑋, then it would involve 285

transformed regressors when the dimension of 𝑋 is 10 and 1770 when we start with a

dimension of 20. 2

• The modeling of heterogeneity can raise the number of nuisance parameters drastically.

In demand analysis, income effects are specific to products if the homothetic preference

assumption fails. For difference-in-difference analysis, allowing unit-specific trends and

1Based on data from Mattoo et al. (2020), 282 PTAs were signed and notified to the WTO between 1958 and 2017,

encompassing 937 provisions across 17 policy areas. See Breinlich et al. (2022).

2For a vector 𝑋 with dimension 𝑘, it is easy to show that the 2nd-order polynomial transformation generates 𝑘2

(cid:205)𝑘

𝑙=1

𝑙 (𝑙 + 1) = 1
6

𝑘 3 + 𝑘 2 + 11
6

𝑘
2 + 3
𝑘 terms.

2

terms and the 3rd-order polynomial transformation generates 𝑘 + 1
2

𝑘 (𝑘 + 1) + 1
2

56

heterogeneous trends across the covariates can relax/test the parallel trend assumption. For

models with unobserved heterogeneity that appears in a nonlinear way, either treating them

as parameters to be estimated (fixed effects) or modeling them in a flexible way (correlated

random effects) contributes to high dimensionality. 3.

Particularly, the modeling of heterogeneity in panel models makes high dimensionality more

of a practical issue rather than just a theoretical concern. As a concrete example, let’s consider a

panel model where all three sources of high dimensionality are involved:

𝑌𝑖𝑡 = 𝐷𝑖𝑡𝜃0 + 𝑔0(𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡) + 𝑈𝑖𝑡,

(3.1)

where 𝐷𝑖𝑡 is a vector of low-dimensional treatment or policy variables. 𝑋𝑖𝑡 is a vector of potentially

high-dimensional control variables. 𝐷𝑖𝑡 can also contain some higher-order effects and interactive

effects with a subset of the controls to allow for nonlinear and heterogeneous effects in a parametric

way. When 𝐷𝑖𝑡 is not conditionally uncounfounded, instrumental variables may be used for

identification of 𝜃0. ; 𝑔(.) is an unknown function, e.g. an infinite dimensional parameter; 𝑐𝑖 and

𝑑𝑡 are unobserved heterogeneous effects, either as fixed-effect parameters or correlated random

variables. The interest lies in the inference on the low-dimensional parameters 𝜃0.

Without considering the features of panel data and the unobserved heterogeneity, it is a classic

partial linear model that has been well-studied in previous semiparametric literature. To reduce the

dimensionality, sparse approximation and regularization approaches have been widely employed.

Essentially, regularization, also known as the machine learning approach, trades off bias for smaller

variance to achieve desirable rates of convergence. However, due to the bias introduced by reg-

ularization and overfitting, inference can be challenging. Typically, bias correction is involved to

obtain estimators with better statistical properties and to conduct valid inferences.

In the case of panel data, it is soon realized that at least three challenges would appear if

researchers attempt to apply the existing high-dimensional approaches directly. First of all, the

3This is particularly relevant in trade literature where the unobserved heterogeneity derived from the gravity model
takes a pairwise form among the importers, exporters, and the time. As each of these three dimensions expands, the
number of nuisance parameters explodes quickly. See Correia et al. (2020), Chiang et al. (2021), and Chiang et al.
(2023b), for example.

57

statistical properties of many regularized estimators remain unknown with panel data where the

observations are potentially dependent across space/unit and time. Secondly, some bias-correction

procedures for inference such as sample-splitting/cross-fitting are very particular about the sampling

assumption and existing approaches are not valid under two-way dependence in panels. Thirdly,

panel data is often leveraged to model unobserved individual and time effects, which may lead to

another source of high dimensionality and further complicate estimation and inference.

To reduce the dimensionality, as the first challenge, I proposed a variant of LASSO that

uses regressor-specific penalty weights robust to two-way cluster dependence and weak temporal

dependence across clusters. Such a LASSO approach is labeled as the two-way cluster-LASSO,

corresponding to the heteroskedasticity-robust LASSO in Belloni et al. (2012) and the cluster-

LASSO in Belloni et al. (2016). This approach theoretically derives the common penalty level

𝜆 up to a constant and a small-order sequence that do not vary across different data-generating

processes. Therefore, data-driven tuning, such as cross-validation, is not needed, which makes

it more computationally efficient and avoids non-trivial theories that take data-driven tuning into

account.

A common and important condition for obtaining the desirable statistical properties of LASSO

selection/estimation is the so-called "regularization event", which states that the overall penalty level

is sufficiently large to dominate the "noise" in the high-dimensional estimation (but not too large at

the same time to avoid under-selection and slow rate of convergence). However, existing approaches

for ensuring such an event with probability approaching one are not applicable in this case due to

the two-way cluster dependence. Instead, by considering the component structure characterization

of the two-way dependence and decomposing the error terms using Hajek projections, I am able

to leverage the moderate deviation theorems by Peña et al., 2009 and Gao et al., 2022 and the

concentration inequality by Fuk and Nagaev (1971) for bounding the tail probability of the "noise"

term. Combining with existing non-asymptotic bounds for the LASSO approach in Belloni et al.

(2012), I derive the rate of convergence for the (post) two-way cluster-LASSO.

According to the rate of convergence results, the proposed (post) LASSO estimator is consistent

58

for the slope coefficients of the sparse model. However, it is also revealed that the convergence rate is

not as fast as the common rates for LASSO estimation due to the two-way cluster dependence. The

problem lies in the underlying component structure. To illustrate, consider the simplest multivariate

mean model through a component structure representation:

𝑌𝑖𝑡 = 𝜃0 + 𝑓 (𝛼𝑖, 𝛾𝑡, 𝜖𝑖𝑡)

(3.2)

where 𝑌𝑖𝑡 is a high-dimensional vector with dimension 𝑠 = 𝑜(𝑁𝑇) and 𝜃0 = 𝐸 [𝑌𝑖𝑡]; 𝛼𝑖, 𝛾𝑡, and 𝜀𝑖𝑡

are unobserved random elements. This is a common characterization of cluster dependence in the

literature on cluster-robust inference. We notice that 𝛼𝑖 introduces cluster/temporal dependence

within group 𝑖 and 𝛾𝑡 introduces cluster/cross-sectional dependence within group 𝑡. To estimate

the high-dimensional vector 𝜃0, we consider the sample mean estimator (cid:98)𝜃 = 1
𝑁𝑇

(cid:205)𝑁
𝑖=1

(cid:205)𝑇

𝑡=1

𝑌𝑖𝑡. We

can rewrite the estimator through a Hajek projection:

(cid:98)𝜃 − 𝜃0 =

1
𝑁𝑇

𝑁
∑︁

𝑇
∑︁

𝑖=1

𝑡=1

(𝑎𝑖 + 𝑔𝑡 + 𝑒𝑖𝑡) =

1
𝑁

𝑁
∑︁

𝑖=1

𝑎𝑖 +

1
𝑇

𝑇
∑︁

𝑡=1

𝑔𝑡 +

1
𝑁𝑇

𝑁
∑︁

𝑇
∑︁

𝑖=1

𝑡=1

𝑒𝑖𝑡,

(3.3)

where 𝑎𝑖 := 𝐸 [𝑌𝑖𝑡 − 𝜃0|𝛼𝑖], 𝑔𝑡 := 𝐸 [𝑌𝑖𝑡 − 𝛾𝑡], and 𝑒𝑖𝑡 := 𝑌𝑖𝑡 − 𝜃0 − 𝑎𝑖 − 𝑔𝑡. For illustration purposes,

suppose those components are i.i.d sequences and independent of each other. Then it can be

shown that, under some regularity conditions, for each 𝑗 = 1, ..., 𝑠, (cid:98)𝜃 𝑗 − 𝜃0 = 𝑂 𝑃
(cid:16)√︁ 𝑠
𝑁∧𝑇

∥(cid:98)𝜃 − 𝜃0∥2 =

(cid:98)𝜃 𝑗 − 𝜃0 𝑗

(cid:17) 2(cid:19) 1/2

= 𝑂 𝑃

(cid:205)𝑠

𝑗=1

(cid:18)

(cid:17)

(cid:16)

(cid:17)

(cid:16)

1√

𝑁∧𝑇

and

. While (cid:98)𝜃 is still consistent when 𝑁, 𝑇 diverge at
(cid:18)√︃ 1
(𝑁𝑇)−1/4(cid:17)
𝑁∧𝑇

, which is a common

= 𝑜𝑃

(cid:19)

(cid:16)

the same rate, ∥(cid:98)𝜃 − 𝜃0∥2 converges slower than 𝑜𝑃

rate requirement for inferential theory.

This is where the second challenge arises:

if a faster rate of convergence is not achievable

due to the two-way cluster dependence, some bias-correction approaches are needed to relax the

rate requirement for valid inference. One common approach in semiparametric literature is to

add a correction term to the original identifying moment function. It results in an orthogonalized

moment condition which often features multiplicative error terms, closely related to the doubly

robust estimators. Although the orthogonality property allows the nuisance estimations to be

noisier, it generally does not ensure valid inference when there is growing dimensionality in the

59

unknown parameters. An extra bias-correction approach, sample splitting or its generalization

cross-fitting, has been proposed for inference in the high-dimensional regression models. The idea

of sample splitting in a two-step estimation is to split the sample in a proper way and use the

sub-samples separately for each step. If the sub-samples are independent of each other, then the

first-step estimates will be independent of the sample used for the second-step estimation. With

this property, the error term that causes the bias can vanish with a less stringent rate requirement on

the first step. Intuitively, the dependence between the two steps is eliminated so that a potentially

over-fitted nuisance estimate from the first step does not pollute the second-step estimator as much

as it would otherwise do. However, sample-splitting as well as cross-fitting is very sensitive to the

sampling assumption. Building upon recent development of cross-fitting approaches for dependent

data (Chiang et al. (2022); Semenova et al. (2023a)), I propose a clustered-panel cross-fitting scheme

and I show the constructed main and auxiliary samples are “approximately” and independent of

each other. Effectively, this inferential procedure extends the double/debiased machine learning

(DML, hereafter) approach by Chernozhukov et al. (2018a) to panel data models, and it is labeled as

the panel DML. Asymptotic normality for the panel DML estimator is established given high-level

assumptions on the convergence rates regarding the first-step estimator. It is shown that the crude

requirement on the rate of convergence can be relaxed to 𝑜

(cid:16)

(𝑁 ∧ 𝑇)−1/4(cid:17)

in 𝐿2 norm, which admits

the first-step estimation through the two-way cluster LASSO.

For the third challenge caused by the unobserved heterogeneity, existing literature proposes

to use the fixed-effect approach assuming the unknown function 𝑔0 in 3.1 is linear in (𝑐𝑖, 𝑑𝑡)

(Belloni et al., 2016; Kock and Tang, 2019), or to use the high-dimensional common correlated

effect approaches assuming 𝑔0 is linear in the interactive fixed effects (Vogt et al., 2022). To allow

for flexible function forms while remaining tractable, I propose to model (𝑐𝑖, 𝑑𝑡) as correlated

random effects through a generalized Mundlak device while assuming the unknown function to be

approximately sparse. In that way, a very rich form of heterogeneity is permitted. While not all

of those nonlinear and heterogeneous effects are relevant and the identities of the truly relevant

effects are unknown to the researcher, suitable machine learning approaches, e.g.

the two-way

60

cluster-LASSO, can be used to select the relevant effects. However, there is one more subtle

issue: common approaches including Mundlak device that deal with the unobserved heterogeneous

effects introduce cross-sectional and temporal sample averages which in turn bring dependence

across cross-fitting sub-samples. Furthermore, even if it remains valid under extra conditions,

cross-fitting often causes a loss of efficiency due to the exclusion of observations. On the other

hand, without cross-fitting, valid inference remains challenging for high-dimensional panel models

in general. Nevertheless, in the case of the partial linear panel model, I further show that inferential

theory can be established using the full sample.

In the empirical application, I re-examine the effect of government spending on the output of an

open economy following the framework of Nakamura and Steinsson (2014), a well-cited empirical-

macro paper. While they study it using a panel data approach considering unobserved heterogeneous

effects that raise the nuisance parameters as the sample size grows, it is not considered a high-

dimensional problem in the baseline setting: a linear panel model with only a few covariates and

additive unobserved heterogeneous effects; the identification is through the instrumental variable.

However, even in a conventionally low-dimensional setting, high dimensionality can be hidden

because the true model can be highly nonlinear in the covariates and the unobserved heterogeneous

effects can enter the model in a flexible way. To avoid the endogeneity caused by potential

misspecification in the function form, I consider extending the baseline model in a more flexible

way as in 3.1. Due to potential two-way cluster dependence, existing high-dimensional methods

designed for independent or weakly dependent data may not be valid. This is where the proposed

dependence-robust estimation and inference for high-dimensional models can be leveraged and the

results can be used for a robustness check. It is shown that the estimates are consistent with the

baseline results, which indicates the nonlinear and interactive effects may not be very relevant in this

model. Existing estimation and inference methods that are not robust to either high-dimensionality

or two-way cluster dependence tend to over-fit and result in noisy estimates and inaccurate inference

results.

The rest of the paper is outlined as follows: The next sub-section reviews relevant literature and

61

summarizes the differences and contributions of this paper relative to the existing ones. Section

3.2 presents the two-way cluster-LASSO estimator and the investigation of its statistical properties

under two-way cluster dependence. Section 3.3 introduces a sub-sampling scheme designed for

cross-fitting that allows within-cluster dependence and weak dependence across clusters. It is then

used as a bias-correction approach for valid inference on the low-dimensional parameter considering

the effect of the high-dimensional nuisance estimation. In Section 3.4, the partial linear model

with unobserved heterogeneity is studied in detail as a leading example. Simulation evidence is

given in Section 3.5 where the proposed approaches compete with existing ones. In Section 3.6,

the empirical estimation of the government spending multiplier is used as the illustration of hidden

high dimensionality and the application of the proposed toolkit. Section 3.7 concludes the paper

with a discussion of limitations and detailed empirical recommendations.

3.1.1 Relation to the Literature

This paper builds upon literature on 𝑙1 regularization methods in high-dimensional regression.

Bickel et al. (2009) derive the convergence rate of the prediction error in terms of the empirical

norm under homogeneous Gaussian error, restricted eigenvalue, and sparsity conditions. Bühlmann

and Van De Geer (2011) instead assumes a sub-Gaussian tail property to derive similar results of

convergence rates. See Section 29.11 of Hansen (2022) for an illustration and extension of Bickel

et al. (2009)’s analysis under heteroskedasticity. Under Gaussian or sub-Gaussian errors, Basu

and Michailidis (2015); Kock and Callot (2015); Lin and Michailidis (2017) study LASSO-based

approaches for dependent data. To allow for both non-Gaussian errors and dependent data, Wu and

Wu (2016), Chernozhukov et al. (2021a), Babii et al. (2022, 2023) Gao et al. (2024) derive Nagaev-

type concentration inequalities to bound the tail probability assuming a proper order of the penalty

level. However, all aforementioned LASSO-based approaches require delicate tuning of the penalty

level to ensure a desirable finite sample performance. The common cross-validation approaches

and bootstrap in Chernozhukov et al. (2021a) for choosing the penalty level are computationally

costly and are very sensitive to the sampling assumption. Plus, the statistical analysis accounting

for the data-driven penalty level is highly non-trivial (see Chetverikov et al., 2021 for validity

62

on cross-validation LASSO under random sampling). As another strand, Belloni et al. (2011,

2012, 2016) propose other variants of LASSO approaches and leverage (self-normalized) moderate

deviation theorems to derive theoretically-driven penalty levels. However, their methodologies

cannot be easily extended to settings with two-way dependence. The proposed variant of LASSO

is built upon the aforementioned literature and employs both Nagaev-type inequalities (Fuk and

Nagaev, 1971) and moderate deviation theorem for self-normalized sums (Peña et al., 2009; Gao

et al., 2022). To my knowledge, it is the first LASSO and high-dimensional estimator that is robust

to the two-way cluster dependence.

The inferential theory in high-dimensional regression models typically relies on some bias-

correction methods and they are particularly important here due to the two-way cluster dependence

that results in a slow rate of convergence. Bias-correction approaches for inference purposes take

various forms in the literature: for example, the low-dimensional projection adjustment in Zhang

and Zhang (2014), the de-sparsification procedure in Van de Geer et al. (2014), the decorrelating

matrix adjustment in Javanmard and Montanari (2014), the double selection approach in Belloni

et al. (2014), the decorrelated score construction in Ning and Liu (2017), the Neyman orthogonal

moment construction in Chernozhukov et al. (2018a, 2022a). The last strand of the literature is

often labeled as DML, which is closely related to previous semiparametric literature including

Ichimura (1987), Robinson (1988), Powell et al. (1989), Newey (1994), and Andrews (1994). The

idea of the orthogonalization is to add a correction term to the original identifying moment function

so that the second-step estimator is less sensitive to the plug-in of noisy first steps. Due to the

resulting multiplicative error term in the orthogonal moment condition, it is closely related to the

doubly-robust methods. Newey (1994) provides a general construction of the orthogonal moment

condition through the influence functions. It is further facilitated by Ichimura and Newey (2022) for

identifying moment conditions satisfying certain restrictions. See Chernozhukov et al. (2018a) and

Chernozhukov et al. (2022a) for a summary of such constructions and known orthogonal moment

functions. More recently, Chernozhukov et al. (2018b, 2021b, 2022b,c); Jordan et al. (2023) provide

an alternative approach by estimating the correction term without knowing its analytical form. For

63

the inferential theory in high-dimensional panel models, this paper takes the orthogonalization step

as given and focuses on nuisance estimation and cross-fitting.

Sample-splitting or cross-fitting, serving as another bias-correction approach, has been widely

employed in other two-step estimations. The role of cross-fitting in high-dimensional inferential

theory is to remove the dependence between the nuisance estimation and the second-step estimation

so that the over-fitting bias from the first step has less impact on the second step. Technically, it

allows for a slower rate of convergence in the first step and it in turn relaxes the sparsity condition

(e.g., Belloni et al., 2014). Chernozhukov et al. (2018a) generalize the sample-splitting procedure

as a cross-fitting scheme which further improves finite sample performance by reducing the noise

due to arbitrary splitting of the sample. Chiang et al. (2021, 2022) propose a cross-fitting scheme

robust to separately and jointly exchangeable arrays. Semenova et al. (2023a) propose a leave-

one-neighborhood-out cross-fitting and introduce a coupling approach (due to Strassen, 1965 and

Berbee, 1987) to prove the validity of cross-fitting under temporal dependence. The idea of leave-

one-neighborhood-out sub-sampling scheme is also shared by ℎ−block cross-validation (Burman

et al., 1994; Racine, 2000) and big-block-small-block technique in time series literature (e.g., Gao

et al., 2022). Built upon previous literature, I propose a more robust cross-fitting scheme that is

valid under not only cluster dependence but also weak temporal dependence across clusters.

This paper also belongs to the cluster-robust inference literature. The characterization of the

two-way cluster dependence is based on the Aldous-Huber-Kallenberg (AHK) type representation,

which is common is this literature (e.g., Djogbenou et al., 2019, Roodman et al., 2019, Davezies

et al., 2019, and Menzel, 2021). This original representation only works for exchangeable arrays,

which is violated in panel data settings with autocorrelation over time. Chiang et al. (2024)

generalizes this representation by allowing the time factor to be correlated over time and Chen and

Vogelsang (2024) also considers this representation when deriving fixed-b asymptotic results for

inference. Differing from the original representation theorem, it is a fairly general characterization

of two-way dependence in the panel. Such characterization of the dependence structure is common

in economic studies (e.g., Rajan and Zingales, 1998, Fama and French, 2000, Li et al., 2004,

64

Larrain, 2006, Thompson, 2011, Nakamura and Steinsson, 2014, Guvenen et al., 2017, Ellison

et al., 2024, and Nakamura and Steinsson, 2014 among many others).

3.1.2 Notation.

Here is a collection of frequently used notations in this paper. Some extra notations are defined

along with the context. 𝐸 and 𝑃 are as generic expectation and probability operators. P𝑁𝑇 is

an expanding collection of all data-generating processes 𝑃 that satisfy certain conditions. 𝑃𝑁𝑇

is a sequence of probability laws such that 𝑃𝑁𝑇 ∈ P𝑁𝑇 for each (𝑁, 𝑇). The dependence on

(𝑁, 𝑇) and 𝑃𝑁𝑇 will be suppressed whenever clear in the context. ∥.∥ is the Euclidean (Frobenius)

(cid:17) 1/𝑞

norm for a matrix. Let x be a generic 𝑘 × 1 real vector, then the 𝑙𝑞 norm is denoted as ∥x∥𝑞 :=
(cid:16)(cid:205)𝑘
for 1 ≤ 𝑞 < ∞; ∥x∥∞ := max1≤ 𝑗 ≤𝑘 |𝑥 𝑗 |. The 𝐿𝑞 (𝑃) norm is denoted as ∥ 𝑓 ∥𝑃,𝑞 :=
where 𝑓 is a random element with probability law 𝑃. I denote the empirical

(cid:0)∫ ∥ 𝑓 (𝜔)∥𝑞𝑑𝑃(𝜔)(cid:1) 1/𝑞

𝑥𝑞
𝑗

𝑗=1

average of 𝑓𝑖𝑡 over 𝑖 = 1, ..., 𝑁 and 𝑡 = 1, ..., 𝑇 as E𝑁𝑇 [ 𝑓𝑖𝑡] = 1
𝑁𝑇

(cid:205)𝑁
𝑖=1

(cid:205)𝑇

𝑡=1

𝑓𝑖𝑡 and the empirical 𝐿2

(cid:205)𝑁
norm as ∥ 𝑓𝑖𝑡 ∥ 𝑁𝑇,2 =
𝑖=1
𝑓𝑖𝑡 over the sub-sample 𝑖 ∈ 𝐼𝑘 and 𝑡 ∈ 𝑆𝑙 as E𝑘𝑙 [ 𝑓𝑖𝑡] = 1
𝑁𝑘𝑇𝑙

𝑡=1 ∥ 𝑓𝑖𝑡 ∥2(cid:17) 1/2
(cid:205)𝑇

(cid:16) 1
𝑁𝑇

. Correspondingly, I denote the empirical average of

(cid:205)𝑖∈𝐼𝑘,𝑡∈𝑆𝑙

𝑓𝑖𝑡 and the empirical 𝐿2 norm

over the subsample as ∥ 𝑓𝑖𝑡 ∥ 𝑘𝑙,2 =
sets and 𝑁𝑘 , 𝑇𝑙 are sub-sample sizes that will be introduced next section.

(cid:205)𝑖∈𝑁𝑙

(cid:205)𝑡∈𝑇𝑙 ∥ 𝑓𝑖𝑡 ∥2(cid:17) 1/2

(cid:16) 1
𝑁𝑘𝑇𝑙

, where 𝐼𝑘 , 𝑆𝑙 are sub-sample index

3.2 Two-Way Cluster LASSO

In the existing literature, not much is known in terms of statistical properties for high-

dimensional methods under cluster dependence in both cross-section and time.

In this section,

a variant of the 𝑙1-regularization methods, also known as the LASSO, will be proposed and

examined.

To focus on the LASSO approach under two-way dependence, I consider a simple conditional

expectation model of a scalar outcome given a potentially high-dimensional vector of covariates.

Let (𝑌𝑖𝑡, 𝑋𝑖𝑡) be a sample with 𝑖 = 1, ..., 𝑁 and 𝑡 = 1, ..., 𝑇. The conditional expectation model can

be expressed as follows:

𝑌𝑖𝑡 = 𝑓 (𝑋𝑖𝑡) + 𝑉𝑖𝑡, 𝐸 [𝑉𝑖𝑡 |𝑋𝑖𝑡] = 0

(3.4)

65

where 𝑓 (𝑋𝑖𝑡) := 𝐸 [𝑌𝑖𝑡 |𝑋𝑖𝑡] is an unknown conditional expectation function of potentially high-

dimensional covariates 𝑋𝑖𝑡; 𝑉𝑖𝑡 is the associated stochastic error.

To characterize the two-way cluster dependence in the panel, I assume the random elements

𝑊𝑖𝑡 := (𝑌𝑖𝑡, 𝑋𝑖𝑡, 𝑉𝑖𝑡) are generated by the following process:

Assumption 3.1 (Aldous-Hoover-Kallenberg Component Structure Characterization)

𝑊𝑖𝑡 = 𝜇 + 𝑓 (𝛼𝑖, 𝛾𝑡, 𝜀𝑖𝑡) , ∀𝑖 ≥ 1, 𝑡 ≥ 1,

(3.5)

where 𝜇 = 𝐸𝑃 [𝑊𝑖𝑡], 𝑓 is some unknown measurable function; (𝛼𝑖)𝑖≥1, (𝛾𝑡)𝑡≥1, and (𝜀𝑖𝑡)𝑖≥1,𝑡≥1 are

mutually independent sequences, 𝛼𝑖 is i.i.d across 𝑖, 𝜀𝑖𝑡 is i.i.d across 𝑖 and 𝑡, and 𝛾𝑡 is strictly

stationary.

Assumption AHK is motivated by a representation theorem for an exchangeable array, named

after Aldous-Hoover-Kallenberg (AHK, hereafter), which states that if an array of random variables

(𝑋𝑖 𝑗 )𝑖≥1, 𝑗 ≥1 is separately or jointly exchangeable4, then 𝑋𝑖 𝑗 = 𝑓 (𝜉𝑖, 𝜁 𝑗 , 𝜄𝑖 𝑗 ) where (𝜉𝑖)𝑖≥1, (𝜁 𝑗 ) 𝑗 ≥1,

(𝜄𝑖 𝑗 )𝑖≥1, 𝑗 ≥1 are mutually independent, uniformly distributed i.i.d. random variables5. However, the

exchangeability is not likely to hold for arrays with the presence of a temporal dimension since

it is naturally ordered. In macroeconomics, for instance, we can interpret the time components

(𝛾𝑡)𝑡≥1 as unobserved common time shocks, which are naturally correlated over time, implying

the exchangeability violated. Therefore, by allowing 𝛾𝑡 to be correlated, it introduces temporal

dependence across all clusters, making the characterization more sensible. The relaxation of the

independence condition on (𝛾𝑡)𝑡≥1 can be viewed as a generalization of the component structure

representation, as argued by Chiang et al. (2024). It is clear that under Assumption AHK, 𝑊𝑖𝑡 and

𝑊𝑖𝑠 are correlated for any 𝑖, 𝑡, 𝑠 due to sharing the same cross-sectional cluster. Similarly, due to

sharing the same temporal cluster, 𝑊𝑖𝑡 and 𝑊 𝑗𝑡 are dependent for any 𝑡, 𝑖, 𝑗. Furthermore, even

if sharing neither the cross-sectional or temporal dimensions, observations can still be dependent

4An array (𝑋𝑖 𝑗 )𝑖 ≥1, 𝑗 ≥1 is separately exchangeable if (cid:0)𝑋 𝜋 (𝑖), 𝜋′ ( 𝑗 )

condition holds with 𝜋 = 𝜋′.

(cid:1) 𝑑

= (cid:0)𝑋𝑖 𝑗 (cid:1), and jointly exchangeable if the same

5This is first proved in Aldous (1981) and independently proved and generalized to higher dimensional arrays in
Hoover (1979). It is then further studied in Kallenberg (1989). For a formal statement of the theorem, see, for example,
Theorem 7.22 in Kallenberg (2005).

66

due to correlated time effects 𝛾𝑡.

It is important to notice that the components in 3.5 simply

characterize the dependence in panel data. Differing from factor models or models with unobserved

heterogeneity, they do not affect the identification of the regression model in any way.

Throughout the paper, time effects 𝛾𝑡 are weakly dependent with some regularity condition

introduced as follows. Before that, a few more concepts and notations are needed. Let (𝑋, 𝑌 ) be

random elements taking values in Euclidean space 𝑆 = (𝑆1 × 𝑆2) with probability laws 𝑃𝑋 and 𝑃𝑌 ,

respectively. Let ∥𝜈∥𝑇𝑉 denote the total variation norm of a signed measure 𝜈 on a measurable

space (𝑆, Σ) where Σ is a 𝜎-algebra on 𝑆:

∥𝜈∥𝑇𝑉 = sup
𝐴∈Σ

𝜈( 𝐴) − 𝜈( 𝐴𝑐).

Define the dependence coefficient of 𝑋 and 𝑌 as:

𝛽(𝑋, 𝑌 ) =

1
2

∥𝑃𝑋,𝑌 − 𝑃𝑋 × 𝑃𝑌 ∥𝑇𝑉 .

The next assumption regulates the dependence of 𝛾𝑡 using the beta-mixing coefficient:

Assumption 3.2 (Absolute Regularity) The sequence {𝛾𝑡 }𝑡≥1 is beta-mixing at a geometric rate:

𝛽𝛾 (𝑚) = sup
𝑠≤𝑇

𝛽 ({𝛾𝑡 }𝑡≤𝑠, {𝛾𝑡 }𝑡≥𝑠+𝑚) ≤ 𝑐𝜅𝑒𝑥 𝑝(−𝜅𝑚), ∀𝑚 ∈ Z+,

(3.6)

for some constants 𝜅 > 0 and 𝑐𝜅 ≥ 0.

Condition AR, also known as the beta-mixing condition, restricts the temporal dependence of the

common time effects to decay at an exponential rate that is common in literature (for example, see

Hahn and Kuersteiner (2011); Fernández-Val and Lee (2013), and can be generated by common

autoregressive models as in Baraud et al. (2001).

Due to the potential high dimensionality in 𝑋, traditional nonparametric methods are not feasible

for estimating the unknown function 𝑓 . To reduce the dimensionality, a common approach is to

consider the sparsity in the model and reduce the dimension through regularization. However, the

unknown function 𝑓 is an infinite dimensional parameter, which is not exactly sparse. Therefore, I

take a sparse approximation approach following Belloni et al. (2012):

67

Assumption 3.3 (Approximate Sparse Model) The unknown function 𝑓 can be well approxi-

mated by a dictionary of transformations 𝑓𝑖𝑡 = 𝐹 (𝑋𝑖𝑡) where 𝑓𝑖𝑡 is a 𝑝 × 1 vector and 𝐹 is a

measurable map, such that

𝑓 (𝑋𝑖𝑡) = 𝑓𝑖𝑡 𝜁0 + 𝑟𝑖𝑡

where the coefficients 𝜁0 and the approximation error 𝑟𝑖𝑡 satisfy

∥𝜁0∥0 ≤ 𝑠 = 𝑜(𝑁 ∧ 𝑇), ∥𝑟𝑖𝑡 ∥ 𝑁𝑇,2 = 𝑂 𝑃

(cid:18)√︂ 𝑠

𝑁 ∧ 𝑇

(cid:19)

.

Assumption ASM views the high-dimensional linear regression as an approximation. It requires

a subset of the parameters 𝜁0 to be zero while controlling the size of the approximation error.

Compared to the sparsity condition in previous literature, here it imposes a slower rate of growth

restriction on the non-zero slope coefficients. For example, 𝑠 = 𝑜(𝑁𝑇) corresponds to the case of

heteroskedasticity-robust LASSO under i.i.d data in Belloni et al. (2012); 𝑠 = (𝑁𝑙𝑇 ) corresponds

to the cluster-robust LASSO under temporal dependence panel data in Belloni et al. (2016) where

𝑙𝑇 ∈ [1, 𝑇] is an information index that equals T when there is no temporal dependence and equals

1 when there is cross-sectional independence and perfect temporal dependence. In other words,

the underlying component structure restricts the growth of nonzero slope coefficients of the model

in a similar way to the perfect temporal dependence case.

Under Assumption ASM, we can rewrite the model 3.4 as

𝑌𝑖𝑡 = 𝑓𝑖𝑡 𝜁0 + 𝑟𝑖𝑡 + 𝑉𝑖𝑡, 𝐸 [𝑉𝑖𝑡 |𝑋𝑖𝑡] = 0.

(3.7)

Using 3.7, we can apply 𝑙1 regularization in the least squared error problem. Let 𝜆 be some non-

negative common penalty level and 𝜔 be some non-negative 𝑝 × 𝑝 diagonal matrix of regressor-

specific penalty weights. Consider the following generic weighted LASSO estimator:

(cid:98)𝜁 = arg min
𝜁

1
𝑁𝑇

𝑁
∑︁

𝑇
∑︁

𝑖=1

𝑡=1

(𝑌𝑖𝑡 − 𝑓𝑖𝑡 𝜁)2 +

𝜆
𝑁𝑇

∥𝜔1/2𝜁 ∥1.

(3.8)

To obtain the desirable property of LASSO estimation, one needs to choose 𝜆 and 𝜔 in an

optimal way so that the penalty level is large enough to avoid noisy estimation due to overfitting

68

but also the smallest possible since the size of the penalty determines the performance bound of

LASSO estimation and too large a penalty level can cause missing variable bias. In other words,

the overall penalty level given by both 𝜆 and 𝜔 decides the trade-off between overfitting variance

and regularization bias. For example, let (cid:165)𝑓𝑖𝑡 be the demeaned 𝑓𝑖𝑡 using the sample mean6 and one

common choice of 𝜔 is the empirical Gram matrix 𝐸 [ (cid:165)𝑓 ′
𝑖𝑡

(cid:165)𝑓𝑖𝑡] that is used to standardize the regressors

and so the model selection is not affected by the scale of the regressors. The common penalty level

𝜆 is often chosen by some cross-validation algorithms. If the chosen 𝜆 satisfies a certain asymptotic

order, then the key condition that regularizes the tail behavior of the error term can be established

under the conditional Gaussian error or sub-Gaussian error conditions (see Bickel et al., 2009,

Bühlmann and Van De Geer, 2011, and Theorem 29.3 of Hansen, 2022):

max
𝑗=1,...,𝑝

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

1
𝑁𝑇

𝑁
∑︁

𝑇
∑︁

𝑖=1

𝑡=1

𝜔−1/2
𝑗

𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

≤

𝜆
2𝑐1𝑁𝑇

.

(3.9)

Condition 3.9 is referred to as the “regularization event” in the literature. Combining with some

non-asymptotic bounds for LASSO, the rate of convergence for (cid:98)𝜁 can be derived. This approach,

however, is not applicable when the error terms are considered to exhibit heavy tails. Alternatively,

Fuk-Nagaev types of concentration inequality are established to verify Condition 3.9 without relying

on the Gaussian or sub-Gaussian assumption (e.g. Babii et al. (2024, 2023); Gao et al. (2024)).

These alternative approaches, again, rely on cross-validation for choosing penalty levels, which is

computationally costly and sensitive to the tuning of the cross-validation. It is further complicated

when the cross-validation needs to be adjusted for dependent data.

Belloni et al. (2012) propose to self-normalize the regressors through regressor-specific penalty

weights and leverage moderate deviation theorems (see Jing et al., 2003 and Peña et al., 2009) for

the self-normalized sums to verify Condition 3.9. This common penalty level 𝜆 of this approach

is theoretically derived and only determined by the sample size, the number of regressors, a

small-order sequence, and some constants that do not vary across data generate processes. When

the dependence is considered only in the temporal dimension, then the existing approach for

6The demeaning is done because of the inclusion of the intercept term. To avoid it to be penalized, it is usually

projected out first.

69

independent data can be extended by clustering over the cross-sectional dimension (see Belloni

et al. (2016)). However, there is no simple extension if the dependence is present in both temporal

and cross-sectional dimensions. Instead, I utilize the component structure characterization of the

dependence and decompose the high-dimensional error term 𝑓𝑖𝑡𝑉𝑖𝑡 using Hajek projection into

three parts: 𝑎𝑖 = 𝐸 [ 𝑓𝑖𝑡𝑉𝑖𝑡 |𝛼𝑖], 𝑔𝑡 = 𝐸 [ 𝑓𝑖𝑡𝑉𝑖𝑡 |𝛾𝑡], 𝑒𝑖𝑡 = 𝑓𝑖𝑡𝑉𝑖𝑡 − 𝑎𝑖 − 𝑔𝑡, where 𝑎𝑖 are i.i.d over 𝑖, 𝑔𝑡

are weakly dependent over 𝑡, and the remainder can be shown as small order and is well-behaved

too. With this observation, appropriate regressor-specific penalty weights can be constructed, and

existing moderate deviation theorems and concentration inequalities can be leveraged.

With the observation above, I propose the following common penalty level 𝜆 and (infeasible)

penalty weights:

𝜆 =

𝐶𝜆𝑁𝑇
(𝑁 ∧ 𝑇)1/2

(cid:18)

1 −

Φ−1

(cid:19)

,

𝛾
2𝑝

𝜔 𝑗 =

𝑁 ∧ 𝑇
𝑁 2

𝑁
∑︁

𝑖=1

𝑎2
𝑖, 𝑗 +

𝑁 ∧ 𝑇
𝑇 2

(cid:32)

𝐵
∑︁

∑︁

(cid:33) 2

𝑔𝑡, 𝑗

.

𝑏=1

𝑡∈𝐻𝑏

(3.10)

(3.11)

where 𝑎𝑖, 𝑗 = 𝐸 [ 𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡 |𝛼𝑖], 𝑔𝑡, 𝑗 = 𝐸 [ 𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡 |𝛾𝑡] for 𝑗 = 1, ..., 𝑝. 𝐶𝜆 is some sufficiently large

constant and 𝛾 is a small order sequence. The convergence rate of 𝛾 affects the convergence rate

of the LASSO estimator: as is revealed later, 𝛾 should be 𝑜(1) for LASSO to be consistent while

a larger 𝛾 is necessary for a faster convergence rate. Both 𝐶𝜆 and 𝛾 do not vary across different

data-generating processes. While there is some guidance about choosing 𝐶𝜆 and 𝛾, the choice is not

given exactly in the theoretical results. In practice, it is found that 𝐶𝜆 = 2 and 𝜆 = 0.1/log( 𝑝∨𝑁 ∨𝑇)

delivers desirable finite sample performance. Looking at the definition of 𝜔 in 3.11, we notice that

the first term in 3.11 is a variance estimator for i.i.d random variables and the second term is a

cluster variance estimator as in Bester et al., 2008 where 𝐵 is the number of clusters/blocks, ℎ is the

block length and 𝐻𝑏 is the associated index set. Technically, they are chosen as 𝐵 = round(𝑇/ℎ),

ℎ = round(𝑇 1/5) + 1, and, for 𝑏 = 1, .., 𝐵, 𝐻𝑏 = {𝑡 : ℎ(𝑏 − 1) + 1 ≤ 𝑡 ≤ ℎ𝑏}.

To implement the penalty weights in 3.11, however, we need to estimate 𝑎𝑖, 𝑗 = 𝐸 [ 𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡 |𝛼𝑖]

and 𝑔𝑡, 𝑗 = 𝐸 [ 𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡 |𝛾𝑡] with two challenges. With some initial estimation, we can replace 𝑉𝑖𝑡

with the initial residual (cid:101)𝑉𝑖𝑡. (cid:101)𝑉𝑖𝑡 then can be updated iteratively by the residuals from the estimation

70

in 3.8 until it converges, meaning that (cid:101)𝑉𝑖𝑡 does not update anymore up to a small difference. A
common estimator for 𝑎𝑖, 𝑗 is then (cid:98)
𝑓𝑖𝑡, 𝑗 (cid:101)𝑉𝑖𝑡
for estimating 𝑔𝑡, 𝑗 . Observe that this choice of implementing (cid:205)𝑁
𝑖=1
It shows that the first term of 𝜔

penalty weights of cluster-LASSO in Belloni et al. (2016).

𝑓𝑖𝑡, 𝑗 (cid:101)𝑉𝑖𝑡. Similarly, we use (cid:98)

𝑎2
𝑖, 𝑗 is equivalent to the feasible

𝑎𝑖, 𝑗 = 1
𝑁

𝑔𝑡, 𝑗 = 1
𝑁

(cid:205)𝑁
𝑖=1

(cid:205)𝑇

𝑡=1

clusters over time so it adjusts for the temporal dependence within each unit cluster. The second

term clusters over individuals first and then clusters over time within each block, so it adjusts for

cross-sectional dependence and weak temporal dependence across time. The validity of estimating

those components through cross-sectional and temporal averages is given in Menzel (2021) for

exchangeable arrays. Extending the consistency results for non-exchangeable arrays is not trivial

and establishing the uniform convergence result, required due to the high dimensionality, is rather

challenging and not the focus of this paper. Following Belloni et al. (2012) and Belloni et al.

(2016), the statistical analysis of the weighted LASSO approach is based on high-level assumptions

on the feasible penalty weights: Let (cid:98)
0 < 1/𝑐1 < 𝑙 ≤ 1 and 1 ≤ 𝑢 < ∞ such that 𝑙 → 1 and

𝜔 be the feasible diagonal weights and suppose there exists

𝑙𝜔1/2

𝑗 ≤ (cid:98)

𝜔1/2

𝑗 ≤ 𝑢𝜔1/2

𝑗

, 𝑢𝑛𝑖 𝑓 𝑜𝑟𝑚𝑙 𝑦 𝑜𝑣𝑒𝑟 𝑗 = 1, ..., 𝑝,

(3.12)

where {𝜔 𝑗 } and { (cid:98)

𝜔 𝑗 } are diagonal entries of 𝜔 and (cid:98)

𝜔, respectively.

Algorithm: Implementation of the Two-Way Cluster-LASSO

i Obtain the initial residuals (cid:101)𝑉: estimate a model with a certain (user-specified) number of the

most correlated regressors. 7

ii Set 𝜆 according to 3.10 with 𝐶𝜆 = 2 and 𝛾 = 0.1/log( 𝑝 ∨ 𝑁 ∨ 𝑇). Calculate (cid:101)

𝜔 according to

3.11.

iii Using (cid:101)

𝜔 for LASSO estimation as in 3.8 and update the residual (cid:101)𝑉 using the (post) LASSO

estimation.8

7This step is for better convergence of the iterative estimation of the penalty weights. A small number of initially

included regressors can cause failure to converge.

8While they are asymptotically equivalent, post-LASSO suffers from less shrinkage bias in the finite sample.

71

iv Repeat steps ii-iii until it converges. Obtains the (post) LASSO estimates from the last

iteration.

In the low-dimensional case, a key identifying condition is that the population Gram matrix

𝐸𝑃 [ 𝑓 ′

𝑖𝑡 𝑓𝑖𝑡] is non-singular so that the empirical Gram matrix is also non-singular with high proba-

bility. However, as we allow the dimension of 𝑓𝑖𝑡 to be larger than the sample size, the empirical

Gram matrix E𝑁𝑇 [ 𝑓 ′

𝑖𝑡 𝑓𝑖𝑡] is singular. Fortunately, it turns out that we only need certain sub-matrices

to be well-behaved for identification. Define

𝜙𝑚𝑖𝑛 (𝑚)(𝑀 𝑓 ) := min
𝛿∈Δ(𝑚)

𝛿′𝑀 𝑓 𝛿 𝑎𝑛𝑑 𝜙𝑚𝑎𝑥 (𝐶𝑠) (𝑀 𝑓 ) := max
𝛿∈Δ(𝑚)

𝛿′𝑀 𝑓 𝛿,

where Δ(𝑚) = {𝛿 : ∥𝛿∥0 = 𝑚, ∥𝛿∥2 = 1} and 𝑀 𝑓 = E𝑁𝑇 [ 𝑓 ′

𝑖𝑡 𝑓𝑖𝑡].

Assumption 3.4 (Sparse Eigenvalues) For any 𝐶 > 0, there exists constants 0 < 𝜅1 < 𝜅2 < ∞

such that with probability approaching one, as (𝑁, 𝑇) → ∞ jointly, 𝜅1 ≤ 𝜙𝑚𝑖𝑛 (𝐶𝑠) (𝑀 𝑓 ) <

𝜙𝑚𝑎𝑥 (𝐶𝑠)(𝑀 𝑓 ) ≤ 𝜅2.

The sparse eigenvalue assumption follows from Belloni et al. (2012).

It implies a restricted

eigenvalue condition, which represents a modulus of continuity between the prediction norm and

the norm of 𝛿 within a restricted set. More primitive sufficient conditions are discussed in Bickel

et al. (2009) and Belloni et al. (2012).

Assumption 3.5 (Regularity Conditions) (i) log( 𝑝/𝛾) = 𝑜

𝑇 1/6/(log 𝑇)2(cid:17)
(cid:16)

and

𝑝 = 𝑜

𝑇 7/6/(log 𝑇)2(cid:17)
(cid:16)

. (ii) For some 𝜇 > 1, 𝛿 > 0, max 𝑗 ≤ 𝑝 E[| 𝑓𝑖𝑡, 𝑗 |8(𝜇+𝛿)] < ∞, E[|𝑉𝑖𝑡 |8𝜇+𝛿)] < ∞.

(iii) min 𝑗 ≤ 𝑝 E(𝑎2

𝑖, 𝑗 ) > 0 and min 𝑗 ≤ 𝑝 E(𝑔2

𝑡, 𝑗 ) > 0.

Assumption REG(i) restricts the dimension of 𝑓𝑖𝑡 relative to the sample size. Although the

number of regressors is constrained to be of a small order relative to the overall sample size 𝑁𝑇

as 𝑁, 𝑇 → ∞ jointly, it is still allowed to be greater than the sample size in the finite sample.

Note that this requirement is more of a technical constraint and may be further relaxed with refined

concentration inequalities for two-way dependent arrays. The moment conditions in Assumption

72

REG(ii) are common in the literature. REG(iii) is a non-degeneracy condition, which is the main

case of interest.

A common way to mitigate the shrinkage bias of LASSO is to apply least square estimation

based on the selected model by LASSO, which is named Post-LASSO. The next theorem delivers a

similar result. Let (cid:98)Γ = { 𝑗 ∈ 1, ..., 𝑝 : |(cid:98)𝜁 𝑗 | > 0} where (cid:98)𝜁 𝑗 are two-way LASSO estimates. The next

theorem gives convergence rates for both two-way cluster-LASSO and its associated Post-LASSO.

Theorem 3.1 Suppose Assumptions AHK, ASM, AR, REG hold for model 3.4 as 𝑁, 𝑇 → ∞ jointly

with 𝑁/𝑇 → 𝑐. Then, by setting 𝜆 as 3.10 with some sufficiently large 𝐶𝜆, we have (i) the event

3.9 happens with probability approaching one. Additionally, suppose that Assumption SE holds

𝜔 satisfies condition 3.12. Let (cid:98)𝜁 be the two-way cluster-LASSO estimator or the post-LASSO

and (cid:98)
estimator based on the two-way cluster-LASSO selection. Then, (ii) ∥(cid:98)𝜁 ∥0 = 𝑂 𝑃 (𝑠), and (iii)

1
𝑁𝑇

𝑁
∑︁

𝑇
∑︁

(cid:16)

𝑖=1

𝑡=1

𝑓𝑖𝑡 (cid:98)𝜁 − 𝑓𝑖𝑡 𝜁0

(cid:17) 2

=𝑂 𝑃

(cid:18) 𝑠 log( 𝑝/𝛾)
𝑁 ∧ 𝑇

(cid:19)

,

(cid:13)
(cid:13)(cid:98)𝜁 − 𝜁0
(cid:13)

(cid:13)
(cid:13)
(cid:13)1

(cid:32)

√︂

=𝑂 𝑃

𝑠

log( 𝑝/𝛾)
𝑁 ∧ 𝑇

(cid:33)

,

(cid:13)
(cid:13)(cid:98)𝜁 − 𝜁0
(cid:13)

(cid:13)
(cid:13)
(cid:13)2

=𝑂 𝑃

(cid:32)√︂ 𝑠 log( 𝑝/𝛾)
𝑁 ∧ 𝑇

(cid:33)

.

Theorem 3.1 establishes convergence rates in terms of the prediction, 𝑙1, and 𝑙2 norms for

the (post) two-way cluster-LASSO estimator in an approximately sparse model. These results are

the first that give convergence rates for a LASSO-based estimator allowing for two-way cluster

dependence. It is shown that under the two-way cluster dependence, the two-way cluster-LASSO is

consistent but has a convergence rate slower than those of LASSO-based methods under the random

sampling condition or weak dependence. Without loss of generality, let 𝑁 = 𝑁 ∧𝑇, then by choosing
(cid:13)
(cid:13)(cid:98)𝜁 − 𝜁0
(cid:13)

𝛾 according to log(1/𝛾) ≃ log( 𝑝 ∨ 𝑁𝑇), we have

(cid:18)√︃ 𝑠 log( 𝑝∨𝑁𝑇)
𝑁

. As a comparison,

(cid:19)

the rate of convergence in terms of the 𝑙2 norm is 𝑂 𝑃

under the random sampling and

(cid:13)
= 𝑂 𝑃
(cid:13)
(cid:13)2
(cid:19)
(cid:18)√︃ 𝑠 log 𝑝
𝑁𝑇

the homoskedasticity Gaussian error assumptions in Bickel et al. (2009) or the heteroskedasticity

(cid:18)√︃ 𝑠 log( 𝑝∨𝑁𝑇)
𝑁𝑇

(cid:19)

under random sampling

Gaussian error in Theorem 19.3 of Hansen (2022), 𝑂 𝑃

73

in Belloni et al. (2012), and 𝑂 𝑃

(cid:18)√︃ 𝑠 log( 𝑝∨𝑁𝑇)
𝑁𝑙𝑇

(cid:19)

under cross-sectional independence in Belloni

et al. (2016) where the information index 𝑙𝑇 = 1 when there is perfect dependence within the

cross-sectional cluster.

As illustrated in the Introduction, the slow rate of convergence is due to the underlying factor

structure. It is unclear if valid inference is possible under the rate of convergence results in Theorem

3.1 or if it is possible to relax the requirement through a cross-fitting procedure. These questions

are addressed in the next section.

3.3 Clustered-Panel Cross-Fitting and Inference

In this section, I will first propose a sub-sampling scheme for cross-fitting in a two-way clustered

panel and then propose a general inference procedure using cross-fitting for a high-dimensional

panel model. The idea of the sub-sampling scheme is to split the sample in a proper way so that

two resulting sub-samples are independent or, at least, “approximately” independent. With such

properties, the sub-sampling scheme can be used for various purposes. For example, it can be used

for cross-fitting in a two-step estimation since it effectively eliminates the dependence between

the two steps, which in turn relaxes the rate of convergence requirement for the first step for valid

inference. It can also be used for cross-validation when choosing tuning parameters in panel data

models. In this paper, we will focus on its application in cross-fitting.

Let {𝑊𝑖𝑡 : 𝑖 = 1, ..., 𝑁 𝑎𝑛𝑑 𝑡 = 1, ..., 𝑇 } denote a sample of sizes (𝑁, 𝑇) from a sequence of

random elements (𝑊𝑖𝑡)𝑖≥1,𝑡≥1 defined on a common measurable space (Ω, F ) and taking values in

Euclidean spaces. To allow the dimension of 𝑊𝑖𝑡 to grow with 𝑁, 𝑇, we denote (P𝑁𝑇 )𝑁 ≥1,𝑇 ≥1 as an

expanding class of probability laws of {𝑊𝑖𝑡 : 𝑖 = 1, ..., 𝑁 𝑎𝑛𝑑 𝑡 = 1, ..., 𝑇 } and denote 𝑃 ∈ P𝑁𝑇 as

a generic probability law for the sample with sizes (𝑁, 𝑇).

Under the AHK characterization in Assumption AHK, 𝑊𝑖𝑡 are cluster-dependent over both

cross-section and time. Importantly, the cluster dependence does not vanish as the distance between

observations (if there is any ordering) increases. If 𝛾𝑡 is weakly dependent, which is the focus of

this paper, then the dependence between observations that don’t share the same cluster in either

dimension dies out as the temporal distance grows. In that case, intuitively, one can split the sample

74

so that the sub-samples do not share the same cluster and are away from each other in temporal

distance. This is exactly how this scheme works:

Definition 3.1 (Two-Way Clustered-Panel Cross-Fitting)

(i) Select some positive integers (𝐾, 𝐿). Randomly partition the cross-sectional index set

{1, 2, ..., 𝑁 } into 𝐾 folds {𝐼1, 𝐼2, ..., 𝐼𝐾 } and partition the temporal index set {1, 2, ..., 𝑇 }
into 𝐿 adjacent folds {𝑆1, 𝑆2, ..., 𝑆𝐿 } so that (cid:208)𝐾
𝑘=1

𝐼𝑘 = {1, ..., 𝑁 }, (cid:208)𝐿
𝑙=1

𝑆𝑙 = {1, ..., 𝑇 }9.

(ii) For each 𝑘 = 1, ..., 𝐾 and 𝑙 = 1, .., 𝐿, construct the main sample

𝑊 (𝑘, 𝑙) = {𝑊𝑖𝑡 : 𝑖 ∈ 𝐼𝑘 , 𝑡 ∈ 𝑆𝑙 },

and the auxiliary sample (typically larger)

𝑊 (−𝑘, −𝑙) =

(cid:40)
𝑊𝑖𝑡 : 𝑖 ∈

(cid:216)

𝑘 ′≠𝑘

𝐼𝑘 ′, 𝑡 ∈

(cid:216)

(cid:41)

𝑆𝑙′

,

𝑙′≠𝑙,𝑙±1

Later on, we also use 𝐼−𝑘 and 𝑆−𝑙 to denote the index sets for the auxiliary sample 𝑊 (−𝑘, −𝑙).

Similarly, we denote 𝑁−𝑘 and 𝑇−𝑙 as the cross-sectional and temporal sample sizes for the auxiliary

sample 𝑊 (−𝑘, −𝑙). Figure 3.1 illustrates the cross-fitting with 𝐾 = 4 and 𝐿 = 8.

Since the sub-samples 𝑊 (𝑘, 𝑙) and 𝑊 (−𝑘, −𝑙) do not share any cluster, they are free from cluster

dependence and what’s left is the weak dependence over time. Unless imposing 𝑚−dependence, the

sub-samples above will not be independent. However, under certain regularity conditions regarding

the weak dependence, it can be shown through the coupling technique that as long as the temporal

distance between the sub-samples diverges at a certain rate, there exist coupling sub-samples that

are independent of each other while having the same marginal distributions as the constructed

sub-samples with probability converging to 1. Such coupling technique is common in the time

series context. The following lemma delivers such a result:

9For simplicity, I assume 𝑁 and 𝑇 are divisible by 𝐾 and 𝐿, respectively. In practice, if 𝑁 is not divisible by 𝐾, the
size for each cross-sectional block can be chosen differently with some length equal to 𝑓 𝑙𝑜𝑜𝑟 (𝑁/𝐾) and others equal
to 𝑐𝑒𝑖𝑙 (𝑁/𝐾). and the same applies to the temporal dimension.

75

Lemma 3.1 (Independent Coupling) Consider the sub-samples 𝑊 (𝑘, 𝑙) and 𝑊 (−𝑘, −𝑙) for 𝑘 =

1, ..., 𝐾 and 𝑙 = 1, ..., 𝐿. Suppose Assumptions AHK, AR hold and 𝑙𝑜𝑔(𝑁)/𝑇 = 𝑜(1) as 𝑇 → ∞.

Then, we can construct (cid:101)𝑊 (𝑘, 𝑙) and (cid:101)𝑊 (−𝑘, −𝑙) such that: (i) they are independent of each other;
(ii) have the same marginal distribution as 𝑊 (𝑘, 𝑙) and 𝑊 (−𝑘, −𝑙), respectively; (iii)

(cid:110)

𝑃

(𝑊 (𝑘, 𝑙), 𝑊 (−𝑘, −𝑙)) ≠

(cid:16)
(cid:101)𝑊 (𝑘, 𝑙), (cid:101)𝑊 (−𝑘, −𝑙)

(cid:17)

,

𝑓 𝑜𝑟 𝑠𝑜𝑚𝑒 (𝑘, 𝑙)

(cid:111)

= 𝑜(1).

Lemma 3.1 shows that the main and auxiliary samples from the proposed clustered-panel cross-

fitting scheme are approximately independent as 𝑁, 𝑇 diverge. Note that the hypothetical sample

(cid:101)𝑊 (𝑘, 𝑙) and (cid:101)𝑊 (−𝑘, −𝑙) do not matter in practice, but they allow us to treat 𝑊 (𝑘, 𝑙) and 𝑊 (−𝑘, −𝑙)
as (cid:101)𝑊 (𝑘, 𝑙) and (cid:101)𝑊 (−𝑘, −𝑙) with probability approaching 1. The proof of Lemma 3.1 is based

on independence coupling results (Strassen, 1965, Dudley and Philipp, 1983, and Berbee, 1987)

introduced in Semenova et al. (2023a).

Figure 3.1: Clustered-Panel cross-fitting with 𝐾 = 4 and 𝐿 = 8. Three graphs from left to right

correspond to the main and auxiliary sample constructions with (𝑘, 𝑙) = (1, 1), (𝑘, 𝑙) = (2, 2),

(𝑘, 𝑙) = (3, 3). For a simple illustration, observations in the main sample are all adjacent in the

cross-sectional dimension but it is not necessary in practice; the same applies to the auxiliary

sample.

As mentioned at the beginning of the section, one of the primary uses of the sub-sampling

scheme is cross-fitting in a two-step estimation. To be concrete, I will define a two-step estimator

76

using the cross-fitting algorithm in the context of a semi-parametric moment restriction model. The

theoretical properties of the estimator will be studied in Section 3.3.1.

Let 𝜑(𝑊𝑖𝑡; 𝜃, 𝜂) denote some identifying moment functions where 𝜃 is a low-dimensional vector

of parameters of interest and 𝜂 are nuisance functions. For example, 𝜂 = 𝑔0 in 3.1. Let 𝜓(𝑊𝑖𝑡; 𝜃, 𝜂)

denote some orthogonalized moment function based on 𝜑(𝑊𝑖𝑡; 𝜃, 𝜂). The formal definition of the

orthogonality will be delivered in the next subsection. For now, it suffices to be aware that both func-

tions are mean zero while 𝜓(𝑊𝑖𝑡; 𝜃, 𝜂) is adjusted for the fact that 𝜂0 needs to be estimated. In model

3.1, 𝜑(𝑊𝑖𝑡; 𝜃, 𝜂) = 𝐷𝑖𝑡𝑈𝑖𝑡 and 𝜓(𝑊𝑖𝑡; 𝜃, 𝜂) = (𝐷𝑖𝑡 − 𝐸 [𝐷𝑖𝑡 |𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡]) (𝑌𝑖𝑡 − 𝐷𝑖𝑡𝜃 − 𝑔(𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡)).

In the treatment effect model with unconfoundedness conditional on covariates and unobserved het-

erogeneous effects, 𝜑(𝑊𝑖𝑡; 𝜃, 𝜂) = 𝐸 [𝑌𝑖𝑡 |𝐷𝑖𝑡 = 1, 𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡] − 𝐸 [𝑌𝑖𝑡 |𝐷𝑖𝑡 = 0, 𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡] − 𝜃 𝐴𝑇 𝐸 and

𝜓(𝑊𝑖𝑡; 𝜃, 𝜂) is the moment function corresponding to the well-known augmented inverse probability

weighting estimator, which is doubly robust.

The panel cross-fitting procedure goes as follows. For each 𝑘 and 𝑙, we use the sub-sample

𝑊 (−𝑘, −𝑙) to estimate 𝜂 with the estimator denoted as (cid:98)
𝜂𝑘𝑙 to the orthogonal moment function, 𝜓(𝑊𝑖𝑡; 𝜃,
in (cid:98)
𝑘 = 1, ..., 𝐾 and 𝑙 = 1, ..., 𝐿, we obtain

𝜂𝑘𝑙. For each 𝑖 ∈ 𝐼𝑘 and 𝑡 ∈ 𝑆𝑙, we plug-

𝜂𝑘𝑙). By averaging 𝜓(𝑊𝑖𝑡; 𝜃,
(cid:98)

𝜂𝑘𝑙) across all
(cid:98)

𝜓 𝑘𝑙 := E𝑘𝑙 [𝜓(𝑊𝑖𝑡; 𝜃,

𝜂𝑘𝑙)] ,
(cid:98)

which is a sample analog of the population orthogonal moment condition used for estimation.

Note that the larger sub-sample 𝑊 (−𝑘, −𝑙), instead of the smaller sub-sample 𝑊 (𝑘, 𝑙), is used

for first-step nuisance estimation because it involves the estimation of high-dimensional unknown

parameters. For reference, 𝑊 (𝑘, 𝑙) is referred to as the main sample, and 𝑊 (−𝑘, −𝑙) is referred to

as the auxiliary sample. The next definition summarizes the panel DML estimation and inference

procedures for a semiparametric moment restriction model:

Definition 3.2 (Panel DML Algorithm)

(i) Given the identifying moment functions 𝜑(𝑊; 𝜃, 𝜂) such that 𝐸𝑃 [𝜑(𝑊; 𝜃0, 𝜂0)] = 0, find the

orthogonalized moment function 𝜓(𝑊, 𝜃, 𝜂).

77

(ii) Obtain cross-fitting sub-samples 𝑊 (𝑘, 𝑙) and 𝑊 (−𝑘, −𝑙) as in Definition 3.1.

(iii) For each 𝑘 and 𝑙, use the sample 𝑊 (−𝑘, −𝑙) for the first-step estimation and obtain (cid:98)

𝜂𝑘𝑙, then

construct 𝜓 𝑘𝑙 (𝜃) = E𝑘𝑙 [𝜓(𝑊𝑖𝑡; 𝜃,

𝜂𝑘𝑙)] for each (𝑘, 𝑙). Finally, obtain the DML estimator (cid:98)𝜃
(cid:98)

as the solution to

1
𝐾 𝐿

𝐾
∑︁

𝐿
∑︁

𝑘=1

𝑙=1

𝜓 𝑘𝑙 (𝜃) = 0.

(3.13)

Remark 3.1 (The Choice of 𝐾 and 𝐿) Notice there is a trade-off in setting (𝐾, 𝐿) between the

first step and second step accuracy: the bigger values of (𝐾, 𝐿), the bigger sample size of the

auxiliary sample 𝑊 (−𝑘, −𝑙), which is beneficial for high-dimensional first-steps but at the cost of

a noisier parametric second step. Due to leaving out the temporal neighborhood, it necessitates

an 𝐿 ≥ 4 for feasible implementation (if 𝐿 = 3, for example, any main sample 𝑊 (𝑘, 𝑙) with 𝑙 = 2

does not have a well-defined auxiliary sample). On the other hand, it is computationally costly to

set the values of (𝐾, 𝐿) too large. In practice, 𝐾 = 2 𝑡𝑜 4 and 𝐿 = 4 𝑡𝑜 8 work well in simulations.

3.3.1 Panel DML: Inferential Theory

To investigate the required convergence rate of a high-dimensional estimator for valid inference,

I will study a general inference procedure for a high-dimensional panel model characterized by a

semiparametric moment restriction. Such an inference procedure is based on the panel cross-fitting

approach proposed in Section 3.3 and the prototypical DML approach proposed in Chernozhukov

et al. (2018a).

With the same notations as above, the model is characterized by a semiparametric moment

condition 𝐸 [𝜑(𝑊𝑖𝑡; 𝜃0, 𝜂0)] = 0 where 𝑊𝑖𝑡 are again characterized by an underlying component

structure as in Assumption AHK. Let 𝜓(𝑊; 𝜃, 𝜂) be the orthogonalized moment function. Formally,

the orthogonality means that it is mean zero and its pathwise or Gateaux derivative with respect to

the nuisance parameter is 0 when evaluated at the true values:

𝐸𝑃 [𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂0)] = 0,

𝜕𝑟 𝐸𝑃 [𝜓 (𝑊𝑖𝑡; 𝜃0, 𝜂0 + 𝑟 (𝜂 − 𝜂0))] |𝑟=0 = 0.

(3.14)

(3.15)

78

In other words, the nuisance functions have no first-order effect locally on the orthogonalized

moment conditions, based on which the estimation of 𝜃0 is therefore robust to the plug-in of noisy

estimates of 𝛾0.

property.

In contrast, the original identifying moment conditions do not possess such a

Differing from the existing literature, the approach in this paper focuses on estimation and

inference robust to two-way cluster dependence characterized by Assumption AHK. Note that

Assumption AHK also includes i.i.d data as a special case. Although the panel DML procedure

is also robust to the i.i.d case or, more generally, the case of the degeneracy in components, the

theoretical properties are not formally given in this paper. The rates of convergence for both the

nuisance estimator and the second-step estimator are different and faster for the i.i.d case but that’s

not surprising and is not the focus of this paper. To restrict the focus, I will assume a non-degeneracy

condition in terms of Hajek projection components. First, I define the Hajek components and their

corresponding (long-run) variance-covariance matrices as follows:

𝑎𝑖 := 𝐸𝑃 [𝜓 (𝑊𝑖𝑡; 𝜃0, 𝜂0) |𝛼𝑖] , Σ𝑎 := 𝐸𝑃 [𝑎𝑖𝑎′
𝑖],

𝑔𝑡 := 𝐸𝑃 [𝜓 (𝑊𝑖𝑡; 𝜃0, 𝜂0) |𝛾𝑡] , Σ𝑔 :=

𝑒𝑖𝑡 := 𝜓 (𝑊𝑖𝑡; 𝜃0, 𝜂0) − 𝑎𝑖 − 𝑔𝑡, Σ𝑒 :=

∞
∑︁

𝑙=−∞
∞
∑︁

𝑙=−∞

𝐸𝑃 [𝑔𝑡𝑔′

𝑡+𝑙],

𝐸𝑃 [𝑒𝑖𝑡𝑒′

𝑖,𝑡+𝑙].

Let 𝜆𝑚𝑖𝑛 [.] denote the smallest eigenvalue of a square matrix. The next assumption specifies the

non-degeneracy condition and it implies that at least one of the components drives the cluster

dependence.

Assumption 3.6 (Non-Degeneracy) Either 𝜆𝑚𝑖𝑛 [Σ𝑎] > 0 or 𝜆𝑚𝑖𝑛 [Σ𝑔] > 0.

The next two assumptions follow the same format as Chernozhukov et al. (2018a) but, impor-

tantly, they characterize some different rates of convergence required for inferential theory. Let

(𝛿𝑁𝑇 ) and (Δ𝑁𝑇 ) be some sequence of positive constants converging to 0 as 𝑁, 𝑇 → ∞. Let T𝑁𝑇
be a nuisance realization set such that it contains 𝜂0 and that (cid:98)
1 − Δ𝑁𝑇 for each (𝑘, 𝑙).

𝜂𝑘𝑙 belongs to T𝑁𝑇 with probability

79

Assumption 3.7 (Linear Moment Conditions, Smoothness, and Identification)

(i) 𝜓(𝑊; 𝜃, 𝜂) is linear in 𝜃:

𝜓(𝑤; 𝜃, 𝜂) = 𝜓𝑎 (𝑊, 𝜂)𝜃 + 𝜓𝑏 (𝑊, 𝜂), ∀ 𝑤 ∈ W, 𝜃 ∈ Θ, 𝜂 ∈ T .

(ii) 𝜓(𝑊; 𝜃, 𝜂) satisfy the Neyman orthogonality conditions 3.14 and 3.15 with respect to the

probability measure 𝑃, or, more generally, 3.15 can be replaced by a 𝜆𝑁𝑇 near-orthogonality

condition

𝜆𝑁𝑇 := sup
𝜂∈T𝑁 𝑇

∥𝜕𝑟 𝐸𝑃 [𝜓(𝑊; 𝜃0, 𝜂0 + 𝑟 (𝜂 − 𝜂0))]|𝑟=0∥ ≤ 𝛿𝑁𝑇 /

√

𝑁.

(iii) The map 𝜂 → 𝐸𝑃 [𝜓(𝑊𝑖𝑡; 𝜃, 𝜂)] is twice continuously Gateaux-differentiable on T .

(iv) The singular values of the matrix 𝐴0 := 𝐸𝑃 [𝜓𝑎 (𝑊𝑖𝑡; 𝜂0)] are bounded below by 𝑐𝑎 > 0.

Assumption DML1(i) restricts the focus of this paper to models with linear orthogonal moment

conditions, which covers the model in Section 3.4. For nonlinear orthogonal moment conditions,

Chernozhukov et al. (2018a) has shown that the DML estimator has the same desirable properties

under more complicated regularity conditions. Focusing on the linear cases allows us to pay more

attention to issues specifically attributed to panel data. Assumption DML1(ii) slightly relaxes the

orthogonality condition 3.15 by a near-orthogonality condition, which is useful for the approximate

sparse model with approximation errors. Assumption DML1(iii) imposes a mild smoothness

assumption on the orthogonal moment condition and Assumption DML1(iv) is a common condition

for identification.

Assumption 3.8 (Moment Regularity and First-Steps)

(i) For all 𝑖 ≥ 1, 𝑡 ≥ 1, and some 𝑞 > 2, 𝑐𝑚 < ∞, the following moment conditions hold:

𝑚𝑁𝑇 := sup
𝜂∈T𝑁 𝑇

𝑚′

𝑁𝑇 := sup
𝜂∈T𝑁 𝑇

(𝐸𝑃 ∥𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂) ∥𝑞)1/𝑞 ≤ 𝑐𝑚,

(cid:0)𝐸𝑃 ∥𝜓𝑎 (𝑊𝑖𝑡; 𝜂) ∥𝑞(cid:1) 1/𝑞

≤ 𝑐𝑚.

80

(ii) The following conditions on the statistical rates 𝑟𝑁𝑇 , 𝑟′

𝑁𝑇 , 𝜆′

𝑁𝑇 hold for all 𝑖 ≥ 1, 𝑡 ≥ 1:

𝑟𝑁𝑇 := sup
𝜂∈T𝑁 𝑇

𝑟′
𝑁𝑇 := sup
𝜂∈T𝑁 𝑇

∥𝐸𝑃 [𝜓𝑎 (𝑊𝑖𝑡; 𝜂) − 𝜓𝑎 (𝑊𝑖𝑡; 𝜂0)] ∥ ≤ 𝛿𝑁𝑇 ,

(cid:16)

𝐸𝑃 ∥𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂) − 𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂0) ∥2(cid:17) 1/2

≤ 𝛿𝑁𝑇 ,

𝜆′

𝑁𝑇 :=

sup
𝑟∈(0,1),𝜂∈T𝑁 𝑇

(cid:13)
(cid:13)𝜕2

𝑟 𝐸𝑃 [𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂0 + 𝑟 (𝜂 − 𝜂0))](cid:13)

(cid:13) ≤ 𝛿𝑁𝑇 /

√

𝑁.

Assumption DML2 regulates the quality of the first-step nuisance estimators. It follows from

Chernozhukov et al. (2018a) and it can be verified under primitive conditions in the next section.

Observe that, if the orthogonal moment function 𝜓(𝑊; 𝜃, 𝜂) is smooth in 𝜂, then 𝜆′

𝑁𝑇 is the dominant

rate and it imposes a crude rate requirement of order 𝜀𝑁𝑇 = 𝑜(𝑁 −1/4) on the first-step nuisance

parameter in 𝐿2(𝑃) norm, which is possible for the two-way cluster LASSO estimator to achieve

under proper sparsity assumption. Furthermore, in some models including the partial linear model,

𝜆′

𝑁𝑇 can be exactly 0, then it is possible to obtain the weakest possible rate requirement for the

first-step estimator, i.e. 𝜀𝑁𝑇 = 𝑜(1).

Theorem 3.2 (Asymptotic Normality and Variance) Suppose Assumptions AHK, AR, ND,

DML1, DML2 hold for any 𝑃 ∈ P𝑁𝑇 , then for some 𝛿𝑁𝑇 ≥ 𝑁 −1/2, as (𝑁, 𝑇) → ∞ jointly,

√

(cid:16)

𝑁

(cid:17)

(cid:98)𝜃 − 𝜃0

√

= −

𝑁 𝐴−1
0

𝑁
∑︁

𝑇
∑︁

𝑖=1

𝑡=1

𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂0) + 𝑜𝑃 (1) ⇒ N (0, 𝑉),

where

,

𝑉 ≔ 𝐴−1

0 Ω𝐴−1′
Ω ≔ Σ𝑎 + 𝑐Σ𝑔.

0

We observe that the convergence rate of the two-step estimator (cid:98)𝜃 resulting from the panel DML
𝑁𝑇-consistent. This is because the
𝑁-consistent instead of

procedure is non-standard. It is

√

√

cluster dependence introduced by the unit and time components does not decay over time or space.

Intuitively, with more persistence, the information carried by data is accumulated more slowly. It

81

is a common feature in the literature of robust inference with cluster dependence10 and it is also

related to inferential theory under strong cross-sectional dependence (e.g., Gonçalves, 2011).

Due to the presence of unit and time components, the asymptotic variance is made of (long-

run) variance-covariance matrices of both factors. I consider a two-way cluster robust variance

estimator similar to Chiang et al. (2024) (CHS estimator) with adjustment due to cross-fitting. The

variance estimator is motivated under arbitrary dependence in panel data and is shown to be robust

to two-way clustering with correlated time effects in linear panel models. As is shown in Chen

and Vogelsang (2024), such variance estimator can be written as an affine combination of three

well-known robust variance estimators: Liang-Zeger-Arellano estimator, Driscoll-Kraay estimator,

and the "average of HACs" estimator. Applying this result, we can define the CHS-type variance

estimator as follows:

(cid:98)𝑉𝐶𝐻𝑆 = (cid:98)¯𝐴

−1

(cid:98)Ω𝐶𝐻𝑆(cid:98)¯𝐴

−1′

,

(cid:98)Ω𝐶𝐻𝑆 = (cid:98)Ω𝐴 + (cid:98)Ω𝐷𝐾 − (cid:98)Ω𝑁𝑊 ,

(cid:205)𝐾

where (cid:98)¯𝐴 := 1
𝐾 𝐿

𝜓𝑎 (𝑊𝑖𝑡; (cid:98)
1 and 0 otherwise (i.e., Bartlett kernel) and the bandwidth parameter 𝑀 chosen from 1 to 𝑇𝑙,

𝜂𝑘𝑙) and, with 𝑘 (cid:0) 𝑚
𝑀

(cid:205)𝑖∈𝐼𝑘,𝑠∈𝑆𝑙

(cid:1) := 1− 𝑚

𝑀 for 𝑚 = 0, 1, ..., 𝑀 −

1
𝑁𝑘𝑇𝑙

(cid:205)𝐿

𝑘=1

𝑙=1

(cid:98)Ω𝐴 :=

(cid:98)Ω𝐷𝐾 :=

(cid:98)Ω𝑁𝑊 :=

1
𝐾 𝐿

1
𝐾 𝐿

1
𝐾 𝐿

𝐾
∑︁

𝐿
∑︁

𝑘=1
𝐾
∑︁

𝑙=1
𝐿
∑︁

𝑘=1
𝐾
∑︁

𝑙=1
𝐿
∑︁

𝑘=1

𝑙=1

1
𝑁𝑘𝑇 2
𝑙

𝐾/𝐿
𝑁𝑘𝑇 2
𝑙

𝐾/𝐿
𝑁𝑘𝑇 2
𝑙

∑︁

𝑖∈𝐼𝑘,𝑡∈𝑆𝑙,𝑟∈𝑆𝑙

𝜓(𝑊𝑖𝑡; (cid:98)𝜃,

𝜂𝑘𝑙)𝜓(𝑊𝑖𝑟; (cid:98)𝜃,
(cid:98)

𝜂𝑘𝑙)′,
(cid:98)

∑︁

𝑡∈𝑆𝑙,𝑟∈𝑆𝑙

𝑘

(cid:18) |𝑡 − 𝑟 |
𝑀

(cid:19) ∑︁

∑︁

𝑘

𝑖∈𝐼𝑘,𝑡∈𝑆𝑙,𝑟∈𝑆𝑙

(cid:18) |𝑡 − 𝑟 |
𝑀

𝑖∈𝐼𝑘, 𝑗 ∈𝐼𝑘
(cid:19)

𝜓(𝑊𝑖𝑡; (cid:98)𝜃,

𝜂𝑘𝑙)𝜓 𝑗𝑟 ((cid:98)𝜃,
(cid:98)

𝜂𝑘𝑙)′,
(cid:98)

𝜓(𝑊𝑖𝑡; (cid:98)𝜃,

𝜂𝑘𝑙)𝜓(𝑊𝑖𝑟; (cid:98)𝜃,
(cid:98)

𝜂𝑘𝑙)′.
(cid:98)

It is noted that the variance estimator under the cross-fitting is equivalent to estimating the

variance in each sub-sample and then averaging across all sub-samples. Since 𝐾, 𝐿 are fixed, the

asymptotic analysis is done at the sub-sample level. The next theorem establishes the consistency

of this variance estimator under the conventional small-bandwidth assumption.

10For example, see Hansen, 2007, MacKinnon et al., 2021,Menzel, 2021, Chiang et al., 2022,Chiang et al.,

2023a,Chiang et al., 2024,Chen and Vogelsang, 2024 among many others.

82

Theorem 3.3 (Consistent Variance Estimator) Assumptions AHK, AR, ND, DML1, DML2 hold

for any 𝑃 ∈ P𝑁𝑇 , and some 𝑞 > 4 (defined in Assumption DML2), and 𝑀/𝑇 1/2 = 𝑜(1). Then, as

𝑁, 𝑇 → ∞ and 𝑁/𝑇 → 𝑐 where 0 < 𝑐 < ∞,

(cid:98)𝑉𝐶𝐻𝑆 = 𝑉 + 𝑜𝑃 (1).

Theorem 3.3 can be seen as a generalization of the consistency result for the CHS variance

estimator in Chiang et al. (2024) by allowing for the estimated nuisance parameters in the moment

functions. A remaining practical issue is that (cid:98)𝑉 is not ensured to be positive semi-definite.

It

has been shown in Chen and Vogelsang (2024) that negative variance estimates happen with a

non-trivial number of times under certain data-generating processes. Accordingly, an alternative

two-term variance estimator was proposed in Chen and Vogelsang (2024). Following the same

idea, I propose an alternative variance estimator by dropping the double-counting term (cid:98)Ω𝑁𝑊 :

(cid:98)𝑉𝐷𝐾 𝐴 = (cid:98)𝐴−1

(cid:98)Ω𝐷𝐾 𝐴 (cid:98)𝐴−1′,

(cid:98)Ω𝐷𝐾 𝐴 = (cid:98)Ω𝐴 + (cid:98)Ω𝐷𝐾 .

The estimator is referred to as the DKA variance estimator because it is a sum of Driscoll-Krray

and Arellano variance estimators.11 Similar approaches can be found in MacKinnon et al. (2021).

It relies on the fact that the double-counting term is of small order asymptotically when the panel

is two-way clustering. Similar to other two-term cluster-robust variance estimators, it has the

computational advantage of guaranteeing positive semi-definiteness but at the cost of inconsistency

in the case of no clustering or clustering at the intersection. For theoretical results and more

detailed discussions on the trade-off between the ensured positive-definiteness and the risk of being

too conservative/losing power, readers are referred to MacKinnon et al. (2021) and Chen and

Vogelsang (2024).

11Note that, the DKA estimator defined in Chen and Vogelsang (2024) differs from the DKA estimator here by
a constant term based on fixed-b asymptotic analysis. Such bias correction is not considered here since the fixed-b
properties are not directly applicable in this setting. The conjecture is that the same form of bias correction can be
applied here but formally establishing the fixed-b asymptotic results with the presence of estimated nuisance parameters
is challenging and out of the scope of this paper, and so is left for future research.

83

Theorem 3.4 (Alternative Consistent Variance Estimator) Under the same conditions as The-

orem 3.3, we have, as 𝑁, 𝑇 → ∞ and 𝑁/𝑇 → 𝑐 where 0 < 𝑐 < ∞,

(cid:98)𝑉𝐷𝐾 𝐴 = (cid:98)𝑉𝐶𝐻𝑆 + 𝑜𝑃 (1).

Theorem 3.4 formally shows that the double-counting term is of small order under two-way

clustering and it implies that the (cid:98)𝑉𝐷𝐾 𝐴 is also consistent for Ω under two-way clustering.

To conclude, in this section, the inferential theory is established for the panel DML estimator,

under high-level assumptions on the first-step estimator. Even though the rate of convergence

can be slow for the nuisance estimations due to the two-way cluster dependence, the cross-fitting

approach for panel models allows for valid inference in a general moment restriction model with

growing dimensions in the nuisance parameters. In the next section, I will study a special case of the

semiparametric restriction model but consider the complication due to unobserved heterogeneity.

3.4 Partial Linear Model with Unobserved Heterogeneity

In this section, a partial linear model with non-additive unobserved heterogeneous effects is

considered. The proposed toolkit is flexible enough to allow for models with instrumental variables

used for identification, so I consider the following model: for 𝑖 = 1, ..., 𝑁 and 𝑡 = 1, ..., 𝑇 ,

𝑌𝑖𝑡 =𝐷𝑖𝑡𝜃0 + 𝑔(𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡) + 𝑈𝑖𝑡, 𝐸 [𝑈𝑖𝑡 |𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡] = 0,

(3.16)

where 𝐷𝑖𝑡 is a low-dimensional vector of endogenous variables; 𝑔 is an unknown function of

potentially high-dimensional control variables 𝑋𝑖𝑡 and unobserved heterogeneous effects (𝑐𝑖, 𝑑𝑡).

For clearer presentation, 𝐷𝑖𝑡 is treated as a scalar variable.

In practice, 𝐷𝑖𝑡 can contain some

high-order terms and interactions with a low-dimensional vector of controls. If the lags or leads

of 𝐷𝑖𝑡 are considered to be exogenous, they can also included in 𝑋𝑖𝑡. Doing so would not change

the theory for estimation and inference but could change the interpretation of 𝜃0. Consider an

excludable instrumental variable 𝑍𝑖𝑡 such that 𝐸 [𝑍𝑖𝑡𝑈𝑖𝑡] = 0, which gives the identifying moment

condition.

To apply the estimation and inference methods proposed in previous sections, 𝑔 is again assumed

to be approximately sparse. However, it does not suffice since (𝑐𝑖, 𝑑𝑡) are not observed. To deal with

84

the unobserved heterogeneous effects that cause the endogeneity, I take a correlated random-effects

approach through the generalized Mundlak device:

Assumption 3.9 (Generalized Mundlak Device) For each 𝑖 = 1, ..., 𝑁 and 𝑡 = 1, ..., 𝑇,

𝑐𝑖 = ℎ𝑐 ( ¯𝐹𝑖, 𝜖 𝑐

𝑖 ),

𝑑𝑡 = ℎ𝑑 ( ¯𝐹𝑡, 𝜖 𝑑

𝑡 ),

(3.17)

(3.18)

where ¯𝐹𝑖 = 1
𝑇

(cid:205)𝑇

𝑡=1

𝐹𝑖𝑡,

¯𝐹𝑡 = 1
𝑁

(cid:205)𝑁
𝑖=1

𝐹𝑖𝑡, 𝐹𝑖𝑡 := (𝐷𝑖𝑡, 𝑋′

𝑖𝑡)′; ℎ𝑐 and ℎ𝑑 are some unknown mea-

surable functions; the stochastic errors (𝜖 𝑐

𝑖 , 𝜖 𝑑

𝑡 ) are independent of ( ¯𝐹𝑖, ¯𝐹𝑡, 𝑋𝑖𝑡, 𝑍𝑖𝑡, 𝑈 𝐷

𝑖𝑡 , 𝑈𝑌

𝑖𝑡 ); and

(𝑐𝑖, 𝑑𝑡, 𝜖𝑖, 𝜖𝑡) are independent of 𝑈𝑖𝑡.

To justify its use, we shall recall the idea of the conventional Mundlak device. Due to the

correlation between (𝑐𝑖, 𝑑𝑡) and the covariates, the endogeneity issue arises if we don’t control for

the unobserved heterogeneity. To explicitly model the correlation between the random effects and

the covariates, Mundlak (1978) proposes an auxiliary regression between the random effects and

the cross-sectional sample average and shows that if the random effects enter the model linearly then

the resulting estimator GLS estimator is equivalent to the common within-estimator. Wooldridge

(2021) further shows that the equivalence relations exist among the POLS estimators resulting

from the Mundlak device, within-transformation, and the fixed-effects dummies. Therefore, if

the within-transformation and including fixed-effects dummies are sensible ways of dealing with

unobserved heterogeneity, then allowing the Mundlak device to have a more flexible function form

should also be reasonable and more robust. A similar assumption is considered in Wooldridge and

Zhu (2020).

It seems like one can simply apply the panel DML approach from Section 3.3.1 with the two-

way cluster LASSO estimator employed as the first-step estimator except that there is a subtle issue:

the Mundlak device uses the full history of the covariates which potentially generates dependence

across the cross-fitting sub-samples. Similar issues also appear in a simple linear panel model

with additive unobserved effects where within-transformation also introduces sample-averages.

85

Therefore, the cross-fitting may not be compatible with approaches dealing with unobserved het-

erogeneity, including the proposed generalized Mundlak device. However, without cross-fitting, it

is challenging to establish an inferential theory with growing dimensionality in unknown parameters

in general. Nevertheless, as is shown below, it is possible to establishing the asymptotic normality

of the panel DML estimator using the full sample in both the first and the second steps for the partial

linear model with strengthened sparsity condition. This is helpful not only due to the presence of

unobserved heterogeneous effects but also because cross-fitting can be computationally costly and

it works in a cost of efficiency loss.

Under model 3.16, 𝑔(𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡) = 𝐸 [𝑌𝑖𝑡 − 𝐷𝑖𝑡𝜃0|𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡]. We can rewrite 3.16 as follows:

𝑌𝑖𝑡 = (𝐷𝑖𝑡 − 𝑔𝐷 (𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡)) 𝜃0 + 𝑔𝑌 (𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡) + 𝑈𝑖𝑡 .

where 𝑔𝐷 (𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡) := 𝐸 [𝐷𝑖𝑡 |𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡] and 𝑔𝑌 (𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡) := 𝐸 [𝑌𝑖𝑡 |𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡]. Under Assump-

tion GMD, 𝑔𝐷 (𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡) and 𝑔𝑌 (𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡) can be rewritten as compound functions, which are

assumed to be well-approximated by a linear combination of a 𝜏−th order polynomial transformation

𝐿𝜏 as follows:

𝐷 (𝑋𝑖𝑡, ¯𝐹𝑖, 𝜖 𝑐
𝑔∗

𝑖 , ¯𝐹𝑡, 𝜖 𝑑

𝑡 ) := 𝑔𝐷 (𝑋𝑖𝑡, ℎ𝑐 ( ¯𝐹𝑖, 𝜖 𝑐

𝑖 ), ℎ𝑑 ( ¯𝐹𝑡, 𝜖 𝑑

𝑌 (𝑋𝑖𝑡, ¯𝐹𝑖, 𝜖 𝑐
𝑔∗

𝑖 , ¯𝐹𝑡, 𝜖 𝑑

𝑡 ) := 𝑔𝑌 (𝑋𝑖𝑡, ℎ𝑐 ( ¯𝐹𝑖, 𝜖 𝑐

𝑖 ), ℎ𝑑 ( ¯𝐹𝑡, 𝜖 𝑑

𝑡 )) = 𝐿𝜏 (cid:16)
𝑡 )) = 𝐿𝜏 (cid:16)

𝑋𝑖𝑡, ¯𝐹𝑖, ¯𝐹𝑡, 𝜖 𝑐

𝑋𝑖𝑡, ¯𝐹𝑖, ¯𝐹𝑡, 𝜖 𝑐

(cid:17)

𝑖 , 𝜖 𝑑
𝑡
(cid:17)

𝑖 , 𝜖 𝑑
𝑡

𝜂𝐷 + 𝑟 𝐷
𝑖𝑡

(3.19)

𝜂𝑌 + 𝑟𝑌
𝑖𝑡

(3.20)

𝑖𝑡 , 𝑟𝑌

where (𝜂𝐷, 𝜂𝑌 ) are slope coefficients and (𝑟 𝐷
𝑖𝑡 ) are the approximation errors. Furthermore, we
can define a vector of transformed regressors as 𝐿1,𝑖𝑡 = 𝐿𝜏 (𝑋𝑖𝑡, ¯𝐹𝑖, ¯𝐹𝑡) and a vector of unobserved
𝑡 )\𝐿𝜏 (𝑋𝑖𝑡, ¯𝐹𝑖, ¯𝐹𝑡). Let (𝜂𝐷,1, 𝜂𝐷,2) be such that 𝜂𝐷 =
regressors as 𝐿2,𝑖𝑡 = 𝐿𝜏 (𝑋𝑖𝑡, ¯𝐹𝑖, ¯𝐹𝑡, 𝜖 𝑐
𝜂𝐷,1

(cid:208) 𝜂𝐷,2 and

𝑖 , 𝜖 𝑑

𝐿𝜏 (cid:16)

𝑋𝑖𝑡, ¯𝐹𝑖, ¯𝐹𝑡, 𝜖 𝑐

𝑖 , 𝜖 𝑑
𝑡

(cid:17)

𝜂𝐷 = 𝐿1,𝑖𝑡𝜂𝐷,1 + 𝐿2,𝑖𝑡𝜂𝐷,2.

(𝜂𝑌 ,1, 𝜂𝑌 ,2) are defined in the same way. Under the sparse approximation and Assumption GMD,

we can rewrite model 3.16 as follows:

(cid:16)

𝑌𝑖𝑡 =

𝐷𝑖𝑡 − 𝐿1,𝑖𝑡𝜂𝐷,1 − 𝐿2,𝑖𝑡𝜂𝐷,2 − 𝑟 𝐷
𝑖𝑡

(cid:17)

𝜃0 + 𝐿1,𝑖𝑡𝜂𝑌 ,1 + 𝐿2,𝑖𝑡𝜂𝑌 ,2 + 𝑟𝑌

𝑖𝑡 + 𝑈𝑖𝑡,

86

By defining a new error term 𝑉 𝑔
error 𝑟𝑖𝑡 = 𝑟𝑌
nuisance vectors 𝛽0 := (cid:0)𝜂𝑌 ,1, 𝐸 [𝐿2,𝑖𝑡]𝜂𝑌 ,2

(cid:1) + 𝑈𝑖𝑡, a new approximation
𝑖𝑡 𝜃0, the vector of observables 𝑓𝑖𝑡 := (cid:0)𝐿1,𝑖𝑡, 1(cid:1) with dimension denoted by 𝑝, and the
(cid:1), 𝜋0 := (cid:0)𝜂𝐷,1, 𝐸 [𝐿2,𝑖𝑡]𝜂𝐷,2
(cid:1), we can rewrite the model

𝑖𝑡 := (cid:0)𝐿2,𝑖𝑡 − 𝐸 [𝐿2,𝑖𝑡](cid:1) (cid:0)𝜂𝑌 ,2 − 𝜂𝐷,2𝜃0

𝑖𝑡 +𝑟 𝐷

above as

𝑌𝑖𝑡 = (𝐷𝑖𝑡 − 𝑓𝑖𝑡 𝜋0) 𝜃0 + 𝑓𝑖𝑡 𝛽0 + 𝑟𝑖𝑡 + 𝑉 𝑔
𝑖𝑡 .

(3.21)

Noticeably, in this case, the parameters associated with the unobservables 𝐿2,𝑖𝑡 can be arbitrarily

non-sparse.

Given 𝐸 [𝑍𝑖𝑡𝑈𝑖𝑡] and the independence between 𝑍𝑖𝑡 and (𝜖 𝑐

𝑖 , 𝜖 𝑑

𝑡 ), we have the identifying moment

condition 𝐸 [𝑍𝑖𝑡𝑉 𝑔

𝑖𝑡 ] = 0. Let 𝜁0 be the linear projection parameter of 𝑍𝑖𝑡 onto 𝑓𝑖𝑡 and let 𝑉 𝑍

𝑖𝑡 be the

corresponding linear projection errors. By Chernozhukov et al., 2018a, (2.18), the near-Neyman

orthogonal moment function is given by:

𝜓𝑖𝑡 (𝜃0, 𝜂0) := (𝑍𝑖𝑡 − 𝑓𝑖𝑡 𝜁0) (𝑌𝑖𝑡 − 𝑓𝑖𝑡 𝛽0 − (𝐷𝑖𝑡 − 𝑓𝑖𝑡 𝜋0) 𝜃0) .

(3.22)

where we denote 𝜂0 = (𝜁0, 𝛽0, 𝜋0). Under the sparse approximation, we can also rewrite the

conditional expectation models for 𝑌 and 𝐷 as

𝑌𝑖𝑡 = 𝐸 [𝑌𝑖𝑡 |𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡] + 𝑈𝑌

𝑖𝑡 = 𝑓𝑖𝑡 𝛽0 + 𝑟𝑌

𝑖𝑡 + 𝑉𝑌
𝑖𝑡

𝐷𝑖𝑡 = 𝐸 [𝑌𝑖𝑡 |𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡] + 𝑈 𝐷

𝑖𝑡 = 𝑓𝑖𝑡 𝜋0 + 𝑟 𝐷

𝑖𝑡 + 𝑉 𝐷
𝑖𝑡 .

where 𝑉𝑌

𝑖𝑡 = (cid:0)𝐿2,𝑖𝑡 − 𝐸 [𝐿2,𝑖𝑡](cid:1) 𝜂𝐷,2 + 𝑈 𝐷

𝑖𝑡 = (cid:0)𝐿2,𝑖𝑡 − 𝐸 [𝐿2,𝑖𝑡](cid:1) 𝜂𝑌 ,2 + 𝑈𝑌
𝑖𝑡 and 𝑉 𝐷
let 𝜔𝑙, as defined in 3.11 with 𝑉𝑖𝑡 replaced by 𝑉 𝑙
cluster LASSO estimations of (𝜁0, 𝛽0, 𝜋0). Correspondingly, let (cid:98)𝑉 𝑙 be the residuals and let (cid:98)
feasible penalty weights. The two-step debiased estimator (cid:98)𝜃 for 𝜃0 using the full-sample is defined
as the solution of E𝑁𝑇 [𝜓𝑖𝑡 (𝜃,(cid:98)
𝜂 are the (post) two-way cluster LASSO estimators for
𝜂0 obtained in the first step using the full-sample.

𝑖𝑡 be the infeasible penalty weights for the two-way

𝑖𝑡 . For 𝑙 = 𝑍, 𝑌 , 𝐷,

𝜂)] = 0 where (cid:98)

𝜔𝑙 be the

The additional notations introduced below are used in the statistical analysis and delivering the

87

main results:

𝑎𝑖 = 𝐸 [𝑉 𝑍

𝑖𝑡 𝑉 𝑔

𝑖𝑡 |𝛼𝑖], 𝑔𝑡 = 𝐸 [𝑉 𝑍

𝑖𝑡 𝑉 𝑔

𝑖𝑡 |𝛾𝑡], Σ𝑎 = 𝐸 [𝑎𝑖𝑎′

𝑖], Σ𝑔 =

∞
∑︁

𝑙=−∞

𝐸 [𝑔𝑡𝑔′

𝑡+𝑙]

𝐴0 = 𝐸𝑃 [𝑉 𝑍

𝑖𝑡 𝑉 𝐷

𝑖𝑡 ], Ω0 = Σ𝑎 + 𝑐Σ𝑔,

𝑎𝑖, 𝑗,𝑙 = 𝐸 [ 𝑓𝑖𝑡, 𝑗𝑉 𝑙

𝑖𝑡 |𝛼𝑖], 𝑔𝑡, 𝑗,𝑙 = 𝐸 [ 𝑓𝑖𝑡, 𝑗𝑉 𝑙

𝑖𝑡 |𝛾𝑡],

𝑙 = 𝑍, 𝑌 , 𝐷

Assumption 3.10 (Regularity Conditions for the Partial Linear Model)

(i) 𝐴0 is non-singular.

(ii) For any 𝜖, ℎ𝑐 (𝐹, 𝜖) and ℎ𝑑 (𝐹, 𝜖) are invertible in 𝐹.

(iii) For some 𝜇 > 1, 𝛿 > 0, max 𝑗 ≤ 𝑝 𝐸 [| 𝑓𝑖𝑡, 𝑗 |8(𝜇+𝛿)] < ∞ and 𝐸 [|𝑉 𝑙

𝑖𝑡 |8(𝜇+𝛿)] < ∞ for 𝑙 =

𝑔, 𝐷, 𝑌 , 𝑍.

(iv) Either 𝜆𝑚𝑖𝑛 [Σ𝑎] > 0 or 𝜆𝑚𝑖𝑛 [Σ𝑔] > 0, and min 𝑗 ≤ 𝑝 E[𝑎𝑙

𝑖, 𝑗 ]2 > 0 and min 𝑗 ≤ 𝑝 E[𝑔𝑙

𝑡, 𝑗 ]2 > 0,

𝑙 = 𝐷, 𝑌 , 𝑍.

(v) log( 𝑝/𝛾) = 𝑜

𝑇 1/6/(log 𝑇)2(cid:17)
(cid:16)

and 𝑝 = 𝑜(𝑇 7/6/(log 𝑇)2).

(vi) The feasible penalty weights (cid:98)

𝜔𝑙 satisfy condition 3.12 for 𝑙 = 𝐷, 𝑌 , 𝑍.

This set of regularity conditions follows from the assumptions for two-way cluster-LASSO and

the panel-DML inference. The only extra condition is Assumption REG-P(ii) which is a smoothness

condition that ensures the exogeneity properties of ¯𝐹𝑖 and ¯𝐹𝑡 inherited from (𝑐𝑖, 𝜖𝑖) and (𝑑𝑡, 𝜖𝑡).

Theorem 3.5 Suppose, for 𝑃 = 𝑃𝑁𝑇 for each (𝑁, 𝑇), the following conditions hold for model 3.16

and 𝑊𝑖𝑡 = (𝑌𝑖𝑡, 𝐷𝑖𝑡, 𝑋𝑖𝑡, 𝑍𝑖𝑡, 𝑈𝑖𝑡, 𝑐𝑖, 𝑑𝑡, 𝜖𝑖, 𝜖𝑡): (i) Assumptions AHK, AR, SE, GMD, REG-P; (ii)

(cid:19)

(cid:18)√︃ 1
𝑁∧𝑇

for 𝑙 = 𝑌 , 𝐷.

sparse approximation in 3.19 and 3.20 with 𝑠 = 𝑜

(cid:16) √
𝑁∧𝑇
log( 𝑝/𝛾)

(cid:17)

, ∥𝑟 𝜄

𝑖𝑡 ∥ 𝑁𝑇,2 = 𝑜𝑃

Then, as 𝑁, 𝑇 → ∞ and 𝑁/𝑇 → 𝑐 where 0 < 𝑐 < ∞,

where 𝑉 := 𝐴−1

0 Ω0 𝐴−1
0 .

√

𝑁 ((cid:98)𝜃 − 𝜃0)

𝑑
→ N (0, 𝑉)

88

Theorem 3.5 establishes the validity of the proposed inference procedure using the full sample.

Note that the sparsity condition and the condition of the approximation errors are stronger than

the ones needed for two-way LASSO estimation itself. To estimate the asymptotic variance, the

following variance estimators are adapted from Chiang et al. (2024) and Chen and Vogelsang (2024)

using the full sample:

(cid:98)𝑉𝐶𝐻𝑆 = (cid:98)𝐴−1

(cid:98)𝑉𝐷𝐾 𝐴 = (cid:98)𝐴−1

𝑁𝑇 (cid:98)Ω𝐶𝐻𝑆 (cid:98)𝐴−1′
𝑁𝑇 (cid:98)Ω𝐷𝐾 𝐴 (cid:98)𝐴−1′

𝑁𝑇 , (cid:98)Ω𝐶𝐻𝑆 = (cid:98)Ω𝐴 + (cid:98)Ω𝐷𝐾 − (cid:98)Ω𝑁𝑊 ,

𝑁𝑇 , (cid:98)Ω𝐷𝐾 𝐴 = (cid:98)Ω𝐴 + (cid:98)Ω𝐷𝐾,

(3.23)

(3.24)

where

(cid:98)𝐴𝑁𝑇 :=

1
𝑁𝑇

𝑁
∑︁

𝑇
∑︁

𝑖=1
𝑁
∑︁

𝑡=1
𝑇
∑︁

𝑖=1
𝑇
∑︁

𝑡=1
𝑇
∑︁

𝑡=1
𝑁
∑︁

𝑟=1
𝑇
∑︁

𝜋),
(𝑍𝑖𝑡 − 𝑓𝑖𝑡 (cid:98)𝜁) (𝐷𝑖𝑡 − 𝑓𝑖𝑡 (cid:98)

𝑇
∑︁

𝑟=1

𝜓𝑖𝑡 ((cid:98)𝜃,

𝜂)𝜓𝑖𝑟 ((cid:98)𝜃,
(cid:98)

𝜂)′,
(cid:98)

𝑘

(cid:18) |𝑡 − 𝑟 |
𝑀

(cid:19) 𝑁
∑︁

𝑁
∑︁

𝑇
∑︁

𝑘

𝑗=1

𝑖=1
(cid:19)
(cid:18) |𝑡 − 𝑟 |
𝑀

𝑖=1

𝑡=1

𝑟=1

(cid:98)Ω𝐴 :=

(cid:98)Ω𝐷𝐾 :=

(cid:98)Ω𝑁𝑊 :=

1
𝑁𝑇 2

1
𝑁𝑇 2

1
𝑁𝑇 2

𝜓𝑖𝑡 ((cid:98)𝜃,

𝜂)𝜓 𝑗𝑟 ((cid:98)𝜃,
(cid:98)

𝜂)′,
(cid:98)

𝜓𝑖𝑡 ((cid:98)𝜃,

𝜂)𝜓𝑖𝑟 ((cid:98)𝜃,
(cid:98)

𝜂)′.
(cid:98)

For simplicity, we deliver the consistency results of variance estimators assuming the approxi-

mation is exact. Allowing for approximation errors does not change the main idea but only requires

more regularity conditions on the approximation error and lengthier derivations.

Theorem 3.6 Suppose assumptions for Theorem 3.5 holds for 𝑃 = 𝑃𝑁𝑇 for each (𝑁, 𝑇) with
𝑖𝑡 = 𝑟𝑌
𝑟 𝐷

𝑖𝑡 = 0 a.s., and 𝑀/𝑇 1/2 = 𝑜(1). Then, (𝑁, 𝑇) → ∞ and 𝑁/𝑇 → 𝑐 where 0 < 𝑐 < ∞,

(cid:98)𝑉𝐶𝐻𝑆 =𝑉 + 𝑜𝑃 (1),

(cid:98)𝑉𝐷𝐾 𝐴 =(cid:98)𝑉𝐶𝐻𝑆 + 𝑜𝑃 (1).

3.5 Monte Carlo Simulation

In this section, the finite sample performance of the panel DML estimation and inference

procedure are examined in a Monte Carlo simulation study. We will start with an exactly sparse

89

linear model without considering approximation errors and unobserved heterogeneous effects, and

then we will further consider the partial linear model with correlated random effects.

Firstly, the linear model with high-dimensional covariates and exact sparsity is specified as

follows:

DGP(i) − Linear Model :

𝑌𝑖𝑡 = 𝐷𝑖𝑡𝜃0 + 𝑋𝑖𝑡 𝛽0 + 𝑈𝑖𝑡,

𝐷𝑖𝑡 = 𝑋𝑖𝑡 𝜋0 + 𝑉𝑖𝑡,

where 𝜃0 = 1/2 is the true parameter of interest, and 𝛽0 = 𝑐 𝛽 × (1, 1, ..., 1, 0, ..., 0)′, 𝜋0 =
𝑐𝜋 × (1, 1, ..., 1, 0, ..., 0)′ are 𝑝-dimensional nuisance parameters where the first 𝑠 entries are 1 and

the rest of the elements are 0; 𝑐 𝛽 and 𝑐𝜋 are constants that control the relevance of the covariates.

Secondly, the partial linear model with correlated random effects is specified as follows:

DGP(ii) − Partial Linear Model :

𝑌𝑖𝑡 = 𝐷𝑖𝑡𝜃0 + (𝑋𝑖𝑡 𝛽0 + 𝑐𝑖 + 𝑑𝑡)2 + 𝑈𝑖𝑡,

𝐷𝑖𝑡 =

exp(𝑋𝑖𝑡 𝜋0)
1 + exp(𝑋𝑖𝑡 𝜋0)

+ 𝑉𝑖𝑡,

𝑐𝑖 = ¯𝐷𝑖 + ¯𝑋𝑖𝜉0 + 𝜖 𝑐

𝑖 , 𝑑𝑡 = ¯𝐷𝑡 + ¯𝑋𝑡 𝜁0 + 𝜖 𝑑
𝑡 ,

(cid:16)

where 𝛽0 = 𝑐 𝛽 ×
(cid:16)

1/22, 1/23, ..., 1/2( 𝑝+1)(cid:17)′
(cid:16)
, 𝜁 = 𝑐𝜁

1/22, 1/23, ..., 1/2( 𝑝+1)(cid:17)

𝜉 = 𝑐𝜉

(cid:16)

, 𝜋0 = 𝑐𝜋 ×
1/22, 1/23, ..., 1/2( 𝑝+1)(cid:17)

1/22, 1/23, ..., 1/2( 𝑝+1)(cid:17)′
; 𝜖 𝑐

𝑖 and 𝜖 𝑑

, and

𝑡 are each a random

draw from the uniform distribution 𝑈 (0, 1). The nuisance functions in both 𝑌 and 𝐷 are taken as

unknown. Although these nuisance functions are not exactly sparse, they are smooth enough and

can be well-approximated by a polynomial series. The correlated random effects are generated by

the Mundlak device which is taken as known and will be used for estimation.

For the linear model, to feature in the two-way dependence in 𝑉𝑖𝑡𝑈𝑖𝑡 as well as 𝑋𝑖𝑡𝑈𝑖𝑡 and 𝑋𝑖𝑡𝑉𝑖𝑡,

90

(𝑋𝑖𝑡, 𝑈𝑖𝑡, 𝑉𝑖𝑡) are generated by the underlying components as follows: for each 𝑗 = 1, ..., 𝑝,

DGP(i) − Additive Components :

𝑋𝑖𝑡, 𝑗 = 𝑤1𝛼𝑖, 𝑗 + 𝑤2𝛾𝑡, 𝑗 + 𝑤3𝜀𝑖𝑡, 𝑗 ,

𝑈𝑖𝑡 = 𝑤1𝛼𝑢

𝑖 + 𝑤2𝛾𝑢

𝑡 + 𝑤3𝜀𝑢
𝑖𝑡,

𝑉𝑖𝑡 = 𝑤1𝛼𝑣

𝑖 + 𝑤2𝛾𝑣

𝑡 + 𝑤3𝜀𝑣
𝑖𝑡,

√

√

𝑖 , 𝜀𝑢

𝑖 , 𝛼𝑣

where the components 𝛼𝑢

𝑖𝑡, 𝛼𝑖, 𝑗 , 𝛾𝑡, 𝑗 are each random draws from a uniform distribution
3) for each 𝑗; 𝜀𝑖𝑡 = (𝜀𝑖𝑡,1, ..., 𝜀𝑖𝑡,𝑝)′ is a random draw from a joint normal distribution
with mean 1 and variance-covariance matrix equal to 𝜄| 𝑗−𝑘 |, 𝜄 ∈ [0, 1), in the ( 𝑗, 𝑘)’s entry; The

𝑖𝑡, 𝜀𝑣

𝑈 (−

3,

components 𝛾𝑢

𝑡 , 𝛾𝑣

𝑡 each follow a AR(1) process with the coefficient equal to 𝜌 and the initial

values randomly drawn from the normal distribution with mean 0 and variance 1 − 𝜌2 for some

𝜌 ∈ [0, 1). The weights (𝑤1, 𝑤2, 𝑤3) are non-negative with 𝑤2

1 + 𝑤2

2 + 𝑤2

3 = 1. The default weights

are 𝑤1 = 𝑤2 = 𝑤3 = 1/

3.

√

For the partial linear model, the Mundlak device will be used for estimation. It is well-known

that the Mundlak device is mechanically equivalent to within-transformation in a linear panel

model, in which the within-transformation would also remove the additive components in DGP(i)

and eliminate the two-way dependence in the within-transformed random variables. When the true

model is partially linear in the covariates, the Mundlak device also projects out many underlying

components and removes most of the dependence driven by the additive components. To illustrate it

is not necessarily the case in general, a multiplicative component structure is considered as follows:

DGP(ii) − Multiplicative Components :

𝑋𝑖𝑡, 𝑗 = 𝑤1𝛼𝑖, 𝑗 + 𝑤2𝛾𝑡, 𝑗 + 𝑤3𝜀𝑖𝑡, 𝑗 ,

𝑈𝑖𝑡 =

𝑉𝑖𝑡 =

𝑤4
𝑐 𝑝

𝑤4
𝑐 𝑝

𝑝
∑︁

𝑗=1
𝑝
∑︁

𝑗=1

(cid:2)𝛼𝑢

𝑖 𝛾𝑡, 𝑗 + 𝛼𝑖, 𝑗 𝛾𝑢
𝑡

(cid:3) + 𝑤5𝜀𝑢
𝑖𝑡,

(cid:2)𝛼𝑣

𝑖 𝛾𝑡, 𝑗 + 𝛼𝑖, 𝑗 𝛾𝑣
𝑡

(cid:3) + 𝑤5𝜀𝑣
𝑖𝑡,

where the components are generated the same way as in DGP(i) - Linear Components. The

weights are non-negative with 𝑤2

1 + 𝑤2

2 + 𝑤2

3 = 1 and 𝑤2

4 + 𝑤2

5 = 1. The default weights are

91

𝑤1 = 𝑤2 = (2/5)0.5, 𝑤3 = 𝑤5 = (1/5)0.5, and 𝑤4 = (4/5)0.5. 𝑐 𝑝 is a scaling factor that ensures the
sums of multiplicative components in both 𝑈𝑖𝑡 and 𝑉𝑖𝑡 have variance around 1. With the default

weights, 𝑐 𝑝 is set as 3/2. The multiplicative components construction here is a generalization of the

example in Chiang et al. (2024). To see why 𝑈𝑖𝑡𝑉𝑖𝑡 features a component structure, we can expand

expectations given 𝛼 = (𝛼𝑢

the product and observe that it includes terms such as 𝛼𝑢
𝑖 𝛼𝑣
𝑖 , 𝛼𝑣
of 𝛼. Likewise, the product also includes terms like 𝛾𝑢

𝑖 , 𝛼𝑖,1, ..., 𝛼𝑖,𝑝) are 𝛼𝑢

𝑖 𝛼𝑣

𝑖 𝛾2

𝑡, 𝑗 for 𝑗 = 1, ..., 𝑝 whose conditional

𝑖 since 𝛾𝑡, 𝑗 has variance 1 and is independent
𝑡 𝛾𝑣

𝑖, 𝑗 whose conditional expectations given

𝑡 𝛼2

𝛾 = (𝛾𝑢

𝑡 , 𝛾𝑣

𝑡 , 𝛾𝑡,1, ..., 𝛾𝑡,𝑝) are 𝛾𝑢

𝑡 𝛾𝑣

𝑡 . Importantly, these underlying common factors do not introduce

endogeneity as they may seem to.

The simulation study examines the Monte Carlo bias (Bias), standard deviation (SD), mean

square error (MSE), and coverage probability of estimators for 𝜃0. All estimations are based on the

orthogonal moment condition given by 3.22 with 𝑍𝑖𝑡 = 𝐷𝑖𝑡 ( 𝑓𝑖𝑡 = 𝑋𝑖𝑡 in DGP(i)). The comparison

will be among procedures with and without cross-fitting. The first-step estimations will be based

on the POLS estimator (if feasible), the post heteroskedasticity-robust LASSO from Belloni et al.

(2012), the post square-root LASSO from Belloni et al. (2011), the post cluster-robust LASSO from

Belloni et al. (2016), and the post two-way cluster-LASSO. The CHS-type and DKA-type variance

estimators (different formulas for estimations with and without cross-fitting) will be used to obtain

sample coverage probabilities. In some unreported simulations, I also compare CHS/DKA type

variance estimators with Eicker-Huber-White type estimators in Chernozhukov et al. (2018a) for

random sampling data and Cameron-Galbach-Miller type estimator from Chiang et al. (2022) for

multiway clustered data. Since it is well-known that inference based on variance estimators not

sufficiently accounting for the dependence would cause over-rejection, it is omitted here.

The simulation results are based on 1000 Monte Carlo replications. It is a relatively small num-

ber of replications but it is necessitated by the high computational cost of multiple high-dimensional

estimation and inference procedures, particularly with cross-fitting. Results are obtained across

DGPs varied by the sample sizes (𝑁, 𝑇), the dimensions of covariates 𝑝, the number of non-zero

slope coefficients 𝑠, the other sparsity parameter 𝑏, the common coefficient 𝑎, the multicollinearity

92

No

Table 3.5.1: DGP(i) with 𝑁 = 𝑇 = 25, 𝑠 = 5, 𝑝 = 200, 𝜄 = 0.5, 𝜌 = 0.5, 𝑐 𝛽 = 𝑐𝜋 = 0.5
Cross
Fitting

First-Step
Estimator
POLS
H LASSO
R LASSO
C LASSO
TW LASSO
POLS
H LASSO
R LASSO
C LASSO
TW LASSO

Second-Step
SD
0.053
0.065
0.067
0.095
0.096
0.113
0.131
0.130
0.130
0.126

First-Step Ave.
Sel. Y Sel. D Bias
0.003
200
0.062
26.0
0.070
17.6
0.036
8.9
0.023
6.9
0.006
200
0.053
16.6
0.054
9.5
0.041
8.1
0.057
6.4
Note: Simulation results are based on 1000 replications. Tuning parameters: (𝐾, 𝐿) = (4, 8), 𝐶𝜆 = 2,
and 𝛾 = 0.1/log( 𝑝 ∨ 𝑁 ∨ 𝑇). 10 most relevant regressors (based on the sample correlation with
the outcome) are used for initial estimation and at most 10 iterations are used in calculating the
penalty weights. H: heteroskedastic-LASSO; R: square-root-LASSO; C: cluster-LASSO; TW: two-
way cluster-LASSO. Post-LASSO POLS is performed in all first steps. Nominal coverage probability:
0.95.

Coverage (%)
DKA
95.1
78.7
79.5
87.5
90.4
99.4
97.6
98.2
97.4
97.2

RMSE CHS
78.9
0.053
58.5
0.090
65.2
0.097
80.0
0.101
84.3
0.099
98.2
0.113
96.0
0.141
96.0
0.141
96.2
0.136
95.8
0.138

200
26.0
17.6
8.6
6.7
200
16.9
9.5
8.0
6.7

Yes

parameter 𝜄 and the temporal correlation parameter 𝜌. For the panel DML inferential procedure

with cross-fitting, the tuning parameters (𝐾, 𝐿), the number of cross-fitting blocks, needs to be

chosen. For variance estimation, bandwidth parameters 𝑀 of the Bartlett kernel are required. I

use the min-MSE rule from Andrews (1991) for both purposes. For a generic scalar score 𝑣𝑖𝑡, the

formula is given as follows:

(cid:98)𝑀 = 1.8171

(cid:33) 1/3

(cid:32)

𝜌2
(cid:98)
(cid:0)1 − (cid:98)

𝜌2(cid:1) 2

𝑇 1/3 + 1,

𝑣𝑡−1 + 𝜂𝑡 where ¯
𝑣𝑡 = 𝜌 ¯
𝜌 is the OLS estimator from the regression ¯
𝑣𝑡 = 1
(cid:98)
(cid:98)
(cid:98)
𝑁

(cid:205)𝑁

𝑣𝑖𝑡 and

𝑖=1 (cid:98)

where (cid:98)
𝑣𝑖𝑡 = (cid:98)𝑈𝑖𝑡 (cid:98)𝑉𝑖𝑡.
(cid:98)

Table 3.5.1 presents a set of baseline results that are obtained for a decent number of regressors

(𝑝 = 200) among which 5 are associated with non-zero slope coefficients. The number of covariates

is much larger than either cross-sectional or temporal dimensions. On the other hand, the number of

non-zero coefficients can be regarded as a small order of the sample sizes, approximately satisfying

the sparsity condition. In the first step, model selections are done using different LASSO approaches

reported in the second column. The number of selected regressors for both 𝑌 and 𝐷 are reported

in the third and fourth columns. First, comparing the results obtained without using cross-fitting,

93

No

Table 3.5.2: DGP(i) with 𝑁 = 𝑇 = 25, 𝑠 = 5, 𝑝 = 600, 𝜄 = 0.5, 𝜌 = 0.5, 𝑐 𝛽 = 𝑐𝜋 = 0.5
Cross
Fitting

First-Step
Estimator
POLS
H LASSO
R LASSO
C LASSO
TW LASSO
H LASSO
R LASSO
C LASSO
TW LASSO

Second-Step
SD
0.221
0.049
0.055
0.096
0.098
0.134
0.137
0.139
0.140

First-Step Ave.
Sel. Y Sel. D Bias
0.008
600
0.073
39.8
0.079
25.3
0.058
15.2
0.033
7.5
0.056
24.7
0.054
12.1
0.043
11.6
0.065
7.6
Note: Simulation results are based on 1000 replications. Tuning parameters: (𝐾, 𝐿) = (4, 8), 𝐶𝜆 = 2,
and 𝛾 = 0.1/log( 𝑝 ∨ 𝑁 ∨ 𝑇). 10 most relevant regressors (based on the sample correlation with
the outcome) are used for initial estimation and at most 10 iterations are used in calculating the
penalty weights. H: heteroskedastic-LASSO; R: square-root-LASSO; C: cluster-LASSO; TW: two-
way cluster-LASSO. Post-LASSO POLS is performed in all first steps. Nominal coverage probability:
0.95.

Coverage (%)
DKA
38.6
78.9
79.1
78.4
88.1
98.4
96.1
96.1
95.1

RMSE CHS
26.6
0.221
51.2
0.087
52.4
0.097
68.8
0.112
81.6
0.103
94.5
0.146
94.5
0.147
95.1
0.145
90.7
0.154

600
39.5
25.1
14.0
6.9
24.8
12.1
10.7
6.8

Yes

it is shown that when the number of regressors is not extremely large relative to the sample size,

the POLS estimator dominates the sparse methods through different LASSOs in terms of Monte

Carlo bias, standard deviation, and coverage probability obtained using DKA standard error, even

though the true model is sparse. Among the sparse methods, the proposed two-way cluster-LASSO

exhibits the smallest bias and best coverage, while its standard deviation is slightly larger than

the heteroskedastic-robust LASSO and root-LASSO. In terms of selection, the proposed method

selects the number of regressors closest to the true number of relevant regressors while other sparse

methods over-select to different extents.

When cross-fitting is employed, all methods have witnessed a significant improvement in terms

of sample coverage. This is particularly true for LASSO-based methods that are not designed

for dependent data. This is not too surprising because those non-robust sparse methods tend to

over-select and the cross-fitting is designed to remove the overfitting bias and to restore asymptotic

normality. As a cost of cross-fitting, the Monte Carlo standard deviation increased, indicating

the efficient loss due to the exclusion of sub-samples in the first-step estimation. It is also worth

emphasizing that the CHS- and DKA-type variance estimators designed for cross-fitting approaches

play an important role in the desirable sample coverage.

In some unreported simulations, it is

shown that inference based on the cross-fitting variance estimators proposed in Chernozhukov et al.

94

No

Table 3.5.3: DGP(ii) with 𝑁 = 𝑇 = 25, 𝑠 = 𝑝 = 10, 𝜄 = 0.5, 𝜌 = 0.5,
𝑐 𝛽 = 1, 𝑐𝜋 = 4, 𝑐𝜉 = 𝑐𝜁 = 1/4; 2nd-order polynomial series are used for approximation
Cross
Fitting

First-Step
Estimator
POLS
H LASSO
R LASSO
C LASSO
TW LASSO
H LASSO
R LASSO
C LASSO
TW LASSO

First-Step Ave.
Sel. Y Sel. D Bias
0.012
560
0.032
3.4
0.030
3.3
0.030
24.7
0.023
3.1
0.005
2.6
0.001
2.0
0.014
3.1
0.030
1.2
Note: Simulation results are based on 1000 replications. Tuning parameters: (𝐾, 𝐿) = (4, 8), 𝐶𝜆 = 2,
and 𝛾 = 0.1/log( 𝑝 ∨ 𝑁 ∨ 𝑇). 10 most relevant regressors (based on the sample correlation with the
outcome) are used for initial estimation and at most 10 iterations are used in calculating the penalty
weights. H: heteroskedastic-LASSO; R: square-root-LASSO; C: cluster-LASSO; TW: two-way cluster-
LASSO. Post-LASSO POLS is performed in all first steps. Nominal coverage probability: 0.95.

Coverage (%)
DKA
67.4
90.8
91.0
91.8
93.6
98.8
98.8
99.0
98.8

RMSE CHS
54.4
0.173
87.2
0.130
86.2
0.130
87.8
0.130
87.8
0.129
94.8
0.157
95.1
0.159
96.3
0.155
97.3
0.153

Second-Step
SD
0.173
0.126
0.127
0.127
0.127
0.157
0.159
0.154
0.150

560
12.2
11.0
12.3
9.3
9.0
6.9
9.0
6.8

Yes

(2018a) and Chiang et al. (2022) suffer from severe under-coverage. This is not surprising but

the implication is more subtle: while two-way dependence potentially affects both estimation and

inference, its negative impact on the inference is more salient.

As the dimension of the covariates significantly increases and becomes as large as the overall

sample size (so that POLS remains in the competition), a different pattern is revealed. Table 3.5.2

also reports simulation results under the DGP(i) except that the dimension 𝑝 now increases to 600,

slightly smaller than the overall sample size 625. First, we compare the results obtained without

cross-fitting. The simulation results demonstrate that the methods based on the POLS with no

selection and those based on the existing LASSO approaches with over-selection all suffer from

severe under-coverage. The proposed methods, in contrast, continue to select the number of relevant

regressors closest to the true number regardless of the increased number of irrelevant regressors.

When cross-fitting is performed, there is again a significant improvement across all approaches in

terms of the sample coverage but it is also in the cost of efficiency loss measured by the increase in

SD.

We have seen the case with exact sparsity in Tables 3.5.1 and 3.5.2. As claimed in the theory, the

proposed estimation and inference procedures are also valid under approximate sparsity. Table 3.5.3

95

reports the simulation results under DGP(ii) where the true model is nonlinear in the the control

variables and correlated random effects. The functional form of the nonlinearity is not given and

is approximated by the second-order polynomial series. While only 10 observable covariates are

considered, the Mundlak device and the polynomial transformation generate 560 regressors that

will be included in the approximately sparse linear model. Due to the large number of regressors

relative to the overall sample size, the approach based on the POLS estimation has the largest Monte

Carlo standard deviation and root mean square error, and it suffers from severe under-coverage.

Compared to the POLS and other sparse methods, the proposed two-way cluster-LASSO method

selects the most sparse model while having the smallest bias and root mean square error, and it

also achieves the best coverage. When the clustered-panel cross-fitting is employed in the inference

procedure, we find that the Monte Carlo coverage probability for confidence intervals based on

CHS-type standard errors improves significantly, and the confidence intervals based on DKA-type

standard errors switch from slight under-coverage to over-coverage. As the correlated random

effects are used for estimation, they project out most of the components that drive the two-way

cluster dependence under DGP(ii). In that case, the adjustment in the standard error formulas due

to cross-fitting can be conservative.

3.6 Empirical Application

In this section, I re-examine the effects of government spending on the output of an open

economy following the framework of Nakamura and Steinsson (2014). It is one of the most cited

empirical-macro papers on the American Economic Review and it investigates one classic quantity

of interest in economics: the government spending multiplier. The question is can we improve on

the estimation and inference through more robust and flexible methods? As I will show, it is made

possible by the proposed toolkit in this paper.

This framework utilizes the regional variation in military spending in the US to estimate the

percentage increase in output that results from the increase of government spending by 1 percent

of GDP, i.e. government spending multiplier.

It is referred to as the "open economy relative

multiplier" because this framework takes advantage of uniform monetary and tax policies across

96

the regions in the US to difference-out their effects on government spending and output. The

parameter of interest is a scalar and the baseline model is identified without considering control

variables, so why is the high dimensionality relevant here? As it will be revealed very soon, indeed,

the high dimensionality from heterogeneity and flexible modeling can be hidden.

Due to the endogeneity in the variation of the regional military procurement, Nakamura and

Steinsson (2014) achieves identification through an instrumental variable (IV) approach. As argued

by the authors, the national military spending is largely determined by geopolitical events so it is

likely exogenous to the unobserved factors of regional military spending and it affects the regional

military spending disproportionally. In other words, the identifying assumption is that the buildups

and drawdowns in national military spending are not due to unbalanced military development

across regions. Based on this observation, a share-shift type IV is considered and the share is

estimated by regressing the regional military spending on the national military spending allowing

for region-specific constant slope coefficients.12 To focus on the main idea, the shares are taken as

given and the resulting instrument variable is treated as observable instead of generated regressors

to avoid further complication.

In this paper, to avoid the endogeneity caused by the misspecification of the function form, I

extend the linear model with additive unobserved heterogeneous effects to a partial linear model

with non-additive unobserved heterogeneous effects. Let 𝐷𝑖𝑡 be the percentage change in per capita

regional military spending in state 𝑖 and time 𝑡 and 𝑍𝑖𝑡 be the IV. Specifically, the baseline model

from the original study and the one from this paper differ as follows:

Baseline model :

𝑌𝑖𝑡 = 𝜃0𝐷𝑖𝑡 + 𝜋𝑖𝑊𝑡 + 𝑐𝑖 + 𝑑𝑡 + 𝑈𝑖𝑡 .

Partial linear model :

𝑌𝑖𝑡 = 𝜃0𝐷𝑖𝑡 + 𝑔(𝑋𝑖𝑡, 𝑊𝑡, 𝑐𝑖, 𝑑𝑡) + 𝑈𝑖𝑡 .

12All quantities, unless specifically defined, are in terms of two-year growth rate of the real per capita values. Per
capita is in terms of total population. Nakamura and Steinsson (2014) also presents results when per capita is calculated
using the working age population as a robustness check.

97

where 𝜃0 is the parameter of interest, i.e.

the true multiplier; 𝑋𝑖𝑡 and 𝑊𝑡 are exogenous control

variables with the latter being only time-varying; 𝜋𝑖 are non-random unit specific slope coefficients

of 𝑊𝑡; (𝑐𝑖, 𝑑𝑡) are unobserved heterogeneous effects.

In the original study, the linear model

is estimated by the two-stage least square (2SLS) with two-way fixed effects.

In the extended

model, I model the unobserved heterogeneous effects as correlated random effects and take a

sparse approximation approach for the infinite-dimensional nuisance parameters as in Section 3.4.

Specifically, 𝑐𝑖 is assumed to be a function of ( ¯𝐷𝑖, ¯𝑋𝑖) and 𝑑𝑡 is assumed to be a function of

( ¯𝐷𝑡, ¯𝑋𝑡, 𝑊𝑡). Then, through sparse approximation, the feasible (near) Neyman-orthogonal moment

function is given by 3.22 with 𝑓𝑖𝑡 = (𝐿𝜏 (𝑋𝑖𝑡, 𝑊𝑡, ¯𝐷𝑖, ¯𝐷𝑡, ¯𝑋𝑖, ¯𝑋𝑡), 1).

In the baseline specification of Nakamura and Steinsson (2014), 𝑊𝑡 are not included in the

baseline model.

In their alternative specifications, 𝑊𝑡 is chosen as the real interest rate or the

change in national oil price. These two variables are never included together in the original

study. Note that allowing the unit-specific slope coefficients for controls generates many nuisance

parameters: with 51 state groups13, one control would increase 51 parameters and two controls

would generate 102 parameters, without considering interactions or higher order terms. With a

sample size of less than 2000, the high dimensionality in nuisance parameters could result in a

noisy estimate of 𝜃0. In this paper, to obtain a more precise estimate and make the excludability

assumption of the IV more plausible, besides the controls from the original study, I also consider

additional controls. As is shown in Table 3 of Nakamura and Steinsson (2014), the change in

state population is likely not affected by the treatment (the regional military spending), so it is

immune to the "bad control" problem14; But it could affect the treatment and the outcome, so it

is included in 𝑋𝑖𝑡. By considering more flexible function forms and additional exogenous control

variables, the excludability condition of the instruments is more plausible. On the other hand,

the high-dimensionality arose from the flexible function form and the unobserved heterogeneity

necessitates the use of high-dimensional methods. Moreover, state-level yearly variables of those

13The regions in this analysis are defined by the states. Nakamura and Steinsson (2014) also presents results on

regions as clusters of states.

14Angrist and Pischke, 2009 and Chen and Kim, 2024 provide detailed discussions on how endogenous control can

pollute the identification/estimation.

98

macroeconomic characteristics are often considered to be cluster-dependent in both cross-sectional

and time groups due to correlated time shocks and state-unobserved factors. These concerns justify

the use of robust estimation and inference methods proposed in this paper.

Table 3.6.1: Multiplier estimates from the original model
(4)
(1)
Unobs.
Heterog.

(3)
Real
Pop.
Int.
No
No
No
No
No
Yes
Yes
No
Yes Yes

(8)

(7)
(6)
(5)
IV 1 CHS DKA
First
(cid:98)𝜃
s.e.
s.e.
Stage
0.81
0.68
POLS 1.43
0.72
0.56
POLS 1.30
0.70
0.57
POLS 1.40
0.71
0.45
POLS 1.27
0.56
0.43
POLS 1.36

(2)
Oil
Price
No
Yes
No
Yes
Yes

Fixed Effects

Note: Standard errors are calculated with the truncation parame-
ter 𝑀 chosen by the min-MSE rule given in Section 3.5.

The data is available through Nakamura and Steinsson (2014). It is a balanced (after trimming)

state-level yearly panel data with 51 states from 1971-2005 years. The military spending data is

collected from the electronic database of DD-350 military procurement forms of the US Department

of Defense. The state output is measured by state DGP collected from the US Bureau of Economics

Analysis (BEA). The state population data is from the Census Bureau. Data on oil prices is from

West Texas Intermediate. The Federal Funds rate is from the FRED database of the St. Louis

Federal Reserve. The state inflation measures are constructed from several sources. For more

details on data construction, readers are referred to Nakamura and Steinsson (2014).

Table 3.6.1 provides benchmark results for the original model with different choices of control

variables. All estimates (columns 6) of are given by 2SLS with two-way fixed effects and the

standard errors (s.e.) are calculated using CHS and DKA formulas given in Section 3.4. The

estimates of the multiplier replicate those given in Nakamura and Steinsson (2014) with significant

differences in the standard errors. It is because the variance estimates here account for the potential

two-way dependence while the variance estimator used in Nakamura and Steinsson (2014) assumes

cross-sectional independence.

The main comparisons are done in Tables 3.6.2 and 3.6.3. In Table 3.6.2, no cross-fitting is

performed in the first stage. The number of parameters associated with regressors generated by

99

7

35

(7)

(1)

(2)

2nd

(3)
Poly.

No Mundlak

No Mundlak None

(4)
Cross- Unobs.
Param.
Fitting Heterog. Trans. Gen.

(6)
Z: Param.
Sel.
7
2
4
2
35
6
5
3
119
10
6
5

Table 3.6.2: Estimates of the open economy relative multiplier from the extended model.
(5)
First
Stage
POLS
H LASSO
C LASSO
TW LASSO
POLS
H LASSO
CR LASSO
TW LASSO
POLS
H LASSO
CR LASSO
TW LASSO
Note: Tuning parameters are chosen as 𝐶𝜆 = 2, and 𝛾 = 0.1/log( 𝑝 ∨ 𝑁 ∨ 𝑇). 7 most
relevant regressors (based on the sample correlation with the outcome) are used for
initial estimation and at most 10 iterations are used in calculating the penalty weights.
H: heteroskedastic-LASSO; R: square-root-LASSO; C: cluster-LASSO; TW: two-way
cluster-LASSO. The number of predictors generated by the polynomial transformation
and the number of selected predictors for 𝑍 are reported in columns (4) and (6). Standard
errors are calculated with the truncation parameter 𝑀 chosen by the min-MSE rule given
in Section 3.5.

(9)
DKA
s.e.
0.82
0.81
0.81
0.84
1.15
1.17
1.19
0.77
1.37
1.38
0.82
0.76

(8)
CHS
s.e.
0.66
0.66
0.66
0.70
0.99
1.01
1.02
0.62
1.19
1.16
0.66
0.61

(cid:98)𝜃
1.51
1.43
1.43
1.43
1.73
1.73
1.75
1.47
2.20
1.97
0.98
1.47

No Mundlak

119

3rd

the polynomials transformations are reported in column (4) and the number of selected parameters

associated with 𝑍 are reported in column (6)15. Overall, with more controls and the polynomial

transformation of the observables, the standard errors are generally larger than those in 3.6.1. With

no transformations of the original regressors, the estimates obtained by four different methods

are similar and they are consistent with the baseline results.

It is noticeable that the proposed

approach TW LASSO using the DKA-type penalty weights achieves an estimate that is consistent

with the baseline results and has the least variability. As the flexibility and number of nuisance

parameters increase with the higher-order polynomial transformations, the number of selected

regressors increases across all methods. While the standard errors of most approaches climb

become larger and the estimates deviate from the baseline results, the proposed approach remains

15Across all first-step LASSO approaches, more parameters associated with 𝑍 are selected compared to those
associated with 𝑌 and 𝑋. The difference in the LASSO selection is less evident for 𝑌 and 𝑋 while the pattern is similar.

100

less noisy. This indicates that many higher-order polynomials included in the extended model for

robustness in the function form may not matter that much but sorely contribute to the noise; while

the existing approaches tend to over-select those terms under potential two-way dependence, the

proposed method is robust against over-selection.

7

35

(7)

(1)

(2)

2nd

(3)
Poly.

Yes Mundlak

Yes Mundlak None

(4)
Cross- Unobs.
Param.
Fitting Heterog. Trans. Gen.

(6)
Z: Param.
Ave. Sel.
2.0
2.0
2.6
5.2
5.8
4.1
8.3
6.5
5.3

Table 3.6.3: Estimates of the open economy relative multiplier from the extended model.
(5)
First
Stage
H LASSO
C LASSO
TW LASSO
H LASSO
C LASSO
TW LASSO
H LASSO
C LASSO
TW LASSO
Note: The tuning parameters are chosen as (𝐾, 𝐿) = (4, 8), 𝐶𝜆 = 2, and 𝛾 =
0.1/log( 𝑝 ∨ 𝑁 ∨ 𝑇). 7 most relevant regressors (based on the sample correlation with
the outcome) are used for initial estimation and at most 10 iterations are used in cal-
culating the penalty weights. H: heteroskedastic-LASSO; R: square-root-LASSO; C:
cluster-LASSO; TW: two-way cluster-LASSO. The number of predictors generated by
the polynomial transformation and the number of selected predictors for 𝑍 are reported
in columns (4) and (6). Standard errors are calculated with the truncation parameter 𝑀
chosen by the min-MSE rule given in Section 3.5.

(9)
DKA
s.e.
2.00
2.03
2.05
2.52
2.24
1.70
3.47
1.91
1.44

(8)
CHS
s.e.
1.73
1.75
1.77
2.18
1.95
1.42
3.17
1.59
1.18

(cid:98)𝜃
1.28
1.32
1.18
1.12
1.46
1.20
1.81
1.25
1.50

Yes Mundlak

119

3rd

As in the Monte Carlo simulation, the results obtained with cross-fitting are also examined.

Although the theoretical results for inference procedure based on cross-fitting methods with the

presence of the Mundlak device are not formally given in this paper, the conjecture is that it is

still valid under the same set of conditions given in Section 3.4. Table 3.6.3 demonstrates the

comparison between various sparse methods with the clustered-panel cross-fitting 16. It reveals

a similar pattern as in Table 3.6.2: The variability of different methods increases as the model

approximated by higher-order polynomial series, except for the proposed approach which witnesses

more accuracy as the approximation is made more flexible.

16Due to a smaller sample used in the first-step estimation and multicollinearity among the polynomial terms,

methods based on the POLS first-step is too noisy and so they are omitted for comparison here.

101

To conclude, the empirical study of the government spending multiplier using a flexible model

and sparse methods illustrates the issue of hidden dimensionality.

In the current example, the

estimates obtained through the high-dimensional methods do not deviate much from the baseline

results, so it implies the nonlinear effects omitted from the original model may not be very relevant.

While the proposed two-way cluster-LASSO and the inference procedure with or without cross-

fitting remain relatively accurate and provide results as a robustness check, other sparse methods

tend to over-select and become too noisy to be interpretable.

3.7 Conclusion and Discussion

The inferential theory for high-dimensional models is particularly relevant in panel data settings

where the modeling of unobserved heterogeneity commonly leads to high-dimensional nuisance

parameters. This paper enriches the toolbox of researchers in dealing with high-dimensional panel

models. Particularly, I propose a package of tools that deal with the estimation and inference in high-

dimensional panel models that feature two-way cluster dependence and unobserved heterogeneity. I

first develop a weighted LASSO approach that is robust to two-way cluster dependence in the panel

data. As is shown in the statistical analysis of the two-way cluster LASSO, the convergence rates

are slow due to the cluster dependence, making it challenging for inference purposes. However,

by utilizing a cross-fitting method designed for a two-way clustered panel, the rate requirement for

the first step can be substantially relaxed, making the proposed two-way cluster-LASSO a feasible

first-step estimator for the panel-DML inference procedure in a high-dimensional semiparametric

model.

I further consider the unobserved heterogeneity in panel models. Due to the potential

non-compatibility of cross-fitting with common fixed-effect and random-effect methods, I study

the statistical properties of the proposed estimation and inference procedures using the full sample

in both the first and the second steps. The validity is established under a slightly stronger sparsity

condition in a partial linear panel model, as a special case.

The estimation and inferential theory are empirically relevant.

I illustrate the proposed ap-

proaches in an empirical example and exemplify that high-dimensionality can be hidden in ques-

tions not traditionally considered high-dimensional.

In practice, when the question is naturally

102

high-dimensional and answered by panel data, then the proposed approaches are natural solutions.

When the questions are originally not high-dimensional, it is reasonable to start with a simple

model as a baseline and then extend it to a more general and flexible model for a robustness check.

While both theoretical and simulation results support the proposed approaches, some limitations

remain in certain scenarios. The feasible penalty weight estimation is highly non-trivial due to

two-way cluster dependence and high dimensionality. The statistical analysis of the two-way

cluster LASSO relies on high-level assumptions on the feasible penalty weights. Even though

the iterative feasible weights estimation possesses desirable finite sample properties among the

scenarios considered in the Monte Carlo simulation, many subtle issues lack of theoretical guarantee.

A devoted exploration of such issues requires a more comprehensive treatment and is an important

direction of future research.

103

BIBLIOGRAPHY

Aldous, D. J. (1981). Representations for partially exchangeable arrays of random variables. Journal

of Multivariate Analysis, 11(4):581–598.

Andrews, D. W. (1994). Asymptotics for semiparametric econometric models via stochastic

equicontinuity. Econometrica, pages 43–72.

Andrews, D. W. K. (1991). Heteroskedasticity and autocorrelation consistent covariance matrix

estimation. Econometrica, 59:817.

Angrist, J. D. and Pischke, J.-S. (2009). Mostly harmless econometrics: An empiricist’s companion.

Princeton university press.

Babii, A., Ball, R. T., Ghysels, E., and Striaukas, J. (2023). Machine learning panel data re-
gressions with heavy-tailed dependent data: Theory and application. Journal of Econometrics,
237(2):105315.

Babii, A., Ghysels, E., and Striaukas, J. (2022). Machine learning time series regressions with an

application to nowcasting. Journal of Business & Economic Statistics, 40(3):1094–1106.

Babii, A., Ghysels, E., and Striaukas, J. (2024). High-dimensional granger causality tests with an

application to vix and news. Journal of Financial Econometrics, 22(3):605–635.

Baraud, Y., Comte, F., and Viennet, G. (2001). Adaptive estimation in autoregression or-mixing

regression via model selection. The Annals of Statistics, 29(3):839–875.

Basu, S. and Michailidis, G. (2015). Regularized estimation in sparse high-dimensional time series

models.

Belloni, A., Chen, D., Chernozhukov, V., and Hansen, C. (2012). Sparse models and methods for
optimal instruments with an application to eminent domain. Econometrica, 80(6):2369–2429.

Belloni, A., Chernozhukov, V., and Hansen, C. (2014). Inference on treatment effects after selection

among high-dimensional controls. Review of Economic Studies, 81(2):608–650.

Belloni, A., Chernozhukov, V., Hansen, C., and Kozbur, D. (2016). Inference in high-dimensional
panel models with an application to gun control. Journal of Business & Economic Statistics,
34(4):590–605.

Belloni, A., Chernozhukov, V., and Wang, L. (2011). Square-root lasso: pivotal recovery of sparse

signals via conic programming. Biometrika, 98(4):791–806.

Berbee, H. (1987). Convergence rates in the strong law for bounded mixing sequences. Probability

theory and related fields, 74(2):255–270.

104

Bester, C. A., Conley, T. G., and Hansen, C. B. (2008). Inference with dependent data using cluster

covariance estimators.

Bickel, P. J., Ritov, Y., and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and dantzig

selector.

Breinlich, H., Corradi, V., Rocha, N., Ruta, M., Santos Silva, J., and Zylkin, T. (2022). Machine

learning in international trade research-evaluating the impact of trade agreements.

Bühlmann, P. and Van De Geer, S. (2011). Statistics for high-dimensional data: methods, theory

and applications. Springer Science & Business Media.

Burman, P., Chow, E., and Nolan, D. (1994). A cross-validatory method for dependent data.

Biometrika, 81(2):351–358.

Chen, K. and Kim, K. i. (2024). Identification of nonseparable models with endogenous control

variables. arXiv preprint arXiv:2401.14395.

Chen, K. and Vogelsang, T. J. (2024). Fixed-b asymptotics for panel models with two-way clustering.

Journal of Econometrics, 244(1):105831.

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J.
(2018a). Double/debiased machine learning for treatment and structural parameters. Economet-
rics Journal, 21(1):C1–C68.

Chernozhukov, V., Escanciano, J. C., Ichimura, H., Newey, W. K., and Robins, J. M. (2022a).

Locally robust semiparametric estimation. Econometrica, 90(4):1501–1535.

Chernozhukov, V., Hausman, J. A., and Newey, W. K. (2019). Demand analysis with many prices.

Technical report, National Bureau of Economic Research.

Chernozhukov, V., Karl Härdle, W., Huang, C., and Wang, W. (2021a). Lasso-driven inference in

time and space. The Annals of Statistics, 49(3):1702–1735.

Chernozhukov, V., Newey, W., Quintas-Martınez, V. M., and Syrgkanis, V. (2022b). Riesznet
and forestriesz: Automatic debiased machine learning with neural nets and random forests. In
International Conference on Machine Learning, pages 3901–3914. PMLR.

Chernozhukov, V., Newey, W. K., Quintas-Martinez, V., and Syrgkanis, V. (2021b). Automatic
debiased machine learning via neural nets for generalized linear regression. arXiv preprint
arXiv:2104.14737.

Chernozhukov, V., Newey, W. K., and Robins, J. (2018b). Double/de-biased machine learning

using regularized riesz representers. Technical report, cemmap working paper.

105

Chernozhukov, V., Newey, W. K., and Singh, R. (2022c). Automatic debiased machine learning of

causal and structural effects. Econometrica, 90(3):967–1027.

Chetverikov, D., Liao, Z., and Chernozhukov, V. (2021). On cross-validated lasso in high dimen-

sions. The Annals of Statistics, 49(3):1300–1317.

Chiang, H. D., Hansen, B. E., and Sasaki, Y. (2024). Standard errors for two-way clustering with

serially correlated time effects. Review of Economics and Statistics, pages 1–40.

Chiang, H. D., Kato, K., Ma, Y., and Sasaki, Y. (2022). Multiway cluster robust double/debiased

machine learning. Journal of Business and Economic Statistics, 40(3):1046–1056.

Chiang, H. D., Kato, K., and Sasaki, Y. (2023a). Inference for high-dimensional exchangeable

arrays. Journal of the American Statistical Association, 118(543):1595–1605.

Chiang, H. D., Ma, Y., Rodrigue, J., and Sasaki, Y. (2021). Dyadic double/debiased machine
learning for analyzing determinants of free trade agreements. arXiv preprint arXiv:2110.04365.

Chiang, H. D., Rodrigue, J., and Sasaki, Y. (2023b). Post-selection inference in three-dimensional

panel data. Econometric Theory, 39(3):623–658.

Correia, S., Guimarães, P., and Zylkin, T. (2020). Fast poisson estimation with high-dimensional

fixed effects. The Stata Journal, 20(1):95–115.

Davezies, L., D’Haultfœuille, X., and Guyonvarch, Y. (2019). Empirical process results for

exchangeable arrays. arXiv preprint arXiv:1906.11293.

Davidson, J. (1994). Stochastic limit theory: An introduction for econometricians. OUP Oxford.

Dehling, H. and Wendler, M. (2010). Central limit theorem and the bootstrap for u-statistics of

strongly mixing data. Journal of Multivariate Analysis, 101(1):126–137.

Djogbenou, A. A., MacKinnon, J. G., and Nielsen, M. Ø. (2019). Asymptotic theory and wild

bootstrap inference with clustered errors. Journal of Econometrics, 212(2):393–412.

Dudley, R. M. and Philipp, W. (1983). Invariance principles for sums of banach space valued random
elements and empirical processes. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte
Gebiete, 62(4):509–552.

Ellison, M., Lee, S. S., and O’Rourke, K. H. (2024). The ends of 27 big depressions. American

Economic Review, 114(1):134–168.

Fama, E. F. and French, K. R. (2000). Forecasting profitability and earnings. The journal of

business, 73(2):161–175.

106

Fernández-Val, I. and Lee, J. (2013). Panel data models with nonadditive unobserved heterogeneity:

Estimation and inference. Quantitative Economics, 4(3):453–481.

Fuk, D. K. and Nagaev, S. V. (1971). Probability inequalities for sums of independent random

variables. Theory of Probability & Its Applications, 16(4):643–660.

Gao, J., Peng, B., and Yan, Y. (2024). Robust inference for high-dimensional panel data models.

Available at SSRN 4825772.

Gao, L., Shao, Q.-M., and Shi, J. (2022). Refined cramér-type moderate deviation theorems for
general self-normalized sums with applications to dependent random variables and winsorized
mean. The Annals of Statistics, 50(2):673–697.

Gonçalves, S. (2011). The moving blocks bootstrap for panel linear regression models with

individual fixed effects. Econometric Theory, 27(5):1048–1082.

Guvenen, F., Schulhofer-Wohl, S., Song, J., and Yogo, M. (2017). Worker betas: Five facts about

systematic earnings risk. American Economic Review, 107(5):398–403.

Hahn, J. and Kuersteiner, G. (2011). Bias reduction for dynamic nonlinear panel models with fixed

effects. Econometric Theory, 27(6):1152–1191.

Hansen, B. (2022). Econometrics. Princeton University Press.

Hansen, B. E. (1992). Consistent covariance matrix estimation for dependent heterogeneous

processes. Econometrica, pages 967–972.

Hansen, C. B. (2007). Asymptotic properties of a robust variance matrix estimator for panel data

when t is large. Journal of Econometrics, 141(2):597–620.

Hoover, D. N. (1979). Relations on probability spaces and arrays of random variables. t, Institute

for Advanced Study.

Ichimura, H. (1987). Estimation of single index models. PhD thesis, Massachusetts Institute of

Technology.

Ichimura, H. and Newey, W. K. (2022). The influence function of semiparametric estimators.

Quantitative Economics, 13(1):29–61.

Javanmard, A. and Montanari, A. (2014). Confidence intervals and hypothesis testing for high-

dimensional regression. The Journal of Machine Learning Research, 15(1):2869–2909.

Jing, B.-Y., Shao, Q.-M., and Wang, Q. (2003). Self-normalized cramér-type large deviations for

independent random variables. The Annals of probability, 31(4):2167–2215.

107

Jordan, M. I., Wang, Y., and Zhou, A. (2023). Data-driven influence functions for optimization-

based causal inference.

Kallenberg, O. (1989). On the representation theorem for exchangeable arrays. Journal of Multi-

variate Analysis, 30(1):137–154.

Kallenberg, O. (2005). Probabilistic symmetries and invariance principles, volume 9. Springer.

Kock, A. B. and Callot, L. (2015). Oracle inequalities for high dimensional vector autoregressions.

Journal of Econometrics, 186(2):325–344.

Kock, A. B. and Tang, H. (2019). Uniform inference in high-dimensional dynamic panel data

models with approximately sparse fixed effects. Econometric Theory, 35(2):295–359.

Larrain, B. (2006). Do banks affect the level and composition of industrial volatility? The Journal

of Finance, 61(4):1897–1925.

Li, K., Morck, R., Yang, F., and Yeung, B. (2004). Firm-specific variation and openness in emerging

markets. Review of Economics and Statistics, 86(3):658–669.

Lin, J. and Michailidis, G. (2017). Regularized estimation and testing for high-dimensional multi-
block vector-autoregressive models. Journal of Machine Learning Research, 18(117):1–49.

MacKinnon, J. G., Nielsen, M. Ø., and Webb, M. D. (2021). Wild bootstrap and asymptotic
inference with multiway clustering. Journal of Business & Economic Statistics, 39(2):505–519.

Mattoo, A., Rocha, N., and Ruta, M. (2020). Handbook of deep trade agreements. World Bank

Publications.

Menzel, K. (2021). Bootstrap with cluster-dependence in two or more dimensions. Econometrica,

89(5):2143–2188.

Mundlak, Y. (1978). On the pooling of cross-section and time-series data. Econometrica,

46(69):X6.

Nakamura, E. and Steinsson, J. (2014). Fiscal stimulus in a monetary union: Evidence from us

regions. American Economic Review, 104(3):753–792.

Newey, W. K. . (1994). The Asymptotic Variance of Semiparametric Estimators. Econometrica,

62(6):1349–1382.

Ning, Y. and Liu, H. (2017). A general theory of hypothesis tests and confidence regions for sparse

high dimensional models. The Annals of Statistics, 45(1):158 – 195.

Peña, V. H., Lai, T. L., and Shao, Q.-M. (2009). Self-normalized processes: Limit theory and

108

Statistical Applications. Springer.

Powell, J. L., Stock, J. H., and Stoker, T. M. (1989). Semiparametric estimation of index coefficients.

Econometrica, pages 1403–1430.

Racine, J. (2000). Consistent cross-validatory model-selection for dependent data: hv-block cross-

validation. Journal of econometrics, 99(1):39–61.

Rajan, R. G. and Zingales, L. (1998). Financial dependence and growth. The American Economic

Review, 88(3):559.

Robinson, P. M. (1988). Root-n-consistent semiparametric regression. Econometrica, pages 931–

954.

Roodman, D., Nielsen, M. Ø., MacKinnon, J. G., and Webb, M. D. (2019). Fast and wild: Bootstrap

inference in stata using boottest. The Stata Journal, 19(1):4–60.

Semenova, V., Goldman, M., Chernozhukov, V., and Taddy, M. (2023a). Inference on heterogeneous
treatment effects in high-dimensional dynamic panels under weak dependence. Quantitative
Economics, 14(2):471–510.

Semenova, V., Goldman, M., Chernozhukov, V., and Taddy, M. (2023b). Supplement to "Inference
on heterogeneous treatment effects in high-dimensional dynamic panels under weak dependence".
Quantitative Economics, 14(2):471–510.

Strassen, V. (1965). The existence of probability measures with given marginals. The Annals of

Mathematical Statistics, 36(2):423–439.

Thompson, S. B. (2011). Simple formulas for standard errors that cluster by both firm and time.

Journal of financial Economics, 99(1):1–10.

Van de Geer, S., Bühlmann, P., Ritov, Y., and Dezeure, R. (2014). On asymptotically optimal

confidence regions and tests for high-dimensional models.

Vogt, M., Walsh, C., and Linton, O. (2022). Cce estimation of high-dimensional panel data models

with interactive fixed effects. arXiv preprint arXiv:2206.12152.

Wooldridge, J. M. (2021). Two-way fixed effects, the two-way mundlak regression, and difference-

in-differences estimators. Available at SSRN 3906345.

Wooldridge, J. M. and Zhu, Y. (2020). Inference in approximately sparse correlated random effects

probit models with panel data. Journal of Business & Economic Statistics, 38(1):1–18.

Wu, W.-B. and Wu, Y. N. (2016). Performance bounds for parameter estimates of high-dimensional

linear models with correlated errors.

109

Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in
high dimensional linear models. Journal of the Royal Statistical Society Series B: Statistical
Methodology, 76(1):217–242.

110

APPENDIX 3A

PROOFS FOR CHAPTER 3.2

We will first introduce two lemmas regarding the law of large number (LLN) and the central limit

theorem (CLT) for two-way clustered arrays with correlated time effects. They are restated and

generalized from Theorems 1 and 2 in Chiang et al. (2024). The following notations will also be used

frequently throughout the appendices: Let {𝑊𝑖𝑡 : 𝑖 = 1, .., 𝑁; 𝑡 = 1, ..., 𝑇 } be an array of random

vectors taking values in R𝑝. Let 𝐹 : R𝑝 → R𝑘 be a measurable function where 𝑘 is a constant. We

define the Hajek projection terms 𝑎𝑖 = 𝐸 [𝐹 (𝑊𝑖𝑡) − 𝐸 [𝐹 (𝑊𝑖𝑡)] |𝛼𝑖], 𝑔𝑡 = 𝐸 [𝐹 (𝑊𝑖𝑡) − 𝐸 [𝐹 (𝑊𝑖𝑡)]|𝛾𝑡],

and 𝑒𝑖𝑡 = 𝑊𝑖𝑡 −𝐸 [𝐹 (𝑊𝑖𝑡)] −𝑎𝑖 −𝑔𝑡 and their corresponding (long-run) variance covariance matrices:

Σ𝑎 = 𝐸 [𝑎𝑖𝑎′

𝑖], Σ𝑔 =

∞
∑︁

𝑙=−∞

𝐸 [𝑔𝑡𝑔′

𝑡+𝑙], Σ𝑒 =

∞
∑︁

𝑙=−∞

𝐸 [𝑒𝑖𝑡𝑒′

𝑖,𝑡+𝑙].

We can rewrite 𝐹 (𝑊𝑖𝑡) = 𝑎𝑖 + 𝑔𝑡 + 𝑒𝑖𝑡. Suppose that 𝑊𝑖𝑡 satisfy Assumptions AHK and AR, then

the decomposition has the following properties:

(i) {𝑎𝑖}𝑖≥1 is a sequence of i.i.d random vectors, {𝑔𝑡 }𝑡≥1 are strictly stationary and 𝛽-mixing

with the the mixing coefficient 𝛽𝑔 (𝑚) ≤ 𝛽𝛾 (𝑚) for all 𝑚 ≥ 1; for each 𝑖, {𝑒𝑖𝑡 }𝑡≥1 is also

strictly stationary; and 𝑎𝑖 is independent of 𝑔𝑡.

(ii) 𝑎𝑖, 𝑏𝑡, 𝑒𝑖𝑡 are mean zero.

(iii) Conditional on (𝛾𝑡, 𝛾𝑟), 𝑒𝑖𝑡 and 𝑒 𝑗𝑟 are independent for 𝑗 ≠ 𝑖.

(iv) The sequences {𝑎𝑖}, {𝑔𝑡 }, {𝑒𝑖𝑡 } are mutually uncorrelated.

Properties (i) and (ii) are straightforward. Property (iii) is due to the assumption that {𝛼𝑖} and

{𝜀𝑖𝑡 } are each i.i.d sequence and independent of each other. Property (iv) is less obvious. One can

show 𝐸𝑃 [𝑒𝑖𝑡 |𝛾𝑟] = 0 and 𝐸𝑃 [𝑒𝑖𝑡 |𝛼 𝑗 ] for any 𝑖, 𝑡, 𝑗, 𝑟. It is less obvious to see 𝐸𝑃 [𝑒𝑖𝑡 |𝛾𝑟] = 0 for

111

some 𝑟 ≠ 𝑡:

𝐸𝑃 [𝑒𝑖𝑡 |𝛾𝑟] = 𝐸𝑃 [𝜓 (𝑊𝑖𝑡; 𝜃0, 𝜂0) |𝛾𝑟] − 𝐸𝑃 [𝑎𝑖 |𝛾𝑟] − 𝐸𝑃 [𝑔𝑠 |𝛾𝑡]

= 𝐸𝑃 [𝐸𝑃 [𝜓 ( 𝑓 (𝛼𝑖, 𝛾𝑡, 𝜀𝑖𝑡); 𝜃0, 𝜂0) |𝛾𝑡, 𝛾𝑟] |𝛾𝑟] − 𝐸𝑃 [𝑎𝑖] − 𝐸𝑃 [𝑔𝑡 |𝛾𝑟]

= 𝐸𝑃 [𝐸𝑃 [𝜓 ( 𝑓 (𝛼𝑖, 𝛾𝑡, 𝜀𝑖𝑡); 𝜃0, 𝜂0) |𝛾𝑡] |𝛾𝑟] − 𝐸𝑃 [𝑎𝑖] − 𝐸𝑃 [𝑔𝑡 |𝛾𝑟]

= 𝐸𝑃 [𝑔𝑡 |𝛾𝑟] − 𝐸𝑃 [𝑔𝑡 |𝛾𝑟] = 0

where the second equality follows from the iterated expectation and the independence of 𝛼𝑖 and 𝛾𝑟

and the third equality follows from that given 𝛾𝑡, 𝛾𝑟 is independent of (𝛼𝑖, 𝛾𝑡, 𝜀𝑖𝑡).

Using the properties above, one can derive the LLN and CLT for two-way clustered panel data.

The following lemma is regarding the LLN.
Lemma A.1 Suppose that 𝑊𝑖𝑡 satisfy Assumptions AHK and AR and 𝐸 (cid:2)∥𝐹 (𝑊𝑖𝑡) ∥4(𝑟+𝛿)(cid:3) < ∞.

Then,

i ∥Σ𝑎 ∥ < ∞, ∥Σ𝑔 ∥ < ∞, and ∥Σ𝑒 ∥ < ∞ where

ii Var (E𝑁𝑇 [𝐹 (𝑊𝑖𝑡)]) = 1

𝑁 Σ𝑎 + 1

𝑇 Σ𝑔 (1 + 𝑜(1)) + 1

𝑁𝑇 Σ𝑒 (1 + 𝑜(1)) as 𝑁, 𝑇 → ∞.

iii E𝑁𝑇 [𝐹 (𝑊𝑖𝑡)]

𝑝
→ 𝐸 [𝐹 (𝑊𝑖𝑡)] as 𝑁, 𝑇 → ∞.

Lemma A.2 With the same setting as in Lemma A.1, further assume that either 𝜆𝑚𝑖𝑛 [Σ𝑎] > 0
𝑑
or 𝜆𝑚𝑖𝑛 (cid:2)Σ𝑔(cid:3) > 0. Then, as 𝑁, 𝑇 → ∞ and 𝑁/𝑇 → 𝑐,
→

𝑁 (E𝑁𝑇 [𝐹 (𝑊𝑖𝑡)] − 𝐸 [𝐹 (𝑊𝑖𝑡)])

√

N (0, Σ𝑎 + 𝑐Σ𝑔)

Lemmas A.1 and A.2 are the same as those for Theorems 1 and 2 in Chiang et al. (2024) except

that 𝑊𝑖𝑡 are replaced by 𝐹 (𝑊𝑖𝑡) and we don’t consider the i.i.d case here. The proofs with 𝑊𝑖𝑡

replaced by 𝐹 (𝑊𝑖𝑡) still go through so they are not repeated here.

The following lemma provides a probability limit of the infeasible penalty weights.

Lemma A.3 Let 𝜔 𝑗 be as defined in 3.11 with the bandwidth 𝑀 such that 𝑀/𝑇 0.5 = 𝑜(1). With

the same setting as in Lemma A.2 for 𝐹 (𝑊𝑖𝑡) = 𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡, 𝜔 𝑗

𝑁/𝑇 → 𝑐.

112

𝑝
→ 𝑁∧𝑇

𝑁 Σ𝑎 + 𝑁∧𝑇

𝑇 Σ𝑔 as 𝑁, 𝑇 → ∞ and

Proof of Lemma A.3 Since 𝑎𝑖, 𝑗 is independent over 𝑖, we can apply the weak law of large number

and obtain

𝑁 ∧ 𝑇
𝑁 2

𝑁
∑︁

𝑖=1

𝑎2
𝑖, 𝑗 =

𝑁 ∧ 𝑇
𝑁

Σ𝑎 + 𝑜𝑃 (1)

To show the convergence of the second term, we can apply Proposition 2 of Bester et al.

(2008) by verifying its Assumption 7. Since the block size here ℎ = 𝑟𝑜𝑢𝑛𝑑 (𝑇 1/5) + 1, it diverges

with the time sample size and ℎ/𝑇 → 0 as 𝑇 → ∞.

and Assumption 7(i) follows. Note

that the 𝛽-mixing property of 𝑔𝑡, 𝑗 implies that it is also 𝛼-mixing with the mixing coefficient

𝛼𝑔 (𝑞) ≤ 𝛽𝑔 (𝑞) ≤ 𝛽𝛾 (𝑞) = 𝑐𝜅exp(−𝜅𝑞) for all 𝑞 ≥ 1. Let 𝜁 be some positive constant, then we

have

∞
∑︁

𝑞=1

𝑞2𝛼𝑔 (𝑞)𝜁/(4+𝜁) ≤ 𝑐𝜁/(4+𝜁)

𝜅

∞
∑︁

𝑞=1

𝑞2exp(−𝜅𝜁 𝑞/(4 + 𝜁)) = 𝑐𝜁/(4+𝜁)

𝜅

∞
∑︁

𝑞=1

𝑞2exp(−𝑎𝑞)

where 𝑎 :=

𝜅𝜁
4+𝜁 . We can use the ratio test to examine the convergence of sum:

lim𝑞→∞

(𝑞 + 1)2exp(−𝑎(𝑞 + 1))
𝑞2exp(−𝑎𝑞)

= lim𝑞→∞

(cid:19)

(cid:18) 𝑞 + 1
𝑞

exp(−𝑎) = exp(−𝑎)

Since 𝜅 > 0 and 𝜁 > 0, we have 𝑎 > 0 and so exp(−𝑎) < 1. Thus we conclude the infinite sum

does not diverge to infinity. The third condition is ensured by our assumptions directly. Thus, by

Proposition 2 of Bester et al. (2008), we have

𝑁 ∧ 𝑇
𝑇 2

(cid:32)

𝐵
∑︁

∑︁

(cid:33) 2

𝑔𝑡, 𝑗

=

𝑏=1

𝑡∈𝐻𝑏

𝑁 ∧ 𝑇
𝑇

Σ𝑔 + 𝑜𝑃 (1).

The conclusion follows. □

The following notations and the lemma is used for deriving the performance bounds for post-

LASSO. Corresponding to (cid:98)Γ defined above Theorem 3.1, here we define Γ0 as the support of 𝜁0.
𝑚 = ∥(cid:98)Γ\Γ0∥0. Define PΓ as the projection matrix such that it projects an 𝑁𝑇 × 1 vector
Define (cid:98)
onto the linear span of 𝑁𝑇 × 1 vector 𝑓 𝑗 with 𝑗 ∈ Γ. The post-LASSO estimator (cid:98)𝜁𝑃𝐿 is defined as
the OLS estimator of the linear projection of 𝑌𝑖𝑡 onto { 𝑓𝑖𝑡, 𝑗 :

𝑗 ∈ (cid:98)Γ}.

113

Lemma A.4 Under Assumption ASM, if 𝑆max := max1≤ 𝑗 ≤ 𝑝 |E𝑁𝑇 [𝜔−1/2 𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡] | ≤

𝜆

2𝑐1𝑁𝑇 , 0 < 𝑎 =

min 𝑗 𝜔1/2 ≤ max 𝑗 𝜔1/2 = 𝑏 < ∞, and 𝑢 ≥ 1 ≥ 𝑙 ≥ 1/𝑐1, then

∥ 𝑓 (𝑋𝑖𝑡) − 𝑓𝑖𝑡 (cid:98)𝜁𝑃𝐿 ∥ 𝑁𝑇,2
√︄
(cid:32)√︂

𝑠
𝜙𝑚𝑖𝑛 (𝑠)(𝑀 𝑓 )

+

=

𝑚
(cid:98)
𝑚) (𝑀 𝑓 )
𝜙𝑚𝑖𝑛 ( (cid:98)

(cid:33)

𝑂 𝑃

(cid:19)

(cid:18) 𝜆
𝑁𝑇

(cid:16)

+ 𝑂 𝑃

∥ 𝑓 (𝑋𝑖𝑡) − (cid:0)P
(cid:98)Γ

𝑓 (cid:1)

𝑖𝑡 ∥ 𝑁𝑇,2

(cid:17)

.

Proof of Lemma A.4 We can decompose 𝑓 (𝑋𝑖𝑡) − 𝑓𝑖𝑡 (cid:98)𝜁𝑃𝐿 as follows:

𝑌 )𝑖𝑡 = (cid:0)(𝐼𝑁𝑇 − P
𝑓 (𝑋𝑖𝑡) − 𝑓𝑖𝑡 (cid:98)𝜁𝑃𝐿 = 𝑓 (𝑋𝑖𝑡) − (P
(cid:98)Γ
(cid:17)
(cid:16)
+ PΓ0)𝑉

(𝐼𝑁𝑇 − P

=

(cid:98)Γ) 𝑓 − (P
(cid:13)
(cid:98)Γ) 𝑓 (cid:1)
(cid:13)𝑁𝑇,2

(cid:98)Γ\Γ0
+ (cid:13)
(cid:13)

𝑖𝑡

𝑖𝑡
(cid:13)
(cid:13)𝑁𝑇,2

(cid:0)PΓ0

𝑉 (cid:1)

𝑖𝑡

≤ (cid:13)
(cid:13)

(cid:0)𝐼𝑁𝑇 − P

+

(cid:16)

(cid:13)
(cid:13)
(cid:13)

P

(cid:98)Γ\Γ0

(cid:17)

𝑉

(cid:13)
(cid:13)
(cid:13)𝑁𝑇,2

.

𝑖𝑡

𝑉 (cid:1)
(cid:98)Γ) 𝑓 (𝑋) − P
(cid:98)Γ

𝑖𝑡

where the last equality follows from the property of the linear projection and the inequality follows

from Minkowski’s inequality. By Hölder’s inequality and the property of spectral norm, we have

(cid:16)

(cid:13)
(cid:13)
(cid:13)

P

(cid:98)Γ\Γ0

(cid:17)

𝑉

(cid:13)
(cid:13)
(cid:13)𝑁𝑇,2

𝑖𝑡

=

√

1
𝑁𝑇

(cid:13)
(cid:13)
(cid:13)

P

(cid:98)Γ\Γ0

𝑉

(cid:13)
(cid:13)
(cid:13)2

≤

√

1
𝑁𝑇

(cid:13)
(cid:13)
(cid:13)

𝑓

( 𝑓 ′

(cid:98)Γ\Γ0

𝑓

(cid:98)Γ\Γ0

(cid:98)Γ\Γ0

)−1(cid:13)
(cid:13)
(cid:13)∞

(cid:13)
(cid:13)
(cid:13)

𝑉

𝑓 ′
(cid:98)Γ\Γ0

(cid:13)
(cid:13)
(cid:13)2

√︄

≤

√

1
𝑁𝑇

1
𝑚)(𝑀 𝑓 )
𝑁𝑇 𝜙𝑚𝑖𝑛 ( (cid:98)

√︄

=

𝑚
(cid:98)
𝑚)(𝑀 𝑓 )
𝜙𝑚𝑖𝑛 ( (cid:98)

𝑂 𝑃

(cid:18) 𝜆
𝑁𝑇

(cid:169)
(cid:173)
(cid:171)
(cid:19)

∑︁

(cid:32) 𝑁
∑︁

𝑇
∑︁

𝑗 ∈(cid:98)Γ\Γ0

𝑖=1

𝑡=1

𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡

1/2

(cid:33) 2

√︄

≤

(cid:170)
(cid:174)
(cid:172)

𝑚
(cid:98)
𝑚) (𝑀 𝑓 )
𝜙𝑚𝑖𝑛 ( (cid:98)

𝑆max

where the last line follows from min 𝑗 𝜔1/2

𝑗 = 𝑎 > 0 and 𝑆max ≤

𝜆

2𝑐1𝑁𝑇 . By similar arguments, we

have

□

(cid:13)
(cid:13)

(cid:0)PΓ0

𝑉 (cid:1)

𝑖𝑡

(cid:13)
(cid:13)𝑁𝑇,2

=

√

1
𝑁𝑇

(cid:13)
(cid:13)PΓ0

𝑉(cid:13)
(cid:13)2

≤

√

1
𝑁𝑇

(cid:13)
(cid:13)
(cid:13)

𝑓Γ0 ( 𝑓 ′
Γ0

𝑓Γ0)−1(cid:13)
(cid:13)
(cid:13)∞

(cid:13)
(cid:13)
(cid:13)

𝑉

𝑓 ′
Γ0

(cid:13)
(cid:13)
(cid:13)2

√︄

≤

√

1
𝑁𝑇

1
𝑁𝑇 𝜙𝑚𝑖𝑛 (𝑠)(𝑀 𝑓 )

(cid:32) 𝑁
∑︁

𝑇
∑︁

∑︁

𝑗 ∈Γ0

𝑖=1

𝑡=1

(cid:169)
(cid:173)
(cid:171)

𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡

1/2

(cid:33) 2

(cid:170)
(cid:174)
(cid:172)

√︂

≤

𝑠
𝜙𝑚𝑖𝑛 (𝑠) (𝑀 𝑓 )

𝑂 𝑃

(cid:19)

.

(cid:18) 𝜆
𝑁𝑇

Proof of Theorem 3.1 In the proof, we will show L1 and L2 convergence rates for (cid:98)𝜁. We will
first show the regularization event in terms of the infeasible penalty weights 𝜔 as defined in 3.11.

114

Due to the AHK representation as in Assumption AHK, we can decompose 𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡 as 𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡 =

𝑎𝑖, 𝑗 + 𝑔𝑡, 𝑗 + 𝑒𝑖𝑡, 𝑗 where 𝑎𝑖, 𝑗 := 𝐸 [ 𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡 |𝛼𝑖], 𝑔𝑡, 𝑗 = 𝐸 [ 𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡 |𝛾𝑡], and 𝑒𝑖𝑡, 𝑗 = 𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡 − 𝑎𝑖, 𝑗 − 𝑔𝑡, 𝑗 , for

𝑗 = 1, .., 𝑝.

To show the regularization event holds with probability approaching one, we bound the proba-

bility of the following event for each 𝑗 = 1, ..., 𝑝:

(cid:33)

>

𝜆
2𝑐1𝑁𝑇

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

𝑁
∑︁

𝑇
∑︁

𝜔−1/2
𝑗

𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡

1
𝑁𝑇

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

𝑃

(cid:32)

(cid:32)

𝜔−1/2
𝑗

𝑖=1
(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

1
𝑁

𝑡=1
𝑁
∑︁

𝑖=1

1
𝑇

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

1
𝑇

>

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

+

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

𝑁
∑︁

𝜔−1/2

𝑎, 𝑗 𝑎𝑖, 𝑗

1
𝑁

𝑖=1
𝑁
∑︁

1
√
𝑁
(cid:32)(cid:12)
𝑁
(cid:12)
∑︁
(cid:12)
(cid:12)
(cid:12)

𝑖=1

𝑖=1
𝑇
∑︁

𝑡=1

𝜔−1/2

𝑎, 𝑗 𝑎𝑖, 𝑗

𝜔−1/2
𝑗

𝑒𝑖𝑡, 𝑗

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

=𝑃

≤𝑃

≤𝑃

(cid:32)(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:32)(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

+ 𝑃

𝑡=1
𝑇
∑︁

𝑡=1
√

𝑁𝜆
6𝑐1𝑁𝑇
(cid:33)

𝜆
6𝑐1

>

𝑎𝑖, 𝑗 +

𝑇
∑︁

𝑔𝑡, 𝑗 +

1
𝑁𝑇

𝑁
∑︁

𝑇
∑︁

𝑒𝑖𝑡, 𝑗

(cid:33)

>

𝜆
2𝑐1𝑁𝑇

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

𝜔−1/2
𝑗

𝑒𝑖𝑡, 𝑗

𝑡=1

1
𝑁𝑇

𝑖=1
(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

+

𝑁
∑︁

𝑇
∑︁

𝑖=1

𝑡=1

(cid:33)

+ 𝑃

1
√
𝑇

𝑇
∑︁

𝑡=1

𝜔−1/2

𝑔, 𝑗 𝑔𝑡, 𝑗

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

>

:= 𝑝1, 𝑗 (𝜆) + 𝑝2, 𝑗 (𝜆) + 𝑝3, 𝑗 (𝜆)

𝜔−1/2

𝑔, 𝑗 𝑔𝑡, 𝑗

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:32)(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

(cid:33)

𝜆
2𝑐1𝑁𝑇
(cid:33)

>

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)
√
𝑇𝜆
6𝑐1𝑁𝑇

(cid:205)𝑁
𝑖, 𝑗 and 𝜔𝑔, 𝑗 := 𝑁∧𝑇
where 𝜔𝑎, 𝑗 := 𝑁∧𝑇
𝑎2
𝑖=1
𝑇 2
𝑁 2
the triangle inequality and the fact that 𝜔1/2

(cid:205)𝐵

𝑏=1

(cid:0)(cid:205)𝑡∈𝐻𝑏

𝑗 = (cid:0)𝜔𝑎, 𝑗 + 𝜔𝑔, 𝑗 (cid:1) 1/2 ≥ max{𝜔1/2

𝑔𝑡, 𝑗 (cid:1) 2. The first inequality follows from
𝑎, 𝑗 , 𝜔1/2

𝑔, 𝑗 }. The second

inequality follows from a union-bound inequality. Applying union-bound inequality again, we

obtain

(cid:32)

𝑃

max
𝑗=1,...,𝑝

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

1
𝑁𝑇

𝑁
∑︁

𝑇
∑︁

𝑖=1

𝑡=1

𝜔−1/2
𝑗

𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

>

𝜆
2𝑐1𝑁𝑇

(cid:33)

≤

𝑝
∑︁

𝑗=1

(cid:2)𝑝1, 𝑗 (𝜆) + 𝑝2, 𝑗 (𝜆) + 𝑝3, 𝑗 (𝜆)(cid:3)

To bound 𝑝1, 𝑗 (𝜆), we will apply a moderate deviation theorem for self-normalized sums of

independent random variables. For 𝑗 = 1, ..., 𝑝, define Ξ𝑎, 𝑗 =

[𝐸 (𝑎𝑖, 𝑗 )2] 1/2
[𝐸 (𝑎𝑖, 𝑗 )3] 1/3 . Under Assumption
REG(i), max 𝑗 ≤ 𝑝 𝐸 |𝑎𝑖, 𝑗 |3 < ∞ by Holder’s inequality and Jensen’s inequality. By Assumption

REG(ii), min 𝑗 ≤ 𝑝 𝐸 |𝑎𝑖, 𝑗 |2 > 0. Therefore, min 𝑗 Ξ𝑎, 𝑗 > 0. By Theorem 7.4 of Peña et al. (2009)

with 𝛿 = 1, we have for any 𝑥 ∈ [0, 𝑁 1/6Ξ𝑎, 𝑗 ] that

𝑃

(cid:32)(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

1
√
𝑁

𝑁
∑︁

𝑖=1

𝜔−1/2

𝑎, 𝑗 𝑎𝑖, 𝑗

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

(cid:33)

(cid:34)

> 𝑥

≤ 2 (1 − Φ(𝑥))

1 + 𝑂 (1)

(cid:19) 3(cid:35)

(cid:18) 1 + 𝑥
𝑁 1/6Ξ𝑎, 𝑗

115

Let 𝑙𝑎,𝑁 be some positive increasing sequence. If 𝑁 1/6Ξ𝑎, 𝑗 /𝑙𝑎,𝑁 − 1 > 0 and 𝑥 ∈ [0, 𝑁 1/6Ξ𝑎, 𝑗 /𝑙𝑎,𝑁 −

1], then

𝑃

(cid:32)(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

1
√
𝑁

𝑁
∑︁

𝑖=1

𝜔−1/2

𝑎, 𝑗 𝑎𝑖, 𝑗

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

(cid:33)

(cid:34)

> 𝑥

≤ 2 (1 − Φ(𝑥))

1 + 𝑂 (1)

(cid:19) 3(cid:35)

(cid:18) 1
𝑙𝑎,𝑁

Then, setting 𝜆 = 6𝑐1

𝑁𝑇
√

𝑁 Φ−1 (cid:16)

1 −

(cid:17)

𝛾
2𝑝

gives

𝑝
∑︁

𝑝1, 𝑗 (𝜆) ≤ 2𝑝(1 − Φ(Φ−1(1 − 𝛾/2𝑝))) ≤ 𝛾 [1 + 𝑂 (1) (1/𝑙𝑎,𝑁 )3]

𝑗=1
given that Φ−1 (cid:16)
that Φ−1 (cid:16)
𝑁, 𝑇 → ∞. Therefore, it suffices to take 𝑙𝑎,𝑁 = 𝑂 (log 𝑁), and it follows that (cid:205)𝑝

1 −
(cid:17) ≲ √︁log( 𝑝/𝛾) = 𝑜

𝑁 1/12/log 𝑁

1 −

𝛾
2𝑝

𝛾
2𝑝

(cid:17)

(cid:17)

(cid:16)

∈ [0, 𝑁 1/6 min 𝑗 Ξ𝑎, 𝑗 /𝑙𝑎,𝑁 − 1] and 𝑁 1/6 min 𝑗 Ξ𝑎, 𝑗 /𝑙𝑎,𝑁 − 1 > 0. Note

under Assumption REG(i) and 𝑁/𝑇 → 𝑐 as

𝑝1,𝑘𝑙 (𝜆) → 0 as

𝑗=1

𝛾 → 0 and (𝑁, 𝑇) → ∞.

To bound 𝑝2, 𝑗 (𝜆), we utilize a moderate deviation theorem for self-normalized sums of weakly

dependent random variables. Observe that 𝑔𝑡, 𝑗 = 𝐸 [ 𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡 |𝛾𝑡] is beta-mixing with coefficient

𝛽𝑔 (𝑞) satisfying

𝛽𝑔 (𝑞) ≤ 𝛽𝛾 (𝑞) ≤ 𝑐𝜅𝑒𝑥 𝑝(−𝜅𝑞) ∀𝑞 ∈ Z+

Furthermore, by the strict stationarity and the non-degeneracy condition in Assumption REG(iii),
we can verify that for some 𝜈 > 0, 𝐸 (cid:2)(cid:205)𝑟+𝑚
𝑡=𝑟 𝑔𝑡, 𝑗 (cid:3) 2 ≥ 𝜈2𝑚 for all 𝑡 ≥ 1, 𝑟 ≥ 0, 𝑚 ≥ 1. By
Assumption REG(ii) and Holder’s inequality, we have 𝐸 | 𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡 |4(𝜇+𝛿) < ∞ for some 𝜇 > 1, 𝛿 > 0.

Then, by Theorem 3.2 of Gao et al. (2022) with 𝜏 = 1 and 𝛼 = 1

𝑝
∑︁

𝑗=1

𝑃

(cid:32)(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

1
√
𝑇

𝑇
∑︁

𝑡=1

𝜔−1/2

𝑔, 𝑗 𝑔𝑡, 𝑗

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

(cid:33)

> 𝑥

≤ 2𝑝 (1 − Φ(𝑥))

1 + 𝑂 (1)

1+2𝜏 , we have
(cid:34)

(cid:19) 2(cid:35)

(cid:18) 1
𝑙𝑔,𝑇

uniformly for 𝑥 ∈

(cid:16)

0, 𝑑0(log 𝑇)−1/2𝑇 1/12/𝑙𝑔,𝑇

(cid:17)

where 𝑑0 is some positive constant and 𝑙𝑔,𝑇 is some

positive increasing sequence. Then, setting 𝜆 = 6𝑐1

𝑁𝑇
√

𝑇 Φ−1(1 −

𝛾

2𝑝 ) gives, for all 𝑗 = 1, ..., 𝑝,

(cid:34)

𝑝2, 𝑗 (𝜆) ≤ 𝛾

1 + 𝑂 (1)

(cid:19) 2(cid:35)

(cid:18) 1
𝑙𝑔,𝑇

𝑝
∑︁

𝑗=1

116

given that Φ−1(1 −

log( 𝑝/𝛾) = 𝑜

by taking 𝑙𝑔,𝑇 = 𝑂

(cid:16)

𝛾
2𝑝 ) ∈
𝑇 1/6/(log 𝑇)2(cid:17)
(cid:16)
(log 𝑇)1/2(cid:17)

(cid:16)

0, 𝑑0(log 𝑇)−1/2𝑇 1/12/𝑙𝑔,𝑇
) and so Φ−1 (cid:16)
𝛾
1 −
2𝑝
, it follows that (cid:205)𝑝

𝑗=1

(cid:17)

. Under Assumption REG(i), we have

(cid:17) ≲ √︁log( 𝑝/𝛾) = 𝑜
𝑝2, 𝑗 (𝜆) → 0 as 𝛾 → 0 and (𝑁, 𝑇) → ∞.

(cid:16)
𝑇 1/12/(log 𝑇

(cid:17)

. Therefore,

Consider 𝑝3, 𝑗 (𝜆). Define ¯𝑒𝑖, 𝑗

:= 1
𝑇

(cid:205)𝑇

𝑡=1

𝑒𝑖𝑡, 𝑗 . Observe that 𝐸 [ ¯𝑒𝑖, 𝑗 ] = 0 by iterated expec-

tation and conditional on {𝛾𝑡 }𝑇

𝑡=1, ¯𝑒𝑖, 𝑗 are independent over 𝑖 . We have shown previously that
𝐸 | 𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡 |4(𝜇+𝛿) < ∞ for some 𝜇 > 1, 𝛿 > 0. Given that 𝑒𝑖𝑡, 𝑗 = 𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡 − 𝑎𝑖, 𝑗 − 𝑔𝑡, 𝑗 and

𝐸 |𝑎𝑖, 𝑗 |4(𝜇+𝛿) < ∞, 𝐸 |𝑔𝑡, 𝑗 |4(𝜇+𝛿) < ∞ due to Jansen’s inequality and iterated expectation, we

have 𝐸 |𝑒𝑖𝑡, 𝑗 |4(𝜇+𝛿) < ∞ and so 𝐸 | ¯𝑒𝑖, 𝑗 |4(𝜇+𝛿) < ∞ due to Minkowski’s inequality. Note that

Var (cid:0) ¯𝑒𝑖, 𝑗 (cid:1) =

1
𝑇

𝑇−1
∑︁

𝑙=−(𝑇−1)

(cid:18)

1 −

(cid:19)

|𝑙 |
𝑇

E(𝑒𝑖𝑡, 𝑗 𝑒𝑖,𝑡+𝑙, 𝑗 ) =

1
𝑇

Σ𝑒 (1 + 𝑜(1)).

where Σ𝑒 is defined in the beginning Appendix A with 𝑘 = 1 in this case. By Lemma A.1,
|Σ𝑒, 𝑗 | < ∞. Furthermore, as is shown below, 𝜔1/2

is bounded from below by some constant 𝑎 > 0.

𝑗

Now, by the conditional version of Corollary 4 from Fuk and Nagaev (1971), there exists some

constant 𝑎1 and 𝑎2 such that

𝑃

(cid:32)(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

𝑁
∑︁

𝑖=1

𝜔−1/2
𝑗

¯𝑒𝑖, 𝑗

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

>

𝜆
6𝑐1𝑇

(cid:33)

≤ 𝑃

|{𝛾𝑡 }𝑇

𝑡=1

(cid:32)(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

𝑁
∑︁

𝑖=1

¯𝑒𝑖, 𝑗

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

>

𝑎𝜆
6𝑐1𝑇

(cid:33)

|{𝛾𝑡 }𝑇

𝑡=1

≤ 𝑎1(𝜆/𝑇)−4

𝑁
∑︁

𝑖=1

(cid:16)

𝐸

| ¯𝑒𝑖, 𝑗 |4|{𝛾𝑡 }𝑇
𝑡=1

(cid:17)

−

+ exp (cid:169)
(cid:173)
(cid:173)
(cid:171)

(cid:205)𝑁
𝑖=1

𝑎2(𝜆/𝑇)2
(cid:16)

𝑉 𝑎𝑟

¯𝑒𝑖, 𝑗 |{𝛾𝑡 }𝑇

𝑡=1

(cid:17)

(cid:170)
(cid:174)
(cid:174)
(cid:172)

Note that exp(−1/𝑧) is not globally concave but it is concave for 𝑧 > 1/2 and is bounded by 𝑧/𝑒2
for 𝑧 ∈ (0, 1/2) where 𝑒 is the Euler’s number. Denote 𝑧 = (𝑇/𝜆)2
𝑎2

𝑡=1). Then, we

𝑉 𝑎𝑟 ( ¯𝑒𝑖, 𝑗 |{𝛾𝑡 }𝑇

(cid:205)𝑁
𝑖=1

have

(cid:32)

exp

−

𝑎2(𝜆/𝑇)2
𝑉 𝑎𝑟 ( ¯𝑒𝑖, 𝑗 |{𝛾𝑡 }𝑇

𝑡=1

(cid:205)𝑁
𝑖=1

(cid:33)

)

= exp (−1/𝑧) ≤ 𝑧/𝑒21{𝑧 ∈ (0, 1/2)} + exp(−1/𝑧)1{𝑧 > 1/2}.

117

By Fubini theorem, Jensen’s inequality, and the bounded moments, we have

𝑝3, 𝑗 (𝜆) =𝑃

(cid:32)(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

𝑁
∑︁

𝑖=1

𝜔−1/2
𝑗

¯𝑒𝑖, 𝑗

(cid:33)

>

𝜆
6𝑐1𝑇

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

≤𝑎1(𝜆/𝑇)−4

𝑁
∑︁

𝐸

(cid:16)

| ¯𝑒𝑖, 𝑗 |4(cid:17)

+

𝑖=1
(cid:17)

(𝜆/𝑇)−4𝑁

(cid:16)

=𝑂

(𝑇/𝜆)2
𝑎2

𝑁
∑︁

𝑉 𝑎𝑟 ( ¯𝑒𝑖, 𝑗 )/𝑒2 + exp

−

(cid:32)

𝑖=1
𝑎2(𝜆/𝑇)2
𝑁/𝑇 Σ𝑒

(cid:19)

.

(cid:33)

𝑎2(𝜆/𝑇)2

(cid:205)𝑁
𝑖=1

𝑉 𝑎𝑟 ( ¯𝑒𝑖, 𝑗 )

+ 𝑂

(cid:19)

(cid:18)𝑇 𝑁
𝜆2

(cid:18)

−

+ exp

Therefore, we have (cid:205)𝑝

𝑗=1

𝑁/𝑇 → 𝑐 where 0 < 𝑐 < ∞, by taking 𝜆𝑒 =
(cid:205)𝑝

𝑝3, 𝑗 (𝜆𝑒) = 𝑜(1). Under REG(i), log( 𝑝/𝛾) = 𝑜

𝑗=1

(cid:16)

(cid:17)

(cid:16) 𝑝𝑇 𝑁
𝑝3, 𝑗 (𝜆) = 𝑂 (cid:0)𝑝(𝜆/𝑇)−4𝑁 (cid:1) + 𝑂
𝜆2
√
𝑝𝑁𝑇
∨
𝜀1/2 ∨
𝑇 1/6/(log 𝑇)2(cid:17)
(cid:16)
2𝑝 ). Therefore, we have shown (cid:205)𝑝
𝑗=1

− 𝑎2 (𝜆/𝑇)2
𝑁/𝑇 Σ𝑒
for some 𝜀 = 𝑜(1),

+ 𝑝 exp
√︃ 𝑁𝑇 log( 𝑝/𝛾)
𝑎2/Σ𝑒
and 𝑝 = 𝑜(𝑇 7/6/(log 𝑇)2),

𝑝3, 𝑗 (𝜆) → 0 for

. Given that

𝑁∧𝑇 Φ−1(1 −

𝑝1/4𝑇 𝑁 1/4
𝜀1/4

𝑁𝑇

√

(cid:17)

𝛾

then 𝜆𝑒 = 𝑂 (𝜆) where 𝜆 = 6𝑐1
𝑁∧𝑇 Φ−1(1 −

𝜆 = 6𝑐1

𝛾
2𝑝 ).

𝑁𝑇

√

Put together, we have shown

(cid:32)

𝑃

max
𝑗=1,...,𝑝

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

1
𝑁𝑇

𝑁
∑︁

𝑇
∑︁

𝑖=1

𝑡=1

𝜔−1/2
𝑗

𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡

(cid:33)

≤

𝜆
2𝑐1𝑁𝑇

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

→ 1.

(3A.1)

(cid:16)

𝑓𝑖𝑡

Now, we can apply Lemma 6 of Belloni et al. (2012) to obtain the finite sample bounds on
(cid:13)
(cid:13)
(cid:13)
let 𝐽1

(cid:98)𝜁 − 𝜁0
𝑝 be a subset of an index set 𝐽𝑝 = 1, ..., 𝑝 and 𝐽0

. Let 𝛿 be some generic vector of nuisance parameters and

𝑝. Let 𝛿1 be a copy of 𝛿 with its

(cid:17)(cid:13)
(cid:13)
(cid:13)𝑁𝑇,2

(cid:12)
𝜔1/2 (cid:16)
(cid:12)
(cid:12)

𝑝 = 𝐽𝑝\𝐽1

(cid:98)𝜁 − 𝜁0

(cid:17)(cid:12)
(cid:12)
(cid:12)1

and

𝑗-th element replaced by 0 for all 𝑗 ∈ 𝐽0

𝑝 and similarly let 𝛿0 be a copy of 𝛿 with its 𝑗-th element

replaced by 0 for all 𝑗 ∈ 𝐽1

𝑝. Define the restricted eigenvalues and Gram matrix as follows:

𝐾𝐶 (𝑀 𝑓 ) =

min

𝛿:∥𝛿0 ∥1≤𝐶 ∥𝛿1 ∥1, ∥𝛿∥≠0, |𝐽1

𝑝 |≤𝑠

√︁𝑠𝛿′𝑀 𝑓 𝛿
(cid:13)𝛿1(cid:13)
(cid:13)
(cid:13)1

, 𝑀 𝑓 = 𝐸𝑁𝑇 [ 𝑓 ′

𝑖𝑡 𝑓𝑖𝑡].

Define the weighted restricted eigenvalues as follows:

𝐾 𝜔

𝐶 (𝑀 𝑓 ) =

𝛿:∥𝜔1/2𝛿0 ∥1≤𝐶 ∥𝜔1/2𝛿1 ∥1, ∥𝛿∥≠0, |𝐽1

𝑝 |≤𝑠

min

√︁𝑠𝛿′𝑀 𝑓 𝛿
(cid:13)𝜔1/2𝛿1(cid:13)
(cid:13)
(cid:13)1

.

Let 𝑎 := min 𝑗=1,...,𝑝 𝜔1/2

𝑗

, 𝑏 := max 𝑗=1,...,𝑝 𝜔1/2

𝑗

. As is shown in Belloni et al. (2016),

𝐾 𝜔

𝐶 (𝑀 𝑓 ) ≥

1
𝑏

𝐾𝑏𝐶/𝑎 (𝑀 𝑓 ).

(3A.2)

118

(cid:16)

(cid:17) 1/2

Denote Σ𝑎, 𝑗 =
𝑝
→ 𝑁∧𝑇

𝜔 𝑗

𝑁 Σ𝑎, 𝑗 + 𝑁∧𝑇
implies that min 𝑗 ≤ 𝑝 Σ2

𝑖, 𝑗 ]

E[𝑎2

and Σ𝑔, 𝑗 = (cid:0)(cid:205)∞

𝑙=−∞ E[𝑔𝑡, 𝑗 𝑔𝑡+𝑙, 𝑗 ](cid:1) 1/2. By Lemma A.2 above, we have
𝑇 Σ𝑔, 𝑗 . By Lemma A.1, |Σ𝑎, 𝑗 | < ∞ and |Σ𝑔, 𝑗 | < ∞. Assumption REG(iii)
𝑎, 𝑗 > 0. Therefore, we have 𝜔 𝑗 bounded below by zero and bounded above

for each 𝑗 = 1, .., 𝑝 with probability approaching one as 𝑁, 𝑇 → ∞. Assumption (ASM), the

condition 3.12, and 3A.1, Lemma 6 of Belloni et al. (2012) implies that

(cid:16)

(cid:13)
(cid:13)
(cid:13)

𝑓𝑖𝑡

(cid:98)𝜁 − 𝜁0

(cid:17)(cid:13)
(cid:13)
(cid:13)𝑁𝑇,2

(cid:18)

𝑢 +

≤

(cid:19)

1
𝑐1

𝑁𝑇 𝐾 𝜔

+ 2 ∥𝑟 ∥ 𝑁𝑇,2

√

𝑠𝜆
𝑐0 (𝑀 𝑓 )
√︂ 𝑠 log( 𝑝/𝛾)
𝑁 ∧ 𝑇
(cid:19)

√

𝑢 +

1
𝑐1

𝑁𝑇 𝐾 𝜔

1
𝐾 𝜔
𝑐0 (𝑀 𝑓 )
√
(cid:20)(cid:18)
𝑠
(𝑀 𝑓 )

(cid:32)

=𝑂 𝑃

3𝑐0

𝐾 𝜔
2𝑐0
(cid:32)

=𝑂 𝑃

√︂

𝑠
(𝑀 𝑓 )𝐾 𝜔

𝑐0 (𝑀 𝑓 )

𝐾 𝜔
2𝑐0

√︂ 𝑠

𝑁 ∧ 𝑇

+

(cid:33)

,

𝑠𝜆
𝑐0 (𝑀 𝑓 )
log( 𝑝/𝛾)
𝑁 ∧ 𝑇

+ 2 ∥𝑟 ∥ 𝑁𝑇,2

∥𝑟 ∥2

𝑁𝑇,2

,

(cid:21)

𝑁𝑇
+ 3𝑐0
𝜆
√

√︂ 𝑠

𝑁 ∧ 𝑇

+

+

𝑁 ∧ 𝑇
𝑠/
log( 𝑝/𝛾)

(cid:33)

𝜔1/2 (cid:16)

(cid:13)
(cid:13)
(cid:13)

(cid:98)𝜁 − 𝜁0

(cid:17)(cid:13)
(cid:13)
(cid:13)1

≤

where 𝑐0 := 𝑢𝑐+1
> 1. By 3A.2, we have 1/𝐾 𝜔
𝑐0
𝑙𝑐−1
arguments given in Bickel et al. (2009), Assumption SE implies that 1/𝐾𝐶 (𝑀 𝑓 ) = 𝑂 𝑃 (1) for any

(𝑀 𝑓 ) ≤ 𝑏/𝐾 ¯𝐶 (𝑀 𝑓 ) where ¯𝐶 := 𝑏𝑐0/𝑎. By

𝐶 > 0. Therefore,

(cid:16)

(cid:13)
(cid:13)
(cid:13)

𝑓𝑖𝑡

(cid:98)𝜁 − 𝜁0

(cid:17)(cid:13)
(cid:13)
(cid:13)𝑁𝑇,2

= 𝑂 𝑃

(cid:32)√︂ 𝑠 log( 𝑝/𝛾)
𝑁 ∧ 𝑇

(cid:33)

,

𝜔1/2 (cid:16)

(cid:12)
(cid:12)
(cid:12)

(cid:98)𝜁 − 𝜁0

(cid:17)(cid:12)
(cid:12)
(cid:12)1

(cid:32)

√︂

= 𝑂 𝑃

𝑠

log( 𝑝/𝛾)
𝑁 ∧ 𝑇

(cid:33)

.

By Holder’s inequality and that min 𝑗 𝜔1/2

𝑗 ≥ 𝑎 > 0

∥(cid:98)𝜁 − 𝜁0∥1 ≤ ∥𝜔−1/2∥∞

𝜔1/2 (cid:16)

(cid:12)
(cid:12)
(cid:12)

(cid:98)𝜁 − 𝜁0

(cid:17)(cid:12)
(cid:12)
(cid:12)1

(cid:32)

√︂

= 𝑂 𝑃

𝑠

(cid:33)

log( 𝑝/𝛾)
𝑁 ∧ 𝑇

(cid:32)

√︂

= 𝑂 𝑃

𝑠

(cid:33)

log( 𝑝 ∨ 𝑁𝑇)
𝑁 ∧ 𝑇

where the first inequality follows from the .

The 𝑙2 rate of convergence will be derived after the sparsity bounds. We now switch the focus

to the Post-LASSO. By the finite sample bounds of Lemma A.4, we have

∥ 𝑓 (𝑋𝑖𝑡) − 𝑓𝑖𝑡 (cid:98)𝜁𝑃𝐿 ∥ 𝑁𝑇,2 =

(cid:32)√︂

𝑠
𝜙𝑚𝑖𝑛 (𝑠) (𝑀 𝑓 )

+

√︄

𝑚
(cid:98)
𝑚) (𝑀 𝑓 )
𝜙𝑚𝑖𝑛 ( (cid:98)
𝑓 (cid:1)

𝑖𝑡 ∥ 𝑁𝑇,2

(cid:17)

,

(cid:33)

𝑂 𝑃

(cid:19)

(cid:18) 𝜆
𝑁𝑇

(3A.3)

(cid:16)

+ 𝑂 𝑃

∥ 𝑓 (𝑋𝑖𝑡) − (cid:0)P
(cid:98)Γ

119

By the finite sample bounds of Lemma 7 from Belloni et al. (2012), we have

∥𝜔1/2((cid:98)𝜁𝑃𝐿 − 𝜁0)∥1 ≤

∥ 𝑓𝑖𝑡 ((cid:98)𝜁𝑃𝐿 − 𝜁0)∥ 𝑁𝑇,2 ≤∥ 𝑓𝑖𝑡 (𝑋𝑖𝑡) − 𝑓𝑖𝑡 (cid:98)𝜁𝑃𝐿 ∥ 𝑁𝑇,2 + ∥𝑟𝑖𝑡 ∥ 𝑁𝑇,2,
√
𝑚 + 𝑠
𝑏
(cid:98)
√︁𝜙𝑚𝑖𝑛 ( (cid:98)
𝑚 + 𝑠) (𝑀 𝑓 )
(cid:19)
(cid:18)
𝑠
𝜆
1
𝑢 +
𝑁𝑇 𝐾 𝜔
𝑐1
𝑐0 (𝑀 𝑓 )

𝑓 (𝑋𝑖𝑡)∥ 𝑁𝑇,2 ≤

+ 3∥𝑟𝑖𝑡 ∥ 𝑁𝑇,2.

√

× ∥ 𝑓𝑖𝑡 ((cid:98)𝜁𝑃𝐿 − 𝜁0) ∥ 𝑁𝑇,2

∥ 𝑓 (𝑋𝑖𝑡) − P
(cid:98)Γ

(3A.4)

(3A.5)

(3A.6)

The finite sample bound of Lemma 8 from Belloni et al. (2012) gives

𝑚 ≤ 𝜙𝑚𝑎𝑥 ( (cid:98)
(cid:98)

𝑚)(𝑀 𝑓 )𝑎−2

√

(cid:18) 2𝑐0
𝑠
𝐾 𝜔
𝑐0 (𝑀 𝑓 )

6𝑐0𝑁𝑇 ∥𝑟𝑖𝑡 ∥ 𝑁𝑇,2
𝜆

+

(cid:19) 2

.

where 𝑎 > 0 has been shown previously.

(cid:26)

Let M =

𝑚 ∈ N : 𝑚 > 2𝜙𝑚𝑎𝑥 (𝑚)(𝑀 𝑓 )𝑎−2 (cid:16) 2𝑐0

√

𝑠

(𝑀 𝑓 ) + 6𝑐0𝑁𝑇 ∥𝑟𝑖𝑡 ∥ 𝑁 𝑇 ,2

𝜆

(cid:17) 2(cid:27)

. Lemma 10 of Belloni

𝐾 𝜔
𝑐
0

et al. (2012) gives

𝑚 ≤ min
(cid:98)
𝑚∈M

𝜙𝑚𝑎𝑥 (𝑚 ∧ 𝑁𝑇) (𝑀 𝑓 )𝑎−2

√

(cid:18) 2𝑐0
𝑠
𝐾 𝜔
𝑐0 (𝑀 𝑓 )

6𝑐0𝑁𝑇 ∥𝑟𝑖𝑡 ∥ 𝑁𝑇,2
𝜆

+

(cid:19) 2

.

(3A.7)

Note that 6𝑐0𝑁𝑇 ∥𝑟𝑖𝑡 ∥ 𝑁 𝑇 ,2

√

𝑠

𝜆
Let 𝜇 := min𝑚

= 𝑂 𝑃 (1/log( 𝑝 ∧ 𝑁𝑇))

𝑝
→ 0. Recall that 1/𝐾 𝜔
𝑐0

(𝑀 𝑓 ) ≤ 𝑏/𝐾 ¯𝐶 (𝑀 𝑓 ) < ∞.
(cid:111)

(cid:110)√︁𝜙𝑚𝑎𝑥 (𝑚)(𝑀 𝑓 )/𝜙𝑚𝑖𝑛 (𝑚) (𝑀 𝑓 ) : 𝑚 > 18 ¯𝐶2𝑠𝜙𝑚𝑎𝑥 (𝑚) (𝑀 𝑓 )/𝐾 2
be the integer associated with 𝜇. By the definition of M, it implies that ¯𝑚 ∈ M with probability

, and let ¯𝑚

¯𝐶 (𝑀 𝑓 )

approaching one, which implies ¯𝑚 >

𝑚 due to 3A.7. By Lemma 9 (the sub-linearity of sparse
(cid:98)

eigenvalues) from Belloni et al. (2012) and 3A.7, we have

𝑚 ≲𝑃 𝑠𝜇2𝜙𝑚𝑖𝑛 ( ¯𝑚 + 𝑠)/𝐾 2
(cid:98)
¯𝐶

≲ 𝑠𝜇2𝜙𝑚𝑖𝑛 ( (cid:98)

𝑚 + 𝑠)/𝐾 2
¯𝐶

.

Combining the results above with 3A.3 and 3A.6 to gives
(cid:32)√︄ 𝑠𝜇2 log( 𝑝/𝛾)
(𝑁 ∧ 𝑇)𝐾 2
¯𝐶

∥ 𝑓 (𝑋𝑖𝑡) − 𝑓𝑖𝑡 (cid:98)𝜁𝑃𝐿)∥ 𝑁𝑇,2 = 𝑂 𝑃

+ ∥𝑟𝑖𝑡 ∥ 𝑁𝑇,2 +

√

𝑠
𝜆
𝑁𝑇 𝐾 𝜔
𝑐0 (𝑀 𝑓 )

(cid:33)

.

Recall that 𝑏 < ∞ and Condition SE imply 1/𝐾 𝜔
𝑐0
Condition ASM and the choice of 𝜆 together imply

(𝑀 𝑓 ) ≤ 1/𝐾 ¯𝐶 (𝑀 𝑓 ) < ∞. Then, Condition SE,

∥ 𝑓 (𝑋𝑖𝑡) − 𝑓𝑖𝑡 (cid:98)𝜁𝑃𝐿) ∥ 𝑁𝑇,2 = 𝑂 𝑃

(cid:32)√︂ 𝑠 log( 𝑝/𝛾)
𝑁 ∧ 𝑇

(cid:33)

.

120

For the 𝑙1 convergence rate, note that ∥(cid:98)𝜁𝑃𝐿 − 𝜁0∥0 ≤ (cid:98)
inequality to ∥(cid:98)𝜁𝑃𝐿 − 𝜁0∥1 = (cid:205)𝑝
𝑗=1

|(cid:98)𝜁𝑃𝐿 − 𝜁0| = (cid:205)

𝑗 ∈{(cid:98)Γ (cid:208) Γ0} |(cid:98)𝜁𝑃𝐿 − 𝜁0| gives

𝑚 + 𝑠. Then, applying Cauchy-Schwarz

∥(cid:98)𝜁𝑃𝐿 − 𝜁0∥1 ≤

√

𝑚 + 𝑠∥(cid:98)𝜁𝑃𝐿 − 𝜁0∥2
(cid:98)

To derive the convergence rates in 𝑙2-norm of the Post-LASSO estimator (the 𝑙2 rate for

the LASSO estimator is obtained similarly), we will utilize the sparse eigenvalue condition and

the prediction norm.

(cid:16)

(cid:17)

𝑏 =

(cid:98)𝜁𝑃𝐿 − 𝜁0

/∥(cid:98)𝜁𝑃𝐿 − 𝜁0∥−1
𝑚 + 𝑠, ∥𝛿∥2 = 1}. By Assumption SE, we have
(cid:98)

If (cid:98)𝜁𝑃𝐿 − 𝜁0 = 0, then the conclusion holds trivially. Otherwise, define
𝑚 + 𝑠) = {𝛿 : ∥𝛿∥0 =

2 . Then, we have ∥𝑏∥2 = 1 and so 𝑏 ∈ Δ( (cid:98)

0 < 𝜅1 ≤ 𝜙𝑚𝑖𝑛 ( (cid:98)

𝑚 + 𝑠)(𝑀 𝑓 ) ≤

(𝑏′𝑀 𝑓 𝑏)1/2
∥𝑏∥2

=

(cid:16)

(cid:13)
(cid:13)
(cid:13)

𝑓𝑖𝑡

(cid:98)𝜁𝑃𝐿 − 𝜁0

(cid:17)(cid:13)
(cid:13)
(cid:13)𝑁𝑇,2

∥(cid:98)𝜁𝑃𝐿 − 𝜁0∥2

,

Therefore, using the bound on the prediction norm above, we conclude that

∥(cid:98)𝜁𝑃𝐿 − 𝜁0∥2 ≤

(cid:16)

(cid:13)
(cid:13)
(cid:13)

𝑓𝑖𝑡

(cid:98)𝜁𝑃𝐿 − 𝜁0
𝜅1

(cid:17)(cid:13)
(cid:13)
(cid:13)𝑁𝑇,2

= 𝑂 𝑃

(cid:32)√︂ 𝑠 log( 𝑝/𝛾)
𝑁 ∧ 𝑇

(cid:33)

.

It implies that ∥(cid:98)𝜁𝑃𝐿 − 𝜁0∥1 =

√

𝑚 + 𝑠𝑂 𝑃
(cid:98)

(cid:18)√︃ 𝑠 log( 𝑝/𝛾)
𝑁∧𝑇

(cid:19)

= 𝑂 𝑃

(cid:18)√︃ 𝑠2 log( 𝑝/𝛾)
𝑁∧𝑇

(cid:19)

. □

121

APPENDIX 3B

PROOFS FOR CHAPTER 3.3

The following lemma, quoted from Semenova et al. (2023a)(Lemma A.3), is a result follows from

the weak form of Strassen’s coupling Strassen (1965) and the strong form of Strassen’s coupling

via Lemma 2.11 of Dudley and Philipp (1983):

Lemma B.1 Let (𝑋, 𝑌 ) be random element taking values in Polish space 𝑆 = (𝑆1 × 𝑆2) with laws
𝑃𝑋 and 𝑃𝑌 , respectively. Then, we can construct ( (cid:101)𝑋, (cid:101)𝑌 ) taking values in (𝑆1, 𝑆2) such that (i) they
are independent of each other; (ii) their laws L ( (cid:101)𝑋) = 𝑃𝑋 and L ((cid:101)𝑌 ) = 𝑃𝑌 ; (iii)

𝑃 (cid:8)(𝑋, 𝑌 ) ≠ ( (cid:101)𝑋, (cid:101)𝑌 )(cid:9) =

1
2

∥𝑃𝑋,𝑌 − 𝑃𝑋 × 𝑃𝑌 ∥𝑇𝑉

The proof is provided in Semenova et al. (2023b). To apply the independence coupling result

for cross-fitting in the panel data, we need to introduce another lemma:

Lemma B.2 Let 𝑋1, ..., 𝑋𝑞 and 𝑌 be random elements taking values in Polish space 𝑆 = (𝑆1 × ... ×

𝑆𝑚 × 𝑆𝑦).

𝛽((𝑋1, ..., 𝑋𝑚), 𝑌 ) ≤

𝑞
∑︁

𝑖=1

𝛽(𝑋𝑖, 𝑌 ).

Proof of Lemma B.2 By Lemma B.1, we have

𝛽((𝑋1, ..., 𝑋𝑚), 𝑌 ) =

1
2

(cid:13)
(cid:13)𝑃(𝑋1,...,𝑋𝑞),𝑌 − 𝑃(𝑋1,...,𝑋𝑚) × 𝑃𝑌

(cid:13)
(cid:13)𝑇𝑉

= 𝑃((𝑋1, ..., 𝑋𝑚, 𝑌 ) ≠ ( (cid:101)𝑋1, ..., (cid:101)𝑋𝑚, (cid:101)𝑌 ))

≤

𝑚
∑︁

𝑖=1

𝑃((𝑋𝑖, 𝑌 ) ≠ ( (cid:101)𝑋𝑖, (cid:101)𝑌 )) =

𝑚
∑︁

𝑖=1

𝛽(𝑋𝑖, 𝑌 ),

where the inequality follows from the union bound. □

Now we can prove Lemma 3.1 from the main body of the paper:

122

Proof of Lemma 3.1 By Lemma B.1, for each (𝑘, 𝑙) we have

𝑃{(𝑊 (𝑘, 𝑙), 𝑊 (−𝑘, −𝑙)) ≠ ( (cid:101)𝑊 (𝑘, 𝑙), (cid:101)𝑊 (−𝑘, −𝑙))}

=𝛽 (𝑊 (𝑘, 𝑙), 𝑊 (−𝑘, −𝑙)) = 𝛽

(cid:32)

{𝑊𝑖𝑡 }𝑖∈𝐼𝑘,𝑡∈𝑆𝑙 , (cid:216)

{𝑊𝑖𝑡 }𝑖∈𝐼𝑘′ ,𝑡∈𝑆𝑙′

(cid:33)

𝑘 ′≠𝑘,𝑙′≠𝑙,𝑙±1

(cid:32)

∑︁

𝛽

≤

𝑖∈𝐼𝑘

{𝑊𝑖𝑡 }𝑡∈𝑆𝑙 , (cid:216)

(cid:33)

{𝑊𝑖𝑡 }𝑖∈𝐼𝑘′ ,𝑡∈𝑆𝑙′

≤

∑︁

∑︁

∑︁

𝛽 (cid:0){𝑊𝑖𝑡 }𝑡∈𝑆𝑙 , {𝑊 𝑗𝑡 }𝑡∈𝑆𝑙′

(cid:1)

𝑘 ′≠𝑘,𝑙′≠𝑙,𝑙±1

𝑘 ′≠𝑘,𝑙′≠𝑙,𝑙±1

𝑗 ∈𝐼𝑘′

𝑖∈𝐼𝑘

where the last two inequalities follow from Lemma B.2. Note that for 𝑠, 𝑚 ≥ 1, we have

𝛽({𝑊𝑖𝑡 }𝑡≤𝑠, {𝑊 𝑗𝑡 }𝑡≥𝑠+𝑚)

= (cid:13)

(cid:13)𝑃{𝑊𝑖𝑡 }𝑡 ≤𝑠,{𝑊 𝑗𝑡 }𝑡 ≥𝑠+𝑚 − 𝑃{𝑊𝑖𝑡 }𝑡 ≤𝑠 × 𝑃{𝑊 𝑗𝑡 }𝑡 ≥𝑠+𝑚

(cid:13)
(cid:13)𝑇𝑉 ≤

sup
𝐴∈𝜎({𝑊 𝑗𝑡 }𝑡 ≥𝑠+𝑚)

𝐸𝑃 |𝑃( 𝐴|𝜎({𝑊𝑖𝑡 }𝑡≤𝑠)) − 𝑃( 𝐴)|

=

=

=

sup
𝐴∈𝜎({𝑊 𝑗𝑡 }𝑡 ≥𝑠+𝑚)

sup
𝐴∈𝜎({𝑊 𝑗𝑡 }𝑡 ≥𝑠+𝑚)

𝐸𝑃 |𝑃(𝑃( 𝐴|𝜎(𝛼𝑖, {𝛾𝑡 }𝑡≤𝑠, {𝜀𝑖𝑡 }𝑡≤𝑠))|𝜎({𝑊𝑖𝑡 }𝑡≤𝑠)) − 𝑃( 𝐴)|

𝐸𝑃 |𝑃( 𝐴|𝜎({𝛾𝑡 }𝑡≤𝑠) − 𝑃( 𝐴)|

sup
𝐴∈𝜎({𝛾𝑡 }𝑡 ≥𝑠+𝑚)

𝐸𝑃 |𝑃( 𝐴|𝜎({𝛾𝑡 }𝑡≤𝑠) − 𝑃( 𝐴)| ≤ 𝑐𝜅𝑒𝑥 𝑝(−𝜅𝑚),

where the last inequality follows from Assumption 3.2. Therefore,

𝑃{(𝑊 (𝑘, 𝑙), 𝑊 (−𝑘, −𝑙)) ≠ ( (cid:101)𝑊 (𝑘, 𝑙), (cid:101)𝑊 (−𝑘, −𝑙))} ≤ 𝐾 𝐿𝑁 2𝑐𝜅𝑒𝑥 𝑝(−𝜅𝑇𝑙),

which in turn gives

𝑃{(𝑊 (𝑘, 𝑙), 𝑊 (−𝑘, −𝑙)) ≠ ( (cid:101)𝑊 (𝑘, 𝑙), (cid:101)𝑊 (−𝑘, −𝑙)),

𝑓 𝑜𝑟 𝑠𝑜𝑚𝑒 (𝑘, 𝑙)} ≤ 𝐾 2𝐿2𝑁 2𝑐𝜅𝑒𝑥 𝑝(−𝜅𝑇𝑙),

where 𝑇𝑙 = 𝑇/𝐿. Given that log(𝑁)/𝑇 = 𝑜(1) and (𝐾, 𝐿) are finite, it follows that

𝑃{(𝑊 (𝑘, 𝑙), 𝑊 (−𝑘, −𝑙)) ≠ ( (cid:101)𝑊 (𝑘, 𝑙), (cid:101)𝑊 (−𝑘, −𝑙)),

𝑓 𝑜𝑟 𝑠𝑜𝑚𝑒 (𝑘, 𝑙)} = 𝑜(1)

□

Proof of Theorem 3.2 By Assumption DML2(i), with probability 1 − Δ𝑁𝑇 , (cid:98)
𝑃((cid:98)
𝜂𝑘𝑙 ∈ T𝑁𝑇 , ∀(𝑘, 𝑙)) ≥ 1 − 𝐾 𝐿Δ𝑁𝑇 = 1 − 𝑜(1). Let’s denote the event 𝑃((cid:98)
E𝜂 and the event {(𝑊 (𝑘, 𝑙), 𝑊 (−𝑘, −𝑙)) = ( (cid:101)𝑊 (𝑘, 𝑙), (cid:101)𝑊 (−𝑘, −𝑙)),

𝜂𝑘𝑙 ∈ T𝑁𝑇 . So,

𝜂𝑘𝑙 ∈ T𝜂, ∀(𝑘, 𝑙)) as

𝑓 𝑜𝑟 𝑠𝑜𝑚𝑒 (𝑘, 𝑙)} as E𝑐 𝑝. By

123

Lemma 3.1, we have 𝑃(E𝑐 𝑝) = 1 − 𝑜(1). By union bound inequality, we have 𝑃(E𝑐

𝜂 ∪ E𝑐

𝑐 𝑝) ≤

𝑃(E𝑐

𝜂) + 𝑃(E𝑐

𝑐 𝑝) = 𝑜(1). So, 𝑃(E𝜂 ∩ E𝑐 𝑝) = 1 − 𝑃(E𝑐

𝜂 ∪ E𝑐

𝑐 𝑝) ≥ 1 − 𝑜(1).

Let (cid:98)𝜃 be a solution from equation 3.13. To simplify the notation, we denote

(cid:98)𝐴𝑘𝑙 = E𝑘𝑙 [𝜓𝑎 (𝑊𝑖𝑡,

𝜂𝑘𝑙)], (cid:98)¯𝐴 =
(cid:98)

(cid:98)𝐵𝑘𝑙 = E𝑘𝑙 [𝜓𝑏 (𝑊𝑖𝑡,

𝜂𝑘𝑙)], (cid:98)¯𝐵 =
(cid:98)

1
𝐾 𝐿

1
𝐾 𝐿

𝐾
∑︁

𝐿
∑︁

𝑘=1
𝐾
∑︁

𝑙=1
𝐿
∑︁

𝑘=1

𝑙=1

(cid:98)𝐴𝑘𝑙, 𝐴0 = 𝐸𝑃 [𝜓𝑎 (𝑊𝑖𝑡; 𝜂0)],

(cid:98)𝐵𝑘𝑙, 𝐵0 = 𝐸𝑃 [𝜓𝑏 (𝑊𝑖𝑡; 𝜂0)],

(cid:98)¯𝜓(𝜃) = (cid:98)¯𝐴𝜃 + (cid:98)¯𝐵, ¯𝜓(𝜃, 𝜂) = E𝑁𝑇 𝜓(𝑊𝑖𝑡; 𝜃, 𝜂).

Claim B.1. On event {E𝜂 ∩ E𝑐 𝑝}, ∥(cid:98)¯𝐴 − 𝐴0∥ = 𝑂 𝑃 (𝑁 −1/2 + 𝑟𝑁𝑇 ).

By Claim 1 and Assumption 3(iii) that all singular values of 𝐴0 are bounded below by zero, it

follows that all singular values of (cid:98)¯𝐴 are also bounded below from zero, on event E𝜂. Then, by the

linearity in Assumption 3(i), we can write (cid:98)𝜃 = −(cid:98)¯𝐴

−1

(cid:98)¯𝐵, 𝜃0 = −𝐴−1
0

𝐵0. Then, by basic algebra, we

have

√

√

𝑁 ((cid:98)𝜃 − 𝜃0) =

−1

𝑁 (−(cid:98)¯𝐴

(cid:98)¯𝐵 − 𝜃0) = −
√

√

𝑁(cid:98)¯𝐴
(cid:16)

−1

((cid:98)¯𝐵 + (cid:98)¯𝐴𝜃0) = −

√

−1

(cid:98)¯𝜓(𝜃0)

𝑁(cid:98)¯𝐴
(cid:17)

𝑁 𝐴−1
0

(cid:98)¯𝜓(𝜃0) − ¯𝜓(𝜃0, 𝜂0)

(cid:17) −1

− 𝐴−1
0

(cid:21) (cid:16)

¯𝜓(𝜃0, 𝜂0) + (cid:98)¯𝜓(𝜃0) − ¯𝜓(𝜃0, 𝜂0)

(cid:17)

√

=

𝑁 𝐴−1
¯𝜓(𝜃0, 𝜂0) +
0
(cid:20) (cid:16)

√

+

𝑁

𝐴0 + (cid:98)¯𝐴 − 𝐴0

Claim B.2. On event {E𝜂 ∩ E𝑐 𝑝}, ∥(cid:98)¯𝜓(𝜃0) − ¯𝜓(𝜃0, 𝜂0) ∥ = 𝑂 𝑃 (𝑟′

𝑁𝑇 /

√

𝑁 + 𝜆𝑁𝑇 + 𝜆′

𝑁𝑇 ).

By Assumption DML2(i) and Jensen’s inequality, we have ∥ 𝐴0∥ ≤ 𝑚′

𝑁𝑇 ≤ 𝑐𝑚. Then, Claim

B.2 implies that

√

∥

(cid:16)

𝑁 𝐴−1
0

(cid:98)¯𝜓(𝜃0) − ¯𝜓(𝜃0, 𝜂0)

(cid:17)

∥ =𝑂 𝑃 (1)𝑂 𝑃 (

√

𝑁𝑟′

𝑁𝑇 +

=𝑂 𝑃 (𝑟′

𝑁𝑇 +

√

𝑁𝜆𝑁𝑇 +

√

√

𝑁𝜆𝑁𝑇 +

√

𝑁𝜆′

𝑁𝑇 )

𝑁𝜆′

𝑁𝑇 ),

Since 𝐸 [ ¯𝜓(𝜃0, 𝜂0)] = 0, by Lemma A.2, we have

√

𝑁 ¯𝜓(𝜃0, 𝜂0)

𝑑
→ N (0, Ω) where Ω = Σ𝑎 + 𝑐Σ𝑔

124

and ∥Ω∥ < ∞. By Claims B.1, B.2, and the asymptotic normality of

√

𝑁 ¯𝜓(𝜃0, 𝜂0), we have

√

𝑁

(cid:20) (cid:16)

𝐴0 + (cid:98)¯𝐴 − 𝐴0

(cid:17) −1

− 𝐴−1
0

(cid:21) (cid:16)

¯𝜓(𝜃0, 𝜂0) + (cid:98)¯𝜓(𝜃0) − ¯𝜓(𝜃0, 𝜂0)

(cid:17)(cid:13)
(cid:13)
(cid:13)
(cid:13)

(cid:13)
(cid:13)
(cid:13)

(cid:13)
(cid:13)𝐴−1
0

(cid:13)
(cid:13)

(cid:16)

𝑁

√
(cid:13)
(cid:13)
(cid:13)

¯𝜓(𝜃0, 𝜂0) + (cid:98)¯𝜓(𝜃0) − ¯𝜓(𝜃0, 𝜂0)

(cid:17)(cid:13)
(cid:13)
(cid:13)

(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)

≤

(cid:98)¯𝐴

−1(cid:13)
(cid:13)
(cid:13)
(cid:13)

(cid:13)
(cid:13)(cid:98)¯𝐴 − 𝐴0
(cid:13)
(cid:16)

=𝑂 𝑃 (1)𝑂 𝑃

𝑁 −1/2 + 𝑟𝑁𝑇

(cid:17)

𝑂 𝑃 (1)

(cid:16)

𝑂 𝑃 (1) + 𝑂 𝑃

√

(cid:16)

𝑟′
𝑁𝑇 +

𝑁𝜆𝑁𝑇 +

√

(cid:17)(cid:17)

𝑁𝜆′

𝑁𝑇

(cid:16)

=𝑂 𝑃

𝑁 −1/2 + 𝑟𝑁𝑇

(cid:17)

,

and

√

(cid:16)

𝑁

(cid:17)

(cid:98)𝜃 − 𝜃0

= 𝐴−1

0 N (0, Ω) + 𝑂 𝑃

(cid:16)

𝑁 −1/2 + 𝑟𝑁𝑇 + 𝑟′

𝑁𝑇 +

√

𝑁𝜆𝑁𝑇 +

√

𝑁𝜆′

𝑁𝑇

Proof of Claim B.1. Fix any (𝑘, 𝑙), we have

(cid:17) 𝑑

→ 𝐴−1

0 N (0, Ω).

(cid:13)
(cid:13) (cid:98)𝐴𝑘𝑙 − 𝐴0
(cid:13)

(cid:13)
(cid:13)
(cid:13)

≤

(cid:13)
(cid:13) (cid:98)𝐴𝑘𝑙 − 𝐸𝑃 [ (cid:98)𝐴𝑘𝑙 |𝑊 (−𝑘, −𝑙)]
(cid:13)

(cid:13)
(cid:13)
(cid:13)

+

(cid:13)
(cid:13)
(cid:13)

𝐸𝑃 [ (cid:98)𝐴𝑘𝑙 |𝑊 (−𝑘, −𝑙)] − 𝐴0

(cid:13)
(cid:13) =: ∥Δ𝐴,1∥ + ∥Δ𝐴,2∥.
(cid:13)

On the event {E𝜂 ∩ E𝑐 𝑝}, we have (cid:98)
So, due to Assumption DML2, we have (cid:13)
simplify the notation, we denote (cid:165)𝜓𝑎,𝑘𝑙
(cid:13)
(cid:13)Δ𝐴,1

(cid:13)
(cid:13):

𝑖𝑡

𝜂𝑘𝑙 ∈ T𝑁𝑇 and the independence between 𝑊 (−𝑘, −𝑙) and 𝑊 (𝑘, 𝑙).

(cid:13)Δ𝐴,2
:= 𝜓𝑎 (𝑊𝑖𝑡,

(cid:13)
(cid:13) ≤ 𝑟𝑁𝑇 . By iterated expectation,- 𝐸𝑃 [Δ𝐴,1] = 0. To
𝜂𝑘𝑙)|𝑊 (−𝑘, −𝑙)]. Consider
(cid:98)

𝜂𝑘𝑙) − 𝐸𝑃 [𝜓𝑎 (𝑊𝑖𝑡,
(cid:98)

(𝑁𝑘𝑇𝑙)2𝐸

(cid:16)(cid:13)
(cid:13)Δ𝐴,1

(cid:13)
2 |𝑊 (−𝑘, −𝑙)
(cid:13)

(cid:17)

= 𝐸𝑃

∑︁

≤

𝑖∈𝐼𝑘,𝑡∈𝑆𝑙,𝑟∈𝑆𝑙

(cid:104)

(cid:12)
(cid:12)
(cid:12)

𝐸𝑃

⟨ (cid:165)𝜓𝑎,𝑘𝑙
𝑖𝑡

, (cid:165)𝜓𝑎,𝑘𝑙
𝑖𝑠

⟩|𝑊 (−𝑘, −𝑙)

(cid:104)

(cid:12)
(cid:12)
(cid:12)

𝐸𝑃

, (cid:165)𝜓𝑎,𝑘𝑙
𝑗𝑡

⟩|𝑊 (−𝑘, −𝑙)

(cid:105)(cid:12)
(cid:12)
(cid:12)

(cid:104)

(cid:12)
(cid:12)
(cid:12)

𝐸𝑃

⟨ (cid:165)𝜓𝑎,𝑘𝑙
𝑖𝑡

, (cid:165)𝜓𝑎,𝑘𝑙
𝑖𝑡

⟩|𝑊 (−𝑘, −𝑙)

(cid:105)(cid:12)
(cid:12)
(cid:12)

+ 2

∑︁

(cid:104)

(cid:12)
(cid:12)
(cid:12)

𝐸𝑃

⟨ (cid:165)𝜓𝑎,𝑘𝑙
𝑖𝑡

, (cid:165)𝜓𝑎

𝑗,𝑡+𝑚⟩|𝑊 (−𝑘, −𝑙)

(cid:105)(cid:12)
(cid:12)
(cid:12)

∑︁

(cid:13)

(cid:13)

(cid:13)

(cid:13)

(cid:13)
𝑖∈𝐼𝑘,𝑡∈𝑆𝑙


(cid:105)(cid:12)
(cid:12)
(cid:12)

+

2

(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)

(cid:165)𝜓𝑎,𝑘𝑙
𝑖𝑡

∑︁

|𝑊 (−𝑘, −𝑙)







⟨ (cid:165)𝜓𝑎,𝑘𝑙
𝑖𝑡

𝑡∈𝑆𝑙,𝑖∈𝐼𝑘, 𝑗 ∈𝐼𝑘
𝑇𝑙−1
∑︁

max(𝑆𝑙)−𝑚
∑︁

𝑚=1

𝑡=min(𝑆𝑙)

𝑖, 𝑗 ∈𝐼𝑘

(cid:104)

(cid:12)
(cid:12)
(cid:12)

𝐸𝑃

⟨ (cid:165)𝜓𝑎,𝑘𝑙
𝑖𝑡

, (cid:165)𝜓𝑎

𝑖,𝑡+𝑚⟩|𝑊 (−𝑘, −𝑙)

(cid:105)(cid:12)
(cid:12) =: 𝑎(1) + 𝑎(2) + 𝑎(3) + 2𝑎(4) + 2𝑎(5).
(cid:12)

∑︁

+

𝑡∈𝑆𝑙,𝑖∈𝐼𝑘
𝑇𝑙−1
∑︁

+ 2

max(𝑆𝑙)−𝑚
∑︁

∑︁

𝑚=1

𝑡=min(𝑆𝑙)

𝑖∈𝐼𝑘

By conditional Cauchy-Schwarz inequality, for any 𝑖, 𝑡, 𝑗, 𝑠, we have

(cid:104)

(cid:12)
(cid:12)
(cid:12)

𝐸𝑃

⟨ (cid:165)𝜓𝑎,𝑘𝑙
𝑖𝑡

, (cid:165)𝜓𝑎,𝑘𝑙

𝑗 𝑠 ⟩|𝑊 (−𝑘, −𝑙)

(cid:105)(cid:12)
(cid:12)
(cid:12)

(cid:104)

(cid:16)

𝐸𝑃

≤

∥ (cid:165)𝜓𝑎,𝑘𝑙
𝑖𝑡

∥2|𝑊 (−𝑘, −𝑙)

(cid:105)

𝐸𝑃

(cid:104)

∥ (cid:165)𝜓𝑎,𝑘𝑙

𝑗 𝑠 ∥2|𝑊 (−𝑘, −𝑙)

(cid:105) (cid:17) 1/2

= 𝐸𝑃

(cid:104)

∥ (cid:165)𝜓𝑎,𝑘𝑙
𝑖𝑡

∥2|𝑊 (−𝑘, −𝑙)

(cid:105)

.

125

Therefore, we have

𝑎(1) ≤ 𝑁𝑘𝑇 2

𝑙 𝐸𝑃

(cid:104)

∥ (cid:165)𝜓𝑎,𝑘𝑙
𝑖𝑡

∥2|𝑊 (−𝑘, −𝑙)

(cid:105)

, 𝑎(2) ≤ 𝑁 2

𝑘𝑇𝑙 𝐸𝑃

(cid:104)

∥ (cid:165)𝜓𝑎,𝑘𝑙
𝑖𝑡

∥2|𝑊 (−𝑘, −𝑙)

(cid:105)

,

𝑎(3) ≤ 𝑁𝑘𝑇𝑙 𝐸𝑃

(cid:104)

∥ (cid:165)𝜓𝑎,𝑘𝑙
𝑖𝑡

∥2|𝑊 (−𝑘, −𝑙)

(cid:105)

,

𝑎(5) ≤ 𝑁𝑘𝑇 2

𝑙 𝐸𝑃

(cid:104)

∥ (cid:165)𝜓𝑎,𝑘𝑙
𝑖𝑡

∥2|𝑊 (−𝑘, −𝑙)

(cid:105)

.

On the event E𝜂 ∩ E𝑐 𝑝, we have, for 𝑖 ∈ 𝐼𝑘 , 𝑡 ∈ 𝑆𝑙,

(cid:104)

(cid:16)

𝐸𝑃

∥ (cid:165)𝜓𝑎,𝑘𝑙
𝑖𝑡

∥2|𝑊 (−𝑘, −𝑙)

(cid:105) (cid:17) 1/2

≲ (cid:16)

𝐸𝑃 (cid:2)∥𝜓𝑎 (𝑊𝑖𝑡,

𝜂𝑘𝑙) ∥2|𝑊 (−𝑘, −𝑙)(cid:3) (cid:17) 1/2
(cid:98)

< ∞,

where the first inequality follows from expanding the term and applying Jensen’s inequality and the

second inequality follows from Assumption DML2(i). Let 𝐷 denote the dimension of 𝜓𝑎 (𝑊, 𝜂),

then we have

𝑎(4) = 𝑎(5) +

𝑇𝑙−1
∑︁

max(𝑆𝑙)−𝑚
∑︁

∑︁

𝐷
∑︁

𝑚=1

𝑖, 𝑗 ∈𝐼𝑘,𝑖≠ 𝑗

𝑡=min(𝑆𝑙)

𝑑=1
For each 𝑖 ∈ 𝐼𝑘 , 𝑡 ∈ 𝑆𝑙, we can decompose (cid:165)𝜓𝑎,𝑘𝑙
𝑔𝑖 = 𝐸 [ (cid:165)𝜓𝑎,𝑘𝑙

𝑑,𝑖,𝑡 |𝛾𝑡], and 𝑒𝑖𝑡 = (cid:165)𝜓𝑎,𝑘𝑙
uncorrelated, 𝑎𝑖 ⊥⊥ 𝑎 𝑗 for 𝑖 ≠ 𝑗, and 𝑔𝑘𝑙
𝑡

𝑑,𝑖,𝑡 − 𝑎𝑖 − 𝑔𝑡. Conditional on 𝑊 (−𝑘, −𝑙), (𝑎𝑘𝑙
𝑖

, 𝑔𝑘𝑙
𝑡

𝑑,𝑖,𝑡 = 𝑎𝑘𝑙

𝑖 + 𝑔𝑘𝑙

𝑡 + 𝑒𝑘𝑙

𝑖𝑡 where 𝑎𝑖 = 𝐸 [ (cid:165)𝜓𝑎,𝑘𝑙
, 𝑒𝑘𝑙

𝑑,𝑖,𝑡 |𝛼𝑖],

𝑖𝑡 ) are mutually

is also beta-mixing with 𝛽𝑔 (𝑚) ≤ 𝛽𝛾 (𝑚). Therefore, we

𝐸𝑃

(cid:104) (cid:165)𝜓𝑎,𝑘𝑙

𝑑,𝑖,𝑡

(cid:165)𝜓𝑎,𝑘𝑙
𝑑, 𝑗,𝑡+𝑚 |𝑊 (−𝑘, −𝑙)

(cid:105)

have

𝐸𝑃

(cid:104) (cid:165)𝜓𝑎,𝑘𝑙

𝑑,𝑖,𝑡

(cid:105)

(cid:165)𝜓𝑎,𝑘𝑙
𝑑, 𝑗,𝑡+𝑚 |𝑊 (−𝑘, −𝑙)
𝑡+𝑚 |𝑊 (−𝑘, −𝑙)(cid:3) + 𝐸𝑃

(cid:104)

= 𝐸𝑃

𝑔𝑘𝑙
𝑡 𝑔𝑘𝑙

𝑡+𝑚 + 𝑒𝑘𝑙

𝑖𝑡 𝑒𝑘𝑙

𝑗,𝑡+𝑚 |𝑊 (−𝑘, −𝑙)

(cid:105)

(cid:104)

(cid:104)

𝐸𝑃

𝑖𝑡 𝑒𝑘𝑙
𝑒𝑘𝑙

𝑗,𝑡+𝑚 |𝛼𝑖, 𝛼 𝑗 , 𝑊 (−𝑘, −𝑙)

(cid:105)

|𝑊 (−𝑘, −𝑙)

(cid:105)

=𝐸𝑃 (cid:2)𝑔𝑘𝑙

𝑡 𝑔𝑘𝑙

Note that 𝛽-mixing of 𝛾𝑡 implies 𝛼-mixing with the mixing coefficient 𝛼𝛾 (𝑚) ≤ 𝛽𝛾 (𝑚) for all

𝑚 ∈ Z+, and conditional on 𝑊 (−𝑘, −𝑙) and 𝛼𝑖, 𝑒𝑘𝑙

𝑖𝑡 is also 𝛼-mixing with the mixing coefficient not

larger than 𝛼𝛾 (𝑚) by Theorem 14.12 of Hansen (2022). Then, we have

(cid:104)

(cid:104)

𝐸𝑃

𝐸𝑃

(cid:12)
(cid:12)
(cid:12)

𝑖𝑡 𝑒𝑘𝑙
𝑒𝑘𝑙

(cid:104)

(cid:104)(cid:12)
𝑖𝑡 𝑒𝑘𝑙
𝑒𝑘𝑙
≤𝐸𝑃
𝐸𝑃
(cid:12)
(cid:12)
≲8𝛼𝛾 (𝑚)1−2/𝑞 (cid:16)

𝑗,𝑡+𝑚 |𝛼𝑖, 𝛼 𝑗 , 𝑊 (−𝑘, −𝑙)

𝑗,𝑡+𝑚 |𝛼𝑖, 𝛼 𝑗 , 𝑊 (−𝑘, −𝑙)
(cid:105)(cid:12)
(cid:12)
(cid:12)
(cid:17) 1/𝑞 (cid:16)
𝑑,𝑖,𝑡 |𝑞 |𝑊 (−𝑘, −𝑙)]

𝐸𝑃 [| (cid:165)𝜓𝑎,𝑘𝑙

(cid:105)

|𝑊 (−𝑘, −𝑙)

|𝑊 (−𝑘, −𝑙)

(cid:105)(cid:12)
(cid:12)
(cid:12)

(cid:105)

𝐸𝑃 [| (cid:165)𝜓𝑎,𝑘𝑙

𝑑, 𝑗,𝑡+𝑚 |𝑞 |𝑊 (−𝑘, −𝑙)]

(cid:17) 1/𝑞

≲ 32𝛼𝛾 (𝑚)1−2/𝑞𝑐2
𝑚,

where the first inequality follows from the Jensen’s inequality; the second inequality follows from

the fact that 𝐸 [𝑒𝑘𝑙

𝑖𝑡 |𝛼𝑖, 𝑊 (−𝑘, −𝑙)] = 0, and Theorem 14.13(ii) of Hansen (2022); the last inequality

126

follows from the moment conditions in Assumption DML2 and that 𝑊 (−𝑘, −𝑙) is independent of

𝑊 (𝑘, 𝑙) on E𝑐 𝑝 . Similarly,

(cid:12)
(cid:12)𝐸𝑃 (cid:2)𝑔𝑘𝑙

𝑡 𝑔𝑘𝑙

𝑡+𝑚 |𝑊 (−𝑘, −𝑙)(cid:3)(cid:12)

(cid:12) ≲ 𝛼𝛾 (𝑚)1−2/𝑞𝑐2
𝑚,

Then, we have

1
𝑘𝑇𝑙
𝑁 2

𝑇𝑙−1
∑︁

max(𝑆𝑙)−𝑚
∑︁

∑︁

𝐷
∑︁

𝑚=1

𝑡=min(𝑆𝑙)

𝑖, 𝑗 ∈𝐼𝑘,𝑖≠ 𝑗

𝑑=1

𝐸𝑃

(cid:104) (cid:165)𝜓𝑎,𝑘𝑙

𝑑,𝑖,𝑡

(cid:165)𝜓𝑎,𝑘𝑙
𝑑, 𝑗,𝑡+𝑚 |𝑊 (−𝑘, −𝑙)

(cid:105)

≲𝑐2
𝑚

𝑇𝑙−1
∑︁

max(𝑆𝑙)−𝑚
∑︁

1
𝑘𝑇𝑙
𝑁 2
𝑚=1
𝑚 𝐷𝑐𝜅
𝑐2
𝑒𝑥 𝑝(𝜅(1 − 2/𝑞)) − 1

𝑡=min(𝑆𝑙)

≤

𝛼𝛾 (𝑚)1−2/𝑞 ≤ 𝑐2

𝑚 𝐷

∞
∑︁

𝑚=1

𝑐𝜅𝑒𝑥 𝑝(−𝜅𝑚)1−2/𝑞

∑︁

𝐷
∑︁

𝑖, 𝑗 ∈𝐼𝑘,𝑖≠ 𝑗

𝑑=1

< ∞,

where the last inequality follows from the geometric sum. Thus, as (𝑁𝑘 , 𝑇𝑙) → ∞ we have

𝐸

(cid:16)(cid:13)
(cid:13)Δ𝐴,1

(cid:13)
2 |𝑊 (−𝑘, −𝑙)
(cid:13)

(cid:17)

=

(cid:19) 2

(cid:18) 1
𝑁𝑘𝑇𝑙

[𝑎(1) + 𝑎(2) + (3) + 2𝑎(4) + 2𝑎(5)]

=𝑂 𝑃 (1/𝑇𝑙) = 𝑂 𝑃 (1/𝑁).

where the last step follows from that 𝐿 is constant and 𝑁/𝑇 → 𝑐 as 𝑁, 𝑇 → ∞. By Markov’s
(cid:13)
inequality, we conclude that conditional on 𝑊 (−𝑘, −𝑙), (cid:13)
(cid:13) = 𝑂 𝑃 (1/
conditional convergence implies unconditional convergence, we have (cid:13)

𝑁). By Lemma 6.1 that

(cid:13)Δ𝐴,1

√

√

summarize, we have

(cid:13)
(cid:13) (cid:98)𝐴𝑘𝑙 − 𝐴0
(cid:13)

(cid:13)
(cid:13) = 𝑂 𝑃 (𝑁 −1/2 + 𝛿𝑁𝑇 ), which implies
(cid:13)

(cid:13)
(cid:13) = 𝑂 𝑃 (1/
𝑁). To
(cid:13)Δ𝐴,1
(cid:13)
(cid:13)
(cid:13)(cid:98)¯𝐴 − 𝐴0
(cid:13) = 𝑂 𝑃 (𝑁 −1/2 + 𝑟𝑁𝑇 ).
(cid:13)
(cid:13)

Proof of Claim B.2: Since 𝐾 and 𝐿 are finite, it suffices to show for any 𝑘, 𝑙,

∥E𝑘𝑙 [𝜓(𝑊𝑖𝑡; 𝜃0,

𝜂𝑘𝑙) − 𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂0)] ∥ = 𝑂 𝑃 (𝑟′
(cid:98)

𝑁𝑇 /√︁𝑁𝑘 + 𝜆𝑁𝑇 + 𝜆′

𝑁𝑇 ).

To simplify the notation, we denote

𝑖𝑡 |𝑊 (−𝑘, −𝑙)],

(cid:165)𝜓 𝑘𝑙
𝑖𝑡 = 𝜓(𝑊𝑖𝑡; 𝜃0,
𝜂𝑘𝑙) − 𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂0),
(cid:98)
𝑘𝑙
𝑖𝑡 − 𝐸𝑃 [ (cid:165)𝜓 𝑘𝑙
𝑖𝑡 = (cid:165)𝜓 𝑘𝑙
(cid:101)(cid:165)𝜓
√
(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)

𝐸𝑃 [𝜓(𝑊𝑖𝑡; 𝜃0,

𝑖𝑡 − 𝐸𝑃 [ (cid:165)𝜓 𝑘𝑙

𝑁𝑘
𝑁𝑘𝑇𝑙

1
𝑁𝑘𝑇𝑙

𝑖∈𝐼𝑘,𝑡∈𝑆𝑙

(cid:2) (cid:165)𝜓 𝑘𝑙

∑︁

𝑏(2) =

𝑏(1) =

𝑖𝑡 |𝑊 (−𝑘, −𝑙)](cid:3)

(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)

𝜂𝑘𝑙)|𝑊 (−𝑘, −𝑙)] − 𝐸𝑃 [𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂0)]
(cid:98)

(cid:13)
(cid:13)
(cid:13)
(cid:13)

.

127

We also denote (cid:101)(cid:165)𝜓𝑑,𝑖𝑡 as each element in the vector (cid:101)(cid:165)𝜓

𝑘𝑙
𝑖𝑡 for 𝑑 = 1, ..., 𝐷, while suppressing the

subscripts 𝑘, 𝑙 for convenience. By triangle inequality, we have

∥E𝑘𝑙 [𝜓(𝑊𝑖𝑡; 𝜃0,

𝜂𝑘𝑙) − 𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂0)] ∥ ≤ 𝑏(1)/√︁𝑁𝑘 + 𝑏(2).
(cid:98)

To bound 𝑏(1), first note that it is mean zero by the iterated expectation argument. On the event

E𝜂 ∩ E𝑐 𝑝, we have

𝐸𝑃 [𝑏(1)2|𝑊 (−𝑘, −𝑙)] ≤

1
𝑁𝑘𝑇 2
𝑙

∑︁

+

𝑡∈𝑆𝑙,𝑖∈𝐼𝑘, 𝑗 ∈𝐼𝑘

(cid:20)

𝑘𝑙
𝑖𝑡 , (cid:101)(cid:165)𝜓

⟨(cid:101)(cid:165)𝜓

𝐸𝑃

(cid:12)
(cid:12)
(cid:12)
(cid:12)

∑︁

𝑘𝑙
𝑗𝑡 ⟩|𝑊 (−𝑘, −𝑙)

𝑖∈𝐼𝑘,𝑡∈𝑆𝑙,𝑟∈𝑆𝑙
(cid:21)(cid:12)
(cid:12)
(cid:12)
(cid:12)

(cid:20)

𝐸𝑃

(cid:12)
(cid:12)
(cid:12)
(cid:12)

𝑘𝑙
𝑖𝑡 , (cid:101)(cid:165)𝜓

𝑘𝑙
𝑖𝑠 ⟩|𝑊 (−𝑘, −𝑙)

⟨(cid:101)(cid:165)𝜓

(cid:21)(cid:12)
(cid:12)
(cid:12)
(cid:12)

∑︁

+

𝑡∈𝑆𝑙,𝑖∈𝐼𝑘

(cid:20)

𝐸𝑃

(cid:12)
(cid:12)
(cid:12)
(cid:12)

𝑘𝑙
𝑖𝑡 , (cid:101)(cid:165)𝜓

𝑘𝑙
𝑖𝑡 ⟩|𝑊 (−𝑘, −𝑙)

⟨(cid:101)(cid:165)𝜓

(cid:21)(cid:12)
(cid:12)
(cid:12)
(cid:12)

(cid:20)

𝐸𝑃

(cid:12)
(cid:12)
(cid:12)
(cid:12)

𝑘𝑙
𝑖𝑡 , (cid:101)(cid:165)𝜓

⟨(cid:101)(cid:165)𝜓

𝑘𝑙
𝑗,𝑡+𝑚⟩|𝑊 (−𝑘, −𝑙)

(cid:21)(cid:12)
(cid:12)
(cid:12)
(cid:12)

+ 2

+ 2

𝑇𝑙−1
∑︁

max(𝑆𝑙)−𝑚
∑︁

∑︁

𝑚=1
𝑇𝑙−1
∑︁

𝑡=min(𝑆𝑙)
max(𝑆𝑙)−𝑚
∑︁

𝑖, 𝑗 ∈𝐼𝑘

∑︁

𝑚=1

𝑡=min(𝑆𝑙)

𝑖∈𝐼𝑘

(cid:20)

𝐸𝑃

(cid:12)
(cid:12)
(cid:12)
(cid:12)

𝑘𝑙
𝑖𝑡 , (cid:101)(cid:165)𝜓

𝑘𝑙
𝑖,𝑡+𝑚⟩|𝑊 (−𝑘, −𝑙)

⟨(cid:101)(cid:165)𝜓

(cid:21)(cid:12)
(cid:12)
(cid:12)
(cid:12)

=: 𝑐(1) + 𝑐(2) + 𝑐(3) + 2𝑐(4) + 2𝑐(5).

By conditional Cauchy-Schwarz inequality, for any 𝑖, 𝑡, 𝑗, 𝑠, we have

(cid:20)

𝐸𝑃

(cid:12)
(cid:12)
(cid:12)
(cid:12)

𝑘𝑙
𝑖𝑡 , (cid:101)(cid:165)𝜓

⟨(cid:101)(cid:165)𝜓

𝑘𝑙
𝑗 𝑠⟩|𝑊 (−𝑘, −𝑙)

(cid:21)(cid:12)
(cid:12)
(cid:12)
(cid:12)

(cid:20)

(cid:18)

𝐸𝑃

≤

𝑘𝑙
𝑖𝑡 ∥2|𝑊 (−𝑘, −𝑙)

∥(cid:101)(cid:165)𝜓

(cid:20)

(cid:21)

𝐸𝑃

𝑘𝑙
𝑗 𝑠 ∥2|𝑊 (−𝑘, −𝑙)

∥(cid:101)(cid:165)𝜓

(cid:21) (cid:19) 1/2

.

Applying Minkowski’s inequality, Jensen’s inequality on the event E𝜂 ∩ E𝑐 𝑝, we have, for 𝑖 ∈ 𝐼𝑘 , 𝑡 ∈

𝑆𝑙,

(cid:18)

(cid:16)

≤

(cid:20)

𝐸𝑃

𝑘𝑙
𝑖𝑡 ∥2|𝑊 (−𝑘, −𝑙)

∥(cid:101)(cid:165)𝜓

(cid:21) (cid:19) 1/2

𝐸𝑃 (cid:2)∥ (cid:165)𝜓 𝑘𝑙

(cid:16)

≤2

𝐸𝑃 (cid:2)∥ (cid:165)𝜓 𝑘𝑙

𝑖𝑡 ∥2|𝑊 (−𝑘, −𝑙)(cid:3) (cid:17) 1/2
+
𝑖𝑡 ∥2|𝑊 (−𝑘, −𝑙)(cid:3) (cid:17) 1/2

(cid:16)

𝐸𝑃 (cid:2)∥𝐸𝑃 [ (cid:165)𝜓 𝑘𝑙

𝑖𝑡 |𝑊 (−𝑘, −𝑙)] ∥2|𝑊 (−𝑘, −𝑙)(cid:3) (cid:17) 1/2

≤ 2𝑟′

𝑁𝑇 .

Therefore, we have

𝑐(1) ≤ 𝐸𝑃

(cid:20)

𝑘𝑙
𝑖𝑡 ∥2|𝑊 (−𝑘, −𝑙)

∥(cid:101)(cid:165)𝜓

(cid:21)

= 𝑂 (𝑟′

𝑁𝑇

2),

𝑐(2) ≤ 𝑐𝐸𝑃

(cid:20)

𝑘𝑙
𝑖𝑡 ∥2|𝑊 (−𝑘, −𝑙)

∥(cid:101)(cid:165)𝜓

(cid:21)

= 𝑂 (𝑟′

𝑁𝑇

2),

𝑐(3) ≤

(cid:20)

1
𝑁𝑘

𝐸𝑃

𝑘𝑙
𝑖𝑡 ∥2|𝑊 (−𝑘, −𝑙)

∥(cid:101)(cid:165)𝜓

(cid:21)

= 𝑂 (𝑟′

𝑁𝑇

2/𝑁), 𝑐(5) ≤ 𝐸𝑃

(cid:20)

𝑘𝑙
𝑖𝑡 ∥2|𝑊 (−𝑘, −𝑙)

∥(cid:101)(cid:165)𝜓

(cid:21)

] = 𝑂 (𝑟′

𝑁𝑇

2).

128

Following similar arguments as for bounding 𝑎(4), 𝑐(4) is of order 𝑂 (𝑟′

𝑁𝑇

2). So, we have shown

𝐸𝑃 [𝑏(1)2|𝑊 (−𝑘, −𝑙)] = 𝑂 𝑃

(cid:16)

𝑟′
𝑁𝑇

2(cid:17)

,

which implies 𝑏(1) = 𝑂 𝑃 (𝑟′

𝑁𝑇 ) by Markov inequality and Lemma 6.1 of Chernozhukov et al.

(2018a).

To bound 𝑏(2), we first define

𝑓𝑘𝑙 (𝑟) := 𝐸𝑃 [𝜓(𝑊𝑖𝑡, 𝜃0, 𝜂0 + 𝑟 ((cid:98)

𝜂𝑘𝑙 − 𝜂0)|𝑊 (−𝑘, −𝑙)] − 𝐸𝑃 [𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂0)] , 𝑟 ∈ [0, 1],

for some 𝑖 ∈ 𝐼𝑘 , 𝑡 ∈ 𝑆𝑙. So, 𝑏(2) = ∥ 𝑓𝑘𝑙 (1) ∥. By expanding 𝑓𝑘𝑙 (𝑟) around 0 using mean value

theorem and evaluating at 𝑟 = 1, we have

𝑓𝑘𝑙 (𝑟) = 𝑓𝑘𝑙 (0) + 𝑓 ′

𝑘𝑙 (0) + 𝑓 ′′

𝑟)/2,

𝑘𝑙 ((cid:101)

𝑟 ∈ (0, 1). We note that

where (cid:101)
under Assumption DML1(ii)(near-orthogonality), we have (cid:13)

𝑓𝑘𝑙 (0) = 0 on the event E𝑐 𝑝. On the event E𝜂 ∩ E𝑐 𝑝 and
𝑘𝑙 (0)(cid:13)

(cid:13) ≤ 𝜆𝑁𝑇 and

≤ 𝜆′

(cid:13) 𝑓 ′

(cid:13)
(cid:13)
(cid:13)

𝑓 ′′
𝑘𝑙 (0)

(cid:13)
(cid:13)
(cid:13)

𝑁𝑇 .
𝑁𝑇 ). Combining the bounds for 𝑏(1) and

Therefore, we have shown that 𝑏(2) = 𝑂 𝑃 (𝜆𝑁𝑇 ) + 𝑂 𝑃 (𝜆′

𝑏(2) completes the proof of Claim B.2. □

Proof of Theorem 3.3

By the same arguments for Theorem 3.2, we have 𝑃(E𝜂 ∩ E𝑐 𝑝) = 1 − 𝑃(E𝑐

𝜂 ∪ E𝑐

𝑐 𝑝) ≥ 1 − 𝑜(1).

∥ 𝐴−1

By Claim B.1, we have ∥(cid:98)¯𝐴 − 𝐴0∥ = 𝑂 𝑃 (𝑁 −1/2 + 𝑟𝑁𝑇 ) on event {E𝜂 ∩ E𝑐 𝑝}. Therefore, due to
0 ensured by Assumption DML1(iv) and Ω < ∞ as shown in Claim B.2, it suffices to
show ∥(cid:98)Ω𝐶𝐻𝑆 − Ω∥ = 𝑜𝑃 (1). Furthermore, since 𝐾, 𝐿 are fixed constants, it suffices to show for

0 ∥ ≤ 𝑎−1

129

each (𝑘, 𝑙) that ∥(cid:98)Ω𝐶𝐻𝑆,𝑘𝑙 − Ω∥ = 𝑜𝑃 (1) where

(cid:98)Ω𝐶𝐻𝑆,𝑘𝑙 :=(cid:98)Ω𝑎,𝑘𝑙 + (cid:98)Ω𝑏,𝑘𝑙 − (cid:98)Ω𝑐,𝑘𝑙 + (cid:98)Ω𝑑,𝑘𝑙 + (cid:98)Ω′
∑︁

𝑑,𝑘𝑙,

𝜓(𝑊𝑖𝑡; (cid:98)𝜃,

𝜂𝑘𝑙)𝜓(𝑊𝑖𝑟; (cid:98)𝜃,
(cid:98)

𝜂𝑘𝑙)′,
(cid:98)

(cid:98)Ω𝑎,𝑘𝑙 :=

(cid:98)Ω𝑏,𝑘𝑙 :=

(cid:98)Ω𝑐,𝑘𝑙 :=

1
𝑁𝑘𝑇 2
𝑙
𝐾/𝐿
𝑁𝑘𝑇 2
𝑙
𝐾/𝐿
𝑁𝑘𝑇 2
𝑙

(cid:98)Ω𝑑,𝑘𝑙 :=

𝐾/𝐿
𝑁𝑘𝑇 2
𝑙

𝑖∈𝐼𝑘,𝑡∈𝑆𝑙,𝑟∈𝑆𝑙
∑︁

𝑡∈𝑆𝑙,𝑖∈𝐼𝑘, 𝑗 ∈𝐼𝑘

∑︁

𝑖∈𝐼𝑘,𝑡∈𝑆𝑙
𝑀−1
∑︁

𝑘

𝑚=1

𝜓(𝑊𝑖𝑡; (cid:98)𝜃,

𝜂𝑘𝑙)𝜓(𝑊 𝑗𝑡; (cid:98)𝜃,
(cid:98)

𝜂𝑘𝑙)′,
(cid:98)

𝜓(𝑊𝑖𝑡; (cid:98)𝜃,

𝜂𝑘𝑙)𝜓(𝑊𝑖𝑡; (cid:98)𝜃,
(cid:98)

𝜂𝑘𝑙)′,
(cid:98)

(cid:16) 𝑚
𝑀

⌈𝑆𝑙⌉−𝑚
∑︁

(cid:17)

∑︁

𝑡=⌊𝑆𝑙⌋

𝑖∈𝐼𝑘, 𝑗 ∈𝐼𝑘, 𝑗≠𝑖

𝜓(𝑊𝑖𝑡; (cid:98)𝜃,

𝜂𝑘𝑙)𝜓(𝑊 𝑗,𝑡+𝑚; (cid:98)𝜃,
(cid:98)

𝜂𝑘𝑙)′.
(cid:98)

Since a sequence of symmetric matrices Ω𝑛 converges to a symmetric matrix Ω0 if and only

if 𝑒′Ω𝑛𝑒 → 𝑒′Ω0𝑒 for all comfortable 𝑒, it suffices to assume without loss of generality that the

dimension of 𝜓 to be 1. To simplify the expression, we denote

𝑖𝑡 = 𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂0), (cid:98)𝜓 (𝑘𝑙)
𝜓 (0)

𝑖𝑡

= 𝜓(𝑊𝑖𝑡; (cid:98)𝜃,

𝜂𝑘𝑙)
(cid:98)

Claim B.3. On event {E𝜂 ∩ E𝑐 𝑝},

Claim B.4. On event {E𝜂 ∩ E𝑐 𝑝},

Claim B.5. On event {E𝜂 ∩ E𝑐 𝑝},

Claim B.6. On event {E𝜂 ∩ E𝑐 𝑝},

𝑁 −1/2 + 𝑟′

𝑁𝑇

(cid:17)

.

(cid:16)

(cid:12)
(cid:12)
(cid:12) = 𝑂 𝑃
(cid:12)
(cid:12)
(cid:12)(cid:98)Ω𝑎,𝑘𝑙 − Σ𝑎
(cid:12)
(cid:12)
(cid:12)(cid:98)Ω𝑏,𝑘𝑙 − 𝑐𝐸𝑃 [𝑔𝑡𝑔𝑡]
(cid:12) = 𝑂 𝑃
(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12) = 𝑂 𝑃 (cid:0)𝑇 −1(cid:1).
(cid:12)
(cid:12)
(cid:12)(cid:98)Ω𝑐,𝑘𝑙
(cid:12)
(cid:12)(cid:98)Ω𝑑,𝑘𝑙 − 𝑐 (cid:205)∞
(cid:12)
𝑚=1

𝐸𝑃 [𝑔𝑡𝑔𝑡+𝑚]

(cid:12)
(cid:12) = 𝑜𝑃 (1).
(cid:12)

(cid:16)

𝑁 −1/2 + 𝑟′

𝑁𝑇

(cid:17)

.

The decomposition techniques used in the proofs of Claims A.4, A.5, and A.7 follow the proofs

of Lemma 1 and Lemma 2 in Appendix E of Chiang et al. (2024). Combining the Claims A.4-A.7

completes the proof of Theorem 3.3.

130

Proof of Claim B.3. By triangle inequality, we have

(cid:12)
(cid:12)
(cid:12)

≤

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)(cid:98)Ω𝑎,𝑘𝑙 − Σ𝑎
(cid:12)
𝐼 (𝑘𝑙)
𝑎,1 :=

+

(cid:12)
(cid:12)
(cid:12)

𝐼 (𝑘𝑙)
𝑎,1
1
𝑁𝑘𝑇 2
𝑙

𝐼 (𝑘𝑙)
𝑎,2 :=

𝐼 (𝑘𝑙)
𝑎,2 :=

1
𝑁𝑘𝑇 2
𝑙

1
𝑁𝑘𝑇 2
𝑙

+

(cid:12)
(cid:12)
(cid:12)

(cid:12)
(cid:12)
(cid:12)

(cid:12)
𝐼 (𝑘𝑙)
(cid:12)
𝑎,2
(cid:12)
∑︁

,

(cid:12)
𝐼 (𝑘𝑙)
(cid:12)
𝑎,2
(cid:12)
(cid:110)
𝑖𝑡 (cid:98)𝜓 (𝑘𝑙)
(cid:98)𝜓 (𝑘𝑙)

𝑖𝑟 − 𝜓 (0)

𝑖𝑡 𝜓 (0)
𝑖𝑟

(cid:111)

,

𝑖∈𝐼𝑘,𝑡∈𝑆𝑙,𝑟∈𝑆𝑙
∑︁

(cid:110)

𝑖𝑡 𝜓 (0)
𝜓 (0)

𝑖𝑟 − 𝐸𝑃 [𝜓 (0)

𝑖𝑡 𝜓 (0)
𝑖𝑟 ]

(cid:111)

,

𝑖∈𝐼𝑘,𝑡∈𝑆𝑙,𝑟∈𝑆𝑙
∑︁

𝑖∈𝐼𝑘,𝑡∈𝑆𝑙,𝑟∈𝑆𝑙

𝐸𝑃 [𝜓 (0)

𝑖𝑡 𝜓 (0)

𝑖𝑟 ] − 𝐸𝑃 [𝑎𝑖𝑎𝑖].

By law of total covariance and mean-zero property of 𝜓 (0)
𝑖𝑡

, we have

(cid:104)

𝐸𝑃

𝑖𝑡 𝜓 (0)
𝜓 (0)
𝑖𝑟

(cid:105)

= 𝐸𝑃 [𝐸𝑃 (𝜓 (0)
𝑖𝑡

, 𝜓 (0)
𝑖𝑟

|𝛼𝑖)] + 𝐸𝑃

(cid:16)

𝐸𝑃 [𝜓 (0)
𝑖𝑡

|𝛼𝑖]𝐸𝑃 [𝜓 (0)
𝑖𝑟

|𝛼𝑖]

(cid:17)

Due to the identical distribution of 𝛼𝑖 and mean zero, we have

∑︁

𝐸𝑃 [𝜓 (0)

𝑖𝑡 𝜓 (0)
𝑖𝑟 ]

1
𝑁𝑘𝑇 2
𝑙

𝑖∈𝐼𝑘,𝑡∈𝑆𝑙,𝑟∈𝑆𝑙
(cid:110)
∑︁

=

1
𝑇 2
𝑙

𝑡∈𝑆𝑙,𝑟∈𝑆𝑙

𝐸𝑃 [𝐸𝑃 (𝜓 (0)

𝑖𝑡 𝜓 (0)
𝑖𝑟

|𝛼𝑖)] + 𝐸𝑃 (𝐸𝑃 [𝜓 (0)
𝑖𝑡

|𝛼𝑖]𝐸𝑃 [𝜓 (0)
𝑖𝑟

|𝛼𝑖])

(cid:111)

Conditional on 𝛼𝑖, {𝜓 (0)

𝑖𝑡 }𝑡≥1 is 𝛽-mixing with the mixing coefficient same as 𝛾𝑡. Therefore, we can

apply Theorem 14.13(ii) in Hansen (2022) and Jensen’s inequality:

𝐸𝑃

(cid:104)

(cid:12)
(cid:12)
(cid:12)

𝐸𝑃

𝜓 (0)
𝑖𝑡

, 𝜓 (0)
𝑖𝑟

|𝛼𝑖

(cid:16)

𝐸𝑃 |𝜓 (0)
𝑖𝑡

|𝑞(cid:17) 2/𝑞

≤ 8

(cid:105)(cid:12)
(cid:12)
(cid:12)

𝛽𝛾 (|𝑡 − 𝑟 |)1−2/𝑞

Note that (cid:205)𝑡∈𝑆𝑙,𝑟∈𝑆𝑙
To bound 𝐼 (𝑘𝑙)

𝛽𝛾 (|𝑡 − 𝑟 |)1−2/𝑞 ≤ ∞ under Assumption 3.2. So, 𝐼 (𝑘𝑙)

𝑎,2 = 𝑂 (1/𝑇 2

𝑙 ) = 𝑂 (𝑇 −2).

𝑎,2 , we can rewrite it by triangle inequality as follows:
(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

(cid:101)𝐼 (𝑘𝑙)
𝑎,2,𝑖

𝐼 (𝑘𝑙)
𝑎,2,𝑖

𝐼 (𝑘𝑙)
𝑎,2

1
𝑁𝑘

1
𝑁𝑘

∑︁

∑︁

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

𝑖∈𝐼𝑘

≤

+

(cid:12)
(cid:12)
(cid:12)

(cid:12)
(cid:12)
(cid:12)

,

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)
1
𝑇 2
𝑙

𝐼 (𝑘𝑙)
𝑎,2,𝑖 :=

(cid:101)𝐼 (𝑘𝑙)
𝑎,2,𝑖 :=

1
𝑇 2
𝑙

𝑖∈𝐼𝑘
∑︁

𝑡,𝑟∈𝑆𝑙
∑︁

𝑡,𝑟∈𝑆𝑙

(cid:110)

(cid:110)

𝑖𝑡 𝜓 (0)
𝜓 (0)

𝑖𝑟 − 𝐸𝑃

(cid:104)

𝑖𝑡 𝜓 (0)
𝜓 (0)
𝑖𝑟

|{𝛾𝑡 }𝑡∈𝑆𝑙

(cid:105) (cid:111)

,

(cid:104)

𝐸𝑃

𝑖𝑡 𝜓 (0)
𝜓 (0)
𝑖𝑟

|{𝛾𝑡 }𝑡∈𝑆𝑙

(cid:105)

− 𝐸𝑃 [𝜓 (0)

𝑖𝑡 𝜓 (0)
𝑖𝑟 ]

(cid:111)

.

131

Due to identical distribution of 𝛼𝑖, (cid:101)𝐼 (𝑘𝑙)
Denote ℎ𝑖 (𝛾𝑡, 𝛾𝑟) = 𝐸𝑃 [𝜓 (0)

𝑖𝑡 𝜓 (0)
𝑖𝑟

𝑎,2,𝑖 does not vary over 𝑖 so that 𝐸𝑃
𝑖𝑡 𝜓 (0)

(cid:12)
(cid:12)
(cid:12)
𝑖𝑟 ]. By direct calculation, we have

(cid:205)𝑖∈𝐼𝑘 (cid:101)𝐼 (𝑘𝑙)
𝑎,2,𝑖

|𝛾𝑡, 𝛾𝑟] − 𝐸𝑃 [𝜓 (0)

1
𝑁𝑘

(cid:12)
(cid:12)
(cid:12)

2

= 𝐸𝑃

(cid:12)
(cid:12)(cid:101)𝐼 (𝑘𝑙)
(cid:12)

𝑎,2,𝑖

(cid:12)
2
(cid:12)
(cid:12)

.

𝐸𝑃

(cid:12)
(cid:12)(cid:101)𝐼 (𝑘𝑙)
(cid:12)

𝑎,2,𝑖

(cid:12)
2
(cid:12)
(cid:12)

=

1
𝑇 4
𝑙

∑︁

𝑡,𝑟,𝑡′,𝑟 ′∈𝑆𝑙

𝐸𝑃 [ℎ𝑖 (𝛾𝑡, 𝛾𝑟)ℎ𝑖 (𝛾𝑡′, 𝛾𝑟 ′)] .

To bound the RHS above, we can apply Lemma 3.4 in Dehling and Wendler (2010) by verifying

the following two conditions:

𝐸𝑃 |ℎ𝑖 (𝛾𝑡, 𝛾𝑟)|2+𝛿 < ∞,

∫ ∫

|ℎ𝑖 (𝑢, 𝑣)|2+𝛿 𝑑𝐹 (𝑢)𝑑𝐹 (𝑣) < ∞,

(3B.1)

(3B.2)

for some 𝛿 > 0 and 𝐹 (.) is the common CDF of 𝛾𝑡. Consider condition 3B.1. By Minkowski’s

inequality, Jensen’s inequality, and the law of iterated expectation, we have

(cid:16)

𝐸𝑃 |ℎ𝑖 (𝛾𝑡, 𝛾𝑟)|2+𝛿(cid:17) 1

2+ 𝛿

≤

(cid:18)

𝐸𝑃

(cid:12)
(cid:12)
(cid:12)

𝑖𝑡 𝜓 (0)
𝜓 (0)
𝑖𝑟

(cid:12)
(cid:12)
(cid:12)

2+𝛿(cid:19) 1
2+ 𝛿

+ 𝐸𝑃

(cid:12)
(cid:12)
(cid:12)

𝑖𝑡 𝜓 (0)
𝜓 (0)
𝑖𝑟

(cid:18)

𝐸𝑃

≤

(cid:12)
(cid:12)
(cid:12)

(cid:12)
(cid:12)
(cid:12)

𝜓 (0)
𝑖𝑡

4+2𝛿(cid:19) 1
(cid:12)
2+ 𝛿
(cid:12)
(cid:12)

+ 𝐸𝑃

(cid:12)
(cid:12)
(cid:12)

𝜓 (0)
𝑖𝑡

2

(cid:12)
(cid:12)
(cid:12)

where the second inequality follows from Hölder’s inequality and the identical distribution of 𝛾𝑡.
(cid:12)
(cid:12)
(cid:12)

𝑚 follows from Assumption DML2(i).

< 𝑐𝑚 and 𝐸𝑃

𝑝−4
2 , then

4+2𝛿(cid:19) 1
2+ 𝛿

Let 𝛿 =

𝜓 (0)
𝑖𝑡

𝜓 (0)
𝑖𝑡

≤ 𝑐2

𝐸𝑃

(cid:12)
(cid:12)
(cid:12)

(cid:12)
(cid:12)
(cid:12)

(cid:12)
(cid:12)
(cid:12)

(cid:18)

2

Therefore, condition 3B.1 is satisfied.

Consider condition 3B.2. By Minkowski’s inequality and Jensen’s inequality, we have

(cid:18)∫ ∫ (cid:12)
(cid:12)
(cid:12)

(cid:18)∫ ∫ (cid:12)
(cid:12)
(cid:12)

𝐸𝑃 [𝜓 (0)

𝑖𝑡 𝜓 (0)
𝑖𝑟

|𝛾𝑡 = 𝑢, 𝛾𝑟 = 𝑣] − 𝐸𝑃 [𝜓 (0)

𝑖𝑡 𝜓 (0)
𝑖𝑟 ]

(cid:19) 1
2+ 𝛿

𝑑𝐹 (𝑢)𝑑𝐹 (𝑣)

2+𝛿

(cid:12)
(cid:12)
(cid:12)

𝐸𝑃 [𝜓 (0)

𝑖𝑡 𝜓 (0)
𝑖𝑟

|𝛾𝑡 = 𝑢, 𝛾𝑟 = 𝑣]

2+𝛿

(cid:12)
(cid:12)
(cid:12)

𝑑𝐹 (𝑢)𝑑𝐹 (𝑣)

(cid:19) 1
2+ 𝛿

+ 𝐸𝑃

(cid:12)
(cid:12)
(cid:12)

𝑖𝑡 𝜓 (0)
𝜓 (0)
𝑖𝑟

(cid:12)
(cid:12)
(cid:12)
(cid:33) 1
2+ 𝛿

(cid:32)∫ ∫ (cid:18)

𝐸𝑃

(cid:20)(cid:12)
(cid:12)
(cid:12)

𝜓 (0)
𝑖𝑡

2

(cid:12)
(cid:12)
(cid:12)

|𝛾𝑡 = 𝑢

(cid:21) (cid:19) 2+ 𝛿

2 (cid:18)

𝐸𝑃

(cid:20)(cid:12)
(cid:12)
(cid:12)

𝜓 (0)
𝑖𝑟

(cid:12)
2
(cid:12)
(cid:12)

|𝛾𝑟 = 𝑣

(cid:21) (cid:19) 2+ 𝛿

2

𝑑𝐹 (𝑢)𝑑𝐹 (𝑣)

+ 𝐸𝑃

(cid:12)
(cid:12)
(cid:12)

𝜓 (0)
𝑖𝑡

2

(cid:12)
(cid:12)
(cid:12)

(cid:18)∫ ∫

(cid:18)

𝐸𝑃

=

(cid:12)
(cid:12)
(cid:12)

𝜓 (0)
𝑖𝑡

𝐸𝑃

(cid:12)
(cid:12)
(cid:12)

𝜓 (0)
𝑖𝑡

(cid:20)(cid:12)
(cid:12)
(cid:12)
4+2𝛿(cid:19) 1
(cid:12)
2+ 𝛿
(cid:12)
(cid:12)

2+𝛿

|𝛾𝑡 = 𝑢

(cid:21)

𝐸𝑃

2+𝛿

(cid:20)(cid:12)
(cid:12)
(cid:12)

𝜓 (0)
𝑖𝑟

(cid:12)
(cid:12)
(cid:12)

(cid:21)

|𝛾𝑟 = 𝑣

𝑑𝐹 (𝑢)𝑑𝐹 (𝑣)

(cid:19) 1
2+ 𝛿

+ 𝐸𝑃

(cid:12)
(cid:12)
(cid:12)

𝜓 (0)
𝑖𝑡

(cid:12)
2
(cid:12)
(cid:12)

+ 𝐸𝑃

(cid:12)
(cid:12)
(cid:12)

𝜓 (0)
𝑖𝑡

2

(cid:12)
(cid:12)
(cid:12)

≤

≤

≤

where the second inequality follows from (conditional) Hölder’s inequality and identical distribution

of 𝛾𝑡; the third inequality follows from Jensen’s inequality; the last equality follows from the law of

132

iterated expectation and the identical distribution of 𝛾𝑡. Therefore, condition 3B.2 is also satisfied

with 𝛿 =

𝑝−4
2 . By Lemma 3.4 in Dehling and Wendler (2010), we conclude

𝐸𝑃

2

(cid:12)
(cid:12)(cid:101)𝐼 (𝑘𝑙)
(cid:12)

𝑎,2,𝑖

(cid:12)
(cid:12)
(cid:12)

=

1
𝑇 4
𝑙

∑︁

𝑡,𝑟,𝑡′,𝑟 ′∈𝑆𝑙

𝐸𝑃 [ℎ𝑖 (𝛾𝑡, 𝛾𝑟)ℎ𝑖 (𝛾𝑡′, 𝛾𝑟 ′)] = 𝑜(𝑇 −1

𝑙

) = 𝑜(𝑇 −1).

Therefore, by Markov inequality, we have (cid:101)𝐼 (𝑘𝑙)
that conditional on {𝛾𝑡 }𝑡∈𝑆𝑙 , 𝐼 (𝑘𝑙)

𝑎,2,𝑖 is i.i.d over 𝑖. So, we have

𝑎,2,𝑖 = 𝑜𝑃 (𝑇 −1/2). Next. consider

(cid:12)
(cid:12)
(cid:12)

1
𝑁𝑘

(cid:205)𝑖∈𝐼𝑘

𝐼 (𝑘𝑙)
𝑎,2,𝑖

(cid:12)
(cid:12)
(cid:12). Note

1
𝑁𝑘

∑︁

𝑖∈𝐼𝑘

𝐼 (𝑘𝑙)
𝑎,2,𝑖

2

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

𝐸𝑃

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)








|{𝛾𝑡 }𝑡∈𝑆𝑙

=

1
𝑁 2
𝑘

∑︁

𝑖∈𝐼𝑘








𝐸𝑃

(cid:20)(cid:12)
(cid:12)
(cid:12)

𝐼 (𝑘𝑙)
𝑎,2,𝑖

2

(cid:12)
(cid:12)
(cid:12)

|{𝛾𝑡 }𝑡∈𝑆𝑙

(cid:21)

=

1
𝑁𝑘

𝐸𝑃

(cid:20)(cid:12)
(cid:12)
(cid:12)

𝐼 (𝑘𝑙)
𝑎,2,𝑖

2

(cid:12)
(cid:12)
(cid:12)

(cid:21)

|{𝛾𝑡 }𝑡∈𝑆𝑙

By conditional Markov inequality, we have
(cid:32)(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

𝐼 (𝑘𝑙)
𝑎,2,𝑖

1
𝑁𝑘

∑︁

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

𝑖∈𝐼𝑘

𝑃

> 𝜀|{𝛾𝑡 }𝑡∈𝑆𝑙

(cid:33)

= 𝑂

(cid:18) 1
𝑁𝑘

𝐸𝑃

(cid:20)(cid:12)
(cid:12)
(cid:12)

𝐼 (𝑘𝑙)
𝑎,2,𝑖

(cid:12)
2
(cid:12)
(cid:12)

(cid:21) (cid:19)

|{𝛾𝑡 }𝑡∈𝑆𝑙

By Minkowski’s inequality for infinite sums, Jensen’s inequality, and Hölder’s inequality, we have

(cid:18)

𝐸𝑃

(cid:20)(cid:12)
(cid:12)
(cid:12)

𝐼 (𝑘𝑙)
𝑎,2,𝑖

2(cid:21) (cid:19) 1/2
(cid:12)
(cid:12)
(cid:12)

≲ 1
𝑇 2
𝑙

∑︁

𝑡,𝑟∈𝑆𝑙

(cid:18)

(cid:104)

𝐸𝑃

𝑖𝑡 𝜓 (0)
𝜓 (0)
𝑖𝑟

(cid:105) 2(cid:19) 1/2

≤

1
𝑇 2
𝑙

(cid:18)

∑︁

𝐸𝑃

(cid:104)

𝜓 (0)
𝑖𝑡

(cid:105) 4(cid:19) 1/2

≤ 𝑐2
𝑚,

𝑡,𝑟∈𝑆𝑙

where the last inequality follows from Assumption DML2(i) . Then, by law of iterated expectation,

we have

𝑃

(cid:32)(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

1
𝑁𝑘

(cid:205)𝑖∈𝐼𝑘

(cid:12)
1
(cid:12)
and
𝑁𝑘
(cid:12)
𝑇 −1/2(cid:17)
(cid:16)

𝑜𝑃

.

𝐼 (𝑘𝑙)
𝑎,2,𝑖

(cid:12)
(cid:12) = 𝑂 𝑃
(cid:12)

(cid:16)

(cid:17)

𝑁 −1/2
𝑘

(cid:16)

= 𝑂 𝑃

∑︁

𝑖∈𝐼𝑘

𝐼 (𝑘𝑙)
𝑎,2,𝑖

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)
𝑁 −1/2(cid:17)

(cid:33)

> 𝜀

= 𝑂

(cid:16)

(cid:17)

,

𝑁 −1
𝑘

. Therefore, we have shown

(cid:12)
(cid:12)
(cid:12)

𝐼 𝑘𝑙
𝑎,2

(cid:12)
(cid:12) = 𝑂 𝑃
(cid:12)

(cid:16)

𝑁 −1/2(cid:17)

+

Next, consider 𝐼 𝑘𝑙

𝑎,1. By product decomposition, triangle inequality, and Cauchy-Schwarz in-

equality, we have

− 𝜓 (0)

𝑖𝑡 𝜓 (0)
𝑖𝑟

(cid:12)
(cid:12)
(cid:12)

(cid:12)
(cid:12)
(cid:12)

𝐼 𝑘𝑙
𝑎,1

(cid:12)
(cid:12)
(cid:12)

≤

≤

1
𝑁𝑘𝑇 2
𝑙

1
𝑁𝑘𝑇 2
𝑙
(cid:26)(cid:13)
(cid:13)
(cid:13)

∑︁

𝑖∈𝐼𝑘,𝑡∈𝑆𝑙,𝑟∈𝑆𝑙
∑︁

𝑖∈𝐼𝑘,𝑡∈𝑆𝑙,𝑟∈𝑆𝑙
(cid:13)
(cid:13)
(cid:13)𝑘𝑙,2

𝜓 (0)
𝑖𝑡

≲𝑅𝑘𝑙

𝑖𝑟

(cid:12)
𝑖𝑡 (cid:98)𝜓 (𝑘𝑙)′
(cid:12)(cid:98)𝜓 (𝑘𝑙)
(cid:12)
(cid:110)(cid:12)
(cid:12)(cid:98)𝜓 (𝑘𝑙)
𝑖𝑡 − 𝜓 (0)
(cid:12)
𝑖𝑡
(cid:27)

+ 𝑅𝑘𝑙

,

(cid:12)
(cid:12)
(cid:12)

(cid:12)
(cid:12)(cid:98)𝜓 (𝑘𝑙)
𝑖𝑟 − 𝜓 (0)
(cid:12)
𝑖𝑟

(cid:12)
(cid:12)
(cid:12)

+

(cid:12)
(cid:12)
(cid:12)

𝜓 (0)
𝑖𝑡

(cid:12)
(cid:12)
(cid:12)

(cid:12)
(cid:12)(cid:98)𝜓 (𝑘𝑙)
𝑖𝑟 − 𝜓 (0)
(cid:12)
𝑖𝑟

(cid:12)
(cid:12)
(cid:12)

+

(cid:12)
(cid:12)(cid:98)𝜓 (𝑘𝑙)
𝑖𝑡 − 𝜓 (0)
(cid:12)
𝑖𝑡

(cid:12)
(cid:12)
(cid:12)

(cid:12)
(cid:12)(cid:98)𝜓 (𝑘𝑙)′
(cid:12)

𝑖𝑟

(cid:12)
(cid:12)
(cid:12)

(cid:111)

133

where 𝑅𝑘𝑙 =

(cid:13)
(cid:13)(cid:98)𝜓 (𝑘𝑙)
(cid:13)

𝑖𝑡 − 𝜓 (0)
𝑖𝑡

(cid:13)
(cid:13)
(cid:13)𝑘𝑙,2

. By Markov inequality and under Assumption DML2(i), we have

𝐸𝑃

(cid:34)

1
𝑁𝑘𝑇𝑙

∑︁

𝑖∈𝐼𝑘,𝑡∈𝑆𝑙

(cid:35)

(cid:17) 2

(cid:16)

𝜓 (0)
𝑖𝑡

= 𝐸𝑃 |𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂0)|2 ≤ 𝑐2
𝑚.

Therefore,

1
𝑁𝑘𝑇𝑙

(cid:205)𝑖∈𝐼𝑘,𝑡∈𝑆𝑙

(cid:17) 2

(cid:16)

𝜓 (0)
𝑖𝑡

(linearity) we have

= 𝑂 𝑃 (1). To bound 𝑅𝑘𝑙, note that by Assumption DML1(i)

𝑅2

𝑘𝑙 =

1
𝑁𝑘𝑇𝑙

≲ 1
𝑁𝑘𝑇𝑙

𝑖∈𝐼𝑘,𝑡∈𝑆𝑙
∑︁

𝑖∈𝐼𝑘,𝑡∈𝑆𝑙

∑︁

(cid:16)

𝜓𝑎 (𝑊𝑖𝑡; (cid:98)

𝜂𝑘𝑙) ((cid:98)𝜃 − 𝜃0) + 𝜓(𝑊𝑖𝑡; 𝜃0,

𝜂𝑘𝑙) − 𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂0)
(cid:98)

(cid:17) 2

|𝜓𝑎 (𝑊𝑖𝑡; (cid:98)

𝜂𝑘𝑙)|2 (cid:12)
(cid:12)(cid:98)𝜃 − 𝜃0
(cid:12)

(cid:12)
(cid:12)
(cid:12)

2

+

1
𝑁𝑘𝑇𝑙

∑︁

𝑖∈𝐼𝑘,𝑡∈𝑆𝑙

(cid:12)
(cid:12)(cid:98)𝜓 (𝑘𝑙)
𝑖𝑡 − 𝜓 (0)
(cid:12)
𝑖𝑡

(cid:12)
2
(cid:12)
(cid:12)

By Markov inequality and Assumption DML2(i), we have

1
𝑁𝑘𝑇𝑙

(cid:205)𝑖∈𝐼𝑘,𝑡∈𝑆𝑙 |𝜓𝑎 (𝑊𝑖𝑡; (cid:98)

𝜂𝑘𝑙)|2 = 𝑂 𝑃 (1).

By Theorem 3.2,

2

(cid:12)
(cid:12)(cid:98)𝜃 − 𝜃0
(cid:12)

(cid:12)
(cid:12)
(cid:12)

= 𝑂 𝑝 (𝑁 −1). Therefore, the first term on RHS is 𝑂 𝑃 (𝑁 −1). For the

second term on RHS, consider its conditional expectation given the auxiliary sample 𝑊 (−𝑘, −𝑙)

on the event E𝜂 ∩ E𝑐 𝑝:

(cid:104)

𝐸𝑃

|𝜓(𝑊𝑖𝑡; 𝜃0,

𝜂𝑘𝑙) − 𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂0)|2 |𝑊 (−𝑘, −𝑙)
(cid:98)

(cid:105)

≤ 𝛿2

𝑁𝑇 ,

where the inequality follows from Assumption DML2(ii). Then, by Markov inequality and

Lemma 6.1 from Chernozhukov et al. (2018a), we have 𝑅2

𝑘𝑙 = 𝑂 𝑃 (cid:0)𝑁 −1 + (𝑟′

𝑁𝑇 )2(cid:1) and so

(cid:16)

𝑂 𝑃

𝑁 −1/2 + 𝑟′

𝑁𝑇

(cid:17)

. To summarize, we have shown

(cid:12)
(cid:12)
(cid:12)

𝐼 𝑘𝑙
𝑎,1

(cid:12)
(cid:12)
(cid:12) =

(cid:12)
(cid:12)
(cid:12)(cid:98)Ω𝑎,𝑘𝑙 − Σ𝑎

(cid:12)
(cid:12) = 𝑂 𝑃
(cid:12)

(cid:16)

𝑁 −1/2 + 𝑟′

𝑁𝑇

(cid:17)

+ 𝑂 𝑃 (𝑁 −1/2) + 𝑜𝑃 (𝑇 −1/2) + 𝑂 (𝑇 −2) = 𝑂 𝑃

(cid:16)

𝑁 −1/2 + 𝑟′

𝑁𝑇

(cid:17)

Proof of Claim B.4. By triangle inequality, we have

(cid:12)
(cid:12)(cid:98)Ω𝑏,𝑘𝑙 − 𝑐𝐸𝑃 [𝑔𝑡𝑔′
(cid:12)
𝑡]
𝐾/𝐿
𝐼 (𝑘𝑙)
𝑏,1 :=
𝑁𝑘𝑇 2
𝑙
𝐾/𝐿
𝑁𝑘𝑇 2
𝑙
𝐾/𝐿
𝑁𝑘𝑇 2
𝑙

𝐼 (𝑘𝑙)
𝑏,2 :=

𝐼 (𝑘𝑙)
𝑏,3 :=

𝑡∈𝑆𝑙,𝑖∈𝐼𝑘, 𝑗 ∈𝐼𝑘
∑︁

𝑡∈𝑆𝑙,𝑖∈𝐼𝑘, 𝑗 ∈𝐼𝑘
∑︁

𝑡∈𝑆𝑙,𝑖∈𝐼𝑘, 𝑗 ∈𝐼𝑘

≤

(cid:12)
(cid:12)
(cid:12)
∑︁

(cid:12)
(cid:12)
(cid:12)

𝐼 (𝑘𝑙)
𝑏,1
(cid:110)

+

+

𝐼 (𝑘𝑙)
𝑏,3

(cid:12)
(cid:12)
(cid:12)
𝐼 (𝑘𝑙)
(cid:12)
(cid:12)
(cid:12)
𝑏,2
(cid:12)
(cid:12)
(cid:12)
𝑖𝑡 (cid:98)𝜓 (𝑘𝑙)
(cid:98)𝜓 (𝑘𝑙)

(cid:12)
(cid:12)
(cid:12)
𝑗𝑡 − 𝜓 (0)

(cid:12)
(cid:12)
(cid:12)
𝑖𝑡 𝜓 (0)
𝑗𝑡

,

(cid:111)

,

(cid:110)

𝑖𝑡 𝜓 (0)
𝜓 (0)

𝑗𝑡 − 𝐸𝑃 [𝜓 (0)

𝑖𝑡 𝜓 (0)
𝑗𝑡 ]

(cid:111)

,

𝐸𝑃 [𝜓 (0)

𝑖𝑡 𝜓 (0)

𝑗𝑡 ] − 𝑐𝐸𝑃 [𝑔𝑡𝑔′
𝑡],

134

and

𝐾/𝐿
= 𝑐
.
𝑁 2
𝑁𝑘𝑇 2
𝑇𝑙
𝑘
𝑙
Consider 𝐼 (𝑘𝑙)
𝑏,3 . By the the law of total covariance, we have

𝐸𝑃 [𝜓 (0)

𝑖𝑡 𝜓 (0)

𝑗𝑡 ]𝑐𝑜𝑣(𝜓 (0)
𝑖𝑡

, 𝜓 (0)
𝑗𝑡 )

=𝐸𝑃 [𝑐𝑜𝑣(𝜓 (0)
𝑖𝑡

, 𝜓 (0)
𝑗𝑡

|𝛾𝑡)] + 𝑐𝑜𝑣(𝐸𝑃 [𝜓 (0)
𝑖𝑡

|𝛾𝑡], 𝐸𝑃 [𝜓 (0)
𝑗𝑡

|𝛾𝑡]) = 0 + 𝐸𝑃 [𝑔𝑡𝑔′

𝑡],

Due to identical distribution of 𝛾𝑡, 𝐸𝑃 [𝑔𝑡𝑔′

𝑡] does not vary over 𝑡 and so 𝐼 (𝑘𝑙)

𝑏,3 = 0.

To bound 𝐼 (𝑘𝑙)

𝑏,2 , we can rewrite it by triangle inequality as follows
(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

(cid:101)𝐼 (𝑘𝑙)
𝑏,2,𝑡

𝐼 (𝑘𝑙)
𝑏,2,𝑡

𝐼 𝑘𝑙
𝑏,2

1
𝑇𝑙

1
𝑇𝑙

∑︁

∑︁

1
𝑐

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

≤

+

(cid:12)
(cid:12)
(cid:12)

(cid:12)
(cid:12)
(cid:12)

,

𝑡∈𝑆𝑙
𝑗𝑡 − 𝐸𝑃 [𝜓 (0)

𝑖𝑡 𝜓 (0)
𝑗𝑡

𝑖𝑡 𝜓 (0)
𝜓 (0)

(cid:111)

|{𝛼𝑖}𝑖∈𝐼𝑘 ]

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)
1
𝑁 2
𝑘

𝐼 (𝑘𝑙)
𝑏,2,𝑡 :=

(cid:101)𝐼 (𝑘𝑙)
𝑏,2,𝑡 :=

1
𝑁 2
𝑘

𝑡∈𝑆𝑙
∑︁

𝑖, 𝑗 ∈𝐼𝑘
∑︁

𝑖, 𝑗 ∈𝐼𝑘

(cid:110)

(cid:110)

𝐸𝑃 [𝜓 (0)

𝑖𝑡 𝜓 (0)
𝑗𝑡

|{𝛼𝑖}𝑖∈𝐼𝑘 ] − 𝐸𝑃 [𝜓 (0)

𝑖𝑡 𝜓 (0)
𝑗𝑡 ]

(cid:111)

Due to identical distribution of 𝛾𝑡, (cid:101)𝐼 (𝑘𝑙)
𝑖𝑡 𝜓 (0)
Denote 𝜁𝑖 𝑗,𝑡 = 𝜓 (0)

𝑗𝑡 . By direct calculation, we have

𝑏,2,𝑡 does not vary over 𝑡 so that 𝐸𝑃

(cid:12)
(cid:12)
(cid:12)

1
𝑇𝑙

(cid:205)𝑡∈𝑆𝑙 (cid:101)𝐼 (𝑘𝑙)
𝑏,2,𝑡

2

(cid:12)
(cid:12)
(cid:12)

= 𝐸𝑃

(cid:12)
(cid:12)(cid:101)𝐼 (𝑘𝑙)
(cid:12)

𝑏,2,𝑡

(cid:12)
2
(cid:12)
(cid:12)

.

𝐸𝑃

2

(cid:12)
(cid:12)(cid:101)𝐼 (𝑘𝑙)
(cid:12)

𝑏,2,𝑡

(cid:12)
(cid:12)
(cid:12)

=

1
𝑁 4
𝑘
≲ 1
𝑁𝑘

∑︁

∑︁

𝐸𝑃 (cid:2) (cid:0)𝐸𝑃 [𝜁𝑖 𝑗,𝑡 |𝛼𝑖, 𝛼 𝑗 ] − 𝐸𝑃 [𝜁𝑖 𝑗,𝑡](cid:1) (cid:0)𝐸𝑃 [𝜁𝑖′ 𝑗 ′ |𝛼𝑖′, 𝛼 𝑗 ′] − 𝐸𝑃 [𝜁𝑖′ 𝑗 ′](cid:1)(cid:3)

𝑖, 𝑗 ∈𝐼𝑘
𝑖′, 𝑗 ′∈𝐼𝑘
𝐸𝑃 [𝜁𝑖 𝑗,𝑡]2 < 1
𝑁𝑘

𝐸𝑃

(cid:105) 4

(cid:104)

𝜓 (0)
𝑖𝑡

= 𝑂 (1/𝑁𝑘 ).

where the first inequality follows from the assumption that 𝛼𝑖 is independent over 𝑖 and an application

of Hölder’s inequality and Jensen’s inequality. The second inequality follows from Hölder’s

inequality and the last equality follows from Assumption DML2(i) with some 𝑞 > 4. Therefore, by
(cid:12)
(cid:12) = 𝑂 𝑃 (𝑁 −1/2
(cid:12)

(cid:205)𝑡∈𝑆𝑙 (cid:101)𝐼 (𝑘𝑙)
) = 𝑂 𝑃 (𝑁 −1/2).
𝑏,2,𝑡
(cid:12)
(cid:12). Note that conditional on {𝛼𝑖}, 𝐼 (𝑘𝑙)
𝑏,2,𝑡 is also 𝛽-mixing with the
(cid:12)
mixing coefficient same as 𝛾𝑡. Then, with an application of the conditional version of Theorem

Markov inequality, we have
(cid:205)𝑡∈𝑆𝑙

(cid:12)
1
(cid:12)
𝑇𝑙
(cid:12)
𝐼 (𝑘𝑙)
𝑏,2,𝑡

Now consider

1
𝑇𝑙

(cid:12)
(cid:12)
(cid:12)

𝑘

14.2 from Davidson (1994), we have

(cid:18)

𝐸𝑃

(cid:20)(cid:12)
(cid:12)
(cid:12)

𝐸𝑃 [𝐼 (𝑘𝑙)

𝑏,2,𝑡 |{𝛼𝑖}𝑖∈𝐼𝑘 , F 𝑡−𝑙
−∞ ]

(cid:21) (cid:19) 1/2

2

(cid:12)
(cid:12)
(cid:12)

|{𝛼𝑖}𝑖∈𝐼𝑘

] ≤ 2(21/2 + 1) 𝛽(𝑙)1/2− 2

𝑞

(cid:104)

(cid:16)

𝐸𝑃

|𝐼 (𝑘𝑙)
𝑏,2,𝑡 |

𝑞

2 |{𝛼𝑖}𝑖∈𝐼𝑘

(cid:105) (cid:17) 2

𝑞 .

135

Then, we can apply the conditional version of Lemma A from Hansen (1992) to show that

1
𝑇𝑙

∑︁

𝑡∈𝑆𝑙

𝐼 (𝑘𝑙)
𝑏,2,𝑡

2

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)








𝐸𝑃

(cid:169)
(cid:173)
(cid:171)

|{𝛼𝑖}𝑖∈𝐼𝑘

1/2








(cid:170)
(cid:174)
(cid:172)

≲ 1
𝑇𝑙

∞
∑︁

𝑙=1

𝛽(𝑙)1/2− 2

𝑞

(cid:32)

∑︁

(cid:16)

𝑡∈𝑆𝑙

𝐸𝑃

(cid:104)

|𝐼 (𝑘𝑙)
𝑏,2,𝑡 |

𝑞

2 |{𝛼𝑖}𝑖∈𝐼𝑘

(cid:33) 1/2

(cid:105) (cid:17) 4

𝑞

≲ 1
√
𝑇𝑙

(cid:104)

(cid:16)

𝐸𝑃

|𝐼 (𝑘𝑙)
𝑏,2,𝑡 |

𝑞

2 |{𝛼𝑖}𝑖∈𝐼𝑘

(cid:105) (cid:17) 2

𝑞

By conditional Markov inequality, we have

𝑃

(cid:32)(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

1
𝑇𝑙

𝐼 (𝑘𝑙)
𝑏,2,𝑡

∑︁

𝑡∈𝑆𝑙

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

(cid:33)

> 𝜀|{𝛼𝑖}𝑖∈𝐼𝑘

= 𝑂

(cid:18)
𝑇 −1
𝑙 𝐸𝑃

(cid:20)(cid:12)
(cid:12)
(cid:12)

𝐼 (𝑘𝑙)
𝑏,2,𝑡

(cid:12)
(cid:12)
(cid:12)

𝑞

2 |{𝛼𝑖}𝑖∈𝐼𝑘

(cid:21) (cid:19)

By Minkowski’s inequality for infinite sums, Jensen’s inequality, and Hölder’s inequality, we have

(cid:18)

𝐸𝑃

(cid:20)(cid:12)
(cid:12)
(cid:12)

𝐼 (𝑘𝑙)
𝑏,2,𝑡

𝑞
2

(cid:12)
(cid:12)
(cid:12)

(cid:21) (cid:19) 2

𝑞

≲ 1
𝑁 2
𝑘

∑︁

𝑖, 𝑗 ∈𝐼𝑘

(cid:18)

(cid:104)

𝐸𝑃

𝑖𝑡 𝜓 (0)
𝜓 (0)
𝑗𝑡

(cid:19) 2

𝑞

𝑞
2

(cid:105)

≤

1
𝑁 2
𝑘

∑︁

(cid:16)

𝐸𝑃

(cid:104)

𝜓 (0)
𝑖𝑡

(cid:105) 𝑞(cid:17) 2

𝑞

≤ 𝑐2
𝑚,

𝑖, 𝑗 ∈𝐼𝑘

where the last inequality follows from Assumption DML2(i) . Then, by the law of iterated

expectation, we have

Therefore, we have shown

(cid:12)
(cid:12)
(cid:12)

𝐼 𝑘𝑙
𝑏,2

𝑃

(cid:32)(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

1
𝑇𝑙

𝐼 (𝑘𝑙)
𝑏,2,𝑡

∑︁

𝑡∈𝑆𝑙

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

(cid:33)

> 𝜀

= 𝑂

(cid:17)

(cid:16)
𝑇 −1/2
𝑙

.

Consider 𝐼 𝑘𝑙

𝑏,1. By the similar inequality for

(cid:12)
(cid:12) = 𝑂 𝑃 (𝑁 −1
(cid:12)

) = 𝑂 𝑃 (𝑇 −1/2).

𝑘 ) + 𝑂 𝑃 (𝑇 −1/2
(cid:12)
𝐼 𝑘𝑙
(cid:12)
𝑎,1
(cid:12)

𝑙
(cid:12)
(cid:12)
(cid:12), we have

1
𝑐

(cid:12)
(cid:12)
(cid:12)

𝐼 𝑘𝑙
𝑏,1

(cid:12)
(cid:12)
(cid:12)

≲ 𝑅𝑘𝑙

1
𝑁𝑘𝑇𝑙

(cid:32)





∑︁

(cid:17) 2

(cid:16)

𝜓 (0)
𝑖𝑡

𝑖∈𝐼𝑘,𝑡∈𝑆𝑙

(cid:33) 1/2

+ 𝑅𝑘𝑙

,





where 𝑅𝑘𝑙 =

(cid:13)
(cid:13)(cid:98)𝜓 (𝑘𝑙)
𝑖𝑡 − 𝜓 (0)
(cid:13)
𝑖𝑡
𝑘𝑙 = 𝑂 𝑃 (cid:0)𝑁 −1 + (𝑟′

(cid:13)
(cid:13)
(cid:13)𝑘𝑙,2
𝑁𝑇 )2(cid:1). So

and 𝑅2

. We have shown in the proof of Claim B.3 that

(cid:12)
(cid:12)
(cid:12)

𝐼 𝑘𝑙
𝑏,1

(cid:12)
(cid:12) = 𝑂 𝑃
(cid:12)

(cid:16)

𝑁 −1/2 + 𝑟′

𝑁𝑇

(cid:17)

. To summarize

(cid:13)
(cid:13)
(cid:13)

𝜓 (0)
𝑖𝑡

(cid:13)
(cid:13)
(cid:13)𝑘𝑙,2

= 𝑂 𝑃 (1)

(cid:12)
(cid:12)(cid:98)Ω𝑏,𝑘𝑙 − 𝑐𝐸𝑃 [𝑔𝑡𝑔′
(cid:12)
𝑡]

(cid:12)
(cid:12) = 𝑂 𝑃
(cid:12)

(cid:16)

𝑁 −1/2(cid:17)

+ 𝑂 𝑃

(cid:16)
𝑇 −1/2(cid:17)

(cid:16)

+ 𝑂 𝑃

𝑁 −1/2 + 𝑟′

𝑁𝑇

(cid:17)

(cid:16)

= 𝑂 𝑃

𝑁 −1/2 + 𝑟′

𝑁𝑇

(cid:17)

,

which completes the proof of Claim B.4.

136

Proof of Claim B.5. By triangle inequality, we have

(cid:12)
(cid:12)
(cid:12)(cid:98)Ω𝑐,𝑘𝑙

(cid:12)
(cid:12)
(cid:12)

≤

(cid:12)
(cid:12)
(cid:12)

𝐼 (𝑘𝑙)
𝑐,1

(cid:12)
(cid:12)
(cid:12)

+

(cid:12)
(cid:12)
(cid:12)

𝐼 (𝑘𝑙)
𝑐,2

(cid:12)
(cid:12)
(cid:12)

+

(cid:12)
(cid:12)
(cid:12)

𝐼 (𝑘𝑙)
𝑐,3

(cid:12)
(cid:12)
(cid:12) where

𝐼 (𝑘𝑙)
𝑐,1 :=

𝐼 (𝑘𝑙)
𝑐,2 :=

𝐼 (𝑘𝑙)
𝑐,3 :=

𝐾/𝐿
𝑁𝑘𝑇 2
𝑙
𝐾/𝐿
𝑁𝑘𝑇 2
𝑙
𝐾/𝐿
𝑁𝑘𝑇 2
𝑙

∑︁

𝑖∈𝐼𝑘,𝑡∈𝑆𝑙
∑︁

𝑖∈𝐼𝑘,𝑡∈𝑆𝑙
∑︁

𝑖∈𝐼𝑘,𝑡∈𝑆𝑙

(cid:110)

(cid:110)

𝑖𝑡 (cid:98)𝜓 (𝑘𝑙)
(cid:98)𝜓 (𝑘𝑙)

𝑖𝑡 − 𝜓 (0)

𝑖𝑡 𝜓 (0)
𝑖𝑡

(cid:111)

,

𝑖𝑡 𝜓 (0)
𝜓 (0)

𝑖𝑡 − 𝐸𝑃 [𝜓 (0)

𝑖𝑡 𝜓 (0)
𝑖𝑡 ]

(cid:111)

,

𝐸𝑃 [𝜓 (0)

𝑖𝑡 𝜓 (0)
𝑖𝑡 ],

Consider 𝐼 (𝑘𝑙)

𝑐,3 . Note that under Assumption DML2(i), we have

𝐸𝑃 [𝜓 (0)

𝑖𝑡 𝜓 (0)

𝑚.
𝑖𝑡 ] ≤ 𝑐2

Thus, 𝐼 (𝑘𝑙)

𝑐,3 = 𝑂 𝑃 (1/𝑇𝑙) = 𝑂 𝑃 (𝑇 −1).

Consider 𝐼 𝑘𝑙

𝑐,2. We denote 𝜉𝑖𝑡 = 𝜓 (0)

𝑖𝑡 𝜓 (0)

𝑖𝑡 − 𝐸𝑃 [𝜓 (0)

𝑖𝑡 𝜓 (0)

𝑖𝑡 ]. By expanding 𝐸

(cid:12)
(cid:12)
(cid:12)

𝐼 𝑘𝑙
𝑐,2

2

(cid:12)
(cid:12)
(cid:12)

and applying

Hölder’s inequality, we have
(cid:33) 2 


max(𝑆𝑙)−𝑚
∑︁

(cid:32) 𝐾/𝐿
𝑁𝑘𝑇 2
𝑙

𝑇𝑙−1
∑︁

𝐼 𝑘𝑙
𝑐,2

≤

𝐸

(cid:12)
(cid:12)
(cid:12)

(cid:12)
(cid:12)
(cid:12)

2

+2

∑︁

𝐸𝑃 |𝜉𝑖𝑡 |2 +

∑︁

𝐸𝑃 |𝜉𝑖𝑡 |2 +

∑︁

𝐸𝑃 |𝜉𝑖𝑡 |2

𝑖∈𝐼𝑘,𝑡∈𝑆𝑙,𝑟∈𝑆𝑙

∑︁

𝐸𝑃 |𝜉𝑖𝑡 |2 + 2

𝑡∈𝑆𝑙,𝑖∈𝐼𝑘, 𝑗 ∈𝐼𝑘

𝑇𝑙−1
∑︁

max(𝑆𝑙)−𝑚
∑︁

∑︁

𝐸𝑃 |𝜉𝑖𝑡 |2

𝑚=1

𝑡=min(𝑆𝑙)

𝑖, 𝑗 ∈𝐼𝑘

𝑚=1

𝑡=min(𝑆𝑙)

𝑖∈𝐼𝑘

𝑡∈𝑆𝑙,𝑖∈𝐼𝑘




.

where the last inequality follows from Note that for each 𝑖, 𝑡, by Hölder’s inequality and Assumption

DML2(i), we have

𝐸𝑃 |𝜉𝑖𝑡 |2 ≲ 𝐸𝑃 (cid:2)𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂0)4(cid:3) ≤ 𝑐4
𝑚.

Thus, 𝐸

(cid:12)
(cid:12)
(cid:12)

(cid:12)
2
𝐼 (𝑘𝑙)
(cid:12)
𝑐,2
(cid:12)
Now consider 𝐼 (𝑘𝑙)

= 𝑂 (𝑇 −2) and so 𝐼 (𝑘𝑙)

𝑐,2 = 𝑂 𝑃 (𝑇 −1).

𝑐,1 . Following the same steps for 𝐼 (𝑘𝑙)
(cid:13)
(cid:13)
(cid:13)𝑘𝑙,2

𝐾/𝐿
𝑇𝑙

𝐼 (𝑘𝑙)
𝑐,1

𝜓 (0)
𝑖𝑡

(cid:26)(cid:13)
(cid:13)
(cid:13)

𝑅𝑘𝑙

≲

(cid:12)
(cid:12)
(cid:12)

(cid:12)
(cid:12)
(cid:12)

𝑏,1 , we have
(cid:27)

+ 𝑅𝑘𝑙

,

where 𝑅𝑘𝑙 =

(cid:13)
(cid:13)(cid:98)𝜓 (𝑘𝑙)
𝑖𝑡 − 𝜓 (0)
(cid:13)
𝑖𝑡
𝑘𝑙 = 𝑂 𝑃 (cid:0)𝑁 −1 + (𝑟′

(cid:13)
(cid:13)
(cid:13)𝑘𝑙,2
𝑁𝑇 )2(cid:1). So,

and 𝑅2

. We have shown in the proof of Claim B.3 that

(cid:16)

𝑁 −1/2/𝑇 + 𝑟′

𝑁𝑇 /𝑇

(cid:17)

. To summarize

(cid:13)
(cid:13)
(cid:13)

𝜓 (0)
𝑖𝑡

(cid:13)
(cid:13)
(cid:13)𝑘𝑙,2

= 𝑂 𝑃 (1)

(cid:12)
(cid:12)
(cid:12)(cid:98)Ω𝑐,𝑘𝑙

(cid:12)
(cid:12) = 𝑂 𝑃
(cid:12)

𝑇 −1(cid:17)
(cid:16)

𝑁 −1/2/𝑇 + 𝑟′

𝑁𝑇 /𝑇

(cid:17)

= 𝑂 𝑃

𝑇 −1(cid:17)
(cid:16)

,

(cid:12)
(cid:12)
(cid:12)

𝐼 (𝑘𝑙)
𝑐,1

(cid:12)
(cid:12) = 𝑂 𝑃
(cid:12)
(cid:16)

+ 𝑂 𝑃

137

which completes the proof of Claim B.5.

Proof of Claim B.6. By triangle inequality, we have

(cid:12)
(cid:12)
(cid:98)Ω𝑑,𝑘𝑙 − 𝑐
(cid:12)
(cid:12)
(cid:12)

∞
∑︁

𝑚=1

𝐸𝑃 [𝑔𝑡𝑔′
𝑡]

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

≤

(cid:12)
(cid:12)
(cid:12)

𝐼 (𝑘𝑙)
𝑑,1

(cid:12)
(cid:12)
(cid:12)

+

(cid:12)
(cid:12)
(cid:12)

𝐼 (𝑘𝑙)
𝑑,2

(cid:12)
(cid:12)
(cid:12)

+

(cid:12)
(cid:12)
(cid:12)

𝐼 (𝑘𝑙)
𝑑,3

(cid:12)
(cid:12)
(cid:12)

+

(cid:12)
(cid:12)
(cid:12)

𝐼 (𝑘𝑙)
𝑑,4

(cid:12)
(cid:12)
(cid:12)

+

(cid:12)
(cid:12)
(cid:12)

𝐼 (𝑘𝑙)
𝑑,5

(cid:12)
(cid:12)
(cid:12)

+

(cid:12)
(cid:12)
(cid:12)

𝐼 (𝑘𝑙)
𝑑,6

(cid:12)
(cid:12)
(cid:12)

where

𝐼 (𝑘𝑙)
𝑑,1 :=

𝐼 (𝑘𝑙)
𝑑,2 :=

𝐼 (𝑘𝑙)
𝑑,3 :=

𝐼 (𝑘𝑙)
𝑑,4 :=

𝐼 (𝑘𝑙)
𝑑,5 :=

𝐾/𝐿
𝑁𝑘𝑇 2
𝑙

𝐾/𝐿
𝑁𝑘𝑇 2
𝑙

𝐾/𝐿
𝑁𝑘𝑇 2
𝑙

𝐾/𝐿
𝑁𝑘𝑇 2
𝑙

𝐾/𝐿
𝑁𝑘𝑇 2
𝑙

(cid:110)

(cid:110)

𝑖𝑡 (cid:98)𝜓 (𝑘𝑙)
(cid:98)𝜓 (𝑘𝑙)

𝑗,𝑡+𝑚 − 𝜓 (0)

𝑖𝑡 𝜓 (0)

𝑗,𝑡+𝑚

(cid:111)

,

𝑖𝑡 𝜓 (0)
𝜓 (0)

𝑗,𝑡+𝑚 − 𝐸𝑃

(cid:104)

𝑖𝑡 𝜓 (0)
𝜓 (0)

𝑗,𝑡+𝑚

(cid:105) (cid:111)

,

(cid:104)

𝐸𝑃

𝑖𝑡 𝜓 (0)
𝜓 (0)

𝑗,𝑡+𝑚

(cid:105)

,

⌈𝑆𝑙⌉−𝑚
∑︁

(cid:17)

∑︁

𝑡=⌊𝑆𝑙⌋
⌈𝑆𝑙⌉−𝑚
∑︁

𝑡=⌊𝑆𝑙⌋

𝑖∈𝐼𝑘, 𝑗 ∈𝐼𝑘, 𝑗≠𝑖

∑︁

𝑖∈𝐼𝑘, 𝑗 ∈𝐼𝑘, 𝑗≠𝑖
⌈𝑆𝑙⌉−𝑚
∑︁

∑︁

𝑘

𝑘

(cid:16) 𝑚
𝑀

(cid:17)

(cid:16) 𝑚
𝑀

𝑀−1
∑︁

𝑚=1
𝑀−1
∑︁

𝑚=1
𝑀−1
∑︁

𝑚=1

(cid:16)

𝑘

(cid:17)

(cid:16) 𝑚
𝑀

(cid:17)

− 1

∞
∑︁

⌈𝑆𝑙⌉−𝑚
∑︁

∑︁

𝑡=⌊𝑆𝑙⌋

𝑖∈𝐼𝑘, 𝑗 ∈𝐼𝑘, 𝑗≠𝑖

(cid:104)

𝐸𝑃

𝑖𝑡 𝜓 (0)
𝜓 (0)

𝑗,𝑡+𝑚

(cid:105)

,

𝑖∈𝐼𝑘, 𝑗 ∈𝐼𝑘, 𝑗≠𝑖

𝑚=𝑀
𝑀−1
∑︁

𝑡=⌊𝑆𝑙⌋
⌈𝑆𝑙⌉−𝑚
∑︁

𝑚=1

𝑡=⌊𝑆𝑙⌋

∑︁

(cid:104)

𝐸𝑃

𝑖𝑡 𝜓 (0)
𝜓 (0)

𝑗,𝑡+𝑚

(cid:105)

− 𝑐

∞
∑︁

𝑚=1

(cid:104)

𝐸𝑃

𝑖𝑡 𝜓 (0)
𝜓 (0)

𝑗,𝑡+𝑚

(cid:105)

,

𝐼 (𝑘𝑙)
𝑑,6 := 𝑐

(cid:104)

𝐸𝑃

∞
∑︁

𝑚=1

𝑖∈𝐼𝑘, 𝑗 ∈𝐼𝑘, 𝑗≠𝑖
∞
∑︁

(cid:105)

− 𝑐

𝑖𝑡 𝜓 (0)
𝜓 (0)

𝑗,𝑡+𝑚

𝑚=1

𝐸𝑃 [𝑔𝑡𝑔𝑡+𝑚]

and

𝐾/𝐿
= 𝑐
.
𝑇𝑙
𝑁 2
𝑁𝑘𝑇 2
𝑘
𝑙
Consider 𝐼 (𝑘𝑙)
𝑑,6 . By the law of total covariance, we have

(cid:104)

𝐸𝑃

𝑖𝑡 𝜓 (0)
𝜓 (0)

𝑗,𝑡+𝑚

(cid:105)

= 𝑐𝑜𝑣(𝜓 (0)
𝑖𝑡

, 𝜓 (0)

𝑗,𝑡+𝑚)

= 𝐸𝑃 [𝑐𝑜𝑣(𝜓 (0)
𝑖𝑡

, 𝜓 (0)

𝑗,𝑡+𝑚)|𝛾𝑡, 𝛾𝑡+𝑚] + 𝑐𝑜𝑣(𝐸𝑃 [𝜓 (0)

𝑖𝑡

|𝛾𝑡], 𝐸𝑃 [𝜓 (0)

𝑗,𝑡+𝑚 |𝛾𝑡+𝑚])

= 0 + 𝐸𝑃 [𝑔𝑡𝑔′

𝑡+𝑚],

where the last equality follows from the properties of Hajek projection components, as discussed
in the beginning of Appendix A. Therefore, 𝐼 (𝑘𝑙)

𝑑,6 = 0.

Consider 𝐼 (𝑘𝑙)

𝑑,5 . The strict stationarity of 𝛾𝑡 implies that 𝜓 (0)

𝑖𝑡

is also strictly stationary over 𝑡.

And under Assumption 3.1, there is no heterogeneity across 𝑖. Then, as 𝑀, 𝑇 → ∞, we have
𝐼 (𝑘𝑙)
𝑑,5 = 𝑜(1).

138

Consider 𝐼 (𝑘𝑙)

𝑑,4 . Under Assumption DML2(i),

(cid:16)

𝐸𝑃 |𝜓 (0)
𝑖𝑡

|𝑞(cid:17) 1/𝑞

≤ 𝑐𝑚 for 𝑞 > 4. And conditional

on 𝛼𝑖, 𝜓 (0)
𝑖𝑡

is 𝛽-mixing with the mixing coefficient not larger than that of 𝛾𝑡. Then by Theorem

14.13(ii) in Hansen (2022), we have

(cid:104)

(cid:12)
(cid:12)
(cid:12)

𝐸𝑃

𝑖𝑡 𝜓 (0)
𝜓 (0)

𝑗,𝑡+𝑚 |{𝛼𝑖}𝑖∈𝐼𝑘

(cid:105)(cid:12)
(cid:12)
(cid:12)

(cid:104)

(cid:16)

𝐸𝑃

≤ 8

|𝜓 (0)
𝑖𝑡

|𝑞 |𝛼𝑖

(cid:105) (cid:17) 1/𝑞 (cid:16)

(cid:104)

𝐸𝑃

|𝜓 (0)

𝑗,𝑡+𝑚 |𝑞 |𝛼 𝑗

(cid:105) (cid:17) 1/𝑞

𝛼𝛾 (𝑚)1−2/𝑞

By iterated expectation and Jensen’s inequality, we have

(cid:104)

(cid:12)
(cid:12)
(cid:12)

𝐸𝑃

𝑖𝑡 𝜓 (0)
𝜓 (0)

𝑗,𝑡+𝑚

(cid:105)(cid:12)
(cid:12)
(cid:12)

≤𝐸𝑃

(cid:104)

(cid:104)(cid:12)
𝐸𝑃
(cid:12)
(cid:12)
(cid:20) (cid:16)

𝑖𝑡 𝜓 (0)
𝜓 (0)
(cid:104)

𝑗,𝑡+𝑚 |{𝛼𝑖}𝑖∈𝐼𝑘
(cid:105) (cid:17) 1/𝑞 (cid:16)

|𝜓 (0)
𝑖𝑡

|𝑞 |𝛼𝑖

(cid:105)

(cid:105)(cid:12)
(cid:12)
(cid:12)

𝐸𝑃

≤8𝐸𝑃

𝐸𝑃

(cid:104)

|𝜓 (0)

𝑗,𝑡+𝑚 |𝑞 |𝛼 𝑗

(cid:105) (cid:17) 1/𝑞

(cid:21)

𝛼𝛾 (𝑚)1−2/𝑞

≤8𝐸𝑃

(cid:20) (cid:16)

(cid:104)

𝐸𝑃

|𝜓 (0)
𝑖𝑡

|𝑞 |𝛼𝑖

(cid:105) (cid:17) 1/𝑞(cid:21)

(cid:20) (cid:16)

(cid:104)

𝐸𝑃

𝐸𝑃

|𝜓 (0)

𝑗,𝑡+𝑚 |𝑞 |𝛼 𝑗

(cid:105) (cid:17) 1/𝑞(cid:21)

𝛼𝛾 (𝑚)1−2/𝑞

≲𝑐2

𝑚𝛼𝛾 (𝑚)1−2/𝑞

where the third inequality follows from that 𝛼𝑖 are independent over 𝑖. Then, as 𝑀 → ∞,

(cid:12)
(cid:12)
(cid:12)

𝐼 (𝑘𝑙)
𝑑,4

(cid:12)
(cid:12)
(cid:12)

≤

𝐾/𝐿
𝑁𝑘𝑇 2
𝑙

∞
∑︁

⌈𝑆𝑙⌉−𝑚
∑︁

∑︁

𝑚=𝑀

𝑡=⌊𝑆𝑙⌋

𝑖∈𝐼𝑘, 𝑗 ∈𝐼𝑘, 𝑗≠𝑖

(cid:104)

(cid:12)
(cid:12)
(cid:12)

𝐸𝑃

𝑖𝑡 𝜓 (0)
𝜓 (0)

𝑗,𝑡+𝑚

(cid:105)(cid:12)
(cid:12)
(cid:12)

≲

∞
∑︁

𝑚=𝑀

𝛼𝛾 (𝑚)1−2/𝑞 ≤

∞
∑︁

𝑚=𝑀

𝛽𝛾 (𝑚)1−2/𝑞

≤𝑐𝜅

∞
∑︁

𝑚=𝑀

𝑒𝑥 𝑝(−𝜅𝑚) = 𝑐𝜅

(cid:18)

1
1 − 𝑒−𝑘 −

(cid:19)

1 − 𝑒−𝑘 𝑀
1 − 𝑒−𝜅

(cid:16)

𝑒−𝜅𝑀 (cid:17)

.

= 𝑂

Consider 𝐼 (𝑘𝑙)
𝑑,3 .

(cid:12)
(cid:12)
(cid:12)

𝐼 (𝑘𝑙)
𝑑,3

(cid:12)
(cid:12)
(cid:12)

≤

𝐾/𝐿
𝑁𝑘𝑇 2
𝑙

𝑀−1
∑︁

𝑚=1

(cid:17)

𝑘

(cid:12)
(cid:12)
(cid:12)

(cid:16) 𝑚
𝑀

− 1

(cid:12)
(cid:12)
(cid:12)

⌈𝑆𝑙⌉−𝑚
∑︁

∑︁

𝑡=⌊𝑆𝑙⌋

𝑖∈𝐼𝑘, 𝑗 ∈𝐼𝑘, 𝑗≠𝑖

(cid:104)

(cid:12)
(cid:12)
(cid:12)

𝐸𝑃

𝑖𝑡 𝜓 (0)
𝜓 (0)

𝑗,𝑡+𝑚

(cid:105)(cid:12)
(cid:12)
(cid:12)

𝑀−1
∑︁

≤𝑐𝑐2
𝑚

(cid:17)

𝑘

(cid:12)
(cid:12)
(cid:12)

(cid:16) 𝑚
𝑀

(cid:12)
(cid:12)
− 1
(cid:12)

𝛼𝛾 (𝑚)1−2/𝑞.

𝑚=1
Note that for each 𝑚, (cid:12)
(cid:12) → 0 as 𝑀 → ∞. Since (cid:12)
(cid:1) − 1(cid:12)
apply dominated convergence theorem to conclude that 𝐼 (𝑘𝑙)

(cid:12)𝑘 (cid:0) 𝑚
𝑑,3 = 𝑜(1).
𝑑,2 , we can rewrite it by triangle inequality as follows

To bound 𝐼 (𝑘𝑙)

(cid:12)𝑘 (cid:0) 𝑚

𝑀

𝑀

(cid:1) − 1(cid:12)

(cid:12) 𝛼𝛾 (𝑚)1−2/𝑞 ≤ 1, we can

1
𝑐

(cid:12)
(cid:12)
(cid:12)

𝐼 (𝑘𝑙)
𝑑,2

(cid:12)
(cid:12)
(cid:12)

≤

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

𝑀−1
∑︁

𝑚=1

𝑘 (cid:0) 𝑚
𝑀
𝑇𝑙

(cid:1)

⌈𝑆𝑙⌉−𝑚
∑︁

𝑡=⌊𝑆𝑙⌋

𝐼 (𝑘𝑙)
𝑑,2,𝑡𝑚

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

+

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

𝑀−1
∑︁

𝑚=1

𝑘 (cid:0) 𝑚
𝑀
𝑇𝑙

(cid:1)

⌈𝑆𝑙⌉−𝑚
∑︁

𝑡=⌊𝑆𝑙⌋

(cid:101)𝐼 (𝑘𝑙)
𝑑,2,𝑡𝑚

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

,

139

where

𝐼 (𝑘𝑙)
𝑑,2,𝑡𝑚 :=

(cid:101)𝐼 (𝑘𝑙)
𝑑,2,𝑡𝑚 :=

(cid:110)

(cid:110)

1
𝑁 2
𝑘

1
𝑁 2
𝑘

∑︁

𝑖, 𝑗 ∈𝐼𝑘,𝑖≠ 𝑗
∑︁

𝑖, 𝑗 ∈𝐼𝑘,𝑖≠ 𝑗

𝑖𝑡 𝜓 (0)
𝜓 (0)

𝑗,𝑡+𝑚 − 𝐸𝑃 [𝜓 (0)

𝑖𝑡 𝜓 (0)

𝑗,𝑡+𝑚 |{𝛼𝑖}𝑖∈𝐼𝑘 ]

(cid:111)

𝐸𝑃 [𝜓 (0)

𝑖𝑡 𝜓 (0)

𝑗,𝑡+𝑚 |{𝛼𝑖}𝑖∈𝐼𝑘 ] − 𝐸𝑃 [𝜓 (0)

𝑖𝑡 𝜓 (0)

𝑗,𝑡+𝑚]

(cid:111)

Due to identical distribution of 𝛾𝑡, (cid:101)𝐼 (𝑘𝑙)

𝑑,2,𝑡𝑚 does not vary over 𝑡 so that

𝐸𝑃

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

𝑀−1
∑︁

𝑚=1

𝑘 (cid:0) 𝑚
𝑀
𝑇𝑙

(cid:1)

⌈𝑆𝑙⌉−𝑚
∑︁

𝑡=⌊𝑆𝑙⌋

(cid:101)𝐼 (𝑘𝑙)
𝑑,2,𝑡𝑚

(cid:12)
2
(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

≤ 𝐸𝑃

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

𝑀−1
∑︁

𝑚=1

𝑘

(cid:17)

(cid:16) 𝑚
𝑀

(cid:101)𝐼 (𝑘𝑙)
𝑑,2,𝑡𝑚

2

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

.

And by Minkowski’s inequality, we have

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

𝑀−1
∑︁

𝑚=1

𝐸𝑃

(cid:169)
(cid:173)
(cid:171)

𝑘

(cid:17)

(cid:16) 𝑚
𝑀

(cid:101)𝐼 (𝑘𝑙)
𝑑,2,𝑡𝑚

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

1/2

2

(cid:170)
(cid:174)
(cid:172)

≤

𝑀−1
∑︁

𝑚=1

(cid:17) (cid:18)

𝑘

(cid:16) 𝑚
𝑀

𝐸𝑃

(cid:104)

(cid:101)𝐼 (𝑘𝑙)
𝑑,2,𝑡𝑚

(cid:105) 2(cid:19) 1/2

Denote 𝜁𝑖 𝑗 𝑚 = 𝜓 (0)

𝑖𝑡 𝜓 (0)

𝑗,𝑡+𝑚. By direct calculation, we have

𝐸𝑃

2

(cid:12)
(cid:12)(cid:101)𝐼 (𝑘𝑙)
(cid:12)

𝑑,2,𝑡𝑚

(cid:12)
(cid:12)
(cid:12)

=

1
𝑁 4
𝑘
≲ 1
𝑁𝑘

∑︁

∑︁

𝐸𝑃 (cid:2)(cid:0)𝐸𝑃 [𝜁𝑖 𝑗 𝑚 |𝛼𝑖, 𝛼 𝑗 ] − 𝐸𝑃 [𝜁𝑖 𝑗,𝑡](cid:1) (cid:0)𝐸𝑃 [𝜁𝑖′ 𝑗 ′ |𝛼𝑖′, 𝛼 𝑗 ′] − 𝐸𝑃 [𝜁𝑖′ 𝑗 ′](cid:1)(cid:3)

𝑖, 𝑗 ∈𝐼𝑘
𝑖′, 𝑗 ′∈𝐼𝑘
𝐸𝑃 [𝜁𝑖 𝑗 𝑚]2 < 1
𝑁𝑘

𝐸𝑃

(cid:105) 4

(cid:104)

𝜓 (0)
𝑖𝑡

= 𝑂 (1/𝑁𝑘 ).

where the first inequality follows from the assumption that 𝛼𝑖 is independent over 𝑖 and an application

of Hölder’s inequality and Jensen’s inequality. The second inequality follows from Hölder’s

inequality and the last equality follows from Assumption DML2(i) with some 𝑞 > 4. Therefore,

we have

(cid:16)

𝐸𝑃 ||2(cid:17) 1/2

≤ 𝑂 𝑃

(cid:19)

(cid:18) 𝑀
𝑁 1/2

= 𝑂 𝑃

(cid:18) 𝑀
𝑇 1/2

(cid:19)

.

By Markov inequality, we have
𝑘 ( 𝑚
𝑀 )
𝑇𝑙

Now consider

(cid:205)𝑀−1
𝑚=1

(cid:12)
(cid:12)
(cid:12)

(cid:12)
(cid:205)𝑀−1
(cid:12)
𝑚=1
(cid:12)
(cid:205)⌈𝑆𝑙⌉−𝑚
𝑡=⌊𝑆𝑙⌋

𝑘 ( 𝑚
𝑀 )
𝑇𝑙
𝐼 (𝑘𝑙)
𝑑,2,𝑡𝑚

(cid:12)
(cid:12) = 𝑂 𝑃
(cid:12)

𝑡=⌊𝑆𝑙⌋ (cid:101)𝐼 (𝑘𝑙)
(cid:205)⌈𝑆𝑙⌉−𝑚
𝑑,2,𝑡𝑚
(cid:12)
(cid:12)
(cid:12). By Minkowski’s inequality, we have

(cid:16) 𝑀
𝑇 1/2

(cid:17)

.

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

𝑀−1
∑︁

𝑚=1

𝐸𝑃

(cid:169)
(cid:173)
(cid:173)
(cid:171)

𝑘 (cid:0) 𝑚
𝑀
𝑇𝑙

(cid:1)

⌈𝑆𝑙⌉−𝑚
∑︁

𝑡=⌊𝑆𝑙⌋

𝐼 (𝑘𝑙)
𝑑,2,𝑡𝑚

1/2

(cid:12)
2
(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

(cid:170)
(cid:174)
(cid:174)
(cid:172)

≤

𝑀−1
∑︁

𝑚=1

𝑘

(cid:16) 𝑚
𝑀

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

1
𝑇𝑙

⌈𝑆𝑙⌉−𝑚
∑︁

𝑡=⌊𝑆𝑙⌋

𝐼 (𝑘𝑙)
𝑑,2,𝑡𝑚

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

1/2

2

(cid:170)
(cid:174)
(cid:174)
(cid:172)

𝐸𝑃

(cid:17) (cid:169)
(cid:173)
(cid:173)
(cid:171)

140

Following the same steps as for 𝐼 (𝑘𝑙)
(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

𝑏,2,𝑡𝑚, we can show
(cid:12)
2
(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

𝐼 (𝑘𝑙)
𝑑,2,𝑡𝑚

⌈𝑆𝑙⌉−𝑚
∑︁

1
𝑇𝑙

𝑡=⌊𝑆𝑙⌋

𝐸𝑃

= 𝑂

(cid:17)

(cid:16)
𝑇 −1
𝑙

.

Therefore,

(cid:12)
(cid:205)𝑀−1
(cid:12)
𝑚=1
(cid:12)

𝑘 ( 𝑚
𝑀 )
𝑇𝑙
(cid:16) 𝑀
𝑇 −1/2

(cid:17)

(cid:12)
(cid:12) = 𝑂 𝑃
(cid:12)

(cid:19)

(cid:18)

𝑀
𝑇 −1/2
𝑙

= 𝑂 𝑃

(cid:17)

(cid:16) 𝑀
𝑇 −1/2

. We have shown

(cid:12)
(cid:12)
(cid:12)

𝐼 (𝑘𝑙)
𝑏,2

(cid:12)
(cid:12)
(cid:12) =

(cid:205)⌈𝑆𝑙⌉−𝑚
𝑡=⌊𝑆𝑙⌋

= 𝑂 𝑃

𝐼 (𝑘𝑙)
𝑑,2,𝑡𝑚
(cid:17)

(cid:16) 𝑀
𝑇 −1/2

.

𝑂 𝑃 (1/𝑁𝑘 ) + 𝑂 𝑃
Consider 𝐼 (𝑘𝑙)

𝑑,1 . Denote

𝐼 (𝑘𝑙)
𝑑,1,𝑚 =

𝐾/𝐿
𝑁𝑘𝑇 2
𝑙

⌈𝑆𝑙⌉−𝑚
∑︁

∑︁

(cid:110)

𝑡=⌊𝑆𝑙⌋

𝑖∈𝐼𝑘, 𝑗 ∈𝐼𝑘, 𝑗≠𝑖

𝑖𝑡 (cid:98)𝜓 (𝑘𝑙)
(cid:98)𝜓 (𝑘𝑙)

𝑗,𝑡+𝑚 − 𝜓 (0)

𝑖𝑡 𝜓 (0)

𝑗,𝑡+𝑚

(cid:111)

,

for each 𝑚. Then, 𝐼 (𝑘𝑙)

𝑑,1 = (cid:205)𝑀−1
𝑚=1

𝑘 (cid:0) 𝑚
𝑀

(cid:1) 𝐼 (𝑘𝑙)

𝑎,1 , we can show

𝑑,1,𝑚. Following the same steps as for 𝐼 (𝑘𝑙)
(cid:12)
(cid:12) = 𝑂 𝑃 (𝑇 −1/2 + 𝑟′
(cid:12)
(cid:17)

𝑁𝑇 ),

𝐼 (𝑘𝑙)
𝑑,1,𝑚

(cid:12)
(cid:12)
(cid:12)
(cid:16) 𝑀
𝑇 −1/2 + 𝑀𝑟′
𝑁𝑇

where

for each 𝑚. Therefore,

(cid:12)
(cid:12)
(cid:12)

𝐼 (𝑘𝑙)
𝑑,1

(cid:12)
(cid:12) = 𝑂 𝑃
(cid:12)

𝑀𝑟′

𝑁𝑇 ≤ 𝑀𝛿𝑁𝑇 𝑁 −1/2 =

𝑀
𝑇 1/2

𝑇 1/2
𝑁 1/2

𝛿𝑁𝑇 = 𝑜(1).

𝐸𝑃 [𝑔𝑡𝑔′
𝑡]

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

= 𝑂 𝑃

(cid:18) 𝑀
𝑇 −1/2

+ 𝑀𝑟′

𝑁𝑇

(cid:19)

+ 𝑂 𝑃

(cid:19)

(cid:18) 𝑀
𝑇 1/2

+ 𝑜(1) + 𝑂 (𝑒−𝜅𝑀) + 𝑜(1) + 0

To summarize
(cid:12)
(cid:12)
(cid:98)Ω𝑑,𝑘𝑙 − 𝑐
(cid:12)
(cid:12)
(cid:12)

∞
∑︁

𝑚=1

= 𝑜𝑃 (1).

which completes the proof of Claim B.6. □

Proof of Theorem 3.4 Since (𝐾, 𝐿) are fixed constants, it suffices to show for each (𝑘, 𝑙) that

𝐾/𝐿
𝑁𝑘𝑇 2
𝑙

(cid:205)
𝑖∈𝐼𝑘,𝑡∈𝑆𝑙,𝑟∈𝑆𝑙

𝑘

(cid:17)

(cid:16) |𝑡−𝑟 |
𝑀

(cid:98)Ω𝑁𝑊,𝑘𝑙 :=

(cid:98)Ω𝑁𝑊,𝑘𝑙 as

𝜓(𝑊𝑖𝑡; (cid:98)𝜃,

𝜂𝑘𝑙)𝜓(𝑊𝑖𝑟; (cid:98)𝜃,
(cid:98)

𝜂𝑘𝑙)′ = 𝑜𝑃 (1). Note that we can rewrite
(cid:98)

(cid:98)Ω𝑁𝑊,𝑘𝑙 = (cid:98)Ω𝑐,𝑘𝑙 + (cid:98)Ω𝑒,𝑘𝑙 − (cid:98)Ω𝑑,𝑘𝑙

where (cid:98)Ω𝑐,𝑘𝑙 and (cid:98)Ω𝑑,𝑘𝑙 are defined in the beginning of the proof of Theorem 3.3, and (cid:98)Ω𝑒,𝑘𝑙 is defined

as follows:

(cid:98)Ω𝑒,𝑘𝑙 :=

𝐾/𝐿
𝑁𝑘𝑇 2
𝑙

𝑀−1
∑︁

𝑚=1

𝑘

(cid:16) 𝑚
𝑀

⌈𝑆𝑙⌉−𝑚
∑︁

(cid:17)

∑︁

𝑡=⌊𝑆𝑙⌋

𝑖∈𝐼𝑘, 𝑗 ∈𝐼𝑘

𝜓(𝑊𝑖𝑡; (cid:98)𝜃,

𝜂𝑘𝑙)𝜓(𝑊 𝑗,𝑡+𝑚; (cid:98)𝜃,
(cid:98)

𝜂𝑘𝑙)′.
(cid:98)

141

Observe that by replacing (cid:98)Ω𝑑,𝑘𝑙 by (cid:98)Ω𝑒,𝑘𝑙, each step in the proof of Claim B.6 also follows. It implies
that (cid:98)Ω𝑒,𝑘𝑙 = (cid:98)Ω𝑑,𝑘𝑙 + 𝑜𝑃 (1). By Lemma A.6, we have (cid:98)Ω𝑐,𝑘𝑙 = 𝑂 𝑃 (𝑇 −1). Therefore, we conclude
that (cid:98)Ω𝑁𝑊,𝑘𝑙 = 𝑜𝑃 (1). □

142

APPENDIX 3C

PROOFS FOR CHAPTER 3.4

Proof of Theorem 3.5 Let 𝑃 ∈ P𝑁𝑇 for each (𝑁, 𝑇). We denote

𝐴𝑁𝑇 =

𝜓𝑁𝑇 =

1
𝑁𝑇
1
𝑁𝑇

(𝑉 𝑍 )′𝑉 𝐷, (cid:98)𝐴𝑁𝑇 =

(𝑉 𝑍 )′𝑉 𝑔, (cid:98)𝜓𝑁𝑇 =

1
𝑁𝑇
1
𝑁𝑇

(𝑍 − 𝑓 (cid:98)𝜁0)′(𝐷 − 𝑓
(𝑍 − 𝑓 (cid:98)𝜁0)′ (cid:16)

𝜋0),
(cid:98)

𝑌 − 𝑓 (cid:98)𝛽 − (𝐷 − 𝑓 (cid:98)𝜁)′𝜃0

(cid:17)

.

We can write (cid:98)𝜃 − 𝜃0 = (cid:98)𝐴−1

𝑁𝑇 (cid:98)𝜓𝑁𝑇 . By product decomposition, we have

(cid:98)𝜃 − 𝜃0 =𝐴−1

𝑁𝑇 𝜓𝑁𝑇 + 𝐴−1
𝑁𝑇

(cid:2)

(cid:98)𝜓𝑁𝑇 − 𝜓𝑁𝑇 (cid:3) +

(cid:104)

𝑁𝑇 − 𝐴−1
(cid:98)𝐴−1
𝑁𝑇

(cid:105) (cid:2)

(cid:98)𝜓𝑁𝑇 − 𝜓𝑁𝑇 (cid:3) +

(cid:104)

𝑁𝑇 − 𝐴−1
(cid:98)𝐴−1
𝑁𝑇

(cid:105)

𝜓𝑁𝑇

√

(cid:16)

(cid:17)

𝑁 ∧ 𝑇

(i) 𝐴𝑁𝑇

𝑝
→ 𝐴0 = 𝐸 [𝑉 𝑍

For the asymptotic normality of

, we need to show the following statements:
(cid:98)𝜓𝑁𝑇 − 𝜓𝑁𝑇 (cid:3) = 𝑜(1); (iv)
(cid:98)𝐴𝑁𝑇 − 𝐴𝑁𝑇 = 𝑜𝑃 (1). With statements (i) - (iv) and the identification condition in Assumption REG-
P(i) such that (cid:101)𝐴0 is non-singular,

(cid:98)𝜃 − 𝜃0
𝑑
→ N (0, Ω0); (iii)

. Then, the conclusion of

𝑁 ∧ 𝑇𝜓𝑁𝑇

𝑁 ∧ 𝑇 (cid:2)

𝑖𝑡 ]; (ii)

0, 𝐴−1

𝑁 ∧ 𝑇

𝑖𝑡 𝑉 𝐷

(cid:98)𝜃 − 𝜃0

→ N

0 Ω0 𝐴−1′

(cid:17) 𝑑

√

√

√

(cid:17)

(cid:16)

(cid:16)

0

the theorem follows.

Before we show Statement (i) - (iv), we note that Assumptions REG-P(ii) and AHK im-

ply that ( ¯𝐹𝑖, ¯𝐹𝑡) are functions of only (𝛼𝑖, 𝛾𝑡, 𝜀𝑖𝑡), and so are 𝑓𝑖𝑡 and 𝑉 𝑙

𝑖𝑡 for 𝑙 = 𝑔, 𝐷, 𝑌 , 𝑍.

Therefore, the results based on Hajek projection are still applicable. Also, due to Assump-

tions REG-P(ii),

¯𝐹𝑖 is a function of only (𝑐𝑖, 𝜖𝑖) and ¯𝐹𝑡 is a function of only (𝑑𝑡, 𝜖𝑡), so 𝑓𝑖𝑡 is

a function of (𝑋𝑖𝑡, 𝑐𝑖, 𝜖𝑖, 𝑑𝑡, 𝜖𝑡) which are mean independent of 𝑈 𝐷
𝐸𝑃 (cid:2) 𝑓𝑖𝑡 [(𝐿2,𝑖𝑡 − 𝐸 [𝐿2,𝑖𝑡])𝜂𝐷,2 + 𝑈 𝐷
the main text. Similarly, we have 𝐸𝑃 [ 𝑓𝑖𝑡𝑉 𝐷

𝑖𝑡 ] =
𝑖𝑡 ](cid:3) = 0 given that 𝑓𝑖𝑡 is uncorrelated with 𝐿2,𝑖𝑡 as discussed in

𝑖𝑡 . Therefore, 𝐸𝑃 [ 𝑓𝑖𝑡𝑉 𝐷

𝑖𝑡 ] = 0.

Statement (i) follows from Lemma A.1 under Assumptions AHK, AR, and REG-P(iii). For

Statement (ii), we first observe that 𝑉 𝑍

𝑖𝑡 𝑍𝑖𝑡]. Due to the
exogeneity condition 𝐸𝑃 [𝑍𝑖𝑡𝑈𝑔] = 0 and the independence between ( ¯𝐹𝑖, ¯𝐹𝑡, 𝑍𝑖𝑡, 𝑋𝑖𝑡) and (𝜖𝑖, 𝜖𝑡),

𝑖𝑡 = 𝑍𝑖𝑡 (1 − 𝜁0) where 𝜁0 = (cid:0)𝐸 [ 𝑓 ′

𝑖𝑡 𝑓𝑖𝑡](cid:1) −1 𝐸 [ 𝑓 ′

we have 𝐸𝑃 [𝑉 𝑍

𝑖𝑡 𝑉 𝑔

𝑖𝑡 ] = 0. With the additional Assumption REG-P(iv), Statement (ii) follows from

Lemma A.2.

143

Consider Statement (iii). By product decomposition and triangle inequality, we have

𝑁𝑇 |(cid:98)𝜓𝑁𝑇 − 𝜓𝑁𝑇 | ≤|( 𝑓 (𝜁0 − (cid:98)𝜁))′( 𝑓 (𝛽0 − (cid:98)𝛽) + 𝑉𝑌 + 𝑟𝑌 − 𝜃0( 𝑓 (𝜋0 − (cid:98)
𝜋 − 𝜋0)) − 𝑓 (𝛽0 − (cid:98)𝛽) + 𝑟 𝑔)|

+ |(𝑍 − 𝑓 𝜁0)′(𝜃0( 𝑓 ((cid:98)

𝜋) + 𝑉 𝐷 + 𝑟 𝐷))|

≲|( 𝑓 (𝜁0 − (cid:98)𝜁))′ 𝑓 (𝛽0 − (cid:98)𝛽)| + |( 𝑓 (𝜁0 − (cid:98)𝜁))′𝑉𝑌 | + |( 𝑓 (𝜁0 − (cid:98)𝜁))′𝑟𝑌 |

+ |( 𝑓 (𝜁0 − (cid:98)𝜁))′ 𝑓 (𝜋0 − (cid:98)
+ |(𝑉 𝑍 )′ 𝑓 ((cid:98)

𝜋)| + |( 𝑓 (𝜁0 − (cid:98)𝜁))′𝑉 𝐷 | + |( 𝑓 (𝜁0 − (cid:98)𝜁))′𝑟 𝐷 |

𝜋 − 𝜋0)| + |(𝑉 𝑍 )′ 𝑓 (𝛽0 − (cid:98)𝛽)| + +|(𝑉 𝑍 )′𝑟 𝑔 |

(3C.1)

(cid:19)

(cid:18)

𝑠

𝑂 𝑃

√︃ log( 𝑝/𝛾)
𝑁∧𝑇

Under Assumptions AHK, AR, the sparse approximation conditions as well as Assumption REG-
P(ii) - (vii), we can apply Theorem 3.1 to obtain that ∥ 𝑓𝑖𝑡 (𝜂0−(cid:98)
(cid:205)𝑁
𝑖=1

(cid:12)
(cid:12)
2𝑐1𝑁𝑇
(cid:12)
𝑁 Σ𝑎, 𝑗,𝑙 + 𝑁∧𝑇
𝑎, 𝑗 > 0 by Assumption REG-P(iv) and Lemma A.1. Therefore, min 𝑗 𝜔−1/2
𝑁 ∧ 𝑇) = 𝑂 𝑃

𝑙 = 𝑍, 𝐷, 𝑌 where 𝜆 = 6𝑐1𝑁𝑇
min 𝑗 ≤ 𝑝 Σ𝑙
implies ∥ 𝑓 ′𝑉 𝑙 ∥∞ = 𝑂 𝑃 (Φ−1(1 − 𝛾/2𝑝)/

𝑡=1
𝑁∧𝑇 Φ−1(1 − 𝛾/2𝑝). By Lemma A.2, 𝜔 𝑗,𝑙

𝑇 Σ𝑔, 𝑗,𝑙 where
𝑗,𝑙 > 0, which

for 𝜂 = 𝜁, 𝜋, 𝛽, and 𝑃

𝜔−1/2
𝑗,𝑙
𝑝
→ 𝐴∧𝑇

(cid:18)√︃ 𝑠 log( 𝑝/𝛾)
𝑁∧𝑇

, ∥𝜂0−(cid:98)
(cid:17)

for 𝑙 = 𝐷, 𝑌 , 𝑍.

𝜂) ∥ 𝑁𝑇,2 = 𝑂 𝑃

max 𝑗=1,...,𝑝

→ 0 for

𝑓𝑖𝑡, 𝑗𝑉 𝑙
𝑖𝑡

𝜂∥1 =

≥ 𝜆

(cid:18)√︃ log( 𝑝/𝛾)
𝑁∧𝑇

(cid:205)𝑇

1
𝑁𝑇

√

(cid:12)
(cid:12)
(cid:12)

√

(cid:19)

(cid:19)

(cid:16)

Consider the first term in 3C.1. By Cauchy-Swartz inequality, we have

√

𝑁 ∧ 𝑇
𝑁𝑇

|( 𝑓 (𝜁0 − (cid:98)𝜁))′ 𝑓 (𝛽0 − (cid:98)𝛽)| ≤

√

𝑁 ∧ 𝑇 ∥ 𝑓𝑖𝑡 (𝜁0 − (cid:98)𝜁) ∥ 𝑁𝑇,2∥ 𝑓𝑖𝑡 (𝛽0 − (cid:98)𝛽) ∥ 𝑁𝑇,2

=𝑂 𝑃

(cid:18) 𝑠 log( 𝑝/𝛾)
𝑁 ∧ 𝑇

√

(cid:19)

.

Consider the second term in 3C.1. By Holder’s inequality, we have
√
𝑠 log( 𝑝/𝛾)
𝑁∧𝑇
𝑁𝑇 ∥𝜁0 − (cid:98)𝜁 ∥1∥ 𝑓 ′𝑉𝑌 ∥∞ = 𝑂 𝑃
√
𝑁∧𝑇

(cid:16)

(cid:17)

inequality, we have

√

𝑁∧𝑇
𝑁𝑇 |( 𝑓 (𝜁0 − (cid:98)𝜁))′𝑟𝑌 | ≤

. Consider the third term in 3C.1. By Cauchy-Swartz
(cid:19)

√

𝑁 ∧ 𝑇 ∥ 𝑓𝑖𝑡 (𝜁0 − (cid:98)𝜁) ∥ 𝑁𝑇,2∥𝑟𝑌

𝑖𝑡 ∥ 𝑁𝑇,2 = 𝑂 𝑃

(cid:18)√︃ 𝑠 log( 𝑝/𝛾)
𝑁∧𝑇

.

√

𝑁∧𝑇
𝑁𝑇 |( 𝑓 (𝜁0 − (cid:98)𝜁))′𝑉𝑌 | ≤

For the last term of 3C.1, Cauchy-Swartz inequality implies that

√

𝑁 ∧ 𝑇
𝑁𝑇

|(𝑉 𝑍 )′𝑟 | ≤

√

𝑁 ∧ 𝑇 ∥𝑉𝑌

𝑖𝑡 ∥ 𝑁𝑇,2∥𝑟𝑌

𝑖𝑡 ∥ 𝑁𝑇,2.

By Assumption REG-P(ii), we have |𝐸 [(𝑉𝑌

and obtain that ∥𝑉𝑌

𝑖𝑡 ∥ 𝑁𝑇,2 → (𝐸 [(𝑉𝑌

𝑖𝑡 )2]4(𝜇+𝛿) | < ∞. Then we can apply Lemma A.1
𝑁∧𝑇
𝑁𝑇 |(𝑉 𝑍 )′𝑟 | = 𝑜𝑃 (1). The

𝑖𝑡 )2])1/2. Therefore, we have

√

arguments for the rest of the terms in 3C.1 are similar. Under the sparsity condition 𝑠 =

we conclude that

√

𝑁𝑇 | (cid:98)𝜓𝑁𝑇 − 𝜓𝑁𝑇 | = 𝑜𝑃 (1).

144

√
𝑁∧𝑇
log( 𝑝/𝛾) ,

Consider Statement (vi). By product decomposition, we have

𝑁𝑇

=

≤

(cid:16)

(cid:16)

(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)

(cid:13)
(cid:13) (cid:98)𝐴𝑁𝑇 − 𝐴𝑁𝑇
(cid:13)
(cid:17)′
𝑓 (𝜁0 − (cid:98)𝜁)

(cid:13)
(cid:13)
(cid:13)1
𝑓 (𝜋0 − (cid:98)

𝜋) +

(cid:17)′

𝑓 (𝜁0 − (cid:98)𝜁)

𝜋)
𝑓 (𝜋0 − (cid:98)

(cid:13)
(cid:13)
(cid:13)1

(cid:16)

+

(cid:17)′

𝑓 (𝜁0 − (cid:98)𝜁)
(cid:13)
(cid:13)
(cid:13)

(cid:16)

𝑓 (𝜁0 − (cid:98)𝜁)

(𝑟 𝐷 + 𝑉 𝐷)

(cid:13)
(𝐷 − 𝑓 𝜋0) + (𝑍 − 𝑓 𝜁0)′ 𝑓 (𝜋0 − (cid:98)
𝜋)
(cid:13)
(cid:13)1
(cid:17)′
𝜋)(cid:13)
+ (cid:13)
(cid:13)(𝑉 𝑍 )′ 𝑓 (𝜋0 − (cid:98)
(cid:13)1
(cid:13)
(cid:13)
(cid:13)1

(cid:13)
(cid:13) (cid:98)𝐴𝑁𝑇 − 𝐴𝑁𝑇
(cid:13)

(cid:13)
(cid:13)
(cid:13)1

We observe that, by similar arguments for Statement (v),

= 𝑜𝑃 (1). We have shown

Statement (i)-(iv), completing the proof. □

Proof of Theorem 3.6 We have shown in the proof of Theorem 3.5 that (cid:98)𝐴𝑁𝑇 − 𝐴𝑁𝑇 = 𝑜𝑃 (1) and
𝐴𝑁𝑇 − 𝐴0 = 𝑜𝑃 (1). By triangle inequality, we have (cid:98)𝐴𝑁𝑇 − 𝐴0 = 𝑜𝑃 (1). Then, it suffices to show
(cid:98)Ω𝐶𝐻𝑆 − Ω = 𝑜𝑃 (1). We decompose (cid:98)Ω𝐶𝐻𝑆 as follows:

𝜓𝑖𝑡 ((cid:98)𝜃,

𝜂)𝜓𝑖𝑟 ((cid:98)𝜃,
(cid:98)

𝜂)′, (cid:98)Ω𝑏 :=
(cid:98)

1
𝑁𝑇 2

𝑇
∑︁

𝑁
∑︁

𝑁
∑︁

𝑡=1

𝑖=1

𝑗=1

𝜓𝑖𝑡 ((cid:98)𝜃,

𝜂)𝜓 𝑗𝑡 ((cid:98)𝜃,
(cid:98)

𝜂)′,
(cid:98)

(cid:98)Ω𝐶𝐻𝑆 :=(cid:98)Ω𝑎 + (cid:98)Ω𝑏 − (cid:98)Ω𝑐 + (cid:98)Ω𝑑 + (cid:98)Ω′
𝑑,
𝑁
∑︁

𝑇
∑︁

𝑇
∑︁

(cid:98)Ω𝑎 :=

1
𝑁𝑇 2

(cid:98)Ω𝑐 :=

(cid:98)Ω𝑑 :=

1
𝑁𝑇 2

1
𝑁𝑇 2

𝑟=1

𝑖=1
𝑁
∑︁

𝑡=1
𝑇
∑︁

𝑡=1

𝑖=1
𝑀−1
∑︁

𝑚=1

𝜓𝑖𝑡 ((cid:98)𝜃,

𝜂)𝜓𝑖𝑡 ((cid:98)𝜃,
(cid:98)

𝜂)′,
(cid:98)

𝑘

(cid:16) 𝑚
𝑀

𝑇−𝑚
∑︁

𝑁
∑︁

(cid:17)

𝑁
∑︁

𝑡=1

𝑖=1

𝑗=1, 𝑗≠𝑖

𝜓𝑖𝑡 ((cid:98)𝜃,

𝜂)𝜓 𝑗,𝑡+𝑚 ((cid:98)𝜃,
(cid:98)

𝜂)′.
(cid:98)

where 𝜓𝑖𝑡 (𝜃, 𝜂) = (𝑍𝑖𝑡 − 𝑓𝑖𝑡 𝜁)(𝑌𝑖𝑡 − 𝑓𝑖𝑡 𝛽 − 𝜃 (𝐷𝑖𝑡 − 𝑓𝑖𝑡 𝜋)) and 𝜂 = (𝜁, 𝛽, 𝜋). We need to show

𝑝
→ Σ𝑎 = 𝐸𝑃 [𝑎2

𝑖 ], (cid:98)Ω𝑏

(cid:98)Ω𝑎

𝑝
→ 𝑐𝐸 [𝑔2

𝑡 ], (cid:98)Ω𝑐 = 𝑜𝑃 (1), and (cid:98)Ω𝑑

𝑝
→ 𝑐 (cid:205)∞

𝑚=1

𝐸𝑃 [𝑔𝑡𝑔𝑡+𝑚].

First, consider (cid:98)Ω𝑎 − 𝐸𝑃 [𝑎2

𝑖 ]. By triangle inequality, we have

(cid:12)
(cid:12)(cid:98)Ω𝑎 − 𝐸𝑃 [𝑎2
(cid:12)
𝑖 ]

(cid:12)
(cid:12)
(cid:12)

≤ (cid:12)

(cid:12)𝐼𝑎,1

(cid:12)
(cid:12) ,

(cid:12) + (cid:12)
(cid:12)

(cid:12)𝐼𝑎,2
𝑇
𝑁
∑︁
∑︁

(cid:12) + (cid:12)
(cid:12)

(cid:12)𝐼𝑎,2
𝑇
∑︁

(cid:110)

𝐼𝑎,1 :=

𝐼𝑎,2 :=

𝐼𝑎,2 :=

1
𝑁𝑇 2

1
𝑁𝑇 2

1
𝑁𝑇 2

𝑖=1
𝑁
∑︁

𝑡=1
𝑇
∑︁

𝑟=1
𝑇
∑︁

𝑖=1
𝑁
∑︁

𝑡=1
𝑇
∑︁

𝑟=1
𝑇
∑︁

𝑖=1

𝑡=1

𝑟=1

𝜓𝑖𝑡 ((cid:98)𝜃,

𝜂)𝜓𝑖𝑟 ((cid:98)𝜃,
(cid:98)

𝜂) − 𝜓𝑖𝑡 (𝜃0, 𝜂0)𝜓𝑖𝑟 (𝜃0, 𝜂0)
(cid:98)

(cid:111)

,

{𝜓𝑖𝑡 (𝜃0, 𝜂0)𝜓𝑖𝑟 (𝜃0, 𝜂0) − 𝐸 [𝜓𝑖𝑡 (𝜃0, 𝜂0)𝜓𝑖𝑟 (𝜃0, 𝜂0)]} ,

(cid:8)𝐸 [𝜓𝑖𝑡 (𝜃0, 𝜂0)𝜓𝑖𝑟 (𝜃0, 𝜂0)] − 𝐸 [𝑎2

𝑖 ](cid:9) .

145

Note that in proving Claim B.3, the cross-fitting device is only used to show that 𝐼𝑎,1 is of small

order. Since the arguments for showing 𝐼𝑎,2 and 𝐼𝑎,3 to be of small order are basically the same as

those in the proof of Claim B.3, they are not repeated here.

Consider 𝐼𝑎,1. By product decomposition, triangle inequality, and Cauchy-Schwarz inequality,

we have

(cid:12)
(cid:12)𝐼𝑎,1

(cid:12)
(cid:12) ≲𝑅𝑁𝑇 (cid:8)|𝜓𝑖𝑡 (𝜃0, 𝜂0)|𝑁𝑇,2 + 𝑅𝑁𝑇 (cid:9)

𝑅𝑁𝑇 :=

(cid:13)
(cid:13)
(cid:13)

𝜓𝑖𝑡 ((cid:98)𝜃,

𝜂) − 𝜓𝑖𝑡 (𝜃0, 𝜂0)
(cid:98)

(cid:13)
(cid:13)
(cid:13)𝑁𝑇,2

By Minkowski’s inequality, we have

𝑅𝑁𝑇 =

≤

(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)

𝑖𝑡 (𝜂0)((cid:98)𝜃 − 𝜃0) + (𝜓𝑎
𝜓𝑎

𝑖𝑡 (𝜂0) − 𝜓𝑎

𝜂)) ((cid:98)𝜃 − 𝜃0) + 𝜓𝑖𝑡 (𝜃0,

𝑖𝑡 ((cid:98)

𝜂) − 𝜓𝑖𝑡 (𝜃0, 𝜂0)
(cid:98)

(cid:13)
(cid:13)
(cid:13)𝑁𝑇,2

𝜓𝑎
𝑖𝑡 (𝜂0)((cid:98)𝜃 − 𝜃0)

(cid:13)
(cid:13)
(cid:13)𝑁𝑇,2

+

(cid:13)
(cid:13)
(cid:13)

(𝜓𝑎

𝑖𝑡 (𝜂0) − 𝜓𝑎

𝜂)) ((cid:98)𝜃 − 𝜃0)

𝑖𝑡 ((cid:98)

(cid:13)
(cid:13)
(cid:13)𝑁𝑇,2

+ ∥𝜓𝑖𝑡 (𝜃0,

𝜂) − 𝜓𝑖𝑡 (𝜃0, 𝜂0) ∥ 𝑁𝑇,2
(cid:98)

=: 𝑅𝑎,1 + 𝑅𝑎,2 + 𝑅𝑎,3,

where 𝜓𝑎
𝑖𝑡 (𝑉 𝐷

𝐸𝑃 [𝑉 𝑍

𝑖𝑡 (𝜂) := (𝑍𝑖𝑡 − 𝑓𝑖𝑡 𝜁) (𝐷𝑖𝑡 − 𝑓𝑖𝑡 𝜋). Under Assumption REG-P(ii), we have 𝐸𝑃 [𝜓𝑎
𝑖𝑡 + 𝑟 𝐷

𝑖𝑡 )]2 = 𝑂 𝑃 (1), and Markov inequality implies that (cid:13)
(cid:13)𝜓𝑎
(cid:17)
𝑖𝑡 (𝜂0)(cid:13)

𝑖𝑡 (𝜂0)]2 =
(cid:13)𝑁𝑇,2 = 𝑂 𝑃 (1). By
(cid:17)
|(cid:98)𝜃 − 𝜃0| = 𝑂 𝑃
.

. Therefore, 𝑅𝑎,1 ≤ (cid:13)

𝑖𝑡 (𝜂0)(cid:13)

(cid:13)𝜓𝑎

1√

1√

(cid:16)

(cid:16)

(cid:13)𝑁𝑇,2

𝑁∧𝑇

𝑁∧𝑇

Theorem 3.5, we have (cid:98)𝜃 − 𝜃0 = 𝑂 𝑃
To bound 𝑅𝑎,2, we note

𝑖𝑡 (𝜂0) − 𝜓𝑎

𝜂)(cid:13)
𝑖𝑡 ((cid:98)

(cid:13)𝑁𝑇,2

(cid:13)
(cid:13)𝜓𝑎
(cid:13)
(cid:13)
(cid:13)

=

𝑓𝑖𝑡 ((cid:98)𝜁 − 𝜁0)(𝐷𝑖𝑡 − 𝑓𝑖𝑡 𝜋0) + 𝑓𝑖𝑡 ((cid:98)𝜁 − 𝜁0) 𝑓𝑖𝑡 ((cid:98)

𝜋 − 𝜋0) + (𝑍𝑖𝑡 − 𝑓𝑖𝑡 𝜁0) 𝑓𝑖𝑡 ((cid:98)

𝜋 − 𝜋0)

(cid:13)
(cid:13)
(cid:13)𝑁𝑇,2

Under Assumption REG-P(iii), we have 𝐸𝑃 |𝑉 𝐷

𝑖𝑡 |8(𝜇+𝛿) < ∞, which implies

(cid:20)

𝐸𝑃

max
𝑖≤𝑁,𝑡≤𝑇

(cid:21)

|𝑉 𝐷

𝑖𝑡 |2

≲ (𝑁𝑇)

1
4( 𝜇+ 𝛿 ) .

By Markov inequality, we have max𝑖≤𝑁,𝑡≤𝑇 |𝑉 𝐷

3.5, Theorem 3.1 can be applied to obtain

(cid:13)
(cid:13)
(cid:13)

4( 𝜇+ 𝛿 ) ). As in the proof of Theorem

(cid:18)√︃ 𝑠 log( 𝑝/𝛾)
𝑁∧𝑇

(cid:19)

. Then, we have

𝑅𝑎,2 =

(cid:13)
(cid:13)
(cid:13)

𝑓𝑖𝑡 ((cid:98)𝜁 − 𝜁0)𝑉 𝐷

𝑖𝑡

(cid:13)
(cid:13)
(cid:13)𝑁𝑇,2

=𝑂 𝑃 ((𝑁𝑇)

1

8( 𝜇+ 𝛿 ) )𝑂 𝑃

= 𝑂 𝑃 ((𝑁𝑇)

8( 𝜇+ 𝛿 ) )𝑜𝑃

1

(cid:19)

1
(𝑁 ∧ 𝑇)1/4

= 𝑜𝑃 (1).

1

𝑖𝑡 |2 = 𝑂 𝑃 ((𝑁𝑇)
(cid:13)
(cid:13)
(cid:13)𝑁𝑇,2

𝑓𝑖𝑡 ((cid:98)𝜁 − 𝜁0)

= 𝑂 𝑃

𝑖𝑡 |2

(cid:19) 1/2 (cid:13)
(cid:13)
(cid:13)

𝑓𝑖𝑡 ((cid:98)𝜁 − 𝜁0)

(cid:13)
(cid:13)
(cid:13)𝑁𝑇,2
(cid:18)

(cid:18)

|𝑉 𝐷
≤
max
𝑖≤𝑁,𝑡≤𝑇
(cid:32)√︂ 𝑠 log( 𝑝/𝛾)
𝑁 ∧ 𝑇

(cid:33)

146

Similar arguments can be made to show 𝑅𝑎,3. Therefore, we have 𝑅𝑁𝑇 = 𝑜𝑃 (1) and so (cid:98)Ω𝑎

𝑝
→ Σ𝑎

It is left to show that (cid:98)Ω𝑏

𝑝
→ 𝑐𝐸 [𝑔2

𝑡 ], (cid:98)Ω𝑐 = 𝑜𝑃 (1), and (cid:98)Ω𝑑

𝑝
→ 𝑐 (cid:205)∞

𝑚=1

𝐸𝑃 [𝑔𝑡𝑔𝑡+𝑚]. As is shown

in the proof of Theorem 3.3 (Lemmas A.5-A.7), the only step in showing these claims that involve

cross-fitting technique is to show the same term 𝑅𝑁𝑇 to converge to 0 in probability. Otherwise,

the arguments are basically the same and not repeated here. Combining these results, we obtain

𝑝
→ 𝐸𝑃 (𝑎2

𝑖 ) + 𝑐𝐸𝑃 (𝑔2

𝑡 ) + 𝑐 (cid:205)∞

𝑚=1

(cid:98)Ω

𝐸𝑃 (𝑔𝑡𝑔𝑡+𝑚) = Σ𝑎 + 𝑐Σ𝑔.

To show (cid:98)𝑉𝐷𝐾 𝐴 = (cid:98)𝑉𝐶𝐻𝑆 + 𝑜𝑃 (1), it suffices to show (cid:98)Ω𝑁𝑊 = 𝑜𝑃 (1). We decompose Ω𝑁𝑊 as

follows:

(cid:98)Ω𝑁𝑊 = (cid:98)Ω𝑐 + (cid:98)Ω𝑒 − (cid:98)Ω𝑑,

where (cid:98)Ω𝑐 and (cid:98)Ω𝑑 are defined as above and (cid:98)Ω𝑒 is defined as follows:

(cid:98)Ω𝑒 :=

1
𝑁𝑇 2

𝑀−1
∑︁

𝑚=1

𝑘

(cid:16) 𝑚
𝑀

𝑇−𝑚
∑︁

𝑁
∑︁

𝑁
∑︁

(cid:17)

𝑡=1

𝑖=1

𝑗=1

𝜓(𝑊𝑖𝑡; (cid:98)𝜃,

𝜂)𝜓(𝑊 𝑗,𝑡+𝑚; (cid:98)𝜃,
(cid:101)

𝜂).
(cid:101)

Following the same arguments as in the proof of Claim B.6, we have (cid:98)Ω𝑒 = (cid:98)Ω𝑑 + 𝑜𝑃 (1). We have
shown (cid:98)Ω𝑐 = 𝑜𝑃 (1). Therefore, we conclude that (cid:98)Ω𝑁𝑊 = 𝑜𝑃 (1). So it is proved. □

147

CHAPTER 4

ANOTHER LOOK AT THE LINEAR PROBABILITY MODEL AND NONLINEAR
INDEX MODELS

(CO-AUTHORED WITH ROBERT S. MARTIN and JEFFREY M. WOOLDRIDGE)

4.1

Introduction

When an outcome variable, 𝑦, is binary, empirical researchers usually choose between two

general strategies given a vector of (exogenous) explanatory variables, x:

(i) approximate the

response probability, 𝑃 (𝑦 = 1|x), using a model linear in parameters or (ii) use a nonlinear model,

such as logit or probit. The first strategy is commonly known as using a linear probability model

(LPM). The benefits of the LPM are well-known and include ease of interpretation and simple

estimation. The shortcomings of the LPM are also well known and discussed in most introductory

econometrics texts; see, for example, Wooldridge, 2019, Section 7.5. More advanced discussions of

the LPM recognize that one should not take the linear model for 𝑃 (𝑦 = 1|x) literally but only as an

approximation. The approximation can be exact in special cases—such as when x consists of binary

indicators that are exhaustive and mutually exclusive—and it may be poor in other cases. However,

for the most part, prediction is not the primary use of LPMs specifically or binary response models

generally. Rather, researchers are largely interested in using binary response models to measure

ceteris paribus or causal effects, and it is from this perspective that the LPM approximation should

be evaluated. Angrist and Pischke, 2009, Section 3.4.1 and Wooldridge, 2010, Section 15.2 take

this perspective. Wooldridge, 2010, Section 15.6, p. 579 shows how the results of Stoker (1986)

can be applied to OLS estimation of the parameters in a LPM. Remarkably, there are situations

where the linear projection exactly recovers the average partial effects (APEs) across a broad range

of binary response models.1

Even though it is natural to study the LPM from the linear projection perspective, this opinion

The co-authors have approved that the co-authored chapter is included. The co-authors’ contact: Robert S. Martin,
Division of Price and Index Number Research, Bureau of Labor Statistics. Email: martin.robert@bls.gov. Jeffrey M.
Wooldridge, Department of Economics, Michigan State University. Email: wooldri1@msu.edu

1Note, extensions of the LPM do not necessarily recover the APE. For instance, see Li et al. (2022) for the case of

the LPM with endogenous x and two-stage least squares estimation.

148

is not universally held. In an influential paper, Horrace and Oaxaca (2006) study both the bias and

inconsistency of the OLS estimator for the parameters of an underlying piecewise linear model for

the response probability that ensures the probabilities are in the unit interval.2 The Horrace and

Oaxaca paper is regularly cited in empirical research,3 sometimes as a cautionary tale in using the

LPM and sometimes as support for using the LPM when relatively few fitted values lie outside the

unit interval. While Horrace and Oaxaca take the piecewise linear model seriously, much if not

most of the citing literature seeks to use their results to choose between the LPM and an alternative

like probit or logit.4

In the current paper, we revisit the Horrace and Oaxaca framework but, rather than focus on

parameters, we focus on APEs. We show that Horrace and Oaxaca set up the problem so that, in

general, the response probability is nonlinear in the underlying linear index, x𝛽 = 𝛽1 + 𝛽2𝑥2 + · · · +

𝛽𝐾𝑥𝐾:

𝑃(𝑦 = 1|x) = 𝑅(x𝛽) =

0,

𝑥 𝛽 ≤ 0

𝑥 𝛽,

𝑥 𝛽 ∈ (0, 1).

1,

𝑥 𝛽 ≥ 1






(4.1)

The nonlinear function 𝑅(·)—sometimes called the ramp function—is piecewise linear and continu-

ous, but it is not strictly increasing, and it is nondifferentiable at two inflection points. Nevertheless,

under fairly weak assumptions, one can define the APEs. For continuous variables, the associated

APEs are necessarily smaller in magnitude than the index slope coefficients in the underlying non-

linear model. Consequently, Horrace and Oaxaca’s focus on index parameters rather than APEs

is essentially the same as focusing on parameters in smooth response probabilities such as the

logit and probit functions. Therefore, any conclusions about the usefulness of the LPM should be

reexamined from the perspective of identifying APEs rather than coefficients.

2Horrace and Oaxaca (2006) defines the LPM as the piecewise linear ramp model. However, in this paper, we

differentiate between the “ramp model” and the “LPM” (which is linear everywhere).

3In recent years (2020-2024), Horrace and Oaxaca (2006) has more than 300 Google Scholar citations.
4See, for example, Footnote 20 of van den Berg and Siflinger (2022).

149

It is important to understand that we are not advocating the ramp function as an especially

sensible model of the response probability. Rather, we primarily study that specification from the

perspective of average partial effects to determine how the Horrace and Oaxaca conclusions hold

up. Briefly, in some cases, the linear projection parameters do a very good job of approximating

the APEs even when a large percentage of the fitted values are outside the unit interval. Conversely,

in other cases, the linear projection parameters do a very poor job of approximating the APEs even

when a high percentage of the fitted values are within the unit interval. A practical implication is

that there is little justification for how the Horrace and Oaxaca paper is cited in empirical research.

We compare the OLS estimation of the LPM to a few nonlinear competitors, including probit and

logit quasi-maximum likelihood estimation (QMLE), as natural benchmarks. Horrace and Oaxaca

cite a few theoretical rationalizations for the ramp model, so it also makes sense to see if a consistent

estimator exists that takes it seriously. Horrace and Oaxaca suggest trimming the sample of fitted

values outside the unit interval and re-estimating using OLS, but do not present any theoretical

or simulation results. In unreported simulations, we found that trimming the sample once did not

necessarily improve performance over OLS for estimating the APEs. Interestingly, by iteratively

trimming the sample and performing OLS estimation (referred to as the ITO procedure, hereafter),

we show it produces results equivalent to those from numerically minimizing the nonlinear least

square (NLS) objective function with the ramp model.

In Section 4.3, we show that the NLS

estimator of the ramp model is consistent and asymptotically normal under mild assumptions,

which in turn justifies trimming procedures in practice. For estimating the APEs, we find that NLS

estimation of the ramp function performs comparably to quasi-MLE estimation of the logit and

probit models and has good finite sample properties even when OLS estimation of the LPM does

not.

Section 4.2 delivers our main theoretical arguments. Starting with a linear index model as the

response probability of a binary outcome, we define and contrast parameters of interest, which are

the index slope coefficients, average partial effects, and linear projection parameters. By leveraging

results from Stoker (1986), we describe scenarios where the linear projection parameters recover

150

APEs. In particular, we extend the discussion in Wooldridge, 2010, Section 15.6 and show that,

when the covariates have a multivariate normal distribution, the linear projection identifies the

APEs. Section 4.4 continues with the mission by conducting simulations under scenarios where

theory has made predictions and where theory suggests, but does not fully uncover the relations.

Related to our main theoretical arguments, we show that a large fraction of fitted values in [0, 1] is

neither sufficient nor necessary condition for the LPM to well-approximate the APEs. We revisit

an empirical study of mortgage lending decisions in Section 4.5. The LPM estimated by OLS,

with a full set of interactions between the variable of interest and the control variables, delivers a

notably smaller and marginally statistically significant estimate of the effect of being white on the

approval probability. The NLS estimator of the ramp function, probit QMLE, and logit QMLE are

very similar and all statistically significant at the 0.2% level—both because the estimated effects

are larger but also because the (robust) standard errors are notably smaller.

In Section 4.6, we

conclude with some implications for empirical research.

4.2 Relevant Parameters of Binary Response Models

Let 𝑦 be the binary outcome variable and x the 1 × 𝐾 vector of explanatory variables, where

𝑥1 ≡ 1 allows for an intercept in the index. Consider a linear index model of the response probability

for 𝑦:

𝑃 (𝑦 = 1|x) ≡ 𝑝(x) = 𝐺 (x𝛽) = 𝐺 (𝛽1 + 𝛽2𝑥2 + · · · + 𝛽𝐾𝑥𝐾)

(4.2)

where 𝐺 : R → [0, 1]. This embeds the probit/logit model by setting 𝐺 (·) as the standard normal

CDF/standard logistic function, and it includes the ramp model in Horrace and Oaxaca (2006) by

setting 𝐺 (·) = 𝑅(·) as in 4.1.

The following subsection compares different parameters of interest for the linear index model

generally and for the ramp model specifically. The ramp model for the response probability was

suggested by Horowitz and Savin (2001) as being suitable when one starts with a linear model for

𝑝 (x) but wants to ensure that the probabilities are within the unit interval. While not necessarily

advocating this view, our purpose is to show that Horrace and Oaxaca’s conclusions about one set

151

of parameters (𝛽) do not necessarily apply to the most interesting set of parameters (the APEs).

4.2.1 APEs, Index Slopes, and Linear Projection Parameters

We will first consider partial effects for continuous variables. Let 𝑥 𝑗 be a continuously distributed

explanatory variable. For simplicity, the discussion here assumes that 𝑥 𝑗 appears only by itself. If

the model includes quadratics, interactions, and so on then the details become more complicated

but the conclusions do not change substantively. Assume 𝐺 (·) is differentiable almost everywhere,

with its derivative denoted by 𝑔(·). Then, the partial effects and average partial effects of 𝑥 𝑗 on the

response probability of 𝑦 can be defined as:

𝑃𝐸 𝑗 (x) ≡ 𝛽 𝑗 𝑔 (x𝛽) , 𝐴𝑃𝐸 𝑗 ≡ 𝛽 𝑗 𝐸 [𝑔 (x𝛽)]

In the case of the ramp function, even though 𝑅 (𝑧) is non-differentiable at 𝑧 = 0 and 𝑧 = 1,

it is still differentiable with probability one as long as x𝛽 is continuous, and so 𝑃 (x𝛽 = 0) =

𝑃 (x𝛽 = 1) = 0. This holds true provided that at least one element of x is continuous, and

that element has a nonzero coefficient, which is a very common assumption imposed in the

semiparametric literature on binary response models.

In what follows, we maintain that x𝛽 is

continuous so that partial effects are well-defined with probability one. As a result, we can define

a partial effect function as the derivative of 𝑅 (x𝛽) and ignore points where the derivative does not

exist:

𝑃𝐸 𝑗 (x) =

𝜕 𝑝
𝜕𝑥 𝑗

(x) = 𝛽 𝑗 1 [0 ≤ x𝛽 ≤ 1] ,

where 1 [·] is the indicator function. Therefore, under the ramp model, the APE is

𝐴𝑃𝐸 𝑗 ≡ 𝐸 (cid:2)𝑃𝐸 𝑗 (x)(cid:3) = 𝛽 𝑗 𝑃 (0 ≤ x𝛽 ≤ 1) .

(4.3)

There are some simple but useful observations about (4.3). First, similar to the probit or logit

cases, 𝐴𝑃𝐸 𝑗 always has the same sign as 𝛽 𝑗 . Second, because 𝑃 (0 ≤ x𝛽 ≤ 1) ≤ 1, (cid:12)

(cid:12)𝛽 𝑗
with wide support for x𝛽, 𝐴𝑃𝐸 𝑗 can be much smaller in magnitude than 𝛽 𝑗 . Moreover, 𝐴𝑃𝐸 𝑗 = 𝛽 𝑗

(cid:12)𝐴𝑃𝐸 𝑗

(cid:12) ≤ (cid:12)
(cid:12)

(cid:12)
(cid:12);

if and only if 𝑃 (0 ≤ x𝛽 ≤ 1) = 1, which means the support of x𝛽 is inside the unit interval. This is

essentially the condition used by Horrace and Oaxaca (2006) to conclude that the OLS estimator in

152

linear regression is unbiased and consistent for 𝛽. Our goal here is to compare the OLS estimators

with the APEs in the general case where 𝑃 (0 ≤ x𝛽 ≤ 1) < 1; the Horrace and Oaxaca condition is

then a special case where the index coefficient, 𝛽 𝑗 , is identical to 𝐴𝑃𝐸 𝑗 .

In order to understand the behavior of the OLS estimator under a linear index model, it is

important to introduce a third set of parameters: the linear projection parameters, denoted as 𝛾.

Assume that the 𝑥 𝑗 have finite second moments and that the 𝐾 × 𝐾 matrix 𝐸 (x′x) is nonsingular.

Then we can always define the 𝐾 × 1 vector 𝛾 as

𝛾 = [𝐸 (x′x)]−1 𝐸 (x′𝑦) .

We then write the linear projection of 𝑦 on (1, 𝑥2, ..., 𝑥𝐾) as.

𝐿 (𝑦|x) = 𝐿 (𝑦|1, 𝑥2, ..., 𝑥𝐾) = 𝛾1 + 𝛾2𝑥2 + · · · + 𝛾𝐾𝑥𝐾 = x𝛾.

In understanding the findings in Horrace and Oaxaca, and their limitations, it is important to know

that, given the model of 4.2, 𝐴𝑃𝐸 𝑗 , 𝛽 𝑗 , and 𝛾 𝑗 are all well-defined parameters and, in general, they

are all different. Defining 𝛽 and the APEs require an underlying model for the response probability

whereas defining 𝛾 does not.

As is well known, under random sampling the OLS estimator consistently estimates the param-

eters of the linear projection; see, for example, Wooldridge (2010, Chapter 4.2). In other words, if

we run the OLS regression underlying LPM estimation,

𝑦𝑖 on 1, 𝑥𝑖2, ..., 𝑥𝑖𝐾, 𝑖 = 1, ..., 𝑁,

and obtain the (cid:98)

𝛾 𝑗
𝛾 𝑗 , then (cid:98)

𝑝
→ 𝛾 𝑗 . Again, this result holds free of any kind of underlying model.

Under the ramp model, Horrace and Oaxaca study the consistency of the (cid:98)

𝛾 𝑗 when considered

as estimators of 𝛽 𝑗 —the coefficients in the index. In other words, their asymptotic analysis is the

same as comparing the linear projection parameters 𝛾 𝑗 to the index parameters 𝛽 𝑗 . Our view is

that this does usually not make much sense—for the same reason, we do not study the consistency

of the OLS estimator for the index parameters in, say, probit or logit. If one explicitly models the

response probability as a nonlinear function of x𝛽 then one must recognize that nonlinearity when

153

defining the parameters of interest. When interest is in the effects of the explanatory variables on

the response probability—which describes almost all modern usages of the LPM—it only makes

sense to compare the linear projection parameters to the APEs. In other words, we should ask:

When is 𝛾 𝑗 “close” to 𝐴𝑃𝐸 𝑗 ? This is not the same as studying when 𝛾 𝑗 is “close” to 𝛽 𝑗 (except in

the special case where 𝑃 (0 ≤ x𝛽 ≤ 1) holds).

Under the ramp model we can write

𝐸 (𝑦|x) = 𝑝 (x) = 1 [0 ≤ x𝛽 ≤ 1] x𝛽 + 1 [x𝛽 > 1] .

If 𝑃 (0 ≤ x𝛽 ≤ 1) holds then, with probability one,

𝐸 (𝑦|x) = x𝛽 = 𝐿 (𝑦|x) ,

in which case 𝐴𝑃𝐸 𝑗 = 𝛽 𝑗 = 𝛾 𝑗 and so the OLS estimators, (cid:98)
for a random sample of size 𝑁, x𝑖 𝛽 ∈ [0, 1] for all 𝑖, then

𝛾 𝑗 are consistent for 𝛽 𝑗 and 𝐴𝑃𝐸 𝑗 . If

𝐸 (𝑦𝑖 |x1, x2, ..., x𝑛) = x𝑖 𝛽,

and it follows that the OLS estimators are conditionally unbiased for the 𝛽 𝑗 – the conclusion reached

in Horrace and Oaxaca.

If 𝑃 (0 ≤ x𝛽 ≤ 1) < 1, then the 𝛽 𝑗 measure the partial effects when 0 ≤ x𝛽 ≤ 1, but this

restriction depends on the unknown vector and the 𝛽 𝑗 need not be very useful as summary measures

of the partial effects. In the next subsection, we discuss more generally when the linear projection

parameters identify the APEs.

4.2.2 When Linear Projection Recovers the APEs

In addition to being easy to interpret, empirically, the OLS estimates of the LPM are often

similar to the corresponding APEs from nonlinear index models—particularly logit or probit.

Wooldridge, 2010, Section 15.6 provides a discussion based on a results of Stoker (1986) that helps

one understand these empirical findings. Here we expand that discussion to allow for an extension

to the ramp model.

154

As argued in Wooldridge, 2010, Section 15.6, the results of Stoker (1986) imply that, if

(𝑥2, ..., 𝑥𝐾) has a multivariate normal distribution and 𝐺 (·) is differentiable almost everywhere on

R (with respect to Lebesgue measure), then

𝛾 𝑗 = 𝛽 𝑗 𝐸 [𝑔 (x𝛽)] = 𝐴𝑃𝐸 𝑗 , 𝑗 = 2, ..., 𝐾,

where 𝛾 𝑗 is the slope coefficients on 𝑥 𝑗 in 𝐿 (𝑦|x) = x𝛾, 𝑔 (·) is the almost everywhere derivative

of 𝐺 (·). The ramp function 𝑅 (·) is differentiable everywhere except at zero and one, and so it

satisfies Stoker’s (1986) assumptions. The result is that OLS consistently estimates 𝐴𝑃𝐸 𝑗 , even

though the 𝐴𝑃𝐸 𝑗 are attenuated versions of the 𝛽 𝑗 . This equality holds even when 𝑃 (0 ≤ x𝛽 ≤ 1)

is very close to zero. Horrace and Oaxaca (2006), and many papers citing their findings, focus

on the inconsistency of OLS for 𝛽 𝑗 , failing to recognize that the OLS estimators from the linear

model could be consistent for the more interesting quantities, the 𝐴𝑃𝐸 𝑗 . This point is key to our

argument: If the model of the response probability is nonlinear so that 0 ≤ 𝑝 (x) ≤ 1 is ensured,

one should study estimation of APEs, not underlying index parameters.

Other than the case of multivariate normality of (𝑥2, ..., 𝑥𝐾), there is another case where the

linear projection parameters, 𝛾 𝑗 , 𝑗 = 2, ..., 𝐾, equal the APEs: 𝑥2, ..., 𝑥𝐾 are mutually exclusive

binary indicators that, along with a base group given by 𝑥2 = 𝑥3 = · · · = 𝑥𝐾 = 0, are exhaustive. See

Angrist and Pischke, 2009, Section 3.1.4 and Wooldridge, 2010, Section 15.2. If 𝑥1 = 1 denotes

the base group then the APEs are simply

𝐴𝑃𝐸 𝑗 = 𝐸 (cid:0)𝑦|𝑥 𝑗 = 1(cid:1) − 𝐸 (𝑦|𝑥1 = 1) , 𝑗 = 2, ..., 𝐾,

and these are identical to the corresponding LPM coefficients.

4.2.3 More General Cases

Clearly the assumption of multivariate normality of x is too restrictive to be widely applicable.

Nevertheless, the results of Stoker (1986) are suggestive, especially when combined with Ruud

(1983). Ruud studies smooth nonlinear function forms that never hit the endpoints of the unit

interval, like probit and logit.

In these cases, quasi-MLE identifies the index coefficients up

155

to scale.5

If x has a centrally symmetric distribution—of which the multivariate normal is a

special case—then Ruud’s (1983) conditions hold. In Section 4, we will find several cases where

the covariates are symmetrically distributed (but not multivariate normal) and the APEs are still

approximated well by the linear projection parameters.

Beyond the extreme cases described here, there appears to be no general theory to determine

when the linear projection coefficients will be the same or “close” to the APEs. Many empirical

applications include a combination of continuous, discrete, and even mixed explanatory variables.

Rarely do these all have marginal symmetric distributions, let alone a symmetric joint distribution.

Plus, such explanatory variables often appear as quadratics, interactions, and other functional

forms—which also do not have symmetric distributions. In Section 4, we use simulations to shed

light on when the LPM coefficients closely approximate the APEs—and when they do not. When

evaluating the performance of the LPM as an approximation to Horrace and Oaxaca’s ramp model,

it makes sense to consider an estimator which takes such a model seriously. To that end, the next

section describes such an estimator.

4.3 Asymptotically Valid Estimators of the Ramp Model

4.3.1 Nonlinear Least Square Estimation

We have already seen how if 𝑃(0 ≤ x𝛽 ≤ 1) = 1, then OLS is consistent for the 𝛽 𝑗 , which are

equal to 𝐴𝑃𝐸 𝑗 in the case of a continuous covariate 𝑥 𝑗 under model (4.1). If the probability that x𝛽

lies outside the unit interval is nonzero, then OLS is no longer consistent for the 𝛽 𝑗 , and it may or

may not approximate the 𝐴𝑃𝐸 𝑗 depending on the distribution of x. In addition to probit and logit

quasi-MLE, it makes sense to consider an estimator which is consistent if the ramp model is true.

Of course, Bernoulli MLE in the usual fashion using the ramp model as the conditional response

probability is not feasible because the log-likelihood is not necessarily defined for x𝛽 ∉ (0, 1).

Instead, we consider nonlinear least squares (NLS) using the piecewise ramp function 𝑅(x𝛽) from

(4.1) as the conditional mean. In addition, since there may not be much justification to think the

ramp function is the true response probability, we allow for general misspecification. Therefore,

5Li et al. (2022) discuss this further and show that in the case of a single normal covariate, logit quasi-MLE

identifies the APE, but probit quasi-MLE does not.

156

we define 𝛽𝑜 as the pseudo-true value in the sense that 𝛽𝑜 is the unique solution to

min
𝛽

𝐸 (cid:2)(𝑦 − 𝑅(x𝛽))2(cid:3) ≡ min
𝛽

𝑄(𝛽).

(4.4)

We say that the model is misspecified if there is no such 𝛽 such that 𝐸 [𝑦|x] = 𝑅(x𝛽). By

construction, 𝛽𝑜 is the true coefficient when the model is correctly specified and otherwise we view

𝑅(x𝛽𝑜) as the best mean squared error approximation to 𝐸 [𝑦|x] over all ramp functions 𝑅(x𝛽).

Assume a random sample indexed by 𝑖 = 1, ..., 𝑁. As a sample analogue of (4.4), we define the

objective function 𝑄 𝑁 (𝛽) as

𝑄 𝑁 (𝛽) ≡

=

1
𝑁

1
𝑁

𝑁
∑︁

𝑖=1
𝑁
∑︁

𝑖=1

(𝑦𝑖 − 𝑅(x𝑖 𝛽))2

(cid:16)

𝑖 1 {x𝑖 𝛽 ≤ 0} + (𝑦𝑖 − x𝑖 𝛽)2 1 {x𝑖 𝛽 ∈ (0, 1)} + (𝑦𝑖 − 1)21 {x𝑖 𝛽 ≥ 1}
𝑦2

(cid:17)

,

where 𝑁 is the sample size. We define the NLS estimator (cid:98)𝛽 as

(cid:98)𝛽 ≡ argmin
𝛽

𝑄 𝑁 (𝛽).

The following theorem gives the consistency of the NLS estimator for the pseudo-true value,

allowing for misspecification of the conditional mean model.

Theorem 4.1 Let {𝑦𝑖, x𝑖}∞

𝑖=1 be an i.i.d. sequence with 𝑦 only taking on values zero and one, and
let 𝑅 : R → [0, 1] be the ramp function defined in (4.1). Suppose 𝛽 ∈ B such that B ⊂ R𝐾 is

compact, and 𝛽𝑜 is identified in the sense that ∀ 𝛽 ∈ B, 𝛽 ≠ 𝛽𝑜,

𝐸 (cid:2)(𝑦𝑖 − 𝑅(x𝑖 𝛽𝑜))2(cid:3) < 𝐸 (cid:2)(𝑦𝑖 − 𝑅(x𝑖 𝛽))2(cid:3)

Then, (cid:98)𝛽

𝑝
→ 𝛽𝑜 as 𝑁 → ∞.

The consistency result of Theorem 4.1 follows directly from Theorem 12.2 of Wooldridge (2010).

If x contains a continuously distributed 𝑥 𝑗 and 𝛽 𝑗 𝑜 is nonzero, then the probability of x𝑖 𝛽𝑜 being

equal to 0 or 1 is zero. Then, with suitable moment conditions on x (so the Leibniz integral rule

applies), the FOC of the (4.4) is well defined with probability 1 as follows:

𝐸 (cid:2)x′

𝑖𝑢𝑖1{x𝑖 𝛽𝑜 ∈ (0, 1)}(cid:3) = 0,

(4.5)

157

where 𝑢𝑖 (𝛽) = 𝑦𝑖 − 𝑅(x𝑖 𝛽) and 𝑢𝑖 ≡ 𝑢𝑖 (𝛽0). Define the score function for random draw 𝑖:

s𝑖 (𝛽) = −x′

𝑖𝑢𝑖 (𝛽)1{x𝑖 𝛽 ∈ (0, 1)}.

Then, 𝛽𝑜 solves 𝐸 [s𝑖 (𝛽𝑜)] = 0. The variance-covariance matrix of s𝑖 (𝛽) is

𝛀(𝛽) = 𝐸 (cid:2)x′

𝑖x𝑖𝑢𝑖 (𝛽)21 {x𝑖 𝛽 ∈ (0, 1)}(cid:3) .

(4.6)

The natural definition of the Jacobian of s𝑖 (𝛽) is

A𝑖 (𝛽) = x′

𝑖x𝑖1 {x𝑖 𝛽 ∈ (0, 1)} .

For the similar reason as (4.5), the Hessian of 𝑄(𝛽) is well-defined with probability 1 at 𝛽𝑜 as

follows

A(𝛽𝑜) = 𝐸 (cid:2)x′

𝑖x𝑖1 {x𝑖 𝛽𝑜 ∈ (0, 1)}(cid:3) .

(4.7)

Note that (4.6) and (4.7) are the same whether the conditional mean model is correctly specified

or not. Therefore, the following asymptotic distribution result allows for misspecification of the

model.

Theorem 4.2 Suppose that the assumptions from Theorem 4.1 hold, and (i) 𝛽𝑜 is an interior point

of B; (ii) x𝑖 contains a continuously distributed random variable with a nonzero coefficient; (iii)
𝐸 ∥x𝑖 ∥2 < ∞ and 𝐸 (cid:2)x′
𝑖x𝑖1 {x𝑖 𝛽𝑜 ∈ (0, 1)}(cid:3) > 0, where ∥.∥ denotes the 𝑙2 − 𝑛𝑜𝑟𝑚. Then, as

𝑁 → ∞,

√

(cid:16)

𝑁

(cid:98)𝛽 − 𝛽𝑜

(cid:17) 𝑑

→ N(0, A(𝛽𝑜)−1𝛀(𝛽𝑜)A(𝛽𝑜)−1).

The proof of Theorem 4.2 is given in Appendix 4A. The asymptotic normality results does

not follow directly from the M-estimator due to the non-smoothness of the objective function. We

therefore leverage an asymptotic normality result for estimators with non-smooth objective function

from Newey and McFadden (1994).

158

Taking the sample analogue of the asymptotic variance from Theorem 4.2, we define a variance

estimator of

√

𝑁 ( (cid:98)𝛽 − 𝛽𝑜) as

(cid:98)V = A𝑁 ( (cid:98)𝛽)−1𝛀𝑁 ( (cid:98)𝛽)A𝑁 ( (cid:98)𝛽)−1,

𝑖x𝑖1{x𝑖 (cid:98)𝛽 ∈ (0, 1)}, 𝛀𝑁 ( (cid:98)𝛽) = 𝑁 −1 (cid:205)𝑁
where A𝑁 ( (cid:98)𝛽) = 𝑁 −1 (cid:205)𝑁
𝑖 1{x𝑖 (cid:98)𝛽 ∈ (0, 1)}, and
𝑢2
𝑖x𝑖(cid:98)
𝑢𝑖 = 𝑦𝑖 − 𝑅(x𝑖 (cid:98)𝛽). Standard errors are obtained the usual way from (cid:98)V/𝑁. The next theorem gives
(cid:98)

𝑖=1 x′

𝑖=1 x′

the consistency result of the variance estimator.

Theorem 4.3 Under the same assumption of Theorem 4.2 and 𝐸 ∥𝑥∥4 < ∞, as 𝑁 → ∞, (cid:98)V
A(𝛽𝑜)−1𝛀(𝛽𝑜)A(𝛽𝑜)−1.

𝑝
→

The proof of Theorem 4.3 is given in Appendix 4A. As before, we are interested in the APE.

Consider the best ramp approximation in (4.4), the APE of a continuous random variable 𝑥𝑘 is

defined as

𝐴𝑃𝐸 𝑘 = 𝐸

(cid:21)

(cid:20) 𝜕𝑅(x𝑖 𝛽𝑜)
𝜕𝑥𝑘

= 𝛽𝑘𝑜𝑃 (x𝑖 𝛽𝑜 ∈ (0, 1)) .

A sample-analogue estimator of the APE is then given by

𝐴 (cid:98)𝑃𝐸 𝑘 = (cid:98)𝛽𝑘

1
𝑁

𝑁
∑︁

𝑖=1

(cid:110)

1

x𝑖 (cid:98)𝛽 ∈ (0, 1)

(cid:111)

Define 𝑔(x𝑖, 𝛽) = 𝛽𝑘 1{x𝑖 𝛽 ∈ (0, 1)}, 𝛿𝑜 = 𝐸 [𝑔(x𝑖, 𝛽𝑜)], and G𝑜 = ∇𝛽𝑔(x𝑖, 𝛽𝑜). Following problem

12.17 of Wooldridge (2010), the asymptotic variance of the estimated APE is given by

𝐴𝑉 𝑎𝑟

(cid:16)√

(cid:16)

𝑁

𝐴 (cid:98)𝑃𝐸 𝑘 − 𝐴𝑃𝐸 𝑘

(cid:17)(cid:17)

(cid:16)

= 𝑉 𝑎𝑟

𝑔(x𝑖, 𝛽𝑜) − 𝛿𝑜 − G𝑜A(𝛽𝑜)−1s𝑖 (𝛽𝑜)

(cid:17)

,

where G𝑜 is a 1 × 𝐾 vector with the 𝑘 𝑡ℎ element being 𝑝𝑜 ≡ 𝑃 (x𝑖 𝛽𝑜 ∈ (0, 1)) and all else 0. The

asymptotic variance can be estimated by the sample variance of 𝑔(x𝑖, (cid:98)𝛽)−(cid:98)𝛿−(cid:98)GA𝑁 ( (cid:98)𝛽)−1s𝑖 ( (cid:98)𝛽), where
(cid:98)𝛿 = 1
𝑁

𝑔(x𝑖, (cid:98)𝛽), (cid:98)G is a 1 × 𝐾 vector with the 𝑘 𝑡ℎ element being (cid:98)

x𝑖 (cid:98)𝛽 ∈ (0, 1)

𝑝 = 1
𝑁

(cid:205)𝑁
𝑖=1

𝑖=1 1

(cid:205)𝑁

(cid:110)

(cid:111)

.

The APE for a discrete random variable 𝑥𝑘 can be defined as

𝐴𝑃𝐸 𝑘 = 𝐸 (cid:2)𝑅(x𝑖,−𝑘 𝛽−𝑘𝑜 + 𝛽𝑘𝑜) − 𝑅(x𝑖,−𝑘 𝛽−𝑘𝑜)(cid:3) .

159

A sample analogue estimator of 𝐴𝑃𝐸 𝑘 is given by

𝐴 (cid:98)𝑃𝐸 𝑘 =

1
𝑁

𝑁
∑︁

𝑖=1

𝑅(x𝑖,−𝑘 (cid:98)𝛽−𝑘 + (cid:98)𝛽𝑘 ) − 𝑅(x𝑖,−𝑘 (cid:98)𝛽−𝑘 ).

The asymptotic variance can be found and estimated in a similar manner as the continuous case.

4.3.2

Iterative Trimming OLS Estimation

To estimate 𝛽𝑜, Horrace and Oaxaca suggest running OLS on a trimmed sample (i.e., those

observations for which initial OLS fitted values are inside the unit interval) to reduce bias. We

find in practice that a single round of trimming may not reduce the bias for the APEs in the cases

where OLS is not consistent for them. However, we find an iterative trimming OLS procedure

(ITO) does reduce the bias for estimating APEs, as well as 𝛽𝑜. The procedure goes: 1) estimate the

LPM by OLS. 2) Compute fitted values. 3) Drop observations with fitted values outside the unit

interval, and 4) Repeat starting at 1) until no further observations are dropped. In fact, we find in

simulations that the NLS estimates are numerically the same as the ITO estimates up to machine

precision.6 It turns out that ITO is implicitly minimizing the NLS sample objective function using

the OLS estimates as starting values and following the Newton-Raphson numerical method, which

is iterative (see Wooldridge, 2010, Section 12.7.1). Given an estimate 𝛽{𝑔}, the next iteration is

given (using our notation) by

𝛽{𝑔+1} = 𝛽{𝑔} −

= 𝛽{𝑔} +

𝑁 −1

(cid:34)

(cid:34)

𝑁 −1

𝑁
∑︁

𝑖=1
𝑁
∑︁

𝑖=1

(cid:35) −1

𝑁 −1

A𝑖 (𝛽{𝑔})

𝑁
∑︁

𝑖=1

s𝑖 (𝛽{𝑔})

𝑖x𝑖1 (cid:8)x𝑖 𝛽{𝑔} ∈ (0, 1)(cid:9)
x′

(cid:35) −1

𝑁 −1

𝑁
∑︁

𝑖=1

(cid:16)

𝑦𝑖 − x𝑖 𝛽{𝑔}(cid:17)

x′
𝑖

1 (cid:8)x𝑖 𝛽{𝑔} ∈ (0, 1)(cid:9)

(cid:34)

=

𝑁 −1

𝑁
∑︁

𝑖=1

𝑖x𝑖1 (cid:8)x𝑖 𝛽{𝑔} ∈ (0, 1)(cid:9)
x′

(cid:35) −1

𝑁 −1

𝑁
∑︁

𝑖=1

𝑖 𝑦𝑖1 (cid:8)x𝑖 𝛽{𝑔} ∈ (0, 1)(cid:9) .
x′

The second equality above substitutes our expressions for s𝑖 (𝛽) and A𝑖 (𝛽) and uses the fact that

𝑅(x𝑖 𝛽) = x𝑖 𝛽 for x𝑖 𝛽 ∈ (0, 1). This shows that 𝛽{𝑔+1} is simply the OLS estimator on the sample

with x𝑖 𝛽{𝑔} ∈ (0, 1).

6With some DGP, it was occasionally necessary to specify OLS starting values for the NLS function evaluator for

the NLS and ITO estimates to match to machine precision. The two were still otherwise very close.

160

As a consequence, the preceding consistency and asymptotic normality results for the NLS

estimator justify using the ITO procedure to reduce the OLS bias. However, it is worth mentioning

that, at least in Stata, the pre-loaded NLS solver (the“nl” command) may have a performance

advantage over ITO in practice. We find in simulations that ITO can result in a dead loop when

only a very small portion of observations are left for estimation after iterative trimming. The

pre-loaded NLS algorithm continues to work well in those cases.

4.4 Simulations

In this section, we present several Monte Carlo simulations that provide insights into the behavior

of different modeling/estimation approaches. The LPM is estimated by OLS and the ramp function

is estimated by NLS. For the LPM, the APE estimates come directly from the linear projection

(e.g., the estimated slope coefficient for a non-interacted variable). For the ramp model, the APEs

are estimated using averages of derivatives and differences of the ramp function, as discussed in

Section 4.2. These resemble the familiar formulas for the linear model, though the individual unit

partial effects need to be scaled by 1

(cid:104)

0 ≤ x(cid:98)𝛽𝑁 𝐿𝑆 ≤ 1

(cid:105)

before averaging, where (cid:98)𝛽𝑁 𝐿𝑆 corresponds

to the NLS slope estimate. The logit and probit parameters are estimated by the (quasi-) maximum

likelihood estimator, and then the APEs are estimated using the usual APE formulas. We used

Stata®17 for simulation. 7.

To better evaluate the findings from Horrace and Oaxaca (2006), we generate the responses

to follow the ramp model for their true conditional probabilities. We also show that our main

arguments hold when the true responses are probit.

It is useful to observe that the response

probability can be derived from a latent variable formulation:

𝑦∗ = x𝛽 − 𝑢,

𝑦 = 1 [𝑦∗ > 0] .

For the ramp model in (4.1), suppose that

𝑢|x ∼ Uniform (0, 1) .

(4.8)

(4.9)

(4.10)

7The Stata code is available via the repository https://kaichengchen.github.io/lpm_simulation_post.

rar

161

Under 4.10, the CDF of 𝑢 is identical to the ramp function 𝑅 (·), it follows immediately that (4.8),

(4.9), (4.10) lead to the response probability in (4.1). In the Appendix 4B, we show an extension

of the above model where 𝑢 has variable support, which is another way to represent the role of the

unit interval bounds for response probabilities. For the probit model, suppose that

𝑢|x ∼ N(0, 1).

(4.11)

Initially, the true models take the form (we are dropping 𝑜 on beta here)

𝑦 = 1 [𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 − 𝑢 > 0] ,

For a given choice of (𝛽0, 𝛽1, 𝛽2) = (𝑏0, 𝑏1, 𝑏2), we can scale (𝛽1, 𝛽2) by a positive constant 𝑐,

(𝛽0, 𝛽1, 𝛽2) = (𝑏0, 𝑐𝑏1, 𝑐𝑏2),

to govern how close to linear the response probability is. When 𝑢 ∼ Uniform(0, 1), the ramp model

is correctly specified, but the LPM is misspecified to varying degrees. For given initial values

(𝑏0, 𝑏1, 𝑏2), a larger scaling factor 𝑐 makes the kinks in the ramp function more likely to be binding

and the LPM can provide a poor approximation to the response probability. Naturally, the logit

and probit models are always misspecified in this case. As stated before, here we focus on the

APEs rather than the underlying parameters or how well the models approximate the true response

probability.

The sample size is 𝑁 = 1, 000 and 10, 000 replications are used. The population (or true) APEs

are not available in closed form, so we simulate these along with the estimators. In the tables to

follow, the columns labeled “Simulated Truth” include the empirical means and standard deviations

of the sample APEs at the true parameter values. We also simulate the probabilities 𝑃 (𝑦 = 1) and

𝑃 (0 ≤ x𝛽 ≤ 1) where the first quantity is the (Monte Carlo) population response probability and

the second one tells us how binding are the ramp function inflection points. We also simulate the

fraction of OLS fitted values in the unit interval, 𝑃(0 ≤ x(cid:98)𝛽𝑂 𝐿𝑆 ≤ 1). This is practically relevant

because researchers often check the fraction of fitted values outside the unit interval to determine

the adequacy of the LPM.

162

4.4.1 Symmetrically Distributed Explanatory Variables

In the first design, (𝑥1, 𝑥2) are generated as

𝑥1 = 𝑣/2

√

2 + 𝑒/2

√
2

𝑥2 = 1 [𝑣 + 𝑟 > 0] ,

where 𝑣, 𝑒, and 𝑟 are independent standard normals. The initial choice of parameters is set to

(𝑏0, 𝑏1, 𝑏2) = (1/2, 1/4, 1/4).

Table 4.4.1 reports the findings when 𝑐 = 1. There is a small probability that x𝛽 ∉ [0, 1] –

roughly, about 0.021. Moreover, across all simulations, about 2.0% of the OLS fitted values are

outside the unit interval. The pattern is clear: All the estimators of the APEs show very little

bias and have the same precision. This is true for the continuous variable, 𝑥1, and the binary

variable, 𝑥2. Note that this is not predicted by application of the Stoker results because 𝑥2 is a

discrete variable.8 Nevertheless, this table illustrates what is often observed in practice: the LPM

coefficients estimated by OLS are often close to the probit and logit APEs.

Table 4.4.1. 𝑢 ∼ Uniform(0, 1), 𝑥1 normal, 𝑥2 binary; 𝑐 = 1

𝑁 = 1000

APE1

Simulated
Truth*
0.2448
0.0011
0.2489
0.0003

LPM Ramp
(NLS)
(OLS)
0.2450
0.2444
0.0292
0.0288
0.2493
0.2506
0.0326
0.0325

Probit
(QMLE)
0.2483
0.0287
0.2454
0.0323

Logit.
(QMLE)
0.2452
0.0290
0.2449
0.0324

mean
sd
mean
sd

APE2
𝑃(𝑦 = 1) = 0.6238, 𝑃(0 ≤ x𝛽 ≤ 1) = 0.9792
𝑃(0 ≤ x(cid:98)𝛽𝑂 𝐿𝑆 ≤ 1) = 0.9806, 𝑃(0 ≤ x(cid:98)𝛽𝑁 𝐿𝑆 ≤ 1) = 0.9774

*This column contains the empirical means and standard deviations of
the sample APEs at the true parameter values.

The story does not change when the constraints of the ramp function are strongly binding. In

Table 4.4.2, we scale the initial choice coefficients by 𝑐 = 2 and we see 𝑃 (0 ≤ x𝛽 ≤ 1) is only

about 0.67, and about 28% of the OLS fitted values are outside [0, 1]. And yet, for estimating the

APEs, the LPM does essentially as well as probit and logit, with the bias being slightly larger for

8Admittedly, when the LPM is used for approximation, the bias for APE2 is slightly larger compared to APE1, but

the bias is still reasonably small and comparable to the ones by probit and logit approximation.

163

APE2. This delivers the first argument: having a large fraction of observations with fitted values

within 0, 1 is not a necessary condition for the OLS estimator to produce a good estimate of the

APE.

Table 4.4.2. 𝑢 ∼ Uniform(0, 1), 𝑥1 normal, 𝑥2 binary; 𝑐 = 2

𝑁 = 1000

APE1

Simulated
Truth*
0.3219
0.0075
0.4003
0.0044

LPM Ramp
(NLS)
(OLS)
0.3221
0.3200
0.0237
0.0237
0.4006
0.4186
0.0274
0.0270

Probit
(QMLE)
0.3242
0.0226
0.4051
0.0263

Logit.
(QMLE)
0.3220
0.0236
0.4036
0.0270

mean
sd
mean
sd

APE2
𝑃(𝑦 = 1) = 0.6769, 𝑃(0 ≤ x𝛽 ≤ 1) = 0.6738
𝑃(0 ≤ x(cid:98)𝛽𝑂 𝐿𝑆 ≤ 1) = 0.8155, 𝑃(0 ≤ x(cid:98)𝛽𝑁 𝐿𝑆 ≤ 1) = 0.6406

*This column contains the empirical means and standard deviations of
the sample APEs at the true parameter values.

Table 4.4.3 shows the case where 𝑃 (0 ≤ x𝛽 ≤ 1) is very close to one (the consistency result of

OLS estimator for the index coefficients in Horrace and Oaxaca (2006) applies when 𝑃 (0 ≤ x𝛽 ≤ 1)

is exactly one). We would expect the LPM to work very well in this case, and it does. What is,

perhaps, more surprising is that probit and logit work just as well, even though the true response

probability is largely linear over the support of x𝛽. These findings are a good reminder of why

statements such as “the linear probability model is preferred to probit because the latter assumes

normality” are not just misleading: they are wrong. In the end, what we care about is how well

each approach approximates the partial effects on 𝑃 (𝑦 = 1|x). When we consider the APEs, all

methods do well even when the response probability has the peculiar ramp shape.

APE1

Table 4.4.3. 𝑢 ∼ Uniform(0, 1), 𝑥1 normal, 𝑥2 binary; 𝑐 = 0.75
𝑁 = 1000

Simulated
Truth*
0.1874
0.0001
0.1875
0.0000

LPM Ramp
(NLS)
(OLS)
0.1875
0.1873
0.0314
0.0312
0.1880
0.1881
0.0334
0.0334

Probit
(QMLE)
0.1886
0.0313
0.1859
0.0333

Logit.
(QMLE)
0.1877
0.0313
0.1856
0.0334

mean
sd
mean
sd

APE2
𝑃(𝑦 = 1) = 0.5937, 𝑃(0 ≤ x𝛽 ≤ 1) = 0.9996
𝑃(0 ≤ x(cid:98)𝛽𝑂 𝐿𝑆 ≤ 1) = 0.9991, 𝑃(0 ≤ x(cid:98)𝛽𝑁 𝐿𝑆 ≤ 1) = 0.9990

*This column contains the empirical means and standard deviations of
the sample APEs at the true parameter values.

164

We next consider the response probability resulting from 4.8, 4.9, and 4.11, under which only

the probit model is correctly specified. Tables 4.4.4 and 4.4.5 maintain the same true index slopes

as Tables 4.4.1 and 2, but due to scaling from the standard normal PDF, the true APEs are lower.

Nevertheless, a similar pattern is revealed. Table 4.4.4 can be compared to Table 4.4.1 where

𝑃(0 ≤ x𝛽 ≤ 1) is close to one. In this case, OLS even does a better job in fitting the response

probability into the unit interval and, not surprisingly, LPM estimated by OLS performs just as well

as the the correctly specified probit model in producing the estimated APEs. As 𝑐 increases from

1 to 2 in Table 5, 𝑃(0 ≤ x𝛽 ≤ 1) drops to 0.64. Due to the better-behaved Gaussian error, we still

observe a large fraction of OLS fitted values are within [0, 1] and there is not much difference across

different methods. To better compare with the true APEs of Table 4.4.1, we increase 𝑐 even further

in Table 4.4.6. In this case, support of the linear index becomes really wide, and 𝑃(0 ≤ x𝛽 ≤ 1) is

as small as 0.37. Correspondingly, only 86% of observations have OLS fitted values within [0, 1].

However, OLS still produces estimates of APEs as good as those produced by nonlinear methods.

Not surprisingly, probit and logit QMLE have low bias, while NLS of the ramp model has slightly

higher bias in the cases (e.g., Table 4.4.6) where the true APEs are larger.

Table 4.4.4. 𝑢 ∼ N(0, 1), 𝑥1 normal, 𝑥2 binary; 𝑐 = 1

𝑁 = 1000

APE1

Simulated
Truth*
0.0810
0.0003
0.0815
0.0002

LPM Ramp
(NLS)
(OLS)
0.0814
0.0814
0.0303
0.0302
0.0817
0.0817
0.0306
0.0306

Probit
(QMLE)
0.0814
0.0302
0.0815
0.0306

Logit.
(QMLE)
0.0814
0.0302
0.0816
0.0306

mean
sd
mean
sd

APE2
𝑃(𝑦 = 1) = 0.7296, 𝑃(0 ≤ x𝛽 ≤ 1) = 0.9793
𝑃(0 ≤ x(cid:98)𝛽𝑂 𝐿𝑆 ≤ 1) = 0.9999, 𝑃(0 ≤ x(cid:98)𝛽𝑁 𝐿𝑆 ≤ 1) = 0.9999

*This column contains the empirical means and standard deviations of
the sample APEs at the true parameter values.

We also generated the outcome 𝑦 using an interaction between 𝑥1 and 𝑥2, with 𝑢 having a

uniform distribution. Specifically,

𝑦 = 1 [𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + 𝛽3 (𝑥1 · 𝑥2) + 𝑢 > 0]

The initial choice of parameters and the scaled parameters are (𝑏0, 𝑏1, 𝑏2, 𝑏3) = (1/2, 1/4, 1/4, 1/8)

165

and (𝛽0, 𝛽1, 𝛽2, 𝛽3) = (𝑏0, 𝑐𝑏1, 𝑐𝑏2, 𝑐𝑏3).

Table 4.4.5. 𝑢 ∼ N(0, 1), 𝑥1 normal, 𝑥2 binary; 𝑐 = 2

𝑁 = 1000

APE1

Simulated
Truth*
0.1448
0.0013
0.1478
0.0008

LPM Ramp
(NLS)
(OLS)
0.1484
0.1450
0.0305
0.0281
0.1484
0.1488
0.0288
0.0288

Probit
(QMLE)
0.1451
0.0280
0.1480
0.0287

Logit.
(QMLE)
0.1450
0.0281
0.1483
0.0288

mean
sd
mean
sd

APE2
𝑃(𝑦 = 1) = 0.7550, 𝑃(0 ≤ x𝛽 ≤ 1) = 0.6438
𝑃(0 ≤ x(cid:98)𝛽𝑂 𝐿𝑆 ≤ 1) = 0.9890, 𝑃(0 ≤ x(cid:98)𝛽𝑁 𝐿𝑆 ≤ 1) = 0.9845

*This column contains the empirical means and standard deviations of
the sample APEs at the true parameter values.

Tables 4.4.7 and 4.4.8 display simulation results under uniformly distributed 𝑢 and normally

distributed 𝑢, respectively. The scaling factor 𝑐 is set as 2 to focus on the scenarios with small

𝑃(0 ≤ x𝛽 ≤ 1) and potentially small 𝑃(0 ≤ x(cid:98)𝛽𝑂 𝐿𝑆 ≤ 1). Remember, both 𝑥1 and 𝑥2 have
symmetric distributions, but this functional form falls outside Stoker’s results because 𝑥2 is discrete

and so is 𝑥1 · 𝑥2:

it has a mass point at zero and is otherwise continuous. However, the four

approaches—where the interaction term is included in the estimation—delivered similar estimated

APEs that were close to the sample “true” APEs (as previously, probit, logit, and LPM approaches

use a misspecified response probability).

Table 4.4.6. 𝑢 ∼ N(0, 1), 𝑥1 normal, 𝑥2 binary; 𝑐 = 4

𝑁 = 1000

APE1

Simulated
Truth*
0.2296
0.0043
0.2375
0.0029

LPM Ramp
(NLS)
(OLS)
0.2368
0.2295
0.0257
0.0253
0.2420
0.2396
0.0279
0.0257

Probit
(QMLE)
0.2298
0.0247
0.2375
0.0255

Logit.
(QMLE)
0.2296
0.0249
0.2393
0.0256

mean
sd
mean
sd

APE2
𝑃(𝑦 = 1) = 0.7733, 𝑃(0 ≤ x𝛽 ≤ 1) = 0.3725
𝑃(0 ≤ x(cid:98)𝛽𝑂 𝐿𝑆 ≤ 1) = 0.8605, 𝑃(0 ≤ x(cid:98)𝛽𝑁 𝐿𝑆 ≤ 1) = 0.7028

*This column contains the empirical means and standard deviations of
the sample APEs at the true parameter values.

166

Table 4.4.7. 𝑢 ∼ Uniform(0, 1), 𝑥1 normal, 𝑥2 binary; 𝑐 = 2; with interaction

𝑁 = 1000

APE1

Simulated
Truth*
0.3634
0.0089
0.3509
0.0040

LPM Ramp
(NLS)
(OLS)
0.3638
0.3606
0.0249
0.0245
0.3512
0.3777
0.0289
0.0281

Probit
(QMLE)
0.3664
0.0241
0.3471
0.0275

mean
sd
mean
sd

APE2
𝑃(𝑦 = 1) = 0.6645, 𝑃(0 ≤ x𝛽 ≤ 1) = 0.6436
𝑃(0 ≤ x(cid:98)𝛽𝑂 𝐿𝑆 ≤ 1) = 0.8554, 𝑃(0 ≤ x(cid:98)𝛽𝑁 𝐿𝑆 ≤ 1) = 0.6403

Logit.
(QMLE)
0.3641
0.0246
0.3456
0.0278

*This column contains the empirical means and standard deviations of the
sample APEs at the true parameter values.

APE1

𝑁 = 1000

Table 4.4.8. 𝑢 ∼ N(0, 1), 𝑥1 normal, 𝑥2 binary; 𝑐 = 2; with interaction
Logit.
(QMLE)
0.1689
0.0280
0.1384
0.0293

LPM Ramp
(NLS)
(OLS)
0.1717
0.1684
0.0295
0.0280
0.1418
0.1427
0.0292
0.0290

Simulated
Truth*
0.1685
0.0013
0.1393
0.0005

Probit
(QMLE)
0.1689
0.0279
0.1392
0.0292

APE2
𝑃(𝑦 = 1) = 0.7566, 𝑃(0 ≤ x𝛽 ≤ 1) = 0.6437
𝑃(0 ≤ x(cid:98)𝛽𝑂 𝐿𝑆 ≤ 1) = 0.9842, 𝑃(0 ≤ x(cid:98)𝛽𝑁 𝐿𝑆 ≤ 1) = 0.9728

mean
sd
mean
sd

*This column contains the empirical means and standard deviations of
the sample APEs at the true parameter values.

4.4.2 Asymmetrically Distributed Explanatory Variables

The story changes markedly when the distributions of 𝑥1 and 𝑥2 are asymmetric. With 𝑣, 𝑒, and

𝑟 generated as before, 𝑥1 and 𝑥2 are now generated as

𝑥1 = exp

(cid:16)

−1/4 + 𝑣/2

√

2 + 𝑒/2

(cid:17)

√
2

𝑥2 = 1 [−1/4 + 𝑣 + 𝑒 > 0] ,

so that 𝑥1 has a lognormal distribution. The variable 𝑥2 is still binary but the response probability

is below 0.5. The unscaled parameter values are, again, (𝑏0, 𝑏1, 𝑏2) = (1/2, 1/4, 1/4).

Table 4.5.1 repeats the same experiment as Table 4.4.1, with the scaling factor 𝑐 = 1 except that

the explanatory variables are not asymmetrically distributed. We observe that the OLS estimated

APE for 𝑥1 under the LPM is severely biased. The misspecified probit model and logit model

estimated by QMLE appear to be slightly biased too. The Ramp model is correctly specified and,

as predicted by the asymptotic properties given in Section 4.3, the NLS estimator continues to

167

perform well. The relative bias of probit and logit are higher as well compared to the previous

tables, but not to as high a degree as the LPM.

APE1

𝑁 = 1000

Table 4.5.1. 𝑢 ∼ Uniform(0, 1), 𝑥1 lognormal, 𝑥2 asym. binary; 𝑐 = 1
Logit.
(QMLE)
0.2203
0.0383
0.2298
0.0230

LPM Ramp
(NLS)
(OLS)
0.1988
0.1299
0.0350
0.0220
0.2211
0.2354
0.0233
0.0226

Simulated
Truth*
0.1975
0.0032
0.2203
0.0020

Probit
(QMLE)
0.2225
0.0361
0.2291
0.0226

APE2
𝑃(𝑦 = 1) = 0.8024, 𝑃(0 ≤ x𝛽 ≤ 1) = 0.7900
𝑃(0 ≤ x(cid:98)𝛽𝑂 𝐿𝑆 ≤ 1) = 0.9011, 𝑃(0 ≤ x(cid:98)𝛽𝑁 𝐿𝑆 ≤ 1) = 0.7857

mean
sd
mean
sd

*This column contains the empirical means and standard deviations of
the sample APEs at the true parameter values.

The findings in Table 4.5.2 are striking. Even though 𝑃 (0 ≤ x𝛽 ≤ 1) is high—around 0.95—

and the OLS fitted values are very rarely outside the unit interval (only about 3.3 percent of the

time), LPM/OLS is badly biased for the APEs and notably worse than other methods. This would

seem to go against the conventional wisdom of checking the proportion of fitted values within

[0, 1], and this confirms the second argument: having a large fraction of OLS fitted values within

the unit interval is not sufficient. Among the other estimators, Ramp/NLS has a smaller bias in

terms of both APEs, while probit and logit appear to have small bias for the discrete APE, but

higher bias for the continuous APE. With respect to the performance of the LPM, the results with

a normally distributed 𝑢 and with an interaction term are similar and so are skipped for brevity.

Table 4.5.2. 𝑢 ∼ Uniform(0, 1), 𝑥1 lognormal, 𝑥2 asym. binary; 𝑐 = 0.75

𝑁 = 1000

APE1

Simulated
Truth*
0.1776
0.0013
0.1828
0.0007

LPM Ramp
(NLS)
(OLS)
0.1796
0.1486
0.0345
0.0248
0.1829
0.1910
0.0282
0.0278

Probit
(QMLE)
0.2110
0.0358
0.1835
0.0281

Logit.
(QMLE)
0.2111
0.0378
0.1835
0.0285

mean
sd
mean
sd

APE2
𝑃(𝑦 = 1) = 0.7413, 𝑃(0 ≤ x𝛽 ≤ 1) = 0.9471
𝑃(0 ≤ x(cid:98)𝛽𝑂 𝐿𝑆 ≤ 1) = 0.9671, 𝑃(0 ≤ x(cid:98)𝛽𝑁 𝐿𝑆 ≤ 1) = 0.9446

*This column contains the empirical means and standard deviations of
the sample APEs at the true parameter values.

168

4.5 Mortgage Approval Probabilities

As an illustration of linear and nonlinear estimators for binary response models, we revisit

the analysis of mortgage lending decisions from Hunter and Walker (1996).9 We compare linear

and nonlinear estimates of the average effect of being white on the probability of loan approval,

holding constant a number of loan, property, and borrower characteristics. Table 4.5.1 presents

basic summary statistics for the dependent variable “approve” and 23 covariates.

Table 4.5.1: Loan Approval Summary Statistics (𝑁 = 1989)

Variable Description
=1 if loan approved
approve
=1 if white
white
loanamt Loan amount $1000s
suffolk
appinc
unit
married
dep
emp
yjob
atotinc
self
other
rep
pubrec
hrat
obrat
cosign
sch
mortno
mortlat1 =1 if one or two late payments
mortlat2 =1 if more than two late payments
chist
loanprc

=1 if in Suffolk County
Applicant income $1000s
Number of units in property
=1 if applicant married
Number of dependents
Years employed in line of work
Years at this job
Total monthly income
=1 if self employed
Other financing $1000s
Number of credit reports
=1 if filed bankruptcy
Housing expense % of total inc.
Other obligations % of total inc.
=1 if there is a cosigner
=1 if > 12 years schooling
=1 if no mortgage history

=0 if accounts are delinq. ≥60 days
Loan amount / purchase price

Mean
0.88
0.85
143.25
0.15
84.68
1.12
0.66
0.77
0.21
0.45

Kurt.
SD Skew.
6.29
-2.30
0.33
4.64
-1.91
0.36
20.36
3.13
80.52
4.66
1.91
0.36
36.70
5.26
87.06
19.89
4.01
0.44
1.45
-0.67
0.47
5.33
1.47
1.10
50.57
6.69
1.00
36.18
5.32
1.12
65.34
6.36
5195.55 5269.06
2.21
5.89
0.34
26.80 886.84
28.23
7.37
1.45
0.99
12.59
3.40
0.25
6.74
0.25
7.12
7.40
0.44
8.26
32.92
5.65
0.17
2.68
-1.29
0.42
1.51
0.71
0.47
50.36
7.03
0.14
92.72
9.58
0.10
4.35
-1.83
0.37
14.39
0.44
0.19

0.13
2.37
1.50
0.07
24.79
32.39
0.03
0.77
0.33
0.02
0.01
0.84
0.77

For our index model, we include interactions between “white” and all other explanatory variables

to allow for the factors like loan amount and credit history to have a differential impact on approval

probability by group. Let 𝑤 denote “white” and z be a vector including the 22 other covariates,

so that x = {1, z, 𝑤, 𝑤z} and 𝛽 = {𝛽0, 𝛽𝑧, 𝛽𝑤, 𝛽𝑤𝑧}, where 𝛽0 is the intercept, 𝛽𝑧 and 𝛽𝑤 are the

9We use a version of the loan applications dataset provided by Mary Beth Walker for Wooldridge (2019).

169

coefficients on z and 𝑤, respectively, while 𝛽𝑤𝑧 is the coefficient on 𝑤z. Then the partial effects we

average are formed by evaluating the difference in the probabilities evaluated at 𝑤 = 1 and 𝑤 = 0,

respectively, as given below.

𝐴𝑃𝐸𝑤 = 𝐸 [𝐺 (𝛽0 + 𝛽𝑤 + z(𝛽𝑧 + 𝛽𝑤𝑧)) − 𝐺 (𝛽0 + z𝛽𝑧)] ,

where 𝐺 () is either the identity function (for the LPM), the probit CDF, the logit CDF, or the ramp

function.

Table 4.5.2 presents the results. Using the LPM estimated by OLS, about 18% of observations

have predicted probabilities outside the unit interval.10 The Horrace and Oaxaca results then

clearly imply OLS is inconsistent for the slope parameters if the ramp model is correct. There is

little reason to expect the LPM will approximate this APE, either based on the theoretical results

of Stoker (1986) or our simulation study. Many of the explanatory variables are binary, and the

continuous variables (e.g., income) tend to be skewed. For each variable, normality is strongly

rejected by a Jarque-Bera test (a joint test of the skewness and kurtosis) with p-values well below

1%. The model also includes interactions between the continuous variables and a binary variable.

Using the LPM estimates, the APE for 𝑤ℎ𝑖𝑡𝑒 is 5.3 percentage points and it is only marginally

significant. Using the nonlinear estimators, the APEs are each a bit larger at about 7.0 percentage

points, and they are all significant at the 1% level.

Table 4.5.2: Estimates of the APE of “White” on Loan Approval
Logit
LPM Ramp
(QMLE)
(NLS)
(OLS)
0.0712
0.0706
0.0532
0.0219
0.0227
0.0278
0.0837
0.0839
0.1171

Estimate
Robust SE
Mean Squared Error

Probit
(QMLE)
0.0695
0.0220
0.0840

Notes: There were only 1976 complete cases out of 1989 obser-
vations total. All robust standard errors were computed using
the sandwich forms and the delta method. The fraction of pre-
dicted linear indexes within the unit interval is 0.8173 by OLS
and 0.6027 by NLS.

Interestingly, OLS predicts only 18% of observations with indexes outside the unit interval,

whereas NLS predicts nearly 40%, which follows the pattern of many of our simulations from the

10Within this 18%, 98% of observations had predicted values greater than 1.

170

previous section and suggests trimming the sample once is not sufficient to consistently estimate

the parameters or APEs under the piecewise linear model.11 Of this 40%, 99% had NLS predicted

linear indexes greater than 1 and most had high predicted probabilities of approval regardless of the

model or counterfactual race.12 Model selection by the minimum mean squared error favors logit,

though the other nonlinear models are very similar.

4.6

Implications for Empirical Research

We have revisited the conclusions reached by Horrace and Oaxaca (2006) concerning the

ability of the linear projection parameters—consistently estimated by OLS—to recover interesting

parameters. We argue that Horrace and Oaxaca’s focus on the parameters in the underlying index

model is misguided; instead, one should focus on the APEs. Focusing on the APEs is hardly

controversial, as almost every modern study that employs any model nonlinear in the explanatory

variables reports estimated APEs.

Once the focus is on the APEs, a few useful conclusions emerge. First, having a high of

estimated response probabilities in [0, 1] is neither necessary nor sufficient for good performance

of the LPM. Notably, when the explanatory variables have a multivariate normal distribution, the

linear projection parameters are identical to the population APEs under a general index model,

and this is true even when the flat parts of the ramp function occur with high probability, i.e.

𝑃(0 ≤ x𝛽 ≤ 1) is small. In this case, the linear projection parameters, 𝛾 𝑗 , will be greatly attenuated

toward zero compared with the index parameters, 𝛽 𝑗 . We find that OLS estimation of the LPM

continues to have good finite sample properties for the APE in many cases when the covariates

are symmetrically distributed. When the explanatory variables have asymmetric distributions,

however, the conclusions for the LPM are not as sanguine—unless the support of x𝛽 is contained

entirely in the interval [0, 1]. Some simulations show that even if the probability of x𝛽 being in the

11The reason we report the fraction of NLS predicted linear index outside 0 and 1 here is to illustrate what proportion
of observations would have been trimmed by the iterated trimming OLS procedure. We should note that this quantity
is not of essential interest just as the linear indexes in probit and logit models are not.

12NLS drops these observations because they have predicted indexes outside the unit interval, not necessarily
In fact, under the logit model, the average Pregibon (1981) leverage statistic for
because they have high leverage.
the 40% (“predict lev, hat” in Stata following logit estimation) was lower (0.008) than the average for the included
observations (0.033).

171

unit interval is high (e.g. 97% in Table 4.5.2), the linear projection parameters are not very close

to the true APEs.

For the DGPs we study, we also find the logit and probit models, estimated by quasi-MLE

(because the response probabilities are misspecified), tend to approximate the APEs very well.

Especially when the support of x𝛽 is wide relative to [0, 1], the logit and probit approximations

to the APEs may be notably better than those for the LPM when the covariates have asymmetric

distributions, though this is not guaranteed. Although the ramp function may not be particularly

realistic as a model for the response probability, we have shown that NLS estimation based on it

is consistent (for the best MSE approximation to the true response probability) and asymptotically

normal. A nonlinear model, of course, offers other advantages over the LPM, such as more

realistic response probabilities and nonconstant partial effects. Especially given the ease of modern

computation, an implication of our simulation findings is that researchers should generally try a

nonlinear estimator, as they may be more robust to covariate asymmetry and variance than OLS

estimation of the LPM.

To summarize, in evaluating different strategies, one needs to make sure we have carefully

defined the population quantities of interest, and then we make proper comparisons across different

approaches. We find probit, logit, and the ramp models have the best finite sample properties for

estimating the APEs across the model DGPs we study. However, when the APEs are of interest,

we also find the LPM is more widely applicable than a simple reading of Horrace and Oaxaca

might suggest. The conclusions drawn here are easily extended to the case where 𝑦 is a fractional

response, where the limit values zero and one can occur with positive probability. In particular, the

results of Stoker (1986) apply to 𝐸 (𝑦|x). If this conditional mean follows the same ramp function,

the qualitative conclusions obtained in the binary case will remain.

172

BIBLIOGRAPHY

Angrist, J. D. and Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist’s Compan-

ion. Princeton University Press.

Horowitz, J. L. and Savin, N. (2001). Binary response models: Logits, probits and semiparametrics.

Journal of Economic Perspectives, 15(4):43–56.

Horrace, W. C. and Oaxaca, R. L. (2006). Results on the bias and inconsistency of ordinary least

squares for the linear probability model. Economics Letters, 90(3):321–327.

Hunter, W. C. and Walker, M. B. (1996). The cultural affinity hypothesis and mortgage lending

decisions. Journal of Real Estate Finance and Economics, 13:57–70.

Li, C., Poskitt, D. S., Windmeijer, F., and Zhao, X. (2022). Binary outcomes, OLS, 2SLS and

IV probit. Econometric Reviews, 41(8):859–876.

Newey, W. K. and McFadden, D. (1994). Large sample estimation testing. Handbook of

Econometrics, 4:2113–2245.

Pregibon, D. (1981). Logistic regression diagnostics. The Annals of Statistics, 9(4):705–724.

Ruud, P. A. (1983). Sufficient Conditions for the Consistency of Maximum Likelihood Es-
timation Despite Misspecification of Distribution in Multinomial Discrete Choice Models.
Econometrica, 51(1):225–228.

Stoker, T. M. (1986). Consistent estimation of scaled coefficients. Econometrica, 54(6):1461–

1481.

van den Berg, G. J. and Siflinger, B. M. (2022). The effects of a daycare reform on health in

childhood–Evidence from Sweden. Journal of Health Economics, 81:102577.

Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data. MIT Press.

Wooldridge, J. M. (2019). Introductory Econometrics: A Modern Approach. Cengage Learning.

173

APPENDIX 4A

PROOFS FOR CHAPTER 4

Proof of Theorem 4.2: We will obtain the asymptotic normality of the NLS estimator by applying

Theorem 7.1 of Newey and McFadden (1994). Condition (i) and (ii) of Theorem 7.1 follows

from our assumptions. As we discussed in the main context, condition (iii) is satisfied as long

as x contains a continuous variable 𝑥 𝑗 with nonzero 𝛽 𝑗 𝑜 so that 𝑃(x𝑖 𝛽𝑜 = 0 or x𝑖 𝛽𝑜 = 1) = 0.

For condition (iv), notice that the first derivative of the object function is well defined at 𝛽𝑜

with probability 1:

𝐷 𝑁 (𝛽𝑜) = ∇𝛽𝑄 𝑁 (𝛽𝑜) =

1
𝑁

𝑁
∑︁

𝑖=1

x′
𝑖 (𝑦𝑖 − x𝑖 𝛽𝑜)1{x𝑖 𝛽𝑜 ∈ (0, 1)} =

1
𝑁

𝑁
∑︁

𝑖=1

x′
𝑖𝑢𝑖1{x𝑖 𝛽𝑜 ∈ (0, 1)},

where 𝑢𝑖 = 𝑦𝑖 − 𝑅(x𝑖 𝛽𝑜). Since 𝐸 ∥x′

𝑖𝑢𝑖 ∥1{x𝑖 𝛽𝑜 ∈ (0, 1)} < ∞ under the assumption 𝐸 ∥x𝑖 ∥2 < ∞,

the vector Lindberg-Levy CLT applies:

√

𝑁 𝐷 𝑁 (𝛽𝑜)

𝑑
→ N (0, 𝛀(𝛽𝑜)) ,

giving condition (iv). Lastly, for condition (v), following Newey and McFadden (1994), we can

rewrite

√

√

=

𝑁 [𝑄 𝑁 (𝛽) − 𝑄 𝑁 (𝛽𝑜)]

𝑁 [𝐷 𝑁 (𝛽𝑜)(𝛽 − 𝛽𝑜) + 𝑄(𝛽) − 𝑄(𝛽𝑜)] + ∥ 𝛽 − 𝛽𝑜 ∥ 𝑀𝑁 (𝛽),

where 𝑀𝑁 (𝛽) is the remainder term, defined as:

√

𝑀𝑁 (𝛽) =

𝑁 (cid:2)𝑄 𝑁 (𝛽) − 𝑄 𝑁 (𝛽𝑜) − 𝐷′

𝑁 (𝛽𝑜) (𝛽 − 𝛽𝑜) − (𝑄(𝛽) − 𝑄(𝛽𝑜))(cid:3)
∥ 𝛽 − 𝛽𝑜 ∥

.

Let 𝑈𝑁 be an neighborhood of 𝛽𝑜: 𝑈𝑁 = {𝛽 ∈ B : ∥ 𝛽 − 𝛽𝑜 ∥ < 𝜀𝑁 } where 𝜀𝑁 → 0. Consider

any 𝛽 ∈ 𝑈𝑁 . Since 𝐷 𝑁 (𝛽𝑜) is the gradient of 𝑄 𝑁 (𝛽) at 𝛽𝑜, 𝑄 𝑁 (𝛽) − 𝑄 𝑁 (𝛽𝑜) − 𝐷 𝑁 (𝛽𝑜) (𝛽 − 𝛽𝑜)

goes to zero faster than ∥ 𝛽 − 𝛽𝑜 ∥ as 𝛽 goes to 𝛽𝑜, by the definition of the gradient. Similarly,

due to ∇𝛽𝑄(𝛽𝑜) = 𝐸 (𝑠𝑖 (𝛽𝑜)) = 0, 𝑄(𝛽) − 𝑄(𝛽𝑜) goes to 0 faster than ∥ 𝛽 − 𝛽𝑜 ∥ as 𝛽 goes

to 𝛽𝑜. Under the moment conditions, we can easily show 𝑄 𝑁 (𝛽) − 𝑄(𝛽) → 0 in probability

174

for each 𝛽 and so
√

√

𝑁 [𝑄 𝑁 (𝛽) − 𝑄(𝛽)] is bounded in probability for each 𝛽. Also note that

𝑁 𝐷 𝑁 (𝛽𝑜) is bounded in probability due to the asymptotic normality. Since the numerator

is bounded in probability and converges to zero faster than the denominator, we conclude that

𝑀𝑁 (𝛽) → 0 in probability, which implies condition (v).

lim𝑁→∞ sup𝛽∈𝑈𝑁
Proof of Theorem 4.3: Consider 𝛀( (cid:98)𝛽):

𝛀( (cid:98)𝛽) =

≡

1
𝑁

1
𝑁

𝑁
∑︁

𝑖=1
𝑁
∑︁

𝑖=1

(cid:16)

x′
𝑖x𝑖

𝑦𝑖 − 𝑅(x𝑖 (cid:98)𝛽)

(cid:17) 2

1{x𝑖 (cid:98)𝛽 ∈ (0, 1)}

𝑎(xi, (cid:98)𝛽)

Note that 𝐸 |𝑦𝑖 − 𝑅(x𝑖 𝛽)|4 ≤ 1 for any 𝛽 ∈ B since both 𝑦𝑖 and 𝑅(.) are naturally bounded in

[0, 1] with probability 1. Then, we have

𝐸 sup
𝛽∈B

∥𝑎(𝑥, 𝛽)∥ ≤ (𝐸 ∥x𝑖 ∥4 𝐸 |𝑦𝑖 − 𝑅(x𝑖 𝛽)|4)1/2 < ∞,

where the first inequality follows from Hölder’s inequality. Also note that 𝑎(xi, 𝛽) is continuous

at 𝛽𝑜 with probability one given that 𝑃(x𝑖 𝛽𝑜 = 0) = 𝑃(x𝑖 𝛽𝑜 = 1) = 0. Then, we can apply

Lemma 4.3 of Newey and McFadden (1994):

𝛀( (cid:98)𝛽) =

1
𝑁

𝑁
∑︁

𝑖=1

𝑎(xi, (cid:98)𝛽)

𝑝
→ 𝐸 (𝑎(xi, 𝛽𝑜)) = 𝛀(𝛽𝑜).

Similarly, Lemma 4.3 also applies to A𝑁 ( (cid:98)𝛽) = 1
𝑁

(cid:205)𝑁

𝑖=1 x′

𝑖x𝑖1{x𝑖 (cid:98)𝛽 ∈ (0, 1)}:

1
𝑁

𝑁
∑︁

𝑖=1

plim
𝑁→∞

1{x𝑖 (cid:98)𝛽 ∈ (0, 1)}x′

𝑖x𝑖 = A(𝛽𝑜).

So, we conclude that

(cid:98)V = A𝑁 ( (cid:98)𝛽)−1𝛀𝑁 ( (cid:98)𝛽)A𝑁 ( (cid:98)𝛽)−1

𝑝
→ A(𝛽𝑜)−1𝛀(𝛽𝑜)A(𝛽𝑜)−1.

175

APPENDIX 4B

THE RAMP MODEL WITH VARIABLE SUPPORT

In this appendix, we modify and extend the Horrace and Oaxaca setup in order to show how

the constraint on the linear index through a ramp function can be interpreted as relating to the

support of the latent model error term. In particular, write

𝑦∗ = x𝛽 − 𝑢

𝑢|x ∼ Uniform (−𝑎, 𝑎)

𝑦 = 1 [𝑦∗ > 0]

(4B.1)

for some 𝑎 > 0. Compared with Horrace and Oaxaca, we have shifted the intercept so that

𝑢 has a symmetric distribution about its mean of zero. Also, we allow 𝑢 to have narrow or

wide support, depending on 𝑎. The CDF for the Uniform

(cid:16)

−

√

3,

(cid:17)

√
3

distribution, which has unit

variance, is graphed in Figure 1.

Figure 1: The CDF of 𝑦 with 𝑢|𝑥 ∼ U

(cid:16)

√

−

√

(cid:17)

.

3

3,

176

Given the latent variable model in (4B.1), we can derive the response probability:

𝑝 (x) ≡ 𝑃 (𝑦 = 1|x) = 𝑃 (𝑦∗ ≥ 0|x) = 𝑃 (𝑢 ≥ −x𝛽|x) = 𝑃 (𝑢 ≤ x𝛽|x) = 𝐹𝑢 (x𝛽)

0,

x𝛽 < −𝑎

x𝛽+𝑎
2𝑎 , −𝑎 ≤ x𝛽 ≤ 𝑎

1,

x𝛽 > 𝑎

=






We write this function as 𝐹𝑢 (x𝛽) ≡ 𝑅𝑎 (x𝛽), which is a ramp function that is nondifferen-

tiable at −𝑎 and 𝑎. For an 𝑥 𝑗 with a positive coefficient, the response probability has the same

shape as in Figure 1.

As 𝑎 increases relative to 𝛽, the response probability is linear over more of the support of

x. If

𝑃 (−𝑎 ≤ x𝛽 ≤ 𝑎) = 1

(4B.2)

then, with probability one, 𝑅𝑎 (x𝛽) = (x𝛽 + 𝑎) /2𝑎, a linear function of x.

In this case, the

partial effects are constant and equal to 𝛽 𝑗 /2𝑎, 𝑗 = 2, ..., 𝐾. These are also the linear projection

parameters 𝛾 𝑗 and so OLS consistently estimates the APEs under (4B.2).

If 𝑥 𝑗 is a continuous variable, we are interested in the APE defined as a derivative, which

exists with probability one when x𝛽 is continuous. At x𝛽 ∈ {−𝑎, 𝑎} the definition of the partial

effect is immaterial. To be concrete, take

𝑃𝐸 𝑗 (x) =

𝛽 𝑗
2𝑎

· 1 [−𝑎 ≤ x𝛽 ≤ 𝑎] .

Notice that 𝑃𝐸 𝑗 (x) = 0 if x𝛽 < −𝑎 or x𝛽 > 𝑎 because we are on one of the flat parts of the

ramp. This feature of 𝑃𝐸 𝑗 (x) is taken into account in computing the APE:

𝐴𝑃𝐸 𝑗 = 𝐸 (cid:2)𝑃𝐸 𝑗 (x)(cid:3) =

𝛽 𝑗
2𝑎

· 𝑃 (−𝑎 ≤ x𝛽 ≤ 𝑎)

The case that aligns with Horrace and Oaxaca is 𝑎 = 1/2—so that the 𝑈𝑛𝑖 𝑓 𝑜𝑟𝑚 (0, 1)

distributed has just been shifted to have zero mean—in which case (cid:12)
It is easily seen that (cid:12)

difference between 𝐴𝑃𝐸 𝑗 and 𝛽 𝑗 can be large.

(cid:12)𝐴𝑃𝐸 𝑗
(cid:12)𝐴𝑃𝐸 𝑗

(cid:12) ≤ (cid:12)
(cid:12)
(cid:12)𝛽 𝑗
(cid:12) < (cid:12)
(cid:12)
(cid:12)𝛽 𝑗

(cid:12)
(cid:12), and the
(cid:12)
(cid:12) for any

177

𝑎 ≥ 1/2. In the extended model (4B.1), depending on the values of 𝑎 and 𝑃 (−𝑎 ≤ x𝛽 ≤ 𝑎),
(cid:12)
(cid:12)𝐴𝑃𝐸 𝑗

(cid:12) need not be smaller than (cid:12)
(cid:12)

(cid:12)𝛽 𝑗

(cid:12)
(cid:12).

While the latent error support parameter 𝑎 is not separately identified from 𝛽, this model can

be a convenient device for generating data where the unit interval for probabilities is binding to

varying degrees.

178

CHAPTER 5

IDENTIFICATION OF PARTIAL EFFECTS WITH ENDOGENOUS CONTROLS

(CO-AUTHORED WITH KYOO IL KIM)

5.1

Introduction

In models with endogenous treatment, to obtain consistent estimates of treatment effects,

researchers commonly impose conditional (mean) independence or use instrumental variables

(IV) for the treatment while rather casually assuming other observable control variables are

exogenous. In reality, however, empirical researchers often end up with control variables that may

be subject to additional endogeneity concerns while finding instruments for every endogenous

control is challenging or impossible.

In this note, we will demonstrate that if the objects of

interest are limited to parameters associated with the treatment, then we can get around the

endogeneity issue of the control variables in certain settings.

To illustrate the problem, let’s consider the following linear model:

𝑌 = 𝐷𝜏 + 𝑋 𝛽 + 𝜀, 𝐸 [𝜀|𝐷, 𝑋] = 𝐸 [𝜀|𝑋]

where 𝐷 is a treatment variable of interest, 𝑋 is another observable determinant of the outcome

𝑌 , and 𝜀 is an unobserved determinant. A common scenario where the endogenous control

problem arises is that the treatment 𝐷 is dependent on 𝑋 while, at the same time, 𝑋 is dependent

on𝜀, too. As a result, (1) without considering 𝑋 in the specification, the dependence between

𝐷 and 𝑋 can cause bias in the OLS estimation; (2) with the presence of 𝑋, the dependence

between 𝑋 and 𝜀 can also pollute the estimation of the partial effects 𝜏 even if 𝐷 is conditionally

independent of 𝜀. In either case (1) or (2), 𝜏 is not identified by the linear projection parameters,

so the OLS estimator is biased. In case (1), it is simply an omitted variable bias. To see the

bias in case (2), let 𝑊 = (𝐷, 𝑋) and 𝜃 = (𝜏′, 𝛽′)′, the linear projection parameters are defined

The co-authors have approved that the co-authored chapter is included. The co-authors’ contact: Kyoo il Kim,

Department of Economics, Michigan State University. Email: kyookim@msu.edu

179

as follows:

𝛾𝐿𝑃 ≡ 𝐸 [𝑊 ′𝑊]−1𝐸 [𝑊 ′𝑌 ] = 𝜃 + 𝐸 [𝑊 ′𝑊]−1𝐸 [𝑊 ′𝐸 [𝜀|𝑋]] .

Therefore, without further restriction on 𝐸 [𝜀|𝑋], OLS does not produce a consistent estimate

for 𝜃, or 𝜏 in particular. In a worse scenario, 𝑋 may itself be an outcome of 𝐷, in which 𝑋

is described as a bad control as in Angrist and Pischke (2009).

It is well-known that a bad

control can cause problems for identification and the problem is present even if we start with a

randomly assigned treatment (see Wooldridge, 2005; Lechner, 2008).

However, as is shown below, 𝜏 can still be identified even when 𝑋 is affected by 𝐷, as long

as 𝑋 is not solely a function of 𝐷. More formally, this extra condition is referred to as the

measurable separability, as first introduced in Florens et al. (1990) and will be defined formally

later. At its essence, this assumption ensures that we can vary the value of 𝐷 = 𝑑 while holding

𝑋 = 𝑥 at a particular value of 𝑥. Note that this still allows the distribution of 𝑋 to depend on 𝐷,

and vice versa. In the case of a continuous random variable 𝐷, 𝜏 is nonparametrically identified

as follows:

𝜕𝑑 𝐸 [𝑌 |𝐷 = 𝑑, 𝑋 = 𝑥] = 𝜏 + 𝜕𝑑 (𝑥 𝛽 + 𝐸 [𝜀|𝐷 = 𝑑, 𝑋 = 𝑥]) = 𝜏 + 𝜕𝑑 (𝑥 𝛽 + 𝐸 [𝜀|𝑋 = 𝑥]) = 𝜏,

where the last result holds due to the measurable separability of 𝐷 and 𝑋. To see this, suppose

the measurable separability does not hold. For example, 𝑋 = 𝑓 (𝐷) almost surely and they are

not constants, then conditioning on 𝐷 = 𝑑, 𝑋 = 𝑥 necessitates 𝐷 = 𝑑, 𝑋 = 𝑥 = 𝑓 (𝑑). In that

case, we would have

𝜕𝑑 𝐸 [𝑌 |𝐷 = 𝑑, 𝑋 = 𝑥] = 𝜏 + 𝜕𝑑 ( 𝑓 (𝑑) 𝛽 + 𝐸 [𝜀|𝑋 = 𝑓 (𝑑)]) ≠ 𝜏.

In the case of a binary 𝐷, measurable separability between 𝐷 and 𝑋 allows for conditioning on

𝐷 = 1 and 𝐷 = 0 at different realized values of 𝑋. Combining with the conditional independence

condition, the identification of 𝜏 is achieved as follows:

𝐸 [𝑌 |𝐷 = 1, 𝑋 = 𝑥] − 𝐸 [𝑌 |𝐷 = 0, 𝑋 = 𝑥] = 𝜏 + 𝐸 [𝜀|𝐷 = 1, 𝑋 = 𝑥] − 𝐸 [𝜀|𝐷 = 0, 𝑋 = 𝑥]

= 𝜏 + 𝐸 [𝜀|𝑋 = 𝑥] − 𝐸 [𝜀|𝑋 = 𝑥] = 𝜏.

180

EXAMPLE 5.1 As a concrete example, consider the linear regression model relating district-

level average test scores (avgscore) to district-level educational expenditure per student (expend)

and average family income (avginc) from Wooldridge, 2019, Chapter 3:

𝑎𝑣𝑔𝑠𝑐𝑜𝑟𝑒 = 𝛼 + 𝜏 · 𝑒𝑥 𝑝𝑒𝑛𝑑 + 𝛽 · 𝑎𝑣𝑔𝑖𝑛𝑐 + 𝜀.

Suppose we are interested in the partial effect of expend, 𝜏. Since avginc is relevant for expend

in the district level and avginc can also affect avgscore through other channels, e.g. private

tutoring, including avginc as a control variable or a proxy for those unobserved determinants

is sensible. However, avginc may also be correlated with other unobserved determinants that

affect both avginc and avgscore, then the endogeneity of avginc can pollute the identification

of (𝜏, 𝛽) using linear projection and OLS fails to produce consistent estimates. However, 𝜏 is

nonparametrically identified as long as: (1) expend is independent of 𝜀 conditional on avginc and

(2) expend and avginc are measurably separated. Given those two conditions and the identifica-

tion results above, a consistent estimate for 𝜏 is available through usual nonparametric estimators.

Does the problem go away with an excludable instrumental variable? When the instrumental

variable is truly exogenous and affects the outcome only through the treatment, the answer is yes

because no control is needed. However, in practice, control variables are commonly considered

in models with instrumental variables, sometimes due to the concern that the excludability

condition is valid only after controlling certain observable variables. In those cases, again, the

endogeneity in those controls can cause problems for identification. To illustrate the problem,

let’s consider the same outcome equation with an excludable instrumental variable in a linear

triangular model:

𝑌 = 𝐷𝜏 + 𝑋 𝛽 + 𝜀,

𝐷 = 𝑍 𝜋𝑍 + 𝑋 𝜋𝑋 + 𝜂.

In this model, the instrumental variable 𝑍 is needed because 𝐷 is not conditionally independent

of 𝜀 even after conditioning on 𝑋 and the nonparametric identification method introduced above

181

is not valid anymore. However, without controlling 𝑋, the instrument 𝑍 itself may not suffice

for identification because 𝑍 may affect 𝑌 through 𝑋 too. Again, the endogeneity of 𝑋 brings a

dilemma: (1) without controlling 𝑋, 𝑍 is not a valid IV; (2) with the presence of 𝑋, neither 𝜏

or 𝛽 is identified by usual IV or 2SLS projection without further restriction on 𝐸 [𝜀|𝑋].

Nevertheless, 𝜏 is nonparametrically identified given (i) the measurable separability between

𝑍 and 𝑋 and (ii) 𝜋𝑍 ≠ 0:

𝐸 [𝑌 |𝑋 = 𝑥, 𝑍 = 𝑧] = (𝑧𝜋𝑍 + 𝑥𝜋𝑋 + 𝐸 [𝜂|𝑋 = 𝑥]) 𝜏 + 𝑥 𝛽 + 𝐸 [𝜀|𝑋 = 𝑥],

𝐸 [𝐷 |𝑋 = 𝑥, 𝑍 = 𝑧] = 𝑧𝜋𝑍 + 𝑥𝜋𝑋 + 𝐸 [𝜂|𝑋 = 𝑥],

𝜕𝑧𝐸 [𝑌 |𝑋 = 𝑥, 𝑍 = 𝑧]
𝜕𝑧𝐸 [𝐷|𝑋 = 𝑥, 𝑍 = 𝑧]

= 𝜏.

(5.1)

Again, at its essence, the extra measurable separability condition ensures that we can vary the

value of 𝑍 = 𝑧 while holding 𝑋 = 𝑥. Alternatively, 𝜏 can also be nonparametrically identified

through a control function approach.

In this case, due to 𝐷 = 𝑍 𝜋𝑍 + 𝑋 𝜋𝑋 + 𝜂 with 𝜋𝑍 ≠ 0,

the measurable separability between 𝑍 and 𝑋 also implies the measurable separability between

(𝑋, 𝜂) and 𝐷.

It will be shown later that given the exogeneity of 𝑍, 𝐷 is independent of 𝜀

conditional on (𝜂, 𝑋). Therefore, 𝜏 can also be identified as follows:

𝜕𝑑 𝐸 [𝑌 |𝐷 = 𝑑, 𝑋 = 𝑥, 𝜂 = 𝑒] = 𝜕𝑑 [𝑑𝜏 + 𝐸 [𝜀|𝑋 = 𝑥, 𝜂 = 𝑒]] = 𝜏.

(5.2)

EXAMPLE 5.2 Consider a linear triangular model relating the individual wage with school

attendance. In an influential paper that studies the causal impact of compulsory school attendance

on earnings, Angrist and Krueger (1991) use quarter of birth (qbirth) as an instrument for

educational attainment (totaledu) in wage equations, based on the observation that school-entry

requirement and the compulsory schooling laws compel students born in the end of the year to

attend school longer than students born in other months. Suppose we include the parents’ income

(parinc) as a control because parents’ income also affects the birth quarters of their children,

so the inclusion of parents’ income also makes the exogeneity condition of the instrument more

182

likely to hold. The heuristic model can be specified as follows:

𝑙𝑜𝑔(𝑤𝑎𝑔𝑒) = 𝛼 + 𝜏 · 𝑡𝑜𝑡𝑎𝑙𝑒𝑑𝑢 + 𝛽 · 𝑝𝑎𝑟𝑖𝑛𝑐 + 𝜀,

𝑡𝑜𝑡𝑎𝑙𝑒𝑑𝑢 = 𝜋0 + 𝜋1 · 𝑞𝑏𝑖𝑟𝑡ℎ + 𝜋2 · 𝑝𝑎𝑟𝑖𝑛𝑐 + 𝜂.

However, students from high-income families may be able to access social resources that are

positively correlated with earnings, making parinc an endogenous control.

In that case, IV

or 2SLS does not yield consistent estimates for (𝜏, 𝛽) as discussed above. Nevertheless, 𝜏 is

identified nonparametrically as long as the usual restrictions for 𝐼𝑉 hold and the measurable

separability condition is justified.

The heuristic exposition above focuses on linear models. When the true data-generating

process is nonlinear in the parameters or nonparametric,

it is not clear whether the same

idea is still applicable. Especially, when the unobserved determinant is not separable from

other observable covariates, it is not clear whether the dependence between the controls and

the unobserved determinant could largely change the exposition.

Ideally, we would like our

approach to be applicable to a large class of data-generating processes. Therefore, we derive the

main results under nonparametric and nonseparable models, considering cases with or without

instrumental variables.

The basic idea on how nonparametric estimation helps alleviate the bias due to endogenous

controls is introduced in Frölich (2008). However, our work first provides a formal identification

result, which not only justifies the use of nonparametric methods in the presence of endogenous

control but also delineates the boundary of this method through the measurable separability

condition.

The issue of endogenous controls is prevalent in empirical research but is not well studied

in econometrics literature. One exception outside our setting is regarding the regression discon-

tinuity (RD) design where Kim (2013) finds that endogenous control variables yield asymptotic

bias in the RD estimator while the inclusion of these relevant controls may offset this bias

and improve some higher-order properties of the estimator. Diegert et al. (2022) assess the

183

omitted variable bias when the controls are potentially correlated with the omitted variables in

a sensitivity analysis framework.

For the rest of the paper, it is outlined as follows: In Section 5.2, we establish the main

identification results for nonseparable models under conditional independence. In Section 5.3,

we consider methods based on an instrumental variable that is only conditionally exogenous.

In Section 5.4, Monte Carlo simulation demonstrates the issue of endogenous control and the

performance of proposed methods in a finite sample. Section 5.5 concludes the note with

recommendations for empirical practice.

5.2 Nonseparable Models with Conditional Independence

Identification results for a nonseparable model with an endogenous treatment, 𝑌 = 𝑚(𝐷, 𝜀), is

given in Altonji and Matzkin (2005), assuming there exists some vector 𝑋 such that conditional

on 𝑋, the treatment variable 𝐷 is independent of the stochastic error 𝜀. However, in many

empirical applications with either nonparametric, semiparametric, or parametric models, the

vector of control variables usually appears in the model of the outcome 𝑌 .

The question is, would the endogeneity of 𝑋 be a problem when we are interested in iden-

tifying, for example, LAR and ATE of 𝐷 on 𝑌 given the conditional independence assumption?

In this section, we show the answer is positive. To focus on our main point, for convenience, we

assume all relevant (conditional) probability density functions are well defined below and also

throughout the paper.

Consider a nonseparable nonparametric model as follows:

𝑌 = 𝑚(𝐷, 𝑋, 𝜀)

(5.3)

We are interested in identifying conditional LAR (CLAR) and unconditional LAR, denoted by

𝛽(𝑑, 𝑥) and 𝛽(𝑑), respectively. For now, the focus is on a continuous outcome Y, but the results

can be extended to binary choice models, as shown in Altonji and Matzkin (2005). Assume

𝑚(.) is differentiable w.r.t its first argument and 𝐷 is a continuous treatment, then 𝛽(𝑑, 𝑥) and

184

𝛽(𝑑) are defined as:

𝛽(𝑑, 𝑥) =

𝛽(𝑑) =

∫ 𝜕𝑚(𝑑, 𝑥, 𝜖)

𝜕𝑑
∫ ∫ 𝜕𝑚(𝑑, 𝑥, 𝜖)

𝑓𝜀|𝐷=𝑑,𝑋=𝑥 (𝜖)𝑑𝜖,

𝑓𝑋,𝜀|𝐷=𝑑 (𝑥, 𝜖)𝑑𝑥𝑑𝜖,

𝜕𝑑

where 𝑓𝜀|𝐷=𝑑,𝑋=𝑥 (𝜖) and 𝑓𝑋,𝜀|𝐷=𝑑 (𝑥, 𝜖) denote relevant conditional density functions. If 𝐷 is a

binary random variable (or if we are interested in discrete change in 𝐷), we can define CLAR

and LAR as follows:

∫

(cid:101)𝛽(𝑑, 𝑥) =

(𝑚(1, 𝑥, 𝜖) − 𝑚(0, 𝑥, 𝜖)) 𝑓𝜀|𝐷=𝑑,𝑋=𝑥 (𝜖)𝑑𝜖,

∫ ∫

(cid:101)𝛽(𝑑) =

(𝑚(1, 𝑥, 𝜖) − 𝑚(0, 𝑥, 𝜖)) 𝑓𝑋,𝜀|𝐷=𝑑 (𝑥, 𝜖)𝑑𝑥𝑑𝜖 .

Assumption 5.1 𝑓𝜀|𝐷,𝑋 (𝜖) = 𝑓𝜀|𝑋 (𝜖) for all 𝜖 ∈ R.

Assumption 5.1 is the conditional independence assumption which is also imposed in Altonji

and Matzkin (2005). We note that Assumption 5.1 does not rule out 𝑋 being endogenous (i.e.

being not independent with 𝜀).

If 𝐷 is a continuous random variable, we can represent 𝐷 by some function ℎ as

𝐷 = ℎ(𝑋, 𝑈),

(5.4)

where 𝑋 is independent of a continuous error term 𝑈 and ℎ(𝑋, 𝑢) is strictly monotonic in 𝑢

almost surely (see Matzkin (2003)). To identify CLAR and LAR in the continuous treatment

case while allowing for endogeneity in 𝑋, we need an extra rank condition:

Assumption 5.2 D and X are measurably separated, that is, any function of D almost surely equal

to a function of X must be almost surely equal to a constant.

To see why it is a type of rank condition, consider a case where the condition is violated at

some point in the interior of the support of (𝐷, 𝑋), i.e.

𝑙 (𝐷) = 𝑞(𝑋) for some measurable

functions 𝑙 (.) and 𝑞(.). Then, 𝑙 (ℎ(𝑋, 𝑈)) = 𝑞(𝑋). Differentiating both sides with respect to 𝑈,
we have 𝜕𝑙
𝜕ℎ

𝜕𝑈 = 0. Given that the measurable separability fails, we have 𝜕𝑙

𝜕ℎ ≠ 0 and so

𝜕ℎ
𝜕𝑈 =

𝜕𝑞

185

𝜕ℎ
𝜕𝑈 = 0. Therefore, Assumption 5.2 requires 𝑈 to affect 𝐷. Following Theorem 3 in Florens

et al. (2008), we give primitive conditions for Assumption 5.2 as follows:

Assumption 5.3 (i) D is determined by (5.4) where 𝑋 is continuously distributed and independent

of 𝑈, and ℎ(𝑥, 𝑢) is continuous in 𝑥. (ii) Given any fixed 𝑥, the support of the distribution of ℎ(𝑥, 𝑈)

contains an open interval.

In Appendix 5A, we give a lemma that follows from Theorem 3 in Florens et al. (2008)

under which the conditions in Assumption 5.3 are sufficient for Assumption 5.2 to hold. Note

that Assumption 5.3(i) implicitly restricts the treatment 𝐷 to be continuous, so it is not proper to

impose this restriction for a binary treatment 𝐷. Assumption 5.3(ii) requires 𝑈 to be continuously

distributed and that ℎ(𝑥, 𝑈) is a continuous monotonic function of 𝑈 for any fixed 𝑥.

We now give identification results of CLAR and LAR for both continuous treatment and

binary treatment cases in the following theorem:

Theorem 5.1 Consider the model defined in (5.3) and (5.4).

(i) For a continuous random variable 𝐷, suppose that Assumptions 5.1 and 5.3 hold

and 𝐸

(cid:104)(cid:12)
(cid:12)
(cid:12)

𝜕𝑚(𝑑,𝑥,𝜀)
𝜕𝑑

(cid:12)
(cid:12)
(cid:12)

|𝐷 = 𝑑, 𝑋 = 𝑥

(cid:105)

< ∞, then LAR and CLAR are identified for all 𝑑 ∈

𝑆𝑢 𝑝 𝑝(𝐷) and 𝑥 ∈ 𝑆𝑢 𝑝 𝑝(𝑋 |𝐷 = 𝑑) as follows:

𝛽(𝑑, 𝑥) =

𝛽(𝑑) =

𝜕𝐸 [𝑌 |𝐷 = 𝑑, 𝑋 = 𝑥]
𝜕𝑑
∫ 𝜕𝐸 [𝑌 |𝐷 = 𝑑, 𝑋 = 𝑥]

,

𝜕𝑑

𝑓𝑋 |𝐷=𝑑 (𝑥)𝑑𝑥.

(ii) For a binary random variable 𝐷, suppose Assumptions 5.1 and 5.2 hold and for all 𝑑 ∈

𝑆𝑢 𝑝 𝑝(𝐷) and 𝑥 ∈ 𝑆𝑢 𝑝 𝑝(𝑋 |𝐷 = 𝑑), 𝐸 [|𝑚(1, 𝜀) − 𝑚(0, 𝜀)| |𝐷 = 𝑑, 𝑋 = 𝑥] < ∞, then LAR

and CLAR are identified for all 𝑑 ∈ 𝑆𝑢 𝑝 𝑝(𝐷) and 𝑥 ∈ 𝑆𝑢 𝑝 𝑝(𝑋 |𝐷 = 𝑑) as follows:

(cid:101)𝛽(𝑑, 𝑥) = 𝐸 [𝑌 |𝐷 = 1, 𝑋 = 𝑥] − 𝐸 [𝑌 |𝐷 = 0, 𝑋 = 𝑥],

∫

(cid:101)𝛽(𝑑) =

(𝐸 [𝑌 |𝐷 = 1, 𝑋 = 𝑥] − 𝐸 [𝑌 |𝐷 = 0, 𝑋 = 𝑥]) 𝑓𝑋 |𝐷=𝑑 (𝑥)𝑑𝑥.

The proof of Theorem 5.1 is given in Appendix 5A.

186

5.3 Nonseparable Triangular Models

As an alternative to the conditional independence assumption, another useful identifying

restriction to solve the endogeneity problem of the treatment is the excluded IV by which we

can construct a control variable that controls for the endogeneity from the treatment equation. In

applications, other observable control variables are included to make the exogeneity condition of

the IV more likely to hold. Commonly, these observable controls are assumed to be exogenous

and included in both the outcome equation and the reduced form equation. We caution that

these control variables may be endogenous, too, while finding IV for all endogenous controls

is not possible. In this section, we study a nonseparable triangular model similar to the one in

Imbens and Newey (2009) where we explicitly include potentially endogenous control variables

in the model and provide identification results on LAR and treatment effects.

Consider the nonseparable triangular model as follows:

𝑌 = 𝑔(𝐷, 𝑋, 𝜀),

𝐷 = 𝑞(𝑍, 𝑋, 𝜂)

(5.5)

(5.6)

where 𝐷 is a continuously distributed random variable and endogenous to the stochastic error. 𝑋

is a vector of observable control variables potentially endogenous to unobservable determinants

of 𝑌 . 𝑍 is an exogenous variable excluded from the outcome equation (5.5) and is independent

of (𝜀, 𝜂):

Assumption 5.4 𝑍 ⊥⊥ (𝜀, 𝜂) | 𝑋.

This is different from Imbens and Newey (2009) in that the endogenous variables now have

been separated into two vectors 𝐷 and 𝑋, and we are only interested in identifying parameters

associated with 𝐷. Note that 𝑋 is allowed to be correlated with both 𝐷 and 𝑍, which also

motivates the inclusion of 𝑋 in the model as it makes Assumption 5.4 more likely to hold.

If there exists a control variable 𝑉 such that

𝑓𝜀|𝐷,𝑋,𝑉 (𝜖) = 𝑓𝜀|𝑋,𝑉 (𝜖), ∀𝜖 ∈ R,

(5.7)

187

and both 𝑋 and 𝑉 are measurably separated from 𝐷, then we can apply the similar approach

from Section 5.2 to identify CLAR and LAR, which in this case are defined as:

∫ 𝜕𝑔(𝑑, 𝑥, 𝜖)

𝜕𝑑
∫ ∫ 𝜕𝑔(𝑑, 𝑥, 𝜖)

𝜕𝑑

∫ ∫ 𝜕𝑔(𝑑, 𝑥, 𝜖)

𝜕𝑑

𝛽(𝑑, 𝑥) =

=

𝛽(𝑑) =

=

𝑓𝜀|𝐷=𝑑,𝑋=𝑥 (𝜖)𝑑𝜖

𝑓𝜀|𝐷=𝑑,𝑋=𝑥,𝑉=𝑣 (𝜖) 𝑓𝑉 |𝐷=𝑑,𝑋=𝑥 (𝑣)𝑑𝑣𝑑𝜖,

𝑓𝑋,𝜀|𝐷=𝑑 (𝑥, 𝜖)𝑑𝑥𝑑𝜖

∫ ∫ ∫ 𝜕𝑔(𝑑, 𝑥, 𝜖)

𝜕𝑑

𝑓𝜀|𝐷=𝑑,𝑋=𝑥,𝑉=𝑣 (𝜖) 𝑓𝑉 |𝐷=𝑑,𝑋=𝑥 (𝑣) 𝑓𝑋 |𝐷=𝑑 (𝑥)𝑑𝑥𝑑𝑣𝑑𝜖 .

Under the condition (5.7), measurably separability, and appropriate regularity conditions for the

derivative to pass through the expectation, we have

𝜕𝐸 [𝑌 |𝐷 = 𝑑, 𝑋 = 𝑥, 𝑉 = 𝑣]
𝜕𝑑

=

=

𝜕 ∫ 𝑔(𝑑, 𝑥, 𝜖) 𝑓𝜀|𝐷=𝑑,𝑋=𝑥,𝑉=𝑣 (𝜖)𝑑𝜖
𝜕𝑑

∫ 𝜕𝑔(𝑑, 𝑥, 𝜖)

𝜕𝑑

𝑓𝜀|𝐷=𝑑,𝑋=𝑥,𝑉=𝑣 (𝜖)𝑑𝜖 .

Then, CLAR and LAR are identified for all 𝑑 ∈ 𝑆𝑢 𝑝 𝑝(𝐷) and 𝑥 ∈ 𝑆𝑢 𝑝 𝑝(𝑋 |𝐷 = 𝑑) as follows:

∫ 𝜕𝐸 [𝑌 |𝐷 = 𝑑, 𝑋 = 𝑥, 𝑉 = 𝑣]

𝑓𝑉 |𝐷=𝑑,𝑋=𝑥 (𝑣)𝑑𝑣

𝜕𝑑
∫ ∫ 𝜕𝑔(𝑑, 𝑥, 𝜖)

=

𝜕𝑑

𝑓𝜀|𝐷=𝑑,𝑋=𝑥,𝑉=𝑣 (𝜖) 𝑓𝑉 |𝐷=𝑑,𝑋=𝑥 (𝑣)𝑑𝑣𝑑𝜖 = 𝛽(𝑑, 𝑥),

∫ ∫ 𝜕𝐸 [𝑌 |𝐷 = 𝑑, 𝑋 = 𝑥, 𝑉 = 𝑣]

𝜕𝑑

𝑓𝑉 |𝐷=𝑑,𝑋=𝑥 (𝑣) 𝑓𝑋 |𝐷=𝑑 (𝑥)𝑑𝑣𝑑𝑥

∫ ∫ ∫ 𝜕𝑔(𝑑, 𝑥, 𝜖)

𝜕𝑑

=

𝑓𝜀|𝐷=𝑑,𝑋=𝑥,𝑉=𝑣 (𝜖) 𝑓𝑉 |𝐷=𝑑,𝑋=𝑥 (𝑣) 𝑓𝑋 |𝐷=𝑑 (𝑥)𝑑𝑥𝑑𝑣𝑑𝜖 = 𝛽(𝑑).

We construct such control variable 𝑉 satisfying condition (5.7) in a fashion similar to Imbens

and Newey (2009): 𝑉 = 𝐹𝐷 |𝑍,𝑋 (𝐷), i.e. the conditional CDF of 𝐷 given (𝑍, 𝑋). The following

assumption is essential for the construction of 𝑉 and ensuring that the information contained in

𝑉 is the same as that in 𝜂.

Assumption 5.5 (i) 𝑞(𝑧, 𝑥, 𝑒) is strictly monotonic in 𝑒 for any fixed (𝑧, 𝑥); (ii) 𝜂 is continuously

distributed with its CDF 𝐹𝜂 (𝑒) strictly increasing in the support of 𝜂.

Assumption 5.5(i) allows the inverse function of 𝑞(𝑧, 𝑥, 𝑒) with respect to 𝑒 to exist. Assumption

5.5(ii) implies that 𝐹𝜂 (𝑒) is a one-to-one function of 𝑒.

188

We further discuss measurable separability conditions. First note that if 𝜂 is independent of

(𝑍, 𝑋) in (5.6), we can fix 𝜂 and see that 𝐷 and 𝑋 are measurably separated (i) if 𝑍 and 𝑋

are measurably separated, which would hold under similar sufficient conditions as Assumption

5.3, and (ii) if 𝑞(𝑧, ·, ·) is continuous in 𝑧. Again, this essentially means that given fixed values

of (𝑋, 𝜂) = (𝑥, 𝑒) we can vary 𝑍 = 𝑧, so does 𝐷 = 𝑑 because 𝑑 = 𝑞(𝑧, 𝑥, 𝑒). We also provide

primitive conditions for the measurable separability between 𝐷 and 𝜂:

Assumption 5.6 (i) 𝐷 is determined by (5.6) where 𝜂 is continuously distributed and independent

of (𝑍, 𝑋), and 𝑞(𝑧, 𝑥, 𝑒) is continuous in 𝑒. (ii) For any fixed 𝑒, the support of the distribution of

𝑞(𝑍, 𝑋, 𝑒) contains an open interval.

It is a counterpart of Assumption 5.3, it implies the measurable separability of 𝐷 and 𝜂 by

Lemma 5A.1 in Appendix 5A. The difference is that Assumption 5.6(ii) does impose some

restriction on the 𝑋 and 𝑍: Assumption 5.6(i) requires 𝑋 to be independent of unobservable

determinants of 𝐷 and Assumption 5.6(ii) requires (𝑍, 𝑋) contains a continuous element and 𝑞

is continuous in that element for any fixed 𝑒.

In the next theorem, we show that the constructed control variable 𝑉 satisfies condition (5.7)

and is measurable separated from 𝐷.

Theorem 5.2 Suppose Assumption 5.4 holds for the nonseparable model in (5.5) and (5.6). Then,

(i) 𝐷 is independent of 𝜀 conditional on (𝜂, 𝑋).

(ii) If, additionally, Assumptions 5.5 and 5.6 holds, then condition (5.7) is satisfied with 𝑉 =

𝐹𝜂 (𝜂) = 𝐹𝐷|𝑍,𝑋 (𝐷) and 𝐷 is measurably separated from 𝑉.

The proof of Theorem 5.2 is given in Appendix 5A.

5.4 Simulation

In this section, we use Monte Carlo simulations to demonstrate the bias due to the endogenous

control, and how the proposed methods perform in finite sample. Two DGPs considered are as

follows: (1) the first DGP covers the scenario where the control is endogenous and treatment

189

Table 5.1: Simulation for DGP(1)
Methods
Bias
SD

(i)
0.374
0.055

(ii)
0.374
0.044

(iii)
0.114
0.444

(iv)
0.001
0.048

Note: Simulation results are based on 1000 repli-
cations and random samples of size 𝑛 = 1000.
Series regression uses 3rd-order polynomials.

is conditionally independent of the unobserved determinants as in Section 5.2; (2) the second

DGP covers the scenario where the IV is conditionally valid but the control is endogenous as in

Section 5.3.

First, consider DGP(1):

DGP(1): 𝑌 = 𝛽0 + 𝛽1𝐷 + 𝛽2𝑋 + 𝑈

𝐷 = 𝑎2 + 𝑁 (0, 1), 𝑋 = 𝑎

𝑈 = 𝑏2 + 𝑁 (0, 1),

where (𝑎, 𝑏) are jointly normal with mean zero, variance one, and covariance 0.75; 𝑁 (0, 1)

denotes a random draw from a normal distribution, independent of 𝑎 and 𝑏. We observe that

(1) 𝑋 is relevant for both 𝑌 and 𝐷; (2) 𝑈 affects both 𝐷 and 𝑋; (3) Conditional on 𝑋, 𝐷 is

independent of 𝑈; and (4) 𝐷 and 𝑋 are measurably separated. As a result, linear projection

parameters does not identify 𝛽1, which is the local average response and is our parameter of

interest. As is shown in Section 5.2, 𝛽1 is nonparametrically identified.

In Table 5.1, we compare the estimates of 𝛽1 using (i) OLS without control, (ii) OLS with

control, (iii) the nonparametric estimations through the local linear regression, and (iii) the

third-order polynomial series regression. The results are clear: while conventional methods that

assume the exogeneity of controls are severely biased, both the nonparametric methods perform

much better in terms of bias. The local linear kernel method is less efficient compared to the

series approximation methods. Note that we can rewrite 𝑋 = (𝐷 − 𝑁 (0, 1))1/2, which is a

function of the treatment 𝐷 and so is subject to the critics of bad control. However, we see that

the proposed methods still work when the target is the average partial effect or the local average

response.

190

Table 5.2: Simulation for DGP(2)

Methods
Bias
SD

(i)
0.374
0.057

ii)
0.375
0.052

(iii)
0.139
0.161

(v)

(iv)
0.001 0.001
0.068 0.066

Note: Simulation results are based on 1000 replications
and random samples of size 𝑛 = 1000.

Next, consider DGP(2):

DGP(2): 𝑌 = 𝛽0 + 𝛽1𝐷 + 𝛽2𝑋 + 𝑈

𝐷 = 𝑋 + 𝑍 + 𝜂, 𝑋 = 𝑎

𝑍 = 𝑎2 + 𝑁 (0, 1), 𝜂 = 𝑐

𝑈 = 𝑏2 + 𝑑2 + 𝑁 (0, 1),

where (𝑐, 𝑑) are also jointly normal with mean zero, variance one, and covariance 0.75, and

(𝑐, 𝑑) are independent of 𝑎, 𝑏. We observe that (i) 𝐷 is not conditionally independent given 𝑋;

(ii) 𝑍 is not a valid 𝐼𝑉 except when it is conditional on 𝑋, but 𝑋 is endogenous; (iii) 𝑍 and 𝑋 are

measurably separated, so the local average response parameter 𝛽1 is nonparametrically identified

as 5.1; (iv) The measurable separability between 𝑍 and 𝑋 here also implies the measurable

separability between 𝐷 and (𝑋, 𝜂), so 𝛽1 can also be nonparametrically identified as 5.2 and we

can use a two-step control function approach. Note that the model is linear, a special case of

the nonseparable model in Section 5.3, so we can also resort to Theorem 2 for identification.

Table 5.2 compares the following 5 methods, corresponding to their column numbers in the

same table: (i) IV without control; (ii) IV with control; (iii) nonparametric approach as 5.1 using

local linear regression; (iv) nonparametric approach as 5.1 using series regression; (v) two-step

control function approach as Theorem 2 using series regression for the second step. Although

(v) is the most general approach that allows for nonseparable models, it is very computationally

costly because we need to estimate the conditional 𝐶𝐷𝐹 𝐹𝐷|𝑍,𝑋 (𝐷𝑖) for each realized 𝐷𝑖 in

the sample. In practice, method (v) is implemented as follows: (a) For some 𝑖 = 1, ..., 𝑛, we

generate the indicator variable 1{𝐷 < 𝐷𝑖}.

(b) Then we use local linear kernel methods to

regress 1{𝐷 < 𝐷𝑖} on 𝑋 and 𝑍 for each 𝑑 and obtain the fitted values of 1{𝐷 < 𝐷𝑖} only for

the 𝑖-th observation. After, repeat steps (a)-(b) for each 𝑖 = 1, ..., 𝑛, we obtain the estimated

191

𝐹𝐷|𝑍,𝑋 (𝐷𝑖) for each 𝑖. Finally, we nonparametrically regress 𝑌 on 𝐷, 𝑋, and the estimated

𝐹𝐷|𝑍,𝑋 (𝐷𝑖) using polynomial series regression. The results of this comparison are as expected

by theory. While the IV-based methods (i) and (ii) that implicitly impose the exogeneity of

the controls fail to produce consistent estimates of the partial effects of interest, the alternative

identification methods paired with usual nonparametric estimators in (iii)-(v) perform well in

finite sample.

5.5 Conclusion

This note addresses a critical, prevalent, yet often overlooked problem in empirical research:

the endogeneity of control variables. Building on the insightful observation and discussion in

Frölich (2008) that nonparametric estimation can help with the endogenous control problem, we

provide formal identification results in a simple linear model with or without the presence of

instrumental variables, and extend the results to a general class of nonseparable models focusing

on identifying local average responses.

For empirical practice, this note provides a more flexible framework for dealing with en-

dogenous control. Following the primitive conditions we provide in this note, researchers could

evaluate if the inclusion of potentially endogenous control variables is desirable and if the esti-

mation and inference can be made robust to the potential endogeneity in the control. Estimation

based on our identification results is also standard in common empirical settings.

192

BIBLIOGRAPHY

Altonji, J. G. and Matzkin, R. L. (2005). Cross section and panel data estimators for nonseparable

models with endogenous regressors. Econometrica, 73(4):1053–1102.

Angrist, J. D. and Krueger, A. B. (1991). Does compulsory school attendance affect schooling

and earnings? The Quarterly Journal of Economics, 106(4):979–1014.

Angrist, J. D. and Pischke, J.-S. (2009). Mostly harmless econometrics: An empiricist’s compan-

ion. Princeton university press.

Diegert, P., Masten, M. A., and Poirier, A. (2022). Assessing omitted variable bias when the

controls are endogenous. arXiv preprint arXiv:2206.02303.

Durrett, R. (2019). Probability: theory and examples. Cambridge University Press.

Florens, J.-P., Heckman, J. J., Meghir, C., and Vytlacil, E. (2008).

Identification of treat-
ment effects using control functions in models with continuous, endogenous treatment and
heterogeneous effects. Econometrica, 76(5):1191–1206.

Florens, J. P., Mouchart, M., and Rolin, J.-M. (1990). Elements of Bayesian statistics. CRC

Press.

Frölich, M. (2008). Parametric and nonparametric regression in the presence of endogenous

control variables. International Statistical Review, 76(2):214–227.

Imbens, G. W. and Newey, W. K. (2009). Identification and estimation of triangular simultaneous

equations models without additivity. Econometrica, 77(5):1481–1512.

Kim, K. (2013). Regression discontinuity design with endogenous covariates. Journal of Eco-

nomic Theory and Econometrics, 24(4):320–337.

Lechner, M. (2008). A note on the common support problem in applied evaluation studies.

Annales d’Économie et de Statistique, pages 217–235.

Matzkin, R. L. (2003). Nonparametric estimation of nonadditive random functions. Economet-

rica, 71(5):1339–1375.

Wooldridge, J. M. (2005). Violating ignorability of treatment by controlling for too many factors.

Econometric Theory, 21(5):1026–1028.

Wooldridge, J. M. (2019). Introductory Econometrics: A Modern Approach. Cengage Learning.

193

APPENDIX 5A

PROOFS FOR CHAPTER 5

We first restate Theorem 3 Florens et al. (2008) as Lemma 5A.1 below, which gives primitive

conditions for measurable separability.

Lemma 5A.1 Suppose 𝐷 is determined by 𝐷 = ℎ(𝑍, 𝑉), where 𝑉 is continuously distributed and

independent of 𝑍, and ℎ(𝑧, 𝑣) is continuous in 𝑣. Further, for any fixed 𝑣, the support of the

distribution of ℎ(𝑍, 𝑣) contains an open interval. Then, 𝐷 and 𝑉 are measurably separated.

Proof of Theorem 5.1: First, note that Assumption 5.3 implies Assumption 5.2 using Lemma

5A.1 with 𝑍 = 𝑈 and 𝑉 = 𝑋. For continuous 𝐷, Assumptions 5.1 and 5.2 implies that

𝜕 𝑓 (𝜖 |𝐷 = 𝑑, 𝑋 = 𝑥)
𝜕𝑑

=

𝜕 𝑓 (𝜖 |𝑋 = 𝑥)
𝜕𝑑

= 0.

Then, we have

𝜕𝐸 [𝑌 |𝐷 = 𝑑, 𝑋 = 𝑥]
𝜕𝑑

=

𝜕 ∫ 𝑚(𝑑, 𝑥, 𝜖) 𝑓𝜀|𝐷=𝑑,𝑋=𝑥 (𝜖)𝑑𝜖
𝜕𝑑

=

∫ 𝜕𝑚(𝑑, 𝑥, 𝜖)

𝜕𝑑

𝑓𝜀|𝐷=𝑑,𝑋=𝑥 (𝜖)𝑑𝜖,

where the last equality follows from the Leibniz integral rule and the chain rule. Therefore,

𝛽(𝑑, 𝑥) is identified by

𝜕𝐸 [𝑌 |𝐷=𝑑,𝑋=𝑥]
𝜕𝑑

for all 𝑑 ∈ 𝑆𝑢 𝑝 𝑝(𝐷) and 𝑥 ∈ 𝑆𝑢 𝑝 𝑝(𝑋 |𝐷 = 𝑑).

Furthermore, taking integrals on both sides with respect to 𝑋 given 𝐷 = 𝑑 gives

∫ 𝜕𝐸 [𝑌 |𝐷 = 𝑑, 𝑋 = 𝑥]

𝜕𝑑

𝑓𝑋 |𝐷=𝑑 (𝑥)𝑑𝑥 =

∫ ∫ 𝜕𝑚(𝑑, 𝑥, 𝜖)

𝜕𝑑

𝑓𝑋,𝜀|𝐷=𝑑 (𝑥, 𝜖)𝑑𝑥𝑑𝜖 .

So, 𝛽(𝑑) is identified by ∫ 𝜕𝐸 [𝑌 |𝐷=𝑑,𝑋=𝑥]

𝜕𝑑

𝑓𝑋 |𝐷=𝑑 (𝑥)𝑑𝑥.

In the case of binary 𝐷, Assumption 5.1 implies that

𝑓𝜀|𝐷=1,𝑋=𝑥 (𝜖)𝑑𝜖 = 𝑓𝜀|𝐷=0,𝑋=𝑥 (𝜖)𝑑𝜖.

Assumption 2 allows for conditioning on 𝐷 = 1 and 𝐷 = 0 at different realized values of 𝑋, so

we have

=

=

𝐸 [𝑌 |𝐷 = 1, 𝑋 = 𝑥] − 𝐸 [𝑌 |𝐷 = 0, 𝑋 = 𝑥]
∫

∫

𝑚(1, 𝑥, 𝜖) 𝑓𝜀|𝐷=1,𝑋=𝑥 (𝜖)𝑑𝜖 −

𝑚(0, 𝑥, 𝜖) 𝑓𝜀|𝐷=0,𝑋=𝑥 (𝜖)𝑑𝜖

∫

(𝑚(1, 𝑥, 𝜖) − 𝑚(0, 𝑥, 𝜖)) 𝑓𝜀|𝐷=𝑑,𝑋=𝑥 (𝜖)𝑑𝜖 .

194

So, (cid:101)𝛽(𝑑, 𝑥) is identified for all 𝑑 ∈ 𝑆𝑢 𝑝 𝑝(𝐷) and 𝑥 ∈ 𝑆𝑢 𝑝 𝑝(𝑋 |𝐷 = 𝑑) and taking integral on
both sides with respect to 𝑋 given 𝐷 = 𝑑 gives (cid:101)𝛽(𝑑)

∫

(𝐸 [𝑌 |𝐷 = 1, 𝑋 = 𝑥] − 𝐸 [𝑌 |𝐷 = 0, 𝑋 = 𝑥]) 𝑓𝑋 |𝐷=𝑑 (𝑥)𝑑𝑥

∫ ∫

=

(𝑚(1, 𝑥, 𝜖) − 𝑚(0, 𝑥, 𝜖)) 𝑓𝑋,𝜀|𝐷=𝑑 (𝑥, 𝜖)𝑑𝜖 𝑑𝑥

Proof of Theorem 5.2: The proof of statement (i) and part of statement (ii) follows closely the

proof of Theorem 1 in Imbens and Newey (2009). For statement (i), let 𝑙 be any continuous

and bounded real function. Due to the independence of 𝑍 and (𝜀, 𝜂) conditional on 𝑋, we first

obtain the conditional mean independence as an intermediate result:

𝐸 [𝑙 (𝐷)|𝜀, 𝜂, 𝑋] = 𝐸 [𝑙 (𝑞(𝑍, 𝑋, 𝜂))|𝜀, 𝜂, 𝑋]

∫

∫

=

=

𝑙 (𝑞(𝑧, 𝑋, 𝜂))𝑑𝐹𝑍 |𝜀,𝜂,𝑋 (𝑧)

𝑙 (𝑞(𝑧, 𝑋, 𝜂))𝑑𝐹𝑍 |𝑋 (𝑧) = 𝐸 [𝑙 (𝐷)|𝜂, 𝑋].

Then, we can check the conditional independence of 𝐷 and 𝜀 given (𝜂, 𝑋) by a conditional

version of Theorem 2.1.12 of Durrett (2019). Let 𝑎(·) and 𝑏(·) be any continuous and bounded

real functions, then

𝐸 [𝑎(𝐷)𝑏(𝜀)|𝜂, 𝑋] = 𝐸 [𝐸 [𝑎(𝐷)𝑏(𝜀)|𝜀, 𝜂, 𝑋] |𝜂, 𝑋]

= 𝐸 [𝐸 [𝑎(𝐷)|𝜀, 𝜂, 𝑋]𝑏(𝜀)|𝜂, 𝑋]

= 𝐸 [𝐸 [𝑎(𝐷)|𝜂, 𝑋]𝑏(𝜀)|𝜂, 𝑋]

= 𝐸 [𝑎(𝐷)|𝜂, 𝑋]𝐸 [𝑏(𝜀)|𝜂, 𝑋].

Consider statement (ii). The measurable separability between 𝐷 and 𝜂 is implied by As-

sumption 5.6 using Lemma 5A.1 with 𝑍 = (𝑍, 𝑋) and 𝑉 = 𝜂. So it suffices to show that the

sigma-algebra generated by 𝑉 is the same as that of 𝜂. By strict monotonicity of 𝑞(𝑧, 𝑥, 𝑒) in 𝑒

195

for any fixed (𝑧, 𝑥), there exists an inverse function 𝑞−1(𝑧, 𝑥, 𝑑) = 𝑒. Then, we have

𝐹𝐷 |𝑍=𝑧,𝑋=𝑥 (𝑑) = 𝑃𝑟 (𝐷 ≤ 𝑑|𝑍 = 𝑧, 𝑋 = 𝑥) = 𝑃𝑟 (𝑞(𝑧, 𝑥, 𝜂) ≤ 𝑑|𝑍 = 𝑧, 𝑋 = 𝑥)

= 𝑃𝑟 (𝜂 ≤ 𝑞−1(𝑧, 𝑥, 𝑑)|𝑍 = 𝑧, 𝑋 = 𝑥) = 𝑃𝑟 (𝜂 ≤ 𝑞−1(𝑧, 𝑥, 𝑑))

= 𝐹𝜂 (𝑞−1(𝑧, 𝑥, 𝑑)).

where the second to the last equality follows from the independence of (𝑍, 𝑋) and 𝜂 under

Assumption 5.6. Note that 𝜂 = 𝑞−1(𝑍, 𝑋, 𝐷) a.s., so we have 𝑉 = 𝐹𝐷|𝑍,𝑋 (𝐷) = 𝐹𝜂 (𝜂). Under

Assumption 5.5, 𝐹𝜂 (𝑒) is a one-to-one function of 𝑒, which implies the sigma-algebra generated

by 𝐹𝜂 (𝜂) is the same as that of 𝜂.

Furthermore, combining with the independence of 𝜂 and 𝑋 under Assumption 5.6, we have

𝐸 [𝑎(𝐷)𝑏(𝜀)|𝑉, 𝑋] = 𝐸 [𝑎(𝐷)𝑏(𝜀)|𝜂, 𝑋]

= 𝐸 [𝑎(𝐷)|𝜂, 𝑋]𝐸 [𝑏(𝜀)|𝜂, 𝑋]

= 𝐸 [𝑎(𝐷)|𝑉, 𝑋]𝐸 [𝑏(𝜀)|𝑉, 𝑋]

which implies condition (5.7).

196