AN ECLECTIC COLLECTION OF ESSAYS ON ECONOMETRIC METHODS By Kaicheng Chen A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Economics—Doctor of Philosophy 2025 ABSTRACT This dissertation provides the developments and extensions of methodologies in panel, nonlinear, and high-dimensional econometrics. Chapter 1 of the dissertation introduces and summarizes the following chapters. Chapter 2 proposes a fixed-b approximation method for inference in a linear panel model with two-way clustering. Chapter 3 considers the same dependence structure in panel models with nonlinearity and high dimensionality. Chapter 4 revisits the use of linear models for binary responses, focusing on average partial effects. Chapter 5 considers models with endogenous controls and provides alternative identification methods for a large class of models. Copyright by KAICHENG CHEN 2025 ACKNOWLEDGEMENTS First and foremost, I would like to express my deepest gratitude to my advisor, friend, and lifetime role model, Professor Tim Vogelsang. The devoted guidance and persistent support from Tim have guided me through my Ph.D. journey and will remain as my lifelong treasure. I would also like to extend my sincere appreciation to the members of my dissertation committee, Professor Jeff Wooldridge, Professor Antonio Galvao, Professor Kyoo il Kim, and Professor Shlomo Levental, for their constructive feedback and support. Special thanks go to Professor Hugo Freeman for those valuable discussions on research and support throughout the job market. I am grateful to Michigan State University for allowing me to take on the doctoral program and for the generous support provided throughout my graduate studies. I would also like to express my gratitude to my master thesis advisor Professor Jeffrey Zabel at Tufts University for his valuable support and lessons as well as to Professor Todd Elder and my friend Xiaoxin Zhang whose recognitions and recommendations made my admission to this excellent graduate program possible. Additionally, I extend my gratitude to other faculty and staff of the MSU Economics Department. It is their important work that makes this journey fruitful in every aspect. I am also thankful to my colleagues, the graduate students in the department, for their camaraderie and shared experiences. I am especially grateful to my partner, Saera Oh, who has shared this journey with me as a fellow Ph.D. student from an early stage. My life and this journey would never have been as colorful and meaningful without her participation. Finally, and as always, I am profoundly indebted to my parents for their unwavering support. Thank you for your constant love and understanding. To all who have been part of this journey—thank you. iv CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 TABLE OF CONTENTS CHAPTER 2 FIXED-B ASYMPTOTICS FOR PANEL MODELS WITH TWO-WAY CLUSTERING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 PROOFS FOR CHAPTER 2 . . . . . . . . . . . . . . . . . . . 43 . . . . BIBLIOGRAPHY . APPENDIX 2A CHAPTER 3 INFERENCE IN HIGH-DIMENSIONAL PANEL MODELS: TWO-WAY DEPENDENCE AND UNOBSERVED HETEROGENEITY 56 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 . PROOFS FOR CHAPTER 3.2 . . . . . . . . . . . . . . . . . . 111 PROOFS FOR CHAPTER 3.3 . . . . . . . . . . . . . . . . . . 122 PROOFS FOR CHAPTER 3.4 . . . . . . . . . . . . . . . . . . 143 BIBLIOGRAPHY . APPENDIX 3A APPENDIX 3B APPENDIX 3C . . . CHAPTER 4 ANOTHER LOOK AT THE LINEAR PROBABILITY MODEL AND NONLINEAR INDEX MODELS . . . . . . . . . . . . . . . . . . . . . 148 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 . PROOFS FOR CHAPTER 4 . . . . . . . . . . . . . . . . . . . 174 . 176 THE RAMP MODEL WITH VARIABLE SUPPORT . . . . . . . . BIBLIOGRAPHY . APPENDIX 4A APPENDIX 4B CHAPTER 5 IDENTIFICATION OF PARTIAL EFFECTS WITH ENDOGENOUS CONTROLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 PROOFS FOR CHAPTER 5 . . . . . . . . . . . . . . . . . . . 194 . . BIBLIOGRAPHY . APPENDIX 5A v CHAPTER 1 INTRODUCTION This dissertation aims to develop and extend estimation and inference methodologies for panel data models, nonlinear models, and high-dimensional models. Chapter 2 of this dissertation studies a cluster robust variance estimator proposed by Chiang, Hansen, and Sasaki (2024) for linear panels. First, algebraically, it is shown that this variance estimator (CHS estimator, hereafter) is a linear combination of three common variance estimators: the one-way unit cluster estimator, the “HAC of averages” estimator, and the “average of HACs” estimator. Based on this finding, a fixed-𝑏 asymptotic result is obtained for the CHS estimator and corresponding test statistics as the cross-section and time sample sizes jointly go to infinity. Furthermore, two simple bias-corrected versions of the variance estimator are proposed. In a simulation study, it is shown that the two bias-corrected variance estimators along with fixed-𝑏 critical values provide improvements in finite sample coverage probabilities. To illustrate the impact of bias correction and the use of the fixed-𝑏 critical values on inference, an empirical example on the relationship between industry profitability and market concentration is presented. Chapter 3 studies the estimation and inference methods for panel data models with high dimen- sionality and unobserved heterogeneous effects. Panel data allows for the modeling of unobserved heterogeneity, which significantly increases the number of nuisance parameters, making high dimensionality a practical issue rather than just a theoretical concern. However, unobserved het- erogeneity, along with potential temporal and cross-sectional dependence in panel data, further complicates estimation and inference for high-dimensional models. This paper proposes a toolkit for robust estimation and inference in high-dimensional panel models with large cross-sectional and time sample sizes. To reduce the dimensionality, I propose a weighted LASSO using two-way cluster-robust penalty weights. Due to the cluster dependence, the rate of convergence is slow even in an oracle case. Nevertheless, by leveraging a clustered-panel cross-fitting approach for bias correction, asymptotic normality can be established for the low-dimensional vector of the estimated parameters. As a special case, inferential theories are also established using the full sample in a 1 partial linear model with unobserved time and unit effects. In a panel estimation of the government spending multiplier, I demonstrate how high dimensionality can be hidden and how the proposed toolkit enables flexible modeling and robust inference. Chapter 4 reassesses the use of linear models for binary responses, focusing on average partial effects (APEs). It is confirmed that under certain conditions, linear projection parameters corre- spond to APEs even when the true model is nonlinear. Simulations demonstrate a large fraction of fitted values in [0, 1] is neither necessary nor sufficient for OLS to approximate the APEs. To reduce bias, excluding observations with fitted values outside [0, 1] has been proposed. It is shown that iteratively trimming the sample is equivalent to the nonlinear least squares estimation of a piece-wise linear (ramp) model, for which the consistency and asymptotic normality results are established. Chapter 5 of this dissertation focuses on an endogenous control issue. Exogeneity of the treatment needed for identification is often achieved by conditioning. While control variables are explicitly or implicitly assumed to be exogenous, it is common to encounter endogenous controls in practice. It brings a dilemma: without controlling, the treatment may be endogenous; with controlling, the endogeneity of controls may pollute the identification. The problem is not solved with an instrumental variable when it is only conditionally valid and controls are endogenous. We provide identification results for local average response under an extra measurable separability condition between the treatment and the controls. Noticeably, this condition permits the controls to be dependent on the treatment. The results apply to a wide class of models ranging from linear to non-separable ones. Monte Carlo simulations exemplify this prevalent issue and demonstrate the performance of the proposed methods in the finite sample. 2 CHAPTER 2 FIXED-B ASYMPTOTICS FOR PANEL MODELS WITH TWO-WAY CLUSTERING (CO-AUTHORED WITH TIM VOGELSANG) 2.1 Introduction When carrying out inference in a linear panel model, it is well known that failing to adjust the variance estimator of estimated parameters to allow for different dependence structures in the data can cause over-rejection/under-rejection problems under null hypotheses, which in turn can give misleading empirical findings (see Bertrand et al., 2004). To study different dependence structures and robust variance estimators in panel settings, it is now common to use a component structure model 𝑦𝑖𝑡 = 𝑓 (𝛼𝑖, 𝛾𝑡, 𝜀𝑖𝑡) where the observable data, 𝑦𝑖𝑡, is a function of an individual component, 𝛼𝑖, a time component, 𝛾𝑡, and an idiosyncratic component, 𝜀𝑖𝑡. See, for example, Davezies et al. (2021), MacKinnon et al. (2021), Menzel (2021), and Chiang et al. (2024). As a concrete example, suppose 𝑦𝑖𝑡 = 𝛼𝑖 + 𝜀𝑖𝑡 for 𝑖 = 1, ..., 𝑁 and 𝑡 = 1, ..., 𝑇, where 𝛼𝑖 and 𝜀𝑖𝑡 are assumed to be i.i.d random variables. The existence of 𝛼𝑖 generates serial correlation within group 𝑖, which is also known as the individual clustering effect. This dependence structure is well-captured by the cluster variance estimator proposed by Liang and Zeger (1986) and Arellano (1987). One can also use the “average of HACs" variance estimator that uses cross-section averages of the heterogeneity and autocorrelation (HAC) robust variance estimator proposed by Newey and West (1987). On the other hand, suppose 𝑦𝑖𝑡 = 𝛾𝑡 + 𝜀𝑖𝑡 where 𝛾𝑡 is assumed to be an i.i.d sequence of random variables. Cross-sectional/spatial dependence is generated in 𝑦𝑖𝑡 by 𝛾𝑡 is through the time clustering effect. In this case one can use a variance estimator that clusters over time or use the spatial dependence robust variance estimator proposed by Driscoll and Kraay (1998). Furthermore, if both 𝛼𝑖 and 𝛾𝑡 are assumed to be present, e.g. 𝑦𝑖𝑡 = 𝛼𝑖 + 𝛾𝑡 + 𝜀𝑖𝑡, then the dependence of {𝑦𝑖𝑡 } exists in both the time and cross-section dimensions, also as known as two-way clustering effects. This chapter has been published as Chen and Vogelsang (2024) and it is permitted to be included in the author’s own dissertation per the licensing and copyright contract with Elsevier. The co-author has approved that the co-authored chapter is included. The co-author’s contact: Tim Vogelsang, Department of Economics, 486 W. Circle Drive, 110 Marshall-Adams Hall, Michigan State University, East Lansing, MI 48824-1038. Email: tjv@msu.edu 3 Correspondingly, the two-way/multi-way robust variance estimator proposed by Cameron et al. (2011) is suitable for this case. In macroeconomics, the time effects, 𝛾𝑡, can be regarded as common shocks which are usually serially correlated. Allowing persistence in 𝛾𝑡 up to a known lag structure, Thompson (2011) proposed a truncated variance estimator that is robust to dependence in both the cross-section and time dimensions. Because of unsatisfying finite sample performance of this rectangular-truncated estimator, Chiang et al. (2024) propose a Bartlett kernel variant (CHS variance estimator, hereafter) and establish validity of tests based on this variance estimator using asymptotics with the cross- section sample size, 𝑁, and the time sample size, 𝑇, jointly going to infinity. The asymptotic results of the CHS variance estimator rely on the assumption that the bandwidth, 𝑀, goes to infinity as 𝑇 goes to infinity while the bandwidth sample size ratio, 𝑏 = 𝑀 𝑇 is of a small order. As pointed out by Neave (1970) and Kiefer and Vogelsang (2005), the value of 𝑏 in a given application is a non-zero number that matters for the sampling distribution of the variance estimator. Treating 𝑏 as shrinking to zero in the asymptotics may miss some important features of finite sample behavior of the variance estimator and test statistics. As noted by Andrews (1991), Kiefer and Vogelsang (2005), and many others, HAC robust tests tend to over-reject in finite samples when standard critical values are used. This is especially true when time dependence is persistent and large bandwidths are used. We document similar findings for tests based on CHS variance estimator in our simulations. To improve the performance of tests based on CHS variance estimator, we derive fixed-𝑏 asymptotic results (see Kiefer and Vogelsang, 2005, Sun et al., 2008, Vogelsang, 2012, Zhang and Shao, 2013, Sun, 2014, Bester et al., 2016, and Lazarus et al., 2021). Fixed-𝑏 asymptotics captures some important effects of the bandwidth and kernel choices on the finite sample behavior of the variance estimator and tests and provides reference distributions that can be used to obtain critical values that depend on the bandwidth (and kernel). Our asymptotic results are obtained for 𝑁 and 𝑇 jointly going to infinity and leverage the joint asymptotic framework developed by Phillips and Moon (1999). The limiting distribution of tests based on the CHS or BCCHS estimator are not asymptotically pivotal, so we propose a plug-in method of simulating fixed-𝑏 critical values. One 4 key finding is that the CHS variance has a multiplicative bias given by 1 − 𝑏 + 1 3 𝑏2 ≤ 1 resulting in a downward bias that becomes more pronounced as the bandwidth increases. By simply dividing the CHS variance estimator by 1 − 𝑏 + 1 3 𝑏2 we obtain a simple bias-corrected variance estimator that improves the performance of tests based on the CHS variance estimator even without using plug-in fixed-𝑏 critical values. We label this bias-corrected CHS variance estimator as BCCHS. As a purely algebraic result, we show that the CHS variance estimator is the sum of the Arellano cluster and Driscoll-Kraay variance estimators minus the “averages of HAC” variance estimator. We show that dropping the “averages of HAC” component in conjunction with bias correcting the Driscoll-Kraay component removes the asymptotic bias in the CHS variance estimator and has the same fixed-𝑏 limit as the BCCHS variance estimator. We label the resulting variance estimator of this second bias correction approach as the DKA (Driscoll-Kraay+Arellano) variance estimator. Similar ideas are also used by Davezies et al. (2018) and MacKinnon et al. (2021) where they argue the removal of the negative and small order component in the variance estimator brings computational advantage in the sense that the variance estimates are ensured to be positive semi- definite. In our simulations we find that negative CHS variance estimates can occur up to 6.4% of the time. An advantage of the OKA variance estimator is guaranteed positive semi-definiteness. The DKA variance estimator also tends to deliver tests with better finite sample coverage probabilities although there are exceptions: when the data is independent and identically distributed (i.i.d.) in both the cross-section and time dimensions, we show the DKA estimator has a different fixed-𝑏 limit and results in tests that are conservative including the case where the bandwidth is small1. The fixed-𝑏 limit of the CHS variance estimator is also different in the i.i.d. case but tests based on it remain robust when the bandwidth is small. In a finite sample simulation study, we compare sample coverage probabilities of confidence intervals based on CHS, BCCHS, and DKA variance estimators using critical values from both the standard normal distribution and the fixed-𝑏 limits. The fixed-𝑏 limits of the test statistics 1In the small bandwidth case we find that the limit of the DKA variance estimator is twice as big as the population variance for i.i.d. data - a finding similar to Theorem 2 in MacKinnon et al. (2021) in a multiway clustering setting. We thank a referee for pointing out the similarity between our results for the DKA variance estimator and the results in MacKinnon et al. (2021) for multiway cluster variance estimators when the data is i.i.d. 5 constructed by these three variance estimators are not pivotal, so we use a simulation method to obtain the critical values via a plug-in estimator approach to handle asymptotic nuisance parameters. While the plug-in fixed-𝑏 critical values can substantially improve coverage rates relative to using standard critical values when using the CHS variance estimator, improvements from simply using the bias corrections are impressive. In the case of data-dependent bandwidths, the plug-in fixed-𝑏 critical values provide further improvements in finite sample coverage probabilities when neither 𝑇 nor 𝑁 is large. Conversely, when both 𝑁 and 𝑇 are very small, bias correction alone can give more accurate finite sample coverage probabilities than bias correction with plug-in fixed-𝑏 critical values. Similar results hold for tests based on the DKA variance estimator. Overall, four different approaches for within- and across-cluster dependent robust tests are proposed: two are simple bias correction by BCCHS and DKA estimators and the other two are bias correction by BCCHS and DKA estimators with plug-in fixed-𝑏 critical values. Even though tests based on BCCHS and DKA are asymptotically equivalent under the main assumptions, their finite sample performance is distinguishable. Based on theory and simulation results, we provide comprehensive empirical guidance hinged on the researcher’s assessment of the data, model and priority of the test. The rest of the paper is organized as follows. In Section 2 we sketch the algebra of the CHS estimator and rewrite it as a linear combination of three well-known variance estimators. In Section 3 we derive fixed-𝑏 limiting distributions of CHS-based tests for pooled ordinary least squares (POLS) estimators in a simple location panel model and a linear panel regression model. In Section 4 we derive the fixed-𝑏 asymptotic bias of the CHS estimator and propose two bias- corrected variance estimators. We also derive fixed-𝑏 limits for tests based on the bias-corrected variance estimators. When the data is i.i.d. a key assumption for our asymptotic results no longer holds and we show that the asymptotic limits change in this case. Section 5 presents finite sample simulation results that illustrate the relative performance of 𝑡-tests based on the variance estimators. Some theoretical results for two-way-fixed-effects (TWFE) estimator are also discussed along with the simulation. In Section 6 we illustrate the practical implications of the bias corrections and use 6 of fixed-𝑏 critical values in an empirical example. Section 7 concludes the paper with guidance for empirical practice and a discussion on the limitations of proposed approaches. 2.2 A Variance Estimator Robust to Two-Way Clustering We first motivate the estimator of the asymptotic variance of the pooled ordinary least squares (POLS) estimator under arbitrary dependence in both the time and cross-section dimensions. Consider the linear panel model 𝑦𝑖𝑡 = 𝑥′ 𝑖𝑡 𝛽 + 𝑢𝑖𝑡, 𝑖 = 1, . . . , 𝑁, . 𝑡 = 1, . . . , 𝑇, (2.1) where 𝑦𝑖𝑡 is the dependent variable, 𝑥𝑖𝑡 is a 𝑘 × 1 vector of covariates, 𝑢𝑖𝑡 is the error term, and 𝛽 is the coefficient vector. Let (cid:98)𝛽 be the POLS estimator of 𝛽. For illustrative purposes the variance of (cid:98)𝛽 can be approximated as Var (cid:16) (cid:17) (cid:98)𝛽 ≈ (cid:98)𝑄−1Ω𝑁𝑇 (cid:98)𝑄−1, where (cid:98)𝑄 := 1 𝑁𝑇 (cid:205)𝑁 𝑖=1 (cid:205)𝑇 𝑡=1 𝑥𝑖𝑡𝑥′ 𝑖𝑡 and Ω𝑁𝑇 := Var (cid:16) 1 𝑁𝑇 (cid:205)𝑁 𝑖=1 (cid:205)𝑇 𝑡=1 𝑣𝑖𝑡 (cid:17) with 𝑣𝑖𝑡 := 𝑥𝑖𝑡𝑢𝑖𝑡. Without imposing assumptions on the dependence structure of 𝑣𝑖𝑡, it has been shown, alge- braically, that Ω𝑁𝑇 has the following form (see Thompson, 2011 and Chiang et al., 2024): Ω𝑁𝑇 = 1 𝑁 2𝑇 2 𝑇 ∑︁ 𝑇 ∑︁ E (cid:0)𝑣𝑖𝑡𝑣′ 𝑖𝑠 (cid:1) + 𝑇 ∑︁ 𝑁 ∑︁ 𝑁 ∑︁ 𝑡=1 𝑖=1 𝑗=1 E(𝑣𝑖𝑡𝑣′ 𝑗𝑡) − 𝑇 ∑︁ 𝑁 ∑︁ 𝑡=1 𝑖=1 E (cid:0)𝑣𝑖𝑡𝑣′ 𝑖𝑡 (cid:1) 𝑁 ∑︁       𝑇−𝑚 ∑︁ 𝑖=1 𝑇−1 ∑︁ 𝑠=1 𝑡=1 (cid:32) 𝑁 ∑︁ E 𝑚=1 𝑡=1 𝑇−1 ∑︁ 𝑇−𝑚 ∑︁ 𝑖=1 (cid:32) 𝑁 ∑︁ E 𝑚=1 𝑡=1 𝑖=1 + + 𝑁 ∑︁ 𝑗=1 (cid:33) (cid:33) 𝑣𝑖𝑡 (cid:169) (cid:173) (cid:171) 𝑣𝑖,𝑡+𝑚 𝑣′ 𝑗,𝑡+𝑚(cid:170) (cid:174) (cid:172) 𝑁 ∑︁ 𝑗=1 (cid:169) (cid:173) (cid:171) − 𝑣′ 𝑗,𝑡(cid:170) (cid:174) (cid:172) 𝑁 ∑︁ 𝑇−𝑚 ∑︁ − (cid:16) E 𝑣𝑖𝑡𝑣′ 𝑖,𝑡+𝑚 (cid:17) 𝑖=1 𝑡=1 𝑁 ∑︁ 𝑇−𝑚 ∑︁ 𝑖=1 𝑡=1 (cid:16) E 𝑣𝑖,𝑡+𝑚𝑣′ 𝑖,𝑡 . (cid:17)      Based on this decomposition of Ω𝑁𝑇 , Thompson (2011) and Chiang et al. (2024) each propose a truncation-type variance estimator. In particular, Chiang et al. (2024) replaces the Thompson (2011) truncation scheme with a Bartlett kernel and establish the consistency result of their variance estimator while allowing two-way clustering effects with serially correlated stationary time effects. As an asymptotic approximation, appealing to consistency of the estimated variance allows the asymptotic variance to be treated as known when generating asymptotic critical values for 7 inference. While convenient, such a consistency result does not capture the impact of the choice of 𝑀 and kernel function on the finite sample behavior of the variance estimator and any resulting size distortions of test statistics. To capture some of the finite sample impacts of the choice of 𝑀 and kernel, we apply the fixed-𝑏 approach of Kiefer and Vogelsang (2005). Noticeably, the CHS variance estimator can be decomposed into three well-known variance estimators, which will be helpful when we apply the fixed-𝑏 approximation. Using straightforward algebra, one can show that the CHS variance estimator defined in equation (2.12) of Chiang et al. (2024) can be rewritten as (cid:98)𝑉CHS := (cid:98)𝑄−1 (cid:98)ΩCHS (cid:98)𝑄−1, (cid:98)ΩCHS :=(cid:98)Ω𝐴 + (cid:98)ΩDK − (cid:98)ΩNW, (2.2) (2.3) where, with the Bartlett kernel defined as 𝑘 (cid:0) 𝑚 𝑀 (cid:1) = 1 − 𝑚 𝑀 and 𝑀 being the truncation parameter, (cid:98)ΩA := 1 𝑁 2𝑇 2 (cid:98)ΩDK := (cid:98)ΩNW := 1 𝑁 2𝑇 2 1 𝑁 2𝑇 2 𝑁 ∑︁ 𝑇 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑇 ∑︁ 𝑡=1 𝑇 ∑︁ 𝑡=1 𝑁 ∑︁ 𝑠=1 𝑇 ∑︁ (cid:0) 𝑣′ 𝑣𝑖𝑡(cid:98) 𝑖𝑠 (cid:98) (cid:1) , 𝑠=1 𝑘 (cid:18) |𝑡 − 𝑠| 𝑀 𝑇 ∑︁ 𝑘 (cid:18) |𝑡 − 𝑠| 𝑀 (cid:19) (cid:32) 𝑁 ∑︁ (cid:33) 𝑖=1 (cid:19) 𝑣𝑖𝑡 (cid:98) (cid:169) (cid:173) (cid:171) 𝑣′ 𝑖𝑠. 𝑣𝑖𝑡(cid:98) (cid:98) 𝑖=1 𝑡=1 𝑠=1 𝑁 ∑︁ 𝑗=1 , 𝑣′ 𝑗 𝑠(cid:170) (cid:98) (cid:174) (cid:172) (2.4) (2.5) (2.6) Notice that (2.4) is the “cluster by individuals” estimator proposed by Liang and Zeger (1986) and Arellano (1987), (2.5) is the “HAC of cross-section averages" estimator proposed by Driscoll and Kraay (1998), and (2.6) is the “average of HACs” estimator (see Petersen, 2009 and Vogelsang, 2012). In other words, (cid:98)ΩCHS is a linear combination of three well-known variance estimators that have been proposed to handle particular forms of dependence structure. While there are some existing asymptotic results for the components in (2.3) that are potentially relevant (e.g. Hansen, 2007, Vogelsang, 2012, and Chiang et al., 2024), these results are derived either under one-way dependence or are not sufficiently comprehensive to directly obtain a fixed-𝑏 result for (cid:98)ΩCHS. Some new theoretical results are needed. 8 2.3 Fixed-𝑏 Asymptotic Results 2.3.1 The Multivariate Mean Case To set ideas and intuition we first focus on a simple panel mean model (panel location model) of a 𝑘 × 1 random vector 𝑦𝑖𝑡 and then extend the analysis to the linear regression case. We use a large-𝑁 and large-𝑇 framework where 𝑁/𝑇 → 𝑐 for some constant 𝑐 such that 0 < 𝑐 < ∞. As a natural way to model panel data with two-way effects, we follow Chiang et al. (2024) and assume that 𝑦𝑖𝑡 is generated as follows. Assumption 2.1 𝑦𝑖𝑡 = 𝜃 + 𝑓 (𝛼𝑖, 𝛾𝑡, 𝜀𝑖𝑡) , where 𝜃 = E(𝑦𝑖𝑡) and 𝑓 is an unknown Borel-measurable function, the sequences {𝛼𝑖}, {𝛾𝑡 }, and {𝜀𝑖𝑡 } are mutually independent, 𝛼𝑖 is i.i.d across 𝑖, 𝜀𝑖𝑡 is i.i.d across 𝑖 and 𝑡, and 𝛾𝑡 is a strictly stationary serially correlated process. The time component, 𝛾𝑡, is allowed to have serial correlation given that panel data typically has serial correlation beyond that induced by individual effects. As pointed out by Chiang et al. (2024) at the beginning of their Section 3, the data-generating process in Assumption 2.1 is a strict generalization of the representation developed by Hoover (1979), Aldous (1981) and Kallenberg (1989) (the so-called AHK representation) precisely because 𝛾𝑡 is allowed to have serial correlation. The AHK representation is not sufficient here because it was developed for data drawn from an infinite array of jointly exchangeable random variables in which case 𝛾𝑡 would not have serial correlation. Using the representation in Assumption 2.1, Chiang et al. (2024) develop the following decom- position of 𝑦𝑖𝑡. Denoting 𝑎𝑖 = E (𝑦𝑖𝑡 − 𝜃|𝛼𝑖) , 𝑔𝑡 = E (𝑦𝑖𝑡 − 𝜃|𝛾𝑡), and 𝑒𝑖𝑡 = (𝑦𝑖𝑡 − 𝜃) − 𝑎𝑖 − 𝑔𝑡, one can decompose 𝑦𝑖𝑡 − 𝜃 as 𝑦𝑖𝑡 − 𝜃 = 𝑎𝑖 + 𝑔𝑡 + 𝑒𝑖𝑡 =: 𝑣𝑖𝑡 . Chiang et al. (2024) show that the components are mean zero and that 𝑒𝑖𝑡 is also mean zero conditional on 𝑎𝑖 and conditional on 𝑔𝑡. The individual component, 𝑎𝑖, is i.i.d. across 𝑖, and the 9 time component, 𝑔𝑡, is stationary. Conditional on 𝛾𝑡, the 𝑒𝑖𝑡 component is independent across 𝑖. Finally, the three components are uncorrelated with each other with 𝑎𝑖 and 𝑔𝑡 being independent of each other. See Section 3.1 of Chiang et al. (2024) for details. We can estimate 𝜃 using the pooled sample mean estimator given by (cid:98)𝜃 = (𝑁𝑇)−1 (cid:205)𝑁 𝑖=1 (cid:205)𝑇 𝑡=1 𝑦𝑖𝑡. Rewriting the sample mean using the component structure representation for 𝑦𝑖𝑡 gives (cid:98)𝜃 − 𝜃 = 1 𝑁 𝑁 ∑︁ 𝑖=1 𝑎𝑖 + 1 𝑇 𝑇 ∑︁ 𝑡=1 𝑔𝑡 + 1 𝑁𝑇 𝑁 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑡=1 𝑒𝑖𝑡 =: ¯𝑎 + ¯𝑔 + ¯𝑒. (2.7) The Chiang et al. (2024) variance estimator of (cid:98)𝜃 is given by (2.3) with (cid:98) 𝑣𝑖𝑡 = 𝑦𝑖𝑡 − (cid:98)𝜃 used in (2.4) - (2.6). To obtain fixed-𝑏 results for (cid:98)ΩCHS we rewrite the formula for (cid:98)ΩCHS in terms of the following 𝑣𝑖𝑡: two partial sum processes of (cid:98) (cid:98)𝑆𝑖𝑡 = (cid:98)¯𝑆𝑡 = 𝑡 ∑︁ 𝑗=1 𝑁 ∑︁ 𝑖=1 𝑣𝑖 𝑗 = 𝑡 (𝑎𝑖 − ¯𝑎) + (cid:98) 𝑡 ∑︁ 𝑗=1 (cid:0)𝑔 𝑗 − ¯𝑔(cid:1) + 𝑡 ∑︁ 𝑗=1 (cid:0)𝑒𝑖 𝑗 − ¯𝑒(cid:1) , (cid:98)𝑆𝑖𝑡 = 𝑁 ∑︁ 𝑡 ∑︁ 𝑖=1 𝑗=1 𝑣𝑖 𝑗 = 𝑁 (cid:98) 𝑡 ∑︁ 𝑗=1 (cid:0)𝑔 𝑗 − ¯𝑔(cid:1) + 𝑁 ∑︁ 𝑡 ∑︁ 𝑖=1 𝑗=1 (cid:0)𝑒𝑖 𝑗 − ¯𝑒(cid:1) . (2.8) (2.9) Note that the 𝑎𝑖 component drops from (2.9) because (cid:205)𝑁 𝑖=1 (𝑎𝑖 − ¯𝑎) = 0. The Arellano component (2.4) of (cid:98)ΩCHS is obviously a simple function of (2.8) with 𝑡 = 𝑇. The HAC components (2.5) and (2.6) can be written in terms of (2.9) and (2.8) using fixed-𝑏 algebra (see Vogelsang, 2012). Therefore, the Chiang et al. (2024) variance estimator has the following equivalent formula: (cid:98)ΩCHS = 1 𝑁 2𝑇 2 𝑁 ∑︁ (cid:98)𝑆𝑖𝑇 (cid:98)𝑆′ 𝑖𝑇 𝑖=1 (cid:40) + 1 𝑁 2𝑇 2 2 𝑀 ′ (cid:98)¯𝑆𝑡(cid:98)¯𝑆 𝑡 − 1 𝑀 𝑇−𝑀−1 ∑︁ (cid:16) ′ ′ (cid:98)¯𝑆𝑡(cid:98)¯𝑆 𝑡+𝑀 + (cid:98)¯𝑆𝑡+𝑀(cid:98)¯𝑆 𝑡 (cid:17) (cid:41) − 1 𝑁 2𝑇 2 𝑁 ∑︁ 𝑖=1 𝑇−1 ∑︁ 𝑡=1 (cid:98)𝑆𝑖𝑡 (cid:98)𝑆′ 𝑖𝑡 − 𝑡=1 1 𝑀 𝑇−𝑀−1 ∑︁ (cid:16) 𝑡=1 (cid:98)𝑆𝑖𝑡 (cid:98)𝑆′ 𝑖,𝑡+𝑀 + (cid:98)𝑆𝑖,𝑡+𝑀 (cid:98)𝑆′ 𝑖,𝑡 (cid:17) 𝑇−1 ∑︁ 𝑡=1 (cid:40) 2 𝑀 − 1 𝑀 𝑇−1 ∑︁ (cid:16) 𝑡=𝑇−𝑀 (cid:98)𝑆𝑖𝑡 (cid:98)𝑆′ 𝑖𝑇 + (cid:98)𝑆𝑖𝑇 (cid:98)𝑆′ 𝑖𝑡 (cid:17) + (cid:98)𝑆𝑖𝑇 (cid:98)𝑆′ 𝑖𝑇 (cid:41) . Define three 𝑘 × 𝑘 matrices Λ𝑎, Λ𝑔, and Λ𝑒 such that: Λ𝑎Λ′ 𝑎 = E(𝑎𝑖𝑎′ 𝑖), Λ𝑔Λ′ 𝑔 = ∞ ∑︁ ℓ=−∞ E[𝑔𝑡𝑔′ 𝑡+ℓ], Λ𝑒Λ′ 𝑒 = ∞ ∑︁ ℓ=−∞ E[𝑒𝑖𝑡𝑒′ 𝑖,𝑡+ℓ]. 10 (2.10) (2.11) (2.12) The following assumption is used to obtain an asymptotic result for (2.7) and a fixed-𝑏 asymptotic result for (cid:98)ΩCHS. Throughout the paper, let ∥.∥ denote the Euclidean norm for matrices, and let 𝜆𝑚𝑖𝑛 [.] denote the smallest eigenvalue of a square matrix. Assumption 2.2 For some 𝑠 > 1 and 𝛿 > 0, (i) E[𝑦𝑖𝑡] = 𝜃 and E[∥𝑦𝑖𝑡 ∥4(𝑠+𝛿)] < ∞. (ii) 𝛾𝑡 is an 𝛼-mixing sequence with size 2𝑠/(𝑠 − 1), i.e., 𝛼𝛾 (ℓ) = 𝑂 (ℓ−𝜆) for a 𝜆 > 2𝑠/(𝑠 − 1). (iii) 𝜆𝑚𝑖𝑛 [Λ𝑎Λ′ 𝑎] > 0 and/or 𝜆𝑚𝑖𝑛 [Λ𝑔Λ′ 𝑔] > 0, and 𝑁/𝑇 → 𝑐 as (𝑁, 𝑇) → ∞ for some constant 𝑐. (iv) 𝑀 = [𝑏𝑇] where 𝑏 ∈ (0, 1]. Assumption 2.2(i) assumes the mean of 𝑦𝑖𝑡 exists and 𝑦𝑖𝑡 has finite fourth moments. Assumption 2.2(ii) assumes weak dependence of 𝛾𝑡 using a mixing condition. Assumption 2.2(i) - (ii) follow Chiang et al. (2024). Assumption 2.2(iii) is a non-degeneracy restriction on the projected individual and time components. Clearly, when data is i.i.d over both the cross-section and time dimensions, this condition does not hold. Because the fixed-𝑏 limits of (cid:98)ΩCHS and its associated test statistics turn out to be different in the i.i.d case, we discuss it separately in Section 4. Assumption 2.2(iii) also rules out the pathological case described in Example 1.7 of Menzel (2021): when 𝑦𝑖𝑡 = 𝛼𝑖𝛾𝑡 + 𝜀𝑖𝑡 with E(𝛼𝑖) = E(𝛾𝑡) = 0, one can easily verify that 𝑎𝑖 = 𝑔𝑡 = 0, in which case the limiting distribution of appropriately scaled (cid:98)𝜃 is non-Gaussian. Assumption 2.2(iv) uses the fixed-𝑏 asymptotic nesting for the bandwidth. The following theorem gives an asymptotic result for appropriately scaled (cid:98)𝜃 and a fixed-𝑏 asymptotic result for appropriately scaled (cid:98)ΩCHS. Theorem 2.1 Let 𝑧𝑘 be a 𝑘 × 1 vector of independent standard normal random variables, and let 𝑊𝑘 (𝑟), 𝑟 ∈ (0, 1], be a 𝑘 × 1 vector of independent standard Wiener processes independent of 𝑧𝑘 . Suppose Assumptions 2.1 and 2.2 hold, then as (𝑁, 𝑇) → ∞, √ (cid:16) 𝑁 (cid:17) (cid:98)𝜃 − 𝜃 ⇒ Λ𝑎𝑧𝑘 + √ 𝑐Λ𝑔𝑊𝑘 (1), 𝑁 (cid:98)Ω𝐶𝐻𝑆 ⇒ ℎ(𝑏)Λ𝑎Λ′ 𝑎 + 𝑐Λ𝑔𝑃 (cid:16) 𝑏, (cid:101)𝑊𝑘 (𝑟) (cid:17) Λ′ 𝑔, (2.13) where ℎ(𝑏) := (cid:16) 1 − 𝑏 + 1 3 𝑏2(cid:17) , (cid:101)𝑊𝑘 (𝑟) := 𝑊𝑘 (𝑟) − 𝑟𝑊𝑘 (1), i.e. a Brownian bridge, and (cid:16) 𝑃 𝑏, (cid:101)𝑊𝑘 (𝑟) (cid:17) := 2 𝑏 ∫ 1 0 (cid:101)𝑊𝑘 (𝑟) (cid:101)𝑊 𝑘 (𝑟)′𝑑𝑟 − 1 𝑏 ∫ 1−𝑏 0 11 (cid:101)𝑊𝑘 (𝑟) (cid:101)𝑊𝑘 (𝑟 + 𝑏)′ + (cid:101)𝑊𝑘 (𝑟 + 𝑏) (cid:101)𝑊𝑘 (𝑟)′(cid:3) 𝑑𝑟. (cid:2) The proof of Theorem 2.1 is given in Appendix 2A. The limit of √ 𝑁 (cid:16) (cid:17) (cid:98)𝜃 − 𝜃 was obtained by Chiang et al. (2024). Because 𝑧𝑘 and 𝑊𝑘 (1) are vectors of independent standard normals that are independent of each other, Λ𝑎𝑧𝑘 + √ 𝑐Λ𝑔𝑊𝑘 (1) is a vector of normal random variables with variance-covariance matrix Λ𝑎Λ′ 𝑎 + 𝑐Λ𝑔Λ′ 𝑔. The Λ𝑔𝑃 (cid:16) 𝑏, (cid:101)𝑊 (𝑟) (cid:17) Λ′ 𝑔 component of (2.13) is equivalent to the fixed-𝑏 limit obtained by Kiefer and Vogelsang (2005) in stationary time series settings. Obviously, (2.13) is different than the limit obtained by Kiefer and Vogelsang (2005) because of the ℎ(𝑏)Λ𝑎Λ′ 𝑎 term. As the proof illustrates, this term is the limit of the “cluster by individuals” (2.4) and “average of HACs” (2.6) components whereas the 𝑐Λ𝑔𝑃 (cid:16) 𝑏, (cid:101)𝑊𝑘 (𝑟) (cid:17) Λ′ 𝑔 term is the limit of the “HAC of averages” (2.5). Interestingly, Kiefer and Vogelsang (2005) showed that (cid:16) (cid:16) 𝑃 E 𝑏, (cid:101)𝑊𝑘 (𝑟) (cid:18) (cid:17)(cid:17) = 1 − 𝑏 + (cid:19) 𝑏2 1 3 𝐼𝑘 = ℎ(𝑏)𝐼𝑘 , (2.14) where 𝐼𝑘 is a 𝑘 × 𝑘 identity matrix. The fact that both terms in the limit of 𝑁 (cid:98)ΩCHS are proportional to ℎ(𝑏) suggests a simple bias correction that is discussed in Section 2.4.1. Because of the component structure of (2.13), the fixed-𝑏 limits of 𝑡 and Wald statistics based on (cid:98)ΩCHS are not pivotal. We provide details on test statistics after extending our results to the case of a linear panel regression. 2.3.2 The Linear Panel Regression Case It is straightforward to extend our results to the case of a linear panel regression given by (2.1). The POLS estimator of 𝛽 is (cid:98)𝛽 = 𝛽 + (cid:98)𝑄−1 (cid:32) 1 𝑁𝑇 𝑁 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑡=1 (cid:33) 𝑥𝑖𝑡𝑢𝑖𝑡 , (2.15) where (cid:98)𝑄 := 1 𝑁𝑇 (cid:205)𝑁 𝑖=1 (cid:205)𝑇 𝑡=1 𝑥𝑖𝑡𝑥′ 𝑖𝑡 as in Section 2. Following Chiang et al. (2024), we assume the components of the panel regression are generated from the component structure: (𝑦𝑖𝑡, 𝑥′ 𝑖𝑡, 𝑢𝑖𝑡)′ = 𝑓 (𝛼𝑖, 𝛾𝑡, 𝜀𝑖𝑡) where 𝑓 is an unknown Borel-measurable function, the sequences {𝛼𝑖}, {𝛾𝑡 }, and {𝜀𝑖𝑡 } are mutually independent, 𝛼𝑖 is i.i.d across 𝑖, 𝜀𝑖𝑡 is i.i.d across 𝑖 and 𝑡, and 𝛾𝑡 is a strictly stationary serially correlated process. Define the vector 𝑣𝑖𝑡 = 𝑥𝑖𝑡𝑢𝑖𝑡. Similar to the simple mean model we can write 12 𝑎𝑖 = E (𝑣𝑖𝑡 |𝑎𝑖), 𝑔𝑡 = E (𝑣𝑖𝑡 |𝛾𝑡), 𝑒𝑖𝑡 = 𝑣𝑖𝑡 − 𝑎𝑖 − 𝑔𝑡, giving the decomposition 𝑣𝑖𝑡 = 𝑎𝑖 + 𝑔𝑡 + 𝑒𝑖𝑡 . The Chiang et al. (2024) variance estimator of (cid:98)𝛽 is given by (2.2) with (cid:98) defined as (cid:98) 𝑖𝑡 (cid:98)𝛽 are the POLS residuals. 𝑢𝑖𝑡 where (cid:98) 𝑣𝑖𝑡 = 𝑥𝑖𝑡(cid:98) 𝑢𝑖𝑡 = 𝑦𝑖𝑡 − 𝑥′ 𝑣𝑖𝑡 in (2.4) - (2.6) now The following assumption is used to obtain an asymptotic result for (2.15) and a fixed-𝑏 asymptotic result for (cid:98)ΩCHS in the linear panel case. Assumption 2.3 For some 𝑠 > 1 and 𝛿 > 0, (i) (𝑦𝑖𝑡, 𝑥′ 𝑖𝑡, 𝑢𝑖𝑡)′ = 𝑓 (𝛼𝑖, 𝛾𝑡, 𝜀𝑖𝑡) where {𝛼𝑖}, {𝛾𝑡 }, and 𝛾𝑡 is strictly stationary. and {𝜀𝑖𝑡 } are mutually independent sequences, 𝛼𝑖 is i.i.d across 𝑖, 𝜀𝑖𝑡 is i.i.d across 𝑖 and 𝑡, 𝑖𝑡](cid:3) > 0, E[∥𝑥𝑖𝑡 ∥8(𝑠+𝛿)] < ∞, and E[∥𝑢𝑖𝑡 ∥8(𝑠+𝛿)] < ∞. (iii) 𝛾𝑡 is an 𝛼-mixing sequence with size 2𝑠/(𝑠 − 1), i.e., 𝛼𝛾 (ℓ) = 𝑂 (ℓ−𝜆) for (ii) E[𝑥𝑖𝑡𝑢𝑖𝑡] = 0, 𝜆𝑚𝑖𝑛 (cid:2)E[𝑥𝑖𝑡𝑥′ a 𝜆 > 2𝑠/(𝑠 − 1). (iv) 𝜆𝑚𝑖𝑛 [Λ𝑎Λ′ 𝑎] > 0 and/or 𝜆𝑚𝑖𝑛 [Λ𝑔Λ′ 𝑔] > 0, and 𝑁/𝑇 → 𝑐 as (𝑁, 𝑇) → ∞ for some constant 𝑐. (v) 𝑀 = [𝑏𝑇] where 𝑏 ∈ (0, 1]. Assumption 2.3 can be regarded as a counterpart of Assumptions 2.1 and 2.2 with Assumption 2.3(ii) being strengthened. It is very similar to its counterpart in Chiang et al. (2024) with a main difference the use of the fixed-𝑏 asymptotic nesting for the bandwidth, 𝑀. For the same reason mentioned in the previous section, we discuss the case where (𝑥𝑖𝑡, 𝑢𝑖𝑡) are i.i.d separately in Section 4. The next theorem presents the joint limit of the POLS estimator and the fixed-𝑏 joint limit of CHS variance estimator. Theorem 2.2 Let 𝑧𝑘 , 𝑊𝑘 (𝑟), (cid:101)𝑊𝑘 (𝑟), 𝑃 Assumption 2.3 holds for model (2.1), then as (𝑁, 𝑇) → ∞, 𝑏, (cid:101)𝑊𝑘 (𝑟) (cid:16) (cid:17) and ℎ(𝑏) be defined as in Theorem 2.1. Suppose √ (cid:16) 𝑁 (cid:17) (cid:98)𝛽 − 𝛽 ⇒ 𝑄−1𝐵𝑘 (𝑐), where 𝐵𝑘 (𝑐) := Λ𝑎𝑧𝑘 + √ 𝑐Λ𝑔𝑊𝑘 (1) and 𝑁 (cid:98)VCHS( (cid:98)𝛽) ⇒ 𝑄−1𝑉𝑘 (𝑏, 𝑐)𝑄−1, (2.16) 13 where 𝑉𝑘 (𝑏, 𝑐) := ℎ(𝑏)Λ𝑎Λ′ 𝑎 + 𝑐Λ𝑔𝑃 (cid:16) 𝑏, (cid:101)𝑊𝑘 (𝑟) (cid:17) 𝑔. Λ′ The proof of Theorem 2.2 is given in Appendix 2A. We can see that the limiting random variable, 𝑉𝑘 (𝑏, 𝑐), depends on the choice of truncation parameter, 𝑀, through 𝑏. The use of the Bartlett kernel is reflected in the functional form of 𝑃 (cid:16) 𝑏, (cid:101)𝑊𝑘 (𝑟) (cid:17) as well as the scaling term ℎ(𝑏) on Λ𝑎Λ′ 𝑎. Use of a different kernel would result in different functional forms for these limits. Because of (2.14), it follows that E (𝑉𝑘 (𝑏, 𝑐)) = ℎ(𝑏)Λ𝑎Λ′ 𝑎 + 𝑐Λ𝑔E (cid:16) (cid:104) 𝑃 𝑏, (cid:101)𝑊𝑏 (𝑟) (cid:17)(cid:105) Λ′ 𝑔 (cid:16) = ℎ(𝑏) Λ𝑎Λ′ 𝑎 + 𝑐Λ𝑔Λ′ 𝑔 (cid:17) . (2.17) The scalar ℎ(𝑏) can be viewed as a multiplicative bias term that depends on the bandwidth sample size ratio, 𝑏 = 𝑀/𝑇. We leverage this fact to implement a simple feasible bias correction for the CHS variance estimator that is explored below. Using the theoretical results developed in this section, we next examine the properties of test statistics based on the POLS estimator and CHS variance estimator. We also analyze tests based on two variants of the CHS variance estimator. One is a bias-corrected estimator. The other is a variance estimator guaranteed to be positive semi-definite that is also bias-corrected. 2.4 Inference In regression model (2.1) we focus on tests of linear hypothesis of the form: H0 : 𝑅𝛽 = 𝑟, H1 : 𝑅𝛽 ≠ 𝑟, where 𝑅 is a 𝑞 × 𝑘 matrix (𝑞 ≤ 𝑘) with full rank equal to 𝑞, and 𝑟 is a 𝑞 × 1 vector. Using (cid:98)VCHS( (cid:98)𝛽) as given by (2.2), define a Wald statistic as 𝑊CHS = (cid:16) 𝑅 (cid:98)𝛽 − 𝑟 (cid:17)′(cid:16) 𝑅(cid:98)VCHS( (cid:98)𝛽)𝑅′(cid:17) −1 (cid:16) 𝑅 (cid:98)𝛽 − 𝑟 (cid:17) . When 𝑞 = 1, we can define a 𝑡-statistic as 𝑡CHS = √︃ 𝑅 (cid:98)𝛽 − 𝑟 𝑅(cid:98)VCHS( (cid:98)𝛽)𝑅′ . 14 Appropriately scaling the numerators and denominators of the test statistics and applying Theorem 2.2, we obtain under H0: 𝑊CHS = √ (cid:16) 𝑁 (cid:17)′(cid:16) 𝑅 (cid:98)𝛽 − 𝑟 (cid:16) ⇒ 𝑅𝑄−1𝐵𝑘 (𝑐) √ (cid:16) 𝑅 (cid:98)𝛽 − 𝑟 𝑁 (cid:17) 𝑡CHS = √︃ 𝑅𝑁 (cid:98)VCHS( (cid:98)𝛽)𝑅′ 𝑅𝑁 (cid:98)VCHS( (cid:98)𝛽)𝑅′(cid:17) −1√ 𝑅𝑄−1𝑉𝑘 (𝑏, 𝑐)𝑄−1𝑅′(cid:17) −1 (cid:16) 𝑁 (cid:16) (cid:17)′(cid:16) 𝑅 (cid:98)𝛽 − 𝑟 (cid:17) 𝑅𝑄−1𝐵𝑘 (𝑐) (cid:17) =: 𝑊 ∞ CHS , ⇒ 𝑅𝑄−1𝐵𝑘 (𝑐) √︁𝑅𝑄−1𝑉𝑘 (𝑏, 𝑐)𝑄−1𝑅′ =: 𝑡∞ CHS . (2.18) (2.19) The limits of 𝑊CHS and 𝑡CHS are similar to the fixed-𝑏 limits obtained by Kiefer and Vogelsang (2005) but have distinct differences. First, the form of 𝑉𝑘 (𝑏, 𝑐) depends on two variance matrices rather than one. Second, the variance matrices do not scale out of the statistics. Therefore, the fixed-𝑏 limits given by (2.18) and (2.19) are not pivotal. We propose a plug-in method for the simulation of critical values from these asymptotic random variables. For the case where 𝑏 is small, the fixed-𝑏 critical values are close to 𝜒2 𝑞 and 𝑁 (0, 1) critical values respectively. This can be seen by computing the probability limits of the asymptotic distributions as 𝑏 → 0. In particular, using the fact that plim𝑏→0 𝑃 (cid:16) 𝑏, (cid:101)𝑊𝑘 (𝑟) (cid:17) = 𝐼𝑘 (see Kiefer and Vogelsang, 2005), it follows that plim𝑏→0 𝑉𝑘 (𝑏, 𝑐) = plim𝑏→0 (cid:104) ℎ(𝑏)Λ𝑎Λ′ 𝑎 + 𝑐Λ𝑔𝑃 (cid:16) 𝑏, (cid:101)𝑊𝑘 (𝑟) (cid:17) (cid:105) Λ′ 𝑔 = Λ𝑎Λ′ 𝑎 + 𝑐Λ𝑔Λ′ 𝑔 = Var(𝐵𝑘 (𝑐)), where ℎ(·) and 𝑃(·) are defined in Theorem 2.1. Therefore, it follows that (cid:20)(cid:16) plim𝑏→0 𝑅𝑄−1𝐵𝑘 (𝑐) (cid:17)′(cid:16) 𝑅𝑄−1𝑉𝑘 (𝑏, 𝑐)𝑄−1𝑅′(cid:17) −1 (cid:16) 𝑅𝑄−1𝐵𝑘 (𝑐) (cid:17)(cid:21) (cid:16) = 𝑅𝑄−1𝐵𝑘 (𝑐) (cid:17)′(cid:16) 𝑅𝑄−1Var(𝐵𝑘 (𝑐))𝑄−1𝑅′(cid:17) −1 (cid:16) 𝑅𝑄−1𝐵𝑘 (𝑐) (cid:17) 𝑞, ∼ 𝜒2 and (cid:34) 𝑝 lim 𝑏→0 𝑅𝑄−1𝐵𝑘 (𝑐) √︁𝑅𝑄−1𝑉𝑘 (𝑏, 𝑐)𝑄−1𝑅′ (cid:35) = 𝑅𝑄−1𝐵𝑘 (𝑐) √︁𝑅𝑄−1Var(𝐵𝑘 (𝑐))𝑄−1𝑅′ ∼ 𝑁 (0, 1). In practice, there will not be a substantial difference between using 𝜒2 𝑞 and 𝑁 (0, 1) critical values and fixed-𝑏 critical values for small bandwidths. However, for larger bandwidths more reliable inference can be obtained with fixed-𝑏 critical values. 15 2.4.1 Bias-Corrected CHS Variance Estimator We now leverage the form of the mean of the fixed-𝑏 limit of the CHS variance estimator as given by (2.17) to propose a biased corrected version of the CHS variance estimator. The idea is simple. We can scale out the ℎ(𝑏) multiplicative term evaluated at 𝑏 = 𝑀/𝑇 to make the CHS variance estimator an asymptotically unbiased estimator of Λ𝑎Λ′ 𝑎 + 𝑐Λ𝑔Λ′ 𝑔, the variance of 𝐵𝑘 (𝑐) = Λ𝑎𝑧𝑘 + √ 𝑐Λ𝑔𝑊𝑘 (1). Define the bias-corrected CHS variance estimators as (cid:98)VBCCHS (cid:16) (cid:17) (cid:98)𝛽 = (cid:98)𝑄−1 (cid:98)ΩBCCHS (cid:98)𝑄−1, (cid:98)ΩBCCHS = (cid:19)(cid:19) −1 (cid:18) ℎ (cid:18) 𝑀 𝑇 (cid:98)ΩCHS, and the corresponding Wald and 𝑡-statistics under the null hypothesis 𝑅𝛽 = 𝑟 are defined as 𝑊BCCHS = (cid:16) 𝑅 (cid:98)𝛽 − 𝑟 (cid:17)′(cid:16) 𝑅(cid:98)VBCCHS (cid:16) (cid:17) (cid:98)𝛽 𝑅′(cid:17) −1 (cid:16) 𝑅 (cid:98)𝛽 − 𝑟 (cid:17) , 𝑡BCCHS = √︂ 𝑅 (cid:98)𝛽 − 𝑟 (cid:16) 𝑅(cid:98)VBCCHS . (cid:17) (cid:98)𝛽 𝑅′ Because (cid:98)ΩBCCHS is a simple scalar multiple of (cid:98)ΩCHS, we easily obtain the fixed-𝑏 limits 𝑊BCCHS ⇒ ℎ(𝑏)𝑊 ∞ 𝑡BCCHS ⇒ ℎ(𝑏)1/2𝑡∞ CHS =: 𝑊 ∞ CHS =: 𝑡∞ BCCHS BCCHS , . (2.20) (2.21) Notice that while the fixed-𝑏 limits are different when using the bias-corrected CHS variance estimator, they are scalar multiples of the fixed-𝑏 limits when using the original CHS variance estimator. Therefore, the fixed-𝑏 critical values of the 𝑊BCCHS and 𝑡BCCHS are proportional to the fixed-𝑏 critical values of 𝑊CHS and 𝑡CHS. As long as fixed-𝑏 critical values are used, there is no practical effect on inference from using the bias-corrected CHS variance estimator. Where the bias correction matters is when 𝜒2 𝑞 and 𝑁 (0, 1) critical values are used. In this case, the bias-corrected CHS variance can provide more accurate finite sample inference. This will be illustrated by our finite sample simulations. 2.4.2 An Alternative Bias-Corrected Variance Estimator As noted by Chiang et al. (2024), the CHS variance estimator does not ensure positive- definiteness, which is also the case for the clustered estimator proposed by Cameron et al. (2011). 16 Davezies et al. (2018) and MacKinnon et al. (2021) point out that the double-counting adjustment term in the estimator of Cameron et al. (2011) is of small order, and removing the adjustment term has the computational advantage of guaranteeing positive semi-definiteness. Analogously, we can think of (cid:98)ΩNW, as given by (2.6), as a double-counting adjustment term. If we exclude this term, the variance estimator becomes the sum of two positive semi-definite terms and is guaranteed to be positive definite. Another motivation for dropping (2.6) is that, under fixed-𝑏 asymptotics, (2.6) simply contributes downward bias in the estimation of the Λ𝑎Λ′ 𝑎 term of Var(𝐵𝑘 (𝑐)) through the −𝑏 + 1 3 𝑏2 part of ℎ(𝑏) in the ℎ(𝑏)Λ𝑎Λ′ 𝑎 portion of 𝑉𝑘 (𝑏, 𝑐). Intuitively, the Arellano cluster estimator takes care of the serial correlation introduced by 𝑎𝑖, and the DK estimator takes care of the cross-section and time dependence introduced by 𝑔𝑡. From this perspective, (cid:98)ΩNW is not needed. Accordingly, we propose a variance estimator which is the sum of the Arellano variance estimator and the bias-corrected DK variance estimator (labeled as DKA hereafter) defined as (cid:98)ΩDKA =: (cid:98)ΩA + ℎ(𝑏)−1 (cid:98)ΩDK, where (cid:98)ΩA and (cid:98)ΩDK are defined in (2.4) and (2.5). Notice that we bias correct the DK component so that the resulting variance estimator is asymptotically unbiased under fixed-𝑏 asymptotics. This can improve inference should 𝜒2 𝑞 or 𝑁 (0, 1) critical values be used in practice. The following theorem gives the fixed-𝑏 limit of the scaled DKA variance estimator. Theorem 2.3 Suppose Assumption 2.3 holds for model (2.1), then as (𝑁, 𝑇) → ∞, 𝑁 (cid:98)ΩDKA ⇒ Λ𝑎Λ′ 𝑎 + 𝑐Λ𝑔ℎ(𝑏)−1𝑃 (cid:16) 𝑏, (cid:101)𝑊𝑏 (𝑟) (cid:17) Λ′ 𝑔 = ℎ(𝑏)−1𝑉𝑘 (𝑏, 𝑐). (2.22) The proof of Theorem 2.3 can be found in Appendix 2A. Define the statistics 𝑊DKA and 𝑡DKA analo- gous to 𝑊BCCHS and 𝑡BCCHS using the variance estimator for (cid:98)𝛽 given by (cid:98)VDKA = (cid:98)𝑄−1(cid:98)ΩDKA (cid:98)𝑄−1. Applying Theorems 2.2 and 2.3, we obtain the fixed-𝑏 limits of Wald/t test statistics associated (cid:17) (cid:98)𝛽 (cid:16) with DKA variance estimator under the null: 𝑊DKA ⇒ 𝑊 ∞ BCCHS , 𝑡DKA ⇒ 𝑡∞ BCCHS , which are the same as the limits of 𝑊BCCHS and 𝑡BCCHS given by (2.20) and (2.21). 17 2.4.3 Results for i.i.d. Data While the DKA variance estimator is guaranteed to be positive semi-definite, this useful property comes with a potential cost. As is shown in Theorem 2 of MacKinnon et al. (2021), if the score 𝑥𝑖𝑡𝑢𝑖𝑡 is i.i.d. over 𝑖 and 𝑡, or if clusters are formed at the intersection between individuals and time2, the probability limit of two-way cluster-robust variance estimators that drop the double-counting adjustment term, referred to as a two-term variance estimator, is twice the size of the true variance. In other words, if the researcher believes there is clustering when there is none, the use of a two-term estimator would overestimate the asymptotic variance. The associated Wald and 𝑡-statistics will be scaled down causing over-coverage (under-rejection) problems under the null hypothesis. The following assumption and theorem give fixed-𝑏 results for the CHS, BCCHS and DKA statistics for the case of i.i.d. data. Assumption 2.4 For some 𝑠 > 1 and 𝛿 > 0, (i) (𝑥𝑖𝑡, 𝑢𝑖𝑡) are independent and identically distributed over i and t. (ii) E[𝑥𝑖𝑡𝑢𝑖𝑡] = 0, 𝜆𝑚𝑖𝑛 (cid:2)E[𝑥𝑖𝑡𝑥′ 𝑖𝑡](cid:3) > 0, E[∥𝑥𝑖𝑡 ∥8(𝑠+𝛿)] < ∞, and E[∥𝑢𝑖𝑡 ∥8(𝑠+𝛿)] < ∞. (iii) 𝑁/𝑇 → 𝑐 as (𝑁, 𝑇) → ∞ for some constant 𝑐. (iv) 𝑀 = [𝑏𝑇] for 𝑏 ∈ (0, 1]. Theorem 2.4 Suppose Assumption 2.4 holds for model (2.1), then as (𝑁, 𝑇) → ∞, 𝑊CHS ⇒ 𝑊𝑞 (1)′ (cid:110) 𝑊DKA ⇒ 𝑊𝑞 (1)′ (cid:110) (cid:16) 𝑃 𝑏, (cid:101)𝑊𝑞 (𝑟) (cid:17)(cid:111)−1 𝐼𝑞 + ℎ(𝑏)−1𝑃 (cid:16) 𝑊𝑞 (1) =: 𝑊 ∞,𝑖𝑖𝑑 CHS (cid:17)(cid:111)−1 𝑊𝑞 (1) =: 𝑊 ∞,𝑖𝑖𝑑 DKA 𝑏, (cid:101)𝑊𝑞 (𝑟) , , 𝑊BCCHS ⇒ ℎ(𝑏)𝑊 ∞,𝑖𝑖𝑑 CHS =: 𝑊 ∞,𝑖𝑖𝑑 BCCHS , 𝑡CHS ⇒ 𝑊1(1) (cid:16) 𝑏, (cid:101)𝑊1(𝑟) (cid:17) √︂ 𝑃 =: 𝑡∞,𝑖𝑖𝑑 CHS , 𝑡BCCHS ⇒ ℎ(𝑏)1/2𝑡∞,𝑖𝑖𝑑 CHS =: 𝑡∞,𝑖𝑖𝑑 BCCHS , 𝑡DKA ⇒ √︂ 𝑊1(1) (cid:16) 1 + ℎ(𝑏)−1𝑃 =: 𝑡∞,𝑖𝑖𝑑 DKA . (cid:17) 𝑏, (cid:101)𝑊1(𝑟) Theorem 2.4 shows that the fixed-𝑏 limits in the i.i.d. case are different for all three test statistics than the limits given by (2.18), (2.19) for CHS and (2.20), (2.21) for BCCHS and DKA. 2In our setting where the clustering only happens at individual and time levels, clustering at the intersection is the same as the independence across individuals and times. 18 Suppose tests are carried out using 𝜒2 𝑞 and 𝑁 (0, 1) critical values. The limits in Theorem 2.4 can be used to compute asymptotic null rejection probabilities, or equivalently, asymptotic coverage probabilities for the case of i.i.d. data. For a two-tailed 5% 𝑡-test, the coverage probabilities are given by 𝑃 (cid:16)(cid:12) 𝑡∞,𝑖𝑖𝑑 (cid:12) (cid:12) CHS (cid:12) (cid:12) (cid:12) ≤ 1.96 (cid:17) , 𝑃 (cid:16)(cid:12) 𝑡∞,𝑖𝑖𝑑 (cid:12) (cid:12) BCCHS (cid:12) (cid:12) (cid:12) ≤ 1.96 (cid:17) , 𝑃 (cid:16)(cid:12) 𝑡∞,𝑖𝑖𝑑 (cid:12) (cid:12) DKA (cid:12) (cid:12) (cid:12) ≤ 1.96 (cid:17) . For small bandwidths, we can analytically compute these asymptotic coverage probabilities. As 𝑏 → 0, the limits of 𝑡∞,𝑖𝑖𝑑 CHS and 𝑡∞,𝑖𝑖𝑑 coverage of 95%. In contrast as 𝑏 → 0, the limit of 𝑡∞,𝑖𝑖𝑑 BCCHS converge to 𝑁 (0, 1) random variables giving asymptotic 2) random variable and the DKA is a 𝑁 (0, 1 asymptotic coverage is 99.4%, and DKA over-covers and is conservative. This result for DKA tests is similar to Corollary 1 of MacKinnon et al. (2021). For non-small bandwidths the limiting random variables are non-standard. We used simulation methods to compute these probabilities. We approximated the Wiener processes using scaled partial sums of 1,000 i.i.d. 𝑁 (0, 1) random increments and used 50,000 replications to simulate the percentiles. Table 2.4.1: Asymptotic Critical Values and Coverage Probabilities (%) 97.5% Asymptotic Critical Values 𝑡∞,𝑖𝑖𝑑 CHS 1.960 2.191 2.298 2.421 2.546 3.181 4.300 4.791 𝑡∞,𝑖𝑖𝑑 (cid:98) BCCHS 1.960 1.972 1.991 2.006 2.019 2.070 2.100 2.099 𝑡∞,𝑖𝑖𝑑 BCCHS 1.960 2.104 2.162 2.230 2.296 2.571 2.764 2.766 𝑡∞,𝑖𝑖𝑑 DKA 1.386 1.411 1.416 1.425 1.438 1.470 1.497 1.497 𝑏 0 0.08 0.12 0.16 0.20 0.40 0.80 1.00 𝑡∞,𝑖𝑖𝑑 BCCHS Critical Values Coverage, 𝑁 (0, 1) & (cid:98) CHS BCCHS 𝑁 (0, 1) 𝑁 (0, 1) (cid:98) 95.0 92.5 91.2 89.8 88.6 82.2 71.2 66.7 95.0 93.5 92.8 92.2 91.5 89.0 87.3 87.2 𝑡∞,𝑖𝑖𝑑 BCCHS 95.0 93.7 93.2 92.7 92.3 90.5 89.2 89.2 DKA 𝑁 (0, 1) (cid:98) 99.4 99.4 99.3 99.2 99.1 98.9 98.8 98.8 𝑡∞,𝑖𝑖𝑑 BCCHS 99.4 99.4 99.4 99.3 99.3 99.3 99.2 99.2 Note: Asymptotic critical values and coverage probabilities based on 50,000 replications 𝑡∞,𝑖𝑖𝑑 using 1,000 increments for the Wiener process. The random variables (cid:98) DKA are the same. The nominal coverage probability is 95%. 𝑡∞,𝑖𝑖𝑑 BCCHS and (cid:98) Table 2.4.1 reports 97.5% critical values for 𝑡∞,𝑖𝑖𝑑 that will be used in our finite sample simulations. The critical values of 𝑡∞,𝑖𝑖𝑑 DKA for a range of values of 𝑏 BCCHS equal 1.96 when 𝑏 = 0 and increase as 𝑏 increase. This suggests that CHS and BCCHS tests will under-cover when the data is i.i.d. and bandwidths are not small. In contrast, the critical values of 𝑡∞,𝑖𝑖𝑑 BCCHS, and 𝑡∞,𝑖𝑖𝑑 CHS and 𝑡∞,𝑖𝑖𝑑 CHS , 𝑡∞,𝑖𝑖𝑑 DKA are 19 always smaller than 1.96 and remain smaller as 𝑏 increases. Thus, DKA tests over-cover regardless of the bandwidth. Table 2.4.1 also reports asymptotic coverage probabilities using the 𝑁 (0, 1) critical value. We see that as 𝑏 goes from 0 to 1.0, coverage decreases from 95% to 66.7% for CHS, 95% to 87.2% for BCCHS, and is always close to 99% for DKA. These asymptotic calculations predict that CHS and BCCHS will over-reject (be liberal) when data is i.i.d. and non-small bandwidths are used. DKA is predicted to be conservative regardless of bandwidth. The table also reports some results for a random variable, (cid:98) 𝑡∞,𝑖𝑖𝑑 BCCHS, that is discussed in the next section. 2.4.4 Simulated fixed-𝑏 Critical Values As we have noted, the fixed-𝑏 limits of the test statistics given by (2.18), (2.19) and (2.20), (2.21) are not pivotal due to the nuisance parameters Λ𝑎 and Λ𝑔. A feasible method for obtaining asymptotic critical values is to use simulation methods with unknown nuisance parameters replaced with estimators, i.e. use a plug-in simulation method. To estimate Λ𝑎 and Λ𝑔 we use the estimators: (cid:91)Λ𝑎Λ′ 𝑎 := 1 𝑁𝑇 2 𝑁 ∑︁ (cid:32) 𝑇 ∑︁ 𝑖=1 𝑡=1 𝑣𝑖𝑡 (cid:98) (cid:33) (cid:32) 𝑇 ∑︁ (cid:33) , 𝑣′ 𝑖𝑠 (cid:98) (cid:18) (cid:91)Λ𝑔Λ′ 𝑔 := 1 − 𝑏dk + 1 3 𝑏2 dk 𝑠=1 (cid:19) −1 1 𝑁 2𝑇 𝑇 ∑︁ 𝑇 ∑︁ 𝑡=1 𝑠=1 𝑘 (cid:18) |𝑡 − 𝑠| 𝑀dk (cid:19) (cid:32) 𝑁 ∑︁ 𝑖=1 (cid:33) 𝑣𝑖𝑡 (cid:98) 𝑁 ∑︁ 𝑗=1 (cid:169) (cid:173) (cid:171) . 𝑣′ 𝑗 𝑠(cid:170) (cid:98) (cid:174) (cid:172) 𝑇 and 𝑀dk is the truncation parameter for the Driscoll-Kraay variance estimator.3 where 𝑏dk = 𝑀dk The consistency of (cid:91)Λ𝑎Λ′ 𝑎 is given by (2A.20) in the proof of Theorem 2.3: (cid:91)Λ𝑎Λ′ 𝑎 = Λ𝑎Λ′ 𝑎 + 𝑜 𝑝 (1), And by (2A.10) in the proof of Theorem 2.2, we have, (cid:91)Λ𝑔Λ′ 𝑔 ⇒ Λ𝑔 (cid:17) (cid:16) 𝑃 𝑏dk, (cid:101)𝑊 (𝑟) 𝑏2 dk 1 − 𝑏dk + 1 3 (2.23) (2.24) Λ′ 𝑔. 3Note that, in principle, 𝑏dk can be different from the 𝑏 used for CHS variance estimator. For simulating asymptotic critical values we used the data dependent rule of Andrews (1991) to obtain 𝑏dk. 20 Therefore, (cid:91)Λ𝑎Λ′ 𝑎 is a consistent estimator for Λ𝑎Λ′ 𝑎 and (cid:91)Λ𝑔Λ′ 𝑔 is a bias-corrected estimator of Λ𝑔Λ′ 𝑔 with the mean of the limit equal to Λ𝑔Λ′ matrices (cid:98)Λ𝑎 and (cid:98)Λ𝑔 are matrix square roots of (cid:91)Λ𝑎Λ′ and (cid:98)Λ𝑔(cid:98)Λ′ 𝑔 = (cid:91)Λ𝑔Λ′ 𝑔 𝑔 and the limit converges to Λ𝑔Λ′ 𝑎 and (cid:91)Λ𝑔Λ′ 𝑔 respectively such (cid:98)Λ𝑎(cid:98)Λ′ 𝑔 as 𝑏dk → 0. The 𝑎 = (cid:91)Λ𝑎Λ′ 𝑎 We propose the following plug-in method for simulating the asymptotic critical values of the fixed-𝑏 limits. Details are given for a 𝑡-test with the modifications needed for a Wald test being obvious. 1. For a given data set with sample sizes 𝑁 and 𝑇, calculate (cid:98)𝑄, (cid:98)Λ𝑎 and (cid:98)Λ𝑔. Let 𝑏 = 𝑀/𝑇 where 𝑀 is the bandwidth used for (cid:98)ΩCHS. Let 𝑐 = 𝑁/𝑇 . 2. Taking (cid:98)𝑄, (cid:98)Λ𝑎, (cid:98)Λ𝑔, 𝑏, 𝑐, and 𝑅 as given, use Monte Carlo methods to simulate critical values for the distributions 𝑡CHS = (cid:98) 𝑅𝑄−1(cid:98)𝑏𝑘 (𝑐) √︁𝑅𝑄−1 𝑣 𝑘 (𝑏, 𝑐)𝑄−1𝑅′ (cid:98) 𝑡DKA = ℎ(𝑏)1/2 𝑡CHS (cid:98) 𝑡BCCHS = (cid:98) (cid:98) (2.25) (2.26) where (cid:98)𝑏𝑘 (𝑐) = 𝑅 (cid:98)𝑄−1 (cid:16) (cid:98)Λ𝑎𝑧𝑘 + √ 𝑐(cid:98)Λ𝑔𝑊𝑘 (1) (cid:17) , 𝑣 𝑘 (𝑏, 𝑐) = ℎ(𝑏)(cid:98)Λ𝑎(cid:98)Λ′ (cid:98) 𝑎 + 𝑐(cid:98)Λ𝑔𝑃 (cid:16) 𝑏, (cid:101)𝑊𝑘 (𝑟) (cid:17) (cid:98)Λ′ 𝑔. 3. Typically the process 𝑊𝑘 (𝑟) is approximated using scaled partial sums of a large number of i.i.d. 𝑁 (0, 𝐼𝑘 ) realizations (increments) for each replication of the Monte Carlo simulation. It is clear that under Assumption 2.3, as 𝑁, 𝑇 → ∞ and then as 𝑏dk → 0, (cid:98) 𝑡CHS converges weakly to the fixed-𝑏 limit of 𝑡CHS in (2.18) using results in (2.23) and (2.24); and so does (cid:98) 𝑡CHS and (cid:98) 𝑡DKA). However, it is less clear what (cid:98) ((cid:98) across both individual and time dimensions. In the i.i.d. case 1 𝑁 𝑡DKA) are estimating when data is i.i.d. 𝑔 each estimate 𝑡BCCHS ((cid:98) 𝑡BCCHS (cid:91)Λ𝑎Λ′ (cid:91)Λ𝑔Λ′ 𝑎 and 1 𝑇 𝑔, and (cid:98)𝑄 as consistent plug-in estimators, (cid:91)Λ𝑔Λ′ Λ𝑥𝑢Λ′ 𝑥𝑢 (the variance of 𝑥𝑖𝑡𝑢𝑢𝑡). Treating 1 𝑁 (cid:91)Λ𝑎Λ′ 𝑎 , 1 𝑇 21 it can be shown using arguments similar to the proof of Theorem 2.4 that (cid:98) 𝑡CHS and (cid:98) 𝑡BCCHS ((cid:98) 𝑡DKA) are simulating from the random variables 𝑡∞,𝑖𝑖𝑑 CHS = (cid:98) √︂ 𝑧1 + 𝑊1(1) (cid:16) ℎ(𝑏) + 𝑃 𝑏, (cid:101)𝑊1(𝑟) 𝑡∞,𝑖𝑖𝑑 BCCHS = (cid:98) (cid:98) 𝑡∞,𝑖𝑖𝑑 𝑡∞,𝑖𝑖𝑑 DKA = ℎ(𝑏)1/2 (cid:98) CHS . , (cid:17) While these random variables do not depend on 𝑐 or nuisance parameters, they are clearly different than the limits given by Theorem 2.4. If we take the probability limit of these random variables as 𝑏 → 0, it is easy to see4 that both random variables converge to 𝑧1 + 𝑊1(1) √ 2 ∼ 𝑁 (0, 1), because (𝑧1 + 𝑊1(1)) ∼ 𝑁 (0, 2). Recall from Theorem 2.4 that the fixed-𝑏 limits of 𝑡CHS and 𝑡BCCHS are also approximately 𝑁 (0, 1) when 𝑏 is small. Thus, the simulated critical values for 𝑡CHS and 𝑡BCCHS adapt to the i.i.d. case at least for small bandwidths. In contrast, the limit of 𝑡DKA in Theorem 2.4 is approximately 𝑁 (0, 1 2) when 𝑏 is small whereas (cid:98) 𝑡DKA is simulating from a 𝑁 (0, 1) random variable. Therefore, simulated critical values for 𝑡DKA do not adapt to i.i.d. data when 𝑏 is small and 𝑡DKA over-covers and is conservative. When the plug-in critical values are used, we can make theoretical predictions for coverage probabilities in the i.i.d. case for bandwidths that are not small (𝑏 > 0) by computing coverage probabilities of the limiting random variables 𝑡∞,𝑖𝑖𝑑 DKA using critical values from the BCCHS and 𝑡∞,𝑖𝑖𝑑 asymptotic random variable (cid:98) 𝑡∞,𝑖𝑖𝑑 BCCHS (same as (cid:98) 𝑡∞,𝑖𝑖𝑑 DKA )5. Results are given in Table 2.4.1 in the (cid:98) 𝑡∞,𝑖𝑖𝑑 BCCHS columns. As shown in the table, the critical values of (cid:98) 𝑡∞,𝑖𝑖𝑑 BCCHS increase with 𝑏 but slowly. This helps reduce the under-rejection problems of BCCHS but does not remove them as we see in the coverage probability column for BCCHS that uses critical values from (cid:98) critical values from (cid:98) not vary much across 𝑏 because the critical values of 𝑡∞,𝑖𝑖𝑑 𝑡∞,𝑖𝑖𝑑 BCCHS. When DKA uses 𝑡∞,𝑖𝑖𝑑 BCCHS, coverage probabilities are similar to the 𝑁 (0, 1), and the coverages do 𝑡∞,𝑖𝑖𝑑 BCCHS roughly move together as 𝑏 DKA and (cid:98) increases. (cid:16) 𝑏2(cid:17) 1 − 𝑏 + 1 4Obviously 3 5Coverage probabilities are the same for 𝑡∞,𝑖𝑖𝑑 → 1 as 𝑏 → 0, and recall that 𝑝 lim𝑏→0 𝑃 CHS and 𝑡∞,𝑖𝑖𝑑 = 𝐼𝑞. BCCHS using critical values from (cid:98) 𝑏, (cid:101)𝑊𝑞 (𝑟) (cid:17) (cid:16) 𝑡∞,𝑖𝑖𝑑 CHS and (cid:98) 𝑡∞,𝑖𝑖𝑑 BCCHS given the common scaling factor ℎ(𝑏)1/2. 22 The asymptotic calculations in Table 2.4.1 predict that CHS and BCCHS will tend to under- cover (liberal) when the data is i.i.d. with the coverage approaching the nominal level for small bandwidths. DKA is predicted to have over-coverage (conservative) when the data is i.i.d. regardless of the bandwidth. 2.5 Monte Carlo Simulations To illustrate the finite sample performance of the various variance estimators and corresponding test statistics, we present a Monte Carlo simulation study with 10,000 replications in all cases. We focus on a simple linear panel model: 𝑦𝑖𝑡 = 𝛽0 + 𝛽1𝑥𝑖𝑡 + 𝑢𝑖𝑡, (2.27) where the true parameters are (𝛽0, 𝛽1) = (1, 1). To allow direct comparisons with Table 1 of Chiang et al. (2022), we consider a data generating process (DGP) that is linear in the components: DGP(1) : 𝑥𝑖𝑡 = 𝜔𝛼𝛼𝑥 𝑖 + 𝜔𝛾𝛾𝑥 𝑡 + 𝜔𝜀𝜀𝑥 𝑖𝑡, 𝑢𝑖𝑡 = 𝜔𝛼𝛼𝑢 𝑖 + 𝜔𝛾𝛾𝑢 𝑡 + 𝜔𝜀𝜀𝑢 𝑖𝑡, 𝛾 ( 𝑗) 𝑡 = 𝜌𝛾𝛾 ( 𝑗) 𝑡−1 𝛾 ( 𝑗) + (cid:101) 𝑡 for 𝑗 = 𝑥, 𝑢, where the latent components {𝛼𝑥 𝑖 , 𝛼𝑢 AR(1) processes are i.i.d 𝑁 (0, 1 − 𝜌2 𝑖𝑡, 𝜀𝑢 𝑖 , 𝜀𝑥 𝑖𝑡 } are each i.i.d 𝑁 (0, 1), and the error terms (cid:101) 𝛾) for 𝑗 = 𝑥, 𝑢. The component weights (𝜔𝛼, 𝜔𝛾, 𝜔𝜀) are used for the 𝛾 ( 𝑗) 𝑡 to adjust the relative importance of those components. To further explore the role played by the component structure representation, we consider a second DGP where the latent components enter 𝑥𝑖𝑡 and 𝑢𝑖𝑡 in a non-linear way: DGP(2) : 𝑥𝑖𝑡 = 𝑙𝑜𝑔( 𝑝 (𝑥) 𝑖𝑡 )), 𝑖𝑡 /(1 − 𝑝 (𝑥) 𝑖𝑡 /(1 − 𝑝 (𝑢) 𝑖 + 𝜔𝛾𝛾 ( 𝑗) 𝑖𝑡 )), 𝑡 + 𝜔𝜀𝜀( 𝑗) 𝑢𝑖𝑡 = 𝑙𝑜𝑔( 𝑝 (𝑢) 𝑖𝑡 = Φ(𝜔𝛼𝛼( 𝑗) 𝑝 ( 𝑗) 𝑖𝑡 ) for 𝑗 = 𝑥, 𝑢, where Φ(·) is the cumulative distribution function of a standard normal distribution and the latent components are generated in the same way as DGP(1). 23 Sample coverage probabilities of 95% confidence intervals for (cid:98)𝛽1, the OLS estimator of the slope parameter from (2.27), are provided for the following variance estimators: Eicker-Huber- White (EHW), cluster-by-𝑖 (Ci), cluster-by-𝑡 (Ct), DK, CHS, BCCHS, and DKA. For the variance estimators that require a bandwidth choice (DK, CHS, BCCHS, and DKA) we report results using the Andrews (1991) AR(1) plug-in data-dependent bandwidth, labeled as (cid:98)𝑀, designed to minimize the approximate mean square error of a variance estimator (same formula for all four variance estimators). In the case of a scalar 𝑥𝑖𝑡, the formula is given by6 (cid:32) (cid:33) 1/3 𝜌2(cid:1) 2 𝑇 1/3 + 1, (cid:98)𝑀 = 1.8171 𝜌2 (cid:98) (cid:0)1 − (cid:98) 𝑣𝑡−1 + 𝜂𝑡 where ¯ 𝑣𝑡 = 𝜌 ¯ 𝜌 is the OLS estimator from the regression ¯ 𝑢𝑖𝑡, 𝑣𝑖𝑡 = 𝑥𝑖𝑡(cid:98) 𝑣𝑡 = 1 where (cid:98) (cid:98) (cid:98) (cid:98) 𝑁 𝑢𝑖𝑡 are the OLS residuals from (2.27). We label the ratio of (cid:98)𝑀 relative to the time sample size as and (cid:98) (cid:98)𝑏 = (cid:98)𝑀/𝑇. In some cases (cid:98)𝑀 can exceed 𝑇 especially when the time dependence is strong relative to 𝑇. Therefore, we truncate (cid:98)𝑀 at 𝑇 whenever (cid:98)𝑀 > 𝑇. We also report results for a grid of bandwidth 𝑣𝑖𝑡, (cid:98) 𝑖=1 (cid:98) (cid:205)𝑁 choices. For tests based on CHS and DKA, we use both the standard normal critical values and the plug-in fixed-𝑏 critical values. The simulated critical values use 1000 replications with the Wiener process approximated by scaled partial sums of 500 independent increments drawn from a standard normal distribution. While these are relatively small numbers of replications and increments for an asymptotic critical value simulation, it was necessitated by computational considerations given the need to run an asymptotic critical value simulation for each replication of the finite sample simulation. 2.5.1 Main Simulation Results We first focus on DGP(1) to make direct comparisons to the simulation results of Table 1 of Chiang et al. (2022), a working paper version of Chiang et al. (2024)7. Empirical null coverage probabilities of the confidence intervals for (cid:98)𝛽1 are presented in Table 2.5.1. We start with both 6Using equation (6.4) from Andrews (1991), we use 0 weight for constant regressor and the weights equal to the inverse of the squared innovation variances for other regressors. Because Chiang et al. (2024) parameterize the Bartlett kernel as 1 − 𝑚 𝑀 , we add 1 to the data-dependent formula so that our Bartlett weights match those used by Chiang et al. (2024). 𝑀+1 whereas we use 1 − 𝑚 7The reason we refer to the 2022 working paper version of Chiang et al. (2024) is because their results for small sample sizes (𝑁, 𝑇) = (25, 25) are not included in the published paper, Chiang et al. (2024). 24 the cross-section and time sample sizes equal to 25. The weights on the latent components are 𝜔𝛼 = 0.25, 𝜔𝛾 = 0.5, 𝜔𝜀 = 0.25. Because of the relatively large weight on the common time effect, 𝛾𝑡, the cross-section dependence dominates the time dependence. We can see that the confidence intervals using EHW, Ci, and Ct 8 suffer from a severe under-coverage problems as they fail to capture both cross-section and time dependence. Table 2.5.1: Sample Coverage Probabilities (%), Nominal Coverage 95% BC- Ci Ct 𝑁 = 𝑇 = 25; DGP(1): 𝜔𝛼 = 𝜔𝜀 = 0.25, 𝜔𝛾 = 0.5, 𝜌𝛾 = 0.425; POLS. fixed-b c.v. No Truncation 𝑀 EHW 37.4 38.7 83.6 Count DK CHS CHS DKA CHS DKA CHS< 0 83.6 84.0 83.3 82.0 80.6 73.9 62.6 57.9 84.1 90.3 84.4 89.7 83.7 90.1 82.3 90.5 80.8 90.3 74.4 91.2 63.0 91.1 58.4 90.9 Note: (cid:98)𝑀 ranged from 1 to 21, with an average of 2.6 and median of 2. 𝑏 (cid:98)𝑀 (cid:98)𝑏 0.08 2 0.12 3 0.16 4 0.20 5 0.40 10 0.80 20 1.00 25 86.2 86.0 85.8 85.3 84.8 82.2 80.9 80.8 88.1 87.9 87.9 87.5 87.3 85.4 84.0 84.0 88.3 88.1 87.8 88.2 88.1 88.5 87.9 88.2 0 0 0 0 0 0 0 0 With the time effect, 𝛾𝑡, being mildly persistent (𝜌𝛾 = 0.425), the DK and CHS confidence intervals using the normal approximation undercover with small bandwidths with empirical rejec- tion rates mostly below 0.85. The under-coverage problem becomes more severe as 𝑀 increases because of the well-known downward bias in kernel variance estimators that reflects the need to estimate 𝛽0 and 𝛽1. Coverages of DK and CHS using (cid:98)𝑀 are similar to the smaller bandwidth cases, e.g. 𝑀 = 2 or 3, which makes sense given that the average (cid:98)𝑀 across replications is 2.6 (about 0.1 in terms of (cid:98)𝑏). However, as the note to the table indicates, large values of (cid:98)𝑀 can occur in which case (cid:98)𝑏 is not close to zero. Because they are bias corrected, the BCCHS and DKA variance estimators provide coverage that is less sensitive to the bandwidth. This is particularly true for DKA. If the plug-in fixed-𝑏 critical values are used, coverages are closest to 0.95 and very stable across bandwidths with DKA having the best coverage. Because the CHS variance estimator is not guaranteed to be positive definite, we report the number of times that CHS/BCCHS estimates 8Finite sample adjustments are applied to these three variance estimators. HC1 is used for EHW estimator. The “cluster-by-𝑖”, and “cluster-by-𝑡” are also adjusted by the usual degrees-of-freedom factor. 25 are negative out of the 10,000 replications. In Table 1 there were no cases where CHS/BCCHS estimates are negative. Table 2.5.2: Sample Coverage Probabilities (%), Nominal Coverage 95% BC- Ci Ct 𝑁 = 𝑇 = 25; DGP(2): 𝜔𝛼 = 𝜔𝜀 = 0.25, 𝜔𝛾 = 0.5, 𝜌𝛾 = 0.25; POLS. fixed-b c.v. No Truncation 𝑀 EHW 39.9 40.7 87.1 Count DK CHS CHS DKA CHS DKA CHS< 0 85.3 86.1 84.9 83.5 82.2 75.6 64.9 60.7 85.9 86.6 85.5 84.1 82.8 76.2 65.5 61.3 Note: (cid:98)𝑀 ranged from 1 to 12, with an average of 2.5 and a median of 2. 𝑏 (cid:98)𝑀 (cid:98)𝑏 0.08 2 0.12 3 0.16 4 0.20 5 0.40 10 0.80 20 1.00 25 87.8 88.1 87.6 87.1 86.4 84.1 82.5 82.6 89.6 90.0 89.5 89.2 88.6 86.5 85.4 85.4 89.1 89.5 89.5 89.6 89.7 89.7 89.4 89.2 91.1 91.3 91.2 91.5 91.4 91.6 91.5 91.5 0 0 0 0 0 0 0 0 Tables 2.5.2 - 2.5.5 give results for DGP(2) where the latent components enter in a non-linear way. Tables 2.5.2 - 2.5.4 have both sample sizes equal to 25 with weights across latent components being the same as DGP(1) (𝜔𝛼 = 𝜔𝜀 = 0.25, 𝜔𝛾 = 0.5). Table 2.5.2 has mild persistence in 𝛾𝑡 (𝜌𝛾 = 0.25). Table 2.5.3 has moderate persistence (𝜌𝛾 = 0.5) and Table 2.5.4 has strong persistence (𝜌𝛾 = 0.75). Tables 2.5.2-2.5.4 have similar patterns as Table 2.5.1: confidence intervals with variance estimators non-robust to individual or time components under-cover with the under- coverage problem increasing with 𝜌𝛾. With 𝜌𝛾 = 0.25, CHS has reasonable coverage (about 0.86) with small bandwidths but under-covers severely with large bandwidths. BCCHS performs much better because of the bias correction and fixed-𝑏 critical values provide some additional modest improvements. DKA has better coverage especially when fixed-𝑏 critical values are used with large bandwidths. As 𝜌𝛾 increases, all approaches have increasing under-coverage problems with DKA continuing to perform best. Table 2.5.5 has the same configuration as Table 2.5.4 but with both sample sizes increased to 50. Both BCCHS and DKA show some improvements in coverage. This illustrates the well-known trade-off between the sample size and magnitude of persistence for accuracy of asymptotic approximations with dependent data. Regarding bandwidth choice, the data-dependent bandwidth performs reasonably well for CHS, BCCHS, and DKA. Finally, the 26 chances of CHS/BCCHS being negative are very small but not zero and chances decrease as both 𝑁 and 𝑇 increase. Table 2.5.3: Sample Coverage Probabilities (%), Nominal Coverage 95% BC- Ci Ct 𝑁 = 𝑇 = 25; DGP(2): 𝜔𝛼 = 𝜔𝜀 = 0.25, 𝜔𝛾 = 0.5, 𝜌𝛾 = 0.5; POLS. fixed-b c.v. No Truncation 𝑀 EHW 35.2 37.6 80.7 Count DK CHS CHS DKA CHS DKA CHS< 0 81.3 81.6 80.7 79.8 78.4 71.5 60.4 55.9 81.9 82.3 81.3 80.1 78.6 71.7 60.6 56.1 Note: (cid:98)𝑀 ranged from 1 to 25, with an average of 2.8 and a median of 3. 𝑏 (cid:98)𝑀 (cid:98)𝑏 0.08 2 0.12 3 0.16 4 0.20 5 0.40 10 0.80 20 1.00 25 84.0 83.8 83.7 83.2 82.8 80.5 78.7 78.5 86.4 86.0 86.1 85.8 85.6 83.5 82.0 81.9 86.0 85.4 85.9 86.3 86.5 86.8 86.2 86.1 88.2 87.5 88.1 88.4 88.8 89.3 89.2 89.2 0 0 0 0 0 0 0 0 To show that large values of (cid:98)𝑀 are not unusual in DGP(2), we report in Figure 1 the frequency of (cid:98)𝑏 among the 10,000 Monte Carlo replications used in Table 2.5.4. In this case, more than 21% of replications have (cid:98)𝑏 ≥ 0.2. This explains why bias correction and fixed-𝑏 critical values noticeably reduce the under-coverage problem when (cid:98)𝑀 is used. Table 2.5.4: Sample Coverage Probabilities (%), Nominal Coverage 95% BC- Ci Ct 𝑁 = 𝑇 = 25; DGP(2): 𝜔𝛼 = 𝜔𝜀 = 0.25, 𝜔𝛾 = 0.5, 𝜌𝛾 = 0.75; POLS. fixed-b c.v. No Truncation 𝑀 EHW 28.9 35.1 66.7 Count DK CHS CHS DKA CHS DKA CHS< 0 71.4 70.9 71.4 71.0 70.0 63.8 53.2 48.8 72.2 72.3 72.5 71.8 70.6 64.2 53.6 49.1 Note: (cid:98)𝑀 ranged from 1 to 25, with an average of 3.9 and a median of 4. 𝑏 (cid:98)𝑀 (cid:98)𝑏 0.08 2 0.12 3 0.16 4 0.20 5 0.40 10 0.80 20 1.00 25 76.0 74.3 75.5 75.8 75.3 72.8 71.5 71.5 79.0 76.9 78.4 79.0 78.7 77.1 76.3 76.3 79.2 75.7 78.0 78.9 79.7 79.4 79.5 79.3 82.0 78.6 80.9 82.0 82.5 83.4 83.8 83.8 0 0 0 0 0 1 13 13 To show how the relative values of 𝑁 and 𝑇 can matter in practice, we provide additional results for the same DGP as Tables 2.5.4 and 2.5.5 for 𝑁 and 𝑇 over a range of values9. The results are given in Table 2.5.6. There are two main takeaways from the table: i) bias correction with and 9We thank a referee for this suggestion, 27 without fixed-𝑏 critical values always improves coverage probabilities relative to the original CHS test, ii) bias correction alone does slightly better than bias correction with fixed-𝑏 critical values when both 𝑁, 𝑇 are extremely small (𝑁 = 𝑇 = 10). Table 2.5.5: Sample Coverage Probabilities (%), Nominal Coverage 95% BC- Ci Ct 𝑁 = 𝑇 = 50; DGP(2): 𝜔𝛼 = 𝜔𝜀 = 0.25, 𝜔𝛾 = 0.5, 𝜌𝛾 = 0.75; POLS. fixed-b c.v. No Truncation 𝑀 EHW 17.3 25.9 66.0 Count DK CHS CHS DKA CHS DKA CHS< 0 78.8 78.5 78.4 77.5 76.2 69.7 58.5 54.4 79.4 79.1 79.2 78.2 76.9 70.0 59.0 54.7 Note: (cid:98)𝑀 ranged from 1 to 26, with an average of 5.4 and a median of 5. 𝑏 (cid:98)𝑀 (cid:98)𝑏 0.08 2 0.12 3 0.16 4 0.20 5 0.40 10 0.80 20 1.00 25 81.5 80.8 81.5 81.7 81.0 79.2 77.7 77.5 82.5 81.8 82.8 83.0 82.5 81.0 79.4 79.2 84.2 82.7 83.9 85.2 86.0 86.2 86.0 86.0 85.2 83.9 85.5 86.5 87.1 87.8 87.8 87.7 0 0 0 0 0 0 0 0 The middle row gives results for both 𝑁 and 𝑇 equal to 10. Here the under-coverage problem is substantial for the original CHS test. Bias correction helps especially if the DKA estimator is used. Interestingly, fixed-𝑏 critical values help relative to CHS but less than bias correction alone. This is not surprising because the simulated critical values are functions of variance estimators based on small sample sizes. Going up the rows maintains 𝑁 = 10 with 𝑇 increasing to 160. As expected, coverages approach 0.95 as 𝑇 increases. The top four rows have 𝑇 = 160 with 𝑁 increasing from 10 to 80. With 𝑇 fixed and 𝑁 increasing, the first three tests that fail to capture within-time/cross-sectional dependence have deteriorating coverage (undercoverage). In contrast the DK and CHS tests perform well in those cases; bias correction with fixed-𝑏 critical values continues to provide further improvements. Going down from the middle rows shows what happens as 𝑁 increases when 𝑇 is small (𝑇 = 10). Coverage of the original CHS tests remains quite low as 𝑁 increases. Bias correction without fixed- 𝑏 critical values improves coverage rates but coverage does not improve as 𝑁 increases. Bias correction with fixed-𝑏 critical values performs best and improves as 𝑁 increases. The results for DKA are interesting. As 𝑁 increases, undercoverage becomes more severe when normal critical 28 values are used, whereas with fixed-𝑏 critical values coverge is best and stable across 𝑁. The bottom four rows hold 𝑁 fixed at 160 and show what happens as 𝑇 increases from 10 to 80. CHS and the bias-corrected versions show better coverage as 𝑇 increases. CHS and DKA with fixed-𝑏 critical values perform best in these cases. Figure 1: The Frequency of (cid:98)𝑏 for Table 2.5.4. 2.5.2 Simulation Results for the i.i.d. Case In Theorems 2.1, 2.2, and 2.3, a non-degeneracy assumption on the components is imposed. A special case that violates this assumption is i.i.d. data in both individual and time dimensions (random sampling). As we showed in Theorem 2.4, the fixed-𝑏 limit of the test statistics is different in the i.i.d. case. By setting 𝜔𝛼 = 0, 𝜔𝛾 = 0 in DGP(1), we present coverage probabilities for the i.i.d case in Table 2.5.7. There are some important differences between the coverage probabilities in Table 2.5.7 relative to previous tables. First, notice that the coverages using EHW, Ci, and Ct are close to the nominal level as one would expect. The patterns of coverage probabilities for CHS, BCCHS and DKA are as predicted by Theorem 2.4 and the asymptotic calculations given in Table 2.4.1. Coverages of CHS are close to 0.89 for small bandwidths and under-coverage problems occur with larger bandwidths. BCCHS is less prone to under-coverage as the bandwidth increases and plug-in fixed-𝑏 critical values help to reduce, but do not eliminate, the under-coverage problem. In 29 Table 2.5.6: Sample Coverage Probabilities (%), Nominal Coverage 95% 𝑀 = (cid:98)𝑀; DGP(2): 𝜔𝛼 = 𝜔𝜀 = 0.25, 𝜔𝛾 = 0.5, 𝜌𝛾 = 0.75; POLS. 𝑁 80 40 20 10 10 10 10 10 20 40 80 160 160 160 160 𝑇 160 160 160 160 80 40 20 10 10 10 10 10 20 40 80 EHW Ci 28.8 12.8 39.7 19.1 48.9 25.6 54.5 33.4 47.0 35.0 42.7 37.3 45.2 43.6 51.9 53.7 45.3 43.2 35.6 32.3 27.6 24.5 20.8 17.9 16.8 13.3 14.9 10.3 18.5 10.7 Ct 67.7 66.8 65.7 64.7 64.8 65.7 67.2 68.2 68.0 67.6 68.1 68.4 66.9 66.5 66.9 BC- fixed-b c.v. DK CHS CHS DKA CHS DKA CHS< 0 86.4 85.2 84.9 83.2 79.3 74.6 69.5 63.1 63.5 63.3 63.8 64.0 70.3 76.9 81.5 Count Med (cid:98)𝑏 0.08 0.06 0.05 0.05 0.08 0.13 0.15 0.30 0.30 0.30 0.30 0.30 0.15 0.15 0.09 88.1 87.8 88.2 86.6 82.3 77.2 73.1 68.9 71.5 71.0 71.6 71.8 75.0 79.8 83.9 90.3 89.9 90.4 90.1 87.3 83.7 81.1 77.2 79.4 79.3 79.2 79.5 80.7 83.7 86.8 87.1 86.7 87.1 85.9 80.6 74.8 69.1 62.8 64.4 64.4 64.7 64.5 70.7 77.2 81.9 88.7 88.7 89.9 89.6 86.3 82.5 80.1 79.7 78.0 75.4 73.9 73.0 75.6 80.4 84.4 90.0 88.8 88.7 87.3 83.5 78.7 74.5 68.0 73.1 75.4 77.0 78.2 80.1 83.3 86.3 0 0 0 0 0 1 15 172 26 3 0 0 0 0 0 contrast DKA over-covers regardless of the bandwidth and whether or not fixed-𝑏 critical values are used. As 𝑁,𝑇 get larger, we would expect the coverages of CHS/BCCHS to approach 95% in the i.i.d. case (assuming a small bandwidth) but not for DKA where over-coverage would persist. Table 2.5.7: Sample Coverage Probabilities (%), Nominal Coverage 95% 𝑁 = 𝑇 = 25, i.i.d: DGP (1) with 𝜔𝛼 = 𝜔𝛾 = 0 and 𝜔𝜀 = 1; POLS. BC- Ci Ct fixed-b c.v. No Truncation 𝑀 EHW 94.7 93.1 93.2 Count DK CHS CHS DKA CHS DKA CHS< 0 91.0 92.1 91.0 89.6 88.4 82.0 70.9 66.4 88.8 90.2 88.8 87.5 86.1 80.8 70.3 66.0 Note: (cid:98)𝑀 ranged from 1 to 8, with an average of 2.5 and a median of 2. 𝑏 (cid:98)𝑀 (cid:98)𝑏 0.08 2 0.12 3 0.16 4 0.20 5 0.40 10 0.80 20 1.00 25 90.7 91.3 90.7 90.0 89.3 87.3 86.1 85.9 98.9 99.0 99.0 98.9 98.9 98.7 98.5 98.4 99.0 99.1 99.1 99.0 99.1 98.9 98.9 98.9 91.1 91.9 91.0 90.7 90.2 88.8 87.9 87.7 73 35 63 117 174 425 639 641 To gauge the extent to which the under-coverage of CHS/BCCHS and over-coverage of DKA is caused by the mis-match between the plug-in fixed-𝑏 critical values and the i.i.d. fixed-𝑏 limits, we report in Table 2.5.8 simulated coverage probabilities using fixed-𝑏 critical values based on the 30 limits in Theorem 2.4. We see that, regardless of the bandwidth, coverages are much closer to 95%. Therefore, a significant portion of the size distortions in Table 2.5.7 is because of the mis-match. Table 2.5.8: Sample Coverage Probabilities (%) (cid:98)𝑀 𝑀 𝑏 (cid:98)𝑏 92.1 CHS DKA 94.1 3 0.12 92.0 94.0 2 0.08 92.3 94.0 20 0.80 92.8 94.0 10 5 0.20 0.40 92.0 92.5 94.0 93.9 25 4 1.00 0.16 92.9 92.0 94.0 94.1 Nominal coverage probability: 95%. Sample size: 𝑁 = 𝑇 = 25. DGP: i.i.d: DGP (1) with 𝜔𝛼 = 𝜔𝛾 = 0 and 𝜔𝜀 = 1 (same as Table 2.5.7). Slope estimator: POLS. Fixed-b critical values from Theorem 2.4 are used for confidence intervals constructed by both CHS and DKA variance esti- mators. These results raise a practical question: how does a researcher know which case is being dealt with? In panel data settings, random sampling is almost never a reasonable assumption in the time dimension and clustering dependence often exists due to unobserved heterogeneity in both individual and time dimensions. Some concern may arise as it is common in practice for empirical researchers to include fixed-effect dummy variables, also as known as two-way fixed- effect estimator (TWFE, hereafter), to remove at least some of dependence generated by individual and time unobserved heterogeneity, and in some cases, such as DGP (1), all of the dependence structure would be removed and we are back to the i.i.d case. However, it is important to note that DGP (1) is a very special case. In general, fixed-effect approaches do not guarantee the resulting scores to be free from clustering dependence. Indeed, other data generating mechanisms exist where TWFE will not completely remove the dependence caused by individual and time components in the score as shown in Chiang et al. (2024). We discuss an example and its implications in the next subsection. One should also note that if we compare absolute size-distortions, DKA-based tests are not necessarily more concerning than BCCHS-based tests: while in opposite directions, the magnitudes of size distortions are mostly comparable and DKA-based tests do a better job as 𝑀 increases. Moreover, given that the DKA-based tests tend to be more conservative, a rejection using a DKA- based test delivers strong evidence against the null hypothesis. For a researcher that wants to 31 avoid spurious null rejections (relative to the desired significance level), then DKA-based tests are preferred. On the other hand, if a rejection is not obtained with BCCHS tests, this is strong evidence that the null cannot be rejected. Suppose a rejection is obtained with BCCHS but not with DKA. In this case a researcher had to balance potential over-rejections from BCCHS with potential lower power of DKA which depend on the extent to which the researcher thinks two-way clustering is present in the model. 2.5.3 Additional Results for TWFE A popular alternative to the pooled OLS estimator is the additive TWFE estimator where individual and time period dummies are included in (2.1). It is well known that individual and time dummies will project out any latent individual or time components that only linearly enter 𝑥𝑖𝑡 and 𝑢𝑖𝑡 individually (as would be the case in DGP(1)) leaving only variation from the idiosyncratic component 𝑒𝑖𝑡.. In this case, we would expect the sample coverages of CHS and DKA to be similar to the i.i.d case in Table 2.5.7. However, under the general component structure representation, the TWFE transformation may not fully remove the individual and time components if they enter in a nonlinear manner and we would expect results for CHS and DKA similar to Tables 2.5.1-2.5.6. As an illustration, in Table 2.5.9 we report results for the TWFE estimator using the same configuration as Table 2.5.4 for DGP(2). The sample coverage probabilities are different from Table 2.5.4 but are very similar to the results in Table 2.5.7 for the i.i.d. case. Therefore, for DGP(2), the TWFE dummy variables remove the bulk of the variation from the individual and time components. In contrast Chiang et al. (2024) provide an example where the TWFE dummy variables do not remove the component structure. Consider a third DGP given by DGP(3) : 𝑥𝑖𝑡 = 𝛼1𝑖𝛾2𝑡 + 𝛼2𝑖𝛾1𝑡 + 𝜀𝑥 𝑖𝑡, 𝑢𝑖𝑡 = 𝛼1𝑖𝛾3𝑡 + 𝛼3𝑖𝛾1𝑡 + 𝜀𝑢 𝑖𝑡, where the latent components {𝛼1𝑖, 𝛼2𝑖, 𝛼3𝑖, 𝛾1𝑡, 𝛾2𝑡, 𝛾3𝑡, 𝜀𝑥 are independent across 𝑖 and 𝑡 and independent with each other. As Chiang et al. (2024) argue, there 𝑖𝑡 } are 𝑁 (0, 1) random variables that 𝑖𝑡, 𝜀𝑢 is no endogeneity between 𝑥𝑖𝑡 and 𝑢𝑖𝑡, and it is not difficult to show that 𝐸 (𝑥𝑖𝑡 |𝛼𝑖) = 𝐸 (𝑥𝑖𝑡 |𝛾𝑡) = 32 Table 2.5.9: Sample Coverage Probabilities (%), Nominal Coverage 95% BC- Ci Ct 𝑁 = 𝑇 = 25; DGP(2): 𝜔𝛼 = 𝜔𝜀 = 0.25, 𝜔𝛾 = 0.5, 𝜌𝛾 = 0.75; TWFE. fixed-b c.v. No Truncation 𝑀 EHW 94.2 93.3 93.0 Count DK CHS CHS DKA CHS DKA CHS< 0 91.0 92.2 90.9 89.5 88.2 81.5 70.3 66.1 90.2 91.0 89.7 88.4 87.4 81.7 71.3 67.2 Note: (cid:98)𝑀 ranged from 1 to 8, with an average of 2.5 and a median of 2. 𝑏 (cid:98)𝑀 (cid:98)𝑏 0.08 2 0.12 3 0.16 4 0.20 5 0.40 10 0.80 20 1.00 25 91.5 92.1 91.1 90.5 89.5 88.3 86.7 86.3 98.9 99.0 99.0 98.9 98.9 98.5 98.2 98.2 91.6 92.3 91.7 91.1 90.4 89.5 88.7 88.4 99.1 99.1 99.0 99.2 99.1 99.1 98.8 98.9 57 26 53 88 137 331 517 523 𝐸 (𝑢𝑖𝑡 |𝛼𝑖) = 𝐸 (𝑢𝑖𝑡 |𝛾𝑡) = 0. While 𝑥𝑖𝑡 and 𝑢𝑖𝑡 do not have the component structure, the score, 𝑥𝑖𝑡𝑢𝑖𝑡, does because 𝐸 (𝑥𝑖𝑡𝑢𝑖𝑡 |𝛼𝑖) = 𝛼2𝑖𝛼3𝑖 and 𝐸 (𝑥𝑖𝑡𝑢𝑖𝑡 |𝛾𝑡) = 𝛾𝑡2𝛾𝑡3. Therefore, the TWFE dummy variables will not remove the component structure from 𝑥𝑖𝑡𝑢𝑖𝑡. Table 2.5.10 gives results for DGP(3) for TWFE with 𝑁 = 𝑇 = 25. We see that tests based on variance estimators that are not robust to two-way cluster dependence have substantial under- coverage problems. The original CHS does a better job but tends to under-cover with large bandwidths. BCCHS works better and plug-in fixed-𝑏 critical values provide additional improve- ments in coverage. DKA works quite well, with small improvements using plug-in fixed-𝑏 critical values, and coverage probabilities are close to 95%. Table 2.5.10: Sample Coverage Probabilities (%), Nominal Coverage 95% 𝑁 = 𝑇 = 25; DGP(4); TWFE. BC- Ci Ct fixed-b c.v. No Truncation 𝑀 EHW 61.5 80.1 80.0 Count DK CHS CHS DKA CHS DKA CHS< 0 77.4 78.6 77.1 75.8 74.5 67.3 55.9 51.7 89.2 89.8 88.8 87.8 86.9 81.8 71.4 66.2 Note: (cid:98)𝑀 ranged from 1 to 25, with an average of 2.6 and a median of 2. 𝑏 (cid:98)𝑀 (cid:98)𝑏 0.08 2 0.12 3 0.16 4 0.20 5 0.40 10 0.80 20 1.00 25 93.9 93.9 93.7 93.6 93.4 92.9 92.3 92.4 90.7 90.9 90.8 90.7 90.4 89.3 88.7 88.6 91.2 91.1 91.4 91.4 91.2 91.1 90.8 90.8 94.2 94.2 94.0 94.3 94.3 94.2 94.2 94.2 0 0 0 0 0 0 0 0 The results for BCCHS and DKA in Table 2.5.10 suggest that fixed-𝑏 limits given by (2.20) and 33 (2.21) for POLS can continue to hold for tests based on the TWFE estimator of 𝛽. Let (cid:165)𝑥𝑖𝑡 and (cid:165)𝑢𝑖𝑡 denote the individual and time dummy demeaned versions of 𝑥𝑖𝑡 and 𝑢𝑖𝑡 respectively. Suppose that (cid:165)𝑥𝑖𝑡 (cid:165)𝑢𝑖𝑡, has the individual and time component structure. Because Chiang et al. (2024) show that (cid:165)𝑥𝑖𝑡 (cid:165)𝑢𝑖𝑡 = (cid:101) 𝑥𝑖𝑡(cid:101) 𝑢𝑖𝑡 + 𝑜 𝑝 (1), 𝑥𝑖𝑡 = 𝑥𝑖𝑡 − E[𝑥𝑖𝑡 |𝛼𝑖] − E[𝑥𝑖𝑡 |𝛾𝑡] + E[𝑥𝑖𝑡] and (cid:101) where (cid:101) of Theorem 2.2, (2.20), (2.21) and Theorem 2.3 are easily established for the TWFE estimator 𝑢𝑖𝑡 is similarly defined, equivalent versions provided the stronger exogeniety assumption, E[(cid:101) 𝑥𝑖𝑡𝑢𝑖𝑡] = 0, holds10. 2.6 Empirical Application We illustrate how the choice of variance estimator affects 𝑡-tests and confidence intervals using an empirical example from Thompson (2011). We test the predictive power of market concentration on the profitability of industries where the market concentration is measured by the Herfindahl- Hirschman Index (HHI, hereafter). This example features data where dependence exists in both cross-section and time dimensions with common shocks being correlated across time. Specifically, consider the following linear regression model of profitability measured by ROA𝑚,𝑡, the ratio of return on total assets for industry 𝑚 at time 𝑡: ROA𝑚,𝑡 = 𝛽0 + 𝛽1ln(HHI𝑚,𝑡−1) + 𝛽2PB𝑚,𝑡−1 + 𝛽3DB𝑚.𝑡−1 + 𝛽4 ¯ROA𝑡−1 + 𝑢𝑚,𝑡 where PB is the price-to-book ratio, DB is the dividend-to-book ratio, and ¯ROA is the market average ROA ratio. Table 2.6.1: Industry Profitability, 1972-2021: POLS Estimates and t-statistics POLS Regressors ln(HHI𝑚,𝑡−1) Price/Book𝑚,𝑡−1 DIV/Book𝑚,𝑡−1 Market ROA𝑡−1 Intercept Notes: 𝑅2 = 0.117, (cid:98)𝑀 = 5. Estimates EHW Ci 3.93 -0.09 3.93 14.47 -2.76 0.0097 -0.0001 0.0167 0.6129 -0.0564 12.42 -0.15 6.89 32.31 -8.94 t-statistics CHS BCCHS DKA DK 3.30 3.58 3.76 6.40 -0.05 -0.06 -0.07 -0.07 1.74 1.79 2.04 1.89 8.99 9.76 12.06 10.27 -2.35 -2.53 -2.67 -4.69 Ct 10.57 -0.13 3.81 12.05 -7.52 10Strict exogeneity over time, 𝐸 (𝑢𝑖𝑡 |𝑥𝑖1, 𝑥𝑖2, ..., 𝑥𝑖𝑇 ) = 0, is sufficient for 𝐸 [(cid:101) 𝑥𝑖𝑡 𝑢𝑖𝑡 ] = 0 to hold. 34 The data set used to estimate the model is composed of 234 industries in the US from 1972 to 2021. We obtain the annual level firm data from Compustat and aggregate it to industry level based on Standard Industry Classification (SIC) codes. The details of data construction can be found in Section 6 and Appendix B of Thompson (2011). Table 2.6.2: Industry Profitability, 1972-2021: POLS, 95% Confidence Intervals Fixed-b Critical Values EHW (0.0082, 0.0112) (-0.0017, 0.0014) (0.0119, 0.0214) (0.5757, 0.6500) (-0.0687, -0.0440) CHS (0.0046, 0.0147) (-0.0037, 0.0035) (-0.0006, 0.0340) (0.4959, 0.7299) (-0.0978, -0.0149) BCCHS (0.0044, 0.0150) (-0.0039, 0.0036) (-0.0015, 0.0349) (0.4898, 0.7360) (-0.0999, -0.0128) DKA (0.0039, 0.0154) (-0.0044, 0.0041) (-0.0022, 0.0355) (0.4792, 0.7466) (-0.1034, -0.0093) CHS (0.0043, 0.0149) (-0.0038, 0.0037) (-0.0040, 0.0353) (0.4844, 0.7352) (-0.1000, -0.0124) DKA (0.0039, 0.0153) (-0.0043, 0.0041) (-0.0047, 0.0360) (0.4734, 0.7459) (-0.1036, -0.0088) Regressors ln(HHI𝑚,𝑡−1) Price/Book𝑚,𝑡−1 DIV/Book𝑚,𝑡−1 Market ROA𝑡−1 Intercept Note: (cid:98)𝑀 = 5. In Table 2.6.1, we present the POLS estimates for the five parameters and 𝑡-statistics (with the null H0 : 𝛽 𝑗 = 0 for each 𝑗 = 1, 2, ..., 5) based on the various variance estimators. We use the data dependent bandwidth, (cid:98)𝑀, in all relevant cases. We can see the 𝑡-statistics vary non-trivially across different variance estimators. The estimated coefficient of ln(HHI𝑚,𝑡−1) is significant at a 1% level based on two-sided t-tests using any standard errors among comparison, including the DKA standard error. As is discussed in Section 2.5.3, a rejection using DKA is strong evidence of market concentration being powerful in predicting the profitability of industries. On the other hand, the estimated coefficient of DIV/Book is significant at the 5% significance level in a two-sided test when EHW, cluster-by-industry, cluster-by-time, and DK variances are used, while it is only marginally significant when CHS is used and marginally insignificant when BCCHS and DKA are used. In Table 2.6.2 we present 95% confidence intervals. For CHS/BCCHS and DKA we give confidence interval using both normal and plug-in fixed-𝑏 critical values. For the bias corrected variance estimators (BCCHS and DKA) the differences in confidence intervals between normal 35 Table 2.6.3: Industry Profitability, 1972-2021: TWFE estimates and t-statistics TWFE Regressors ln(HHI𝑚,𝑡−1) Price/Book𝑚,𝑡−1 DIV/Book𝑚,𝑡−1 0.0050 0.0015 0.0056 Notes: 𝑅2 = 0.27, (cid:98)𝑀 = 6. Estimates EHW Ci 1.84 1.41 2.33 4.27 1.73 2.69 t-statistics DK CHS BCCHS DKA Ct 1.33 2.04 3.54 0.78 1.53 1.00 0.95 1.98 1.11 1.46 0.97 1.03 1.55 1.03 1.10 and fixed-𝑏 critical values are not large consistent with our simulation results. Table 2.6.4: Industry Profitability, 1972-2021: TWFE, 95% Confidence Interval fixed-b critical values EHW (0.0027, 0.0736) (-0.0020, 0.0032) (0.0015, 0.0096) CHS (-0.0013, 0.0114) (-0.0013, 0.0043) (-0.0044, 0.0155) BCCHS (-0.0017, 0.0118) (-0.0015, 0.0045) (-0.0050, 0.0161) DKA (-0.0024, 0.0125) (-0.0022, 0.0052) (-0.0059, 0.0170) CHS (-0.0019, 0.0126) (-0.0021, 0.0045) (-0.0048, 0.0166) DKA (-0.0025, 0.0133) (-0.0029, 0.0052) (-0.0057, 0.0175) Regressors ln(HHI𝑚,𝑡−1) Price/Book𝑚,𝑡−1 DIV/Book𝑚,𝑡−1 Note: (cid:98)𝑀 = 6. In Table 2.6.3, we include the results for TWFE estimator to see how the inclusion of firm level and time period dummies matter in practice. The presence of the dummies results in the intercept and ¯ROA𝑡−1 being dropped from the regression. Overall, test statistics based on CHS, BCCHS, and DKA agree with each other in magnitude and they are much smaller relative to EHW-based test statistics. As we have seen in Table 2.5.7, when the scores are independent in both the cross-section and time dimensions, test statistics based on those non-twoway robust standard errors tend to be smaller (higher coverage) on average except for DKA-based tests. Because the non-twoway test statistics in Table 2.6.3 are larger than the CHS/BCCHS statistics suggests TWFE does not fully remove two-way dependence and two-way cluster-robust standard errors are appropriate. The 95% confidence intervals for TWFE case are presented in Table 2.6.4. Confidence intervals tend to be wider with fixed-𝑏 critical values. This is expected given that fixed-𝑏 critical values are larger in magnitude than standard normal critical values. 36 2.7 Conclusion 2.7.1 Summary This paper investigates the fixed-𝑏 asymptotic properties of the CHS variance estimator and tests. An important algebraic observation is that the CHS variance estimator can be expressed as a linear combination of the cluster variance estimator, “HAC of averages" estimator, and “average of HACs" estimator. Building upon this observation, we derive fixed-𝑏 asymptotic results for the CHS variance estimator when both the sample sizes 𝑁 and 𝑇 tend to infinity. Our analysis reveals the presence of an asymptotic bias in the CHS variance estimator which depends on the ratio of the bandwidth parameter, 𝑀, to the time sample size, 𝑇. This bias is multiplicative and leads to simple feasible bias corrected version of the CHS variance estimator (BCCHS). We propose a second bias corrected variance estimator, DKA, by dropping the “HAC of averages" that is guaranteed to be positive semi-definite. We show that the fixed-𝑏 limiting distribution of tests based on CHS, BCCHS and DKA are not asymptotically pivotal, and we propose a straightforward plug-in method for simulating fixed-𝑏 asymptotic critical values. Overall, we propose four test statistics that build on the CHS test: BCCHS and DKA tests using chi-square/standard normal critical values, and BCCHS and DKA tests using plug-in fixed-𝑏 critical values11. Extensive simulations studies are reported that compare finite sample performance of the proposed approaches with existing approaches in terms of finite sample null coverage probabilities. The simple bias-correction approaches provide non-trivial improvements in coverage probabilities and bias-correction with plug-in fixed-𝑏 critical values provide additional improvements except in the i.i.d. case and when both 𝑁 and 𝑇 are very small. 2.7.2 Empirical Recommendations Our results clearly suggest that the bias corrected variance estimators, BCCHS and DKA, provide more reliable inference in practice with or without plug-in fixed-𝑏 critical values. While plug-in fixed-𝑏 critical values involve some computation cost in practice, we can generally recom- mend fixed-𝑏 critical values be used in practice given that i) fixed-𝑏 critical values improve finite 11CHS tests that use simulated fixed-𝑏 critical values are exactly equivalent to BCCHS tests based on simulated fixed-𝑏 critical values because the fixed-𝑏 limits explicitly capture the bias in the CHS variance estimator. 37 sample coverage probabilities when large bandwidths are used, ii) data dependent bandwidths can be large, and iii) coverage probabilities with or without fixed-𝑏 critical values are similar when bandwidths are small. However, there are important exceptions. When both the cross-section and time sample sizes are very small, then BCCHS and DKA based tests using plug-in fixed-𝑏 critical values could yield slightly worse empirical null coverages than using chi-square/standard normal critical values because the plug-in estimators are noisy. Therefore, the choice between using fixed-𝑏 or chi-square/standard normal critical values for BCCHS and DKA tests depends on the sample sizes in additional to any relevant computational costs. The choice between tests based on BCCHS and DKA is nuanced. While DKA ensures positive definiteness and usually provides tests with better empirical null coverage probabilities, these benefits do not come without a cost. Although rare in panel settings, if the scores, 𝑥𝑖𝑡𝑢𝑖𝑡, are i.i.d. over both individual and time dimensions, the DKA estimator has a different fixed-𝑏 limiting distribution and tests based on the DKA estimator can be conservative. In contrast, while the BCCHS estimator also has a different fixed-𝑏 limiting distribution in the i.i.d. case, it has correct asymptotic coverage probabilities when the bandwidth is small. However, if the bandwidth is not small, CHS and BCCHS tests under-cover in the i.i.d. case. Therefore, the practical choice between DKA and BCCHS depends on a researcher’s assessment of the data, the model, and the priority of inference. If the data is thought to be independent in both dimensions, then one should not consider cluster-robust variances estimators in the first place. If the data is thought to have individual and serially correlated time cluster dependence and the researcher places higher priority on controlling over-rejections while having a conservative test (with the cost of lower power) should there not be cluster dependence, the DKA estimator is preferred. BCCHS would be preferred if the additional under-coverage relative to DKA is viewed as reasonable in order to have higher power should there not be cluster dependence. 2.7.3 Further Discussion It is important to acknowledge some limitations of our analysis and to highlight areas of future research. We found that finite sample coverage probabilities of all confidence intervals exhibit 38 under-coverage problems when the autocorrelation of the time effects becomes strong relative to the time sample size. In such cases, potential improvements resulting from the fixed-𝑏 adjustment is limited. Part of this limitation arises because the test statistics are not asymptotically pivotal, necessitating plug-in simulation of critical values. The estimation uncertainty in the plug-in estimators can introduce sampling errors to the simulated critical values that can be acute when persistence is strong. Finding a variance estimator that results in a pivotal fixed-𝑏 limit would help address this problem although appears to be challenging. An empirically relevant question is whether the component structure is a good approximation when the component representation in Assumption 2.1 is not exact. Ideally, inferential theory should be studied under a DGP where the dependence is generated not only through individual and time components but also through the idiosyncratic component. Obtaining fixed-𝑏 results for this generalization appears challenging. Some unreported simulation results point to some theoretical conjectures but a formal analysis is beyond the scope of this paper and is left for future research. A second empirically relevant case we do not address in this paper is the unbalanced panel data case. There are several challenges in establishing formal fixed-𝑏 asymptotic results for unbalanced panels. Unbalanced panels have time sample sizes that are potentially different across individuals and this potentially complicates the choice of bandwidths for the individual-by-individual variance estimators in the average of HACs component of the variance. For the Driscoll-Kraay component, the averaging by time would have potentially different cross-section sample sizes for each period. Theoretically, obtaining fixed-𝑏 results for unbalanced panels also depends on how the missing data is modeled. For example, one might conjecture that if missing observations in the panel occur randomly (missing at random), then extending the fixed-𝑏 theory would be straightforward. While that is true in pure time series settings (see Rho and Vogelsang, 2019), the presence of the individual and time random components in the panel setting complicate things due to the fact that the asymptotic behavior of the components in the partial sums is very different from the balanced panel case. Obtaining useful results for the unbalanced panel case is challenging and is a focus of ongoing research. 39 BIBLIOGRAPHY Aldous, D. J. (1981). Representations for partially exchangeable arrays of random variables. J. Multivar. Anal., 11(4):581–598. Andrews, D. W. K. (1991). Heteroskedasticity and autocorrelation consistent covariance matrix estimation. Econometrica, 59:817. Arellano, M. (1987). Computing robust standard errors for within-groups estimators. Oxford. B. Econ. Stat., 49:431–434. Bertrand, M., Duflo, E., and Mullainathan, S. (2004). How much should we trust differences-in- differences estimates? Q. J. Econ., 119:249–275. Bester, C. A., Conley, T. G., Hansen, C. B., and Vogelsang, T. J. (2016). Fixed-b asymptotics for spatially dependent robust nonparametric covariance matrix estimators. Econom. Theory, 32(1):154–186. Cameron, A. C., Gelbach, J. B., and Miller, D. L. (2011). Robust inference with multiway clustering. J. Bus. Econ. Stat., 29:238–249. Chen, K. and Vogelsang, T. J. (2024). Fixed-b asymptotics for panel models with two-way clustering. Journal of Econometrics, 244(1):105831. Chiang, H. D., Hansen, B. E., and Sasaki, Y. (2022). Standard errors for two-way clustering with serially correlated time effects. Working Paper, Department of Economics, U. Wisconsin Madison, arXiv:2201.11304v2. Chiang, H. D., Hansen, B. E., and Sasaki, Y. (2024). Standard errors for two-way clustering with serially correlated time effects. Review of Economics and Statistics, pages 1–40. Davezies, L., D’Haultfoeuille, X., and Guyonvarch, Y. (2018). Asymptotic results under multiway clustering. Working Paper, arXiv preprint arXiv:1807.07925. Davezies, L., D’Haultfoeuille, X., and Guyonvarch, Y. (2021). Empirical process results for exchangeable arrays. Ann. Stat., 49:845–862. Driscoll, J. C. and Kraay, A. C. (1998). Consistent covariance matrix estimation with spatially dependent panel data. Rev. Econ. Stat., 80:549–559. Hansen, B. E. (2022). Econometrics. Princeton University Press, Princeton, NJ. Hansen, C. B. (2007). Asymptotic properties of a robust variance matrix estimator for panel data when t is large. J. Econom., 141:597–620. 40 Hoover, D. N. (1979). Relations on probability spaces and arrays of random variables. Preprint, Institute for Advanced Study. Kallenberg, O. (1989). On the representation theorem for exchangeable arrays. J. Multivar. Anal., 30(1):137–154. Kiefer, N. M. and Vogelsang, T. J. (2005). A new asymptotic theory for heteroskedasticity- autocorrelation robust tests. Econom. Theory, 21:1130–1164. Lazarus, E., Lewis, D. J., and Stock, J. H. (2021). The size-power tradeoff in HAR inference. Econometrica, 89(5):2497–2516. Liang, K.-Y. and Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73:13–22. MacKinnon, J. G., Nielsen, M. Ø., and Webb, M. D. (2021). Wild bootstrap and asymptotic inference with multiway clustering. J. Bus. Econ. Stat., 39(2):505–519. Menzel, K. (2021). Bootstrap with cluster-dependence in two or more dimensions. Econometrica, 89:2143–2188. Neave, H. R. (1970). An improved formula for the asymptotic variance of spectrum estimates. Ann. Math. Stat., 41(1):70–77. Newey, W. K. and West, K. D. (1987). A simple, positive semi-definite, heteroskedasticity and autocorrelation consistent covariance matrix. Econometrica, 55:703–708. Petersen, M. A. (2009). Estimating standard errors in finance panel data sets: Comparing ap- proaches. Rev. Financ. Stud, 22:435–480. Phillips, P. C. B. and Moon, H. R. (1999). Linear regression limit theory for nonstationary panel data. Econometrica, 67:1057–1111. Rho, S.-H. and Vogelsang, T. J. (2019). Heteroskedasticity autocorrelation robust inference in time series regressions with missing data. Econom. Theory, 35(3):601–629. Sun, Y. (2014). Let’s fix it: Fixed-b asymptotics versus small-b asymptotics in heteroskedasticity and autocorrelation robust inference. J. Econom., 178:659–677. Sun, Y., Phillips, P. C., and Jin, S. (2008). Optimal bandwidth selection in heteroskedasticity– autocorrelation robust testing. Econometrica, 76(1):175–194. Thompson, S. B. (2011). Simple formulas for standard errors that cluster by both firm and time. J. Financ. Econ., 99:1–10. 41 Vogelsang, T. J. (2012). Heteroskedasticity, autocorrelation, and spatial correlation robust inference in linear panel models with fixed-effects. J. Econom., 166:303–319. Zhang, X. and Shao, X. (2013). Fixed-smoothing asymptotics for time series. Ann. Stat., 41(3):1329–1349. 42 APPENDIX 2A Proof of Theorem 2.1: Consider PROOFS FOR CHAPTER 2 √ 𝑁 ((cid:98)𝜃 − 𝜃) with the component structure representation: √ (cid:16) 𝑁 (cid:17) = (cid:98)𝜃 − 𝜃 1 √ 𝑁 𝑁 ∑︁ 𝑖=1 𝑎𝑖 + √︂ 𝑁 𝑇 1 √ 𝑇 𝑇 ∑︁ 𝑡=1 𝑔𝑡 + √ 1 𝑁𝑇 𝑁 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑡=1 𝑒𝑖𝑡 . Under the same set of assumptions, we can apply Theorem 1 of Chiang et al. (2024) giving (cid:32) 1 √ 𝑁 Var (cid:32) Var (cid:32) Var √ 1 𝑁𝑇 1 √ 𝑇 𝑁 ∑︁ (cid:33) (cid:33) (cid:33) 𝑁 ∑︁ 𝑖=1 𝑇 ∑︁ 𝑎𝑖 𝑔𝑡 𝑡=1 𝑇 ∑︁ 𝑒𝑖𝑡 = Λ𝑎Λ′ 𝑎, (cid:13) (cid:13)Λ𝑎Λ′ 𝑎 (cid:13) (cid:13) < ∞, → Λ𝑔Λ′ 𝑔, (cid:13) (cid:13)Λ𝑔Λ′ 𝑔 (cid:13) (cid:13) < ∞, → Λ𝑒Λ′ 𝑒, (cid:13) (cid:13)Λ𝑒Λ′ 𝑒 (cid:13) (cid:13) < ∞. 𝑖=1 𝑡=1 (2A.1) (2A.2) (2A.3) (2A.4) Thus, 𝑎𝑖 = E[𝑦𝑖𝑡 − 𝜃|𝛼𝑖] is a sequence of i.i.d random vectors with zero mean and finite variance Λ𝑎Λ′ 𝑎. Then, the Lindeberg-Lévy CLT applies to the first sum in (2A.1): as 𝑁 → ∞, 1 √ 𝑁 𝑁 ∑︁ 𝑖=1 𝑎𝑖 𝑑 → 𝑁 (0, Λ𝑎Λ′ 𝑎). (2A.5) Consider the second sum in (2A.1) where 𝑔𝑡 = E[𝑦𝑖𝑡 − 𝜃|𝛾𝑡] is strictly stationary and is an 𝛼-mixing sequence with mixing coefficients 𝛼𝑔 (ℓ) ≤ 𝛼𝛾 (ℓ) by Theorem 14.12 of Hansen (2022), and so 𝛼𝑔 (ℓ) satisfies a summation condition as follows: for some 𝑠 > 1 and 𝛿 > 0, and for 𝐾 ∈ (0, ∞), there exists integer 𝑁𝐾 such that ∞ ∑︁ ℓ=1 𝛼𝑔 (ℓ)1−1/2(𝑠+𝛿) ≤ ∞ ∑︁ ℓ=1 𝛼𝛾 (ℓ)1−1/2(𝑠+𝛿) = 𝑁𝐾∑︁ 𝛼𝛾 (ℓ)1−1/2(𝑠+𝛿) + ∞ ∑︁ ℓ=𝑁𝐾 +1 (cid:18) 𝑂 (ℓ−𝜆) ℓ−𝜆 ℓ−𝜆 (cid:19) 1−1/2(𝑠+𝛿) < 𝑁𝐾 + 𝐾 ∞ ∑︁ (cid:16) ℓ− 2𝑠 𝑠−1 ℓ=1 ℓ=1 (cid:17) 1−1/2(𝑠+𝛿) < ∞. Then, by Theorem 16.4 of Hansen (2022) we have as 𝑁, 𝑇 → ∞, for 𝑟 ∈ (0, 1], 1 √ 𝑇 [𝑟𝑇] ∑︁ 𝑡=1 𝑔𝑡 ⇒ Λ𝑔𝑊𝑘 (𝑟) , (2A.6) where [𝑟𝑇] denotes the integer part of 𝑟𝑇 and 𝑊𝑘 (𝑟) is a 𝑘 × 1 vector of standard Wiener process. 43 As for the third sum, we have Var (cid:16) 1√ 𝑁𝑇 (cid:205)𝑁 𝑖=1 (cid:205)𝑇 𝑡=1 𝑒𝑖𝑡 (cid:17) → 0 by (2A.4). Then, we can apply Chebyshev’s inequality for random variables to show that each component of the random vector 1√ 𝑁𝑇 (cid:205)𝑁 𝑖=1 (cid:205)𝑇 𝑡=1 𝑒𝑖𝑡 converges to 0 in probability as 𝑁, 𝑇 → ∞ and so √ 1 𝑁𝑇 𝑁 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑡=1 𝑒𝑖𝑡 𝑝 → 0. (2A.7) Combining (2A.5), (2A.6), (2A.7), we have √ (cid:16) 𝑁 (cid:17) (cid:98)𝜃 − 𝜃 ⇒ Λ𝑎𝑧𝑘 + √ 𝑐Λ𝑔𝑊𝑘 (1) 𝑎𝑠 𝑁, 𝑇 → ∞, and we conclude that 𝑧𝑘 is independent from 𝑊𝑘 (𝑟) since {𝑎𝑖} and {𝑔𝑡 } are independent to each other, proving (i) of Theorem 2.1. Next, consider (2.9), scaled by 1√ 𝑁𝑇 : √ 1 𝑁𝑇 (cid:98)¯𝑆 [𝑟𝑇] = √︂ 𝑁 𝑇 1 √ 𝑇 [𝑟𝑇] ∑︁ 𝑡=1 (𝑔𝑡 − ¯𝑔) + √ 1 𝑁𝑇 𝑁 ∑︁ [𝑟𝑇] ∑︁ 𝑖=1 𝑡=1 (𝑒𝑖𝑡 − ¯𝑒) . (2A.8) By (2A.7), we have the second partial sum of (2A.8) converges to 0 in probability. Combined with the results from (2A.6) we obtain, √ (cid:98)¯𝑆 [𝑟𝑇] ⇒ √ 1 𝑁𝑇 𝑐Λ𝑔 (𝑊𝑘 (𝑟) − 𝑟𝑊𝑘 (1)) = √ 𝑐Λ𝑔 (cid:101)𝑊𝑘 (𝑟), (2A.9) as 𝑁, 𝑇 → ∞. Note that for each 𝑡 = 1, ..., 𝑇 − 1, we can map 𝑡 to [𝑟𝑡𝑇] for some 𝑟𝑡 ∈ (cid:2) 𝑡 𝑇 , 𝑡+1 𝑇 (cid:17) . Similarly, we can map 𝑡+𝑀 to [(𝑟𝑡+𝑏)𝑇] where 𝑏 = 𝑀/𝑇 and [𝑟𝑡𝑇] = 𝑡 for 𝑡 = 1, ..., 𝑇 −𝑀−1. Using (2A.9), we have √ 𝑁𝑇 (cid:98)¯𝑆𝑡 ⇒ 1√ 𝑐Λ𝑔 (cid:101)𝑊𝑘 (𝑟𝑡) for each 𝑡 = 1, ..., 𝑇 − 1 and 1√ 𝑁𝑇 (cid:98)¯𝑆𝑡+𝑀 ⇒ √ 𝑐Λ𝑔 (cid:101)𝑊𝑘 (𝑟𝑡 + 𝑏) for each 𝑡 = 1, ..., 𝑇 − 𝑀 − 1. Note that we can take 𝑟𝑡 = 𝑡/𝑇, then as 𝑁, 𝑇 → ∞, we have 1 𝑁𝑇 3 𝑇−1 ∑︁ 𝑡=1 ′ (cid:98)¯𝑆𝑡(cid:98)¯𝑆 𝑡 = (𝑇−1)/𝑇 ∑︁ √ 1 𝑁𝑇 3 𝑇−𝑀−1 ∑︁ 𝑡=1 ′ (cid:98)¯𝑆𝑡(cid:98)¯𝑆 𝑡+𝑀 = 𝑟𝑡 =1/𝑇 (𝑇−𝑀−1)/𝑇 ∑︁ 𝑟𝑡 =1/𝑇 1 𝑁𝑇 (cid:98)¯𝑆 [𝑟𝑡𝑇] √ ′ (cid:98)¯𝑆 [𝑟𝑡𝑇] ⇒ 𝑐Λ𝑔 1 𝑁𝑇 ∫ 1 0 (cid:101)𝑊𝑘 (𝑟) (cid:101)𝑊𝑘 (𝑟)′𝑑𝑟Λ𝑔, √ 1 𝑁𝑇 (cid:98)¯𝑆 [𝑟𝑡𝑇] √ 1 𝑁𝑇 ′ (cid:98)¯𝑆 [(𝑟𝑡 +𝑏)𝑇] ⇒ 𝑐Λ𝑔 ∫ 1−𝑏 0 (cid:101)𝑊𝑘 (𝑟) (cid:101)𝑊𝑘 (𝑟 + 𝑏)′𝑑𝑟Λ𝑔. 44 Using the results above, we obtain the fixed-𝑏 joint limit of (2.11): 𝑁 𝑁 2𝑇 2 (cid:40) 2 𝑀 𝑇−1 ∑︁ ′ (cid:98)¯𝑆𝑡(cid:98)¯𝑆 𝑡 − 1 𝑀 𝑇−𝑀−1 ∑︁ 𝑡=1 (cid:16) ′ ′ 𝑡+𝑀 + (cid:98)¯𝑆𝑡+𝑀(cid:98)¯𝑆 (cid:98)¯𝑆𝑡(cid:98)¯𝑆 𝑡 (cid:17) (cid:41) 𝑡=1 ∫ 1 (cid:26) 2 𝑏 (cid:16) ⇒ 𝑐Λ𝑔 = 𝑐Λ𝑔𝑃 0 𝑏, (cid:101)𝑊𝑘 (𝑟) (cid:17) Λ′ 𝑔. (cid:101)𝑊𝑘 (𝑟) (cid:101)𝑊 𝑘 (𝑟)′𝑑𝑟 − 1 𝑏 ∫ 1−𝑏 0 (cid:101)𝑊𝑘 (𝑟) (cid:101)𝑊𝑘 (𝑟 + 𝑏)′ + (cid:101)𝑊𝑘 (𝑟 + 𝑏) (cid:101)𝑊𝑘 (𝑟)′(cid:3) 𝑑𝑟 (cid:2) (cid:27) Λ′ 𝑔 (2A.10) Note that the last term of (2.12) is canceled out with (2.10). The rest of the terms of (2.12) are functions of the partial sums defined in (2.8). Consider (2.8) evaluated at 𝑡 = [𝑟𝑇] and scaled by 1 𝑇 : 1 𝑇 (cid:98)𝑆𝑖,[𝑟𝑇] = [𝑟𝑇] 𝑇 (𝑎𝑖 − ¯𝑎) + 1 𝑇 [𝑟𝑇] ∑︁ (𝑔𝑡 − ¯𝑔) + [𝑟𝑇] ∑︁ (𝑒𝑖𝑡 − ¯𝑒), 1 𝑇 𝑡=1 We first consider fixed-𝑁 and large-𝑇 asymptotic results. As 𝑇 → ∞ while fixing 𝑁, 𝑡=1 [𝑟𝑇] 𝑇 (𝑎𝑖 − ¯𝑎) 𝑝 → 𝑟 (𝑎𝑖 − ¯𝑎) and 1 𝑇 (𝑔𝑡 − ¯𝑔) 𝑝 → 0 by (2A.6). Note that Var (cid:32) 1 𝑇 [𝑟𝑇] ∑︁ 𝑡=1 (cid:33) 𝑒𝑖𝑡 = 1 𝑇 (cid:205)[𝑟𝑇] 𝑡=1 [𝑟𝑇]−1 ∑︁ 𝑙=−([𝑟𝑇]−1) (cid:18) [𝑟𝑇] 𝑇 (cid:19) |𝑙 | 𝑇 − E(𝑒𝑖𝑡𝑒𝑖,𝑡+𝑙) = 𝑟 𝑇 Λ𝑒Λ′ 𝑒 (1 + 𝑜(1)). By Chebyshev’s inequality, we have Therefore, we conclude that 1 𝑇 [𝑟𝑇] ∑︁ 𝑡=1 𝑒𝑖𝑡 𝑝 → 0 as 𝑇 → ∞. 1 𝑇 (cid:98)𝑆𝑖,[𝑟𝑇] 𝑝 → 𝑟 (𝑎𝑖 − ¯𝑎) as 𝑇 → ∞, which in turn gives that 1 𝑇 2 2 [𝑏𝑇] 𝑇−1 ∑︁ 𝑡=1 (cid:98)𝑆𝑖𝑡 (cid:98)𝑆′ 𝑖𝑡 𝑝 → 2 𝑏 ∫ 1 0 𝑟 2 (𝑎𝑖 − ¯𝑎) (𝑎𝑖 − ¯𝑎)′ 𝑑𝑟 = 2 3𝑏 So, if we let 𝑇 → ∞ and then 𝑁 → ∞ sequentially, we have (𝑎𝑖 − ¯𝑎) (𝑎𝑖 − ¯𝑎)′ as 𝑇 → ∞. 1 𝑁𝑇 2 𝑁 ∑︁ 𝑖=1 2 [𝑏𝑇] 𝑇−1 ∑︁ 𝑡=1 (cid:98)𝑆𝑖𝑡 (cid:98)𝑆′ 𝑖𝑡 𝑝 → 2 3𝑏 1 𝑁 𝑁 ∑︁ 𝑖=1 (𝑎𝑖 − ¯𝑎) (𝑎𝑖 − ¯𝑎)′ as 𝑇 → ∞ = 2 3𝑏 1 𝑁 𝑁 ∑︁ 𝑖=1 (cid:0)𝑎𝑖𝑎′ 𝑖 − 𝑎𝑖 ¯𝑎′ − ¯𝑎𝑎′ 𝑖 + ¯𝑎 ¯𝑎′(cid:1) 𝑝 → 2 3𝑏 E(𝑎𝑖𝑎′ 𝑖) = 2 3𝑏 Λ𝑎Λ′ 𝑎 as 𝑁 → ∞, 45 where the last convergence follows from the WLLN. What we obtain here is the sequential limit of the first term in (2.12). However, the sequential limit is not necessarily equal to the joint limit. Phillips and Moon (1999) provide a framework to obtain joint convergence results through sequential convergence results under certain conditions. Following their approach, we first define sequential convergence and joint convergence for random matrices defined below and then introduce a lemma which gives a sufficient condition for sequential convergence to imply joint convergence. Definition 2A.1 Let 𝐺 𝑁𝑇 be defined as 𝐺 𝑁𝑇 := 1 𝑁 𝑁 ∑︁ 𝑖=1 𝐺𝑖𝑇 , where 𝐺𝑖𝑇 := 1 𝑇 2 2 [𝑏𝑇] 𝑇−1 ∑︁ 𝑡=1 (cid:98)𝑆𝑖𝑡 (cid:98)𝑆′ 𝑖𝑡 𝑝 → 2 3𝑏 (𝑎𝑖 − ¯𝑎) (𝑎𝑖 − ¯𝑎)′ =: 𝐺𝑖, 𝑎𝑠 𝑇 → ∞. Further, define 𝐺 𝑁 := 1 𝑁 𝑁 ∑︁ 𝑖=1 𝑝 → 𝐺𝑖 2 3𝑏 Λ𝑎Λ𝑎 =: 𝐺, 𝑎𝑠 𝑁 → ∞. Definition 2A.2 (a) A sequence of 𝑘×𝑘 matrices 𝐺 𝑁𝑇 on (Ω, F , 𝑃) is said to converge in probability sequentially to 𝐺, if lim 𝑁→∞ lim 𝑇→∞ 𝑃 (∥𝐺 𝑁𝑇 − 𝐺 ∥ > 𝜀) = 0 ∀𝜀 > 0. (b) Suppose that the 𝑘 ×𝑘 random matrices 𝐺 𝑁𝑇 and 𝐺 are defined on a probability space (Ω, F , 𝑃). 𝐺 𝑁𝑇 is said to converge in probability jointly to 𝐺, if lim 𝑁,𝑇→∞ 𝑃 (∥𝐺 𝑁𝑇 − 𝐺 ∥ > 𝜀) = 0 ∀𝜀 > 0. Lemma 2A.1 Suppose there exist random matrices 𝐺 𝑁 and 𝐺 on the same probability space as 𝐺 𝑁𝑇 satisfying that, for all 𝑁, 𝐺 𝑁𝑇 𝑝 → 𝐺 𝑁 as 𝑇 → ∞ and 𝐺 𝑁 𝑝 → 𝐺 as 𝑁 → ∞. Then, 𝐺 𝑁𝑇 𝑝 → 𝐺 jointly if lim sup 𝑁,𝑇 𝑃 (∥𝐺 𝑁𝑇 − 𝐺 𝑁 ∥ > 𝜀) = 0 ∀𝜀 > 0. (2A.11) 46 Lemma 1 can be proved the same way Lemma 6 of Phillips and Moon (1999) is proved, with the only difference that the vector norm is replaced by a matrix norm, so the proof is omitted here. Now we verify condition (2A.11). By Markov’s inequality, Minkowski inequality (for infinite sum), and the fact that there is no heterogeneity of 𝐺𝑖𝑇 and 𝐺𝑖 across 𝑖, we have lim sup 𝑁,𝑇 𝑃 (∥𝐺 𝑁𝑇 − 𝐺 𝑁 ∥ > 𝜀) ≤ lim sup 𝑁,𝑇 1 𝜀 E (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) 1 𝑁 𝑁 ∑︁ 𝑖=1 𝐺𝑖𝑇 − 1 𝑁 𝑁 ∑︁ 𝑖=1 𝐺𝑖 (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) ≤ lim sup 𝑁,𝑇 1 𝜀 E ∥𝐺𝑖𝑇 − 𝐺𝑖 ∥ . Because 𝐺𝑖𝑇 converges to 𝐺𝑖 in probability as 𝑇 → ∞, it suffices to show that for each 𝑖, {𝐺𝑖𝑇 }∞ is uniformly integrable for the last term to converge to 0. Let 𝜁 > 0 and consider E ∥𝐺𝑖𝑇 ∥1+𝜁 : 𝑇=1 E ∥𝐺𝑖𝑇 ∥1+𝜁 = E (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) 1 𝑇 2 2 [𝑏𝑇] 𝑇−1 ∑︁ 𝑡=1 (cid:98)𝑆𝑖𝑡 (cid:98)𝑆′ 𝑖𝑡 1+𝜁 (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:32) ≤ 1 𝑇 2 2 [𝑏𝑇] 𝑇−1 ∑︁ (cid:18) E 𝑡=1 (cid:32) ≤ 1 𝑇 2 2 [𝑏𝑇] 𝑇−1 ∑︁ (cid:18) 𝑡=1 (cid:13) (cid:13)(cid:98)𝑆𝑖𝑡 (cid:13) (cid:13) (cid:13) (cid:13) E 2(1+𝜁)(cid:19) 1 2(1+𝜁 ) (cid:18) (cid:13) (cid:13)(cid:98)𝑆𝑖𝑡 (cid:13) 2(1+𝜁)(cid:19) (cid:13) (cid:13) (cid:13) E 1+𝜁 (cid:19) 1 1+𝜁 (cid:33) 1+𝜁 (cid:13) (cid:13)(cid:98)𝑆𝑖𝑡 (cid:98)𝑆′ (cid:13) 𝑖𝑡 (cid:13) (cid:13) (cid:13) 2(1+𝜁 ) (cid:33) 1+𝜁 1 , where the first and second inequalities follows from Minkowski’s and Hölder’s inequalities respec- tively. Let 𝜁 = 𝛿 4 with 𝛿 > 0 from Assumption 2.1 and consider (cid:18) E (cid:13) (cid:13) (cid:13) 𝑇 (cid:98)𝑆𝑖𝑡 1 (cid:13) (cid:13) (cid:13) 2(1+𝜁)(cid:19) 1 2(1+𝜁 ) : (cid:32) E (cid:13) (cid:13) (cid:13) (cid:13) 1 𝑇 (cid:98)𝑆𝑖,𝑡=[𝑟𝑇] 2(1+𝜁)(cid:33) (cid:13) (cid:13) (cid:13) (cid:13) 1 2(1+𝜁 ) ≤ 1 𝑇 [𝑟𝑇] ∑︁ 𝑗=1 (E∥𝑦𝑖 𝑗 ∥2(1+𝜁)) 1 2(1+𝜁 ) + [𝑟𝑇] 𝑇 1 𝑇 (cid:16) E∥𝑦𝑖𝑡 ∥2(1+𝜁)(cid:17) 1 2(1+𝜁 ) < ∞, by Minkowski’s inequality and Assumption 2.2(i). Now we conclude that E∥𝐺𝑖𝑡 ∥1+𝜁 < ∞, i.e. 𝐺𝑖𝑡 is uniformly integrable by Theorem 6.13 of Hansen (2022). By uniform integrability of 𝐺𝑖𝑇 and convergence in probability of 𝐺𝑖𝑇 to 𝐺𝑖, we have 𝐿1 convergence: lim sup E ∥𝐺𝑖𝑇 − 𝐺𝑖 ∥ = 0. Then, 𝑁,𝑇 condition (2A.11) follows and we obtain the joint limit of 1 𝑁 (cid:205)𝑁 𝑖=1 𝐺𝑖𝑇 as 𝑁, 𝑇 → ∞. Specifically, we have 1 𝑁 𝑁 ∑︁ 𝑖=1 𝐺𝑖𝑇 = 1 𝑁𝑇 2 𝑁 ∑︁ 𝑖=1 2 [𝑏𝑇] 𝑇−1 ∑︁ 𝑡=1 (cid:98)𝑆𝑖𝑡 (cid:98)𝑆′ 𝑖𝑡 = 2 3𝑏 Λ𝑎Λ′ 𝑎 + 𝑜 𝑝 (1), as 𝑁, 𝑇 → ∞. Following similar steps, we obtain joint limits for the rest of the terms in (2.12): 1 𝑁𝑇 2 1 𝑀 𝑁 ∑︁ 𝑇−𝑀−1 ∑︁ (cid:16) (cid:98)𝑆𝑖𝑡 (cid:98)𝑆′ 𝑖,𝑡+𝑀 + (cid:98)𝑆𝑖,𝑡+𝑀 (cid:98)𝑆′ 𝑖,𝑡 𝑡=1 𝑁 ∑︁ 𝑖=1 1 𝑁𝑇 2 1 𝑀 𝑇−1 ∑︁ (cid:16) (cid:98)𝑆𝑖𝑡 (cid:98)𝑆′ 𝑖𝑇 + (cid:98)𝑆𝑖𝑇 (cid:98)𝑆′ 𝑖𝑡 𝑖=1 𝑡=𝑇−𝑀 47 (cid:17) (cid:17) = (cid:18) 2 3𝑏 + (cid:19) 1 3 (1 − 𝑏)2Λ𝑎Λ′ 𝑎 + 𝑜 𝑝 (1), (2A.12) = (2 − 𝑏)Λ𝑎Λ′ 𝑎 + 𝑜 𝑝 (1). (2A.13) Combining the partial-sum representation in (2.10), (2.11), (2.12) and the results above, we obtain (2.13). Proof of Theorem 2.2: First, rewrite √ 𝑁 ( (cid:98)𝛽 − 𝛽) = (cid:98)𝑄−1 = (cid:98)𝑄−1 (cid:32) √ 𝑁 𝑁𝑇 (cid:34) 1 √ 𝑁 √ 𝑁 ( (cid:98)𝛽 − 𝛽) using the component structure representation: 𝑁 ∑︁ 𝑇 ∑︁ (cid:33) (𝑎𝑖 + 𝑔𝑡 + 𝑒𝑖𝑡) 𝑡=1 𝑎𝑖 + 𝑖=1 𝑁 ∑︁ 𝑖=1 √︂ 𝑁 𝑇 1 √ 𝑇 𝑇 ∑︁ 𝑡=1 𝑔𝑡 + 1 √ 𝑇 √ 1 𝑁𝑇 𝑁 ∑︁ 𝑇 ∑︁ (cid:35) 𝑒𝑖𝑡 . 𝑖=1 𝑡=1 Next, by Assumption 2.3(ii) and Hölder’s inequality we have, for some 𝑠 > 1 and 𝛿 > 0, (cid:16) (cid:16) ∥𝑥𝑖𝑡𝑢𝑖𝑡 ∥4(𝑠+𝛿)(cid:17) 𝑖𝑡 ∥4(𝑠+𝛿)(cid:17) ∥𝑥𝑖𝑡𝑥′ E E (cid:16) (cid:16) ∥𝑥𝑖𝑡 ∥8(𝑠+𝛿)(cid:17) 1/2 ∥𝑥𝑖𝑡 ∥8(𝑠+𝛿)(cid:17) 1/2 E E (cid:16) (cid:16) ∥𝑢𝑖𝑡 ∥8(𝑠+𝛿)(cid:17) 1/2 ∥𝑥𝑖𝑡 ∥8(𝑠+𝛿)(cid:17) 1/2 ≤ E ≤ E < ∞, < ∞, Now, we are back to the case of Assumption 2.2(ii). Then, by similar steps as in the proof of (cid:13) < ∞, (cid:13) Theorem 2.1, we have (cid:13) (cid:13) (cid:13) (cid:13) < ∞, and (cid:13) < ∞, (cid:13) (cid:13) (cid:13)Λ𝑎Λ′ 𝑎 (cid:13)Λ𝑔Λ′ 𝑔 (cid:13)Λ𝑒Λ′ 𝑒 1 √ 𝑁 1 √ 𝑇 𝑁 ∑︁ 𝑁 ∑︁ 𝑖=1 [𝑟𝑇] ∑︁ 𝑎𝑖 𝑑 → 𝑁 (0, Λ𝑎Λ′ 𝑎), 𝑔𝑡 ⇒ Λ𝑔𝑊𝑘 (𝑟), 𝑡=1 𝑇 ∑︁ 𝑒𝑖𝑡 𝑝 → 0, 𝑖=1 𝑡=1 √ 1 𝑁𝑇 (2A.14) (2A.15) (2A.16) as 𝑁, 𝑇 → ∞. As for (cid:98)𝑄, we can vectorize it and then decompose it in the same manner as the multivariate mean case: 𝑣𝑒𝑐(𝑥𝑖𝑡𝑥′ 𝑖𝑡) − 𝑣𝑒𝑐(𝑄) = 𝑎𝑥 𝑡 + 𝑒𝑥 𝑖𝑡, 𝑣𝑒𝑐( (cid:98)𝑄) − 𝑣𝑒𝑐(𝑄) = 𝑖𝑡) − 𝑣𝑒𝑐(𝑄)|𝛼𝑖], 𝑔𝑥 𝑎𝑥 𝑖 + 1 𝑇 𝑇 ∑︁ 𝑔𝑥 𝑡 + 1 𝑁𝑇 𝑁 ∑︁ 𝑇 ∑︁ 𝑒𝑖𝑡, 𝑖=1 𝑡=1 𝑖=1 𝑖𝑡) − 𝑣𝑒𝑐(𝑄)|𝛾𝑡], and 𝑒𝑥 𝑡 = E[𝑣𝑒𝑐(𝑥𝑖𝑡𝑥′ 𝑡=1 𝑖𝑡 = 𝑣𝑒𝑐(𝑥𝑖𝑡𝑥′ 𝑖𝑡) − 𝑡 . Then, we can apply the results of (2A.1) - (2A.4) and the fact that the sums where 𝑎𝑥 𝑖 = E[𝑣𝑒𝑐(𝑥𝑖𝑡𝑥′ 𝑖 − 𝑔𝑥 𝑣𝑒𝑐(𝑄) − 𝑎𝑥 𝑖 + 𝑔𝑥 𝑁 ∑︁ 1 𝑁 48 in (2A.1) are mutually uncorrelated to conclude that Var(𝑣𝑒𝑐( (cid:98)𝑄)) → 0. Then, by Chebyshev’s inequality we obtain 𝑣𝑒𝑐( (cid:98)𝑄) 𝑝 → 𝑣𝑒𝑐(𝑄), i.e. as 𝑁, 𝑇 → ∞, 𝑝 → 𝑄. (cid:98)𝑄 (2A.17) Therefore, as 𝑁, 𝑇 → ∞, we have √ (cid:16) 𝑁 (cid:17) (cid:98)𝛽 − 𝛽 ⇒ 𝑄−1 (cid:2)Λ𝑎𝑧 + √ 𝑐Λ𝑔𝑊 (1)(cid:3) as claimed for the first part of Theorem 2.2. Next, for the second part we define the partial sums in the same fashion as (2.8) and (2.9): [𝑟𝑇] ∑︁ 𝑡=1 𝑁 ∑︁ (cid:98)𝑆𝑖,[𝑟𝑇] = (cid:98)¯𝑆 [𝑟𝑇] = 𝑣𝑖𝑡, 𝑥𝑖𝑡(cid:98) [𝑟𝑇] ∑︁ 𝑖=1 𝑡=1 𝑣𝑖𝑡 . 𝑥𝑖𝑡(cid:98) With similar steps as in proving (2A.17), we can show 1 𝑁𝑇 𝑁 ∑︁ [𝑟𝑇] ∑︁ 𝑖=1 𝑡=1 𝑥𝑖𝑡𝑥′ 𝑖𝑡 = 𝑁 [𝑟𝑇] 𝑁𝑇 1 𝑁 [𝑟𝑇] 𝑁 ∑︁ [𝑟𝑇] ∑︁ 𝑖=1 𝑡=1 𝑥𝑖𝑡𝑥′ 𝑖𝑡 𝑝 → 𝑟𝑄. and Therefore, as 𝑁, 𝑇 → ∞, we have 1 𝑇 [𝑟𝑇] ∑︁ 𝑡=1 𝑥𝑖𝑡𝑥′ 𝑖𝑡 𝑝 → 𝑟𝑄. (cid:98)¯𝑆 [𝑟𝑇] ⇒ 𝑟Λ𝑎𝑊𝑘 (1) + √ 𝑐Λ𝑔𝑊𝑘 (𝑟) − 𝑟𝑄 (cid:16) 𝑄−1 (cid:2)Λ𝑎𝑊𝑘 (1) + √ 𝑐Λ𝑔𝑊𝑘 (1)(cid:3) (cid:17) √ 1 𝑁𝑇 √ = 𝑐Λ𝑔 (cid:101)𝑊𝑘 (𝑟). Then, similarly as it is shown in (2A.10) we have 𝑁 𝑁 2𝑇 2 (cid:40) 2 𝑀 𝑇−1 ∑︁ ′ (cid:98)¯𝑆𝑡(cid:98)¯𝑆 𝑡 − 1 𝑀 𝑇−𝑀−1 ∑︁ 𝑡=1 (cid:16) ′ ′ (cid:98)¯𝑆𝑡(cid:98)¯𝑆 𝑡+𝑀 + (cid:98)¯𝑆𝑡+𝑀(cid:98)¯𝑆 𝑡 (cid:17) (cid:41) 𝑡=1 ∫ 1 (cid:26) 2 𝑏 (cid:16) ⇒ 𝑐Λ𝑔 = 𝑐Λ𝑔𝑃 0 𝑏, (cid:101)𝑊𝑘 (𝑟) (cid:17) Λ′ 𝑔. (cid:101)𝑊𝑘 (𝑟) (cid:101)𝑊 𝑘 (𝑟)′𝑑𝑟 − 1 𝑏 ∫ 1−𝑏 0 (cid:101)𝑊𝑘 (𝑟) (cid:101)𝑊𝑘 (𝑟 + 𝑏)′ + (cid:101)𝑊𝑘 (𝑟 + 𝑏) (cid:101)𝑊𝑘 (𝑟)′(cid:3) 𝑑𝑟 (cid:2) (cid:27) Λ′ 𝑔 (2A.18) 49 For the rest of the terms of (cid:98)ΩCHS, we again apply Lemma 1 to obtain the joint limit through the sequential limit. Consider 1 𝑇 (cid:98)𝑆𝑖,[𝑟𝑇] using the component representation: 1 𝑇 (cid:98)𝑆𝑖,[𝑟𝑇] = [𝑟𝑇] 𝑇 𝑎𝑖 + 1 𝑇 [𝑟𝑇] ∑︁ 𝑡=1 𝑔𝑡 + 1 𝑇 [𝑟𝑇] ∑︁ 𝑡=1 𝑒𝑖𝑡 − 1 𝑇 [𝑟𝑇] ∑︁ 𝑡=1 𝑥𝑖𝑡𝑥′ 𝑖𝑡 (cid:16) (cid:17) . (cid:98)𝛽 − 𝛽 Note that 1 𝑇 (cid:205)[𝑟𝑇] 𝑡=1 𝑒𝑖𝑡 𝑝 → 0 as 𝑇 → ∞ due to Var( 1 𝑇 (cid:205)[𝑟𝑇] 𝑡=1 𝑒𝑖𝑡) = 𝑂 (1/𝑇) and Chebyshev’s inequality. Then, given fixed 𝑁 and as 𝑇 → ∞, we have (cid:98)𝛽 − 𝛽 = (cid:98)𝑄−1 (cid:34) 1 𝑁 𝑁 ∑︁ 𝑖=1 𝑎𝑖 + 1 𝑇 𝑇 ∑︁ 𝑡=1 𝑔𝑡 + 1 𝑁𝑇 𝑁 ∑︁ 𝑇 ∑︁ (cid:35) 𝑒𝑖𝑡 𝑖=1 𝑡=1 𝑝 → 𝑄−1 ¯𝑎𝑖 and so 1 𝑇 (cid:98)𝑆𝑖,[𝑟𝑇] 𝑝 → 𝑟 (𝑎𝑖 − ¯𝑎𝑖), which is the same as the sample mean estimator case. Define 𝐺 𝑁𝑇 := 1 𝑁 𝑁 ∑︁ 𝑖=1 𝐺𝑖𝑇 , 𝐺𝑖𝑇 := 1 𝑇 2 2 [𝑏𝑇] 𝑇−1 ∑︁ 𝑡=1 (cid:98)𝑆𝑖𝑡 (cid:98)𝑆′ 𝑖𝑡, 𝐺𝑖 := 2 3𝑏 where (cid:98)𝑆𝑖𝑡 = (cid:98)𝑆𝑖,[𝑟𝑇] = (cid:205)[𝑟𝑇] 𝑡=1 𝑣𝑖𝑡. 𝑥𝑖𝑡(cid:98) (𝑎𝑖 − ¯𝑎) (𝑎𝑖 − ¯𝑎)′ , 𝐺 𝑁 := 1 𝑁 𝑁 ∑︁ 𝑖=1 𝐺𝑖, 𝐺 := 2 3𝑏 Λ𝑎Λ𝑎 From the Proof of Theorem 2.1, we know that to prove condition (2A.11) it suffices to show the uniform integrability of {𝐺𝑖𝑇 } for any 𝑖. For some 𝜁 > 0, we have lim 𝑀→∞ sup 𝑇 E (∥𝐺𝑖𝑇 ∥ ; ∥𝐺𝑖𝑇 ∥ > 𝑀) ≤ lim 𝑀→∞ E sup 𝑇 (cid:32) ∥𝐺𝑖𝑇 ∥ (cid:19) 𝜁 (cid:18) ∥𝐺𝑖𝑇 ∥ 𝑀 (cid:33) ; ∥𝐺𝑖𝑇 ∥ > 𝑀 ≤ lim 𝑀→∞ sup 𝑇 1 𝑀 𝜁 E (cid:16) ∥𝐺𝑖𝑇 ∥1+𝜁 (cid:17) ≤ lim 𝑀→∞ sup 𝑇 (cid:32) 1 𝑀 𝜁 1 𝑇 2 2 [𝑏𝑇] 𝑇−1 ∑︁ (cid:18) 𝑡=1 (cid:13) (cid:13)(cid:98)𝑆𝑖𝑡 (cid:13) (cid:13) (cid:13) (cid:13) E 2(1+𝜁)(cid:19) 1 2(1+𝜁 ) (cid:18) (cid:13) (cid:13)(cid:98)𝑆𝑖𝑡 (cid:13) 2(1+𝜁)(cid:19) (cid:13) (cid:13) (cid:13) E 2(1+𝜁 ) (cid:33) 1+𝜁 1 . (cid:18) Now consider E 1 2(1+𝜁 ) (cid:13) (cid:13)(cid:98)𝑆𝑖𝑡 (cid:13) 2(1+𝜁)(cid:19) (cid:13) (cid:13) (cid:13) . By Minkowski’s inequality, we have 2(1+𝜁)(cid:33) 1 2(1+𝜁 ) (cid:32) E (cid:13) (cid:13) (cid:13) (cid:13) 1 𝑇 (cid:98)𝑆𝑖,𝑡=[𝑟𝑇] (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) 1 𝑇 [𝑟𝑇] ∑︁ 𝑡=1 𝑥𝑖𝑡𝑢𝑖𝑡 − 1 𝑇 [𝑟𝑇] ∑︁ 𝑡=1 𝑥𝑖𝑡𝑥′ 𝑖𝑡 ( (cid:98)𝛽 − 𝛽) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) = (cid:169) E (cid:173) (cid:171) 1 2(1+𝜁 ) 2(1+𝜁) 1 2(1+𝜁 ) (cid:170) (cid:174) (cid:172) 1 2(1+𝜁 ) 2(1+𝜁) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) 1 𝑇 [𝑟𝑇] ∑︁ 𝑡=1 𝑥𝑖𝑡𝑢𝑖𝑡 (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) E ≤ (cid:169) (cid:173) (cid:171) (cid:170) (cid:174) (cid:172) E + (cid:169) (cid:173) (cid:171) (cid:13) (cid:13) (cid:98)𝑄𝑇 (cid:98)𝑄−1 (cid:13) 𝑁𝑇 (cid:13) (cid:13) 1 𝑁𝑇 𝑁 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑡=1 𝑥𝑖𝑡𝑢𝑖𝑡 2(1+𝜁) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:170) (cid:174) (cid:172) 50 where we denote (cid:98)𝑄𝑇 = 1 𝑇 uniformly over 𝑇 by applying Minkowski’s inequality for infinite sums and Hölder’s inequality under 𝑖𝑡. The first term is easily bounded 𝑖𝑡 and (cid:98)𝑄 𝑁𝑇 = 1 𝑁𝑇 𝑥𝑖𝑡𝑥′ 𝑥𝑖𝑡𝑥′ 𝑡=1 𝑡=1 (cid:205)𝑁 𝑖=1 (cid:205)𝑇 (cid:205)𝑇 Assumption 2.3. By applying Hölder’s inequality to the second term twice, we have (cid:13) (cid:13) (cid:98)𝑄𝑇 (cid:98)𝑄−1 (cid:13) 𝑁𝑇 (cid:13) (cid:13) E 1 𝑁𝑇 𝑁 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑡=1 𝑥𝑖𝑡𝑢𝑖𝑡 2(1+𝜁) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:18) E ≤ (cid:13) (cid:13) (cid:98)𝑄𝑇 (cid:13) (cid:13) (cid:13) (cid:13) 4(1+𝜁)(cid:19) 1/2 (cid:18) (cid:13) (cid:13) (cid:98)𝑄−1 (cid:13) 𝑁𝑇 (cid:13) (cid:13) (cid:13) E 4𝑝(1+𝜁)(cid:19) 1/𝑝 (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) 1 𝑁𝑇 𝑁 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑡=1 𝑥𝑖𝑡𝑢𝑖𝑡 (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) E (cid:169) (cid:173) (cid:171) 4𝑞(1+𝜁) 1/𝑞 (cid:170) (cid:174) (cid:172) (2A.19) where 1 𝑝 + 1 𝑞 = 1 and 𝑝, 𝑞 ∈ [1, ∞]. The first term in (2A.19) is bounded with a straightforward application of Minkowski’s inequality for infinite sums and Hölder’s inequality under Assumption 2.3. Note that we have shown (cid:98)𝑄 𝑁𝑇 4𝑝(1+𝜁) (cid:13) 𝑁𝑇 ∥ < ∞ and so E that ∥ (cid:98)𝑄−1 (cid:13) (cid:13) (cid:13) (cid:13) (cid:98)𝑄−1 (cid:13) 𝑁𝑇 𝑝 → 𝑄. By Assumption 2.3(ii), we have ∥𝑄−1∥ < ∞. It follows < ∞ given that 𝑝 and 𝜁 are finite. To determine 𝑝 and 𝜁, observe that (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) 1 𝑁𝑇 𝑁 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑡=1 𝑥𝑖𝑡𝑢𝑖𝑡 (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) E (cid:169) (cid:173) (cid:171) 4𝑞(1+𝜁) 1 4𝑞 (1+𝜁 ) (cid:170) (cid:174) (cid:172) ≤ ≤ 1 𝑁𝑇 1 𝑁𝑇 𝑁 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑁 ∑︁ 𝑡=1 𝑇 ∑︁ 𝑖=1 𝑡=1 (cid:16) E ∥𝑥𝑖𝑡𝑢𝑖𝑡 ∥4𝑞(1+𝜁)(cid:17) 1 4𝑞 (1+𝜁 ) , (cid:16) E ∥𝑥𝑖𝑡 ∥8𝑞(1+𝜁) E ∥𝑢𝑖𝑡 ∥8𝑞(1+𝜁)(cid:17) 1 8𝑞 (1+𝜁 ) where the first inequality follows from Minkowski’s inequality for infinite sums and the second line follows from Hölder’s inequality. Let 𝑞 = 𝑠 and 𝜁 = 𝛿/𝑞 = 𝛿/𝑠 where 𝑠 and 𝛿 are from Assumption 2.3, then 𝑝 = 𝑠 𝑠−1 and it follows that all three terms in (2A.19) are bounded uniformly over 𝑇. Therefore, we conclude that E∥𝐺𝑖𝑡 ∥1+𝜁 < ∞ and so 𝐺𝑖𝑡 is uniformly integrable. Then, condition (2A.11) is established and we obtain the joint limit of 𝐺 𝑁𝑇 as 𝑁, 𝑇 → ∞: 𝐺 𝑁𝑇 𝑝 → 2 3𝑏 Λ𝑎Λ′ 𝑎. Following the same steps for the rest of terms in (2.12) leads to the same results as (2A.12) and (2A.13). Proof of Theorem 2.3: Following the proof of Theorem 2.2, the (cid:98)ΩDK part of the variance estimator can be rewritten as a function of partial sums where the function is defined in (2.11) and so the 51 joint limit follows from (2A.18): 𝑁 (cid:98)ΩDK = 1 𝑁𝑇 2 = 1 𝑁𝑇 2 𝑇 ∑︁ 𝑇 ∑︁ 𝑘 𝑠=1 𝑇−1 ∑︁ 𝑡=1 (cid:40) 2 𝑀 (cid:18) |𝑡 − 𝑠| 𝑀 (cid:19) (cid:32) 𝑁 ∑︁ (cid:33) 𝑁 ∑︁ 𝑣𝑖𝑡 (cid:98) 𝑗=1 𝑣′ (cid:169) 𝑗 𝑠(cid:170) (cid:98) (cid:173) (cid:174) (cid:171) (cid:172) (cid:16) ′ ′ 𝑡+𝑀 + (cid:98)¯𝑆𝑡+𝑀(cid:98)¯𝑆 (cid:98)¯𝑆𝑡(cid:98)¯𝑆 𝑡 𝑖=1 𝑇−𝑀−1 ∑︁ (cid:41) (cid:17) ′ (cid:98)¯𝑆𝑡(cid:98)¯𝑆 𝑡 − 1 𝑀 𝑡=1 𝑏, (cid:101)𝑊𝑘 (𝑟) (cid:17) Λ′ 𝑔. 𝑡=1 ⇒ 𝑐Λ−1 𝑔 𝑃 (cid:16) For the (cid:98)ΩA part of the variance estimator, under Assumption 2.3 we can apply Lemma 2 of Chiang et al. (2024) giving 𝑁 (cid:98)ΩA = 1 𝑁𝑇 2 𝑁 ∑︁ (cid:32) 𝑇 ∑︁ 𝑖=1 𝑡=1 𝑣𝑖𝑡 (cid:98) (cid:33) (cid:32) 𝑇 ∑︁ 𝑠=1 (cid:33) 𝑣′ 𝑖𝑠 (cid:98) 𝑝 → Λ𝑎Λ′ 𝑎. (2A.20) Proof of Theorem 2.4: Define Λ𝑥𝑢Λ′ 𝑥𝑢 = Var(𝑥𝑖𝑡𝑢𝑖𝑡) and Λ𝑥𝑥Λ′ 𝑥𝑥 = E(𝑥𝑖𝑡𝑥′ 𝑖𝑡). By Jensen’s inequality, Hölder’s inequality, and Assumption 2.4(ii), we have (cid:13) (cid:13)Λ𝑥𝑢Λ′ 𝑥𝑢 (cid:13) (cid:13)Λ𝑥𝑥Λ′ 𝑥𝑥 (cid:13) (cid:13) = ∥E [(𝑥𝑖𝑡𝑢𝑖𝑡)(𝑥𝑖𝑡𝑢𝑖𝑡)′] ∥ ≤ E ∥𝑥𝑖𝑡𝑢𝑖𝑡 ∥2 ≤ (cid:13) = (cid:13) (cid:13) (cid:13) (cid:13) = E ∥𝑥𝑖𝑡 ∥2 < ∞. (cid:13) ≤ E (cid:13) (cid:13)E(𝑥𝑖𝑡𝑥′ (cid:13)𝑥𝑖𝑡𝑥′ 𝑖𝑡 𝑖𝑡)(cid:13) (cid:16) E ∥𝑥𝑖𝑡 ∥4(cid:17) 1/2 (cid:16) E ∥𝑢𝑖𝑡 ∥4(cid:17) 1/2 < ∞, Then, by the WLLN, the functional central limit theorem for i.i.d random vectors, and Slutsky’s Theorem, we have √ 𝑁𝑇 ( (cid:98)𝛽 − 𝛽) = (cid:32) 1 𝑁𝑇 √ 1 𝑁𝑇 (cid:98)¯𝑆 [𝑟𝑇] = √ 1 𝑁𝑇 𝑁 ∑︁ 𝑇 ∑︁ 𝑥𝑖𝑡𝑥′ 𝑖𝑡 (cid:33) −1 (cid:32) √ 1 𝑁𝑇 𝑖=1 𝑁 ∑︁ 𝑡=1 [𝑟𝑇] ∑︁ 𝑖=1 𝑡=1 𝑥𝑖𝑡𝑢𝑖𝑡 − 1 𝑁𝑇 𝑖=1 𝑡=1 𝑁 ∑︁ 𝑇 ∑︁ (cid:33) 𝑥𝑖𝑡𝑢𝑖𝑡 ⇒ 𝑄−1Λ𝑥𝑢𝑊𝑘 (1), (2A.21) 𝑁 ∑︁ 𝑖=1 [𝑟𝑇] ∑︁ 𝑡=1 𝑥𝑖𝑡𝑥′ 𝑖𝑡 √ (cid:16) 𝑁𝑇 (cid:17) (cid:98)𝛽 − 𝛽 ⇒ Λ𝑥𝑢 (cid:101)𝑊𝑘 (𝑟), (2A.22) as 𝑁, 𝑇 → ∞. Then, due to the partial sum representation of (cid:98)ΩDK, we have 𝑁𝑇 (cid:98)ΩDK ⇒ Λ𝑥𝑢𝑃 (cid:16) 𝑏, (cid:101)𝑊𝑘 (𝑟) (cid:17) Λ′ 𝑥𝑢, 52 where 𝑃 (cid:16) 𝑏, (cid:101)𝑊𝑘 (𝑟) (cid:17) is defined the same way as (2.13). The probability limit of (2.10) scaled by 𝑁𝑇 follows from Lemma 3 of Chiang et al. (2024): 1 𝑁𝑇 𝑁 ∑︁ 𝑖=1 (cid:98)𝑆𝑖𝑇 (cid:98)𝑆′ 𝑖𝑇 𝑝 → Λ𝑥𝑢Λ′ 𝑥𝑢. (2A.23) To derive the joint asymptotic limit of (2.12), we first obtain its sequential limit and then apply Theorem 1 of Phillips and Moon (1999) to show the joint limit is given by the sequential limit. By the WLLN and functional CLT for i.i.d. random vectors with finite variance, for given 𝑁 and as 𝑇 → ∞, we have 1 𝑇 1 √ 𝑇 [𝑟𝑇] ∑︁ 𝑡=1 [𝑟𝑇] ∑︁ 𝑡=1 𝑥𝑖𝑡𝑥′ 𝑖𝑡 𝑝 →𝑟𝑄, 𝑎𝑠 𝑇 → ∞ 𝑥𝑖𝑡𝑢𝑖𝑡 ⇒Λ𝑥𝑢𝑊𝑖,𝑘 (𝑟), where 𝑊𝑖,𝑘 (𝑟) is a 𝑘 × 1 vector of standard Wiener process for each 𝑖. Therefore, for given 𝑁 and as 𝑇 → ∞, we have (cid:32) √ 𝑇 ( (cid:98)𝛽 − 𝛽) = 𝑁 ∑︁ 𝑇 ∑︁ 𝑥𝑖𝑡𝑥′ 𝑖𝑡 (cid:33) −1 (cid:32) 1 𝑁𝑇 𝑡=1 1 √ 𝑇 𝑁 𝑁 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑡=1 (cid:33) 𝑥𝑖𝑡𝑢𝑖𝑡 ⇒ 𝑄−1Λ𝑥𝑢 ¯𝑍, 1 √ 𝑇 (cid:98)𝑆𝑖,[𝑟𝑇] = 1 √ 𝑇 𝑥𝑖𝑡𝑢𝑖𝑡 − 1 𝑇 [𝑟𝑇] ∑︁ 𝑡=1 𝑥𝑖𝑡𝑥′ 𝑖𝑡 √ 𝑇 (cid:16) (cid:17) (cid:98)𝛽 − 𝛽 ⇒ Λ𝑥𝑢 (cid:0)𝑊𝑖,𝑘 (𝑟) − 𝑟 ¯𝑍 (cid:1) , 𝑖=1 [𝑟𝑇] ∑︁ 𝑡=1 for each 𝑖 and ¯𝑍 = 1 𝑁 (cid:205)𝑁 𝑖=1 𝑍𝑖 where 𝑍𝑖 is a 𝑘 × 1 vector of standard normal random variables. Because the convergence of a sequence of matrices 𝐴𝑛 to some matrix 𝐴0 holds if and only if 𝑒′𝐴𝑛𝑒 converges to 𝑒 𝐴0𝑒 for any comfortable constant vector vector 𝑒, we can assume without loss of generality that 𝑘 = 1. The sequential limit of the first term of (2.12), scaled by 𝑁𝑇, is obtained as follows: 𝑌𝑖,𝑇 := 1 𝑁 𝑁 ∑︁ 𝑖=1 𝑝 → 𝑌𝑖 1 𝑇 2 𝑏 𝑇−1 ∑︁ (cid:98)𝑆𝑖𝑡 (cid:98)𝑆𝑖𝑡 ⇒ Λ𝑥𝑢 2 [𝑏𝑇] 2 𝑏 ∫ 1 0 (cid:0)𝑊𝑖,𝑘 (𝑟) − 𝑟 ¯𝑍 (cid:1) 2 𝑑𝑟Λ𝑥𝑢 =: 𝑌𝑖, 𝑎𝑠 𝑇 → ∞, 𝑡=1 ∫ 1 0 Λ𝑥𝑢 𝑟𝑑𝑟Λ𝑥𝑢 = Λ𝑥𝑢Λ𝑥𝑢 𝑏 , 𝑎𝑠 𝑁 → ∞, where the equality in the second line follows from Tonelli Theorem. Noting that there is no hetero- geneity across 𝑖 due to i.i.d sequences, the conditions needed for Theorem 1 of Phillips and Moon 53 (1999) reduce to the following conditions: (𝑖) lim sup𝑇→∞ E|𝑌𝑖,𝑇 | < ∞; (𝑖𝑖) lim sup𝑇→∞ |𝐸𝑌𝑖,𝑇 − 𝐸𝑌𝑖 | = 0; (𝑖𝑖𝑖) lim sup𝑁,𝑇→∞ E (cid:0)|𝑌𝑖,𝑇 |; |𝑌𝑖,𝑇 > 𝑁𝜀|(cid:1) = 0 ∀𝜀 > 0; and (iv) lim sup𝑁→∞ E (|𝑌𝑖 |; |𝑌𝑖 > 𝑁𝜀|) = 0 ∀𝜀 > 0. Therefore, it suffices to show uniform integrability of 𝑌𝑖,𝑇 and 𝑌𝑖. Uniform integrability of 𝑌𝑖 is trivial since it is equivalent to show E|𝑌𝑖 | < ∞. To show uniform integrability of 𝑌𝑖,𝑇 , fix 𝜀 > 0. We want to show that sup𝑁,𝑇 E|𝑌𝑖,𝑇 | < ∞ and there exists 𝛿 such that if 𝑃( 𝐴) < 𝛿 then sup𝑁,𝑇 E(|𝑌𝑖,𝑇 |; 𝐴) < 𝜀. By Hölder’s inequalities, we have E(|𝑌𝑖,𝑇 |; 𝐴) = 2 [𝑏𝑇] 𝑇−1 ∑︁ 𝑡=1 (cid:18)(cid:12) (cid:12) (cid:12) (cid:12) E 1 𝑇 (cid:98)𝑆𝑖𝑡 (cid:98)𝑆𝑖𝑡 (cid:12) (cid:12) (cid:12) (cid:12) (cid:19) ; 𝐴 ≤ E|𝑌𝑖,𝑇 | ≤ 2 [𝑏𝑇] 2 [𝑏𝑇] 𝑇−1 ∑︁ (cid:18) (cid:19) (cid:18) 1 𝑇 (cid:98)𝑆2 𝑖𝑡; 𝐴 (cid:19)(cid:19) (cid:18) 1 𝑇 (cid:98)𝑆2 𝑖𝑡; 𝐴 E E 𝑡=1 𝑇−1 ∑︁ 𝑡=1 (cid:18) E (cid:19) (cid:18) 1 𝑇 (cid:98)𝑆2 𝑖𝑡 (cid:19)(cid:19) (cid:18) 1 𝑇 (cid:98)𝑆2 𝑖𝑡 E Thus, it is equivalent to show uniform integrability of 1 𝑇 (cid:98)𝑆2 𝑖𝑡. Notice that 1 the consistency of (cid:98)𝛽 and so by the asymptotic equivalence lemma, we have 𝑇 (cid:98)𝑆2 𝑖𝑡 = 1 𝑇 𝑆2 𝑖𝑡 + 𝑜 𝑝 (1) under (cid:12) (cid:12) (cid:12) (cid:12) 1 𝑇 (cid:98)𝑆2 𝑖𝑡 (cid:12) (cid:12) (cid:12) (cid:12) E = E (cid:12) (cid:12) (cid:12) (cid:12) 1 𝑇 𝑆2 𝑖𝑡 (cid:12) (cid:12) (cid:12) (cid:12) 𝑎𝑠 𝑁, 𝑇 → ∞. Observe that (cid:12) (cid:12) (cid:12) (cid:12) 1 𝑇 E 𝑆2 𝑖𝑡 (cid:12) (cid:12) (cid:12) (cid:12) = 1 𝑇 E (cid:32) 𝑇 ∑︁ 𝑡=1 (cid:33) 2 𝑥𝑖𝑡𝑢𝑖𝑡 = E(𝑥𝑖𝑡𝑢𝑖𝑡𝑢𝑖𝑡𝑥𝑖𝑡) ∀𝑇, where the second equality follows from that {𝑥𝑖𝑡𝑢𝑖𝑡 } are i.i.d across 𝑡. Then, under Assumption 2.4(ii), there exists some constant 𝐶 < ∞ such that E (cid:12) 𝑇 𝑆2 (cid:12) 1 𝑖𝑡 (cid:12) (cid:12) < 𝐶. Integrating both sides of this inequality over 𝐴 gives: 𝐶𝑃( 𝐴) > ∫ (cid:12) (cid:12) (cid:12) (cid:12) 1 𝑇 𝑆2 𝑖𝑡 (cid:12) (cid:12) (cid:12) (cid:12) E 𝐴 𝑑𝑃 = E (cid:18)∫ 1 𝑇 𝐴 (cid:19) 𝑖𝑡 𝑑𝑃 𝑆2 = E (cid:19) 𝑖𝑡; 𝐴 𝑆2 (cid:18) 1 𝑇 ∀𝑇 ∈ N where the second equality follows from that 1 𝑇 𝑆2 𝑖𝑡 ≥ 0. So, if we take 𝛿 = 𝜀/𝐶, then (cid:19) (cid:18) 1 𝑇 (cid:98)𝑆2 𝑖𝑡; 𝐴 E sup 𝑁,𝑇 = sup 𝑁,𝑇 E (cid:19) 𝑖𝑡; 𝐴 𝑆2 (cid:18) 1 𝑇 < 𝜀. It follows that { 1 1 of Phillips and Moon (1999) applies and we obtain 𝑌𝑖,𝑇 𝑖𝑡 } is uniformly integrable and so 𝑌𝑖,𝑇 is uniformly integrable. Therefore, Theorem 𝑝 → 1 𝑏 Λ𝑥𝑢Λ𝑥𝑢. Similarly, the joint fixed-𝑏 𝑇 (cid:98)𝑆2 54 limit of the rest of the terms of (2.12) are obtained as follows: 1 𝑁𝑇 2 [𝑏𝑇] 𝑁 ∑︁ 𝑇−𝑀−1 ∑︁ 𝑖=1 𝑡=1 (cid:98)𝑆𝑖𝑡 (cid:98)𝑆𝑖,𝑡+𝑀 𝑝 → (1 − 𝑏)2 𝑏 Λ𝑥𝑢Λ𝑥𝑢 𝑎𝑠 𝑁, 𝑇 → ∞, 1 𝑁𝑇 2 [𝑏𝑇] 𝑁 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑡=𝑇−𝑀 (cid:98)𝑆𝑖𝑡 (cid:98)𝑆𝑖𝑇 𝑝 → 1 − (1 − 𝑏)2 𝑏 Λ𝑥𝑢Λ𝑥𝑢 𝑎𝑠 𝑁, 𝑇 → ∞. Arranging the joint limits we obtain above we find that 𝑁𝑇 ((cid:98)ΩA − (cid:98)ΩNW) = 𝑜 𝑝 (1) under Assumption 2.4, which combined with (2A.21) - (2A.23) delivers the desired result. 55 CHAPTER 3 INFERENCE IN HIGH-DIMENSIONAL PANEL MODELS: TWO-WAY DEPENDENCE AND UNOBSERVED HETEROGENEITY 3.1 Introduction In economic research, high dimensionality typically refers to the large number of unknown parameters relative to the sample size, under which traditional estimations are either infeasible or tend to yield estimates too noisy to be informative. The issue of high dimensionality becomes more relevant as data availability grows and economic modeling involves more flexibility. Commonly, the problem of high dimensionality appears in at least the following three scenarios: • The dimension of observable and potentially relevant variables can be large relative to the sample. In trade literature, preferential trade agreements (PTAs) usually involve a large number of provisions even though most policy analysis only focuses on the effect of a small subset of the provisions 1. In demand analysis, even if the focus is on the own-price elasticity, the prices of relevant goods should also be included, unless strong assumptions for aggregation are assumed (see Chernozhukov et al., 2019). • With nonparametric or semiparametric modeling, the unknown functions are viewed as infinite-dimensional parameters regardless of the dimension of observable variables. If the unknown function 𝑔(𝑋) is approximately sparse and can be well-approximated by a linear combination of the 3rd-order polynomial transformation of 𝑋, then it would involve 285 transformed regressors when the dimension of 𝑋 is 10 and 1770 when we start with a dimension of 20. 2 • The modeling of heterogeneity can raise the number of nuisance parameters drastically. In demand analysis, income effects are specific to products if the homothetic preference assumption fails. For difference-in-difference analysis, allowing unit-specific trends and 1Based on data from Mattoo et al. (2020), 282 PTAs were signed and notified to the WTO between 1958 and 2017, encompassing 937 provisions across 17 policy areas. See Breinlich et al. (2022). 2For a vector 𝑋 with dimension 𝑘, it is easy to show that the 2nd-order polynomial transformation generates 𝑘2 (cid:205)𝑘 𝑙=1 𝑙 (𝑙 + 1) = 1 6 𝑘 3 + 𝑘 2 + 11 6 𝑘 2 + 3 𝑘 terms. 2 terms and the 3rd-order polynomial transformation generates 𝑘 + 1 2 𝑘 (𝑘 + 1) + 1 2 56 heterogeneous trends across the covariates can relax/test the parallel trend assumption. For models with unobserved heterogeneity that appears in a nonlinear way, either treating them as parameters to be estimated (fixed effects) or modeling them in a flexible way (correlated random effects) contributes to high dimensionality. 3. Particularly, the modeling of heterogeneity in panel models makes high dimensionality more of a practical issue rather than just a theoretical concern. As a concrete example, let’s consider a panel model where all three sources of high dimensionality are involved: 𝑌𝑖𝑡 = 𝐷𝑖𝑡𝜃0 + 𝑔0(𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡) + 𝑈𝑖𝑡, (3.1) where 𝐷𝑖𝑡 is a vector of low-dimensional treatment or policy variables. 𝑋𝑖𝑡 is a vector of potentially high-dimensional control variables. 𝐷𝑖𝑡 can also contain some higher-order effects and interactive effects with a subset of the controls to allow for nonlinear and heterogeneous effects in a parametric way. When 𝐷𝑖𝑡 is not conditionally uncounfounded, instrumental variables may be used for identification of 𝜃0. ; 𝑔(.) is an unknown function, e.g. an infinite dimensional parameter; 𝑐𝑖 and 𝑑𝑡 are unobserved heterogeneous effects, either as fixed-effect parameters or correlated random variables. The interest lies in the inference on the low-dimensional parameters 𝜃0. Without considering the features of panel data and the unobserved heterogeneity, it is a classic partial linear model that has been well-studied in previous semiparametric literature. To reduce the dimensionality, sparse approximation and regularization approaches have been widely employed. Essentially, regularization, also known as the machine learning approach, trades off bias for smaller variance to achieve desirable rates of convergence. However, due to the bias introduced by reg- ularization and overfitting, inference can be challenging. Typically, bias correction is involved to obtain estimators with better statistical properties and to conduct valid inferences. In the case of panel data, it is soon realized that at least three challenges would appear if researchers attempt to apply the existing high-dimensional approaches directly. First of all, the 3This is particularly relevant in trade literature where the unobserved heterogeneity derived from the gravity model takes a pairwise form among the importers, exporters, and the time. As each of these three dimensions expands, the number of nuisance parameters explodes quickly. See Correia et al. (2020), Chiang et al. (2021), and Chiang et al. (2023b), for example. 57 statistical properties of many regularized estimators remain unknown with panel data where the observations are potentially dependent across space/unit and time. Secondly, some bias-correction procedures for inference such as sample-splitting/cross-fitting are very particular about the sampling assumption and existing approaches are not valid under two-way dependence in panels. Thirdly, panel data is often leveraged to model unobserved individual and time effects, which may lead to another source of high dimensionality and further complicate estimation and inference. To reduce the dimensionality, as the first challenge, I proposed a variant of LASSO that uses regressor-specific penalty weights robust to two-way cluster dependence and weak temporal dependence across clusters. Such a LASSO approach is labeled as the two-way cluster-LASSO, corresponding to the heteroskedasticity-robust LASSO in Belloni et al. (2012) and the cluster- LASSO in Belloni et al. (2016). This approach theoretically derives the common penalty level 𝜆 up to a constant and a small-order sequence that do not vary across different data-generating processes. Therefore, data-driven tuning, such as cross-validation, is not needed, which makes it more computationally efficient and avoids non-trivial theories that take data-driven tuning into account. A common and important condition for obtaining the desirable statistical properties of LASSO selection/estimation is the so-called "regularization event", which states that the overall penalty level is sufficiently large to dominate the "noise" in the high-dimensional estimation (but not too large at the same time to avoid under-selection and slow rate of convergence). However, existing approaches for ensuring such an event with probability approaching one are not applicable in this case due to the two-way cluster dependence. Instead, by considering the component structure characterization of the two-way dependence and decomposing the error terms using Hajek projections, I am able to leverage the moderate deviation theorems by Peña et al., 2009 and Gao et al., 2022 and the concentration inequality by Fuk and Nagaev (1971) for bounding the tail probability of the "noise" term. Combining with existing non-asymptotic bounds for the LASSO approach in Belloni et al. (2012), I derive the rate of convergence for the (post) two-way cluster-LASSO. According to the rate of convergence results, the proposed (post) LASSO estimator is consistent 58 for the slope coefficients of the sparse model. However, it is also revealed that the convergence rate is not as fast as the common rates for LASSO estimation due to the two-way cluster dependence. The problem lies in the underlying component structure. To illustrate, consider the simplest multivariate mean model through a component structure representation: 𝑌𝑖𝑡 = 𝜃0 + 𝑓 (𝛼𝑖, 𝛾𝑡, 𝜖𝑖𝑡) (3.2) where 𝑌𝑖𝑡 is a high-dimensional vector with dimension 𝑠 = 𝑜(𝑁𝑇) and 𝜃0 = 𝐸 [𝑌𝑖𝑡]; 𝛼𝑖, 𝛾𝑡, and 𝜀𝑖𝑡 are unobserved random elements. This is a common characterization of cluster dependence in the literature on cluster-robust inference. We notice that 𝛼𝑖 introduces cluster/temporal dependence within group 𝑖 and 𝛾𝑡 introduces cluster/cross-sectional dependence within group 𝑡. To estimate the high-dimensional vector 𝜃0, we consider the sample mean estimator (cid:98)𝜃 = 1 𝑁𝑇 (cid:205)𝑁 𝑖=1 (cid:205)𝑇 𝑡=1 𝑌𝑖𝑡. We can rewrite the estimator through a Hajek projection: (cid:98)𝜃 − 𝜃0 = 1 𝑁𝑇 𝑁 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑡=1 (𝑎𝑖 + 𝑔𝑡 + 𝑒𝑖𝑡) = 1 𝑁 𝑁 ∑︁ 𝑖=1 𝑎𝑖 + 1 𝑇 𝑇 ∑︁ 𝑡=1 𝑔𝑡 + 1 𝑁𝑇 𝑁 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑡=1 𝑒𝑖𝑡, (3.3) where 𝑎𝑖 := 𝐸 [𝑌𝑖𝑡 − 𝜃0|𝛼𝑖], 𝑔𝑡 := 𝐸 [𝑌𝑖𝑡 − 𝛾𝑡], and 𝑒𝑖𝑡 := 𝑌𝑖𝑡 − 𝜃0 − 𝑎𝑖 − 𝑔𝑡. For illustration purposes, suppose those components are i.i.d sequences and independent of each other. Then it can be shown that, under some regularity conditions, for each 𝑗 = 1, ..., 𝑠, (cid:98)𝜃 𝑗 − 𝜃0 = 𝑂 𝑃 (cid:16)√︁ 𝑠 𝑁∧𝑇 ∥(cid:98)𝜃 − 𝜃0∥2 = (cid:98)𝜃 𝑗 − 𝜃0 𝑗 (cid:17) 2(cid:19) 1/2 = 𝑂 𝑃 (cid:205)𝑠 𝑗=1 (cid:18) (cid:17) (cid:16) (cid:17) (cid:16) 1√ 𝑁∧𝑇 and . While (cid:98)𝜃 is still consistent when 𝑁, 𝑇 diverge at (cid:18)√︃ 1 (𝑁𝑇)−1/4(cid:17) 𝑁∧𝑇 , which is a common = 𝑜𝑃 (cid:19) (cid:16) the same rate, ∥(cid:98)𝜃 − 𝜃0∥2 converges slower than 𝑜𝑃 rate requirement for inferential theory. This is where the second challenge arises: if a faster rate of convergence is not achievable due to the two-way cluster dependence, some bias-correction approaches are needed to relax the rate requirement for valid inference. One common approach in semiparametric literature is to add a correction term to the original identifying moment function. It results in an orthogonalized moment condition which often features multiplicative error terms, closely related to the doubly robust estimators. Although the orthogonality property allows the nuisance estimations to be noisier, it generally does not ensure valid inference when there is growing dimensionality in the 59 unknown parameters. An extra bias-correction approach, sample splitting or its generalization cross-fitting, has been proposed for inference in the high-dimensional regression models. The idea of sample splitting in a two-step estimation is to split the sample in a proper way and use the sub-samples separately for each step. If the sub-samples are independent of each other, then the first-step estimates will be independent of the sample used for the second-step estimation. With this property, the error term that causes the bias can vanish with a less stringent rate requirement on the first step. Intuitively, the dependence between the two steps is eliminated so that a potentially over-fitted nuisance estimate from the first step does not pollute the second-step estimator as much as it would otherwise do. However, sample-splitting as well as cross-fitting is very sensitive to the sampling assumption. Building upon recent development of cross-fitting approaches for dependent data (Chiang et al. (2022); Semenova et al. (2023a)), I propose a clustered-panel cross-fitting scheme and I show the constructed main and auxiliary samples are “approximately” and independent of each other. Effectively, this inferential procedure extends the double/debiased machine learning (DML, hereafter) approach by Chernozhukov et al. (2018a) to panel data models, and it is labeled as the panel DML. Asymptotic normality for the panel DML estimator is established given high-level assumptions on the convergence rates regarding the first-step estimator. It is shown that the crude requirement on the rate of convergence can be relaxed to 𝑜 (cid:16) (𝑁 ∧ 𝑇)−1/4(cid:17) in 𝐿2 norm, which admits the first-step estimation through the two-way cluster LASSO. For the third challenge caused by the unobserved heterogeneity, existing literature proposes to use the fixed-effect approach assuming the unknown function 𝑔0 in 3.1 is linear in (𝑐𝑖, 𝑑𝑡) (Belloni et al., 2016; Kock and Tang, 2019), or to use the high-dimensional common correlated effect approaches assuming 𝑔0 is linear in the interactive fixed effects (Vogt et al., 2022). To allow for flexible function forms while remaining tractable, I propose to model (𝑐𝑖, 𝑑𝑡) as correlated random effects through a generalized Mundlak device while assuming the unknown function to be approximately sparse. In that way, a very rich form of heterogeneity is permitted. While not all of those nonlinear and heterogeneous effects are relevant and the identities of the truly relevant effects are unknown to the researcher, suitable machine learning approaches, e.g. the two-way 60 cluster-LASSO, can be used to select the relevant effects. However, there is one more subtle issue: common approaches including Mundlak device that deal with the unobserved heterogeneous effects introduce cross-sectional and temporal sample averages which in turn bring dependence across cross-fitting sub-samples. Furthermore, even if it remains valid under extra conditions, cross-fitting often causes a loss of efficiency due to the exclusion of observations. On the other hand, without cross-fitting, valid inference remains challenging for high-dimensional panel models in general. Nevertheless, in the case of the partial linear panel model, I further show that inferential theory can be established using the full sample. In the empirical application, I re-examine the effect of government spending on the output of an open economy following the framework of Nakamura and Steinsson (2014), a well-cited empirical- macro paper. While they study it using a panel data approach considering unobserved heterogeneous effects that raise the nuisance parameters as the sample size grows, it is not considered a high- dimensional problem in the baseline setting: a linear panel model with only a few covariates and additive unobserved heterogeneous effects; the identification is through the instrumental variable. However, even in a conventionally low-dimensional setting, high dimensionality can be hidden because the true model can be highly nonlinear in the covariates and the unobserved heterogeneous effects can enter the model in a flexible way. To avoid the endogeneity caused by potential misspecification in the function form, I consider extending the baseline model in a more flexible way as in 3.1. Due to potential two-way cluster dependence, existing high-dimensional methods designed for independent or weakly dependent data may not be valid. This is where the proposed dependence-robust estimation and inference for high-dimensional models can be leveraged and the results can be used for a robustness check. It is shown that the estimates are consistent with the baseline results, which indicates the nonlinear and interactive effects may not be very relevant in this model. Existing estimation and inference methods that are not robust to either high-dimensionality or two-way cluster dependence tend to over-fit and result in noisy estimates and inaccurate inference results. The rest of the paper is outlined as follows: The next sub-section reviews relevant literature and 61 summarizes the differences and contributions of this paper relative to the existing ones. Section 3.2 presents the two-way cluster-LASSO estimator and the investigation of its statistical properties under two-way cluster dependence. Section 3.3 introduces a sub-sampling scheme designed for cross-fitting that allows within-cluster dependence and weak dependence across clusters. It is then used as a bias-correction approach for valid inference on the low-dimensional parameter considering the effect of the high-dimensional nuisance estimation. In Section 3.4, the partial linear model with unobserved heterogeneity is studied in detail as a leading example. Simulation evidence is given in Section 3.5 where the proposed approaches compete with existing ones. In Section 3.6, the empirical estimation of the government spending multiplier is used as the illustration of hidden high dimensionality and the application of the proposed toolkit. Section 3.7 concludes the paper with a discussion of limitations and detailed empirical recommendations. 3.1.1 Relation to the Literature This paper builds upon literature on 𝑙1 regularization methods in high-dimensional regression. Bickel et al. (2009) derive the convergence rate of the prediction error in terms of the empirical norm under homogeneous Gaussian error, restricted eigenvalue, and sparsity conditions. Bühlmann and Van De Geer (2011) instead assumes a sub-Gaussian tail property to derive similar results of convergence rates. See Section 29.11 of Hansen (2022) for an illustration and extension of Bickel et al. (2009)’s analysis under heteroskedasticity. Under Gaussian or sub-Gaussian errors, Basu and Michailidis (2015); Kock and Callot (2015); Lin and Michailidis (2017) study LASSO-based approaches for dependent data. To allow for both non-Gaussian errors and dependent data, Wu and Wu (2016), Chernozhukov et al. (2021a), Babii et al. (2022, 2023) Gao et al. (2024) derive Nagaev- type concentration inequalities to bound the tail probability assuming a proper order of the penalty level. However, all aforementioned LASSO-based approaches require delicate tuning of the penalty level to ensure a desirable finite sample performance. The common cross-validation approaches and bootstrap in Chernozhukov et al. (2021a) for choosing the penalty level are computationally costly and are very sensitive to the sampling assumption. Plus, the statistical analysis accounting for the data-driven penalty level is highly non-trivial (see Chetverikov et al., 2021 for validity 62 on cross-validation LASSO under random sampling). As another strand, Belloni et al. (2011, 2012, 2016) propose other variants of LASSO approaches and leverage (self-normalized) moderate deviation theorems to derive theoretically-driven penalty levels. However, their methodologies cannot be easily extended to settings with two-way dependence. The proposed variant of LASSO is built upon the aforementioned literature and employs both Nagaev-type inequalities (Fuk and Nagaev, 1971) and moderate deviation theorem for self-normalized sums (Peña et al., 2009; Gao et al., 2022). To my knowledge, it is the first LASSO and high-dimensional estimator that is robust to the two-way cluster dependence. The inferential theory in high-dimensional regression models typically relies on some bias- correction methods and they are particularly important here due to the two-way cluster dependence that results in a slow rate of convergence. Bias-correction approaches for inference purposes take various forms in the literature: for example, the low-dimensional projection adjustment in Zhang and Zhang (2014), the de-sparsification procedure in Van de Geer et al. (2014), the decorrelating matrix adjustment in Javanmard and Montanari (2014), the double selection approach in Belloni et al. (2014), the decorrelated score construction in Ning and Liu (2017), the Neyman orthogonal moment construction in Chernozhukov et al. (2018a, 2022a). The last strand of the literature is often labeled as DML, which is closely related to previous semiparametric literature including Ichimura (1987), Robinson (1988), Powell et al. (1989), Newey (1994), and Andrews (1994). The idea of the orthogonalization is to add a correction term to the original identifying moment function so that the second-step estimator is less sensitive to the plug-in of noisy first steps. Due to the resulting multiplicative error term in the orthogonal moment condition, it is closely related to the doubly-robust methods. Newey (1994) provides a general construction of the orthogonal moment condition through the influence functions. It is further facilitated by Ichimura and Newey (2022) for identifying moment conditions satisfying certain restrictions. See Chernozhukov et al. (2018a) and Chernozhukov et al. (2022a) for a summary of such constructions and known orthogonal moment functions. More recently, Chernozhukov et al. (2018b, 2021b, 2022b,c); Jordan et al. (2023) provide an alternative approach by estimating the correction term without knowing its analytical form. For 63 the inferential theory in high-dimensional panel models, this paper takes the orthogonalization step as given and focuses on nuisance estimation and cross-fitting. Sample-splitting or cross-fitting, serving as another bias-correction approach, has been widely employed in other two-step estimations. The role of cross-fitting in high-dimensional inferential theory is to remove the dependence between the nuisance estimation and the second-step estimation so that the over-fitting bias from the first step has less impact on the second step. Technically, it allows for a slower rate of convergence in the first step and it in turn relaxes the sparsity condition (e.g., Belloni et al., 2014). Chernozhukov et al. (2018a) generalize the sample-splitting procedure as a cross-fitting scheme which further improves finite sample performance by reducing the noise due to arbitrary splitting of the sample. Chiang et al. (2021, 2022) propose a cross-fitting scheme robust to separately and jointly exchangeable arrays. Semenova et al. (2023a) propose a leave- one-neighborhood-out cross-fitting and introduce a coupling approach (due to Strassen, 1965 and Berbee, 1987) to prove the validity of cross-fitting under temporal dependence. The idea of leave- one-neighborhood-out sub-sampling scheme is also shared by ℎ−block cross-validation (Burman et al., 1994; Racine, 2000) and big-block-small-block technique in time series literature (e.g., Gao et al., 2022). Built upon previous literature, I propose a more robust cross-fitting scheme that is valid under not only cluster dependence but also weak temporal dependence across clusters. This paper also belongs to the cluster-robust inference literature. The characterization of the two-way cluster dependence is based on the Aldous-Huber-Kallenberg (AHK) type representation, which is common is this literature (e.g., Djogbenou et al., 2019, Roodman et al., 2019, Davezies et al., 2019, and Menzel, 2021). This original representation only works for exchangeable arrays, which is violated in panel data settings with autocorrelation over time. Chiang et al. (2024) generalizes this representation by allowing the time factor to be correlated over time and Chen and Vogelsang (2024) also considers this representation when deriving fixed-b asymptotic results for inference. Differing from the original representation theorem, it is a fairly general characterization of two-way dependence in the panel. Such characterization of the dependence structure is common in economic studies (e.g., Rajan and Zingales, 1998, Fama and French, 2000, Li et al., 2004, 64 Larrain, 2006, Thompson, 2011, Nakamura and Steinsson, 2014, Guvenen et al., 2017, Ellison et al., 2024, and Nakamura and Steinsson, 2014 among many others). 3.1.2 Notation. Here is a collection of frequently used notations in this paper. Some extra notations are defined along with the context. 𝐸 and 𝑃 are as generic expectation and probability operators. P𝑁𝑇 is an expanding collection of all data-generating processes 𝑃 that satisfy certain conditions. 𝑃𝑁𝑇 is a sequence of probability laws such that 𝑃𝑁𝑇 ∈ P𝑁𝑇 for each (𝑁, 𝑇). The dependence on (𝑁, 𝑇) and 𝑃𝑁𝑇 will be suppressed whenever clear in the context. ∥.∥ is the Euclidean (Frobenius) (cid:17) 1/𝑞 norm for a matrix. Let x be a generic 𝑘 × 1 real vector, then the 𝑙𝑞 norm is denoted as ∥x∥𝑞 := (cid:16)(cid:205)𝑘 for 1 ≤ 𝑞 < ∞; ∥x∥∞ := max1≤ 𝑗 ≤𝑘 |𝑥 𝑗 |. The 𝐿𝑞 (𝑃) norm is denoted as ∥ 𝑓 ∥𝑃,𝑞 := where 𝑓 is a random element with probability law 𝑃. I denote the empirical (cid:0)∫ ∥ 𝑓 (𝜔)∥𝑞𝑑𝑃(𝜔)(cid:1) 1/𝑞 𝑥𝑞 𝑗 𝑗=1 average of 𝑓𝑖𝑡 over 𝑖 = 1, ..., 𝑁 and 𝑡 = 1, ..., 𝑇 as E𝑁𝑇 [ 𝑓𝑖𝑡] = 1 𝑁𝑇 (cid:205)𝑁 𝑖=1 (cid:205)𝑇 𝑡=1 𝑓𝑖𝑡 and the empirical 𝐿2 (cid:205)𝑁 norm as ∥ 𝑓𝑖𝑡 ∥ 𝑁𝑇,2 = 𝑖=1 𝑓𝑖𝑡 over the sub-sample 𝑖 ∈ 𝐼𝑘 and 𝑡 ∈ 𝑆𝑙 as E𝑘𝑙 [ 𝑓𝑖𝑡] = 1 𝑁𝑘𝑇𝑙 𝑡=1 ∥ 𝑓𝑖𝑡 ∥2(cid:17) 1/2 (cid:205)𝑇 (cid:16) 1 𝑁𝑇 . Correspondingly, I denote the empirical average of (cid:205)𝑖∈𝐼𝑘,𝑡∈𝑆𝑙 𝑓𝑖𝑡 and the empirical 𝐿2 norm over the subsample as ∥ 𝑓𝑖𝑡 ∥ 𝑘𝑙,2 = sets and 𝑁𝑘 , 𝑇𝑙 are sub-sample sizes that will be introduced next section. (cid:205)𝑖∈𝑁𝑙 (cid:205)𝑡∈𝑇𝑙 ∥ 𝑓𝑖𝑡 ∥2(cid:17) 1/2 (cid:16) 1 𝑁𝑘𝑇𝑙 , where 𝐼𝑘 , 𝑆𝑙 are sub-sample index 3.2 Two-Way Cluster LASSO In the existing literature, not much is known in terms of statistical properties for high- dimensional methods under cluster dependence in both cross-section and time. In this section, a variant of the 𝑙1-regularization methods, also known as the LASSO, will be proposed and examined. To focus on the LASSO approach under two-way dependence, I consider a simple conditional expectation model of a scalar outcome given a potentially high-dimensional vector of covariates. Let (𝑌𝑖𝑡, 𝑋𝑖𝑡) be a sample with 𝑖 = 1, ..., 𝑁 and 𝑡 = 1, ..., 𝑇. The conditional expectation model can be expressed as follows: 𝑌𝑖𝑡 = 𝑓 (𝑋𝑖𝑡) + 𝑉𝑖𝑡, 𝐸 [𝑉𝑖𝑡 |𝑋𝑖𝑡] = 0 (3.4) 65 where 𝑓 (𝑋𝑖𝑡) := 𝐸 [𝑌𝑖𝑡 |𝑋𝑖𝑡] is an unknown conditional expectation function of potentially high- dimensional covariates 𝑋𝑖𝑡; 𝑉𝑖𝑡 is the associated stochastic error. To characterize the two-way cluster dependence in the panel, I assume the random elements 𝑊𝑖𝑡 := (𝑌𝑖𝑡, 𝑋𝑖𝑡, 𝑉𝑖𝑡) are generated by the following process: Assumption 3.1 (Aldous-Hoover-Kallenberg Component Structure Characterization) 𝑊𝑖𝑡 = 𝜇 + 𝑓 (𝛼𝑖, 𝛾𝑡, 𝜀𝑖𝑡) , ∀𝑖 ≥ 1, 𝑡 ≥ 1, (3.5) where 𝜇 = 𝐸𝑃 [𝑊𝑖𝑡], 𝑓 is some unknown measurable function; (𝛼𝑖)𝑖≥1, (𝛾𝑡)𝑡≥1, and (𝜀𝑖𝑡)𝑖≥1,𝑡≥1 are mutually independent sequences, 𝛼𝑖 is i.i.d across 𝑖, 𝜀𝑖𝑡 is i.i.d across 𝑖 and 𝑡, and 𝛾𝑡 is strictly stationary. Assumption AHK is motivated by a representation theorem for an exchangeable array, named after Aldous-Hoover-Kallenberg (AHK, hereafter), which states that if an array of random variables (𝑋𝑖 𝑗 )𝑖≥1, 𝑗 ≥1 is separately or jointly exchangeable4, then 𝑋𝑖 𝑗 = 𝑓 (𝜉𝑖, 𝜁 𝑗 , 𝜄𝑖 𝑗 ) where (𝜉𝑖)𝑖≥1, (𝜁 𝑗 ) 𝑗 ≥1, (𝜄𝑖 𝑗 )𝑖≥1, 𝑗 ≥1 are mutually independent, uniformly distributed i.i.d. random variables5. However, the exchangeability is not likely to hold for arrays with the presence of a temporal dimension since it is naturally ordered. In macroeconomics, for instance, we can interpret the time components (𝛾𝑡)𝑡≥1 as unobserved common time shocks, which are naturally correlated over time, implying the exchangeability violated. Therefore, by allowing 𝛾𝑡 to be correlated, it introduces temporal dependence across all clusters, making the characterization more sensible. The relaxation of the independence condition on (𝛾𝑡)𝑡≥1 can be viewed as a generalization of the component structure representation, as argued by Chiang et al. (2024). It is clear that under Assumption AHK, 𝑊𝑖𝑡 and 𝑊𝑖𝑠 are correlated for any 𝑖, 𝑡, 𝑠 due to sharing the same cross-sectional cluster. Similarly, due to sharing the same temporal cluster, 𝑊𝑖𝑡 and 𝑊 𝑗𝑡 are dependent for any 𝑡, 𝑖, 𝑗. Furthermore, even if sharing neither the cross-sectional or temporal dimensions, observations can still be dependent 4An array (𝑋𝑖 𝑗 )𝑖 ≥1, 𝑗 ≥1 is separately exchangeable if (cid:0)𝑋 𝜋 (𝑖), 𝜋′ ( 𝑗 ) condition holds with 𝜋 = 𝜋′. (cid:1) 𝑑 = (cid:0)𝑋𝑖 𝑗 (cid:1), and jointly exchangeable if the same 5This is first proved in Aldous (1981) and independently proved and generalized to higher dimensional arrays in Hoover (1979). It is then further studied in Kallenberg (1989). For a formal statement of the theorem, see, for example, Theorem 7.22 in Kallenberg (2005). 66 due to correlated time effects 𝛾𝑡. It is important to notice that the components in 3.5 simply characterize the dependence in panel data. Differing from factor models or models with unobserved heterogeneity, they do not affect the identification of the regression model in any way. Throughout the paper, time effects 𝛾𝑡 are weakly dependent with some regularity condition introduced as follows. Before that, a few more concepts and notations are needed. Let (𝑋, 𝑌 ) be random elements taking values in Euclidean space 𝑆 = (𝑆1 × 𝑆2) with probability laws 𝑃𝑋 and 𝑃𝑌 , respectively. Let ∥𝜈∥𝑇𝑉 denote the total variation norm of a signed measure 𝜈 on a measurable space (𝑆, Σ) where Σ is a 𝜎-algebra on 𝑆: ∥𝜈∥𝑇𝑉 = sup 𝐴∈Σ 𝜈( 𝐴) − 𝜈( 𝐴𝑐). Define the dependence coefficient of 𝑋 and 𝑌 as: 𝛽(𝑋, 𝑌 ) = 1 2 ∥𝑃𝑋,𝑌 − 𝑃𝑋 × 𝑃𝑌 ∥𝑇𝑉 . The next assumption regulates the dependence of 𝛾𝑡 using the beta-mixing coefficient: Assumption 3.2 (Absolute Regularity) The sequence {𝛾𝑡 }𝑡≥1 is beta-mixing at a geometric rate: 𝛽𝛾 (𝑚) = sup 𝑠≤𝑇 𝛽 ({𝛾𝑡 }𝑡≤𝑠, {𝛾𝑡 }𝑡≥𝑠+𝑚) ≤ 𝑐𝜅𝑒𝑥 𝑝(−𝜅𝑚), ∀𝑚 ∈ Z+, (3.6) for some constants 𝜅 > 0 and 𝑐𝜅 ≥ 0. Condition AR, also known as the beta-mixing condition, restricts the temporal dependence of the common time effects to decay at an exponential rate that is common in literature (for example, see Hahn and Kuersteiner (2011); Fernández-Val and Lee (2013), and can be generated by common autoregressive models as in Baraud et al. (2001). Due to the potential high dimensionality in 𝑋, traditional nonparametric methods are not feasible for estimating the unknown function 𝑓 . To reduce the dimensionality, a common approach is to consider the sparsity in the model and reduce the dimension through regularization. However, the unknown function 𝑓 is an infinite dimensional parameter, which is not exactly sparse. Therefore, I take a sparse approximation approach following Belloni et al. (2012): 67 Assumption 3.3 (Approximate Sparse Model) The unknown function 𝑓 can be well approxi- mated by a dictionary of transformations 𝑓𝑖𝑡 = 𝐹 (𝑋𝑖𝑡) where 𝑓𝑖𝑡 is a 𝑝 × 1 vector and 𝐹 is a measurable map, such that 𝑓 (𝑋𝑖𝑡) = 𝑓𝑖𝑡 𝜁0 + 𝑟𝑖𝑡 where the coefficients 𝜁0 and the approximation error 𝑟𝑖𝑡 satisfy ∥𝜁0∥0 ≤ 𝑠 = 𝑜(𝑁 ∧ 𝑇), ∥𝑟𝑖𝑡 ∥ 𝑁𝑇,2 = 𝑂 𝑃 (cid:18)√︂ 𝑠 𝑁 ∧ 𝑇 (cid:19) . Assumption ASM views the high-dimensional linear regression as an approximation. It requires a subset of the parameters 𝜁0 to be zero while controlling the size of the approximation error. Compared to the sparsity condition in previous literature, here it imposes a slower rate of growth restriction on the non-zero slope coefficients. For example, 𝑠 = 𝑜(𝑁𝑇) corresponds to the case of heteroskedasticity-robust LASSO under i.i.d data in Belloni et al. (2012); 𝑠 = (𝑁𝑙𝑇 ) corresponds to the cluster-robust LASSO under temporal dependence panel data in Belloni et al. (2016) where 𝑙𝑇 ∈ [1, 𝑇] is an information index that equals T when there is no temporal dependence and equals 1 when there is cross-sectional independence and perfect temporal dependence. In other words, the underlying component structure restricts the growth of nonzero slope coefficients of the model in a similar way to the perfect temporal dependence case. Under Assumption ASM, we can rewrite the model 3.4 as 𝑌𝑖𝑡 = 𝑓𝑖𝑡 𝜁0 + 𝑟𝑖𝑡 + 𝑉𝑖𝑡, 𝐸 [𝑉𝑖𝑡 |𝑋𝑖𝑡] = 0. (3.7) Using 3.7, we can apply 𝑙1 regularization in the least squared error problem. Let 𝜆 be some non- negative common penalty level and 𝜔 be some non-negative 𝑝 × 𝑝 diagonal matrix of regressor- specific penalty weights. Consider the following generic weighted LASSO estimator: (cid:98)𝜁 = arg min 𝜁 1 𝑁𝑇 𝑁 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑡=1 (𝑌𝑖𝑡 − 𝑓𝑖𝑡 𝜁)2 + 𝜆 𝑁𝑇 ∥𝜔1/2𝜁 ∥1. (3.8) To obtain the desirable property of LASSO estimation, one needs to choose 𝜆 and 𝜔 in an optimal way so that the penalty level is large enough to avoid noisy estimation due to overfitting 68 but also the smallest possible since the size of the penalty determines the performance bound of LASSO estimation and too large a penalty level can cause missing variable bias. In other words, the overall penalty level given by both 𝜆 and 𝜔 decides the trade-off between overfitting variance and regularization bias. For example, let (cid:165)𝑓𝑖𝑡 be the demeaned 𝑓𝑖𝑡 using the sample mean6 and one common choice of 𝜔 is the empirical Gram matrix 𝐸 [ (cid:165)𝑓 ′ 𝑖𝑡 (cid:165)𝑓𝑖𝑡] that is used to standardize the regressors and so the model selection is not affected by the scale of the regressors. The common penalty level 𝜆 is often chosen by some cross-validation algorithms. If the chosen 𝜆 satisfies a certain asymptotic order, then the key condition that regularizes the tail behavior of the error term can be established under the conditional Gaussian error or sub-Gaussian error conditions (see Bickel et al., 2009, Bühlmann and Van De Geer, 2011, and Theorem 29.3 of Hansen, 2022): max 𝑗=1,...,𝑝 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 1 𝑁𝑇 𝑁 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑡=1 𝜔−1/2 𝑗 𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) ≤ 𝜆 2𝑐1𝑁𝑇 . (3.9) Condition 3.9 is referred to as the “regularization event” in the literature. Combining with some non-asymptotic bounds for LASSO, the rate of convergence for (cid:98)𝜁 can be derived. This approach, however, is not applicable when the error terms are considered to exhibit heavy tails. Alternatively, Fuk-Nagaev types of concentration inequality are established to verify Condition 3.9 without relying on the Gaussian or sub-Gaussian assumption (e.g. Babii et al. (2024, 2023); Gao et al. (2024)). These alternative approaches, again, rely on cross-validation for choosing penalty levels, which is computationally costly and sensitive to the tuning of the cross-validation. It is further complicated when the cross-validation needs to be adjusted for dependent data. Belloni et al. (2012) propose to self-normalize the regressors through regressor-specific penalty weights and leverage moderate deviation theorems (see Jing et al., 2003 and Peña et al., 2009) for the self-normalized sums to verify Condition 3.9. This common penalty level 𝜆 of this approach is theoretically derived and only determined by the sample size, the number of regressors, a small-order sequence, and some constants that do not vary across data generate processes. When the dependence is considered only in the temporal dimension, then the existing approach for 6The demeaning is done because of the inclusion of the intercept term. To avoid it to be penalized, it is usually projected out first. 69 independent data can be extended by clustering over the cross-sectional dimension (see Belloni et al. (2016)). However, there is no simple extension if the dependence is present in both temporal and cross-sectional dimensions. Instead, I utilize the component structure characterization of the dependence and decompose the high-dimensional error term 𝑓𝑖𝑡𝑉𝑖𝑡 using Hajek projection into three parts: 𝑎𝑖 = 𝐸 [ 𝑓𝑖𝑡𝑉𝑖𝑡 |𝛼𝑖], 𝑔𝑡 = 𝐸 [ 𝑓𝑖𝑡𝑉𝑖𝑡 |𝛾𝑡], 𝑒𝑖𝑡 = 𝑓𝑖𝑡𝑉𝑖𝑡 − 𝑎𝑖 − 𝑔𝑡, where 𝑎𝑖 are i.i.d over 𝑖, 𝑔𝑡 are weakly dependent over 𝑡, and the remainder can be shown as small order and is well-behaved too. With this observation, appropriate regressor-specific penalty weights can be constructed, and existing moderate deviation theorems and concentration inequalities can be leveraged. With the observation above, I propose the following common penalty level 𝜆 and (infeasible) penalty weights: 𝜆 = 𝐶𝜆𝑁𝑇 (𝑁 ∧ 𝑇)1/2 (cid:18) 1 − Φ−1 (cid:19) , 𝛾 2𝑝 𝜔 𝑗 = 𝑁 ∧ 𝑇 𝑁 2 𝑁 ∑︁ 𝑖=1 𝑎2 𝑖, 𝑗 + 𝑁 ∧ 𝑇 𝑇 2 (cid:32) 𝐵 ∑︁ ∑︁ (cid:33) 2 𝑔𝑡, 𝑗 . 𝑏=1 𝑡∈𝐻𝑏 (3.10) (3.11) where 𝑎𝑖, 𝑗 = 𝐸 [ 𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡 |𝛼𝑖], 𝑔𝑡, 𝑗 = 𝐸 [ 𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡 |𝛾𝑡] for 𝑗 = 1, ..., 𝑝. 𝐶𝜆 is some sufficiently large constant and 𝛾 is a small order sequence. The convergence rate of 𝛾 affects the convergence rate of the LASSO estimator: as is revealed later, 𝛾 should be 𝑜(1) for LASSO to be consistent while a larger 𝛾 is necessary for a faster convergence rate. Both 𝐶𝜆 and 𝛾 do not vary across different data-generating processes. While there is some guidance about choosing 𝐶𝜆 and 𝛾, the choice is not given exactly in the theoretical results. In practice, it is found that 𝐶𝜆 = 2 and 𝜆 = 0.1/log( 𝑝∨𝑁 ∨𝑇) delivers desirable finite sample performance. Looking at the definition of 𝜔 in 3.11, we notice that the first term in 3.11 is a variance estimator for i.i.d random variables and the second term is a cluster variance estimator as in Bester et al., 2008 where 𝐵 is the number of clusters/blocks, ℎ is the block length and 𝐻𝑏 is the associated index set. Technically, they are chosen as 𝐵 = round(𝑇/ℎ), ℎ = round(𝑇 1/5) + 1, and, for 𝑏 = 1, .., 𝐵, 𝐻𝑏 = {𝑡 : ℎ(𝑏 − 1) + 1 ≤ 𝑡 ≤ ℎ𝑏}. To implement the penalty weights in 3.11, however, we need to estimate 𝑎𝑖, 𝑗 = 𝐸 [ 𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡 |𝛼𝑖] and 𝑔𝑡, 𝑗 = 𝐸 [ 𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡 |𝛾𝑡] with two challenges. With some initial estimation, we can replace 𝑉𝑖𝑡 with the initial residual (cid:101)𝑉𝑖𝑡. (cid:101)𝑉𝑖𝑡 then can be updated iteratively by the residuals from the estimation 70 in 3.8 until it converges, meaning that (cid:101)𝑉𝑖𝑡 does not update anymore up to a small difference. A common estimator for 𝑎𝑖, 𝑗 is then (cid:98) 𝑓𝑖𝑡, 𝑗 (cid:101)𝑉𝑖𝑡 for estimating 𝑔𝑡, 𝑗 . Observe that this choice of implementing (cid:205)𝑁 𝑖=1 It shows that the first term of 𝜔 penalty weights of cluster-LASSO in Belloni et al. (2016). 𝑓𝑖𝑡, 𝑗 (cid:101)𝑉𝑖𝑡. Similarly, we use (cid:98) 𝑎2 𝑖, 𝑗 is equivalent to the feasible 𝑎𝑖, 𝑗 = 1 𝑁 𝑔𝑡, 𝑗 = 1 𝑁 (cid:205)𝑁 𝑖=1 (cid:205)𝑇 𝑡=1 clusters over time so it adjusts for the temporal dependence within each unit cluster. The second term clusters over individuals first and then clusters over time within each block, so it adjusts for cross-sectional dependence and weak temporal dependence across time. The validity of estimating those components through cross-sectional and temporal averages is given in Menzel (2021) for exchangeable arrays. Extending the consistency results for non-exchangeable arrays is not trivial and establishing the uniform convergence result, required due to the high dimensionality, is rather challenging and not the focus of this paper. Following Belloni et al. (2012) and Belloni et al. (2016), the statistical analysis of the weighted LASSO approach is based on high-level assumptions on the feasible penalty weights: Let (cid:98) 0 < 1/𝑐1 < 𝑙 ≤ 1 and 1 ≤ 𝑢 < ∞ such that 𝑙 → 1 and 𝜔 be the feasible diagonal weights and suppose there exists 𝑙𝜔1/2 𝑗 ≤ (cid:98) 𝜔1/2 𝑗 ≤ 𝑢𝜔1/2 𝑗 , 𝑢𝑛𝑖 𝑓 𝑜𝑟𝑚𝑙 𝑦 𝑜𝑣𝑒𝑟 𝑗 = 1, ..., 𝑝, (3.12) where {𝜔 𝑗 } and { (cid:98) 𝜔 𝑗 } are diagonal entries of 𝜔 and (cid:98) 𝜔, respectively. Algorithm: Implementation of the Two-Way Cluster-LASSO i Obtain the initial residuals (cid:101)𝑉: estimate a model with a certain (user-specified) number of the most correlated regressors. 7 ii Set 𝜆 according to 3.10 with 𝐶𝜆 = 2 and 𝛾 = 0.1/log( 𝑝 ∨ 𝑁 ∨ 𝑇). Calculate (cid:101) 𝜔 according to 3.11. iii Using (cid:101) 𝜔 for LASSO estimation as in 3.8 and update the residual (cid:101)𝑉 using the (post) LASSO estimation.8 7This step is for better convergence of the iterative estimation of the penalty weights. A small number of initially included regressors can cause failure to converge. 8While they are asymptotically equivalent, post-LASSO suffers from less shrinkage bias in the finite sample. 71 iv Repeat steps ii-iii until it converges. Obtains the (post) LASSO estimates from the last iteration. In the low-dimensional case, a key identifying condition is that the population Gram matrix 𝐸𝑃 [ 𝑓 ′ 𝑖𝑡 𝑓𝑖𝑡] is non-singular so that the empirical Gram matrix is also non-singular with high proba- bility. However, as we allow the dimension of 𝑓𝑖𝑡 to be larger than the sample size, the empirical Gram matrix E𝑁𝑇 [ 𝑓 ′ 𝑖𝑡 𝑓𝑖𝑡] is singular. Fortunately, it turns out that we only need certain sub-matrices to be well-behaved for identification. Define 𝜙𝑚𝑖𝑛 (𝑚)(𝑀 𝑓 ) := min 𝛿∈Δ(𝑚) 𝛿′𝑀 𝑓 𝛿 𝑎𝑛𝑑 𝜙𝑚𝑎𝑥 (𝐶𝑠) (𝑀 𝑓 ) := max 𝛿∈Δ(𝑚) 𝛿′𝑀 𝑓 𝛿, where Δ(𝑚) = {𝛿 : ∥𝛿∥0 = 𝑚, ∥𝛿∥2 = 1} and 𝑀 𝑓 = E𝑁𝑇 [ 𝑓 ′ 𝑖𝑡 𝑓𝑖𝑡]. Assumption 3.4 (Sparse Eigenvalues) For any 𝐶 > 0, there exists constants 0 < 𝜅1 < 𝜅2 < ∞ such that with probability approaching one, as (𝑁, 𝑇) → ∞ jointly, 𝜅1 ≤ 𝜙𝑚𝑖𝑛 (𝐶𝑠) (𝑀 𝑓 ) < 𝜙𝑚𝑎𝑥 (𝐶𝑠)(𝑀 𝑓 ) ≤ 𝜅2. The sparse eigenvalue assumption follows from Belloni et al. (2012). It implies a restricted eigenvalue condition, which represents a modulus of continuity between the prediction norm and the norm of 𝛿 within a restricted set. More primitive sufficient conditions are discussed in Bickel et al. (2009) and Belloni et al. (2012). Assumption 3.5 (Regularity Conditions) (i) log( 𝑝/𝛾) = 𝑜 𝑇 1/6/(log 𝑇)2(cid:17) (cid:16) and 𝑝 = 𝑜 𝑇 7/6/(log 𝑇)2(cid:17) (cid:16) . (ii) For some 𝜇 > 1, 𝛿 > 0, max 𝑗 ≤ 𝑝 E[| 𝑓𝑖𝑡, 𝑗 |8(𝜇+𝛿)] < ∞, E[|𝑉𝑖𝑡 |8𝜇+𝛿)] < ∞. (iii) min 𝑗 ≤ 𝑝 E(𝑎2 𝑖, 𝑗 ) > 0 and min 𝑗 ≤ 𝑝 E(𝑔2 𝑡, 𝑗 ) > 0. Assumption REG(i) restricts the dimension of 𝑓𝑖𝑡 relative to the sample size. Although the number of regressors is constrained to be of a small order relative to the overall sample size 𝑁𝑇 as 𝑁, 𝑇 → ∞ jointly, it is still allowed to be greater than the sample size in the finite sample. Note that this requirement is more of a technical constraint and may be further relaxed with refined concentration inequalities for two-way dependent arrays. The moment conditions in Assumption 72 REG(ii) are common in the literature. REG(iii) is a non-degeneracy condition, which is the main case of interest. A common way to mitigate the shrinkage bias of LASSO is to apply least square estimation based on the selected model by LASSO, which is named Post-LASSO. The next theorem delivers a similar result. Let (cid:98)Γ = { 𝑗 ∈ 1, ..., 𝑝 : |(cid:98)𝜁 𝑗 | > 0} where (cid:98)𝜁 𝑗 are two-way LASSO estimates. The next theorem gives convergence rates for both two-way cluster-LASSO and its associated Post-LASSO. Theorem 3.1 Suppose Assumptions AHK, ASM, AR, REG hold for model 3.4 as 𝑁, 𝑇 → ∞ jointly with 𝑁/𝑇 → 𝑐. Then, by setting 𝜆 as 3.10 with some sufficiently large 𝐶𝜆, we have (i) the event 3.9 happens with probability approaching one. Additionally, suppose that Assumption SE holds 𝜔 satisfies condition 3.12. Let (cid:98)𝜁 be the two-way cluster-LASSO estimator or the post-LASSO and (cid:98) estimator based on the two-way cluster-LASSO selection. Then, (ii) ∥(cid:98)𝜁 ∥0 = 𝑂 𝑃 (𝑠), and (iii) 1 𝑁𝑇 𝑁 ∑︁ 𝑇 ∑︁ (cid:16) 𝑖=1 𝑡=1 𝑓𝑖𝑡 (cid:98)𝜁 − 𝑓𝑖𝑡 𝜁0 (cid:17) 2 =𝑂 𝑃 (cid:18) 𝑠 log( 𝑝/𝛾) 𝑁 ∧ 𝑇 (cid:19) , (cid:13) (cid:13)(cid:98)𝜁 − 𝜁0 (cid:13) (cid:13) (cid:13) (cid:13)1 (cid:32) √︂ =𝑂 𝑃 𝑠 log( 𝑝/𝛾) 𝑁 ∧ 𝑇 (cid:33) , (cid:13) (cid:13)(cid:98)𝜁 − 𝜁0 (cid:13) (cid:13) (cid:13) (cid:13)2 =𝑂 𝑃 (cid:32)√︂ 𝑠 log( 𝑝/𝛾) 𝑁 ∧ 𝑇 (cid:33) . Theorem 3.1 establishes convergence rates in terms of the prediction, 𝑙1, and 𝑙2 norms for the (post) two-way cluster-LASSO estimator in an approximately sparse model. These results are the first that give convergence rates for a LASSO-based estimator allowing for two-way cluster dependence. It is shown that under the two-way cluster dependence, the two-way cluster-LASSO is consistent but has a convergence rate slower than those of LASSO-based methods under the random sampling condition or weak dependence. Without loss of generality, let 𝑁 = 𝑁 ∧𝑇, then by choosing (cid:13) (cid:13)(cid:98)𝜁 − 𝜁0 (cid:13) 𝛾 according to log(1/𝛾) ≃ log( 𝑝 ∨ 𝑁𝑇), we have (cid:18)√︃ 𝑠 log( 𝑝∨𝑁𝑇) 𝑁 . As a comparison, (cid:19) the rate of convergence in terms of the 𝑙2 norm is 𝑂 𝑃 under the random sampling and (cid:13) = 𝑂 𝑃 (cid:13) (cid:13)2 (cid:19) (cid:18)√︃ 𝑠 log 𝑝 𝑁𝑇 the homoskedasticity Gaussian error assumptions in Bickel et al. (2009) or the heteroskedasticity (cid:18)√︃ 𝑠 log( 𝑝∨𝑁𝑇) 𝑁𝑇 (cid:19) under random sampling Gaussian error in Theorem 19.3 of Hansen (2022), 𝑂 𝑃 73 in Belloni et al. (2012), and 𝑂 𝑃 (cid:18)√︃ 𝑠 log( 𝑝∨𝑁𝑇) 𝑁𝑙𝑇 (cid:19) under cross-sectional independence in Belloni et al. (2016) where the information index 𝑙𝑇 = 1 when there is perfect dependence within the cross-sectional cluster. As illustrated in the Introduction, the slow rate of convergence is due to the underlying factor structure. It is unclear if valid inference is possible under the rate of convergence results in Theorem 3.1 or if it is possible to relax the requirement through a cross-fitting procedure. These questions are addressed in the next section. 3.3 Clustered-Panel Cross-Fitting and Inference In this section, I will first propose a sub-sampling scheme for cross-fitting in a two-way clustered panel and then propose a general inference procedure using cross-fitting for a high-dimensional panel model. The idea of the sub-sampling scheme is to split the sample in a proper way so that two resulting sub-samples are independent or, at least, “approximately” independent. With such properties, the sub-sampling scheme can be used for various purposes. For example, it can be used for cross-fitting in a two-step estimation since it effectively eliminates the dependence between the two steps, which in turn relaxes the rate of convergence requirement for the first step for valid inference. It can also be used for cross-validation when choosing tuning parameters in panel data models. In this paper, we will focus on its application in cross-fitting. Let {𝑊𝑖𝑡 : 𝑖 = 1, ..., 𝑁 𝑎𝑛𝑑 𝑡 = 1, ..., 𝑇 } denote a sample of sizes (𝑁, 𝑇) from a sequence of random elements (𝑊𝑖𝑡)𝑖≥1,𝑡≥1 defined on a common measurable space (Ω, F ) and taking values in Euclidean spaces. To allow the dimension of 𝑊𝑖𝑡 to grow with 𝑁, 𝑇, we denote (P𝑁𝑇 )𝑁 ≥1,𝑇 ≥1 as an expanding class of probability laws of {𝑊𝑖𝑡 : 𝑖 = 1, ..., 𝑁 𝑎𝑛𝑑 𝑡 = 1, ..., 𝑇 } and denote 𝑃 ∈ P𝑁𝑇 as a generic probability law for the sample with sizes (𝑁, 𝑇). Under the AHK characterization in Assumption AHK, 𝑊𝑖𝑡 are cluster-dependent over both cross-section and time. Importantly, the cluster dependence does not vanish as the distance between observations (if there is any ordering) increases. If 𝛾𝑡 is weakly dependent, which is the focus of this paper, then the dependence between observations that don’t share the same cluster in either dimension dies out as the temporal distance grows. In that case, intuitively, one can split the sample 74 so that the sub-samples do not share the same cluster and are away from each other in temporal distance. This is exactly how this scheme works: Definition 3.1 (Two-Way Clustered-Panel Cross-Fitting) (i) Select some positive integers (𝐾, 𝐿). Randomly partition the cross-sectional index set {1, 2, ..., 𝑁 } into 𝐾 folds {𝐼1, 𝐼2, ..., 𝐼𝐾 } and partition the temporal index set {1, 2, ..., 𝑇 } into 𝐿 adjacent folds {𝑆1, 𝑆2, ..., 𝑆𝐿 } so that (cid:208)𝐾 𝑘=1 𝐼𝑘 = {1, ..., 𝑁 }, (cid:208)𝐿 𝑙=1 𝑆𝑙 = {1, ..., 𝑇 }9. (ii) For each 𝑘 = 1, ..., 𝐾 and 𝑙 = 1, .., 𝐿, construct the main sample 𝑊 (𝑘, 𝑙) = {𝑊𝑖𝑡 : 𝑖 ∈ 𝐼𝑘 , 𝑡 ∈ 𝑆𝑙 }, and the auxiliary sample (typically larger) 𝑊 (−𝑘, −𝑙) = (cid:40) 𝑊𝑖𝑡 : 𝑖 ∈ (cid:216) 𝑘 ′≠𝑘 𝐼𝑘 ′, 𝑡 ∈ (cid:216) (cid:41) 𝑆𝑙′ , 𝑙′≠𝑙,𝑙±1 Later on, we also use 𝐼−𝑘 and 𝑆−𝑙 to denote the index sets for the auxiliary sample 𝑊 (−𝑘, −𝑙). Similarly, we denote 𝑁−𝑘 and 𝑇−𝑙 as the cross-sectional and temporal sample sizes for the auxiliary sample 𝑊 (−𝑘, −𝑙). Figure 3.1 illustrates the cross-fitting with 𝐾 = 4 and 𝐿 = 8. Since the sub-samples 𝑊 (𝑘, 𝑙) and 𝑊 (−𝑘, −𝑙) do not share any cluster, they are free from cluster dependence and what’s left is the weak dependence over time. Unless imposing 𝑚−dependence, the sub-samples above will not be independent. However, under certain regularity conditions regarding the weak dependence, it can be shown through the coupling technique that as long as the temporal distance between the sub-samples diverges at a certain rate, there exist coupling sub-samples that are independent of each other while having the same marginal distributions as the constructed sub-samples with probability converging to 1. Such coupling technique is common in the time series context. The following lemma delivers such a result: 9For simplicity, I assume 𝑁 and 𝑇 are divisible by 𝐾 and 𝐿, respectively. In practice, if 𝑁 is not divisible by 𝐾, the size for each cross-sectional block can be chosen differently with some length equal to 𝑓 𝑙𝑜𝑜𝑟 (𝑁/𝐾) and others equal to 𝑐𝑒𝑖𝑙 (𝑁/𝐾). and the same applies to the temporal dimension. 75 Lemma 3.1 (Independent Coupling) Consider the sub-samples 𝑊 (𝑘, 𝑙) and 𝑊 (−𝑘, −𝑙) for 𝑘 = 1, ..., 𝐾 and 𝑙 = 1, ..., 𝐿. Suppose Assumptions AHK, AR hold and 𝑙𝑜𝑔(𝑁)/𝑇 = 𝑜(1) as 𝑇 → ∞. Then, we can construct (cid:101)𝑊 (𝑘, 𝑙) and (cid:101)𝑊 (−𝑘, −𝑙) such that: (i) they are independent of each other; (ii) have the same marginal distribution as 𝑊 (𝑘, 𝑙) and 𝑊 (−𝑘, −𝑙), respectively; (iii) (cid:110) 𝑃 (𝑊 (𝑘, 𝑙), 𝑊 (−𝑘, −𝑙)) ≠ (cid:16) (cid:101)𝑊 (𝑘, 𝑙), (cid:101)𝑊 (−𝑘, −𝑙) (cid:17) , 𝑓 𝑜𝑟 𝑠𝑜𝑚𝑒 (𝑘, 𝑙) (cid:111) = 𝑜(1). Lemma 3.1 shows that the main and auxiliary samples from the proposed clustered-panel cross- fitting scheme are approximately independent as 𝑁, 𝑇 diverge. Note that the hypothetical sample (cid:101)𝑊 (𝑘, 𝑙) and (cid:101)𝑊 (−𝑘, −𝑙) do not matter in practice, but they allow us to treat 𝑊 (𝑘, 𝑙) and 𝑊 (−𝑘, −𝑙) as (cid:101)𝑊 (𝑘, 𝑙) and (cid:101)𝑊 (−𝑘, −𝑙) with probability approaching 1. The proof of Lemma 3.1 is based on independence coupling results (Strassen, 1965, Dudley and Philipp, 1983, and Berbee, 1987) introduced in Semenova et al. (2023a). Figure 3.1: Clustered-Panel cross-fitting with 𝐾 = 4 and 𝐿 = 8. Three graphs from left to right correspond to the main and auxiliary sample constructions with (𝑘, 𝑙) = (1, 1), (𝑘, 𝑙) = (2, 2), (𝑘, 𝑙) = (3, 3). For a simple illustration, observations in the main sample are all adjacent in the cross-sectional dimension but it is not necessary in practice; the same applies to the auxiliary sample. As mentioned at the beginning of the section, one of the primary uses of the sub-sampling scheme is cross-fitting in a two-step estimation. To be concrete, I will define a two-step estimator 76 using the cross-fitting algorithm in the context of a semi-parametric moment restriction model. The theoretical properties of the estimator will be studied in Section 3.3.1. Let 𝜑(𝑊𝑖𝑡; 𝜃, 𝜂) denote some identifying moment functions where 𝜃 is a low-dimensional vector of parameters of interest and 𝜂 are nuisance functions. For example, 𝜂 = 𝑔0 in 3.1. Let 𝜓(𝑊𝑖𝑡; 𝜃, 𝜂) denote some orthogonalized moment function based on 𝜑(𝑊𝑖𝑡; 𝜃, 𝜂). The formal definition of the orthogonality will be delivered in the next subsection. For now, it suffices to be aware that both func- tions are mean zero while 𝜓(𝑊𝑖𝑡; 𝜃, 𝜂) is adjusted for the fact that 𝜂0 needs to be estimated. In model 3.1, 𝜑(𝑊𝑖𝑡; 𝜃, 𝜂) = 𝐷𝑖𝑡𝑈𝑖𝑡 and 𝜓(𝑊𝑖𝑡; 𝜃, 𝜂) = (𝐷𝑖𝑡 − 𝐸 [𝐷𝑖𝑡 |𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡]) (𝑌𝑖𝑡 − 𝐷𝑖𝑡𝜃 − 𝑔(𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡)). In the treatment effect model with unconfoundedness conditional on covariates and unobserved het- erogeneous effects, 𝜑(𝑊𝑖𝑡; 𝜃, 𝜂) = 𝐸 [𝑌𝑖𝑡 |𝐷𝑖𝑡 = 1, 𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡] − 𝐸 [𝑌𝑖𝑡 |𝐷𝑖𝑡 = 0, 𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡] − 𝜃 𝐴𝑇 𝐸 and 𝜓(𝑊𝑖𝑡; 𝜃, 𝜂) is the moment function corresponding to the well-known augmented inverse probability weighting estimator, which is doubly robust. The panel cross-fitting procedure goes as follows. For each 𝑘 and 𝑙, we use the sub-sample 𝑊 (−𝑘, −𝑙) to estimate 𝜂 with the estimator denoted as (cid:98) 𝜂𝑘𝑙 to the orthogonal moment function, 𝜓(𝑊𝑖𝑡; 𝜃, in (cid:98) 𝑘 = 1, ..., 𝐾 and 𝑙 = 1, ..., 𝐿, we obtain 𝜂𝑘𝑙. For each 𝑖 ∈ 𝐼𝑘 and 𝑡 ∈ 𝑆𝑙, we plug- 𝜂𝑘𝑙). By averaging 𝜓(𝑊𝑖𝑡; 𝜃, (cid:98) 𝜂𝑘𝑙) across all (cid:98) 𝜓 𝑘𝑙 := E𝑘𝑙 [𝜓(𝑊𝑖𝑡; 𝜃, 𝜂𝑘𝑙)] , (cid:98) which is a sample analog of the population orthogonal moment condition used for estimation. Note that the larger sub-sample 𝑊 (−𝑘, −𝑙), instead of the smaller sub-sample 𝑊 (𝑘, 𝑙), is used for first-step nuisance estimation because it involves the estimation of high-dimensional unknown parameters. For reference, 𝑊 (𝑘, 𝑙) is referred to as the main sample, and 𝑊 (−𝑘, −𝑙) is referred to as the auxiliary sample. The next definition summarizes the panel DML estimation and inference procedures for a semiparametric moment restriction model: Definition 3.2 (Panel DML Algorithm) (i) Given the identifying moment functions 𝜑(𝑊; 𝜃, 𝜂) such that 𝐸𝑃 [𝜑(𝑊; 𝜃0, 𝜂0)] = 0, find the orthogonalized moment function 𝜓(𝑊, 𝜃, 𝜂). 77 (ii) Obtain cross-fitting sub-samples 𝑊 (𝑘, 𝑙) and 𝑊 (−𝑘, −𝑙) as in Definition 3.1. (iii) For each 𝑘 and 𝑙, use the sample 𝑊 (−𝑘, −𝑙) for the first-step estimation and obtain (cid:98) 𝜂𝑘𝑙, then construct 𝜓 𝑘𝑙 (𝜃) = E𝑘𝑙 [𝜓(𝑊𝑖𝑡; 𝜃, 𝜂𝑘𝑙)] for each (𝑘, 𝑙). Finally, obtain the DML estimator (cid:98)𝜃 (cid:98) as the solution to 1 𝐾 𝐿 𝐾 ∑︁ 𝐿 ∑︁ 𝑘=1 𝑙=1 𝜓 𝑘𝑙 (𝜃) = 0. (3.13) Remark 3.1 (The Choice of 𝐾 and 𝐿) Notice there is a trade-off in setting (𝐾, 𝐿) between the first step and second step accuracy: the bigger values of (𝐾, 𝐿), the bigger sample size of the auxiliary sample 𝑊 (−𝑘, −𝑙), which is beneficial for high-dimensional first-steps but at the cost of a noisier parametric second step. Due to leaving out the temporal neighborhood, it necessitates an 𝐿 ≥ 4 for feasible implementation (if 𝐿 = 3, for example, any main sample 𝑊 (𝑘, 𝑙) with 𝑙 = 2 does not have a well-defined auxiliary sample). On the other hand, it is computationally costly to set the values of (𝐾, 𝐿) too large. In practice, 𝐾 = 2 𝑡𝑜 4 and 𝐿 = 4 𝑡𝑜 8 work well in simulations. 3.3.1 Panel DML: Inferential Theory To investigate the required convergence rate of a high-dimensional estimator for valid inference, I will study a general inference procedure for a high-dimensional panel model characterized by a semiparametric moment restriction. Such an inference procedure is based on the panel cross-fitting approach proposed in Section 3.3 and the prototypical DML approach proposed in Chernozhukov et al. (2018a). With the same notations as above, the model is characterized by a semiparametric moment condition 𝐸 [𝜑(𝑊𝑖𝑡; 𝜃0, 𝜂0)] = 0 where 𝑊𝑖𝑡 are again characterized by an underlying component structure as in Assumption AHK. Let 𝜓(𝑊; 𝜃, 𝜂) be the orthogonalized moment function. Formally, the orthogonality means that it is mean zero and its pathwise or Gateaux derivative with respect to the nuisance parameter is 0 when evaluated at the true values: 𝐸𝑃 [𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂0)] = 0, 𝜕𝑟 𝐸𝑃 [𝜓 (𝑊𝑖𝑡; 𝜃0, 𝜂0 + 𝑟 (𝜂 − 𝜂0))] |𝑟=0 = 0. (3.14) (3.15) 78 In other words, the nuisance functions have no first-order effect locally on the orthogonalized moment conditions, based on which the estimation of 𝜃0 is therefore robust to the plug-in of noisy estimates of 𝛾0. property. In contrast, the original identifying moment conditions do not possess such a Differing from the existing literature, the approach in this paper focuses on estimation and inference robust to two-way cluster dependence characterized by Assumption AHK. Note that Assumption AHK also includes i.i.d data as a special case. Although the panel DML procedure is also robust to the i.i.d case or, more generally, the case of the degeneracy in components, the theoretical properties are not formally given in this paper. The rates of convergence for both the nuisance estimator and the second-step estimator are different and faster for the i.i.d case but that’s not surprising and is not the focus of this paper. To restrict the focus, I will assume a non-degeneracy condition in terms of Hajek projection components. First, I define the Hajek components and their corresponding (long-run) variance-covariance matrices as follows: 𝑎𝑖 := 𝐸𝑃 [𝜓 (𝑊𝑖𝑡; 𝜃0, 𝜂0) |𝛼𝑖] , Σ𝑎 := 𝐸𝑃 [𝑎𝑖𝑎′ 𝑖], 𝑔𝑡 := 𝐸𝑃 [𝜓 (𝑊𝑖𝑡; 𝜃0, 𝜂0) |𝛾𝑡] , Σ𝑔 := 𝑒𝑖𝑡 := 𝜓 (𝑊𝑖𝑡; 𝜃0, 𝜂0) − 𝑎𝑖 − 𝑔𝑡, Σ𝑒 := ∞ ∑︁ 𝑙=−∞ ∞ ∑︁ 𝑙=−∞ 𝐸𝑃 [𝑔𝑡𝑔′ 𝑡+𝑙], 𝐸𝑃 [𝑒𝑖𝑡𝑒′ 𝑖,𝑡+𝑙]. Let 𝜆𝑚𝑖𝑛 [.] denote the smallest eigenvalue of a square matrix. The next assumption specifies the non-degeneracy condition and it implies that at least one of the components drives the cluster dependence. Assumption 3.6 (Non-Degeneracy) Either 𝜆𝑚𝑖𝑛 [Σ𝑎] > 0 or 𝜆𝑚𝑖𝑛 [Σ𝑔] > 0. The next two assumptions follow the same format as Chernozhukov et al. (2018a) but, impor- tantly, they characterize some different rates of convergence required for inferential theory. Let (𝛿𝑁𝑇 ) and (Δ𝑁𝑇 ) be some sequence of positive constants converging to 0 as 𝑁, 𝑇 → ∞. Let T𝑁𝑇 be a nuisance realization set such that it contains 𝜂0 and that (cid:98) 1 − Δ𝑁𝑇 for each (𝑘, 𝑙). 𝜂𝑘𝑙 belongs to T𝑁𝑇 with probability 79 Assumption 3.7 (Linear Moment Conditions, Smoothness, and Identification) (i) 𝜓(𝑊; 𝜃, 𝜂) is linear in 𝜃: 𝜓(𝑤; 𝜃, 𝜂) = 𝜓𝑎 (𝑊, 𝜂)𝜃 + 𝜓𝑏 (𝑊, 𝜂), ∀ 𝑤 ∈ W, 𝜃 ∈ Θ, 𝜂 ∈ T . (ii) 𝜓(𝑊; 𝜃, 𝜂) satisfy the Neyman orthogonality conditions 3.14 and 3.15 with respect to the probability measure 𝑃, or, more generally, 3.15 can be replaced by a 𝜆𝑁𝑇 near-orthogonality condition 𝜆𝑁𝑇 := sup 𝜂∈T𝑁 𝑇 ∥𝜕𝑟 𝐸𝑃 [𝜓(𝑊; 𝜃0, 𝜂0 + 𝑟 (𝜂 − 𝜂0))]|𝑟=0∥ ≤ 𝛿𝑁𝑇 / √ 𝑁. (iii) The map 𝜂 → 𝐸𝑃 [𝜓(𝑊𝑖𝑡; 𝜃, 𝜂)] is twice continuously Gateaux-differentiable on T . (iv) The singular values of the matrix 𝐴0 := 𝐸𝑃 [𝜓𝑎 (𝑊𝑖𝑡; 𝜂0)] are bounded below by 𝑐𝑎 > 0. Assumption DML1(i) restricts the focus of this paper to models with linear orthogonal moment conditions, which covers the model in Section 3.4. For nonlinear orthogonal moment conditions, Chernozhukov et al. (2018a) has shown that the DML estimator has the same desirable properties under more complicated regularity conditions. Focusing on the linear cases allows us to pay more attention to issues specifically attributed to panel data. Assumption DML1(ii) slightly relaxes the orthogonality condition 3.15 by a near-orthogonality condition, which is useful for the approximate sparse model with approximation errors. Assumption DML1(iii) imposes a mild smoothness assumption on the orthogonal moment condition and Assumption DML1(iv) is a common condition for identification. Assumption 3.8 (Moment Regularity and First-Steps) (i) For all 𝑖 ≥ 1, 𝑡 ≥ 1, and some 𝑞 > 2, 𝑐𝑚 < ∞, the following moment conditions hold: 𝑚𝑁𝑇 := sup 𝜂∈T𝑁 𝑇 𝑚′ 𝑁𝑇 := sup 𝜂∈T𝑁 𝑇 (𝐸𝑃 ∥𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂) ∥𝑞)1/𝑞 ≤ 𝑐𝑚, (cid:0)𝐸𝑃 ∥𝜓𝑎 (𝑊𝑖𝑡; 𝜂) ∥𝑞(cid:1) 1/𝑞 ≤ 𝑐𝑚. 80 (ii) The following conditions on the statistical rates 𝑟𝑁𝑇 , 𝑟′ 𝑁𝑇 , 𝜆′ 𝑁𝑇 hold for all 𝑖 ≥ 1, 𝑡 ≥ 1: 𝑟𝑁𝑇 := sup 𝜂∈T𝑁 𝑇 𝑟′ 𝑁𝑇 := sup 𝜂∈T𝑁 𝑇 ∥𝐸𝑃 [𝜓𝑎 (𝑊𝑖𝑡; 𝜂) − 𝜓𝑎 (𝑊𝑖𝑡; 𝜂0)] ∥ ≤ 𝛿𝑁𝑇 , (cid:16) 𝐸𝑃 ∥𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂) − 𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂0) ∥2(cid:17) 1/2 ≤ 𝛿𝑁𝑇 , 𝜆′ 𝑁𝑇 := sup 𝑟∈(0,1),𝜂∈T𝑁 𝑇 (cid:13) (cid:13)𝜕2 𝑟 𝐸𝑃 [𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂0 + 𝑟 (𝜂 − 𝜂0))](cid:13) (cid:13) ≤ 𝛿𝑁𝑇 / √ 𝑁. Assumption DML2 regulates the quality of the first-step nuisance estimators. It follows from Chernozhukov et al. (2018a) and it can be verified under primitive conditions in the next section. Observe that, if the orthogonal moment function 𝜓(𝑊; 𝜃, 𝜂) is smooth in 𝜂, then 𝜆′ 𝑁𝑇 is the dominant rate and it imposes a crude rate requirement of order 𝜀𝑁𝑇 = 𝑜(𝑁 −1/4) on the first-step nuisance parameter in 𝐿2(𝑃) norm, which is possible for the two-way cluster LASSO estimator to achieve under proper sparsity assumption. Furthermore, in some models including the partial linear model, 𝜆′ 𝑁𝑇 can be exactly 0, then it is possible to obtain the weakest possible rate requirement for the first-step estimator, i.e. 𝜀𝑁𝑇 = 𝑜(1). Theorem 3.2 (Asymptotic Normality and Variance) Suppose Assumptions AHK, AR, ND, DML1, DML2 hold for any 𝑃 ∈ P𝑁𝑇 , then for some 𝛿𝑁𝑇 ≥ 𝑁 −1/2, as (𝑁, 𝑇) → ∞ jointly, √ (cid:16) 𝑁 (cid:17) (cid:98)𝜃 − 𝜃0 √ = − 𝑁 𝐴−1 0 𝑁 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑡=1 𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂0) + 𝑜𝑃 (1) ⇒ N (0, 𝑉), where , 𝑉 ≔ 𝐴−1 0 Ω𝐴−1′ Ω ≔ Σ𝑎 + 𝑐Σ𝑔. 0 We observe that the convergence rate of the two-step estimator (cid:98)𝜃 resulting from the panel DML 𝑁𝑇-consistent. This is because the 𝑁-consistent instead of procedure is non-standard. It is √ √ cluster dependence introduced by the unit and time components does not decay over time or space. Intuitively, with more persistence, the information carried by data is accumulated more slowly. It 81 is a common feature in the literature of robust inference with cluster dependence10 and it is also related to inferential theory under strong cross-sectional dependence (e.g., Gonçalves, 2011). Due to the presence of unit and time components, the asymptotic variance is made of (long- run) variance-covariance matrices of both factors. I consider a two-way cluster robust variance estimator similar to Chiang et al. (2024) (CHS estimator) with adjustment due to cross-fitting. The variance estimator is motivated under arbitrary dependence in panel data and is shown to be robust to two-way clustering with correlated time effects in linear panel models. As is shown in Chen and Vogelsang (2024), such variance estimator can be written as an affine combination of three well-known robust variance estimators: Liang-Zeger-Arellano estimator, Driscoll-Kraay estimator, and the "average of HACs" estimator. Applying this result, we can define the CHS-type variance estimator as follows: (cid:98)𝑉𝐶𝐻𝑆 = (cid:98)¯𝐴 −1 (cid:98)Ω𝐶𝐻𝑆(cid:98)¯𝐴 −1′ , (cid:98)Ω𝐶𝐻𝑆 = (cid:98)Ω𝐴 + (cid:98)Ω𝐷𝐾 − (cid:98)Ω𝑁𝑊 , (cid:205)𝐾 where (cid:98)¯𝐴 := 1 𝐾 𝐿 𝜓𝑎 (𝑊𝑖𝑡; (cid:98) 1 and 0 otherwise (i.e., Bartlett kernel) and the bandwidth parameter 𝑀 chosen from 1 to 𝑇𝑙, 𝜂𝑘𝑙) and, with 𝑘 (cid:0) 𝑚 𝑀 (cid:205)𝑖∈𝐼𝑘,𝑠∈𝑆𝑙 (cid:1) := 1− 𝑚 𝑀 for 𝑚 = 0, 1, ..., 𝑀 − 1 𝑁𝑘𝑇𝑙 (cid:205)𝐿 𝑘=1 𝑙=1 (cid:98)Ω𝐴 := (cid:98)Ω𝐷𝐾 := (cid:98)Ω𝑁𝑊 := 1 𝐾 𝐿 1 𝐾 𝐿 1 𝐾 𝐿 𝐾 ∑︁ 𝐿 ∑︁ 𝑘=1 𝐾 ∑︁ 𝑙=1 𝐿 ∑︁ 𝑘=1 𝐾 ∑︁ 𝑙=1 𝐿 ∑︁ 𝑘=1 𝑙=1 1 𝑁𝑘𝑇 2 𝑙 𝐾/𝐿 𝑁𝑘𝑇 2 𝑙 𝐾/𝐿 𝑁𝑘𝑇 2 𝑙 ∑︁ 𝑖∈𝐼𝑘,𝑡∈𝑆𝑙,𝑟∈𝑆𝑙 𝜓(𝑊𝑖𝑡; (cid:98)𝜃, 𝜂𝑘𝑙)𝜓(𝑊𝑖𝑟; (cid:98)𝜃, (cid:98) 𝜂𝑘𝑙)′, (cid:98) ∑︁ 𝑡∈𝑆𝑙,𝑟∈𝑆𝑙 𝑘 (cid:18) |𝑡 − 𝑟 | 𝑀 (cid:19) ∑︁ ∑︁ 𝑘 𝑖∈𝐼𝑘,𝑡∈𝑆𝑙,𝑟∈𝑆𝑙 (cid:18) |𝑡 − 𝑟 | 𝑀 𝑖∈𝐼𝑘, 𝑗 ∈𝐼𝑘 (cid:19) 𝜓(𝑊𝑖𝑡; (cid:98)𝜃, 𝜂𝑘𝑙)𝜓 𝑗𝑟 ((cid:98)𝜃, (cid:98) 𝜂𝑘𝑙)′, (cid:98) 𝜓(𝑊𝑖𝑡; (cid:98)𝜃, 𝜂𝑘𝑙)𝜓(𝑊𝑖𝑟; (cid:98)𝜃, (cid:98) 𝜂𝑘𝑙)′. (cid:98) It is noted that the variance estimator under the cross-fitting is equivalent to estimating the variance in each sub-sample and then averaging across all sub-samples. Since 𝐾, 𝐿 are fixed, the asymptotic analysis is done at the sub-sample level. The next theorem establishes the consistency of this variance estimator under the conventional small-bandwidth assumption. 10For example, see Hansen, 2007, MacKinnon et al., 2021,Menzel, 2021, Chiang et al., 2022,Chiang et al., 2023a,Chiang et al., 2024,Chen and Vogelsang, 2024 among many others. 82 Theorem 3.3 (Consistent Variance Estimator) Assumptions AHK, AR, ND, DML1, DML2 hold for any 𝑃 ∈ P𝑁𝑇 , and some 𝑞 > 4 (defined in Assumption DML2), and 𝑀/𝑇 1/2 = 𝑜(1). Then, as 𝑁, 𝑇 → ∞ and 𝑁/𝑇 → 𝑐 where 0 < 𝑐 < ∞, (cid:98)𝑉𝐶𝐻𝑆 = 𝑉 + 𝑜𝑃 (1). Theorem 3.3 can be seen as a generalization of the consistency result for the CHS variance estimator in Chiang et al. (2024) by allowing for the estimated nuisance parameters in the moment functions. A remaining practical issue is that (cid:98)𝑉 is not ensured to be positive semi-definite. It has been shown in Chen and Vogelsang (2024) that negative variance estimates happen with a non-trivial number of times under certain data-generating processes. Accordingly, an alternative two-term variance estimator was proposed in Chen and Vogelsang (2024). Following the same idea, I propose an alternative variance estimator by dropping the double-counting term (cid:98)Ω𝑁𝑊 : (cid:98)𝑉𝐷𝐾 𝐴 = (cid:98)𝐴−1 (cid:98)Ω𝐷𝐾 𝐴 (cid:98)𝐴−1′, (cid:98)Ω𝐷𝐾 𝐴 = (cid:98)Ω𝐴 + (cid:98)Ω𝐷𝐾 . The estimator is referred to as the DKA variance estimator because it is a sum of Driscoll-Krray and Arellano variance estimators.11 Similar approaches can be found in MacKinnon et al. (2021). It relies on the fact that the double-counting term is of small order asymptotically when the panel is two-way clustering. Similar to other two-term cluster-robust variance estimators, it has the computational advantage of guaranteeing positive semi-definiteness but at the cost of inconsistency in the case of no clustering or clustering at the intersection. For theoretical results and more detailed discussions on the trade-off between the ensured positive-definiteness and the risk of being too conservative/losing power, readers are referred to MacKinnon et al. (2021) and Chen and Vogelsang (2024). 11Note that, the DKA estimator defined in Chen and Vogelsang (2024) differs from the DKA estimator here by a constant term based on fixed-b asymptotic analysis. Such bias correction is not considered here since the fixed-b properties are not directly applicable in this setting. The conjecture is that the same form of bias correction can be applied here but formally establishing the fixed-b asymptotic results with the presence of estimated nuisance parameters is challenging and out of the scope of this paper, and so is left for future research. 83 Theorem 3.4 (Alternative Consistent Variance Estimator) Under the same conditions as The- orem 3.3, we have, as 𝑁, 𝑇 → ∞ and 𝑁/𝑇 → 𝑐 where 0 < 𝑐 < ∞, (cid:98)𝑉𝐷𝐾 𝐴 = (cid:98)𝑉𝐶𝐻𝑆 + 𝑜𝑃 (1). Theorem 3.4 formally shows that the double-counting term is of small order under two-way clustering and it implies that the (cid:98)𝑉𝐷𝐾 𝐴 is also consistent for Ω under two-way clustering. To conclude, in this section, the inferential theory is established for the panel DML estimator, under high-level assumptions on the first-step estimator. Even though the rate of convergence can be slow for the nuisance estimations due to the two-way cluster dependence, the cross-fitting approach for panel models allows for valid inference in a general moment restriction model with growing dimensions in the nuisance parameters. In the next section, I will study a special case of the semiparametric restriction model but consider the complication due to unobserved heterogeneity. 3.4 Partial Linear Model with Unobserved Heterogeneity In this section, a partial linear model with non-additive unobserved heterogeneous effects is considered. The proposed toolkit is flexible enough to allow for models with instrumental variables used for identification, so I consider the following model: for 𝑖 = 1, ..., 𝑁 and 𝑡 = 1, ..., 𝑇 , 𝑌𝑖𝑡 =𝐷𝑖𝑡𝜃0 + 𝑔(𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡) + 𝑈𝑖𝑡, 𝐸 [𝑈𝑖𝑡 |𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡] = 0, (3.16) where 𝐷𝑖𝑡 is a low-dimensional vector of endogenous variables; 𝑔 is an unknown function of potentially high-dimensional control variables 𝑋𝑖𝑡 and unobserved heterogeneous effects (𝑐𝑖, 𝑑𝑡). For clearer presentation, 𝐷𝑖𝑡 is treated as a scalar variable. In practice, 𝐷𝑖𝑡 can contain some high-order terms and interactions with a low-dimensional vector of controls. If the lags or leads of 𝐷𝑖𝑡 are considered to be exogenous, they can also included in 𝑋𝑖𝑡. Doing so would not change the theory for estimation and inference but could change the interpretation of 𝜃0. Consider an excludable instrumental variable 𝑍𝑖𝑡 such that 𝐸 [𝑍𝑖𝑡𝑈𝑖𝑡] = 0, which gives the identifying moment condition. To apply the estimation and inference methods proposed in previous sections, 𝑔 is again assumed to be approximately sparse. However, it does not suffice since (𝑐𝑖, 𝑑𝑡) are not observed. To deal with 84 the unobserved heterogeneous effects that cause the endogeneity, I take a correlated random-effects approach through the generalized Mundlak device: Assumption 3.9 (Generalized Mundlak Device) For each 𝑖 = 1, ..., 𝑁 and 𝑡 = 1, ..., 𝑇, 𝑐𝑖 = ℎ𝑐 ( ¯𝐹𝑖, 𝜖 𝑐 𝑖 ), 𝑑𝑡 = ℎ𝑑 ( ¯𝐹𝑡, 𝜖 𝑑 𝑡 ), (3.17) (3.18) where ¯𝐹𝑖 = 1 𝑇 (cid:205)𝑇 𝑡=1 𝐹𝑖𝑡, ¯𝐹𝑡 = 1 𝑁 (cid:205)𝑁 𝑖=1 𝐹𝑖𝑡, 𝐹𝑖𝑡 := (𝐷𝑖𝑡, 𝑋′ 𝑖𝑡)′; ℎ𝑐 and ℎ𝑑 are some unknown mea- surable functions; the stochastic errors (𝜖 𝑐 𝑖 , 𝜖 𝑑 𝑡 ) are independent of ( ¯𝐹𝑖, ¯𝐹𝑡, 𝑋𝑖𝑡, 𝑍𝑖𝑡, 𝑈 𝐷 𝑖𝑡 , 𝑈𝑌 𝑖𝑡 ); and (𝑐𝑖, 𝑑𝑡, 𝜖𝑖, 𝜖𝑡) are independent of 𝑈𝑖𝑡. To justify its use, we shall recall the idea of the conventional Mundlak device. Due to the correlation between (𝑐𝑖, 𝑑𝑡) and the covariates, the endogeneity issue arises if we don’t control for the unobserved heterogeneity. To explicitly model the correlation between the random effects and the covariates, Mundlak (1978) proposes an auxiliary regression between the random effects and the cross-sectional sample average and shows that if the random effects enter the model linearly then the resulting estimator GLS estimator is equivalent to the common within-estimator. Wooldridge (2021) further shows that the equivalence relations exist among the POLS estimators resulting from the Mundlak device, within-transformation, and the fixed-effects dummies. Therefore, if the within-transformation and including fixed-effects dummies are sensible ways of dealing with unobserved heterogeneity, then allowing the Mundlak device to have a more flexible function form should also be reasonable and more robust. A similar assumption is considered in Wooldridge and Zhu (2020). It seems like one can simply apply the panel DML approach from Section 3.3.1 with the two- way cluster LASSO estimator employed as the first-step estimator except that there is a subtle issue: the Mundlak device uses the full history of the covariates which potentially generates dependence across the cross-fitting sub-samples. Similar issues also appear in a simple linear panel model with additive unobserved effects where within-transformation also introduces sample-averages. 85 Therefore, the cross-fitting may not be compatible with approaches dealing with unobserved het- erogeneity, including the proposed generalized Mundlak device. However, without cross-fitting, it is challenging to establish an inferential theory with growing dimensionality in unknown parameters in general. Nevertheless, as is shown below, it is possible to establishing the asymptotic normality of the panel DML estimator using the full sample in both the first and the second steps for the partial linear model with strengthened sparsity condition. This is helpful not only due to the presence of unobserved heterogeneous effects but also because cross-fitting can be computationally costly and it works in a cost of efficiency loss. Under model 3.16, 𝑔(𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡) = 𝐸 [𝑌𝑖𝑡 − 𝐷𝑖𝑡𝜃0|𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡]. We can rewrite 3.16 as follows: 𝑌𝑖𝑡 = (𝐷𝑖𝑡 − 𝑔𝐷 (𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡)) 𝜃0 + 𝑔𝑌 (𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡) + 𝑈𝑖𝑡 . where 𝑔𝐷 (𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡) := 𝐸 [𝐷𝑖𝑡 |𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡] and 𝑔𝑌 (𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡) := 𝐸 [𝑌𝑖𝑡 |𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡]. Under Assump- tion GMD, 𝑔𝐷 (𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡) and 𝑔𝑌 (𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡) can be rewritten as compound functions, which are assumed to be well-approximated by a linear combination of a 𝜏−th order polynomial transformation 𝐿𝜏 as follows: 𝐷 (𝑋𝑖𝑡, ¯𝐹𝑖, 𝜖 𝑐 𝑔∗ 𝑖 , ¯𝐹𝑡, 𝜖 𝑑 𝑡 ) := 𝑔𝐷 (𝑋𝑖𝑡, ℎ𝑐 ( ¯𝐹𝑖, 𝜖 𝑐 𝑖 ), ℎ𝑑 ( ¯𝐹𝑡, 𝜖 𝑑 𝑌 (𝑋𝑖𝑡, ¯𝐹𝑖, 𝜖 𝑐 𝑔∗ 𝑖 , ¯𝐹𝑡, 𝜖 𝑑 𝑡 ) := 𝑔𝑌 (𝑋𝑖𝑡, ℎ𝑐 ( ¯𝐹𝑖, 𝜖 𝑐 𝑖 ), ℎ𝑑 ( ¯𝐹𝑡, 𝜖 𝑑 𝑡 )) = 𝐿𝜏 (cid:16) 𝑡 )) = 𝐿𝜏 (cid:16) 𝑋𝑖𝑡, ¯𝐹𝑖, ¯𝐹𝑡, 𝜖 𝑐 𝑋𝑖𝑡, ¯𝐹𝑖, ¯𝐹𝑡, 𝜖 𝑐 (cid:17) 𝑖 , 𝜖 𝑑 𝑡 (cid:17) 𝑖 , 𝜖 𝑑 𝑡 𝜂𝐷 + 𝑟 𝐷 𝑖𝑡 (3.19) 𝜂𝑌 + 𝑟𝑌 𝑖𝑡 (3.20) 𝑖𝑡 , 𝑟𝑌 where (𝜂𝐷, 𝜂𝑌 ) are slope coefficients and (𝑟 𝐷 𝑖𝑡 ) are the approximation errors. Furthermore, we can define a vector of transformed regressors as 𝐿1,𝑖𝑡 = 𝐿𝜏 (𝑋𝑖𝑡, ¯𝐹𝑖, ¯𝐹𝑡) and a vector of unobserved 𝑡 )\𝐿𝜏 (𝑋𝑖𝑡, ¯𝐹𝑖, ¯𝐹𝑡). Let (𝜂𝐷,1, 𝜂𝐷,2) be such that 𝜂𝐷 = regressors as 𝐿2,𝑖𝑡 = 𝐿𝜏 (𝑋𝑖𝑡, ¯𝐹𝑖, ¯𝐹𝑡, 𝜖 𝑐 𝜂𝐷,1 (cid:208) 𝜂𝐷,2 and 𝑖 , 𝜖 𝑑 𝐿𝜏 (cid:16) 𝑋𝑖𝑡, ¯𝐹𝑖, ¯𝐹𝑡, 𝜖 𝑐 𝑖 , 𝜖 𝑑 𝑡 (cid:17) 𝜂𝐷 = 𝐿1,𝑖𝑡𝜂𝐷,1 + 𝐿2,𝑖𝑡𝜂𝐷,2. (𝜂𝑌 ,1, 𝜂𝑌 ,2) are defined in the same way. Under the sparse approximation and Assumption GMD, we can rewrite model 3.16 as follows: (cid:16) 𝑌𝑖𝑡 = 𝐷𝑖𝑡 − 𝐿1,𝑖𝑡𝜂𝐷,1 − 𝐿2,𝑖𝑡𝜂𝐷,2 − 𝑟 𝐷 𝑖𝑡 (cid:17) 𝜃0 + 𝐿1,𝑖𝑡𝜂𝑌 ,1 + 𝐿2,𝑖𝑡𝜂𝑌 ,2 + 𝑟𝑌 𝑖𝑡 + 𝑈𝑖𝑡, 86 By defining a new error term 𝑉 𝑔 error 𝑟𝑖𝑡 = 𝑟𝑌 nuisance vectors 𝛽0 := (cid:0)𝜂𝑌 ,1, 𝐸 [𝐿2,𝑖𝑡]𝜂𝑌 ,2 (cid:1) + 𝑈𝑖𝑡, a new approximation 𝑖𝑡 𝜃0, the vector of observables 𝑓𝑖𝑡 := (cid:0)𝐿1,𝑖𝑡, 1(cid:1) with dimension denoted by 𝑝, and the (cid:1), 𝜋0 := (cid:0)𝜂𝐷,1, 𝐸 [𝐿2,𝑖𝑡]𝜂𝐷,2 (cid:1), we can rewrite the model 𝑖𝑡 := (cid:0)𝐿2,𝑖𝑡 − 𝐸 [𝐿2,𝑖𝑡](cid:1) (cid:0)𝜂𝑌 ,2 − 𝜂𝐷,2𝜃0 𝑖𝑡 +𝑟 𝐷 above as 𝑌𝑖𝑡 = (𝐷𝑖𝑡 − 𝑓𝑖𝑡 𝜋0) 𝜃0 + 𝑓𝑖𝑡 𝛽0 + 𝑟𝑖𝑡 + 𝑉 𝑔 𝑖𝑡 . (3.21) Noticeably, in this case, the parameters associated with the unobservables 𝐿2,𝑖𝑡 can be arbitrarily non-sparse. Given 𝐸 [𝑍𝑖𝑡𝑈𝑖𝑡] and the independence between 𝑍𝑖𝑡 and (𝜖 𝑐 𝑖 , 𝜖 𝑑 𝑡 ), we have the identifying moment condition 𝐸 [𝑍𝑖𝑡𝑉 𝑔 𝑖𝑡 ] = 0. Let 𝜁0 be the linear projection parameter of 𝑍𝑖𝑡 onto 𝑓𝑖𝑡 and let 𝑉 𝑍 𝑖𝑡 be the corresponding linear projection errors. By Chernozhukov et al., 2018a, (2.18), the near-Neyman orthogonal moment function is given by: 𝜓𝑖𝑡 (𝜃0, 𝜂0) := (𝑍𝑖𝑡 − 𝑓𝑖𝑡 𝜁0) (𝑌𝑖𝑡 − 𝑓𝑖𝑡 𝛽0 − (𝐷𝑖𝑡 − 𝑓𝑖𝑡 𝜋0) 𝜃0) . (3.22) where we denote 𝜂0 = (𝜁0, 𝛽0, 𝜋0). Under the sparse approximation, we can also rewrite the conditional expectation models for 𝑌 and 𝐷 as 𝑌𝑖𝑡 = 𝐸 [𝑌𝑖𝑡 |𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡] + 𝑈𝑌 𝑖𝑡 = 𝑓𝑖𝑡 𝛽0 + 𝑟𝑌 𝑖𝑡 + 𝑉𝑌 𝑖𝑡 𝐷𝑖𝑡 = 𝐸 [𝑌𝑖𝑡 |𝑋𝑖𝑡, 𝑐𝑖, 𝑑𝑡] + 𝑈 𝐷 𝑖𝑡 = 𝑓𝑖𝑡 𝜋0 + 𝑟 𝐷 𝑖𝑡 + 𝑉 𝐷 𝑖𝑡 . where 𝑉𝑌 𝑖𝑡 = (cid:0)𝐿2,𝑖𝑡 − 𝐸 [𝐿2,𝑖𝑡](cid:1) 𝜂𝐷,2 + 𝑈 𝐷 𝑖𝑡 = (cid:0)𝐿2,𝑖𝑡 − 𝐸 [𝐿2,𝑖𝑡](cid:1) 𝜂𝑌 ,2 + 𝑈𝑌 𝑖𝑡 and 𝑉 𝐷 let 𝜔𝑙, as defined in 3.11 with 𝑉𝑖𝑡 replaced by 𝑉 𝑙 cluster LASSO estimations of (𝜁0, 𝛽0, 𝜋0). Correspondingly, let (cid:98)𝑉 𝑙 be the residuals and let (cid:98) feasible penalty weights. The two-step debiased estimator (cid:98)𝜃 for 𝜃0 using the full-sample is defined as the solution of E𝑁𝑇 [𝜓𝑖𝑡 (𝜃,(cid:98) 𝜂 are the (post) two-way cluster LASSO estimators for 𝜂0 obtained in the first step using the full-sample. 𝑖𝑡 be the infeasible penalty weights for the two-way 𝑖𝑡 . For 𝑙 = 𝑍, 𝑌 , 𝐷, 𝜂)] = 0 where (cid:98) 𝜔𝑙 be the The additional notations introduced below are used in the statistical analysis and delivering the 87 main results: 𝑎𝑖 = 𝐸 [𝑉 𝑍 𝑖𝑡 𝑉 𝑔 𝑖𝑡 |𝛼𝑖], 𝑔𝑡 = 𝐸 [𝑉 𝑍 𝑖𝑡 𝑉 𝑔 𝑖𝑡 |𝛾𝑡], Σ𝑎 = 𝐸 [𝑎𝑖𝑎′ 𝑖], Σ𝑔 = ∞ ∑︁ 𝑙=−∞ 𝐸 [𝑔𝑡𝑔′ 𝑡+𝑙] 𝐴0 = 𝐸𝑃 [𝑉 𝑍 𝑖𝑡 𝑉 𝐷 𝑖𝑡 ], Ω0 = Σ𝑎 + 𝑐Σ𝑔, 𝑎𝑖, 𝑗,𝑙 = 𝐸 [ 𝑓𝑖𝑡, 𝑗𝑉 𝑙 𝑖𝑡 |𝛼𝑖], 𝑔𝑡, 𝑗,𝑙 = 𝐸 [ 𝑓𝑖𝑡, 𝑗𝑉 𝑙 𝑖𝑡 |𝛾𝑡], 𝑙 = 𝑍, 𝑌 , 𝐷 Assumption 3.10 (Regularity Conditions for the Partial Linear Model) (i) 𝐴0 is non-singular. (ii) For any 𝜖, ℎ𝑐 (𝐹, 𝜖) and ℎ𝑑 (𝐹, 𝜖) are invertible in 𝐹. (iii) For some 𝜇 > 1, 𝛿 > 0, max 𝑗 ≤ 𝑝 𝐸 [| 𝑓𝑖𝑡, 𝑗 |8(𝜇+𝛿)] < ∞ and 𝐸 [|𝑉 𝑙 𝑖𝑡 |8(𝜇+𝛿)] < ∞ for 𝑙 = 𝑔, 𝐷, 𝑌 , 𝑍. (iv) Either 𝜆𝑚𝑖𝑛 [Σ𝑎] > 0 or 𝜆𝑚𝑖𝑛 [Σ𝑔] > 0, and min 𝑗 ≤ 𝑝 E[𝑎𝑙 𝑖, 𝑗 ]2 > 0 and min 𝑗 ≤ 𝑝 E[𝑔𝑙 𝑡, 𝑗 ]2 > 0, 𝑙 = 𝐷, 𝑌 , 𝑍. (v) log( 𝑝/𝛾) = 𝑜 𝑇 1/6/(log 𝑇)2(cid:17) (cid:16) and 𝑝 = 𝑜(𝑇 7/6/(log 𝑇)2). (vi) The feasible penalty weights (cid:98) 𝜔𝑙 satisfy condition 3.12 for 𝑙 = 𝐷, 𝑌 , 𝑍. This set of regularity conditions follows from the assumptions for two-way cluster-LASSO and the panel-DML inference. The only extra condition is Assumption REG-P(ii) which is a smoothness condition that ensures the exogeneity properties of ¯𝐹𝑖 and ¯𝐹𝑡 inherited from (𝑐𝑖, 𝜖𝑖) and (𝑑𝑡, 𝜖𝑡). Theorem 3.5 Suppose, for 𝑃 = 𝑃𝑁𝑇 for each (𝑁, 𝑇), the following conditions hold for model 3.16 and 𝑊𝑖𝑡 = (𝑌𝑖𝑡, 𝐷𝑖𝑡, 𝑋𝑖𝑡, 𝑍𝑖𝑡, 𝑈𝑖𝑡, 𝑐𝑖, 𝑑𝑡, 𝜖𝑖, 𝜖𝑡): (i) Assumptions AHK, AR, SE, GMD, REG-P; (ii) (cid:19) (cid:18)√︃ 1 𝑁∧𝑇 for 𝑙 = 𝑌 , 𝐷. sparse approximation in 3.19 and 3.20 with 𝑠 = 𝑜 (cid:16) √ 𝑁∧𝑇 log( 𝑝/𝛾) (cid:17) , ∥𝑟 𝜄 𝑖𝑡 ∥ 𝑁𝑇,2 = 𝑜𝑃 Then, as 𝑁, 𝑇 → ∞ and 𝑁/𝑇 → 𝑐 where 0 < 𝑐 < ∞, where 𝑉 := 𝐴−1 0 Ω0 𝐴−1 0 . √ 𝑁 ((cid:98)𝜃 − 𝜃0) 𝑑 → N (0, 𝑉) 88 Theorem 3.5 establishes the validity of the proposed inference procedure using the full sample. Note that the sparsity condition and the condition of the approximation errors are stronger than the ones needed for two-way LASSO estimation itself. To estimate the asymptotic variance, the following variance estimators are adapted from Chiang et al. (2024) and Chen and Vogelsang (2024) using the full sample: (cid:98)𝑉𝐶𝐻𝑆 = (cid:98)𝐴−1 (cid:98)𝑉𝐷𝐾 𝐴 = (cid:98)𝐴−1 𝑁𝑇 (cid:98)Ω𝐶𝐻𝑆 (cid:98)𝐴−1′ 𝑁𝑇 (cid:98)Ω𝐷𝐾 𝐴 (cid:98)𝐴−1′ 𝑁𝑇 , (cid:98)Ω𝐶𝐻𝑆 = (cid:98)Ω𝐴 + (cid:98)Ω𝐷𝐾 − (cid:98)Ω𝑁𝑊 , 𝑁𝑇 , (cid:98)Ω𝐷𝐾 𝐴 = (cid:98)Ω𝐴 + (cid:98)Ω𝐷𝐾, (3.23) (3.24) where (cid:98)𝐴𝑁𝑇 := 1 𝑁𝑇 𝑁 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑁 ∑︁ 𝑡=1 𝑇 ∑︁ 𝑖=1 𝑇 ∑︁ 𝑡=1 𝑇 ∑︁ 𝑡=1 𝑁 ∑︁ 𝑟=1 𝑇 ∑︁ 𝜋), (𝑍𝑖𝑡 − 𝑓𝑖𝑡 (cid:98)𝜁) (𝐷𝑖𝑡 − 𝑓𝑖𝑡 (cid:98) 𝑇 ∑︁ 𝑟=1 𝜓𝑖𝑡 ((cid:98)𝜃, 𝜂)𝜓𝑖𝑟 ((cid:98)𝜃, (cid:98) 𝜂)′, (cid:98) 𝑘 (cid:18) |𝑡 − 𝑟 | 𝑀 (cid:19) 𝑁 ∑︁ 𝑁 ∑︁ 𝑇 ∑︁ 𝑘 𝑗=1 𝑖=1 (cid:19) (cid:18) |𝑡 − 𝑟 | 𝑀 𝑖=1 𝑡=1 𝑟=1 (cid:98)Ω𝐴 := (cid:98)Ω𝐷𝐾 := (cid:98)Ω𝑁𝑊 := 1 𝑁𝑇 2 1 𝑁𝑇 2 1 𝑁𝑇 2 𝜓𝑖𝑡 ((cid:98)𝜃, 𝜂)𝜓 𝑗𝑟 ((cid:98)𝜃, (cid:98) 𝜂)′, (cid:98) 𝜓𝑖𝑡 ((cid:98)𝜃, 𝜂)𝜓𝑖𝑟 ((cid:98)𝜃, (cid:98) 𝜂)′. (cid:98) For simplicity, we deliver the consistency results of variance estimators assuming the approxi- mation is exact. Allowing for approximation errors does not change the main idea but only requires more regularity conditions on the approximation error and lengthier derivations. Theorem 3.6 Suppose assumptions for Theorem 3.5 holds for 𝑃 = 𝑃𝑁𝑇 for each (𝑁, 𝑇) with 𝑖𝑡 = 𝑟𝑌 𝑟 𝐷 𝑖𝑡 = 0 a.s., and 𝑀/𝑇 1/2 = 𝑜(1). Then, (𝑁, 𝑇) → ∞ and 𝑁/𝑇 → 𝑐 where 0 < 𝑐 < ∞, (cid:98)𝑉𝐶𝐻𝑆 =𝑉 + 𝑜𝑃 (1), (cid:98)𝑉𝐷𝐾 𝐴 =(cid:98)𝑉𝐶𝐻𝑆 + 𝑜𝑃 (1). 3.5 Monte Carlo Simulation In this section, the finite sample performance of the panel DML estimation and inference procedure are examined in a Monte Carlo simulation study. We will start with an exactly sparse 89 linear model without considering approximation errors and unobserved heterogeneous effects, and then we will further consider the partial linear model with correlated random effects. Firstly, the linear model with high-dimensional covariates and exact sparsity is specified as follows: DGP(i) − Linear Model : 𝑌𝑖𝑡 = 𝐷𝑖𝑡𝜃0 + 𝑋𝑖𝑡 𝛽0 + 𝑈𝑖𝑡, 𝐷𝑖𝑡 = 𝑋𝑖𝑡 𝜋0 + 𝑉𝑖𝑡, where 𝜃0 = 1/2 is the true parameter of interest, and 𝛽0 = 𝑐 𝛽 × (1, 1, ..., 1, 0, ..., 0)′, 𝜋0 = 𝑐𝜋 × (1, 1, ..., 1, 0, ..., 0)′ are 𝑝-dimensional nuisance parameters where the first 𝑠 entries are 1 and the rest of the elements are 0; 𝑐 𝛽 and 𝑐𝜋 are constants that control the relevance of the covariates. Secondly, the partial linear model with correlated random effects is specified as follows: DGP(ii) − Partial Linear Model : 𝑌𝑖𝑡 = 𝐷𝑖𝑡𝜃0 + (𝑋𝑖𝑡 𝛽0 + 𝑐𝑖 + 𝑑𝑡)2 + 𝑈𝑖𝑡, 𝐷𝑖𝑡 = exp(𝑋𝑖𝑡 𝜋0) 1 + exp(𝑋𝑖𝑡 𝜋0) + 𝑉𝑖𝑡, 𝑐𝑖 = ¯𝐷𝑖 + ¯𝑋𝑖𝜉0 + 𝜖 𝑐 𝑖 , 𝑑𝑡 = ¯𝐷𝑡 + ¯𝑋𝑡 𝜁0 + 𝜖 𝑑 𝑡 , (cid:16) where 𝛽0 = 𝑐 𝛽 × (cid:16) 1/22, 1/23, ..., 1/2( 𝑝+1)(cid:17)′ (cid:16) , 𝜁 = 𝑐𝜁 1/22, 1/23, ..., 1/2( 𝑝+1)(cid:17) 𝜉 = 𝑐𝜉 (cid:16) , 𝜋0 = 𝑐𝜋 × 1/22, 1/23, ..., 1/2( 𝑝+1)(cid:17) 1/22, 1/23, ..., 1/2( 𝑝+1)(cid:17)′ ; 𝜖 𝑐 𝑖 and 𝜖 𝑑 , and 𝑡 are each a random draw from the uniform distribution 𝑈 (0, 1). The nuisance functions in both 𝑌 and 𝐷 are taken as unknown. Although these nuisance functions are not exactly sparse, they are smooth enough and can be well-approximated by a polynomial series. The correlated random effects are generated by the Mundlak device which is taken as known and will be used for estimation. For the linear model, to feature in the two-way dependence in 𝑉𝑖𝑡𝑈𝑖𝑡 as well as 𝑋𝑖𝑡𝑈𝑖𝑡 and 𝑋𝑖𝑡𝑉𝑖𝑡, 90 (𝑋𝑖𝑡, 𝑈𝑖𝑡, 𝑉𝑖𝑡) are generated by the underlying components as follows: for each 𝑗 = 1, ..., 𝑝, DGP(i) − Additive Components : 𝑋𝑖𝑡, 𝑗 = 𝑤1𝛼𝑖, 𝑗 + 𝑤2𝛾𝑡, 𝑗 + 𝑤3𝜀𝑖𝑡, 𝑗 , 𝑈𝑖𝑡 = 𝑤1𝛼𝑢 𝑖 + 𝑤2𝛾𝑢 𝑡 + 𝑤3𝜀𝑢 𝑖𝑡, 𝑉𝑖𝑡 = 𝑤1𝛼𝑣 𝑖 + 𝑤2𝛾𝑣 𝑡 + 𝑤3𝜀𝑣 𝑖𝑡, √ √ 𝑖 , 𝜀𝑢 𝑖 , 𝛼𝑣 where the components 𝛼𝑢 𝑖𝑡, 𝛼𝑖, 𝑗 , 𝛾𝑡, 𝑗 are each random draws from a uniform distribution 3) for each 𝑗; 𝜀𝑖𝑡 = (𝜀𝑖𝑡,1, ..., 𝜀𝑖𝑡,𝑝)′ is a random draw from a joint normal distribution with mean 1 and variance-covariance matrix equal to 𝜄| 𝑗−𝑘 |, 𝜄 ∈ [0, 1), in the ( 𝑗, 𝑘)’s entry; The 𝑖𝑡, 𝜀𝑣 𝑈 (− 3, components 𝛾𝑢 𝑡 , 𝛾𝑣 𝑡 each follow a AR(1) process with the coefficient equal to 𝜌 and the initial values randomly drawn from the normal distribution with mean 0 and variance 1 − 𝜌2 for some 𝜌 ∈ [0, 1). The weights (𝑤1, 𝑤2, 𝑤3) are non-negative with 𝑤2 1 + 𝑤2 2 + 𝑤2 3 = 1. The default weights are 𝑤1 = 𝑤2 = 𝑤3 = 1/ 3. √ For the partial linear model, the Mundlak device will be used for estimation. It is well-known that the Mundlak device is mechanically equivalent to within-transformation in a linear panel model, in which the within-transformation would also remove the additive components in DGP(i) and eliminate the two-way dependence in the within-transformed random variables. When the true model is partially linear in the covariates, the Mundlak device also projects out many underlying components and removes most of the dependence driven by the additive components. To illustrate it is not necessarily the case in general, a multiplicative component structure is considered as follows: DGP(ii) − Multiplicative Components : 𝑋𝑖𝑡, 𝑗 = 𝑤1𝛼𝑖, 𝑗 + 𝑤2𝛾𝑡, 𝑗 + 𝑤3𝜀𝑖𝑡, 𝑗 , 𝑈𝑖𝑡 = 𝑉𝑖𝑡 = 𝑤4 𝑐 𝑝 𝑤4 𝑐 𝑝 𝑝 ∑︁ 𝑗=1 𝑝 ∑︁ 𝑗=1 (cid:2)𝛼𝑢 𝑖 𝛾𝑡, 𝑗 + 𝛼𝑖, 𝑗 𝛾𝑢 𝑡 (cid:3) + 𝑤5𝜀𝑢 𝑖𝑡, (cid:2)𝛼𝑣 𝑖 𝛾𝑡, 𝑗 + 𝛼𝑖, 𝑗 𝛾𝑣 𝑡 (cid:3) + 𝑤5𝜀𝑣 𝑖𝑡, where the components are generated the same way as in DGP(i) - Linear Components. The weights are non-negative with 𝑤2 1 + 𝑤2 2 + 𝑤2 3 = 1 and 𝑤2 4 + 𝑤2 5 = 1. The default weights are 91 𝑤1 = 𝑤2 = (2/5)0.5, 𝑤3 = 𝑤5 = (1/5)0.5, and 𝑤4 = (4/5)0.5. 𝑐 𝑝 is a scaling factor that ensures the sums of multiplicative components in both 𝑈𝑖𝑡 and 𝑉𝑖𝑡 have variance around 1. With the default weights, 𝑐 𝑝 is set as 3/2. The multiplicative components construction here is a generalization of the example in Chiang et al. (2024). To see why 𝑈𝑖𝑡𝑉𝑖𝑡 features a component structure, we can expand expectations given 𝛼 = (𝛼𝑢 the product and observe that it includes terms such as 𝛼𝑢 𝑖 𝛼𝑣 𝑖 , 𝛼𝑣 of 𝛼. Likewise, the product also includes terms like 𝛾𝑢 𝑖 , 𝛼𝑖,1, ..., 𝛼𝑖,𝑝) are 𝛼𝑢 𝑖 𝛼𝑣 𝑖 𝛾2 𝑡, 𝑗 for 𝑗 = 1, ..., 𝑝 whose conditional 𝑖 since 𝛾𝑡, 𝑗 has variance 1 and is independent 𝑡 𝛾𝑣 𝑖, 𝑗 whose conditional expectations given 𝑡 𝛼2 𝛾 = (𝛾𝑢 𝑡 , 𝛾𝑣 𝑡 , 𝛾𝑡,1, ..., 𝛾𝑡,𝑝) are 𝛾𝑢 𝑡 𝛾𝑣 𝑡 . Importantly, these underlying common factors do not introduce endogeneity as they may seem to. The simulation study examines the Monte Carlo bias (Bias), standard deviation (SD), mean square error (MSE), and coverage probability of estimators for 𝜃0. All estimations are based on the orthogonal moment condition given by 3.22 with 𝑍𝑖𝑡 = 𝐷𝑖𝑡 ( 𝑓𝑖𝑡 = 𝑋𝑖𝑡 in DGP(i)). The comparison will be among procedures with and without cross-fitting. The first-step estimations will be based on the POLS estimator (if feasible), the post heteroskedasticity-robust LASSO from Belloni et al. (2012), the post square-root LASSO from Belloni et al. (2011), the post cluster-robust LASSO from Belloni et al. (2016), and the post two-way cluster-LASSO. The CHS-type and DKA-type variance estimators (different formulas for estimations with and without cross-fitting) will be used to obtain sample coverage probabilities. In some unreported simulations, I also compare CHS/DKA type variance estimators with Eicker-Huber-White type estimators in Chernozhukov et al. (2018a) for random sampling data and Cameron-Galbach-Miller type estimator from Chiang et al. (2022) for multiway clustered data. Since it is well-known that inference based on variance estimators not sufficiently accounting for the dependence would cause over-rejection, it is omitted here. The simulation results are based on 1000 Monte Carlo replications. It is a relatively small num- ber of replications but it is necessitated by the high computational cost of multiple high-dimensional estimation and inference procedures, particularly with cross-fitting. Results are obtained across DGPs varied by the sample sizes (𝑁, 𝑇), the dimensions of covariates 𝑝, the number of non-zero slope coefficients 𝑠, the other sparsity parameter 𝑏, the common coefficient 𝑎, the multicollinearity 92 No Table 3.5.1: DGP(i) with 𝑁 = 𝑇 = 25, 𝑠 = 5, 𝑝 = 200, 𝜄 = 0.5, 𝜌 = 0.5, 𝑐 𝛽 = 𝑐𝜋 = 0.5 Cross Fitting First-Step Estimator POLS H LASSO R LASSO C LASSO TW LASSO POLS H LASSO R LASSO C LASSO TW LASSO Second-Step SD 0.053 0.065 0.067 0.095 0.096 0.113 0.131 0.130 0.130 0.126 First-Step Ave. Sel. Y Sel. D Bias 0.003 200 0.062 26.0 0.070 17.6 0.036 8.9 0.023 6.9 0.006 200 0.053 16.6 0.054 9.5 0.041 8.1 0.057 6.4 Note: Simulation results are based on 1000 replications. Tuning parameters: (𝐾, 𝐿) = (4, 8), 𝐶𝜆 = 2, and 𝛾 = 0.1/log( 𝑝 ∨ 𝑁 ∨ 𝑇). 10 most relevant regressors (based on the sample correlation with the outcome) are used for initial estimation and at most 10 iterations are used in calculating the penalty weights. H: heteroskedastic-LASSO; R: square-root-LASSO; C: cluster-LASSO; TW: two- way cluster-LASSO. Post-LASSO POLS is performed in all first steps. Nominal coverage probability: 0.95. Coverage (%) DKA 95.1 78.7 79.5 87.5 90.4 99.4 97.6 98.2 97.4 97.2 RMSE CHS 78.9 0.053 58.5 0.090 65.2 0.097 80.0 0.101 84.3 0.099 98.2 0.113 96.0 0.141 96.0 0.141 96.2 0.136 95.8 0.138 200 26.0 17.6 8.6 6.7 200 16.9 9.5 8.0 6.7 Yes parameter 𝜄 and the temporal correlation parameter 𝜌. For the panel DML inferential procedure with cross-fitting, the tuning parameters (𝐾, 𝐿), the number of cross-fitting blocks, needs to be chosen. For variance estimation, bandwidth parameters 𝑀 of the Bartlett kernel are required. I use the min-MSE rule from Andrews (1991) for both purposes. For a generic scalar score 𝑣𝑖𝑡, the formula is given as follows: (cid:98)𝑀 = 1.8171 (cid:33) 1/3 (cid:32) 𝜌2 (cid:98) (cid:0)1 − (cid:98) 𝜌2(cid:1) 2 𝑇 1/3 + 1, 𝑣𝑡−1 + 𝜂𝑡 where ¯ 𝑣𝑡 = 𝜌 ¯ 𝜌 is the OLS estimator from the regression ¯ 𝑣𝑡 = 1 (cid:98) (cid:98) (cid:98) 𝑁 (cid:205)𝑁 𝑣𝑖𝑡 and 𝑖=1 (cid:98) where (cid:98) 𝑣𝑖𝑡 = (cid:98)𝑈𝑖𝑡 (cid:98)𝑉𝑖𝑡. (cid:98) Table 3.5.1 presents a set of baseline results that are obtained for a decent number of regressors (𝑝 = 200) among which 5 are associated with non-zero slope coefficients. The number of covariates is much larger than either cross-sectional or temporal dimensions. On the other hand, the number of non-zero coefficients can be regarded as a small order of the sample sizes, approximately satisfying the sparsity condition. In the first step, model selections are done using different LASSO approaches reported in the second column. The number of selected regressors for both 𝑌 and 𝐷 are reported in the third and fourth columns. First, comparing the results obtained without using cross-fitting, 93 No Table 3.5.2: DGP(i) with 𝑁 = 𝑇 = 25, 𝑠 = 5, 𝑝 = 600, 𝜄 = 0.5, 𝜌 = 0.5, 𝑐 𝛽 = 𝑐𝜋 = 0.5 Cross Fitting First-Step Estimator POLS H LASSO R LASSO C LASSO TW LASSO H LASSO R LASSO C LASSO TW LASSO Second-Step SD 0.221 0.049 0.055 0.096 0.098 0.134 0.137 0.139 0.140 First-Step Ave. Sel. Y Sel. D Bias 0.008 600 0.073 39.8 0.079 25.3 0.058 15.2 0.033 7.5 0.056 24.7 0.054 12.1 0.043 11.6 0.065 7.6 Note: Simulation results are based on 1000 replications. Tuning parameters: (𝐾, 𝐿) = (4, 8), 𝐶𝜆 = 2, and 𝛾 = 0.1/log( 𝑝 ∨ 𝑁 ∨ 𝑇). 10 most relevant regressors (based on the sample correlation with the outcome) are used for initial estimation and at most 10 iterations are used in calculating the penalty weights. H: heteroskedastic-LASSO; R: square-root-LASSO; C: cluster-LASSO; TW: two- way cluster-LASSO. Post-LASSO POLS is performed in all first steps. Nominal coverage probability: 0.95. Coverage (%) DKA 38.6 78.9 79.1 78.4 88.1 98.4 96.1 96.1 95.1 RMSE CHS 26.6 0.221 51.2 0.087 52.4 0.097 68.8 0.112 81.6 0.103 94.5 0.146 94.5 0.147 95.1 0.145 90.7 0.154 600 39.5 25.1 14.0 6.9 24.8 12.1 10.7 6.8 Yes it is shown that when the number of regressors is not extremely large relative to the sample size, the POLS estimator dominates the sparse methods through different LASSOs in terms of Monte Carlo bias, standard deviation, and coverage probability obtained using DKA standard error, even though the true model is sparse. Among the sparse methods, the proposed two-way cluster-LASSO exhibits the smallest bias and best coverage, while its standard deviation is slightly larger than the heteroskedastic-robust LASSO and root-LASSO. In terms of selection, the proposed method selects the number of regressors closest to the true number of relevant regressors while other sparse methods over-select to different extents. When cross-fitting is employed, all methods have witnessed a significant improvement in terms of sample coverage. This is particularly true for LASSO-based methods that are not designed for dependent data. This is not too surprising because those non-robust sparse methods tend to over-select and the cross-fitting is designed to remove the overfitting bias and to restore asymptotic normality. As a cost of cross-fitting, the Monte Carlo standard deviation increased, indicating the efficient loss due to the exclusion of sub-samples in the first-step estimation. It is also worth emphasizing that the CHS- and DKA-type variance estimators designed for cross-fitting approaches play an important role in the desirable sample coverage. In some unreported simulations, it is shown that inference based on the cross-fitting variance estimators proposed in Chernozhukov et al. 94 No Table 3.5.3: DGP(ii) with 𝑁 = 𝑇 = 25, 𝑠 = 𝑝 = 10, 𝜄 = 0.5, 𝜌 = 0.5, 𝑐 𝛽 = 1, 𝑐𝜋 = 4, 𝑐𝜉 = 𝑐𝜁 = 1/4; 2nd-order polynomial series are used for approximation Cross Fitting First-Step Estimator POLS H LASSO R LASSO C LASSO TW LASSO H LASSO R LASSO C LASSO TW LASSO First-Step Ave. Sel. Y Sel. D Bias 0.012 560 0.032 3.4 0.030 3.3 0.030 24.7 0.023 3.1 0.005 2.6 0.001 2.0 0.014 3.1 0.030 1.2 Note: Simulation results are based on 1000 replications. Tuning parameters: (𝐾, 𝐿) = (4, 8), 𝐶𝜆 = 2, and 𝛾 = 0.1/log( 𝑝 ∨ 𝑁 ∨ 𝑇). 10 most relevant regressors (based on the sample correlation with the outcome) are used for initial estimation and at most 10 iterations are used in calculating the penalty weights. H: heteroskedastic-LASSO; R: square-root-LASSO; C: cluster-LASSO; TW: two-way cluster- LASSO. Post-LASSO POLS is performed in all first steps. Nominal coverage probability: 0.95. Coverage (%) DKA 67.4 90.8 91.0 91.8 93.6 98.8 98.8 99.0 98.8 RMSE CHS 54.4 0.173 87.2 0.130 86.2 0.130 87.8 0.130 87.8 0.129 94.8 0.157 95.1 0.159 96.3 0.155 97.3 0.153 Second-Step SD 0.173 0.126 0.127 0.127 0.127 0.157 0.159 0.154 0.150 560 12.2 11.0 12.3 9.3 9.0 6.9 9.0 6.8 Yes (2018a) and Chiang et al. (2022) suffer from severe under-coverage. This is not surprising but the implication is more subtle: while two-way dependence potentially affects both estimation and inference, its negative impact on the inference is more salient. As the dimension of the covariates significantly increases and becomes as large as the overall sample size (so that POLS remains in the competition), a different pattern is revealed. Table 3.5.2 also reports simulation results under the DGP(i) except that the dimension 𝑝 now increases to 600, slightly smaller than the overall sample size 625. First, we compare the results obtained without cross-fitting. The simulation results demonstrate that the methods based on the POLS with no selection and those based on the existing LASSO approaches with over-selection all suffer from severe under-coverage. The proposed methods, in contrast, continue to select the number of relevant regressors closest to the true number regardless of the increased number of irrelevant regressors. When cross-fitting is performed, there is again a significant improvement across all approaches in terms of the sample coverage but it is also in the cost of efficiency loss measured by the increase in SD. We have seen the case with exact sparsity in Tables 3.5.1 and 3.5.2. As claimed in the theory, the proposed estimation and inference procedures are also valid under approximate sparsity. Table 3.5.3 95 reports the simulation results under DGP(ii) where the true model is nonlinear in the the control variables and correlated random effects. The functional form of the nonlinearity is not given and is approximated by the second-order polynomial series. While only 10 observable covariates are considered, the Mundlak device and the polynomial transformation generate 560 regressors that will be included in the approximately sparse linear model. Due to the large number of regressors relative to the overall sample size, the approach based on the POLS estimation has the largest Monte Carlo standard deviation and root mean square error, and it suffers from severe under-coverage. Compared to the POLS and other sparse methods, the proposed two-way cluster-LASSO method selects the most sparse model while having the smallest bias and root mean square error, and it also achieves the best coverage. When the clustered-panel cross-fitting is employed in the inference procedure, we find that the Monte Carlo coverage probability for confidence intervals based on CHS-type standard errors improves significantly, and the confidence intervals based on DKA-type standard errors switch from slight under-coverage to over-coverage. As the correlated random effects are used for estimation, they project out most of the components that drive the two-way cluster dependence under DGP(ii). In that case, the adjustment in the standard error formulas due to cross-fitting can be conservative. 3.6 Empirical Application In this section, I re-examine the effects of government spending on the output of an open economy following the framework of Nakamura and Steinsson (2014). It is one of the most cited empirical-macro papers on the American Economic Review and it investigates one classic quantity of interest in economics: the government spending multiplier. The question is can we improve on the estimation and inference through more robust and flexible methods? As I will show, it is made possible by the proposed toolkit in this paper. This framework utilizes the regional variation in military spending in the US to estimate the percentage increase in output that results from the increase of government spending by 1 percent of GDP, i.e. government spending multiplier. It is referred to as the "open economy relative multiplier" because this framework takes advantage of uniform monetary and tax policies across 96 the regions in the US to difference-out their effects on government spending and output. The parameter of interest is a scalar and the baseline model is identified without considering control variables, so why is the high dimensionality relevant here? As it will be revealed very soon, indeed, the high dimensionality from heterogeneity and flexible modeling can be hidden. Due to the endogeneity in the variation of the regional military procurement, Nakamura and Steinsson (2014) achieves identification through an instrumental variable (IV) approach. As argued by the authors, the national military spending is largely determined by geopolitical events so it is likely exogenous to the unobserved factors of regional military spending and it affects the regional military spending disproportionally. In other words, the identifying assumption is that the buildups and drawdowns in national military spending are not due to unbalanced military development across regions. Based on this observation, a share-shift type IV is considered and the share is estimated by regressing the regional military spending on the national military spending allowing for region-specific constant slope coefficients.12 To focus on the main idea, the shares are taken as given and the resulting instrument variable is treated as observable instead of generated regressors to avoid further complication. In this paper, to avoid the endogeneity caused by the misspecification of the function form, I extend the linear model with additive unobserved heterogeneous effects to a partial linear model with non-additive unobserved heterogeneous effects. Let 𝐷𝑖𝑡 be the percentage change in per capita regional military spending in state 𝑖 and time 𝑡 and 𝑍𝑖𝑡 be the IV. Specifically, the baseline model from the original study and the one from this paper differ as follows: Baseline model : 𝑌𝑖𝑡 = 𝜃0𝐷𝑖𝑡 + 𝜋𝑖𝑊𝑡 + 𝑐𝑖 + 𝑑𝑡 + 𝑈𝑖𝑡 . Partial linear model : 𝑌𝑖𝑡 = 𝜃0𝐷𝑖𝑡 + 𝑔(𝑋𝑖𝑡, 𝑊𝑡, 𝑐𝑖, 𝑑𝑡) + 𝑈𝑖𝑡 . 12All quantities, unless specifically defined, are in terms of two-year growth rate of the real per capita values. Per capita is in terms of total population. Nakamura and Steinsson (2014) also presents results when per capita is calculated using the working age population as a robustness check. 97 where 𝜃0 is the parameter of interest, i.e. the true multiplier; 𝑋𝑖𝑡 and 𝑊𝑡 are exogenous control variables with the latter being only time-varying; 𝜋𝑖 are non-random unit specific slope coefficients of 𝑊𝑡; (𝑐𝑖, 𝑑𝑡) are unobserved heterogeneous effects. In the original study, the linear model is estimated by the two-stage least square (2SLS) with two-way fixed effects. In the extended model, I model the unobserved heterogeneous effects as correlated random effects and take a sparse approximation approach for the infinite-dimensional nuisance parameters as in Section 3.4. Specifically, 𝑐𝑖 is assumed to be a function of ( ¯𝐷𝑖, ¯𝑋𝑖) and 𝑑𝑡 is assumed to be a function of ( ¯𝐷𝑡, ¯𝑋𝑡, 𝑊𝑡). Then, through sparse approximation, the feasible (near) Neyman-orthogonal moment function is given by 3.22 with 𝑓𝑖𝑡 = (𝐿𝜏 (𝑋𝑖𝑡, 𝑊𝑡, ¯𝐷𝑖, ¯𝐷𝑡, ¯𝑋𝑖, ¯𝑋𝑡), 1). In the baseline specification of Nakamura and Steinsson (2014), 𝑊𝑡 are not included in the baseline model. In their alternative specifications, 𝑊𝑡 is chosen as the real interest rate or the change in national oil price. These two variables are never included together in the original study. Note that allowing the unit-specific slope coefficients for controls generates many nuisance parameters: with 51 state groups13, one control would increase 51 parameters and two controls would generate 102 parameters, without considering interactions or higher order terms. With a sample size of less than 2000, the high dimensionality in nuisance parameters could result in a noisy estimate of 𝜃0. In this paper, to obtain a more precise estimate and make the excludability assumption of the IV more plausible, besides the controls from the original study, I also consider additional controls. As is shown in Table 3 of Nakamura and Steinsson (2014), the change in state population is likely not affected by the treatment (the regional military spending), so it is immune to the "bad control" problem14; But it could affect the treatment and the outcome, so it is included in 𝑋𝑖𝑡. By considering more flexible function forms and additional exogenous control variables, the excludability condition of the instruments is more plausible. On the other hand, the high-dimensionality arose from the flexible function form and the unobserved heterogeneity necessitates the use of high-dimensional methods. Moreover, state-level yearly variables of those 13The regions in this analysis are defined by the states. Nakamura and Steinsson (2014) also presents results on regions as clusters of states. 14Angrist and Pischke, 2009 and Chen and Kim, 2024 provide detailed discussions on how endogenous control can pollute the identification/estimation. 98 macroeconomic characteristics are often considered to be cluster-dependent in both cross-sectional and time groups due to correlated time shocks and state-unobserved factors. These concerns justify the use of robust estimation and inference methods proposed in this paper. Table 3.6.1: Multiplier estimates from the original model (4) (1) Unobs. Heterog. (3) Real Pop. Int. No No No No No Yes Yes No Yes Yes (8) (7) (6) (5) IV 1 CHS DKA First (cid:98)𝜃 s.e. s.e. Stage 0.81 0.68 POLS 1.43 0.72 0.56 POLS 1.30 0.70 0.57 POLS 1.40 0.71 0.45 POLS 1.27 0.56 0.43 POLS 1.36 (2) Oil Price No Yes No Yes Yes Fixed Effects Note: Standard errors are calculated with the truncation parame- ter 𝑀 chosen by the min-MSE rule given in Section 3.5. The data is available through Nakamura and Steinsson (2014). It is a balanced (after trimming) state-level yearly panel data with 51 states from 1971-2005 years. The military spending data is collected from the electronic database of DD-350 military procurement forms of the US Department of Defense. The state output is measured by state DGP collected from the US Bureau of Economics Analysis (BEA). The state population data is from the Census Bureau. Data on oil prices is from West Texas Intermediate. The Federal Funds rate is from the FRED database of the St. Louis Federal Reserve. The state inflation measures are constructed from several sources. For more details on data construction, readers are referred to Nakamura and Steinsson (2014). Table 3.6.1 provides benchmark results for the original model with different choices of control variables. All estimates (columns 6) of are given by 2SLS with two-way fixed effects and the standard errors (s.e.) are calculated using CHS and DKA formulas given in Section 3.4. The estimates of the multiplier replicate those given in Nakamura and Steinsson (2014) with significant differences in the standard errors. It is because the variance estimates here account for the potential two-way dependence while the variance estimator used in Nakamura and Steinsson (2014) assumes cross-sectional independence. The main comparisons are done in Tables 3.6.2 and 3.6.3. In Table 3.6.2, no cross-fitting is performed in the first stage. The number of parameters associated with regressors generated by 99 7 35 (7) (1) (2) 2nd (3) Poly. No Mundlak No Mundlak None (4) Cross- Unobs. Param. Fitting Heterog. Trans. Gen. (6) Z: Param. Sel. 7 2 4 2 35 6 5 3 119 10 6 5 Table 3.6.2: Estimates of the open economy relative multiplier from the extended model. (5) First Stage POLS H LASSO C LASSO TW LASSO POLS H LASSO CR LASSO TW LASSO POLS H LASSO CR LASSO TW LASSO Note: Tuning parameters are chosen as 𝐶𝜆 = 2, and 𝛾 = 0.1/log( 𝑝 ∨ 𝑁 ∨ 𝑇). 7 most relevant regressors (based on the sample correlation with the outcome) are used for initial estimation and at most 10 iterations are used in calculating the penalty weights. H: heteroskedastic-LASSO; R: square-root-LASSO; C: cluster-LASSO; TW: two-way cluster-LASSO. The number of predictors generated by the polynomial transformation and the number of selected predictors for 𝑍 are reported in columns (4) and (6). Standard errors are calculated with the truncation parameter 𝑀 chosen by the min-MSE rule given in Section 3.5. (9) DKA s.e. 0.82 0.81 0.81 0.84 1.15 1.17 1.19 0.77 1.37 1.38 0.82 0.76 (8) CHS s.e. 0.66 0.66 0.66 0.70 0.99 1.01 1.02 0.62 1.19 1.16 0.66 0.61 (cid:98)𝜃 1.51 1.43 1.43 1.43 1.73 1.73 1.75 1.47 2.20 1.97 0.98 1.47 No Mundlak 119 3rd the polynomials transformations are reported in column (4) and the number of selected parameters associated with 𝑍 are reported in column (6)15. Overall, with more controls and the polynomial transformation of the observables, the standard errors are generally larger than those in 3.6.1. With no transformations of the original regressors, the estimates obtained by four different methods are similar and they are consistent with the baseline results. It is noticeable that the proposed approach TW LASSO using the DKA-type penalty weights achieves an estimate that is consistent with the baseline results and has the least variability. As the flexibility and number of nuisance parameters increase with the higher-order polynomial transformations, the number of selected regressors increases across all methods. While the standard errors of most approaches climb become larger and the estimates deviate from the baseline results, the proposed approach remains 15Across all first-step LASSO approaches, more parameters associated with 𝑍 are selected compared to those associated with 𝑌 and 𝑋. The difference in the LASSO selection is less evident for 𝑌 and 𝑋 while the pattern is similar. 100 less noisy. This indicates that many higher-order polynomials included in the extended model for robustness in the function form may not matter that much but sorely contribute to the noise; while the existing approaches tend to over-select those terms under potential two-way dependence, the proposed method is robust against over-selection. 7 35 (7) (1) (2) 2nd (3) Poly. Yes Mundlak Yes Mundlak None (4) Cross- Unobs. Param. Fitting Heterog. Trans. Gen. (6) Z: Param. Ave. Sel. 2.0 2.0 2.6 5.2 5.8 4.1 8.3 6.5 5.3 Table 3.6.3: Estimates of the open economy relative multiplier from the extended model. (5) First Stage H LASSO C LASSO TW LASSO H LASSO C LASSO TW LASSO H LASSO C LASSO TW LASSO Note: The tuning parameters are chosen as (𝐾, 𝐿) = (4, 8), 𝐶𝜆 = 2, and 𝛾 = 0.1/log( 𝑝 ∨ 𝑁 ∨ 𝑇). 7 most relevant regressors (based on the sample correlation with the outcome) are used for initial estimation and at most 10 iterations are used in cal- culating the penalty weights. H: heteroskedastic-LASSO; R: square-root-LASSO; C: cluster-LASSO; TW: two-way cluster-LASSO. The number of predictors generated by the polynomial transformation and the number of selected predictors for 𝑍 are reported in columns (4) and (6). Standard errors are calculated with the truncation parameter 𝑀 chosen by the min-MSE rule given in Section 3.5. (9) DKA s.e. 2.00 2.03 2.05 2.52 2.24 1.70 3.47 1.91 1.44 (8) CHS s.e. 1.73 1.75 1.77 2.18 1.95 1.42 3.17 1.59 1.18 (cid:98)𝜃 1.28 1.32 1.18 1.12 1.46 1.20 1.81 1.25 1.50 Yes Mundlak 119 3rd As in the Monte Carlo simulation, the results obtained with cross-fitting are also examined. Although the theoretical results for inference procedure based on cross-fitting methods with the presence of the Mundlak device are not formally given in this paper, the conjecture is that it is still valid under the same set of conditions given in Section 3.4. Table 3.6.3 demonstrates the comparison between various sparse methods with the clustered-panel cross-fitting 16. It reveals a similar pattern as in Table 3.6.2: The variability of different methods increases as the model approximated by higher-order polynomial series, except for the proposed approach which witnesses more accuracy as the approximation is made more flexible. 16Due to a smaller sample used in the first-step estimation and multicollinearity among the polynomial terms, methods based on the POLS first-step is too noisy and so they are omitted for comparison here. 101 To conclude, the empirical study of the government spending multiplier using a flexible model and sparse methods illustrates the issue of hidden dimensionality. In the current example, the estimates obtained through the high-dimensional methods do not deviate much from the baseline results, so it implies the nonlinear effects omitted from the original model may not be very relevant. While the proposed two-way cluster-LASSO and the inference procedure with or without cross- fitting remain relatively accurate and provide results as a robustness check, other sparse methods tend to over-select and become too noisy to be interpretable. 3.7 Conclusion and Discussion The inferential theory for high-dimensional models is particularly relevant in panel data settings where the modeling of unobserved heterogeneity commonly leads to high-dimensional nuisance parameters. This paper enriches the toolbox of researchers in dealing with high-dimensional panel models. Particularly, I propose a package of tools that deal with the estimation and inference in high- dimensional panel models that feature two-way cluster dependence and unobserved heterogeneity. I first develop a weighted LASSO approach that is robust to two-way cluster dependence in the panel data. As is shown in the statistical analysis of the two-way cluster LASSO, the convergence rates are slow due to the cluster dependence, making it challenging for inference purposes. However, by utilizing a cross-fitting method designed for a two-way clustered panel, the rate requirement for the first step can be substantially relaxed, making the proposed two-way cluster-LASSO a feasible first-step estimator for the panel-DML inference procedure in a high-dimensional semiparametric model. I further consider the unobserved heterogeneity in panel models. Due to the potential non-compatibility of cross-fitting with common fixed-effect and random-effect methods, I study the statistical properties of the proposed estimation and inference procedures using the full sample in both the first and the second steps. The validity is established under a slightly stronger sparsity condition in a partial linear panel model, as a special case. The estimation and inferential theory are empirically relevant. I illustrate the proposed ap- proaches in an empirical example and exemplify that high-dimensionality can be hidden in ques- tions not traditionally considered high-dimensional. In practice, when the question is naturally 102 high-dimensional and answered by panel data, then the proposed approaches are natural solutions. When the questions are originally not high-dimensional, it is reasonable to start with a simple model as a baseline and then extend it to a more general and flexible model for a robustness check. While both theoretical and simulation results support the proposed approaches, some limitations remain in certain scenarios. The feasible penalty weight estimation is highly non-trivial due to two-way cluster dependence and high dimensionality. The statistical analysis of the two-way cluster LASSO relies on high-level assumptions on the feasible penalty weights. Even though the iterative feasible weights estimation possesses desirable finite sample properties among the scenarios considered in the Monte Carlo simulation, many subtle issues lack of theoretical guarantee. A devoted exploration of such issues requires a more comprehensive treatment and is an important direction of future research. 103 BIBLIOGRAPHY Aldous, D. J. (1981). Representations for partially exchangeable arrays of random variables. Journal of Multivariate Analysis, 11(4):581–598. Andrews, D. W. (1994). Asymptotics for semiparametric econometric models via stochastic equicontinuity. Econometrica, pages 43–72. Andrews, D. W. K. (1991). Heteroskedasticity and autocorrelation consistent covariance matrix estimation. Econometrica, 59:817. Angrist, J. D. and Pischke, J.-S. (2009). Mostly harmless econometrics: An empiricist’s companion. Princeton university press. Babii, A., Ball, R. T., Ghysels, E., and Striaukas, J. (2023). Machine learning panel data re- gressions with heavy-tailed dependent data: Theory and application. Journal of Econometrics, 237(2):105315. Babii, A., Ghysels, E., and Striaukas, J. (2022). Machine learning time series regressions with an application to nowcasting. Journal of Business & Economic Statistics, 40(3):1094–1106. Babii, A., Ghysels, E., and Striaukas, J. (2024). High-dimensional granger causality tests with an application to vix and news. Journal of Financial Econometrics, 22(3):605–635. Baraud, Y., Comte, F., and Viennet, G. (2001). Adaptive estimation in autoregression or-mixing regression via model selection. The Annals of Statistics, 29(3):839–875. Basu, S. and Michailidis, G. (2015). Regularized estimation in sparse high-dimensional time series models. Belloni, A., Chen, D., Chernozhukov, V., and Hansen, C. (2012). Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica, 80(6):2369–2429. Belloni, A., Chernozhukov, V., and Hansen, C. (2014). Inference on treatment effects after selection among high-dimensional controls. Review of Economic Studies, 81(2):608–650. Belloni, A., Chernozhukov, V., Hansen, C., and Kozbur, D. (2016). Inference in high-dimensional panel models with an application to gun control. Journal of Business & Economic Statistics, 34(4):590–605. Belloni, A., Chernozhukov, V., and Wang, L. (2011). Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika, 98(4):791–806. Berbee, H. (1987). Convergence rates in the strong law for bounded mixing sequences. Probability theory and related fields, 74(2):255–270. 104 Bester, C. A., Conley, T. G., and Hansen, C. B. (2008). Inference with dependent data using cluster covariance estimators. Bickel, P. J., Ritov, Y., and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and dantzig selector. Breinlich, H., Corradi, V., Rocha, N., Ruta, M., Santos Silva, J., and Zylkin, T. (2022). Machine learning in international trade research-evaluating the impact of trade agreements. Bühlmann, P. and Van De Geer, S. (2011). Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media. Burman, P., Chow, E., and Nolan, D. (1994). A cross-validatory method for dependent data. Biometrika, 81(2):351–358. Chen, K. and Kim, K. i. (2024). Identification of nonseparable models with endogenous control variables. arXiv preprint arXiv:2401.14395. Chen, K. and Vogelsang, T. J. (2024). Fixed-b asymptotics for panel models with two-way clustering. Journal of Econometrics, 244(1):105831. Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018a). Double/debiased machine learning for treatment and structural parameters. Economet- rics Journal, 21(1):C1–C68. Chernozhukov, V., Escanciano, J. C., Ichimura, H., Newey, W. K., and Robins, J. M. (2022a). Locally robust semiparametric estimation. Econometrica, 90(4):1501–1535. Chernozhukov, V., Hausman, J. A., and Newey, W. K. (2019). Demand analysis with many prices. Technical report, National Bureau of Economic Research. Chernozhukov, V., Karl Härdle, W., Huang, C., and Wang, W. (2021a). Lasso-driven inference in time and space. The Annals of Statistics, 49(3):1702–1735. Chernozhukov, V., Newey, W., Quintas-Martınez, V. M., and Syrgkanis, V. (2022b). Riesznet and forestriesz: Automatic debiased machine learning with neural nets and random forests. In International Conference on Machine Learning, pages 3901–3914. PMLR. Chernozhukov, V., Newey, W. K., Quintas-Martinez, V., and Syrgkanis, V. (2021b). Automatic debiased machine learning via neural nets for generalized linear regression. arXiv preprint arXiv:2104.14737. Chernozhukov, V., Newey, W. K., and Robins, J. (2018b). Double/de-biased machine learning using regularized riesz representers. Technical report, cemmap working paper. 105 Chernozhukov, V., Newey, W. K., and Singh, R. (2022c). Automatic debiased machine learning of causal and structural effects. Econometrica, 90(3):967–1027. Chetverikov, D., Liao, Z., and Chernozhukov, V. (2021). On cross-validated lasso in high dimen- sions. The Annals of Statistics, 49(3):1300–1317. Chiang, H. D., Hansen, B. E., and Sasaki, Y. (2024). Standard errors for two-way clustering with serially correlated time effects. Review of Economics and Statistics, pages 1–40. Chiang, H. D., Kato, K., Ma, Y., and Sasaki, Y. (2022). Multiway cluster robust double/debiased machine learning. Journal of Business and Economic Statistics, 40(3):1046–1056. Chiang, H. D., Kato, K., and Sasaki, Y. (2023a). Inference for high-dimensional exchangeable arrays. Journal of the American Statistical Association, 118(543):1595–1605. Chiang, H. D., Ma, Y., Rodrigue, J., and Sasaki, Y. (2021). Dyadic double/debiased machine learning for analyzing determinants of free trade agreements. arXiv preprint arXiv:2110.04365. Chiang, H. D., Rodrigue, J., and Sasaki, Y. (2023b). Post-selection inference in three-dimensional panel data. Econometric Theory, 39(3):623–658. Correia, S., Guimarães, P., and Zylkin, T. (2020). Fast poisson estimation with high-dimensional fixed effects. The Stata Journal, 20(1):95–115. Davezies, L., D’Haultfœuille, X., and Guyonvarch, Y. (2019). Empirical process results for exchangeable arrays. arXiv preprint arXiv:1906.11293. Davidson, J. (1994). Stochastic limit theory: An introduction for econometricians. OUP Oxford. Dehling, H. and Wendler, M. (2010). Central limit theorem and the bootstrap for u-statistics of strongly mixing data. Journal of Multivariate Analysis, 101(1):126–137. Djogbenou, A. A., MacKinnon, J. G., and Nielsen, M. Ø. (2019). Asymptotic theory and wild bootstrap inference with clustered errors. Journal of Econometrics, 212(2):393–412. Dudley, R. M. and Philipp, W. (1983). Invariance principles for sums of banach space valued random elements and empirical processes. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 62(4):509–552. Ellison, M., Lee, S. S., and O’Rourke, K. H. (2024). The ends of 27 big depressions. American Economic Review, 114(1):134–168. Fama, E. F. and French, K. R. (2000). Forecasting profitability and earnings. The journal of business, 73(2):161–175. 106 Fernández-Val, I. and Lee, J. (2013). Panel data models with nonadditive unobserved heterogeneity: Estimation and inference. Quantitative Economics, 4(3):453–481. Fuk, D. K. and Nagaev, S. V. (1971). Probability inequalities for sums of independent random variables. Theory of Probability & Its Applications, 16(4):643–660. Gao, J., Peng, B., and Yan, Y. (2024). Robust inference for high-dimensional panel data models. Available at SSRN 4825772. Gao, L., Shao, Q.-M., and Shi, J. (2022). Refined cramér-type moderate deviation theorems for general self-normalized sums with applications to dependent random variables and winsorized mean. The Annals of Statistics, 50(2):673–697. Gonçalves, S. (2011). The moving blocks bootstrap for panel linear regression models with individual fixed effects. Econometric Theory, 27(5):1048–1082. Guvenen, F., Schulhofer-Wohl, S., Song, J., and Yogo, M. (2017). Worker betas: Five facts about systematic earnings risk. American Economic Review, 107(5):398–403. Hahn, J. and Kuersteiner, G. (2011). Bias reduction for dynamic nonlinear panel models with fixed effects. Econometric Theory, 27(6):1152–1191. Hansen, B. (2022). Econometrics. Princeton University Press. Hansen, B. E. (1992). Consistent covariance matrix estimation for dependent heterogeneous processes. Econometrica, pages 967–972. Hansen, C. B. (2007). Asymptotic properties of a robust variance matrix estimator for panel data when t is large. Journal of Econometrics, 141(2):597–620. Hoover, D. N. (1979). Relations on probability spaces and arrays of random variables. t, Institute for Advanced Study. Ichimura, H. (1987). Estimation of single index models. PhD thesis, Massachusetts Institute of Technology. Ichimura, H. and Newey, W. K. (2022). The influence function of semiparametric estimators. Quantitative Economics, 13(1):29–61. Javanmard, A. and Montanari, A. (2014). Confidence intervals and hypothesis testing for high- dimensional regression. The Journal of Machine Learning Research, 15(1):2869–2909. Jing, B.-Y., Shao, Q.-M., and Wang, Q. (2003). Self-normalized cramér-type large deviations for independent random variables. The Annals of probability, 31(4):2167–2215. 107 Jordan, M. I., Wang, Y., and Zhou, A. (2023). Data-driven influence functions for optimization- based causal inference. Kallenberg, O. (1989). On the representation theorem for exchangeable arrays. Journal of Multi- variate Analysis, 30(1):137–154. Kallenberg, O. (2005). Probabilistic symmetries and invariance principles, volume 9. Springer. Kock, A. B. and Callot, L. (2015). Oracle inequalities for high dimensional vector autoregressions. Journal of Econometrics, 186(2):325–344. Kock, A. B. and Tang, H. (2019). Uniform inference in high-dimensional dynamic panel data models with approximately sparse fixed effects. Econometric Theory, 35(2):295–359. Larrain, B. (2006). Do banks affect the level and composition of industrial volatility? The Journal of Finance, 61(4):1897–1925. Li, K., Morck, R., Yang, F., and Yeung, B. (2004). Firm-specific variation and openness in emerging markets. Review of Economics and Statistics, 86(3):658–669. Lin, J. and Michailidis, G. (2017). Regularized estimation and testing for high-dimensional multi- block vector-autoregressive models. Journal of Machine Learning Research, 18(117):1–49. MacKinnon, J. G., Nielsen, M. Ø., and Webb, M. D. (2021). Wild bootstrap and asymptotic inference with multiway clustering. Journal of Business & Economic Statistics, 39(2):505–519. Mattoo, A., Rocha, N., and Ruta, M. (2020). Handbook of deep trade agreements. World Bank Publications. Menzel, K. (2021). Bootstrap with cluster-dependence in two or more dimensions. Econometrica, 89(5):2143–2188. Mundlak, Y. (1978). On the pooling of cross-section and time-series data. Econometrica, 46(69):X6. Nakamura, E. and Steinsson, J. (2014). Fiscal stimulus in a monetary union: Evidence from us regions. American Economic Review, 104(3):753–792. Newey, W. K. . (1994). The Asymptotic Variance of Semiparametric Estimators. Econometrica, 62(6):1349–1382. Ning, Y. and Liu, H. (2017). A general theory of hypothesis tests and confidence regions for sparse high dimensional models. The Annals of Statistics, 45(1):158 – 195. Peña, V. H., Lai, T. L., and Shao, Q.-M. (2009). Self-normalized processes: Limit theory and 108 Statistical Applications. Springer. Powell, J. L., Stock, J. H., and Stoker, T. M. (1989). Semiparametric estimation of index coefficients. Econometrica, pages 1403–1430. Racine, J. (2000). Consistent cross-validatory model-selection for dependent data: hv-block cross- validation. Journal of econometrics, 99(1):39–61. Rajan, R. G. and Zingales, L. (1998). Financial dependence and growth. The American Economic Review, 88(3):559. Robinson, P. M. (1988). Root-n-consistent semiparametric regression. Econometrica, pages 931– 954. Roodman, D., Nielsen, M. Ø., MacKinnon, J. G., and Webb, M. D. (2019). Fast and wild: Bootstrap inference in stata using boottest. The Stata Journal, 19(1):4–60. Semenova, V., Goldman, M., Chernozhukov, V., and Taddy, M. (2023a). Inference on heterogeneous treatment effects in high-dimensional dynamic panels under weak dependence. Quantitative Economics, 14(2):471–510. Semenova, V., Goldman, M., Chernozhukov, V., and Taddy, M. (2023b). Supplement to "Inference on heterogeneous treatment effects in high-dimensional dynamic panels under weak dependence". Quantitative Economics, 14(2):471–510. Strassen, V. (1965). The existence of probability measures with given marginals. The Annals of Mathematical Statistics, 36(2):423–439. Thompson, S. B. (2011). Simple formulas for standard errors that cluster by both firm and time. Journal of financial Economics, 99(1):1–10. Van de Geer, S., Bühlmann, P., Ritov, Y., and Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Vogt, M., Walsh, C., and Linton, O. (2022). Cce estimation of high-dimensional panel data models with interactive fixed effects. arXiv preprint arXiv:2206.12152. Wooldridge, J. M. (2021). Two-way fixed effects, the two-way mundlak regression, and difference- in-differences estimators. Available at SSRN 3906345. Wooldridge, J. M. and Zhu, Y. (2020). Inference in approximately sparse correlated random effects probit models with panel data. Journal of Business & Economic Statistics, 38(1):1–18. Wu, W.-B. and Wu, Y. N. (2016). Performance bounds for parameter estimates of high-dimensional linear models with correlated errors. 109 Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 76(1):217–242. 110 APPENDIX 3A PROOFS FOR CHAPTER 3.2 We will first introduce two lemmas regarding the law of large number (LLN) and the central limit theorem (CLT) for two-way clustered arrays with correlated time effects. They are restated and generalized from Theorems 1 and 2 in Chiang et al. (2024). The following notations will also be used frequently throughout the appendices: Let {𝑊𝑖𝑡 : 𝑖 = 1, .., 𝑁; 𝑡 = 1, ..., 𝑇 } be an array of random vectors taking values in R𝑝. Let 𝐹 : R𝑝 → R𝑘 be a measurable function where 𝑘 is a constant. We define the Hajek projection terms 𝑎𝑖 = 𝐸 [𝐹 (𝑊𝑖𝑡) − 𝐸 [𝐹 (𝑊𝑖𝑡)] |𝛼𝑖], 𝑔𝑡 = 𝐸 [𝐹 (𝑊𝑖𝑡) − 𝐸 [𝐹 (𝑊𝑖𝑡)]|𝛾𝑡], and 𝑒𝑖𝑡 = 𝑊𝑖𝑡 −𝐸 [𝐹 (𝑊𝑖𝑡)] −𝑎𝑖 −𝑔𝑡 and their corresponding (long-run) variance covariance matrices: Σ𝑎 = 𝐸 [𝑎𝑖𝑎′ 𝑖], Σ𝑔 = ∞ ∑︁ 𝑙=−∞ 𝐸 [𝑔𝑡𝑔′ 𝑡+𝑙], Σ𝑒 = ∞ ∑︁ 𝑙=−∞ 𝐸 [𝑒𝑖𝑡𝑒′ 𝑖,𝑡+𝑙]. We can rewrite 𝐹 (𝑊𝑖𝑡) = 𝑎𝑖 + 𝑔𝑡 + 𝑒𝑖𝑡. Suppose that 𝑊𝑖𝑡 satisfy Assumptions AHK and AR, then the decomposition has the following properties: (i) {𝑎𝑖}𝑖≥1 is a sequence of i.i.d random vectors, {𝑔𝑡 }𝑡≥1 are strictly stationary and 𝛽-mixing with the the mixing coefficient 𝛽𝑔 (𝑚) ≤ 𝛽𝛾 (𝑚) for all 𝑚 ≥ 1; for each 𝑖, {𝑒𝑖𝑡 }𝑡≥1 is also strictly stationary; and 𝑎𝑖 is independent of 𝑔𝑡. (ii) 𝑎𝑖, 𝑏𝑡, 𝑒𝑖𝑡 are mean zero. (iii) Conditional on (𝛾𝑡, 𝛾𝑟), 𝑒𝑖𝑡 and 𝑒 𝑗𝑟 are independent for 𝑗 ≠ 𝑖. (iv) The sequences {𝑎𝑖}, {𝑔𝑡 }, {𝑒𝑖𝑡 } are mutually uncorrelated. Properties (i) and (ii) are straightforward. Property (iii) is due to the assumption that {𝛼𝑖} and {𝜀𝑖𝑡 } are each i.i.d sequence and independent of each other. Property (iv) is less obvious. One can show 𝐸𝑃 [𝑒𝑖𝑡 |𝛾𝑟] = 0 and 𝐸𝑃 [𝑒𝑖𝑡 |𝛼 𝑗 ] for any 𝑖, 𝑡, 𝑗, 𝑟. It is less obvious to see 𝐸𝑃 [𝑒𝑖𝑡 |𝛾𝑟] = 0 for 111 some 𝑟 ≠ 𝑡: 𝐸𝑃 [𝑒𝑖𝑡 |𝛾𝑟] = 𝐸𝑃 [𝜓 (𝑊𝑖𝑡; 𝜃0, 𝜂0) |𝛾𝑟] − 𝐸𝑃 [𝑎𝑖 |𝛾𝑟] − 𝐸𝑃 [𝑔𝑠 |𝛾𝑡] = 𝐸𝑃 [𝐸𝑃 [𝜓 ( 𝑓 (𝛼𝑖, 𝛾𝑡, 𝜀𝑖𝑡); 𝜃0, 𝜂0) |𝛾𝑡, 𝛾𝑟] |𝛾𝑟] − 𝐸𝑃 [𝑎𝑖] − 𝐸𝑃 [𝑔𝑡 |𝛾𝑟] = 𝐸𝑃 [𝐸𝑃 [𝜓 ( 𝑓 (𝛼𝑖, 𝛾𝑡, 𝜀𝑖𝑡); 𝜃0, 𝜂0) |𝛾𝑡] |𝛾𝑟] − 𝐸𝑃 [𝑎𝑖] − 𝐸𝑃 [𝑔𝑡 |𝛾𝑟] = 𝐸𝑃 [𝑔𝑡 |𝛾𝑟] − 𝐸𝑃 [𝑔𝑡 |𝛾𝑟] = 0 where the second equality follows from the iterated expectation and the independence of 𝛼𝑖 and 𝛾𝑟 and the third equality follows from that given 𝛾𝑡, 𝛾𝑟 is independent of (𝛼𝑖, 𝛾𝑡, 𝜀𝑖𝑡). Using the properties above, one can derive the LLN and CLT for two-way clustered panel data. The following lemma is regarding the LLN. Lemma A.1 Suppose that 𝑊𝑖𝑡 satisfy Assumptions AHK and AR and 𝐸 (cid:2)∥𝐹 (𝑊𝑖𝑡) ∥4(𝑟+𝛿)(cid:3) < ∞. Then, i ∥Σ𝑎 ∥ < ∞, ∥Σ𝑔 ∥ < ∞, and ∥Σ𝑒 ∥ < ∞ where ii Var (E𝑁𝑇 [𝐹 (𝑊𝑖𝑡)]) = 1 𝑁 Σ𝑎 + 1 𝑇 Σ𝑔 (1 + 𝑜(1)) + 1 𝑁𝑇 Σ𝑒 (1 + 𝑜(1)) as 𝑁, 𝑇 → ∞. iii E𝑁𝑇 [𝐹 (𝑊𝑖𝑡)] 𝑝 → 𝐸 [𝐹 (𝑊𝑖𝑡)] as 𝑁, 𝑇 → ∞. Lemma A.2 With the same setting as in Lemma A.1, further assume that either 𝜆𝑚𝑖𝑛 [Σ𝑎] > 0 𝑑 or 𝜆𝑚𝑖𝑛 (cid:2)Σ𝑔(cid:3) > 0. Then, as 𝑁, 𝑇 → ∞ and 𝑁/𝑇 → 𝑐, → 𝑁 (E𝑁𝑇 [𝐹 (𝑊𝑖𝑡)] − 𝐸 [𝐹 (𝑊𝑖𝑡)]) √ N (0, Σ𝑎 + 𝑐Σ𝑔) Lemmas A.1 and A.2 are the same as those for Theorems 1 and 2 in Chiang et al. (2024) except that 𝑊𝑖𝑡 are replaced by 𝐹 (𝑊𝑖𝑡) and we don’t consider the i.i.d case here. The proofs with 𝑊𝑖𝑡 replaced by 𝐹 (𝑊𝑖𝑡) still go through so they are not repeated here. The following lemma provides a probability limit of the infeasible penalty weights. Lemma A.3 Let 𝜔 𝑗 be as defined in 3.11 with the bandwidth 𝑀 such that 𝑀/𝑇 0.5 = 𝑜(1). With the same setting as in Lemma A.2 for 𝐹 (𝑊𝑖𝑡) = 𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡, 𝜔 𝑗 𝑁/𝑇 → 𝑐. 112 𝑝 → 𝑁∧𝑇 𝑁 Σ𝑎 + 𝑁∧𝑇 𝑇 Σ𝑔 as 𝑁, 𝑇 → ∞ and Proof of Lemma A.3 Since 𝑎𝑖, 𝑗 is independent over 𝑖, we can apply the weak law of large number and obtain 𝑁 ∧ 𝑇 𝑁 2 𝑁 ∑︁ 𝑖=1 𝑎2 𝑖, 𝑗 = 𝑁 ∧ 𝑇 𝑁 Σ𝑎 + 𝑜𝑃 (1) To show the convergence of the second term, we can apply Proposition 2 of Bester et al. (2008) by verifying its Assumption 7. Since the block size here ℎ = 𝑟𝑜𝑢𝑛𝑑 (𝑇 1/5) + 1, it diverges with the time sample size and ℎ/𝑇 → 0 as 𝑇 → ∞. and Assumption 7(i) follows. Note that the 𝛽-mixing property of 𝑔𝑡, 𝑗 implies that it is also 𝛼-mixing with the mixing coefficient 𝛼𝑔 (𝑞) ≤ 𝛽𝑔 (𝑞) ≤ 𝛽𝛾 (𝑞) = 𝑐𝜅exp(−𝜅𝑞) for all 𝑞 ≥ 1. Let 𝜁 be some positive constant, then we have ∞ ∑︁ 𝑞=1 𝑞2𝛼𝑔 (𝑞)𝜁/(4+𝜁) ≤ 𝑐𝜁/(4+𝜁) 𝜅 ∞ ∑︁ 𝑞=1 𝑞2exp(−𝜅𝜁 𝑞/(4 + 𝜁)) = 𝑐𝜁/(4+𝜁) 𝜅 ∞ ∑︁ 𝑞=1 𝑞2exp(−𝑎𝑞) where 𝑎 := 𝜅𝜁 4+𝜁 . We can use the ratio test to examine the convergence of sum: lim𝑞→∞ (𝑞 + 1)2exp(−𝑎(𝑞 + 1)) 𝑞2exp(−𝑎𝑞) = lim𝑞→∞ (cid:19) (cid:18) 𝑞 + 1 𝑞 exp(−𝑎) = exp(−𝑎) Since 𝜅 > 0 and 𝜁 > 0, we have 𝑎 > 0 and so exp(−𝑎) < 1. Thus we conclude the infinite sum does not diverge to infinity. The third condition is ensured by our assumptions directly. Thus, by Proposition 2 of Bester et al. (2008), we have 𝑁 ∧ 𝑇 𝑇 2 (cid:32) 𝐵 ∑︁ ∑︁ (cid:33) 2 𝑔𝑡, 𝑗 = 𝑏=1 𝑡∈𝐻𝑏 𝑁 ∧ 𝑇 𝑇 Σ𝑔 + 𝑜𝑃 (1). The conclusion follows. □ The following notations and the lemma is used for deriving the performance bounds for post- LASSO. Corresponding to (cid:98)Γ defined above Theorem 3.1, here we define Γ0 as the support of 𝜁0. 𝑚 = ∥(cid:98)Γ\Γ0∥0. Define PΓ as the projection matrix such that it projects an 𝑁𝑇 × 1 vector Define (cid:98) onto the linear span of 𝑁𝑇 × 1 vector 𝑓 𝑗 with 𝑗 ∈ Γ. The post-LASSO estimator (cid:98)𝜁𝑃𝐿 is defined as the OLS estimator of the linear projection of 𝑌𝑖𝑡 onto { 𝑓𝑖𝑡, 𝑗 : 𝑗 ∈ (cid:98)Γ}. 113 Lemma A.4 Under Assumption ASM, if 𝑆max := max1≤ 𝑗 ≤ 𝑝 |E𝑁𝑇 [𝜔−1/2 𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡] | ≤ 𝜆 2𝑐1𝑁𝑇 , 0 < 𝑎 = min 𝑗 𝜔1/2 ≤ max 𝑗 𝜔1/2 = 𝑏 < ∞, and 𝑢 ≥ 1 ≥ 𝑙 ≥ 1/𝑐1, then ∥ 𝑓 (𝑋𝑖𝑡) − 𝑓𝑖𝑡 (cid:98)𝜁𝑃𝐿 ∥ 𝑁𝑇,2 √︄ (cid:32)√︂ 𝑠 𝜙𝑚𝑖𝑛 (𝑠)(𝑀 𝑓 ) + = 𝑚 (cid:98) 𝑚) (𝑀 𝑓 ) 𝜙𝑚𝑖𝑛 ( (cid:98) (cid:33) 𝑂 𝑃 (cid:19) (cid:18) 𝜆 𝑁𝑇 (cid:16) + 𝑂 𝑃 ∥ 𝑓 (𝑋𝑖𝑡) − (cid:0)P (cid:98)Γ 𝑓 (cid:1) 𝑖𝑡 ∥ 𝑁𝑇,2 (cid:17) . Proof of Lemma A.4 We can decompose 𝑓 (𝑋𝑖𝑡) − 𝑓𝑖𝑡 (cid:98)𝜁𝑃𝐿 as follows: 𝑌 )𝑖𝑡 = (cid:0)(𝐼𝑁𝑇 − P 𝑓 (𝑋𝑖𝑡) − 𝑓𝑖𝑡 (cid:98)𝜁𝑃𝐿 = 𝑓 (𝑋𝑖𝑡) − (P (cid:98)Γ (cid:17) (cid:16) + PΓ0)𝑉 (𝐼𝑁𝑇 − P = (cid:98)Γ) 𝑓 − (P (cid:13) (cid:98)Γ) 𝑓 (cid:1) (cid:13)𝑁𝑇,2 (cid:98)Γ\Γ0 + (cid:13) (cid:13) 𝑖𝑡 𝑖𝑡 (cid:13) (cid:13)𝑁𝑇,2 (cid:0)PΓ0 𝑉 (cid:1) 𝑖𝑡 ≤ (cid:13) (cid:13) (cid:0)𝐼𝑁𝑇 − P + (cid:16) (cid:13) (cid:13) (cid:13) P (cid:98)Γ\Γ0 (cid:17) 𝑉 (cid:13) (cid:13) (cid:13)𝑁𝑇,2 . 𝑖𝑡 𝑉 (cid:1) (cid:98)Γ) 𝑓 (𝑋) − P (cid:98)Γ 𝑖𝑡 where the last equality follows from the property of the linear projection and the inequality follows from Minkowski’s inequality. By Hölder’s inequality and the property of spectral norm, we have (cid:16) (cid:13) (cid:13) (cid:13) P (cid:98)Γ\Γ0 (cid:17) 𝑉 (cid:13) (cid:13) (cid:13)𝑁𝑇,2 𝑖𝑡 = √ 1 𝑁𝑇 (cid:13) (cid:13) (cid:13) P (cid:98)Γ\Γ0 𝑉 (cid:13) (cid:13) (cid:13)2 ≤ √ 1 𝑁𝑇 (cid:13) (cid:13) (cid:13) 𝑓 ( 𝑓 ′ (cid:98)Γ\Γ0 𝑓 (cid:98)Γ\Γ0 (cid:98)Γ\Γ0 )−1(cid:13) (cid:13) (cid:13)∞ (cid:13) (cid:13) (cid:13) 𝑉 𝑓 ′ (cid:98)Γ\Γ0 (cid:13) (cid:13) (cid:13)2 √︄ ≤ √ 1 𝑁𝑇 1 𝑚)(𝑀 𝑓 ) 𝑁𝑇 𝜙𝑚𝑖𝑛 ( (cid:98) √︄ = 𝑚 (cid:98) 𝑚)(𝑀 𝑓 ) 𝜙𝑚𝑖𝑛 ( (cid:98) 𝑂 𝑃 (cid:18) 𝜆 𝑁𝑇 (cid:169) (cid:173) (cid:171) (cid:19) ∑︁ (cid:32) 𝑁 ∑︁ 𝑇 ∑︁ 𝑗 ∈(cid:98)Γ\Γ0 𝑖=1 𝑡=1 𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡 1/2 (cid:33) 2 √︄ ≤ (cid:170) (cid:174) (cid:172) 𝑚 (cid:98) 𝑚) (𝑀 𝑓 ) 𝜙𝑚𝑖𝑛 ( (cid:98) 𝑆max where the last line follows from min 𝑗 𝜔1/2 𝑗 = 𝑎 > 0 and 𝑆max ≤ 𝜆 2𝑐1𝑁𝑇 . By similar arguments, we have □ (cid:13) (cid:13) (cid:0)PΓ0 𝑉 (cid:1) 𝑖𝑡 (cid:13) (cid:13)𝑁𝑇,2 = √ 1 𝑁𝑇 (cid:13) (cid:13)PΓ0 𝑉(cid:13) (cid:13)2 ≤ √ 1 𝑁𝑇 (cid:13) (cid:13) (cid:13) 𝑓Γ0 ( 𝑓 ′ Γ0 𝑓Γ0)−1(cid:13) (cid:13) (cid:13)∞ (cid:13) (cid:13) (cid:13) 𝑉 𝑓 ′ Γ0 (cid:13) (cid:13) (cid:13)2 √︄ ≤ √ 1 𝑁𝑇 1 𝑁𝑇 𝜙𝑚𝑖𝑛 (𝑠)(𝑀 𝑓 ) (cid:32) 𝑁 ∑︁ 𝑇 ∑︁ ∑︁ 𝑗 ∈Γ0 𝑖=1 𝑡=1 (cid:169) (cid:173) (cid:171) 𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡 1/2 (cid:33) 2 (cid:170) (cid:174) (cid:172) √︂ ≤ 𝑠 𝜙𝑚𝑖𝑛 (𝑠) (𝑀 𝑓 ) 𝑂 𝑃 (cid:19) . (cid:18) 𝜆 𝑁𝑇 Proof of Theorem 3.1 In the proof, we will show L1 and L2 convergence rates for (cid:98)𝜁. We will first show the regularization event in terms of the infeasible penalty weights 𝜔 as defined in 3.11. 114 Due to the AHK representation as in Assumption AHK, we can decompose 𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡 as 𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡 = 𝑎𝑖, 𝑗 + 𝑔𝑡, 𝑗 + 𝑒𝑖𝑡, 𝑗 where 𝑎𝑖, 𝑗 := 𝐸 [ 𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡 |𝛼𝑖], 𝑔𝑡, 𝑗 = 𝐸 [ 𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡 |𝛾𝑡], and 𝑒𝑖𝑡, 𝑗 = 𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡 − 𝑎𝑖, 𝑗 − 𝑔𝑡, 𝑗 , for 𝑗 = 1, .., 𝑝. To show the regularization event holds with probability approaching one, we bound the proba- bility of the following event for each 𝑗 = 1, ..., 𝑝: (cid:33) > 𝜆 2𝑐1𝑁𝑇 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 𝑁 ∑︁ 𝑇 ∑︁ 𝜔−1/2 𝑗 𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡 1 𝑁𝑇 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 𝑃 (cid:32) (cid:32) 𝜔−1/2 𝑗 𝑖=1 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 1 𝑁 𝑡=1 𝑁 ∑︁ 𝑖=1 1 𝑇 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 1 𝑇 > (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) + (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 𝑁 ∑︁ 𝜔−1/2 𝑎, 𝑗 𝑎𝑖, 𝑗 1 𝑁 𝑖=1 𝑁 ∑︁ 1 √ 𝑁 (cid:32)(cid:12) 𝑁 (cid:12) ∑︁ (cid:12) (cid:12) (cid:12) 𝑖=1 𝑖=1 𝑇 ∑︁ 𝑡=1 𝜔−1/2 𝑎, 𝑗 𝑎𝑖, 𝑗 𝜔−1/2 𝑗 𝑒𝑖𝑡, 𝑗 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) =𝑃 ≤𝑃 ≤𝑃 (cid:32)(cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:32)(cid:12) (cid:12) (cid:12) (cid:12) (cid:12) + 𝑃 𝑡=1 𝑇 ∑︁ 𝑡=1 √ 𝑁𝜆 6𝑐1𝑁𝑇 (cid:33) 𝜆 6𝑐1 > 𝑎𝑖, 𝑗 + 𝑇 ∑︁ 𝑔𝑡, 𝑗 + 1 𝑁𝑇 𝑁 ∑︁ 𝑇 ∑︁ 𝑒𝑖𝑡, 𝑗 (cid:33) > 𝜆 2𝑐1𝑁𝑇 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 𝜔−1/2 𝑗 𝑒𝑖𝑡, 𝑗 𝑡=1 1 𝑁𝑇 𝑖=1 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) + 𝑁 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑡=1 (cid:33) + 𝑃 1 √ 𝑇 𝑇 ∑︁ 𝑡=1 𝜔−1/2 𝑔, 𝑗 𝑔𝑡, 𝑗 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) > := 𝑝1, 𝑗 (𝜆) + 𝑝2, 𝑗 (𝜆) + 𝑝3, 𝑗 (𝜆) 𝜔−1/2 𝑔, 𝑗 𝑔𝑡, 𝑗 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:32)(cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:33) 𝜆 2𝑐1𝑁𝑇 (cid:33) > (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) √ 𝑇𝜆 6𝑐1𝑁𝑇 (cid:205)𝑁 𝑖, 𝑗 and 𝜔𝑔, 𝑗 := 𝑁∧𝑇 where 𝜔𝑎, 𝑗 := 𝑁∧𝑇 𝑎2 𝑖=1 𝑇 2 𝑁 2 the triangle inequality and the fact that 𝜔1/2 (cid:205)𝐵 𝑏=1 (cid:0)(cid:205)𝑡∈𝐻𝑏 𝑗 = (cid:0)𝜔𝑎, 𝑗 + 𝜔𝑔, 𝑗 (cid:1) 1/2 ≥ max{𝜔1/2 𝑔𝑡, 𝑗 (cid:1) 2. The first inequality follows from 𝑎, 𝑗 , 𝜔1/2 𝑔, 𝑗 }. The second inequality follows from a union-bound inequality. Applying union-bound inequality again, we obtain (cid:32) 𝑃 max 𝑗=1,...,𝑝 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 1 𝑁𝑇 𝑁 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑡=1 𝜔−1/2 𝑗 𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) > 𝜆 2𝑐1𝑁𝑇 (cid:33) ≤ 𝑝 ∑︁ 𝑗=1 (cid:2)𝑝1, 𝑗 (𝜆) + 𝑝2, 𝑗 (𝜆) + 𝑝3, 𝑗 (𝜆)(cid:3) To bound 𝑝1, 𝑗 (𝜆), we will apply a moderate deviation theorem for self-normalized sums of independent random variables. For 𝑗 = 1, ..., 𝑝, define Ξ𝑎, 𝑗 = [𝐸 (𝑎𝑖, 𝑗 )2] 1/2 [𝐸 (𝑎𝑖, 𝑗 )3] 1/3 . Under Assumption REG(i), max 𝑗 ≤ 𝑝 𝐸 |𝑎𝑖, 𝑗 |3 < ∞ by Holder’s inequality and Jensen’s inequality. By Assumption REG(ii), min 𝑗 ≤ 𝑝 𝐸 |𝑎𝑖, 𝑗 |2 > 0. Therefore, min 𝑗 Ξ𝑎, 𝑗 > 0. By Theorem 7.4 of Peña et al. (2009) with 𝛿 = 1, we have for any 𝑥 ∈ [0, 𝑁 1/6Ξ𝑎, 𝑗 ] that 𝑃 (cid:32)(cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 1 √ 𝑁 𝑁 ∑︁ 𝑖=1 𝜔−1/2 𝑎, 𝑗 𝑎𝑖, 𝑗 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:33) (cid:34) > 𝑥 ≤ 2 (1 − Φ(𝑥)) 1 + 𝑂 (1) (cid:19) 3(cid:35) (cid:18) 1 + 𝑥 𝑁 1/6Ξ𝑎, 𝑗 115 Let 𝑙𝑎,𝑁 be some positive increasing sequence. If 𝑁 1/6Ξ𝑎, 𝑗 /𝑙𝑎,𝑁 − 1 > 0 and 𝑥 ∈ [0, 𝑁 1/6Ξ𝑎, 𝑗 /𝑙𝑎,𝑁 − 1], then 𝑃 (cid:32)(cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 1 √ 𝑁 𝑁 ∑︁ 𝑖=1 𝜔−1/2 𝑎, 𝑗 𝑎𝑖, 𝑗 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:33) (cid:34) > 𝑥 ≤ 2 (1 − Φ(𝑥)) 1 + 𝑂 (1) (cid:19) 3(cid:35) (cid:18) 1 𝑙𝑎,𝑁 Then, setting 𝜆 = 6𝑐1 𝑁𝑇 √ 𝑁 Φ−1 (cid:16) 1 − (cid:17) 𝛾 2𝑝 gives 𝑝 ∑︁ 𝑝1, 𝑗 (𝜆) ≤ 2𝑝(1 − Φ(Φ−1(1 − 𝛾/2𝑝))) ≤ 𝛾 [1 + 𝑂 (1) (1/𝑙𝑎,𝑁 )3] 𝑗=1 given that Φ−1 (cid:16) that Φ−1 (cid:16) 𝑁, 𝑇 → ∞. Therefore, it suffices to take 𝑙𝑎,𝑁 = 𝑂 (log 𝑁), and it follows that (cid:205)𝑝 1 − (cid:17) ≲ √︁log( 𝑝/𝛾) = 𝑜 𝑁 1/12/log 𝑁 1 − 𝛾 2𝑝 𝛾 2𝑝 (cid:17) (cid:17) (cid:16) ∈ [0, 𝑁 1/6 min 𝑗 Ξ𝑎, 𝑗 /𝑙𝑎,𝑁 − 1] and 𝑁 1/6 min 𝑗 Ξ𝑎, 𝑗 /𝑙𝑎,𝑁 − 1 > 0. Note under Assumption REG(i) and 𝑁/𝑇 → 𝑐 as 𝑝1,𝑘𝑙 (𝜆) → 0 as 𝑗=1 𝛾 → 0 and (𝑁, 𝑇) → ∞. To bound 𝑝2, 𝑗 (𝜆), we utilize a moderate deviation theorem for self-normalized sums of weakly dependent random variables. Observe that 𝑔𝑡, 𝑗 = 𝐸 [ 𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡 |𝛾𝑡] is beta-mixing with coefficient 𝛽𝑔 (𝑞) satisfying 𝛽𝑔 (𝑞) ≤ 𝛽𝛾 (𝑞) ≤ 𝑐𝜅𝑒𝑥 𝑝(−𝜅𝑞) ∀𝑞 ∈ Z+ Furthermore, by the strict stationarity and the non-degeneracy condition in Assumption REG(iii), we can verify that for some 𝜈 > 0, 𝐸 (cid:2)(cid:205)𝑟+𝑚 𝑡=𝑟 𝑔𝑡, 𝑗 (cid:3) 2 ≥ 𝜈2𝑚 for all 𝑡 ≥ 1, 𝑟 ≥ 0, 𝑚 ≥ 1. By Assumption REG(ii) and Holder’s inequality, we have 𝐸 | 𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡 |4(𝜇+𝛿) < ∞ for some 𝜇 > 1, 𝛿 > 0. Then, by Theorem 3.2 of Gao et al. (2022) with 𝜏 = 1 and 𝛼 = 1 𝑝 ∑︁ 𝑗=1 𝑃 (cid:32)(cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 1 √ 𝑇 𝑇 ∑︁ 𝑡=1 𝜔−1/2 𝑔, 𝑗 𝑔𝑡, 𝑗 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:33) > 𝑥 ≤ 2𝑝 (1 − Φ(𝑥)) 1 + 𝑂 (1) 1+2𝜏 , we have (cid:34) (cid:19) 2(cid:35) (cid:18) 1 𝑙𝑔,𝑇 uniformly for 𝑥 ∈ (cid:16) 0, 𝑑0(log 𝑇)−1/2𝑇 1/12/𝑙𝑔,𝑇 (cid:17) where 𝑑0 is some positive constant and 𝑙𝑔,𝑇 is some positive increasing sequence. Then, setting 𝜆 = 6𝑐1 𝑁𝑇 √ 𝑇 Φ−1(1 − 𝛾 2𝑝 ) gives, for all 𝑗 = 1, ..., 𝑝, (cid:34) 𝑝2, 𝑗 (𝜆) ≤ 𝛾 1 + 𝑂 (1) (cid:19) 2(cid:35) (cid:18) 1 𝑙𝑔,𝑇 𝑝 ∑︁ 𝑗=1 116 given that Φ−1(1 − log( 𝑝/𝛾) = 𝑜 by taking 𝑙𝑔,𝑇 = 𝑂 (cid:16) 𝛾 2𝑝 ) ∈ 𝑇 1/6/(log 𝑇)2(cid:17) (cid:16) (log 𝑇)1/2(cid:17) (cid:16) 0, 𝑑0(log 𝑇)−1/2𝑇 1/12/𝑙𝑔,𝑇 ) and so Φ−1 (cid:16) 𝛾 1 − 2𝑝 , it follows that (cid:205)𝑝 𝑗=1 (cid:17) . Under Assumption REG(i), we have (cid:17) ≲ √︁log( 𝑝/𝛾) = 𝑜 𝑝2, 𝑗 (𝜆) → 0 as 𝛾 → 0 and (𝑁, 𝑇) → ∞. (cid:16) 𝑇 1/12/(log 𝑇 (cid:17) . Therefore, Consider 𝑝3, 𝑗 (𝜆). Define ¯𝑒𝑖, 𝑗 := 1 𝑇 (cid:205)𝑇 𝑡=1 𝑒𝑖𝑡, 𝑗 . Observe that 𝐸 [ ¯𝑒𝑖, 𝑗 ] = 0 by iterated expec- tation and conditional on {𝛾𝑡 }𝑇 𝑡=1, ¯𝑒𝑖, 𝑗 are independent over 𝑖 . We have shown previously that 𝐸 | 𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡 |4(𝜇+𝛿) < ∞ for some 𝜇 > 1, 𝛿 > 0. Given that 𝑒𝑖𝑡, 𝑗 = 𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡 − 𝑎𝑖, 𝑗 − 𝑔𝑡, 𝑗 and 𝐸 |𝑎𝑖, 𝑗 |4(𝜇+𝛿) < ∞, 𝐸 |𝑔𝑡, 𝑗 |4(𝜇+𝛿) < ∞ due to Jansen’s inequality and iterated expectation, we have 𝐸 |𝑒𝑖𝑡, 𝑗 |4(𝜇+𝛿) < ∞ and so 𝐸 | ¯𝑒𝑖, 𝑗 |4(𝜇+𝛿) < ∞ due to Minkowski’s inequality. Note that Var (cid:0) ¯𝑒𝑖, 𝑗 (cid:1) = 1 𝑇 𝑇−1 ∑︁ 𝑙=−(𝑇−1) (cid:18) 1 − (cid:19) |𝑙 | 𝑇 E(𝑒𝑖𝑡, 𝑗 𝑒𝑖,𝑡+𝑙, 𝑗 ) = 1 𝑇 Σ𝑒 (1 + 𝑜(1)). where Σ𝑒 is defined in the beginning Appendix A with 𝑘 = 1 in this case. By Lemma A.1, |Σ𝑒, 𝑗 | < ∞. Furthermore, as is shown below, 𝜔1/2 is bounded from below by some constant 𝑎 > 0. 𝑗 Now, by the conditional version of Corollary 4 from Fuk and Nagaev (1971), there exists some constant 𝑎1 and 𝑎2 such that 𝑃 (cid:32)(cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 𝑁 ∑︁ 𝑖=1 𝜔−1/2 𝑗 ¯𝑒𝑖, 𝑗 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) > 𝜆 6𝑐1𝑇 (cid:33) ≤ 𝑃 |{𝛾𝑡 }𝑇 𝑡=1 (cid:32)(cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 𝑁 ∑︁ 𝑖=1 ¯𝑒𝑖, 𝑗 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) > 𝑎𝜆 6𝑐1𝑇 (cid:33) |{𝛾𝑡 }𝑇 𝑡=1 ≤ 𝑎1(𝜆/𝑇)−4 𝑁 ∑︁ 𝑖=1 (cid:16) 𝐸 | ¯𝑒𝑖, 𝑗 |4|{𝛾𝑡 }𝑇 𝑡=1 (cid:17) − + exp (cid:169) (cid:173) (cid:173) (cid:171) (cid:205)𝑁 𝑖=1 𝑎2(𝜆/𝑇)2 (cid:16) 𝑉 𝑎𝑟 ¯𝑒𝑖, 𝑗 |{𝛾𝑡 }𝑇 𝑡=1 (cid:17) (cid:170) (cid:174) (cid:174) (cid:172) Note that exp(−1/𝑧) is not globally concave but it is concave for 𝑧 > 1/2 and is bounded by 𝑧/𝑒2 for 𝑧 ∈ (0, 1/2) where 𝑒 is the Euler’s number. Denote 𝑧 = (𝑇/𝜆)2 𝑎2 𝑡=1). Then, we 𝑉 𝑎𝑟 ( ¯𝑒𝑖, 𝑗 |{𝛾𝑡 }𝑇 (cid:205)𝑁 𝑖=1 have (cid:32) exp − 𝑎2(𝜆/𝑇)2 𝑉 𝑎𝑟 ( ¯𝑒𝑖, 𝑗 |{𝛾𝑡 }𝑇 𝑡=1 (cid:205)𝑁 𝑖=1 (cid:33) ) = exp (−1/𝑧) ≤ 𝑧/𝑒21{𝑧 ∈ (0, 1/2)} + exp(−1/𝑧)1{𝑧 > 1/2}. 117 By Fubini theorem, Jensen’s inequality, and the bounded moments, we have 𝑝3, 𝑗 (𝜆) =𝑃 (cid:32)(cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 𝑁 ∑︁ 𝑖=1 𝜔−1/2 𝑗 ¯𝑒𝑖, 𝑗 (cid:33) > 𝜆 6𝑐1𝑇 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) ≤𝑎1(𝜆/𝑇)−4 𝑁 ∑︁ 𝐸 (cid:16) | ¯𝑒𝑖, 𝑗 |4(cid:17) + 𝑖=1 (cid:17) (𝜆/𝑇)−4𝑁 (cid:16) =𝑂 (𝑇/𝜆)2 𝑎2 𝑁 ∑︁ 𝑉 𝑎𝑟 ( ¯𝑒𝑖, 𝑗 )/𝑒2 + exp − (cid:32) 𝑖=1 𝑎2(𝜆/𝑇)2 𝑁/𝑇 Σ𝑒 (cid:19) . (cid:33) 𝑎2(𝜆/𝑇)2 (cid:205)𝑁 𝑖=1 𝑉 𝑎𝑟 ( ¯𝑒𝑖, 𝑗 ) + 𝑂 (cid:19) (cid:18)𝑇 𝑁 𝜆2 (cid:18) − + exp Therefore, we have (cid:205)𝑝 𝑗=1 𝑁/𝑇 → 𝑐 where 0 < 𝑐 < ∞, by taking 𝜆𝑒 = (cid:205)𝑝 𝑝3, 𝑗 (𝜆𝑒) = 𝑜(1). Under REG(i), log( 𝑝/𝛾) = 𝑜 𝑗=1 (cid:16) (cid:17) (cid:16) 𝑝𝑇 𝑁 𝑝3, 𝑗 (𝜆) = 𝑂 (cid:0)𝑝(𝜆/𝑇)−4𝑁 (cid:1) + 𝑂 𝜆2 √ 𝑝𝑁𝑇 ∨ 𝜀1/2 ∨ 𝑇 1/6/(log 𝑇)2(cid:17) (cid:16) 2𝑝 ). Therefore, we have shown (cid:205)𝑝 𝑗=1 − 𝑎2 (𝜆/𝑇)2 𝑁/𝑇 Σ𝑒 for some 𝜀 = 𝑜(1), + 𝑝 exp √︃ 𝑁𝑇 log( 𝑝/𝛾) 𝑎2/Σ𝑒 and 𝑝 = 𝑜(𝑇 7/6/(log 𝑇)2), 𝑝3, 𝑗 (𝜆) → 0 for . Given that 𝑁∧𝑇 Φ−1(1 − 𝑝1/4𝑇 𝑁 1/4 𝜀1/4 𝑁𝑇 √ (cid:17) 𝛾 then 𝜆𝑒 = 𝑂 (𝜆) where 𝜆 = 6𝑐1 𝑁∧𝑇 Φ−1(1 − 𝜆 = 6𝑐1 𝛾 2𝑝 ). 𝑁𝑇 √ Put together, we have shown (cid:32) 𝑃 max 𝑗=1,...,𝑝 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 1 𝑁𝑇 𝑁 ∑︁ 𝑇 ∑︁ 𝑖=1 𝑡=1 𝜔−1/2 𝑗 𝑓𝑖𝑡, 𝑗𝑉𝑖𝑡 (cid:33) ≤ 𝜆 2𝑐1𝑁𝑇 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) → 1. (3A.1) (cid:16) 𝑓𝑖𝑡 Now, we can apply Lemma 6 of Belloni et al. (2012) to obtain the finite sample bounds on (cid:13) (cid:13) (cid:13) let 𝐽1 (cid:98)𝜁 − 𝜁0 𝑝 be a subset of an index set 𝐽𝑝 = 1, ..., 𝑝 and 𝐽0 . Let 𝛿 be some generic vector of nuisance parameters and 𝑝. Let 𝛿1 be a copy of 𝛿 with its (cid:17)(cid:13) (cid:13) (cid:13)𝑁𝑇,2 (cid:12) 𝜔1/2 (cid:16) (cid:12) (cid:12) 𝑝 = 𝐽𝑝\𝐽1 (cid:98)𝜁 − 𝜁0 (cid:17)(cid:12) (cid:12) (cid:12)1 and 𝑗-th element replaced by 0 for all 𝑗 ∈ 𝐽0 𝑝 and similarly let 𝛿0 be a copy of 𝛿 with its 𝑗-th element replaced by 0 for all 𝑗 ∈ 𝐽1 𝑝. Define the restricted eigenvalues and Gram matrix as follows: 𝐾𝐶 (𝑀 𝑓 ) = min 𝛿:∥𝛿0 ∥1≤𝐶 ∥𝛿1 ∥1, ∥𝛿∥≠0, |𝐽1 𝑝 |≤𝑠 √︁𝑠𝛿′𝑀 𝑓 𝛿 (cid:13)𝛿1(cid:13) (cid:13) (cid:13)1 , 𝑀 𝑓 = 𝐸𝑁𝑇 [ 𝑓 ′ 𝑖𝑡 𝑓𝑖𝑡]. Define the weighted restricted eigenvalues as follows: 𝐾 𝜔 𝐶 (𝑀 𝑓 ) = 𝛿:∥𝜔1/2𝛿0 ∥1≤𝐶 ∥𝜔1/2𝛿1 ∥1, ∥𝛿∥≠0, |𝐽1 𝑝 |≤𝑠 min √︁𝑠𝛿′𝑀 𝑓 𝛿 (cid:13)𝜔1/2𝛿1(cid:13) (cid:13) (cid:13)1 . Let 𝑎 := min 𝑗=1,...,𝑝 𝜔1/2 𝑗 , 𝑏 := max 𝑗=1,...,𝑝 𝜔1/2 𝑗 . As is shown in Belloni et al. (2016), 𝐾 𝜔 𝐶 (𝑀 𝑓 ) ≥ 1 𝑏 𝐾𝑏𝐶/𝑎 (𝑀 𝑓 ). (3A.2) 118 (cid:16) (cid:17) 1/2 Denote Σ𝑎, 𝑗 = 𝑝 → 𝑁∧𝑇 𝜔 𝑗 𝑁 Σ𝑎, 𝑗 + 𝑁∧𝑇 implies that min 𝑗 ≤ 𝑝 Σ2 𝑖, 𝑗 ] E[𝑎2 and Σ𝑔, 𝑗 = (cid:0)(cid:205)∞ 𝑙=−∞ E[𝑔𝑡, 𝑗 𝑔𝑡+𝑙, 𝑗 ](cid:1) 1/2. By Lemma A.2 above, we have 𝑇 Σ𝑔, 𝑗 . By Lemma A.1, |Σ𝑎, 𝑗 | < ∞ and |Σ𝑔, 𝑗 | < ∞. Assumption REG(iii) 𝑎, 𝑗 > 0. Therefore, we have 𝜔 𝑗 bounded below by zero and bounded above for each 𝑗 = 1, .., 𝑝 with probability approaching one as 𝑁, 𝑇 → ∞. Assumption (ASM), the condition 3.12, and 3A.1, Lemma 6 of Belloni et al. (2012) implies that (cid:16) (cid:13) (cid:13) (cid:13) 𝑓𝑖𝑡 (cid:98)𝜁 − 𝜁0 (cid:17)(cid:13) (cid:13) (cid:13)𝑁𝑇,2 (cid:18) 𝑢 + ≤ (cid:19) 1 𝑐1 𝑁𝑇 𝐾 𝜔 + 2 ∥𝑟 ∥ 𝑁𝑇,2 √ 𝑠𝜆 𝑐0 (𝑀 𝑓 ) √︂ 𝑠 log( 𝑝/𝛾) 𝑁 ∧ 𝑇 (cid:19) √ 𝑢 + 1 𝑐1 𝑁𝑇 𝐾 𝜔 1 𝐾 𝜔 𝑐0 (𝑀 𝑓 ) √ (cid:20)(cid:18) 𝑠 (𝑀 𝑓 ) (cid:32) =𝑂 𝑃 3𝑐0 𝐾 𝜔 2𝑐0 (cid:32) =𝑂 𝑃 √︂ 𝑠 (𝑀 𝑓 )𝐾 𝜔 𝑐0 (𝑀 𝑓 ) 𝐾 𝜔 2𝑐0 √︂ 𝑠 𝑁 ∧ 𝑇 + (cid:33) , 𝑠𝜆 𝑐0 (𝑀 𝑓 ) log( 𝑝/𝛾) 𝑁 ∧ 𝑇 + 2 ∥𝑟 ∥ 𝑁𝑇,2 ∥𝑟 ∥2 𝑁𝑇,2 , (cid:21) 𝑁𝑇 + 3𝑐0 𝜆 √ √︂ 𝑠 𝑁 ∧ 𝑇 + + 𝑁 ∧ 𝑇 𝑠/ log( 𝑝/𝛾) (cid:33) 𝜔1/2 (cid:16) (cid:13) (cid:13) (cid:13) (cid:98)𝜁 − 𝜁0 (cid:17)(cid:13) (cid:13) (cid:13)1 ≤ where 𝑐0 := 𝑢𝑐+1 > 1. By 3A.2, we have 1/𝐾 𝜔 𝑐0 𝑙𝑐−1 arguments given in Bickel et al. (2009), Assumption SE implies that 1/𝐾𝐶 (𝑀 𝑓 ) = 𝑂 𝑃 (1) for any (𝑀 𝑓 ) ≤ 𝑏/𝐾 ¯𝐶 (𝑀 𝑓 ) where ¯𝐶 := 𝑏𝑐0/𝑎. By 𝐶 > 0. Therefore, (cid:16) (cid:13) (cid:13) (cid:13) 𝑓𝑖𝑡 (cid:98)𝜁 − 𝜁0 (cid:17)(cid:13) (cid:13) (cid:13)𝑁𝑇,2 = 𝑂 𝑃 (cid:32)√︂ 𝑠 log( 𝑝/𝛾) 𝑁 ∧ 𝑇 (cid:33) , 𝜔1/2 (cid:16) (cid:12) (cid:12) (cid:12) (cid:98)𝜁 − 𝜁0 (cid:17)(cid:12) (cid:12) (cid:12)1 (cid:32) √︂ = 𝑂 𝑃 𝑠 log( 𝑝/𝛾) 𝑁 ∧ 𝑇 (cid:33) . By Holder’s inequality and that min 𝑗 𝜔1/2 𝑗 ≥ 𝑎 > 0 ∥(cid:98)𝜁 − 𝜁0∥1 ≤ ∥𝜔−1/2∥∞ 𝜔1/2 (cid:16) (cid:12) (cid:12) (cid:12) (cid:98)𝜁 − 𝜁0 (cid:17)(cid:12) (cid:12) (cid:12)1 (cid:32) √︂ = 𝑂 𝑃 𝑠 (cid:33) log( 𝑝/𝛾) 𝑁 ∧ 𝑇 (cid:32) √︂ = 𝑂 𝑃 𝑠 (cid:33) log( 𝑝 ∨ 𝑁𝑇) 𝑁 ∧ 𝑇 where the first inequality follows from the . The 𝑙2 rate of convergence will be derived after the sparsity bounds. We now switch the focus to the Post-LASSO. By the finite sample bounds of Lemma A.4, we have ∥ 𝑓 (𝑋𝑖𝑡) − 𝑓𝑖𝑡 (cid:98)𝜁𝑃𝐿 ∥ 𝑁𝑇,2 = (cid:32)√︂ 𝑠 𝜙𝑚𝑖𝑛 (𝑠) (𝑀 𝑓 ) + √︄ 𝑚 (cid:98) 𝑚) (𝑀 𝑓 ) 𝜙𝑚𝑖𝑛 ( (cid:98) 𝑓 (cid:1) 𝑖𝑡 ∥ 𝑁𝑇,2 (cid:17) , (cid:33) 𝑂 𝑃 (cid:19) (cid:18) 𝜆 𝑁𝑇 (3A.3) (cid:16) + 𝑂 𝑃 ∥ 𝑓 (𝑋𝑖𝑡) − (cid:0)P (cid:98)Γ 119 By the finite sample bounds of Lemma 7 from Belloni et al. (2012), we have ∥𝜔1/2((cid:98)𝜁𝑃𝐿 − 𝜁0)∥1 ≤ ∥ 𝑓𝑖𝑡 ((cid:98)𝜁𝑃𝐿 − 𝜁0)∥ 𝑁𝑇,2 ≤∥ 𝑓𝑖𝑡 (𝑋𝑖𝑡) − 𝑓𝑖𝑡 (cid:98)𝜁𝑃𝐿 ∥ 𝑁𝑇,2 + ∥𝑟𝑖𝑡 ∥ 𝑁𝑇,2, √ 𝑚 + 𝑠 𝑏 (cid:98) √︁𝜙𝑚𝑖𝑛 ( (cid:98) 𝑚 + 𝑠) (𝑀 𝑓 ) (cid:19) (cid:18) 𝑠 𝜆 1 𝑢 + 𝑁𝑇 𝐾 𝜔 𝑐1 𝑐0 (𝑀 𝑓 ) 𝑓 (𝑋𝑖𝑡)∥ 𝑁𝑇,2 ≤ + 3∥𝑟𝑖𝑡 ∥ 𝑁𝑇,2. √ × ∥ 𝑓𝑖𝑡 ((cid:98)𝜁𝑃𝐿 − 𝜁0) ∥ 𝑁𝑇,2 ∥ 𝑓 (𝑋𝑖𝑡) − P (cid:98)Γ (3A.4) (3A.5) (3A.6) The finite sample bound of Lemma 8 from Belloni et al. (2012) gives 𝑚 ≤ 𝜙𝑚𝑎𝑥 ( (cid:98) (cid:98) 𝑚)(𝑀 𝑓 )𝑎−2 √ (cid:18) 2𝑐0 𝑠 𝐾 𝜔 𝑐0 (𝑀 𝑓 ) 6𝑐0𝑁𝑇 ∥𝑟𝑖𝑡 ∥ 𝑁𝑇,2 𝜆 + (cid:19) 2 . where 𝑎 > 0 has been shown previously. (cid:26) Let M = 𝑚 ∈ N : 𝑚 > 2𝜙𝑚𝑎𝑥 (𝑚)(𝑀 𝑓 )𝑎−2 (cid:16) 2𝑐0 √ 𝑠 (𝑀 𝑓 ) + 6𝑐0𝑁𝑇 ∥𝑟𝑖𝑡 ∥ 𝑁 𝑇 ,2 𝜆 (cid:17) 2(cid:27) . Lemma 10 of Belloni 𝐾 𝜔 𝑐 0 et al. (2012) gives 𝑚 ≤ min (cid:98) 𝑚∈M 𝜙𝑚𝑎𝑥 (𝑚 ∧ 𝑁𝑇) (𝑀 𝑓 )𝑎−2 √ (cid:18) 2𝑐0 𝑠 𝐾 𝜔 𝑐0 (𝑀 𝑓 ) 6𝑐0𝑁𝑇 ∥𝑟𝑖𝑡 ∥ 𝑁𝑇,2 𝜆 + (cid:19) 2 . (3A.7) Note that 6𝑐0𝑁𝑇 ∥𝑟𝑖𝑡 ∥ 𝑁 𝑇 ,2 √ 𝑠 𝜆 Let 𝜇 := min𝑚 = 𝑂 𝑃 (1/log( 𝑝 ∧ 𝑁𝑇)) 𝑝 → 0. Recall that 1/𝐾 𝜔 𝑐0 (𝑀 𝑓 ) ≤ 𝑏/𝐾 ¯𝐶 (𝑀 𝑓 ) < ∞. (cid:111) (cid:110)√︁𝜙𝑚𝑎𝑥 (𝑚)(𝑀 𝑓 )/𝜙𝑚𝑖𝑛 (𝑚) (𝑀 𝑓 ) : 𝑚 > 18 ¯𝐶2𝑠𝜙𝑚𝑎𝑥 (𝑚) (𝑀 𝑓 )/𝐾 2 be the integer associated with 𝜇. By the definition of M, it implies that ¯𝑚 ∈ M with probability , and let ¯𝑚 ¯𝐶 (𝑀 𝑓 ) approaching one, which implies ¯𝑚 > 𝑚 due to 3A.7. By Lemma 9 (the sub-linearity of sparse (cid:98) eigenvalues) from Belloni et al. (2012) and 3A.7, we have 𝑚 ≲𝑃 𝑠𝜇2𝜙𝑚𝑖𝑛 ( ¯𝑚 + 𝑠)/𝐾 2 (cid:98) ¯𝐶 ≲ 𝑠𝜇2𝜙𝑚𝑖𝑛 ( (cid:98) 𝑚 + 𝑠)/𝐾 2 ¯𝐶 . Combining the results above with 3A.3 and 3A.6 to gives (cid:32)√︄ 𝑠𝜇2 log( 𝑝/𝛾) (𝑁 ∧ 𝑇)𝐾 2 ¯𝐶 ∥ 𝑓 (𝑋𝑖𝑡) − 𝑓𝑖𝑡 (cid:98)𝜁𝑃𝐿)∥ 𝑁𝑇,2 = 𝑂 𝑃 + ∥𝑟𝑖𝑡 ∥ 𝑁𝑇,2 + √ 𝑠 𝜆 𝑁𝑇 𝐾 𝜔 𝑐0 (𝑀 𝑓 ) (cid:33) . Recall that 𝑏 < ∞ and Condition SE imply 1/𝐾 𝜔 𝑐0 Condition ASM and the choice of 𝜆 together imply (𝑀 𝑓 ) ≤ 1/𝐾 ¯𝐶 (𝑀 𝑓 ) < ∞. Then, Condition SE, ∥ 𝑓 (𝑋𝑖𝑡) − 𝑓𝑖𝑡 (cid:98)𝜁𝑃𝐿) ∥ 𝑁𝑇,2 = 𝑂 𝑃 (cid:32)√︂ 𝑠 log( 𝑝/𝛾) 𝑁 ∧ 𝑇 (cid:33) . 120 For the 𝑙1 convergence rate, note that ∥(cid:98)𝜁𝑃𝐿 − 𝜁0∥0 ≤ (cid:98) inequality to ∥(cid:98)𝜁𝑃𝐿 − 𝜁0∥1 = (cid:205)𝑝 𝑗=1 |(cid:98)𝜁𝑃𝐿 − 𝜁0| = (cid:205) 𝑗 ∈{(cid:98)Γ (cid:208) Γ0} |(cid:98)𝜁𝑃𝐿 − 𝜁0| gives 𝑚 + 𝑠. Then, applying Cauchy-Schwarz ∥(cid:98)𝜁𝑃𝐿 − 𝜁0∥1 ≤ √ 𝑚 + 𝑠∥(cid:98)𝜁𝑃𝐿 − 𝜁0∥2 (cid:98) To derive the convergence rates in 𝑙2-norm of the Post-LASSO estimator (the 𝑙2 rate for the LASSO estimator is obtained similarly), we will utilize the sparse eigenvalue condition and the prediction norm. (cid:16) (cid:17) 𝑏 = (cid:98)𝜁𝑃𝐿 − 𝜁0 /∥(cid:98)𝜁𝑃𝐿 − 𝜁0∥−1 𝑚 + 𝑠, ∥𝛿∥2 = 1}. By Assumption SE, we have (cid:98) If (cid:98)𝜁𝑃𝐿 − 𝜁0 = 0, then the conclusion holds trivially. Otherwise, define 𝑚 + 𝑠) = {𝛿 : ∥𝛿∥0 = 2 . Then, we have ∥𝑏∥2 = 1 and so 𝑏 ∈ Δ( (cid:98) 0 < 𝜅1 ≤ 𝜙𝑚𝑖𝑛 ( (cid:98) 𝑚 + 𝑠)(𝑀 𝑓 ) ≤ (𝑏′𝑀 𝑓 𝑏)1/2 ∥𝑏∥2 = (cid:16) (cid:13) (cid:13) (cid:13) 𝑓𝑖𝑡 (cid:98)𝜁𝑃𝐿 − 𝜁0 (cid:17)(cid:13) (cid:13) (cid:13)𝑁𝑇,2 ∥(cid:98)𝜁𝑃𝐿 − 𝜁0∥2 , Therefore, using the bound on the prediction norm above, we conclude that ∥(cid:98)𝜁𝑃𝐿 − 𝜁0∥2 ≤ (cid:16) (cid:13) (cid:13) (cid:13) 𝑓𝑖𝑡 (cid:98)𝜁𝑃𝐿 − 𝜁0 𝜅1 (cid:17)(cid:13) (cid:13) (cid:13)𝑁𝑇,2 = 𝑂 𝑃 (cid:32)√︂ 𝑠 log( 𝑝/𝛾) 𝑁 ∧ 𝑇 (cid:33) . It implies that ∥(cid:98)𝜁𝑃𝐿 − 𝜁0∥1 = √ 𝑚 + 𝑠𝑂 𝑃 (cid:98) (cid:18)√︃ 𝑠 log( 𝑝/𝛾) 𝑁∧𝑇 (cid:19) = 𝑂 𝑃 (cid:18)√︃ 𝑠2 log( 𝑝/𝛾) 𝑁∧𝑇 (cid:19) . □ 121 APPENDIX 3B PROOFS FOR CHAPTER 3.3 The following lemma, quoted from Semenova et al. (2023a)(Lemma A.3), is a result follows from the weak form of Strassen’s coupling Strassen (1965) and the strong form of Strassen’s coupling via Lemma 2.11 of Dudley and Philipp (1983): Lemma B.1 Let (𝑋, 𝑌 ) be random element taking values in Polish space 𝑆 = (𝑆1 × 𝑆2) with laws 𝑃𝑋 and 𝑃𝑌 , respectively. Then, we can construct ( (cid:101)𝑋, (cid:101)𝑌 ) taking values in (𝑆1, 𝑆2) such that (i) they are independent of each other; (ii) their laws L ( (cid:101)𝑋) = 𝑃𝑋 and L ((cid:101)𝑌 ) = 𝑃𝑌 ; (iii) 𝑃 (cid:8)(𝑋, 𝑌 ) ≠ ( (cid:101)𝑋, (cid:101)𝑌 )(cid:9) = 1 2 ∥𝑃𝑋,𝑌 − 𝑃𝑋 × 𝑃𝑌 ∥𝑇𝑉 The proof is provided in Semenova et al. (2023b). To apply the independence coupling result for cross-fitting in the panel data, we need to introduce another lemma: Lemma B.2 Let 𝑋1, ..., 𝑋𝑞 and 𝑌 be random elements taking values in Polish space 𝑆 = (𝑆1 × ... × 𝑆𝑚 × 𝑆𝑦). 𝛽((𝑋1, ..., 𝑋𝑚), 𝑌 ) ≤ 𝑞 ∑︁ 𝑖=1 𝛽(𝑋𝑖, 𝑌 ). Proof of Lemma B.2 By Lemma B.1, we have 𝛽((𝑋1, ..., 𝑋𝑚), 𝑌 ) = 1 2 (cid:13) (cid:13)𝑃(𝑋1,...,𝑋𝑞),𝑌 − 𝑃(𝑋1,...,𝑋𝑚) × 𝑃𝑌 (cid:13) (cid:13)𝑇𝑉 = 𝑃((𝑋1, ..., 𝑋𝑚, 𝑌 ) ≠ ( (cid:101)𝑋1, ..., (cid:101)𝑋𝑚, (cid:101)𝑌 )) ≤ 𝑚 ∑︁ 𝑖=1 𝑃((𝑋𝑖, 𝑌 ) ≠ ( (cid:101)𝑋𝑖, (cid:101)𝑌 )) = 𝑚 ∑︁ 𝑖=1 𝛽(𝑋𝑖, 𝑌 ), where the inequality follows from the union bound. □ Now we can prove Lemma 3.1 from the main body of the paper: 122 Proof of Lemma 3.1 By Lemma B.1, for each (𝑘, 𝑙) we have 𝑃{(𝑊 (𝑘, 𝑙), 𝑊 (−𝑘, −𝑙)) ≠ ( (cid:101)𝑊 (𝑘, 𝑙), (cid:101)𝑊 (−𝑘, −𝑙))} =𝛽 (𝑊 (𝑘, 𝑙), 𝑊 (−𝑘, −𝑙)) = 𝛽 (cid:32) {𝑊𝑖𝑡 }𝑖∈𝐼𝑘,𝑡∈𝑆𝑙 , (cid:216) {𝑊𝑖𝑡 }𝑖∈𝐼𝑘′ ,𝑡∈𝑆𝑙′ (cid:33) 𝑘 ′≠𝑘,𝑙′≠𝑙,𝑙±1 (cid:32) ∑︁ 𝛽 ≤ 𝑖∈𝐼𝑘 {𝑊𝑖𝑡 }𝑡∈𝑆𝑙 , (cid:216) (cid:33) {𝑊𝑖𝑡 }𝑖∈𝐼𝑘′ ,𝑡∈𝑆𝑙′ ≤ ∑︁ ∑︁ ∑︁ 𝛽 (cid:0){𝑊𝑖𝑡 }𝑡∈𝑆𝑙 , {𝑊 𝑗𝑡 }𝑡∈𝑆𝑙′ (cid:1) 𝑘 ′≠𝑘,𝑙′≠𝑙,𝑙±1 𝑘 ′≠𝑘,𝑙′≠𝑙,𝑙±1 𝑗 ∈𝐼𝑘′ 𝑖∈𝐼𝑘 where the last two inequalities follow from Lemma B.2. Note that for 𝑠, 𝑚 ≥ 1, we have 𝛽({𝑊𝑖𝑡 }𝑡≤𝑠, {𝑊 𝑗𝑡 }𝑡≥𝑠+𝑚) = (cid:13) (cid:13)𝑃{𝑊𝑖𝑡 }𝑡 ≤𝑠,{𝑊 𝑗𝑡 }𝑡 ≥𝑠+𝑚 − 𝑃{𝑊𝑖𝑡 }𝑡 ≤𝑠 × 𝑃{𝑊 𝑗𝑡 }𝑡 ≥𝑠+𝑚 (cid:13) (cid:13)𝑇𝑉 ≤ sup 𝐴∈𝜎({𝑊 𝑗𝑡 }𝑡 ≥𝑠+𝑚) 𝐸𝑃 |𝑃( 𝐴|𝜎({𝑊𝑖𝑡 }𝑡≤𝑠)) − 𝑃( 𝐴)| = = = sup 𝐴∈𝜎({𝑊 𝑗𝑡 }𝑡 ≥𝑠+𝑚) sup 𝐴∈𝜎({𝑊 𝑗𝑡 }𝑡 ≥𝑠+𝑚) 𝐸𝑃 |𝑃(𝑃( 𝐴|𝜎(𝛼𝑖, {𝛾𝑡 }𝑡≤𝑠, {𝜀𝑖𝑡 }𝑡≤𝑠))|𝜎({𝑊𝑖𝑡 }𝑡≤𝑠)) − 𝑃( 𝐴)| 𝐸𝑃 |𝑃( 𝐴|𝜎({𝛾𝑡 }𝑡≤𝑠) − 𝑃( 𝐴)| sup 𝐴∈𝜎({𝛾𝑡 }𝑡 ≥𝑠+𝑚) 𝐸𝑃 |𝑃( 𝐴|𝜎({𝛾𝑡 }𝑡≤𝑠) − 𝑃( 𝐴)| ≤ 𝑐𝜅𝑒𝑥 𝑝(−𝜅𝑚), where the last inequality follows from Assumption 3.2. Therefore, 𝑃{(𝑊 (𝑘, 𝑙), 𝑊 (−𝑘, −𝑙)) ≠ ( (cid:101)𝑊 (𝑘, 𝑙), (cid:101)𝑊 (−𝑘, −𝑙))} ≤ 𝐾 𝐿𝑁 2𝑐𝜅𝑒𝑥 𝑝(−𝜅𝑇𝑙), which in turn gives 𝑃{(𝑊 (𝑘, 𝑙), 𝑊 (−𝑘, −𝑙)) ≠ ( (cid:101)𝑊 (𝑘, 𝑙), (cid:101)𝑊 (−𝑘, −𝑙)), 𝑓 𝑜𝑟 𝑠𝑜𝑚𝑒 (𝑘, 𝑙)} ≤ 𝐾 2𝐿2𝑁 2𝑐𝜅𝑒𝑥 𝑝(−𝜅𝑇𝑙), where 𝑇𝑙 = 𝑇/𝐿. Given that log(𝑁)/𝑇 = 𝑜(1) and (𝐾, 𝐿) are finite, it follows that 𝑃{(𝑊 (𝑘, 𝑙), 𝑊 (−𝑘, −𝑙)) ≠ ( (cid:101)𝑊 (𝑘, 𝑙), (cid:101)𝑊 (−𝑘, −𝑙)), 𝑓 𝑜𝑟 𝑠𝑜𝑚𝑒 (𝑘, 𝑙)} = 𝑜(1) □ Proof of Theorem 3.2 By Assumption DML2(i), with probability 1 − Δ𝑁𝑇 , (cid:98) 𝑃((cid:98) 𝜂𝑘𝑙 ∈ T𝑁𝑇 , ∀(𝑘, 𝑙)) ≥ 1 − 𝐾 𝐿Δ𝑁𝑇 = 1 − 𝑜(1). Let’s denote the event 𝑃((cid:98) E𝜂 and the event {(𝑊 (𝑘, 𝑙), 𝑊 (−𝑘, −𝑙)) = ( (cid:101)𝑊 (𝑘, 𝑙), (cid:101)𝑊 (−𝑘, −𝑙)), 𝜂𝑘𝑙 ∈ T𝑁𝑇 . So, 𝜂𝑘𝑙 ∈ T𝜂, ∀(𝑘, 𝑙)) as 𝑓 𝑜𝑟 𝑠𝑜𝑚𝑒 (𝑘, 𝑙)} as E𝑐 𝑝. By 123 Lemma 3.1, we have 𝑃(E𝑐 𝑝) = 1 − 𝑜(1). By union bound inequality, we have 𝑃(E𝑐 𝜂 ∪ E𝑐 𝑐 𝑝) ≤ 𝑃(E𝑐 𝜂) + 𝑃(E𝑐 𝑐 𝑝) = 𝑜(1). So, 𝑃(E𝜂 ∩ E𝑐 𝑝) = 1 − 𝑃(E𝑐 𝜂 ∪ E𝑐 𝑐 𝑝) ≥ 1 − 𝑜(1). Let (cid:98)𝜃 be a solution from equation 3.13. To simplify the notation, we denote (cid:98)𝐴𝑘𝑙 = E𝑘𝑙 [𝜓𝑎 (𝑊𝑖𝑡, 𝜂𝑘𝑙)], (cid:98)¯𝐴 = (cid:98) (cid:98)𝐵𝑘𝑙 = E𝑘𝑙 [𝜓𝑏 (𝑊𝑖𝑡, 𝜂𝑘𝑙)], (cid:98)¯𝐵 = (cid:98) 1 𝐾 𝐿 1 𝐾 𝐿 𝐾 ∑︁ 𝐿 ∑︁ 𝑘=1 𝐾 ∑︁ 𝑙=1 𝐿 ∑︁ 𝑘=1 𝑙=1 (cid:98)𝐴𝑘𝑙, 𝐴0 = 𝐸𝑃 [𝜓𝑎 (𝑊𝑖𝑡; 𝜂0)], (cid:98)𝐵𝑘𝑙, 𝐵0 = 𝐸𝑃 [𝜓𝑏 (𝑊𝑖𝑡; 𝜂0)], (cid:98)¯𝜓(𝜃) = (cid:98)¯𝐴𝜃 + (cid:98)¯𝐵, ¯𝜓(𝜃, 𝜂) = E𝑁𝑇 𝜓(𝑊𝑖𝑡; 𝜃, 𝜂). Claim B.1. On event {E𝜂 ∩ E𝑐 𝑝}, ∥(cid:98)¯𝐴 − 𝐴0∥ = 𝑂 𝑃 (𝑁 −1/2 + 𝑟𝑁𝑇 ). By Claim 1 and Assumption 3(iii) that all singular values of 𝐴0 are bounded below by zero, it follows that all singular values of (cid:98)¯𝐴 are also bounded below from zero, on event E𝜂. Then, by the linearity in Assumption 3(i), we can write (cid:98)𝜃 = −(cid:98)¯𝐴 −1 (cid:98)¯𝐵, 𝜃0 = −𝐴−1 0 𝐵0. Then, by basic algebra, we have √ √ 𝑁 ((cid:98)𝜃 − 𝜃0) = −1 𝑁 (−(cid:98)¯𝐴 (cid:98)¯𝐵 − 𝜃0) = − √ √ 𝑁(cid:98)¯𝐴 (cid:16) −1 ((cid:98)¯𝐵 + (cid:98)¯𝐴𝜃0) = − √ −1 (cid:98)¯𝜓(𝜃0) 𝑁(cid:98)¯𝐴 (cid:17) 𝑁 𝐴−1 0 (cid:98)¯𝜓(𝜃0) − ¯𝜓(𝜃0, 𝜂0) (cid:17) −1 − 𝐴−1 0 (cid:21) (cid:16) ¯𝜓(𝜃0, 𝜂0) + (cid:98)¯𝜓(𝜃0) − ¯𝜓(𝜃0, 𝜂0) (cid:17) √ = 𝑁 𝐴−1 ¯𝜓(𝜃0, 𝜂0) + 0 (cid:20) (cid:16) √ + 𝑁 𝐴0 + (cid:98)¯𝐴 − 𝐴0 Claim B.2. On event {E𝜂 ∩ E𝑐 𝑝}, ∥(cid:98)¯𝜓(𝜃0) − ¯𝜓(𝜃0, 𝜂0) ∥ = 𝑂 𝑃 (𝑟′ 𝑁𝑇 / √ 𝑁 + 𝜆𝑁𝑇 + 𝜆′ 𝑁𝑇 ). By Assumption DML2(i) and Jensen’s inequality, we have ∥ 𝐴0∥ ≤ 𝑚′ 𝑁𝑇 ≤ 𝑐𝑚. Then, Claim B.2 implies that √ ∥ (cid:16) 𝑁 𝐴−1 0 (cid:98)¯𝜓(𝜃0) − ¯𝜓(𝜃0, 𝜂0) (cid:17) ∥ =𝑂 𝑃 (1)𝑂 𝑃 ( √ 𝑁𝑟′ 𝑁𝑇 + =𝑂 𝑃 (𝑟′ 𝑁𝑇 + √ 𝑁𝜆𝑁𝑇 + √ √ 𝑁𝜆𝑁𝑇 + √ 𝑁𝜆′ 𝑁𝑇 ) 𝑁𝜆′ 𝑁𝑇 ), Since 𝐸 [ ¯𝜓(𝜃0, 𝜂0)] = 0, by Lemma A.2, we have √ 𝑁 ¯𝜓(𝜃0, 𝜂0) 𝑑 → N (0, Ω) where Ω = Σ𝑎 + 𝑐Σ𝑔 124 and ∥Ω∥ < ∞. By Claims B.1, B.2, and the asymptotic normality of √ 𝑁 ¯𝜓(𝜃0, 𝜂0), we have √ 𝑁 (cid:20) (cid:16) 𝐴0 + (cid:98)¯𝐴 − 𝐴0 (cid:17) −1 − 𝐴−1 0 (cid:21) (cid:16) ¯𝜓(𝜃0, 𝜂0) + (cid:98)¯𝜓(𝜃0) − ¯𝜓(𝜃0, 𝜂0) (cid:17)(cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13)𝐴−1 0 (cid:13) (cid:13) (cid:16) 𝑁 √ (cid:13) (cid:13) (cid:13) ¯𝜓(𝜃0, 𝜂0) + (cid:98)¯𝜓(𝜃0) − ¯𝜓(𝜃0, 𝜂0) (cid:17)(cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) ≤ (cid:98)¯𝐴 −1(cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13)(cid:98)¯𝐴 − 𝐴0 (cid:13) (cid:16) =𝑂 𝑃 (1)𝑂 𝑃 𝑁 −1/2 + 𝑟𝑁𝑇 (cid:17) 𝑂 𝑃 (1) (cid:16) 𝑂 𝑃 (1) + 𝑂 𝑃 √ (cid:16) 𝑟′ 𝑁𝑇 + 𝑁𝜆𝑁𝑇 + √ (cid:17)(cid:17) 𝑁𝜆′ 𝑁𝑇 (cid:16) =𝑂 𝑃 𝑁 −1/2 + 𝑟𝑁𝑇 (cid:17) , and √ (cid:16) 𝑁 (cid:17) (cid:98)𝜃 − 𝜃0 = 𝐴−1 0 N (0, Ω) + 𝑂 𝑃 (cid:16) 𝑁 −1/2 + 𝑟𝑁𝑇 + 𝑟′ 𝑁𝑇 + √ 𝑁𝜆𝑁𝑇 + √ 𝑁𝜆′ 𝑁𝑇 Proof of Claim B.1. Fix any (𝑘, 𝑙), we have (cid:17) 𝑑 → 𝐴−1 0 N (0, Ω). (cid:13) (cid:13) (cid:98)𝐴𝑘𝑙 − 𝐴0 (cid:13) (cid:13) (cid:13) (cid:13) ≤ (cid:13) (cid:13) (cid:98)𝐴𝑘𝑙 − 𝐸𝑃 [ (cid:98)𝐴𝑘𝑙 |𝑊 (−𝑘, −𝑙)] (cid:13) (cid:13) (cid:13) (cid:13) + (cid:13) (cid:13) (cid:13) 𝐸𝑃 [ (cid:98)𝐴𝑘𝑙 |𝑊 (−𝑘, −𝑙)] − 𝐴0 (cid:13) (cid:13) =: ∥Δ𝐴,1∥ + ∥Δ𝐴,2∥. (cid:13) On the event {E𝜂 ∩ E𝑐 𝑝}, we have (cid:98) So, due to Assumption DML2, we have (cid:13) simplify the notation, we denote (cid:165)𝜓𝑎,𝑘𝑙 (cid:13) (cid:13)Δ𝐴,1 (cid:13) (cid:13): 𝑖𝑡 𝜂𝑘𝑙 ∈ T𝑁𝑇 and the independence between 𝑊 (−𝑘, −𝑙) and 𝑊 (𝑘, 𝑙). (cid:13)Δ𝐴,2 := 𝜓𝑎 (𝑊𝑖𝑡, (cid:13) (cid:13) ≤ 𝑟𝑁𝑇 . By iterated expectation,- 𝐸𝑃 [Δ𝐴,1] = 0. To 𝜂𝑘𝑙)|𝑊 (−𝑘, −𝑙)]. Consider (cid:98) 𝜂𝑘𝑙) − 𝐸𝑃 [𝜓𝑎 (𝑊𝑖𝑡, (cid:98) (𝑁𝑘𝑇𝑙)2𝐸 (cid:16)(cid:13) (cid:13)Δ𝐴,1 (cid:13) 2 |𝑊 (−𝑘, −𝑙) (cid:13) (cid:17) = 𝐸𝑃 ∑︁ ≤ 𝑖∈𝐼𝑘,𝑡∈𝑆𝑙,𝑟∈𝑆𝑙 (cid:104) (cid:12) (cid:12) (cid:12) 𝐸𝑃 ⟨ (cid:165)𝜓𝑎,𝑘𝑙 𝑖𝑡 , (cid:165)𝜓𝑎,𝑘𝑙 𝑖𝑠 ⟩|𝑊 (−𝑘, −𝑙) (cid:104) (cid:12) (cid:12) (cid:12) 𝐸𝑃 , (cid:165)𝜓𝑎,𝑘𝑙 𝑗𝑡 ⟩|𝑊 (−𝑘, −𝑙) (cid:105)(cid:12) (cid:12) (cid:12) (cid:104) (cid:12) (cid:12) (cid:12) 𝐸𝑃 ⟨ (cid:165)𝜓𝑎,𝑘𝑙 𝑖𝑡 , (cid:165)𝜓𝑎,𝑘𝑙 𝑖𝑡 ⟩|𝑊 (−𝑘, −𝑙) (cid:105)(cid:12) (cid:12) (cid:12) + 2 ∑︁ (cid:104) (cid:12) (cid:12) (cid:12) 𝐸𝑃 ⟨ (cid:165)𝜓𝑎,𝑘𝑙 𝑖𝑡 , (cid:165)𝜓𝑎 𝑗,𝑡+𝑚⟩|𝑊 (−𝑘, −𝑙) (cid:105)(cid:12) (cid:12) (cid:12) ∑︁ (cid:13)  (cid:13)  (cid:13)  (cid:13)  (cid:13) 𝑖∈𝐼𝑘,𝑡∈𝑆𝑙   (cid:105)(cid:12) (cid:12) (cid:12) + 2 (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:165)𝜓𝑎,𝑘𝑙 𝑖𝑡 ∑︁ |𝑊 (−𝑘, −𝑙)       ⟨ (cid:165)𝜓𝑎,𝑘𝑙 𝑖𝑡 𝑡∈𝑆𝑙,𝑖∈𝐼𝑘, 𝑗 ∈𝐼𝑘 𝑇𝑙−1 ∑︁ max(𝑆𝑙)−𝑚 ∑︁ 𝑚=1 𝑡=min(𝑆𝑙) 𝑖, 𝑗 ∈𝐼𝑘 (cid:104) (cid:12) (cid:12) (cid:12) 𝐸𝑃 ⟨ (cid:165)𝜓𝑎,𝑘𝑙 𝑖𝑡 , (cid:165)𝜓𝑎 𝑖,𝑡+𝑚⟩|𝑊 (−𝑘, −𝑙) (cid:105)(cid:12) (cid:12) =: 𝑎(1) + 𝑎(2) + 𝑎(3) + 2𝑎(4) + 2𝑎(5). (cid:12) ∑︁ + 𝑡∈𝑆𝑙,𝑖∈𝐼𝑘 𝑇𝑙−1 ∑︁ + 2 max(𝑆𝑙)−𝑚 ∑︁ ∑︁ 𝑚=1 𝑡=min(𝑆𝑙) 𝑖∈𝐼𝑘 By conditional Cauchy-Schwarz inequality, for any 𝑖, 𝑡, 𝑗, 𝑠, we have (cid:104) (cid:12) (cid:12) (cid:12) 𝐸𝑃 ⟨ (cid:165)𝜓𝑎,𝑘𝑙 𝑖𝑡 , (cid:165)𝜓𝑎,𝑘𝑙 𝑗 𝑠 ⟩|𝑊 (−𝑘, −𝑙) (cid:105)(cid:12) (cid:12) (cid:12) (cid:104) (cid:16) 𝐸𝑃 ≤ ∥ (cid:165)𝜓𝑎,𝑘𝑙 𝑖𝑡 ∥2|𝑊 (−𝑘, −𝑙) (cid:105) 𝐸𝑃 (cid:104) ∥ (cid:165)𝜓𝑎,𝑘𝑙 𝑗 𝑠 ∥2|𝑊 (−𝑘, −𝑙) (cid:105) (cid:17) 1/2 = 𝐸𝑃 (cid:104) ∥ (cid:165)𝜓𝑎,𝑘𝑙 𝑖𝑡 ∥2|𝑊 (−𝑘, −𝑙) (cid:105) . 125 Therefore, we have 𝑎(1) ≤ 𝑁𝑘𝑇 2 𝑙 𝐸𝑃 (cid:104) ∥ (cid:165)𝜓𝑎,𝑘𝑙 𝑖𝑡 ∥2|𝑊 (−𝑘, −𝑙) (cid:105) , 𝑎(2) ≤ 𝑁 2 𝑘𝑇𝑙 𝐸𝑃 (cid:104) ∥ (cid:165)𝜓𝑎,𝑘𝑙 𝑖𝑡 ∥2|𝑊 (−𝑘, −𝑙) (cid:105) , 𝑎(3) ≤ 𝑁𝑘𝑇𝑙 𝐸𝑃 (cid:104) ∥ (cid:165)𝜓𝑎,𝑘𝑙 𝑖𝑡 ∥2|𝑊 (−𝑘, −𝑙) (cid:105) , 𝑎(5) ≤ 𝑁𝑘𝑇 2 𝑙 𝐸𝑃 (cid:104) ∥ (cid:165)𝜓𝑎,𝑘𝑙 𝑖𝑡 ∥2|𝑊 (−𝑘, −𝑙) (cid:105) . On the event E𝜂 ∩ E𝑐 𝑝, we have, for 𝑖 ∈ 𝐼𝑘 , 𝑡 ∈ 𝑆𝑙, (cid:104) (cid:16) 𝐸𝑃 ∥ (cid:165)𝜓𝑎,𝑘𝑙 𝑖𝑡 ∥2|𝑊 (−𝑘, −𝑙) (cid:105) (cid:17) 1/2 ≲ (cid:16) 𝐸𝑃 (cid:2)∥𝜓𝑎 (𝑊𝑖𝑡, 𝜂𝑘𝑙) ∥2|𝑊 (−𝑘, −𝑙)(cid:3) (cid:17) 1/2 (cid:98) < ∞, where the first inequality follows from expanding the term and applying Jensen’s inequality and the second inequality follows from Assumption DML2(i). Let 𝐷 denote the dimension of 𝜓𝑎 (𝑊, 𝜂), then we have 𝑎(4) = 𝑎(5) + 𝑇𝑙−1 ∑︁ max(𝑆𝑙)−𝑚 ∑︁ ∑︁ 𝐷 ∑︁ 𝑚=1 𝑖, 𝑗 ∈𝐼𝑘,𝑖≠ 𝑗 𝑡=min(𝑆𝑙) 𝑑=1 For each 𝑖 ∈ 𝐼𝑘 , 𝑡 ∈ 𝑆𝑙, we can decompose (cid:165)𝜓𝑎,𝑘𝑙 𝑔𝑖 = 𝐸 [ (cid:165)𝜓𝑎,𝑘𝑙 𝑑,𝑖,𝑡 |𝛾𝑡], and 𝑒𝑖𝑡 = (cid:165)𝜓𝑎,𝑘𝑙 uncorrelated, 𝑎𝑖 ⊥⊥ 𝑎 𝑗 for 𝑖 ≠ 𝑗, and 𝑔𝑘𝑙 𝑡 𝑑,𝑖,𝑡 − 𝑎𝑖 − 𝑔𝑡. Conditional on 𝑊 (−𝑘, −𝑙), (𝑎𝑘𝑙 𝑖 , 𝑔𝑘𝑙 𝑡 𝑑,𝑖,𝑡 = 𝑎𝑘𝑙 𝑖 + 𝑔𝑘𝑙 𝑡 + 𝑒𝑘𝑙 𝑖𝑡 where 𝑎𝑖 = 𝐸 [ (cid:165)𝜓𝑎,𝑘𝑙 , 𝑒𝑘𝑙 𝑑,𝑖,𝑡 |𝛼𝑖], 𝑖𝑡 ) are mutually is also beta-mixing with 𝛽𝑔 (𝑚) ≤ 𝛽𝛾 (𝑚). Therefore, we 𝐸𝑃 (cid:104) (cid:165)𝜓𝑎,𝑘𝑙 𝑑,𝑖,𝑡 (cid:165)𝜓𝑎,𝑘𝑙 𝑑, 𝑗,𝑡+𝑚 |𝑊 (−𝑘, −𝑙) (cid:105) have 𝐸𝑃 (cid:104) (cid:165)𝜓𝑎,𝑘𝑙 𝑑,𝑖,𝑡 (cid:105) (cid:165)𝜓𝑎,𝑘𝑙 𝑑, 𝑗,𝑡+𝑚 |𝑊 (−𝑘, −𝑙) 𝑡+𝑚 |𝑊 (−𝑘, −𝑙)(cid:3) + 𝐸𝑃 (cid:104) = 𝐸𝑃 𝑔𝑘𝑙 𝑡 𝑔𝑘𝑙 𝑡+𝑚 + 𝑒𝑘𝑙 𝑖𝑡 𝑒𝑘𝑙 𝑗,𝑡+𝑚 |𝑊 (−𝑘, −𝑙) (cid:105) (cid:104) (cid:104) 𝐸𝑃 𝑖𝑡 𝑒𝑘𝑙 𝑒𝑘𝑙 𝑗,𝑡+𝑚 |𝛼𝑖, 𝛼 𝑗 , 𝑊 (−𝑘, −𝑙) (cid:105) |𝑊 (−𝑘, −𝑙) (cid:105) =𝐸𝑃 (cid:2)𝑔𝑘𝑙 𝑡 𝑔𝑘𝑙 Note that 𝛽-mixing of 𝛾𝑡 implies 𝛼-mixing with the mixing coefficient 𝛼𝛾 (𝑚) ≤ 𝛽𝛾 (𝑚) for all 𝑚 ∈ Z+, and conditional on 𝑊 (−𝑘, −𝑙) and 𝛼𝑖, 𝑒𝑘𝑙 𝑖𝑡 is also 𝛼-mixing with the mixing coefficient not larger than 𝛼𝛾 (𝑚) by Theorem 14.12 of Hansen (2022). Then, we have (cid:104) (cid:104) 𝐸𝑃 𝐸𝑃 (cid:12) (cid:12) (cid:12) 𝑖𝑡 𝑒𝑘𝑙 𝑒𝑘𝑙 (cid:104) (cid:104)(cid:12) 𝑖𝑡 𝑒𝑘𝑙 𝑒𝑘𝑙 ≤𝐸𝑃 𝐸𝑃 (cid:12) (cid:12) ≲8𝛼𝛾 (𝑚)1−2/𝑞 (cid:16) 𝑗,𝑡+𝑚 |𝛼𝑖, 𝛼 𝑗 , 𝑊 (−𝑘, −𝑙) 𝑗,𝑡+𝑚 |𝛼𝑖, 𝛼 𝑗 , 𝑊 (−𝑘, −𝑙) (cid:105)(cid:12) (cid:12) (cid:12) (cid:17) 1/𝑞 (cid:16) 𝑑,𝑖,𝑡 |𝑞 |𝑊 (−𝑘, −𝑙)] 𝐸𝑃 [| (cid:165)𝜓𝑎,𝑘𝑙 (cid:105) |𝑊 (−𝑘, −𝑙) |𝑊 (−𝑘, −𝑙) (cid:105)(cid:12) (cid:12) (cid:12) (cid:105) 𝐸𝑃 [| (cid:165)𝜓𝑎,𝑘𝑙 𝑑, 𝑗,𝑡+𝑚 |𝑞 |𝑊 (−𝑘, −𝑙)] (cid:17) 1/𝑞 ≲ 32𝛼𝛾 (𝑚)1−2/𝑞𝑐2 𝑚, where the first inequality follows from the Jensen’s inequality; the second inequality follows from the fact that 𝐸 [𝑒𝑘𝑙 𝑖𝑡 |𝛼𝑖, 𝑊 (−𝑘, −𝑙)] = 0, and Theorem 14.13(ii) of Hansen (2022); the last inequality 126 follows from the moment conditions in Assumption DML2 and that 𝑊 (−𝑘, −𝑙) is independent of 𝑊 (𝑘, 𝑙) on E𝑐 𝑝 . Similarly, (cid:12) (cid:12)𝐸𝑃 (cid:2)𝑔𝑘𝑙 𝑡 𝑔𝑘𝑙 𝑡+𝑚 |𝑊 (−𝑘, −𝑙)(cid:3)(cid:12) (cid:12) ≲ 𝛼𝛾 (𝑚)1−2/𝑞𝑐2 𝑚, Then, we have 1 𝑘𝑇𝑙 𝑁 2 𝑇𝑙−1 ∑︁ max(𝑆𝑙)−𝑚 ∑︁ ∑︁ 𝐷 ∑︁ 𝑚=1 𝑡=min(𝑆𝑙) 𝑖, 𝑗 ∈𝐼𝑘,𝑖≠ 𝑗 𝑑=1 𝐸𝑃 (cid:104) (cid:165)𝜓𝑎,𝑘𝑙 𝑑,𝑖,𝑡 (cid:165)𝜓𝑎,𝑘𝑙 𝑑, 𝑗,𝑡+𝑚 |𝑊 (−𝑘, −𝑙) (cid:105) ≲𝑐2 𝑚 𝑇𝑙−1 ∑︁ max(𝑆𝑙)−𝑚 ∑︁ 1 𝑘𝑇𝑙 𝑁 2 𝑚=1 𝑚 𝐷𝑐𝜅 𝑐2 𝑒𝑥 𝑝(𝜅(1 − 2/𝑞)) − 1 𝑡=min(𝑆𝑙) ≤ 𝛼𝛾 (𝑚)1−2/𝑞 ≤ 𝑐2 𝑚 𝐷 ∞ ∑︁ 𝑚=1 𝑐𝜅𝑒𝑥 𝑝(−𝜅𝑚)1−2/𝑞 ∑︁ 𝐷 ∑︁ 𝑖, 𝑗 ∈𝐼𝑘,𝑖≠ 𝑗 𝑑=1 < ∞, where the last inequality follows from the geometric sum. Thus, as (𝑁𝑘 , 𝑇𝑙) → ∞ we have 𝐸 (cid:16)(cid:13) (cid:13)Δ𝐴,1 (cid:13) 2 |𝑊 (−𝑘, −𝑙) (cid:13) (cid:17) = (cid:19) 2 (cid:18) 1 𝑁𝑘𝑇𝑙 [𝑎(1) + 𝑎(2) + (3) + 2𝑎(4) + 2𝑎(5)] =𝑂 𝑃 (1/𝑇𝑙) = 𝑂 𝑃 (1/𝑁). where the last step follows from that 𝐿 is constant and 𝑁/𝑇 → 𝑐 as 𝑁, 𝑇 → ∞. By Markov’s (cid:13) inequality, we conclude that conditional on 𝑊 (−𝑘, −𝑙), (cid:13) (cid:13) = 𝑂 𝑃 (1/ conditional convergence implies unconditional convergence, we have (cid:13) 𝑁). By Lemma 6.1 that (cid:13)Δ𝐴,1 √ √ summarize, we have (cid:13) (cid:13) (cid:98)𝐴𝑘𝑙 − 𝐴0 (cid:13) (cid:13) (cid:13) = 𝑂 𝑃 (𝑁 −1/2 + 𝛿𝑁𝑇 ), which implies (cid:13) (cid:13) (cid:13) = 𝑂 𝑃 (1/ 𝑁). To (cid:13)Δ𝐴,1 (cid:13) (cid:13) (cid:13)(cid:98)¯𝐴 − 𝐴0 (cid:13) = 𝑂 𝑃 (𝑁 −1/2 + 𝑟𝑁𝑇 ). (cid:13) (cid:13) Proof of Claim B.2: Since 𝐾 and 𝐿 are finite, it suffices to show for any 𝑘, 𝑙, ∥E𝑘𝑙 [𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂𝑘𝑙) − 𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂0)] ∥ = 𝑂 𝑃 (𝑟′ (cid:98) 𝑁𝑇 /√︁𝑁𝑘 + 𝜆𝑁𝑇 + 𝜆′ 𝑁𝑇 ). To simplify the notation, we denote 𝑖𝑡 |𝑊 (−𝑘, −𝑙)], (cid:165)𝜓 𝑘𝑙 𝑖𝑡 = 𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂𝑘𝑙) − 𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂0), (cid:98) 𝑘𝑙 𝑖𝑡 − 𝐸𝑃 [ (cid:165)𝜓 𝑘𝑙 𝑖𝑡 = (cid:165)𝜓 𝑘𝑙 (cid:101)(cid:165)𝜓 √ (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) 𝐸𝑃 [𝜓(𝑊𝑖𝑡; 𝜃0, 𝑖𝑡 − 𝐸𝑃 [ (cid:165)𝜓 𝑘𝑙 𝑁𝑘 𝑁𝑘𝑇𝑙 1 𝑁𝑘𝑇𝑙 𝑖∈𝐼𝑘,𝑡∈𝑆𝑙 (cid:2) (cid:165)𝜓 𝑘𝑙 ∑︁ 𝑏(2) = 𝑏(1) = 𝑖𝑡 |𝑊 (−𝑘, −𝑙)](cid:3) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) 𝜂𝑘𝑙)|𝑊 (−𝑘, −𝑙)] − 𝐸𝑃 [𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂0)] (cid:98) (cid:13) (cid:13) (cid:13) (cid:13) . 127 We also denote (cid:101)(cid:165)𝜓𝑑,𝑖𝑡 as each element in the vector (cid:101)(cid:165)𝜓 𝑘𝑙 𝑖𝑡 for 𝑑 = 1, ..., 𝐷, while suppressing the subscripts 𝑘, 𝑙 for convenience. By triangle inequality, we have ∥E𝑘𝑙 [𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂𝑘𝑙) − 𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂0)] ∥ ≤ 𝑏(1)/√︁𝑁𝑘 + 𝑏(2). (cid:98) To bound 𝑏(1), first note that it is mean zero by the iterated expectation argument. On the event E𝜂 ∩ E𝑐 𝑝, we have 𝐸𝑃 [𝑏(1)2|𝑊 (−𝑘, −𝑙)] ≤ 1 𝑁𝑘𝑇 2 𝑙 ∑︁ + 𝑡∈𝑆𝑙,𝑖∈𝐼𝑘, 𝑗 ∈𝐼𝑘 (cid:20) 𝑘𝑙 𝑖𝑡 , (cid:101)(cid:165)𝜓 ⟨(cid:101)(cid:165)𝜓 𝐸𝑃 (cid:12) (cid:12) (cid:12) (cid:12) ∑︁ 𝑘𝑙 𝑗𝑡 ⟩|𝑊 (−𝑘, −𝑙) 𝑖∈𝐼𝑘,𝑡∈𝑆𝑙,𝑟∈𝑆𝑙 (cid:21)(cid:12) (cid:12) (cid:12) (cid:12) (cid:20) 𝐸𝑃 (cid:12) (cid:12) (cid:12) (cid:12) 𝑘𝑙 𝑖𝑡 , (cid:101)(cid:165)𝜓 𝑘𝑙 𝑖𝑠 ⟩|𝑊 (−𝑘, −𝑙) ⟨(cid:101)(cid:165)𝜓 (cid:21)(cid:12) (cid:12) (cid:12) (cid:12) ∑︁ + 𝑡∈𝑆𝑙,𝑖∈𝐼𝑘 (cid:20) 𝐸𝑃 (cid:12) (cid:12) (cid:12) (cid:12) 𝑘𝑙 𝑖𝑡 , (cid:101)(cid:165)𝜓 𝑘𝑙 𝑖𝑡 ⟩|𝑊 (−𝑘, −𝑙) ⟨(cid:101)(cid:165)𝜓 (cid:21)(cid:12) (cid:12) (cid:12) (cid:12) (cid:20) 𝐸𝑃 (cid:12) (cid:12) (cid:12) (cid:12) 𝑘𝑙 𝑖𝑡 , (cid:101)(cid:165)𝜓 ⟨(cid:101)(cid:165)𝜓 𝑘𝑙 𝑗,𝑡+𝑚⟩|𝑊 (−𝑘, −𝑙) (cid:21)(cid:12) (cid:12) (cid:12) (cid:12) + 2 + 2 𝑇𝑙−1 ∑︁ max(𝑆𝑙)−𝑚 ∑︁ ∑︁ 𝑚=1 𝑇𝑙−1 ∑︁ 𝑡=min(𝑆𝑙) max(𝑆𝑙)−𝑚 ∑︁ 𝑖, 𝑗 ∈𝐼𝑘 ∑︁ 𝑚=1 𝑡=min(𝑆𝑙) 𝑖∈𝐼𝑘 (cid:20) 𝐸𝑃 (cid:12) (cid:12) (cid:12) (cid:12) 𝑘𝑙 𝑖𝑡 , (cid:101)(cid:165)𝜓 𝑘𝑙 𝑖,𝑡+𝑚⟩|𝑊 (−𝑘, −𝑙) ⟨(cid:101)(cid:165)𝜓 (cid:21)(cid:12) (cid:12) (cid:12) (cid:12) =: 𝑐(1) + 𝑐(2) + 𝑐(3) + 2𝑐(4) + 2𝑐(5). By conditional Cauchy-Schwarz inequality, for any 𝑖, 𝑡, 𝑗, 𝑠, we have (cid:20) 𝐸𝑃 (cid:12) (cid:12) (cid:12) (cid:12) 𝑘𝑙 𝑖𝑡 , (cid:101)(cid:165)𝜓 ⟨(cid:101)(cid:165)𝜓 𝑘𝑙 𝑗 𝑠⟩|𝑊 (−𝑘, −𝑙) (cid:21)(cid:12) (cid:12) (cid:12) (cid:12) (cid:20) (cid:18) 𝐸𝑃 ≤ 𝑘𝑙 𝑖𝑡 ∥2|𝑊 (−𝑘, −𝑙) ∥(cid:101)(cid:165)𝜓 (cid:20) (cid:21) 𝐸𝑃 𝑘𝑙 𝑗 𝑠 ∥2|𝑊 (−𝑘, −𝑙) ∥(cid:101)(cid:165)𝜓 (cid:21) (cid:19) 1/2 . Applying Minkowski’s inequality, Jensen’s inequality on the event E𝜂 ∩ E𝑐 𝑝, we have, for 𝑖 ∈ 𝐼𝑘 , 𝑡 ∈ 𝑆𝑙, (cid:18) (cid:16) ≤ (cid:20) 𝐸𝑃 𝑘𝑙 𝑖𝑡 ∥2|𝑊 (−𝑘, −𝑙) ∥(cid:101)(cid:165)𝜓 (cid:21) (cid:19) 1/2 𝐸𝑃 (cid:2)∥ (cid:165)𝜓 𝑘𝑙 (cid:16) ≤2 𝐸𝑃 (cid:2)∥ (cid:165)𝜓 𝑘𝑙 𝑖𝑡 ∥2|𝑊 (−𝑘, −𝑙)(cid:3) (cid:17) 1/2 + 𝑖𝑡 ∥2|𝑊 (−𝑘, −𝑙)(cid:3) (cid:17) 1/2 (cid:16) 𝐸𝑃 (cid:2)∥𝐸𝑃 [ (cid:165)𝜓 𝑘𝑙 𝑖𝑡 |𝑊 (−𝑘, −𝑙)] ∥2|𝑊 (−𝑘, −𝑙)(cid:3) (cid:17) 1/2 ≤ 2𝑟′ 𝑁𝑇 . Therefore, we have 𝑐(1) ≤ 𝐸𝑃 (cid:20) 𝑘𝑙 𝑖𝑡 ∥2|𝑊 (−𝑘, −𝑙) ∥(cid:101)(cid:165)𝜓 (cid:21) = 𝑂 (𝑟′ 𝑁𝑇 2), 𝑐(2) ≤ 𝑐𝐸𝑃 (cid:20) 𝑘𝑙 𝑖𝑡 ∥2|𝑊 (−𝑘, −𝑙) ∥(cid:101)(cid:165)𝜓 (cid:21) = 𝑂 (𝑟′ 𝑁𝑇 2), 𝑐(3) ≤ (cid:20) 1 𝑁𝑘 𝐸𝑃 𝑘𝑙 𝑖𝑡 ∥2|𝑊 (−𝑘, −𝑙) ∥(cid:101)(cid:165)𝜓 (cid:21) = 𝑂 (𝑟′ 𝑁𝑇 2/𝑁), 𝑐(5) ≤ 𝐸𝑃 (cid:20) 𝑘𝑙 𝑖𝑡 ∥2|𝑊 (−𝑘, −𝑙) ∥(cid:101)(cid:165)𝜓 (cid:21) ] = 𝑂 (𝑟′ 𝑁𝑇 2). 128 Following similar arguments as for bounding 𝑎(4), 𝑐(4) is of order 𝑂 (𝑟′ 𝑁𝑇 2). So, we have shown 𝐸𝑃 [𝑏(1)2|𝑊 (−𝑘, −𝑙)] = 𝑂 𝑃 (cid:16) 𝑟′ 𝑁𝑇 2(cid:17) , which implies 𝑏(1) = 𝑂 𝑃 (𝑟′ 𝑁𝑇 ) by Markov inequality and Lemma 6.1 of Chernozhukov et al. (2018a). To bound 𝑏(2), we first define 𝑓𝑘𝑙 (𝑟) := 𝐸𝑃 [𝜓(𝑊𝑖𝑡, 𝜃0, 𝜂0 + 𝑟 ((cid:98) 𝜂𝑘𝑙 − 𝜂0)|𝑊 (−𝑘, −𝑙)] − 𝐸𝑃 [𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂0)] , 𝑟 ∈ [0, 1], for some 𝑖 ∈ 𝐼𝑘 , 𝑡 ∈ 𝑆𝑙. So, 𝑏(2) = ∥ 𝑓𝑘𝑙 (1) ∥. By expanding 𝑓𝑘𝑙 (𝑟) around 0 using mean value theorem and evaluating at 𝑟 = 1, we have 𝑓𝑘𝑙 (𝑟) = 𝑓𝑘𝑙 (0) + 𝑓 ′ 𝑘𝑙 (0) + 𝑓 ′′ 𝑟)/2, 𝑘𝑙 ((cid:101) 𝑟 ∈ (0, 1). We note that where (cid:101) under Assumption DML1(ii)(near-orthogonality), we have (cid:13) 𝑓𝑘𝑙 (0) = 0 on the event E𝑐 𝑝. On the event E𝜂 ∩ E𝑐 𝑝 and 𝑘𝑙 (0)(cid:13) (cid:13) ≤ 𝜆𝑁𝑇 and ≤ 𝜆′ (cid:13) 𝑓 ′ (cid:13) (cid:13) (cid:13) 𝑓 ′′ 𝑘𝑙 (0) (cid:13) (cid:13) (cid:13) 𝑁𝑇 . 𝑁𝑇 ). Combining the bounds for 𝑏(1) and Therefore, we have shown that 𝑏(2) = 𝑂 𝑃 (𝜆𝑁𝑇 ) + 𝑂 𝑃 (𝜆′ 𝑏(2) completes the proof of Claim B.2. □ Proof of Theorem 3.3 By the same arguments for Theorem 3.2, we have 𝑃(E𝜂 ∩ E𝑐 𝑝) = 1 − 𝑃(E𝑐 𝜂 ∪ E𝑐 𝑐 𝑝) ≥ 1 − 𝑜(1). ∥ 𝐴−1 By Claim B.1, we have ∥(cid:98)¯𝐴 − 𝐴0∥ = 𝑂 𝑃 (𝑁 −1/2 + 𝑟𝑁𝑇 ) on event {E𝜂 ∩ E𝑐 𝑝}. Therefore, due to 0 ensured by Assumption DML1(iv) and Ω < ∞ as shown in Claim B.2, it suffices to show ∥(cid:98)Ω𝐶𝐻𝑆 − Ω∥ = 𝑜𝑃 (1). Furthermore, since 𝐾, 𝐿 are fixed constants, it suffices to show for 0 ∥ ≤ 𝑎−1 129 each (𝑘, 𝑙) that ∥(cid:98)Ω𝐶𝐻𝑆,𝑘𝑙 − Ω∥ = 𝑜𝑃 (1) where (cid:98)Ω𝐶𝐻𝑆,𝑘𝑙 :=(cid:98)Ω𝑎,𝑘𝑙 + (cid:98)Ω𝑏,𝑘𝑙 − (cid:98)Ω𝑐,𝑘𝑙 + (cid:98)Ω𝑑,𝑘𝑙 + (cid:98)Ω′ ∑︁ 𝑑,𝑘𝑙, 𝜓(𝑊𝑖𝑡; (cid:98)𝜃, 𝜂𝑘𝑙)𝜓(𝑊𝑖𝑟; (cid:98)𝜃, (cid:98) 𝜂𝑘𝑙)′, (cid:98) (cid:98)Ω𝑎,𝑘𝑙 := (cid:98)Ω𝑏,𝑘𝑙 := (cid:98)Ω𝑐,𝑘𝑙 := 1 𝑁𝑘𝑇 2 𝑙 𝐾/𝐿 𝑁𝑘𝑇 2 𝑙 𝐾/𝐿 𝑁𝑘𝑇 2 𝑙 (cid:98)Ω𝑑,𝑘𝑙 := 𝐾/𝐿 𝑁𝑘𝑇 2 𝑙 𝑖∈𝐼𝑘,𝑡∈𝑆𝑙,𝑟∈𝑆𝑙 ∑︁ 𝑡∈𝑆𝑙,𝑖∈𝐼𝑘, 𝑗 ∈𝐼𝑘 ∑︁ 𝑖∈𝐼𝑘,𝑡∈𝑆𝑙 𝑀−1 ∑︁ 𝑘 𝑚=1 𝜓(𝑊𝑖𝑡; (cid:98)𝜃, 𝜂𝑘𝑙)𝜓(𝑊 𝑗𝑡; (cid:98)𝜃, (cid:98) 𝜂𝑘𝑙)′, (cid:98) 𝜓(𝑊𝑖𝑡; (cid:98)𝜃, 𝜂𝑘𝑙)𝜓(𝑊𝑖𝑡; (cid:98)𝜃, (cid:98) 𝜂𝑘𝑙)′, (cid:98) (cid:16) 𝑚 𝑀 ⌈𝑆𝑙⌉−𝑚 ∑︁ (cid:17) ∑︁ 𝑡=⌊𝑆𝑙⌋ 𝑖∈𝐼𝑘, 𝑗 ∈𝐼𝑘, 𝑗≠𝑖 𝜓(𝑊𝑖𝑡; (cid:98)𝜃, 𝜂𝑘𝑙)𝜓(𝑊 𝑗,𝑡+𝑚; (cid:98)𝜃, (cid:98) 𝜂𝑘𝑙)′. (cid:98) Since a sequence of symmetric matrices Ω𝑛 converges to a symmetric matrix Ω0 if and only if 𝑒′Ω𝑛𝑒 → 𝑒′Ω0𝑒 for all comfortable 𝑒, it suffices to assume without loss of generality that the dimension of 𝜓 to be 1. To simplify the expression, we denote 𝑖𝑡 = 𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂0), (cid:98)𝜓 (𝑘𝑙) 𝜓 (0) 𝑖𝑡 = 𝜓(𝑊𝑖𝑡; (cid:98)𝜃, 𝜂𝑘𝑙) (cid:98) Claim B.3. On event {E𝜂 ∩ E𝑐 𝑝}, Claim B.4. On event {E𝜂 ∩ E𝑐 𝑝}, Claim B.5. On event {E𝜂 ∩ E𝑐 𝑝}, Claim B.6. On event {E𝜂 ∩ E𝑐 𝑝}, 𝑁 −1/2 + 𝑟′ 𝑁𝑇 (cid:17) . (cid:16) (cid:12) (cid:12) (cid:12) = 𝑂 𝑃 (cid:12) (cid:12) (cid:12)(cid:98)Ω𝑎,𝑘𝑙 − Σ𝑎 (cid:12) (cid:12) (cid:12)(cid:98)Ω𝑏,𝑘𝑙 − 𝑐𝐸𝑃 [𝑔𝑡𝑔𝑡] (cid:12) = 𝑂 𝑃 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) = 𝑂 𝑃 (cid:0)𝑇 −1(cid:1). (cid:12) (cid:12) (cid:12)(cid:98)Ω𝑐,𝑘𝑙 (cid:12) (cid:12)(cid:98)Ω𝑑,𝑘𝑙 − 𝑐 (cid:205)∞ (cid:12) 𝑚=1 𝐸𝑃 [𝑔𝑡𝑔𝑡+𝑚] (cid:12) (cid:12) = 𝑜𝑃 (1). (cid:12) (cid:16) 𝑁 −1/2 + 𝑟′ 𝑁𝑇 (cid:17) . The decomposition techniques used in the proofs of Claims A.4, A.5, and A.7 follow the proofs of Lemma 1 and Lemma 2 in Appendix E of Chiang et al. (2024). Combining the Claims A.4-A.7 completes the proof of Theorem 3.3. 130 Proof of Claim B.3. By triangle inequality, we have (cid:12) (cid:12) (cid:12) ≤ (cid:12) (cid:12) (cid:12) (cid:12) (cid:12)(cid:98)Ω𝑎,𝑘𝑙 − Σ𝑎 (cid:12) 𝐼 (𝑘𝑙) 𝑎,1 := + (cid:12) (cid:12) (cid:12) 𝐼 (𝑘𝑙) 𝑎,1 1 𝑁𝑘𝑇 2 𝑙 𝐼 (𝑘𝑙) 𝑎,2 := 𝐼 (𝑘𝑙) 𝑎,2 := 1 𝑁𝑘𝑇 2 𝑙 1 𝑁𝑘𝑇 2 𝑙 + (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 𝐼 (𝑘𝑙) (cid:12) 𝑎,2 (cid:12) ∑︁ , (cid:12) 𝐼 (𝑘𝑙) (cid:12) 𝑎,2 (cid:12) (cid:110) 𝑖𝑡 (cid:98)𝜓 (𝑘𝑙) (cid:98)𝜓 (𝑘𝑙) 𝑖𝑟 − 𝜓 (0) 𝑖𝑡 𝜓 (0) 𝑖𝑟 (cid:111) , 𝑖∈𝐼𝑘,𝑡∈𝑆𝑙,𝑟∈𝑆𝑙 ∑︁ (cid:110) 𝑖𝑡 𝜓 (0) 𝜓 (0) 𝑖𝑟 − 𝐸𝑃 [𝜓 (0) 𝑖𝑡 𝜓 (0) 𝑖𝑟 ] (cid:111) , 𝑖∈𝐼𝑘,𝑡∈𝑆𝑙,𝑟∈𝑆𝑙 ∑︁ 𝑖∈𝐼𝑘,𝑡∈𝑆𝑙,𝑟∈𝑆𝑙 𝐸𝑃 [𝜓 (0) 𝑖𝑡 𝜓 (0) 𝑖𝑟 ] − 𝐸𝑃 [𝑎𝑖𝑎𝑖]. By law of total covariance and mean-zero property of 𝜓 (0) 𝑖𝑡 , we have (cid:104) 𝐸𝑃 𝑖𝑡 𝜓 (0) 𝜓 (0) 𝑖𝑟 (cid:105) = 𝐸𝑃 [𝐸𝑃 (𝜓 (0) 𝑖𝑡 , 𝜓 (0) 𝑖𝑟 |𝛼𝑖)] + 𝐸𝑃 (cid:16) 𝐸𝑃 [𝜓 (0) 𝑖𝑡 |𝛼𝑖]𝐸𝑃 [𝜓 (0) 𝑖𝑟 |𝛼𝑖] (cid:17) Due to the identical distribution of 𝛼𝑖 and mean zero, we have ∑︁ 𝐸𝑃 [𝜓 (0) 𝑖𝑡 𝜓 (0) 𝑖𝑟 ] 1 𝑁𝑘𝑇 2 𝑙 𝑖∈𝐼𝑘,𝑡∈𝑆𝑙,𝑟∈𝑆𝑙 (cid:110) ∑︁ = 1 𝑇 2 𝑙 𝑡∈𝑆𝑙,𝑟∈𝑆𝑙 𝐸𝑃 [𝐸𝑃 (𝜓 (0) 𝑖𝑡 𝜓 (0) 𝑖𝑟 |𝛼𝑖)] + 𝐸𝑃 (𝐸𝑃 [𝜓 (0) 𝑖𝑡 |𝛼𝑖]𝐸𝑃 [𝜓 (0) 𝑖𝑟 |𝛼𝑖]) (cid:111) Conditional on 𝛼𝑖, {𝜓 (0) 𝑖𝑡 }𝑡≥1 is 𝛽-mixing with the mixing coefficient same as 𝛾𝑡. Therefore, we can apply Theorem 14.13(ii) in Hansen (2022) and Jensen’s inequality: 𝐸𝑃 (cid:104) (cid:12) (cid:12) (cid:12) 𝐸𝑃 𝜓 (0) 𝑖𝑡 , 𝜓 (0) 𝑖𝑟 |𝛼𝑖 (cid:16) 𝐸𝑃 |𝜓 (0) 𝑖𝑡 |𝑞(cid:17) 2/𝑞 ≤ 8 (cid:105)(cid:12) (cid:12) (cid:12) 𝛽𝛾 (|𝑡 − 𝑟 |)1−2/𝑞 Note that (cid:205)𝑡∈𝑆𝑙,𝑟∈𝑆𝑙 To bound 𝐼 (𝑘𝑙) 𝛽𝛾 (|𝑡 − 𝑟 |)1−2/𝑞 ≤ ∞ under Assumption 3.2. So, 𝐼 (𝑘𝑙) 𝑎,2 = 𝑂 (1/𝑇 2 𝑙 ) = 𝑂 (𝑇 −2). 𝑎,2 , we can rewrite it by triangle inequality as follows: (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:101)𝐼 (𝑘𝑙) 𝑎,2,𝑖 𝐼 (𝑘𝑙) 𝑎,2,𝑖 𝐼 (𝑘𝑙) 𝑎,2 1 𝑁𝑘 1 𝑁𝑘 ∑︁ ∑︁ (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 𝑖∈𝐼𝑘 ≤ + (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) , (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 1 𝑇 2 𝑙 𝐼 (𝑘𝑙) 𝑎,2,𝑖 := (cid:101)𝐼 (𝑘𝑙) 𝑎,2,𝑖 := 1 𝑇 2 𝑙 𝑖∈𝐼𝑘 ∑︁ 𝑡,𝑟∈𝑆𝑙 ∑︁ 𝑡,𝑟∈𝑆𝑙 (cid:110) (cid:110) 𝑖𝑡 𝜓 (0) 𝜓 (0) 𝑖𝑟 − 𝐸𝑃 (cid:104) 𝑖𝑡 𝜓 (0) 𝜓 (0) 𝑖𝑟 |{𝛾𝑡 }𝑡∈𝑆𝑙 (cid:105) (cid:111) , (cid:104) 𝐸𝑃 𝑖𝑡 𝜓 (0) 𝜓 (0) 𝑖𝑟 |{𝛾𝑡 }𝑡∈𝑆𝑙 (cid:105) − 𝐸𝑃 [𝜓 (0) 𝑖𝑡 𝜓 (0) 𝑖𝑟 ] (cid:111) . 131 Due to identical distribution of 𝛼𝑖, (cid:101)𝐼 (𝑘𝑙) Denote ℎ𝑖 (𝛾𝑡, 𝛾𝑟) = 𝐸𝑃 [𝜓 (0) 𝑖𝑡 𝜓 (0) 𝑖𝑟 𝑎,2,𝑖 does not vary over 𝑖 so that 𝐸𝑃 𝑖𝑡 𝜓 (0) (cid:12) (cid:12) (cid:12) 𝑖𝑟 ]. By direct calculation, we have (cid:205)𝑖∈𝐼𝑘 (cid:101)𝐼 (𝑘𝑙) 𝑎,2,𝑖 |𝛾𝑡, 𝛾𝑟] − 𝐸𝑃 [𝜓 (0) 1 𝑁𝑘 (cid:12) (cid:12) (cid:12) 2 = 𝐸𝑃 (cid:12) (cid:12)(cid:101)𝐼 (𝑘𝑙) (cid:12) 𝑎,2,𝑖 (cid:12) 2 (cid:12) (cid:12) . 𝐸𝑃 (cid:12) (cid:12)(cid:101)𝐼 (𝑘𝑙) (cid:12) 𝑎,2,𝑖 (cid:12) 2 (cid:12) (cid:12) = 1 𝑇 4 𝑙 ∑︁ 𝑡,𝑟,𝑡′,𝑟 ′∈𝑆𝑙 𝐸𝑃 [ℎ𝑖 (𝛾𝑡, 𝛾𝑟)ℎ𝑖 (𝛾𝑡′, 𝛾𝑟 ′)] . To bound the RHS above, we can apply Lemma 3.4 in Dehling and Wendler (2010) by verifying the following two conditions: 𝐸𝑃 |ℎ𝑖 (𝛾𝑡, 𝛾𝑟)|2+𝛿 < ∞, ∫ ∫ |ℎ𝑖 (𝑢, 𝑣)|2+𝛿 𝑑𝐹 (𝑢)𝑑𝐹 (𝑣) < ∞, (3B.1) (3B.2) for some 𝛿 > 0 and 𝐹 (.) is the common CDF of 𝛾𝑡. Consider condition 3B.1. By Minkowski’s inequality, Jensen’s inequality, and the law of iterated expectation, we have (cid:16) 𝐸𝑃 |ℎ𝑖 (𝛾𝑡, 𝛾𝑟)|2+𝛿(cid:17) 1 2+ 𝛿 ≤ (cid:18) 𝐸𝑃 (cid:12) (cid:12) (cid:12) 𝑖𝑡 𝜓 (0) 𝜓 (0) 𝑖𝑟 (cid:12) (cid:12) (cid:12) 2+𝛿(cid:19) 1 2+ 𝛿 + 𝐸𝑃 (cid:12) (cid:12) (cid:12) 𝑖𝑡 𝜓 (0) 𝜓 (0) 𝑖𝑟 (cid:18) 𝐸𝑃 ≤ (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 𝜓 (0) 𝑖𝑡 4+2𝛿(cid:19) 1 (cid:12) 2+ 𝛿 (cid:12) (cid:12) + 𝐸𝑃 (cid:12) (cid:12) (cid:12) 𝜓 (0) 𝑖𝑡 2 (cid:12) (cid:12) (cid:12) where the second inequality follows from Hölder’s inequality and the identical distribution of 𝛾𝑡. (cid:12) (cid:12) (cid:12) 𝑚 follows from Assumption DML2(i). < 𝑐𝑚 and 𝐸𝑃 𝑝−4 2 , then 4+2𝛿(cid:19) 1 2+ 𝛿 Let 𝛿 = 𝜓 (0) 𝑖𝑡 𝜓 (0) 𝑖𝑡 ≤ 𝑐2 𝐸𝑃 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:18) 2 Therefore, condition 3B.1 is satisfied. Consider condition 3B.2. By Minkowski’s inequality and Jensen’s inequality, we have (cid:18)∫ ∫ (cid:12) (cid:12) (cid:12) (cid:18)∫ ∫ (cid:12) (cid:12) (cid:12) 𝐸𝑃 [𝜓 (0) 𝑖𝑡 𝜓 (0) 𝑖𝑟 |𝛾𝑡 = 𝑢, 𝛾𝑟 = 𝑣] − 𝐸𝑃 [𝜓 (0) 𝑖𝑡 𝜓 (0) 𝑖𝑟 ] (cid:19) 1 2+ 𝛿 𝑑𝐹 (𝑢)𝑑𝐹 (𝑣) 2+𝛿 (cid:12) (cid:12) (cid:12) 𝐸𝑃 [𝜓 (0) 𝑖𝑡 𝜓 (0) 𝑖𝑟 |𝛾𝑡 = 𝑢, 𝛾𝑟 = 𝑣] 2+𝛿 (cid:12) (cid:12) (cid:12) 𝑑𝐹 (𝑢)𝑑𝐹 (𝑣) (cid:19) 1 2+ 𝛿 + 𝐸𝑃 (cid:12) (cid:12) (cid:12) 𝑖𝑡 𝜓 (0) 𝜓 (0) 𝑖𝑟 (cid:12) (cid:12) (cid:12) (cid:33) 1 2+ 𝛿 (cid:32)∫ ∫ (cid:18) 𝐸𝑃 (cid:20)(cid:12) (cid:12) (cid:12) 𝜓 (0) 𝑖𝑡 2 (cid:12) (cid:12) (cid:12) |𝛾𝑡 = 𝑢 (cid:21) (cid:19) 2+ 𝛿 2 (cid:18) 𝐸𝑃 (cid:20)(cid:12) (cid:12) (cid:12) 𝜓 (0) 𝑖𝑟 (cid:12) 2 (cid:12) (cid:12) |𝛾𝑟 = 𝑣 (cid:21) (cid:19) 2+ 𝛿 2 𝑑𝐹 (𝑢)𝑑𝐹 (𝑣) + 𝐸𝑃 (cid:12) (cid:12) (cid:12) 𝜓 (0) 𝑖𝑡 2 (cid:12) (cid:12) (cid:12) (cid:18)∫ ∫ (cid:18) 𝐸𝑃 = (cid:12) (cid:12) (cid:12) 𝜓 (0) 𝑖𝑡 𝐸𝑃 (cid:12) (cid:12) (cid:12) 𝜓 (0) 𝑖𝑡 (cid:20)(cid:12) (cid:12) (cid:12) 4+2𝛿(cid:19) 1 (cid:12) 2+ 𝛿 (cid:12) (cid:12) 2+𝛿 |𝛾𝑡 = 𝑢 (cid:21) 𝐸𝑃 2+𝛿 (cid:20)(cid:12) (cid:12) (cid:12) 𝜓 (0) 𝑖𝑟 (cid:12) (cid:12) (cid:12) (cid:21) |𝛾𝑟 = 𝑣 𝑑𝐹 (𝑢)𝑑𝐹 (𝑣) (cid:19) 1 2+ 𝛿 + 𝐸𝑃 (cid:12) (cid:12) (cid:12) 𝜓 (0) 𝑖𝑡 (cid:12) 2 (cid:12) (cid:12) + 𝐸𝑃 (cid:12) (cid:12) (cid:12) 𝜓 (0) 𝑖𝑡 2 (cid:12) (cid:12) (cid:12) ≤ ≤ ≤ where the second inequality follows from (conditional) Hölder’s inequality and identical distribution of 𝛾𝑡; the third inequality follows from Jensen’s inequality; the last equality follows from the law of 132 iterated expectation and the identical distribution of 𝛾𝑡. Therefore, condition 3B.2 is also satisfied with 𝛿 = 𝑝−4 2 . By Lemma 3.4 in Dehling and Wendler (2010), we conclude 𝐸𝑃 2 (cid:12) (cid:12)(cid:101)𝐼 (𝑘𝑙) (cid:12) 𝑎,2,𝑖 (cid:12) (cid:12) (cid:12) = 1 𝑇 4 𝑙 ∑︁ 𝑡,𝑟,𝑡′,𝑟 ′∈𝑆𝑙 𝐸𝑃 [ℎ𝑖 (𝛾𝑡, 𝛾𝑟)ℎ𝑖 (𝛾𝑡′, 𝛾𝑟 ′)] = 𝑜(𝑇 −1 𝑙 ) = 𝑜(𝑇 −1). Therefore, by Markov inequality, we have (cid:101)𝐼 (𝑘𝑙) that conditional on {𝛾𝑡 }𝑡∈𝑆𝑙 , 𝐼 (𝑘𝑙) 𝑎,2,𝑖 is i.i.d over 𝑖. So, we have 𝑎,2,𝑖 = 𝑜𝑃 (𝑇 −1/2). Next. consider (cid:12) (cid:12) (cid:12) 1 𝑁𝑘 (cid:205)𝑖∈𝐼𝑘 𝐼 (𝑘𝑙) 𝑎,2,𝑖 (cid:12) (cid:12) (cid:12). Note 1 𝑁𝑘 ∑︁ 𝑖∈𝐼𝑘 𝐼 (𝑘𝑙) 𝑎,2,𝑖 2 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 𝐸𝑃 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12)       |{𝛾𝑡 }𝑡∈𝑆𝑙 = 1 𝑁 2 𝑘 ∑︁ 𝑖∈𝐼𝑘       𝐸𝑃 (cid:20)(cid:12) (cid:12) (cid:12) 𝐼 (𝑘𝑙) 𝑎,2,𝑖 2 (cid:12) (cid:12) (cid:12) |{𝛾𝑡 }𝑡∈𝑆𝑙 (cid:21) = 1 𝑁𝑘 𝐸𝑃 (cid:20)(cid:12) (cid:12) (cid:12) 𝐼 (𝑘𝑙) 𝑎,2,𝑖 2 (cid:12) (cid:12) (cid:12) (cid:21) |{𝛾𝑡 }𝑡∈𝑆𝑙 By conditional Markov inequality, we have (cid:32)(cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 𝐼 (𝑘𝑙) 𝑎,2,𝑖 1 𝑁𝑘 ∑︁ (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 𝑖∈𝐼𝑘 𝑃 > 𝜀|{𝛾𝑡 }𝑡∈𝑆𝑙 (cid:33) = 𝑂 (cid:18) 1 𝑁𝑘 𝐸𝑃 (cid:20)(cid:12) (cid:12) (cid:12) 𝐼 (𝑘𝑙) 𝑎,2,𝑖 (cid:12) 2 (cid:12) (cid:12) (cid:21) (cid:19) |{𝛾𝑡 }𝑡∈𝑆𝑙 By Minkowski’s inequality for infinite sums, Jensen’s inequality, and Hölder’s inequality, we have (cid:18) 𝐸𝑃 (cid:20)(cid:12) (cid:12) (cid:12) 𝐼 (𝑘𝑙) 𝑎,2,𝑖 2(cid:21) (cid:19) 1/2 (cid:12) (cid:12) (cid:12) ≲ 1 𝑇 2 𝑙 ∑︁ 𝑡,𝑟∈𝑆𝑙 (cid:18) (cid:104) 𝐸𝑃 𝑖𝑡 𝜓 (0) 𝜓 (0) 𝑖𝑟 (cid:105) 2(cid:19) 1/2 ≤ 1 𝑇 2 𝑙 (cid:18) ∑︁ 𝐸𝑃 (cid:104) 𝜓 (0) 𝑖𝑡 (cid:105) 4(cid:19) 1/2 ≤ 𝑐2 𝑚, 𝑡,𝑟∈𝑆𝑙 where the last inequality follows from Assumption DML2(i) . Then, by law of iterated expectation, we have 𝑃 (cid:32)(cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 1 𝑁𝑘 (cid:205)𝑖∈𝐼𝑘 (cid:12) 1 (cid:12) and 𝑁𝑘 (cid:12) 𝑇 −1/2(cid:17) (cid:16) 𝑜𝑃 . 𝐼 (𝑘𝑙) 𝑎,2,𝑖 (cid:12) (cid:12) = 𝑂 𝑃 (cid:12) (cid:16) (cid:17) 𝑁 −1/2 𝑘 (cid:16) = 𝑂 𝑃 ∑︁ 𝑖∈𝐼𝑘 𝐼 (𝑘𝑙) 𝑎,2,𝑖 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 𝑁 −1/2(cid:17) (cid:33) > 𝜀 = 𝑂 (cid:16) (cid:17) , 𝑁 −1 𝑘 . Therefore, we have shown (cid:12) (cid:12) (cid:12) 𝐼 𝑘𝑙 𝑎,2 (cid:12) (cid:12) = 𝑂 𝑃 (cid:12) (cid:16) 𝑁 −1/2(cid:17) + Next, consider 𝐼 𝑘𝑙 𝑎,1. By product decomposition, triangle inequality, and Cauchy-Schwarz in- equality, we have − 𝜓 (0) 𝑖𝑡 𝜓 (0) 𝑖𝑟 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 𝐼 𝑘𝑙 𝑎,1 (cid:12) (cid:12) (cid:12) ≤ ≤ 1 𝑁𝑘𝑇 2 𝑙 1 𝑁𝑘𝑇 2 𝑙 (cid:26)(cid:13) (cid:13) (cid:13) ∑︁ 𝑖∈𝐼𝑘,𝑡∈𝑆𝑙,𝑟∈𝑆𝑙 ∑︁ 𝑖∈𝐼𝑘,𝑡∈𝑆𝑙,𝑟∈𝑆𝑙 (cid:13) (cid:13) (cid:13)𝑘𝑙,2 𝜓 (0) 𝑖𝑡 ≲𝑅𝑘𝑙 𝑖𝑟 (cid:12) 𝑖𝑡 (cid:98)𝜓 (𝑘𝑙)′ (cid:12)(cid:98)𝜓 (𝑘𝑙) (cid:12) (cid:110)(cid:12) (cid:12)(cid:98)𝜓 (𝑘𝑙) 𝑖𝑡 − 𝜓 (0) (cid:12) 𝑖𝑡 (cid:27) + 𝑅𝑘𝑙 , (cid:12) (cid:12) (cid:12) (cid:12) (cid:12)(cid:98)𝜓 (𝑘𝑙) 𝑖𝑟 − 𝜓 (0) (cid:12) 𝑖𝑟 (cid:12) (cid:12) (cid:12) + (cid:12) (cid:12) (cid:12) 𝜓 (0) 𝑖𝑡 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12)(cid:98)𝜓 (𝑘𝑙) 𝑖𝑟 − 𝜓 (0) (cid:12) 𝑖𝑟 (cid:12) (cid:12) (cid:12) + (cid:12) (cid:12)(cid:98)𝜓 (𝑘𝑙) 𝑖𝑡 − 𝜓 (0) (cid:12) 𝑖𝑡 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12)(cid:98)𝜓 (𝑘𝑙)′ (cid:12) 𝑖𝑟 (cid:12) (cid:12) (cid:12) (cid:111) 133 where 𝑅𝑘𝑙 = (cid:13) (cid:13)(cid:98)𝜓 (𝑘𝑙) (cid:13) 𝑖𝑡 − 𝜓 (0) 𝑖𝑡 (cid:13) (cid:13) (cid:13)𝑘𝑙,2 . By Markov inequality and under Assumption DML2(i), we have 𝐸𝑃 (cid:34) 1 𝑁𝑘𝑇𝑙 ∑︁ 𝑖∈𝐼𝑘,𝑡∈𝑆𝑙 (cid:35) (cid:17) 2 (cid:16) 𝜓 (0) 𝑖𝑡 = 𝐸𝑃 |𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂0)|2 ≤ 𝑐2 𝑚. Therefore, 1 𝑁𝑘𝑇𝑙 (cid:205)𝑖∈𝐼𝑘,𝑡∈𝑆𝑙 (cid:17) 2 (cid:16) 𝜓 (0) 𝑖𝑡 (linearity) we have = 𝑂 𝑃 (1). To bound 𝑅𝑘𝑙, note that by Assumption DML1(i) 𝑅2 𝑘𝑙 = 1 𝑁𝑘𝑇𝑙 ≲ 1 𝑁𝑘𝑇𝑙 𝑖∈𝐼𝑘,𝑡∈𝑆𝑙 ∑︁ 𝑖∈𝐼𝑘,𝑡∈𝑆𝑙 ∑︁ (cid:16) 𝜓𝑎 (𝑊𝑖𝑡; (cid:98) 𝜂𝑘𝑙) ((cid:98)𝜃 − 𝜃0) + 𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂𝑘𝑙) − 𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂0) (cid:98) (cid:17) 2 |𝜓𝑎 (𝑊𝑖𝑡; (cid:98) 𝜂𝑘𝑙)|2 (cid:12) (cid:12)(cid:98)𝜃 − 𝜃0 (cid:12) (cid:12) (cid:12) (cid:12) 2 + 1 𝑁𝑘𝑇𝑙 ∑︁ 𝑖∈𝐼𝑘,𝑡∈𝑆𝑙 (cid:12) (cid:12)(cid:98)𝜓 (𝑘𝑙) 𝑖𝑡 − 𝜓 (0) (cid:12) 𝑖𝑡 (cid:12) 2 (cid:12) (cid:12) By Markov inequality and Assumption DML2(i), we have 1 𝑁𝑘𝑇𝑙 (cid:205)𝑖∈𝐼𝑘,𝑡∈𝑆𝑙 |𝜓𝑎 (𝑊𝑖𝑡; (cid:98) 𝜂𝑘𝑙)|2 = 𝑂 𝑃 (1). By Theorem 3.2, 2 (cid:12) (cid:12)(cid:98)𝜃 − 𝜃0 (cid:12) (cid:12) (cid:12) (cid:12) = 𝑂 𝑝 (𝑁 −1). Therefore, the first term on RHS is 𝑂 𝑃 (𝑁 −1). For the second term on RHS, consider its conditional expectation given the auxiliary sample 𝑊 (−𝑘, −𝑙) on the event E𝜂 ∩ E𝑐 𝑝: (cid:104) 𝐸𝑃 |𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂𝑘𝑙) − 𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂0)|2 |𝑊 (−𝑘, −𝑙) (cid:98) (cid:105) ≤ 𝛿2 𝑁𝑇 , where the inequality follows from Assumption DML2(ii). Then, by Markov inequality and Lemma 6.1 from Chernozhukov et al. (2018a), we have 𝑅2 𝑘𝑙 = 𝑂 𝑃 (cid:0)𝑁 −1 + (𝑟′ 𝑁𝑇 )2(cid:1) and so (cid:16) 𝑂 𝑃 𝑁 −1/2 + 𝑟′ 𝑁𝑇 (cid:17) . To summarize, we have shown (cid:12) (cid:12) (cid:12) 𝐼 𝑘𝑙 𝑎,1 (cid:12) (cid:12) (cid:12) = (cid:12) (cid:12) (cid:12)(cid:98)Ω𝑎,𝑘𝑙 − Σ𝑎 (cid:12) (cid:12) = 𝑂 𝑃 (cid:12) (cid:16) 𝑁 −1/2 + 𝑟′ 𝑁𝑇 (cid:17) + 𝑂 𝑃 (𝑁 −1/2) + 𝑜𝑃 (𝑇 −1/2) + 𝑂 (𝑇 −2) = 𝑂 𝑃 (cid:16) 𝑁 −1/2 + 𝑟′ 𝑁𝑇 (cid:17) Proof of Claim B.4. By triangle inequality, we have (cid:12) (cid:12)(cid:98)Ω𝑏,𝑘𝑙 − 𝑐𝐸𝑃 [𝑔𝑡𝑔′ (cid:12) 𝑡] 𝐾/𝐿 𝐼 (𝑘𝑙) 𝑏,1 := 𝑁𝑘𝑇 2 𝑙 𝐾/𝐿 𝑁𝑘𝑇 2 𝑙 𝐾/𝐿 𝑁𝑘𝑇 2 𝑙 𝐼 (𝑘𝑙) 𝑏,2 := 𝐼 (𝑘𝑙) 𝑏,3 := 𝑡∈𝑆𝑙,𝑖∈𝐼𝑘, 𝑗 ∈𝐼𝑘 ∑︁ 𝑡∈𝑆𝑙,𝑖∈𝐼𝑘, 𝑗 ∈𝐼𝑘 ∑︁ 𝑡∈𝑆𝑙,𝑖∈𝐼𝑘, 𝑗 ∈𝐼𝑘 ≤ (cid:12) (cid:12) (cid:12) ∑︁ (cid:12) (cid:12) (cid:12) 𝐼 (𝑘𝑙) 𝑏,1 (cid:110) + + 𝐼 (𝑘𝑙) 𝑏,3 (cid:12) (cid:12) (cid:12) 𝐼 (𝑘𝑙) (cid:12) (cid:12) (cid:12) 𝑏,2 (cid:12) (cid:12) (cid:12) 𝑖𝑡 (cid:98)𝜓 (𝑘𝑙) (cid:98)𝜓 (𝑘𝑙) (cid:12) (cid:12) (cid:12) 𝑗𝑡 − 𝜓 (0) (cid:12) (cid:12) (cid:12) 𝑖𝑡 𝜓 (0) 𝑗𝑡 , (cid:111) , (cid:110) 𝑖𝑡 𝜓 (0) 𝜓 (0) 𝑗𝑡 − 𝐸𝑃 [𝜓 (0) 𝑖𝑡 𝜓 (0) 𝑗𝑡 ] (cid:111) , 𝐸𝑃 [𝜓 (0) 𝑖𝑡 𝜓 (0) 𝑗𝑡 ] − 𝑐𝐸𝑃 [𝑔𝑡𝑔′ 𝑡], 134 and 𝐾/𝐿 = 𝑐 . 𝑁 2 𝑁𝑘𝑇 2 𝑇𝑙 𝑘 𝑙 Consider 𝐼 (𝑘𝑙) 𝑏,3 . By the the law of total covariance, we have 𝐸𝑃 [𝜓 (0) 𝑖𝑡 𝜓 (0) 𝑗𝑡 ]𝑐𝑜𝑣(𝜓 (0) 𝑖𝑡 , 𝜓 (0) 𝑗𝑡 ) =𝐸𝑃 [𝑐𝑜𝑣(𝜓 (0) 𝑖𝑡 , 𝜓 (0) 𝑗𝑡 |𝛾𝑡)] + 𝑐𝑜𝑣(𝐸𝑃 [𝜓 (0) 𝑖𝑡 |𝛾𝑡], 𝐸𝑃 [𝜓 (0) 𝑗𝑡 |𝛾𝑡]) = 0 + 𝐸𝑃 [𝑔𝑡𝑔′ 𝑡], Due to identical distribution of 𝛾𝑡, 𝐸𝑃 [𝑔𝑡𝑔′ 𝑡] does not vary over 𝑡 and so 𝐼 (𝑘𝑙) 𝑏,3 = 0. To bound 𝐼 (𝑘𝑙) 𝑏,2 , we can rewrite it by triangle inequality as follows (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:101)𝐼 (𝑘𝑙) 𝑏,2,𝑡 𝐼 (𝑘𝑙) 𝑏,2,𝑡 𝐼 𝑘𝑙 𝑏,2 1 𝑇𝑙 1 𝑇𝑙 ∑︁ ∑︁ 1 𝑐 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) ≤ + (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) , 𝑡∈𝑆𝑙 𝑗𝑡 − 𝐸𝑃 [𝜓 (0) 𝑖𝑡 𝜓 (0) 𝑗𝑡 𝑖𝑡 𝜓 (0) 𝜓 (0) (cid:111) |{𝛼𝑖}𝑖∈𝐼𝑘 ] (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 1 𝑁 2 𝑘 𝐼 (𝑘𝑙) 𝑏,2,𝑡 := (cid:101)𝐼 (𝑘𝑙) 𝑏,2,𝑡 := 1 𝑁 2 𝑘 𝑡∈𝑆𝑙 ∑︁ 𝑖, 𝑗 ∈𝐼𝑘 ∑︁ 𝑖, 𝑗 ∈𝐼𝑘 (cid:110) (cid:110) 𝐸𝑃 [𝜓 (0) 𝑖𝑡 𝜓 (0) 𝑗𝑡 |{𝛼𝑖}𝑖∈𝐼𝑘 ] − 𝐸𝑃 [𝜓 (0) 𝑖𝑡 𝜓 (0) 𝑗𝑡 ] (cid:111) Due to identical distribution of 𝛾𝑡, (cid:101)𝐼 (𝑘𝑙) 𝑖𝑡 𝜓 (0) Denote 𝜁𝑖 𝑗,𝑡 = 𝜓 (0) 𝑗𝑡 . By direct calculation, we have 𝑏,2,𝑡 does not vary over 𝑡 so that 𝐸𝑃 (cid:12) (cid:12) (cid:12) 1 𝑇𝑙 (cid:205)𝑡∈𝑆𝑙 (cid:101)𝐼 (𝑘𝑙) 𝑏,2,𝑡 2 (cid:12) (cid:12) (cid:12) = 𝐸𝑃 (cid:12) (cid:12)(cid:101)𝐼 (𝑘𝑙) (cid:12) 𝑏,2,𝑡 (cid:12) 2 (cid:12) (cid:12) . 𝐸𝑃 2 (cid:12) (cid:12)(cid:101)𝐼 (𝑘𝑙) (cid:12) 𝑏,2,𝑡 (cid:12) (cid:12) (cid:12) = 1 𝑁 4 𝑘 ≲ 1 𝑁𝑘 ∑︁ ∑︁ 𝐸𝑃 (cid:2) (cid:0)𝐸𝑃 [𝜁𝑖 𝑗,𝑡 |𝛼𝑖, 𝛼 𝑗 ] − 𝐸𝑃 [𝜁𝑖 𝑗,𝑡](cid:1) (cid:0)𝐸𝑃 [𝜁𝑖′ 𝑗 ′ |𝛼𝑖′, 𝛼 𝑗 ′] − 𝐸𝑃 [𝜁𝑖′ 𝑗 ′](cid:1)(cid:3) 𝑖, 𝑗 ∈𝐼𝑘 𝑖′, 𝑗 ′∈𝐼𝑘 𝐸𝑃 [𝜁𝑖 𝑗,𝑡]2 < 1 𝑁𝑘 𝐸𝑃 (cid:105) 4 (cid:104) 𝜓 (0) 𝑖𝑡 = 𝑂 (1/𝑁𝑘 ). where the first inequality follows from the assumption that 𝛼𝑖 is independent over 𝑖 and an application of Hölder’s inequality and Jensen’s inequality. The second inequality follows from Hölder’s inequality and the last equality follows from Assumption DML2(i) with some 𝑞 > 4. Therefore, by (cid:12) (cid:12) = 𝑂 𝑃 (𝑁 −1/2 (cid:12) (cid:205)𝑡∈𝑆𝑙 (cid:101)𝐼 (𝑘𝑙) ) = 𝑂 𝑃 (𝑁 −1/2). 𝑏,2,𝑡 (cid:12) (cid:12). Note that conditional on {𝛼𝑖}, 𝐼 (𝑘𝑙) 𝑏,2,𝑡 is also 𝛽-mixing with the (cid:12) mixing coefficient same as 𝛾𝑡. Then, with an application of the conditional version of Theorem Markov inequality, we have (cid:205)𝑡∈𝑆𝑙 (cid:12) 1 (cid:12) 𝑇𝑙 (cid:12) 𝐼 (𝑘𝑙) 𝑏,2,𝑡 Now consider 1 𝑇𝑙 (cid:12) (cid:12) (cid:12) 𝑘 14.2 from Davidson (1994), we have (cid:18) 𝐸𝑃 (cid:20)(cid:12) (cid:12) (cid:12) 𝐸𝑃 [𝐼 (𝑘𝑙) 𝑏,2,𝑡 |{𝛼𝑖}𝑖∈𝐼𝑘 , F 𝑡−𝑙 −∞ ] (cid:21) (cid:19) 1/2 2 (cid:12) (cid:12) (cid:12) |{𝛼𝑖}𝑖∈𝐼𝑘 ] ≤ 2(21/2 + 1) 𝛽(𝑙)1/2− 2 𝑞 (cid:104) (cid:16) 𝐸𝑃 |𝐼 (𝑘𝑙) 𝑏,2,𝑡 | 𝑞 2 |{𝛼𝑖}𝑖∈𝐼𝑘 (cid:105) (cid:17) 2 𝑞 . 135 Then, we can apply the conditional version of Lemma A from Hansen (1992) to show that 1 𝑇𝑙 ∑︁ 𝑡∈𝑆𝑙 𝐼 (𝑘𝑙) 𝑏,2,𝑡 2 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12)       𝐸𝑃 (cid:169) (cid:173) (cid:171) |{𝛼𝑖}𝑖∈𝐼𝑘 1/2       (cid:170) (cid:174) (cid:172) ≲ 1 𝑇𝑙 ∞ ∑︁ 𝑙=1 𝛽(𝑙)1/2− 2 𝑞 (cid:32) ∑︁ (cid:16) 𝑡∈𝑆𝑙 𝐸𝑃 (cid:104) |𝐼 (𝑘𝑙) 𝑏,2,𝑡 | 𝑞 2 |{𝛼𝑖}𝑖∈𝐼𝑘 (cid:33) 1/2 (cid:105) (cid:17) 4 𝑞 ≲ 1 √ 𝑇𝑙 (cid:104) (cid:16) 𝐸𝑃 |𝐼 (𝑘𝑙) 𝑏,2,𝑡 | 𝑞 2 |{𝛼𝑖}𝑖∈𝐼𝑘 (cid:105) (cid:17) 2 𝑞 By conditional Markov inequality, we have 𝑃 (cid:32)(cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 1 𝑇𝑙 𝐼 (𝑘𝑙) 𝑏,2,𝑡 ∑︁ 𝑡∈𝑆𝑙 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:33) > 𝜀|{𝛼𝑖}𝑖∈𝐼𝑘 = 𝑂 (cid:18) 𝑇 −1 𝑙 𝐸𝑃 (cid:20)(cid:12) (cid:12) (cid:12) 𝐼 (𝑘𝑙) 𝑏,2,𝑡 (cid:12) (cid:12) (cid:12) 𝑞 2 |{𝛼𝑖}𝑖∈𝐼𝑘 (cid:21) (cid:19) By Minkowski’s inequality for infinite sums, Jensen’s inequality, and Hölder’s inequality, we have (cid:18) 𝐸𝑃 (cid:20)(cid:12) (cid:12) (cid:12) 𝐼 (𝑘𝑙) 𝑏,2,𝑡 𝑞 2 (cid:12) (cid:12) (cid:12) (cid:21) (cid:19) 2 𝑞 ≲ 1 𝑁 2 𝑘 ∑︁ 𝑖, 𝑗 ∈𝐼𝑘 (cid:18) (cid:104) 𝐸𝑃 𝑖𝑡 𝜓 (0) 𝜓 (0) 𝑗𝑡 (cid:19) 2 𝑞 𝑞 2 (cid:105) ≤ 1 𝑁 2 𝑘 ∑︁ (cid:16) 𝐸𝑃 (cid:104) 𝜓 (0) 𝑖𝑡 (cid:105) 𝑞(cid:17) 2 𝑞 ≤ 𝑐2 𝑚, 𝑖, 𝑗 ∈𝐼𝑘 where the last inequality follows from Assumption DML2(i) . Then, by the law of iterated expectation, we have Therefore, we have shown (cid:12) (cid:12) (cid:12) 𝐼 𝑘𝑙 𝑏,2 𝑃 (cid:32)(cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 1 𝑇𝑙 𝐼 (𝑘𝑙) 𝑏,2,𝑡 ∑︁ 𝑡∈𝑆𝑙 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:33) > 𝜀 = 𝑂 (cid:17) (cid:16) 𝑇 −1/2 𝑙 . Consider 𝐼 𝑘𝑙 𝑏,1. By the similar inequality for (cid:12) (cid:12) = 𝑂 𝑃 (𝑁 −1 (cid:12) ) = 𝑂 𝑃 (𝑇 −1/2). 𝑘 ) + 𝑂 𝑃 (𝑇 −1/2 (cid:12) 𝐼 𝑘𝑙 (cid:12) 𝑎,1 (cid:12) 𝑙 (cid:12) (cid:12) (cid:12), we have 1 𝑐 (cid:12) (cid:12) (cid:12) 𝐼 𝑘𝑙 𝑏,1 (cid:12) (cid:12) (cid:12) ≲ 𝑅𝑘𝑙 1 𝑁𝑘𝑇𝑙 (cid:32)    ∑︁ (cid:17) 2 (cid:16) 𝜓 (0) 𝑖𝑡 𝑖∈𝐼𝑘,𝑡∈𝑆𝑙 (cid:33) 1/2 + 𝑅𝑘𝑙 ,    where 𝑅𝑘𝑙 = (cid:13) (cid:13)(cid:98)𝜓 (𝑘𝑙) 𝑖𝑡 − 𝜓 (0) (cid:13) 𝑖𝑡 𝑘𝑙 = 𝑂 𝑃 (cid:0)𝑁 −1 + (𝑟′ (cid:13) (cid:13) (cid:13)𝑘𝑙,2 𝑁𝑇 )2(cid:1). So and 𝑅2 . We have shown in the proof of Claim B.3 that (cid:12) (cid:12) (cid:12) 𝐼 𝑘𝑙 𝑏,1 (cid:12) (cid:12) = 𝑂 𝑃 (cid:12) (cid:16) 𝑁 −1/2 + 𝑟′ 𝑁𝑇 (cid:17) . To summarize (cid:13) (cid:13) (cid:13) 𝜓 (0) 𝑖𝑡 (cid:13) (cid:13) (cid:13)𝑘𝑙,2 = 𝑂 𝑃 (1) (cid:12) (cid:12)(cid:98)Ω𝑏,𝑘𝑙 − 𝑐𝐸𝑃 [𝑔𝑡𝑔′ (cid:12) 𝑡] (cid:12) (cid:12) = 𝑂 𝑃 (cid:12) (cid:16) 𝑁 −1/2(cid:17) + 𝑂 𝑃 (cid:16) 𝑇 −1/2(cid:17) (cid:16) + 𝑂 𝑃 𝑁 −1/2 + 𝑟′ 𝑁𝑇 (cid:17) (cid:16) = 𝑂 𝑃 𝑁 −1/2 + 𝑟′ 𝑁𝑇 (cid:17) , which completes the proof of Claim B.4. 136 Proof of Claim B.5. By triangle inequality, we have (cid:12) (cid:12) (cid:12)(cid:98)Ω𝑐,𝑘𝑙 (cid:12) (cid:12) (cid:12) ≤ (cid:12) (cid:12) (cid:12) 𝐼 (𝑘𝑙) 𝑐,1 (cid:12) (cid:12) (cid:12) + (cid:12) (cid:12) (cid:12) 𝐼 (𝑘𝑙) 𝑐,2 (cid:12) (cid:12) (cid:12) + (cid:12) (cid:12) (cid:12) 𝐼 (𝑘𝑙) 𝑐,3 (cid:12) (cid:12) (cid:12) where 𝐼 (𝑘𝑙) 𝑐,1 := 𝐼 (𝑘𝑙) 𝑐,2 := 𝐼 (𝑘𝑙) 𝑐,3 := 𝐾/𝐿 𝑁𝑘𝑇 2 𝑙 𝐾/𝐿 𝑁𝑘𝑇 2 𝑙 𝐾/𝐿 𝑁𝑘𝑇 2 𝑙 ∑︁ 𝑖∈𝐼𝑘,𝑡∈𝑆𝑙 ∑︁ 𝑖∈𝐼𝑘,𝑡∈𝑆𝑙 ∑︁ 𝑖∈𝐼𝑘,𝑡∈𝑆𝑙 (cid:110) (cid:110) 𝑖𝑡 (cid:98)𝜓 (𝑘𝑙) (cid:98)𝜓 (𝑘𝑙) 𝑖𝑡 − 𝜓 (0) 𝑖𝑡 𝜓 (0) 𝑖𝑡 (cid:111) , 𝑖𝑡 𝜓 (0) 𝜓 (0) 𝑖𝑡 − 𝐸𝑃 [𝜓 (0) 𝑖𝑡 𝜓 (0) 𝑖𝑡 ] (cid:111) , 𝐸𝑃 [𝜓 (0) 𝑖𝑡 𝜓 (0) 𝑖𝑡 ], Consider 𝐼 (𝑘𝑙) 𝑐,3 . Note that under Assumption DML2(i), we have 𝐸𝑃 [𝜓 (0) 𝑖𝑡 𝜓 (0) 𝑚. 𝑖𝑡 ] ≤ 𝑐2 Thus, 𝐼 (𝑘𝑙) 𝑐,3 = 𝑂 𝑃 (1/𝑇𝑙) = 𝑂 𝑃 (𝑇 −1). Consider 𝐼 𝑘𝑙 𝑐,2. We denote 𝜉𝑖𝑡 = 𝜓 (0) 𝑖𝑡 𝜓 (0) 𝑖𝑡 − 𝐸𝑃 [𝜓 (0) 𝑖𝑡 𝜓 (0) 𝑖𝑡 ]. By expanding 𝐸 (cid:12) (cid:12) (cid:12) 𝐼 𝑘𝑙 𝑐,2 2 (cid:12) (cid:12) (cid:12) and applying Hölder’s inequality, we have (cid:33) 2    max(𝑆𝑙)−𝑚 ∑︁ (cid:32) 𝐾/𝐿 𝑁𝑘𝑇 2 𝑙 𝑇𝑙−1 ∑︁ 𝐼 𝑘𝑙 𝑐,2 ≤ 𝐸 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 2 +2 ∑︁ 𝐸𝑃 |𝜉𝑖𝑡 |2 + ∑︁ 𝐸𝑃 |𝜉𝑖𝑡 |2 + ∑︁ 𝐸𝑃 |𝜉𝑖𝑡 |2 𝑖∈𝐼𝑘,𝑡∈𝑆𝑙,𝑟∈𝑆𝑙 ∑︁ 𝐸𝑃 |𝜉𝑖𝑡 |2 + 2 𝑡∈𝑆𝑙,𝑖∈𝐼𝑘, 𝑗 ∈𝐼𝑘 𝑇𝑙−1 ∑︁ max(𝑆𝑙)−𝑚 ∑︁ ∑︁ 𝐸𝑃 |𝜉𝑖𝑡 |2 𝑚=1 𝑡=min(𝑆𝑙) 𝑖, 𝑗 ∈𝐼𝑘 𝑚=1 𝑡=min(𝑆𝑙) 𝑖∈𝐼𝑘 𝑡∈𝑆𝑙,𝑖∈𝐼𝑘    . where the last inequality follows from Note that for each 𝑖, 𝑡, by Hölder’s inequality and Assumption DML2(i), we have 𝐸𝑃 |𝜉𝑖𝑡 |2 ≲ 𝐸𝑃 (cid:2)𝜓(𝑊𝑖𝑡; 𝜃0, 𝜂0)4(cid:3) ≤ 𝑐4 𝑚. Thus, 𝐸 (cid:12) (cid:12) (cid:12) (cid:12) 2 𝐼 (𝑘𝑙) (cid:12) 𝑐,2 (cid:12) Now consider 𝐼 (𝑘𝑙) = 𝑂 (𝑇 −2) and so 𝐼 (𝑘𝑙) 𝑐,2 = 𝑂 𝑃 (𝑇 −1). 𝑐,1 . Following the same steps for 𝐼 (𝑘𝑙) (cid:13) (cid:13) (cid:13)𝑘𝑙,2 𝐾/𝐿 𝑇𝑙 𝐼 (𝑘𝑙) 𝑐,1 𝜓 (0) 𝑖𝑡 (cid:26)(cid:13) (cid:13) (cid:13) 𝑅𝑘𝑙 ≲ (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 𝑏,1 , we have (cid:27) + 𝑅𝑘𝑙 , where 𝑅𝑘𝑙 = (cid:13) (cid:13)(cid:98)𝜓 (𝑘𝑙) 𝑖𝑡 − 𝜓 (0) (cid:13) 𝑖𝑡 𝑘𝑙 = 𝑂 𝑃 (cid:0)𝑁 −1 + (𝑟′ (cid:13) (cid:13) (cid:13)𝑘𝑙,2 𝑁𝑇 )2(cid:1). So, and 𝑅2 . We have shown in the proof of Claim B.3 that (cid:16) 𝑁 −1/2/𝑇 + 𝑟′ 𝑁𝑇 /𝑇 (cid:17) . To summarize (cid:13) (cid:13) (cid:13) 𝜓 (0) 𝑖𝑡 (cid:13) (cid:13) (cid:13)𝑘𝑙,2 = 𝑂 𝑃 (1) (cid:12) (cid:12) (cid:12)(cid:98)Ω𝑐,𝑘𝑙 (cid:12) (cid:12) = 𝑂 𝑃 (cid:12) 𝑇 −1(cid:17) (cid:16) 𝑁 −1/2/𝑇 + 𝑟′ 𝑁𝑇 /𝑇 (cid:17) = 𝑂 𝑃 𝑇 −1(cid:17) (cid:16) , (cid:12) (cid:12) (cid:12) 𝐼 (𝑘𝑙) 𝑐,1 (cid:12) (cid:12) = 𝑂 𝑃 (cid:12) (cid:16) + 𝑂 𝑃 137 which completes the proof of Claim B.5. Proof of Claim B.6. By triangle inequality, we have (cid:12) (cid:12) (cid:98)Ω𝑑,𝑘𝑙 − 𝑐 (cid:12) (cid:12) (cid:12) ∞ ∑︁ 𝑚=1 𝐸𝑃 [𝑔𝑡𝑔′ 𝑡] (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) ≤ (cid:12) (cid:12) (cid:12) 𝐼 (𝑘𝑙) 𝑑,1 (cid:12) (cid:12) (cid:12) + (cid:12) (cid:12) (cid:12) 𝐼 (𝑘𝑙) 𝑑,2 (cid:12) (cid:12) (cid:12) + (cid:12) (cid:12) (cid:12) 𝐼 (𝑘𝑙) 𝑑,3 (cid:12) (cid:12) (cid:12) + (cid:12) (cid:12) (cid:12) 𝐼 (𝑘𝑙) 𝑑,4 (cid:12) (cid:12) (cid:12) + (cid:12) (cid:12) (cid:12) 𝐼 (𝑘𝑙) 𝑑,5 (cid:12) (cid:12) (cid:12) + (cid:12) (cid:12) (cid:12) 𝐼 (𝑘𝑙) 𝑑,6 (cid:12) (cid:12) (cid:12) where 𝐼 (𝑘𝑙) 𝑑,1 := 𝐼 (𝑘𝑙) 𝑑,2 := 𝐼 (𝑘𝑙) 𝑑,3 := 𝐼 (𝑘𝑙) 𝑑,4 := 𝐼 (𝑘𝑙) 𝑑,5 := 𝐾/𝐿 𝑁𝑘𝑇 2 𝑙 𝐾/𝐿 𝑁𝑘𝑇 2 𝑙 𝐾/𝐿 𝑁𝑘𝑇 2 𝑙 𝐾/𝐿 𝑁𝑘𝑇 2 𝑙 𝐾/𝐿 𝑁𝑘𝑇 2 𝑙 (cid:110) (cid:110) 𝑖𝑡 (cid:98)𝜓 (𝑘𝑙) (cid:98)𝜓 (𝑘𝑙) 𝑗,𝑡+𝑚 − 𝜓 (0) 𝑖𝑡 𝜓 (0) 𝑗,𝑡+𝑚 (cid:111) , 𝑖𝑡 𝜓 (0) 𝜓 (0) 𝑗,𝑡+𝑚 − 𝐸𝑃 (cid:104) 𝑖𝑡 𝜓 (0) 𝜓 (0) 𝑗,𝑡+𝑚 (cid:105) (cid:111) , (cid:104) 𝐸𝑃 𝑖𝑡 𝜓 (0) 𝜓 (0) 𝑗,𝑡+𝑚 (cid:105) , ⌈𝑆𝑙⌉−𝑚 ∑︁ (cid:17) ∑︁ 𝑡=⌊𝑆𝑙⌋ ⌈𝑆𝑙⌉−𝑚 ∑︁ 𝑡=⌊𝑆𝑙⌋ 𝑖∈𝐼𝑘, 𝑗 ∈𝐼𝑘, 𝑗≠𝑖 ∑︁ 𝑖∈𝐼𝑘, 𝑗 ∈𝐼𝑘, 𝑗≠𝑖 ⌈𝑆𝑙⌉−𝑚 ∑︁ ∑︁ 𝑘 𝑘 (cid:16) 𝑚 𝑀 (cid:17) (cid:16) 𝑚 𝑀 𝑀−1 ∑︁ 𝑚=1 𝑀−1 ∑︁ 𝑚=1 𝑀−1 ∑︁ 𝑚=1 (cid:16) 𝑘 (cid:17) (cid:16) 𝑚 𝑀 (cid:17) − 1 ∞ ∑︁ ⌈𝑆𝑙⌉−𝑚 ∑︁ ∑︁ 𝑡=⌊𝑆𝑙⌋ 𝑖∈𝐼𝑘, 𝑗 ∈𝐼𝑘, 𝑗≠𝑖 (cid:104) 𝐸𝑃 𝑖𝑡 𝜓 (0) 𝜓 (0) 𝑗,𝑡+𝑚 (cid:105) , 𝑖∈𝐼𝑘, 𝑗 ∈𝐼𝑘, 𝑗≠𝑖 𝑚=𝑀 𝑀−1 ∑︁ 𝑡=⌊𝑆𝑙⌋ ⌈𝑆𝑙⌉−𝑚 ∑︁ 𝑚=1 𝑡=⌊𝑆𝑙⌋ ∑︁ (cid:104) 𝐸𝑃 𝑖𝑡 𝜓 (0) 𝜓 (0) 𝑗,𝑡+𝑚 (cid:105) − 𝑐 ∞ ∑︁ 𝑚=1 (cid:104) 𝐸𝑃 𝑖𝑡 𝜓 (0) 𝜓 (0) 𝑗,𝑡+𝑚 (cid:105) , 𝐼 (𝑘𝑙) 𝑑,6 := 𝑐 (cid:104) 𝐸𝑃 ∞ ∑︁ 𝑚=1 𝑖∈𝐼𝑘, 𝑗 ∈𝐼𝑘, 𝑗≠𝑖 ∞ ∑︁ (cid:105) − 𝑐 𝑖𝑡 𝜓 (0) 𝜓 (0) 𝑗,𝑡+𝑚 𝑚=1 𝐸𝑃 [𝑔𝑡𝑔𝑡+𝑚] and 𝐾/𝐿 = 𝑐 . 𝑇𝑙 𝑁 2 𝑁𝑘𝑇 2 𝑘 𝑙 Consider 𝐼 (𝑘𝑙) 𝑑,6 . By the law of total covariance, we have (cid:104) 𝐸𝑃 𝑖𝑡 𝜓 (0) 𝜓 (0) 𝑗,𝑡+𝑚 (cid:105) = 𝑐𝑜𝑣(𝜓 (0) 𝑖𝑡 , 𝜓 (0) 𝑗,𝑡+𝑚) = 𝐸𝑃 [𝑐𝑜𝑣(𝜓 (0) 𝑖𝑡 , 𝜓 (0) 𝑗,𝑡+𝑚)|𝛾𝑡, 𝛾𝑡+𝑚] + 𝑐𝑜𝑣(𝐸𝑃 [𝜓 (0) 𝑖𝑡 |𝛾𝑡], 𝐸𝑃 [𝜓 (0) 𝑗,𝑡+𝑚 |𝛾𝑡+𝑚]) = 0 + 𝐸𝑃 [𝑔𝑡𝑔′ 𝑡+𝑚], where the last equality follows from the properties of Hajek projection components, as discussed in the beginning of Appendix A. Therefore, 𝐼 (𝑘𝑙) 𝑑,6 = 0. Consider 𝐼 (𝑘𝑙) 𝑑,5 . The strict stationarity of 𝛾𝑡 implies that 𝜓 (0) 𝑖𝑡 is also strictly stationary over 𝑡. And under Assumption 3.1, there is no heterogeneity across 𝑖. Then, as 𝑀, 𝑇 → ∞, we have 𝐼 (𝑘𝑙) 𝑑,5 = 𝑜(1). 138 Consider 𝐼 (𝑘𝑙) 𝑑,4 . Under Assumption DML2(i), (cid:16) 𝐸𝑃 |𝜓 (0) 𝑖𝑡 |𝑞(cid:17) 1/𝑞 ≤ 𝑐𝑚 for 𝑞 > 4. And conditional on 𝛼𝑖, 𝜓 (0) 𝑖𝑡 is 𝛽-mixing with the mixing coefficient not larger than that of 𝛾𝑡. Then by Theorem 14.13(ii) in Hansen (2022), we have (cid:104) (cid:12) (cid:12) (cid:12) 𝐸𝑃 𝑖𝑡 𝜓 (0) 𝜓 (0) 𝑗,𝑡+𝑚 |{𝛼𝑖}𝑖∈𝐼𝑘 (cid:105)(cid:12) (cid:12) (cid:12) (cid:104) (cid:16) 𝐸𝑃 ≤ 8 |𝜓 (0) 𝑖𝑡 |𝑞 |𝛼𝑖 (cid:105) (cid:17) 1/𝑞 (cid:16) (cid:104) 𝐸𝑃 |𝜓 (0) 𝑗,𝑡+𝑚 |𝑞 |𝛼 𝑗 (cid:105) (cid:17) 1/𝑞 𝛼𝛾 (𝑚)1−2/𝑞 By iterated expectation and Jensen’s inequality, we have (cid:104) (cid:12) (cid:12) (cid:12) 𝐸𝑃 𝑖𝑡 𝜓 (0) 𝜓 (0) 𝑗,𝑡+𝑚 (cid:105)(cid:12) (cid:12) (cid:12) ≤𝐸𝑃 (cid:104) (cid:104)(cid:12) 𝐸𝑃 (cid:12) (cid:12) (cid:20) (cid:16) 𝑖𝑡 𝜓 (0) 𝜓 (0) (cid:104) 𝑗,𝑡+𝑚 |{𝛼𝑖}𝑖∈𝐼𝑘 (cid:105) (cid:17) 1/𝑞 (cid:16) |𝜓 (0) 𝑖𝑡 |𝑞 |𝛼𝑖 (cid:105) (cid:105)(cid:12) (cid:12) (cid:12) 𝐸𝑃 ≤8𝐸𝑃 𝐸𝑃 (cid:104) |𝜓 (0) 𝑗,𝑡+𝑚 |𝑞 |𝛼 𝑗 (cid:105) (cid:17) 1/𝑞 (cid:21) 𝛼𝛾 (𝑚)1−2/𝑞 ≤8𝐸𝑃 (cid:20) (cid:16) (cid:104) 𝐸𝑃 |𝜓 (0) 𝑖𝑡 |𝑞 |𝛼𝑖 (cid:105) (cid:17) 1/𝑞(cid:21) (cid:20) (cid:16) (cid:104) 𝐸𝑃 𝐸𝑃 |𝜓 (0) 𝑗,𝑡+𝑚 |𝑞 |𝛼 𝑗 (cid:105) (cid:17) 1/𝑞(cid:21) 𝛼𝛾 (𝑚)1−2/𝑞 ≲𝑐2 𝑚𝛼𝛾 (𝑚)1−2/𝑞 where the third inequality follows from that 𝛼𝑖 are independent over 𝑖. Then, as 𝑀 → ∞, (cid:12) (cid:12) (cid:12) 𝐼 (𝑘𝑙) 𝑑,4 (cid:12) (cid:12) (cid:12) ≤ 𝐾/𝐿 𝑁𝑘𝑇 2 𝑙 ∞ ∑︁ ⌈𝑆𝑙⌉−𝑚 ∑︁ ∑︁ 𝑚=𝑀 𝑡=⌊𝑆𝑙⌋ 𝑖∈𝐼𝑘, 𝑗 ∈𝐼𝑘, 𝑗≠𝑖 (cid:104) (cid:12) (cid:12) (cid:12) 𝐸𝑃 𝑖𝑡 𝜓 (0) 𝜓 (0) 𝑗,𝑡+𝑚 (cid:105)(cid:12) (cid:12) (cid:12) ≲ ∞ ∑︁ 𝑚=𝑀 𝛼𝛾 (𝑚)1−2/𝑞 ≤ ∞ ∑︁ 𝑚=𝑀 𝛽𝛾 (𝑚)1−2/𝑞 ≤𝑐𝜅 ∞ ∑︁ 𝑚=𝑀 𝑒𝑥 𝑝(−𝜅𝑚) = 𝑐𝜅 (cid:18) 1 1 − 𝑒−𝑘 − (cid:19) 1 − 𝑒−𝑘 𝑀 1 − 𝑒−𝜅 (cid:16) 𝑒−𝜅𝑀 (cid:17) . = 𝑂 Consider 𝐼 (𝑘𝑙) 𝑑,3 . (cid:12) (cid:12) (cid:12) 𝐼 (𝑘𝑙) 𝑑,3 (cid:12) (cid:12) (cid:12) ≤ 𝐾/𝐿 𝑁𝑘𝑇 2 𝑙 𝑀−1 ∑︁ 𝑚=1 (cid:17) 𝑘 (cid:12) (cid:12) (cid:12) (cid:16) 𝑚 𝑀 − 1 (cid:12) (cid:12) (cid:12) ⌈𝑆𝑙⌉−𝑚 ∑︁ ∑︁ 𝑡=⌊𝑆𝑙⌋ 𝑖∈𝐼𝑘, 𝑗 ∈𝐼𝑘, 𝑗≠𝑖 (cid:104) (cid:12) (cid:12) (cid:12) 𝐸𝑃 𝑖𝑡 𝜓 (0) 𝜓 (0) 𝑗,𝑡+𝑚 (cid:105)(cid:12) (cid:12) (cid:12) 𝑀−1 ∑︁ ≤𝑐𝑐2 𝑚 (cid:17) 𝑘 (cid:12) (cid:12) (cid:12) (cid:16) 𝑚 𝑀 (cid:12) (cid:12) − 1 (cid:12) 𝛼𝛾 (𝑚)1−2/𝑞. 𝑚=1 Note that for each 𝑚, (cid:12) (cid:12) → 0 as 𝑀 → ∞. Since (cid:12) (cid:1) − 1(cid:12) apply dominated convergence theorem to conclude that 𝐼 (𝑘𝑙) (cid:12)𝑘 (cid:0) 𝑚 𝑑,3 = 𝑜(1). 𝑑,2 , we can rewrite it by triangle inequality as follows To bound 𝐼 (𝑘𝑙) (cid:12)𝑘 (cid:0) 𝑚 𝑀 𝑀 (cid:1) − 1(cid:12) (cid:12) 𝛼𝛾 (𝑚)1−2/𝑞 ≤ 1, we can 1 𝑐 (cid:12) (cid:12) (cid:12) 𝐼 (𝑘𝑙) 𝑑,2 (cid:12) (cid:12) (cid:12) ≤ (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 𝑀−1 ∑︁ 𝑚=1 𝑘 (cid:0) 𝑚 𝑀 𝑇𝑙 (cid:1) ⌈𝑆𝑙⌉−𝑚 ∑︁ 𝑡=⌊𝑆𝑙⌋ 𝐼 (𝑘𝑙) 𝑑,2,𝑡𝑚 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) + (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 𝑀−1 ∑︁ 𝑚=1 𝑘 (cid:0) 𝑚 𝑀 𝑇𝑙 (cid:1) ⌈𝑆𝑙⌉−𝑚 ∑︁ 𝑡=⌊𝑆𝑙⌋ (cid:101)𝐼 (𝑘𝑙) 𝑑,2,𝑡𝑚 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) , 139 where 𝐼 (𝑘𝑙) 𝑑,2,𝑡𝑚 := (cid:101)𝐼 (𝑘𝑙) 𝑑,2,𝑡𝑚 := (cid:110) (cid:110) 1 𝑁 2 𝑘 1 𝑁 2 𝑘 ∑︁ 𝑖, 𝑗 ∈𝐼𝑘,𝑖≠ 𝑗 ∑︁ 𝑖, 𝑗 ∈𝐼𝑘,𝑖≠ 𝑗 𝑖𝑡 𝜓 (0) 𝜓 (0) 𝑗,𝑡+𝑚 − 𝐸𝑃 [𝜓 (0) 𝑖𝑡 𝜓 (0) 𝑗,𝑡+𝑚 |{𝛼𝑖}𝑖∈𝐼𝑘 ] (cid:111) 𝐸𝑃 [𝜓 (0) 𝑖𝑡 𝜓 (0) 𝑗,𝑡+𝑚 |{𝛼𝑖}𝑖∈𝐼𝑘 ] − 𝐸𝑃 [𝜓 (0) 𝑖𝑡 𝜓 (0) 𝑗,𝑡+𝑚] (cid:111) Due to identical distribution of 𝛾𝑡, (cid:101)𝐼 (𝑘𝑙) 𝑑,2,𝑡𝑚 does not vary over 𝑡 so that 𝐸𝑃 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 𝑀−1 ∑︁ 𝑚=1 𝑘 (cid:0) 𝑚 𝑀 𝑇𝑙 (cid:1) ⌈𝑆𝑙⌉−𝑚 ∑︁ 𝑡=⌊𝑆𝑙⌋ (cid:101)𝐼 (𝑘𝑙) 𝑑,2,𝑡𝑚 (cid:12) 2 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) ≤ 𝐸𝑃 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 𝑀−1 ∑︁ 𝑚=1 𝑘 (cid:17) (cid:16) 𝑚 𝑀 (cid:101)𝐼 (𝑘𝑙) 𝑑,2,𝑡𝑚 2 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) . And by Minkowski’s inequality, we have (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 𝑀−1 ∑︁ 𝑚=1 𝐸𝑃 (cid:169) (cid:173) (cid:171) 𝑘 (cid:17) (cid:16) 𝑚 𝑀 (cid:101)𝐼 (𝑘𝑙) 𝑑,2,𝑡𝑚 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 1/2 2 (cid:170) (cid:174) (cid:172) ≤ 𝑀−1 ∑︁ 𝑚=1 (cid:17) (cid:18) 𝑘 (cid:16) 𝑚 𝑀 𝐸𝑃 (cid:104) (cid:101)𝐼 (𝑘𝑙) 𝑑,2,𝑡𝑚 (cid:105) 2(cid:19) 1/2 Denote 𝜁𝑖 𝑗 𝑚 = 𝜓 (0) 𝑖𝑡 𝜓 (0) 𝑗,𝑡+𝑚. By direct calculation, we have 𝐸𝑃 2 (cid:12) (cid:12)(cid:101)𝐼 (𝑘𝑙) (cid:12) 𝑑,2,𝑡𝑚 (cid:12) (cid:12) (cid:12) = 1 𝑁 4 𝑘 ≲ 1 𝑁𝑘 ∑︁ ∑︁ 𝐸𝑃 (cid:2)(cid:0)𝐸𝑃 [𝜁𝑖 𝑗 𝑚 |𝛼𝑖, 𝛼 𝑗 ] − 𝐸𝑃 [𝜁𝑖 𝑗,𝑡](cid:1) (cid:0)𝐸𝑃 [𝜁𝑖′ 𝑗 ′ |𝛼𝑖′, 𝛼 𝑗 ′] − 𝐸𝑃 [𝜁𝑖′ 𝑗 ′](cid:1)(cid:3) 𝑖, 𝑗 ∈𝐼𝑘 𝑖′, 𝑗 ′∈𝐼𝑘 𝐸𝑃 [𝜁𝑖 𝑗 𝑚]2 < 1 𝑁𝑘 𝐸𝑃 (cid:105) 4 (cid:104) 𝜓 (0) 𝑖𝑡 = 𝑂 (1/𝑁𝑘 ). where the first inequality follows from the assumption that 𝛼𝑖 is independent over 𝑖 and an application of Hölder’s inequality and Jensen’s inequality. The second inequality follows from Hölder’s inequality and the last equality follows from Assumption DML2(i) with some 𝑞 > 4. Therefore, we have (cid:16) 𝐸𝑃 ||2(cid:17) 1/2 ≤ 𝑂 𝑃 (cid:19) (cid:18) 𝑀 𝑁 1/2 = 𝑂 𝑃 (cid:18) 𝑀 𝑇 1/2 (cid:19) . By Markov inequality, we have 𝑘 ( 𝑚 𝑀 ) 𝑇𝑙 Now consider (cid:205)𝑀−1 𝑚=1 (cid:12) (cid:12) (cid:12) (cid:12) (cid:205)𝑀−1 (cid:12) 𝑚=1 (cid:12) (cid:205)⌈𝑆𝑙⌉−𝑚 𝑡=⌊𝑆𝑙⌋ 𝑘 ( 𝑚 𝑀 ) 𝑇𝑙 𝐼 (𝑘𝑙) 𝑑,2,𝑡𝑚 (cid:12) (cid:12) = 𝑂 𝑃 (cid:12) 𝑡=⌊𝑆𝑙⌋ (cid:101)𝐼 (𝑘𝑙) (cid:205)⌈𝑆𝑙⌉−𝑚 𝑑,2,𝑡𝑚 (cid:12) (cid:12) (cid:12). By Minkowski’s inequality, we have (cid:16) 𝑀 𝑇 1/2 (cid:17) . (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 𝑀−1 ∑︁ 𝑚=1 𝐸𝑃 (cid:169) (cid:173) (cid:173) (cid:171) 𝑘 (cid:0) 𝑚 𝑀 𝑇𝑙 (cid:1) ⌈𝑆𝑙⌉−𝑚 ∑︁ 𝑡=⌊𝑆𝑙⌋ 𝐼 (𝑘𝑙) 𝑑,2,𝑡𝑚 1/2 (cid:12) 2 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:170) (cid:174) (cid:174) (cid:172) ≤ 𝑀−1 ∑︁ 𝑚=1 𝑘 (cid:16) 𝑚 𝑀 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 1 𝑇𝑙 ⌈𝑆𝑙⌉−𝑚 ∑︁ 𝑡=⌊𝑆𝑙⌋ 𝐼 (𝑘𝑙) 𝑑,2,𝑡𝑚 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 1/2 2 (cid:170) (cid:174) (cid:174) (cid:172) 𝐸𝑃 (cid:17) (cid:169) (cid:173) (cid:173) (cid:171) 140 Following the same steps as for 𝐼 (𝑘𝑙) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 𝑏,2,𝑡𝑚, we can show (cid:12) 2 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 𝐼 (𝑘𝑙) 𝑑,2,𝑡𝑚 ⌈𝑆𝑙⌉−𝑚 ∑︁ 1 𝑇𝑙 𝑡=⌊𝑆𝑙⌋ 𝐸𝑃 = 𝑂 (cid:17) (cid:16) 𝑇 −1 𝑙 . Therefore, (cid:12) (cid:205)𝑀−1 (cid:12) 𝑚=1 (cid:12) 𝑘 ( 𝑚 𝑀 ) 𝑇𝑙 (cid:16) 𝑀 𝑇 −1/2 (cid:17) (cid:12) (cid:12) = 𝑂 𝑃 (cid:12) (cid:19) (cid:18) 𝑀 𝑇 −1/2 𝑙 = 𝑂 𝑃 (cid:17) (cid:16) 𝑀 𝑇 −1/2 . We have shown (cid:12) (cid:12) (cid:12) 𝐼 (𝑘𝑙) 𝑏,2 (cid:12) (cid:12) (cid:12) = (cid:205)⌈𝑆𝑙⌉−𝑚 𝑡=⌊𝑆𝑙⌋ = 𝑂 𝑃 𝐼 (𝑘𝑙) 𝑑,2,𝑡𝑚 (cid:17) (cid:16) 𝑀 𝑇 −1/2 . 𝑂 𝑃 (1/𝑁𝑘 ) + 𝑂 𝑃 Consider 𝐼 (𝑘𝑙) 𝑑,1 . Denote 𝐼 (𝑘𝑙) 𝑑,1,𝑚 = 𝐾/𝐿 𝑁𝑘𝑇 2 𝑙 ⌈𝑆𝑙⌉−𝑚 ∑︁ ∑︁ (cid:110) 𝑡=⌊𝑆𝑙⌋ 𝑖∈𝐼𝑘, 𝑗 ∈𝐼𝑘, 𝑗≠𝑖 𝑖𝑡 (cid:98)𝜓 (𝑘𝑙) (cid:98)𝜓 (𝑘𝑙) 𝑗,𝑡+𝑚 − 𝜓 (0) 𝑖𝑡 𝜓 (0) 𝑗,𝑡+𝑚 (cid:111) , for each 𝑚. Then, 𝐼 (𝑘𝑙) 𝑑,1 = (cid:205)𝑀−1 𝑚=1 𝑘 (cid:0) 𝑚 𝑀 (cid:1) 𝐼 (𝑘𝑙) 𝑎,1 , we can show 𝑑,1,𝑚. Following the same steps as for 𝐼 (𝑘𝑙) (cid:12) (cid:12) = 𝑂 𝑃 (𝑇 −1/2 + 𝑟′ (cid:12) (cid:17) 𝑁𝑇 ), 𝐼 (𝑘𝑙) 𝑑,1,𝑚 (cid:12) (cid:12) (cid:12) (cid:16) 𝑀 𝑇 −1/2 + 𝑀𝑟′ 𝑁𝑇 where for each 𝑚. Therefore, (cid:12) (cid:12) (cid:12) 𝐼 (𝑘𝑙) 𝑑,1 (cid:12) (cid:12) = 𝑂 𝑃 (cid:12) 𝑀𝑟′ 𝑁𝑇 ≤ 𝑀𝛿𝑁𝑇 𝑁 −1/2 = 𝑀 𝑇 1/2 𝑇 1/2 𝑁 1/2 𝛿𝑁𝑇 = 𝑜(1). 𝐸𝑃 [𝑔𝑡𝑔′ 𝑡] (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) = 𝑂 𝑃 (cid:18) 𝑀 𝑇 −1/2 + 𝑀𝑟′ 𝑁𝑇 (cid:19) + 𝑂 𝑃 (cid:19) (cid:18) 𝑀 𝑇 1/2 + 𝑜(1) + 𝑂 (𝑒−𝜅𝑀) + 𝑜(1) + 0 To summarize (cid:12) (cid:12) (cid:98)Ω𝑑,𝑘𝑙 − 𝑐 (cid:12) (cid:12) (cid:12) ∞ ∑︁ 𝑚=1 = 𝑜𝑃 (1). which completes the proof of Claim B.6. □ Proof of Theorem 3.4 Since (𝐾, 𝐿) are fixed constants, it suffices to show for each (𝑘, 𝑙) that 𝐾/𝐿 𝑁𝑘𝑇 2 𝑙 (cid:205) 𝑖∈𝐼𝑘,𝑡∈𝑆𝑙,𝑟∈𝑆𝑙 𝑘 (cid:17) (cid:16) |𝑡−𝑟 | 𝑀 (cid:98)Ω𝑁𝑊,𝑘𝑙 := (cid:98)Ω𝑁𝑊,𝑘𝑙 as 𝜓(𝑊𝑖𝑡; (cid:98)𝜃, 𝜂𝑘𝑙)𝜓(𝑊𝑖𝑟; (cid:98)𝜃, (cid:98) 𝜂𝑘𝑙)′ = 𝑜𝑃 (1). Note that we can rewrite (cid:98) (cid:98)Ω𝑁𝑊,𝑘𝑙 = (cid:98)Ω𝑐,𝑘𝑙 + (cid:98)Ω𝑒,𝑘𝑙 − (cid:98)Ω𝑑,𝑘𝑙 where (cid:98)Ω𝑐,𝑘𝑙 and (cid:98)Ω𝑑,𝑘𝑙 are defined in the beginning of the proof of Theorem 3.3, and (cid:98)Ω𝑒,𝑘𝑙 is defined as follows: (cid:98)Ω𝑒,𝑘𝑙 := 𝐾/𝐿 𝑁𝑘𝑇 2 𝑙 𝑀−1 ∑︁ 𝑚=1 𝑘 (cid:16) 𝑚 𝑀 ⌈𝑆𝑙⌉−𝑚 ∑︁ (cid:17) ∑︁ 𝑡=⌊𝑆𝑙⌋ 𝑖∈𝐼𝑘, 𝑗 ∈𝐼𝑘 𝜓(𝑊𝑖𝑡; (cid:98)𝜃, 𝜂𝑘𝑙)𝜓(𝑊 𝑗,𝑡+𝑚; (cid:98)𝜃, (cid:98) 𝜂𝑘𝑙)′. (cid:98) 141 Observe that by replacing (cid:98)Ω𝑑,𝑘𝑙 by (cid:98)Ω𝑒,𝑘𝑙, each step in the proof of Claim B.6 also follows. It implies that (cid:98)Ω𝑒,𝑘𝑙 = (cid:98)Ω𝑑,𝑘𝑙 + 𝑜𝑃 (1). By Lemma A.6, we have (cid:98)Ω𝑐,𝑘𝑙 = 𝑂 𝑃 (𝑇 −1). Therefore, we conclude that (cid:98)Ω𝑁𝑊,𝑘𝑙 = 𝑜𝑃 (1). □ 142 APPENDIX 3C PROOFS FOR CHAPTER 3.4 Proof of Theorem 3.5 Let 𝑃 ∈ P𝑁𝑇 for each (𝑁, 𝑇). We denote 𝐴𝑁𝑇 = 𝜓𝑁𝑇 = 1 𝑁𝑇 1 𝑁𝑇 (𝑉 𝑍 )′𝑉 𝐷, (cid:98)𝐴𝑁𝑇 = (𝑉 𝑍 )′𝑉 𝑔, (cid:98)𝜓𝑁𝑇 = 1 𝑁𝑇 1 𝑁𝑇 (𝑍 − 𝑓 (cid:98)𝜁0)′(𝐷 − 𝑓 (𝑍 − 𝑓 (cid:98)𝜁0)′ (cid:16) 𝜋0), (cid:98) 𝑌 − 𝑓 (cid:98)𝛽 − (𝐷 − 𝑓 (cid:98)𝜁)′𝜃0 (cid:17) . We can write (cid:98)𝜃 − 𝜃0 = (cid:98)𝐴−1 𝑁𝑇 (cid:98)𝜓𝑁𝑇 . By product decomposition, we have (cid:98)𝜃 − 𝜃0 =𝐴−1 𝑁𝑇 𝜓𝑁𝑇 + 𝐴−1 𝑁𝑇 (cid:2) (cid:98)𝜓𝑁𝑇 − 𝜓𝑁𝑇 (cid:3) + (cid:104) 𝑁𝑇 − 𝐴−1 (cid:98)𝐴−1 𝑁𝑇 (cid:105) (cid:2) (cid:98)𝜓𝑁𝑇 − 𝜓𝑁𝑇 (cid:3) + (cid:104) 𝑁𝑇 − 𝐴−1 (cid:98)𝐴−1 𝑁𝑇 (cid:105) 𝜓𝑁𝑇 √ (cid:16) (cid:17) 𝑁 ∧ 𝑇 (i) 𝐴𝑁𝑇 𝑝 → 𝐴0 = 𝐸 [𝑉 𝑍 For the asymptotic normality of , we need to show the following statements: (cid:98)𝜓𝑁𝑇 − 𝜓𝑁𝑇 (cid:3) = 𝑜(1); (iv) (cid:98)𝐴𝑁𝑇 − 𝐴𝑁𝑇 = 𝑜𝑃 (1). With statements (i) - (iv) and the identification condition in Assumption REG- P(i) such that (cid:101)𝐴0 is non-singular, (cid:98)𝜃 − 𝜃0 𝑑 → N (0, Ω0); (iii) . Then, the conclusion of 𝑁 ∧ 𝑇𝜓𝑁𝑇 𝑁 ∧ 𝑇 (cid:2) 𝑖𝑡 ]; (ii) 0, 𝐴−1 𝑁 ∧ 𝑇 𝑖𝑡 𝑉 𝐷 (cid:98)𝜃 − 𝜃0 → N 0 Ω0 𝐴−1′ (cid:17) 𝑑 √ √ √ (cid:17) (cid:16) (cid:16) 0 the theorem follows. Before we show Statement (i) - (iv), we note that Assumptions REG-P(ii) and AHK im- ply that ( ¯𝐹𝑖, ¯𝐹𝑡) are functions of only (𝛼𝑖, 𝛾𝑡, 𝜀𝑖𝑡), and so are 𝑓𝑖𝑡 and 𝑉 𝑙 𝑖𝑡 for 𝑙 = 𝑔, 𝐷, 𝑌 , 𝑍. Therefore, the results based on Hajek projection are still applicable. Also, due to Assump- tions REG-P(ii), ¯𝐹𝑖 is a function of only (𝑐𝑖, 𝜖𝑖) and ¯𝐹𝑡 is a function of only (𝑑𝑡, 𝜖𝑡), so 𝑓𝑖𝑡 is a function of (𝑋𝑖𝑡, 𝑐𝑖, 𝜖𝑖, 𝑑𝑡, 𝜖𝑡) which are mean independent of 𝑈 𝐷 𝐸𝑃 (cid:2) 𝑓𝑖𝑡 [(𝐿2,𝑖𝑡 − 𝐸 [𝐿2,𝑖𝑡])𝜂𝐷,2 + 𝑈 𝐷 the main text. Similarly, we have 𝐸𝑃 [ 𝑓𝑖𝑡𝑉 𝐷 𝑖𝑡 ] = 𝑖𝑡 ](cid:3) = 0 given that 𝑓𝑖𝑡 is uncorrelated with 𝐿2,𝑖𝑡 as discussed in 𝑖𝑡 . Therefore, 𝐸𝑃 [ 𝑓𝑖𝑡𝑉 𝐷 𝑖𝑡 ] = 0. Statement (i) follows from Lemma A.1 under Assumptions AHK, AR, and REG-P(iii). For Statement (ii), we first observe that 𝑉 𝑍 𝑖𝑡 𝑍𝑖𝑡]. Due to the exogeneity condition 𝐸𝑃 [𝑍𝑖𝑡𝑈𝑔] = 0 and the independence between ( ¯𝐹𝑖, ¯𝐹𝑡, 𝑍𝑖𝑡, 𝑋𝑖𝑡) and (𝜖𝑖, 𝜖𝑡), 𝑖𝑡 = 𝑍𝑖𝑡 (1 − 𝜁0) where 𝜁0 = (cid:0)𝐸 [ 𝑓 ′ 𝑖𝑡 𝑓𝑖𝑡](cid:1) −1 𝐸 [ 𝑓 ′ we have 𝐸𝑃 [𝑉 𝑍 𝑖𝑡 𝑉 𝑔 𝑖𝑡 ] = 0. With the additional Assumption REG-P(iv), Statement (ii) follows from Lemma A.2. 143 Consider Statement (iii). By product decomposition and triangle inequality, we have 𝑁𝑇 |(cid:98)𝜓𝑁𝑇 − 𝜓𝑁𝑇 | ≤|( 𝑓 (𝜁0 − (cid:98)𝜁))′( 𝑓 (𝛽0 − (cid:98)𝛽) + 𝑉𝑌 + 𝑟𝑌 − 𝜃0( 𝑓 (𝜋0 − (cid:98) 𝜋 − 𝜋0)) − 𝑓 (𝛽0 − (cid:98)𝛽) + 𝑟 𝑔)| + |(𝑍 − 𝑓 𝜁0)′(𝜃0( 𝑓 ((cid:98) 𝜋) + 𝑉 𝐷 + 𝑟 𝐷))| ≲|( 𝑓 (𝜁0 − (cid:98)𝜁))′ 𝑓 (𝛽0 − (cid:98)𝛽)| + |( 𝑓 (𝜁0 − (cid:98)𝜁))′𝑉𝑌 | + |( 𝑓 (𝜁0 − (cid:98)𝜁))′𝑟𝑌 | + |( 𝑓 (𝜁0 − (cid:98)𝜁))′ 𝑓 (𝜋0 − (cid:98) + |(𝑉 𝑍 )′ 𝑓 ((cid:98) 𝜋)| + |( 𝑓 (𝜁0 − (cid:98)𝜁))′𝑉 𝐷 | + |( 𝑓 (𝜁0 − (cid:98)𝜁))′𝑟 𝐷 | 𝜋 − 𝜋0)| + |(𝑉 𝑍 )′ 𝑓 (𝛽0 − (cid:98)𝛽)| + +|(𝑉 𝑍 )′𝑟 𝑔 | (3C.1) (cid:19) (cid:18) 𝑠 𝑂 𝑃 √︃ log( 𝑝/𝛾) 𝑁∧𝑇 Under Assumptions AHK, AR, the sparse approximation conditions as well as Assumption REG- P(ii) - (vii), we can apply Theorem 3.1 to obtain that ∥ 𝑓𝑖𝑡 (𝜂0−(cid:98) (cid:205)𝑁 𝑖=1 (cid:12) (cid:12) 2𝑐1𝑁𝑇 (cid:12) 𝑁 Σ𝑎, 𝑗,𝑙 + 𝑁∧𝑇 𝑎, 𝑗 > 0 by Assumption REG-P(iv) and Lemma A.1. Therefore, min 𝑗 𝜔−1/2 𝑁 ∧ 𝑇) = 𝑂 𝑃 𝑙 = 𝑍, 𝐷, 𝑌 where 𝜆 = 6𝑐1𝑁𝑇 min 𝑗 ≤ 𝑝 Σ𝑙 implies ∥ 𝑓 ′𝑉 𝑙 ∥∞ = 𝑂 𝑃 (Φ−1(1 − 𝛾/2𝑝)/ 𝑡=1 𝑁∧𝑇 Φ−1(1 − 𝛾/2𝑝). By Lemma A.2, 𝜔 𝑗,𝑙 𝑇 Σ𝑔, 𝑗,𝑙 where 𝑗,𝑙 > 0, which for 𝜂 = 𝜁, 𝜋, 𝛽, and 𝑃 𝜔−1/2 𝑗,𝑙 𝑝 → 𝐴∧𝑇 (cid:18)√︃ 𝑠 log( 𝑝/𝛾) 𝑁∧𝑇 , ∥𝜂0−(cid:98) (cid:17) for 𝑙 = 𝐷, 𝑌 , 𝑍. 𝜂) ∥ 𝑁𝑇,2 = 𝑂 𝑃 max 𝑗=1,...,𝑝 → 0 for 𝑓𝑖𝑡, 𝑗𝑉 𝑙 𝑖𝑡 𝜂∥1 = ≥ 𝜆 (cid:18)√︃ log( 𝑝/𝛾) 𝑁∧𝑇 (cid:205)𝑇 1 𝑁𝑇 √ (cid:12) (cid:12) (cid:12) √ (cid:19) (cid:19) (cid:16) Consider the first term in 3C.1. By Cauchy-Swartz inequality, we have √ 𝑁 ∧ 𝑇 𝑁𝑇 |( 𝑓 (𝜁0 − (cid:98)𝜁))′ 𝑓 (𝛽0 − (cid:98)𝛽)| ≤ √ 𝑁 ∧ 𝑇 ∥ 𝑓𝑖𝑡 (𝜁0 − (cid:98)𝜁) ∥ 𝑁𝑇,2∥ 𝑓𝑖𝑡 (𝛽0 − (cid:98)𝛽) ∥ 𝑁𝑇,2 =𝑂 𝑃 (cid:18) 𝑠 log( 𝑝/𝛾) 𝑁 ∧ 𝑇 √ (cid:19) . Consider the second term in 3C.1. By Holder’s inequality, we have √ 𝑠 log( 𝑝/𝛾) 𝑁∧𝑇 𝑁𝑇 ∥𝜁0 − (cid:98)𝜁 ∥1∥ 𝑓 ′𝑉𝑌 ∥∞ = 𝑂 𝑃 √ 𝑁∧𝑇 (cid:16) (cid:17) inequality, we have √ 𝑁∧𝑇 𝑁𝑇 |( 𝑓 (𝜁0 − (cid:98)𝜁))′𝑟𝑌 | ≤ . Consider the third term in 3C.1. By Cauchy-Swartz (cid:19) √ 𝑁 ∧ 𝑇 ∥ 𝑓𝑖𝑡 (𝜁0 − (cid:98)𝜁) ∥ 𝑁𝑇,2∥𝑟𝑌 𝑖𝑡 ∥ 𝑁𝑇,2 = 𝑂 𝑃 (cid:18)√︃ 𝑠 log( 𝑝/𝛾) 𝑁∧𝑇 . √ 𝑁∧𝑇 𝑁𝑇 |( 𝑓 (𝜁0 − (cid:98)𝜁))′𝑉𝑌 | ≤ For the last term of 3C.1, Cauchy-Swartz inequality implies that √ 𝑁 ∧ 𝑇 𝑁𝑇 |(𝑉 𝑍 )′𝑟 | ≤ √ 𝑁 ∧ 𝑇 ∥𝑉𝑌 𝑖𝑡 ∥ 𝑁𝑇,2∥𝑟𝑌 𝑖𝑡 ∥ 𝑁𝑇,2. By Assumption REG-P(ii), we have |𝐸 [(𝑉𝑌 and obtain that ∥𝑉𝑌 𝑖𝑡 ∥ 𝑁𝑇,2 → (𝐸 [(𝑉𝑌 𝑖𝑡 )2]4(𝜇+𝛿) | < ∞. Then we can apply Lemma A.1 𝑁∧𝑇 𝑁𝑇 |(𝑉 𝑍 )′𝑟 | = 𝑜𝑃 (1). The 𝑖𝑡 )2])1/2. Therefore, we have √ arguments for the rest of the terms in 3C.1 are similar. Under the sparsity condition 𝑠 = we conclude that √ 𝑁𝑇 | (cid:98)𝜓𝑁𝑇 − 𝜓𝑁𝑇 | = 𝑜𝑃 (1). 144 √ 𝑁∧𝑇 log( 𝑝/𝛾) , Consider Statement (vi). By product decomposition, we have 𝑁𝑇 = ≤ (cid:16) (cid:16) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:98)𝐴𝑁𝑇 − 𝐴𝑁𝑇 (cid:13) (cid:17)′ 𝑓 (𝜁0 − (cid:98)𝜁) (cid:13) (cid:13) (cid:13)1 𝑓 (𝜋0 − (cid:98) 𝜋) + (cid:17)′ 𝑓 (𝜁0 − (cid:98)𝜁) 𝜋) 𝑓 (𝜋0 − (cid:98) (cid:13) (cid:13) (cid:13)1 (cid:16) + (cid:17)′ 𝑓 (𝜁0 − (cid:98)𝜁) (cid:13) (cid:13) (cid:13) (cid:16) 𝑓 (𝜁0 − (cid:98)𝜁) (𝑟 𝐷 + 𝑉 𝐷) (cid:13) (𝐷 − 𝑓 𝜋0) + (𝑍 − 𝑓 𝜁0)′ 𝑓 (𝜋0 − (cid:98) 𝜋) (cid:13) (cid:13)1 (cid:17)′ 𝜋)(cid:13) + (cid:13) (cid:13)(𝑉 𝑍 )′ 𝑓 (𝜋0 − (cid:98) (cid:13)1 (cid:13) (cid:13) (cid:13)1 (cid:13) (cid:13) (cid:98)𝐴𝑁𝑇 − 𝐴𝑁𝑇 (cid:13) (cid:13) (cid:13) (cid:13)1 We observe that, by similar arguments for Statement (v), = 𝑜𝑃 (1). We have shown Statement (i)-(iv), completing the proof. □ Proof of Theorem 3.6 We have shown in the proof of Theorem 3.5 that (cid:98)𝐴𝑁𝑇 − 𝐴𝑁𝑇 = 𝑜𝑃 (1) and 𝐴𝑁𝑇 − 𝐴0 = 𝑜𝑃 (1). By triangle inequality, we have (cid:98)𝐴𝑁𝑇 − 𝐴0 = 𝑜𝑃 (1). Then, it suffices to show (cid:98)Ω𝐶𝐻𝑆 − Ω = 𝑜𝑃 (1). We decompose (cid:98)Ω𝐶𝐻𝑆 as follows: 𝜓𝑖𝑡 ((cid:98)𝜃, 𝜂)𝜓𝑖𝑟 ((cid:98)𝜃, (cid:98) 𝜂)′, (cid:98)Ω𝑏 := (cid:98) 1 𝑁𝑇 2 𝑇 ∑︁ 𝑁 ∑︁ 𝑁 ∑︁ 𝑡=1 𝑖=1 𝑗=1 𝜓𝑖𝑡 ((cid:98)𝜃, 𝜂)𝜓 𝑗𝑡 ((cid:98)𝜃, (cid:98) 𝜂)′, (cid:98) (cid:98)Ω𝐶𝐻𝑆 :=(cid:98)Ω𝑎 + (cid:98)Ω𝑏 − (cid:98)Ω𝑐 + (cid:98)Ω𝑑 + (cid:98)Ω′ 𝑑, 𝑁 ∑︁ 𝑇 ∑︁ 𝑇 ∑︁ (cid:98)Ω𝑎 := 1 𝑁𝑇 2 (cid:98)Ω𝑐 := (cid:98)Ω𝑑 := 1 𝑁𝑇 2 1 𝑁𝑇 2 𝑟=1 𝑖=1 𝑁 ∑︁ 𝑡=1 𝑇 ∑︁ 𝑡=1 𝑖=1 𝑀−1 ∑︁ 𝑚=1 𝜓𝑖𝑡 ((cid:98)𝜃, 𝜂)𝜓𝑖𝑡 ((cid:98)𝜃, (cid:98) 𝜂)′, (cid:98) 𝑘 (cid:16) 𝑚 𝑀 𝑇−𝑚 ∑︁ 𝑁 ∑︁ (cid:17) 𝑁 ∑︁ 𝑡=1 𝑖=1 𝑗=1, 𝑗≠𝑖 𝜓𝑖𝑡 ((cid:98)𝜃, 𝜂)𝜓 𝑗,𝑡+𝑚 ((cid:98)𝜃, (cid:98) 𝜂)′. (cid:98) where 𝜓𝑖𝑡 (𝜃, 𝜂) = (𝑍𝑖𝑡 − 𝑓𝑖𝑡 𝜁)(𝑌𝑖𝑡 − 𝑓𝑖𝑡 𝛽 − 𝜃 (𝐷𝑖𝑡 − 𝑓𝑖𝑡 𝜋)) and 𝜂 = (𝜁, 𝛽, 𝜋). We need to show 𝑝 → Σ𝑎 = 𝐸𝑃 [𝑎2 𝑖 ], (cid:98)Ω𝑏 (cid:98)Ω𝑎 𝑝 → 𝑐𝐸 [𝑔2 𝑡 ], (cid:98)Ω𝑐 = 𝑜𝑃 (1), and (cid:98)Ω𝑑 𝑝 → 𝑐 (cid:205)∞ 𝑚=1 𝐸𝑃 [𝑔𝑡𝑔𝑡+𝑚]. First, consider (cid:98)Ω𝑎 − 𝐸𝑃 [𝑎2 𝑖 ]. By triangle inequality, we have (cid:12) (cid:12)(cid:98)Ω𝑎 − 𝐸𝑃 [𝑎2 (cid:12) 𝑖 ] (cid:12) (cid:12) (cid:12) ≤ (cid:12) (cid:12)𝐼𝑎,1 (cid:12) (cid:12) , (cid:12) + (cid:12) (cid:12) (cid:12)𝐼𝑎,2 𝑇 𝑁 ∑︁ ∑︁ (cid:12) + (cid:12) (cid:12) (cid:12)𝐼𝑎,2 𝑇 ∑︁ (cid:110) 𝐼𝑎,1 := 𝐼𝑎,2 := 𝐼𝑎,2 := 1 𝑁𝑇 2 1 𝑁𝑇 2 1 𝑁𝑇 2 𝑖=1 𝑁 ∑︁ 𝑡=1 𝑇 ∑︁ 𝑟=1 𝑇 ∑︁ 𝑖=1 𝑁 ∑︁ 𝑡=1 𝑇 ∑︁ 𝑟=1 𝑇 ∑︁ 𝑖=1 𝑡=1 𝑟=1 𝜓𝑖𝑡 ((cid:98)𝜃, 𝜂)𝜓𝑖𝑟 ((cid:98)𝜃, (cid:98) 𝜂) − 𝜓𝑖𝑡 (𝜃0, 𝜂0)𝜓𝑖𝑟 (𝜃0, 𝜂0) (cid:98) (cid:111) , {𝜓𝑖𝑡 (𝜃0, 𝜂0)𝜓𝑖𝑟 (𝜃0, 𝜂0) − 𝐸 [𝜓𝑖𝑡 (𝜃0, 𝜂0)𝜓𝑖𝑟 (𝜃0, 𝜂0)]} , (cid:8)𝐸 [𝜓𝑖𝑡 (𝜃0, 𝜂0)𝜓𝑖𝑟 (𝜃0, 𝜂0)] − 𝐸 [𝑎2 𝑖 ](cid:9) . 145 Note that in proving Claim B.3, the cross-fitting device is only used to show that 𝐼𝑎,1 is of small order. Since the arguments for showing 𝐼𝑎,2 and 𝐼𝑎,3 to be of small order are basically the same as those in the proof of Claim B.3, they are not repeated here. Consider 𝐼𝑎,1. By product decomposition, triangle inequality, and Cauchy-Schwarz inequality, we have (cid:12) (cid:12)𝐼𝑎,1 (cid:12) (cid:12) ≲𝑅𝑁𝑇 (cid:8)|𝜓𝑖𝑡 (𝜃0, 𝜂0)|𝑁𝑇,2 + 𝑅𝑁𝑇 (cid:9) 𝑅𝑁𝑇 := (cid:13) (cid:13) (cid:13) 𝜓𝑖𝑡 ((cid:98)𝜃, 𝜂) − 𝜓𝑖𝑡 (𝜃0, 𝜂0) (cid:98) (cid:13) (cid:13) (cid:13)𝑁𝑇,2 By Minkowski’s inequality, we have 𝑅𝑁𝑇 = ≤ (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) 𝑖𝑡 (𝜂0)((cid:98)𝜃 − 𝜃0) + (𝜓𝑎 𝜓𝑎 𝑖𝑡 (𝜂0) − 𝜓𝑎 𝜂)) ((cid:98)𝜃 − 𝜃0) + 𝜓𝑖𝑡 (𝜃0, 𝑖𝑡 ((cid:98) 𝜂) − 𝜓𝑖𝑡 (𝜃0, 𝜂0) (cid:98) (cid:13) (cid:13) (cid:13)𝑁𝑇,2 𝜓𝑎 𝑖𝑡 (𝜂0)((cid:98)𝜃 − 𝜃0) (cid:13) (cid:13) (cid:13)𝑁𝑇,2 + (cid:13) (cid:13) (cid:13) (𝜓𝑎 𝑖𝑡 (𝜂0) − 𝜓𝑎 𝜂)) ((cid:98)𝜃 − 𝜃0) 𝑖𝑡 ((cid:98) (cid:13) (cid:13) (cid:13)𝑁𝑇,2 + ∥𝜓𝑖𝑡 (𝜃0, 𝜂) − 𝜓𝑖𝑡 (𝜃0, 𝜂0) ∥ 𝑁𝑇,2 (cid:98) =: 𝑅𝑎,1 + 𝑅𝑎,2 + 𝑅𝑎,3, where 𝜓𝑎 𝑖𝑡 (𝑉 𝐷 𝐸𝑃 [𝑉 𝑍 𝑖𝑡 (𝜂) := (𝑍𝑖𝑡 − 𝑓𝑖𝑡 𝜁) (𝐷𝑖𝑡 − 𝑓𝑖𝑡 𝜋). Under Assumption REG-P(ii), we have 𝐸𝑃 [𝜓𝑎 𝑖𝑡 + 𝑟 𝐷 𝑖𝑡 )]2 = 𝑂 𝑃 (1), and Markov inequality implies that (cid:13) (cid:13)𝜓𝑎 (cid:17) 𝑖𝑡 (𝜂0)(cid:13) 𝑖𝑡 (𝜂0)]2 = (cid:13)𝑁𝑇,2 = 𝑂 𝑃 (1). By (cid:17) |(cid:98)𝜃 − 𝜃0| = 𝑂 𝑃 . . Therefore, 𝑅𝑎,1 ≤ (cid:13) 𝑖𝑡 (𝜂0)(cid:13) (cid:13)𝜓𝑎 1√ 1√ (cid:16) (cid:16) (cid:13)𝑁𝑇,2 𝑁∧𝑇 𝑁∧𝑇 Theorem 3.5, we have (cid:98)𝜃 − 𝜃0 = 𝑂 𝑃 To bound 𝑅𝑎,2, we note 𝑖𝑡 (𝜂0) − 𝜓𝑎 𝜂)(cid:13) 𝑖𝑡 ((cid:98) (cid:13)𝑁𝑇,2 (cid:13) (cid:13)𝜓𝑎 (cid:13) (cid:13) (cid:13) = 𝑓𝑖𝑡 ((cid:98)𝜁 − 𝜁0)(𝐷𝑖𝑡 − 𝑓𝑖𝑡 𝜋0) + 𝑓𝑖𝑡 ((cid:98)𝜁 − 𝜁0) 𝑓𝑖𝑡 ((cid:98) 𝜋 − 𝜋0) + (𝑍𝑖𝑡 − 𝑓𝑖𝑡 𝜁0) 𝑓𝑖𝑡 ((cid:98) 𝜋 − 𝜋0) (cid:13) (cid:13) (cid:13)𝑁𝑇,2 Under Assumption REG-P(iii), we have 𝐸𝑃 |𝑉 𝐷 𝑖𝑡 |8(𝜇+𝛿) < ∞, which implies (cid:20) 𝐸𝑃 max 𝑖≤𝑁,𝑡≤𝑇 (cid:21) |𝑉 𝐷 𝑖𝑡 |2 ≲ (𝑁𝑇) 1 4( 𝜇+ 𝛿 ) . By Markov inequality, we have max𝑖≤𝑁,𝑡≤𝑇 |𝑉 𝐷 3.5, Theorem 3.1 can be applied to obtain (cid:13) (cid:13) (cid:13) 4( 𝜇+ 𝛿 ) ). As in the proof of Theorem (cid:18)√︃ 𝑠 log( 𝑝/𝛾) 𝑁∧𝑇 (cid:19) . Then, we have 𝑅𝑎,2 = (cid:13) (cid:13) (cid:13) 𝑓𝑖𝑡 ((cid:98)𝜁 − 𝜁0)𝑉 𝐷 𝑖𝑡 (cid:13) (cid:13) (cid:13)𝑁𝑇,2 =𝑂 𝑃 ((𝑁𝑇) 1 8( 𝜇+ 𝛿 ) )𝑂 𝑃 = 𝑂 𝑃 ((𝑁𝑇) 8( 𝜇+ 𝛿 ) )𝑜𝑃 1 (cid:19) 1 (𝑁 ∧ 𝑇)1/4 = 𝑜𝑃 (1). 1 𝑖𝑡 |2 = 𝑂 𝑃 ((𝑁𝑇) (cid:13) (cid:13) (cid:13)𝑁𝑇,2 𝑓𝑖𝑡 ((cid:98)𝜁 − 𝜁0) = 𝑂 𝑃 𝑖𝑡 |2 (cid:19) 1/2 (cid:13) (cid:13) (cid:13) 𝑓𝑖𝑡 ((cid:98)𝜁 − 𝜁0) (cid:13) (cid:13) (cid:13)𝑁𝑇,2 (cid:18) (cid:18) |𝑉 𝐷 ≤ max 𝑖≤𝑁,𝑡≤𝑇 (cid:32)√︂ 𝑠 log( 𝑝/𝛾) 𝑁 ∧ 𝑇 (cid:33) 146 Similar arguments can be made to show 𝑅𝑎,3. Therefore, we have 𝑅𝑁𝑇 = 𝑜𝑃 (1) and so (cid:98)Ω𝑎 𝑝 → Σ𝑎 It is left to show that (cid:98)Ω𝑏 𝑝 → 𝑐𝐸 [𝑔2 𝑡 ], (cid:98)Ω𝑐 = 𝑜𝑃 (1), and (cid:98)Ω𝑑 𝑝 → 𝑐 (cid:205)∞ 𝑚=1 𝐸𝑃 [𝑔𝑡𝑔𝑡+𝑚]. As is shown in the proof of Theorem 3.3 (Lemmas A.5-A.7), the only step in showing these claims that involve cross-fitting technique is to show the same term 𝑅𝑁𝑇 to converge to 0 in probability. Otherwise, the arguments are basically the same and not repeated here. Combining these results, we obtain 𝑝 → 𝐸𝑃 (𝑎2 𝑖 ) + 𝑐𝐸𝑃 (𝑔2 𝑡 ) + 𝑐 (cid:205)∞ 𝑚=1 (cid:98)Ω 𝐸𝑃 (𝑔𝑡𝑔𝑡+𝑚) = Σ𝑎 + 𝑐Σ𝑔. To show (cid:98)𝑉𝐷𝐾 𝐴 = (cid:98)𝑉𝐶𝐻𝑆 + 𝑜𝑃 (1), it suffices to show (cid:98)Ω𝑁𝑊 = 𝑜𝑃 (1). We decompose Ω𝑁𝑊 as follows: (cid:98)Ω𝑁𝑊 = (cid:98)Ω𝑐 + (cid:98)Ω𝑒 − (cid:98)Ω𝑑, where (cid:98)Ω𝑐 and (cid:98)Ω𝑑 are defined as above and (cid:98)Ω𝑒 is defined as follows: (cid:98)Ω𝑒 := 1 𝑁𝑇 2 𝑀−1 ∑︁ 𝑚=1 𝑘 (cid:16) 𝑚 𝑀 𝑇−𝑚 ∑︁ 𝑁 ∑︁ 𝑁 ∑︁ (cid:17) 𝑡=1 𝑖=1 𝑗=1 𝜓(𝑊𝑖𝑡; (cid:98)𝜃, 𝜂)𝜓(𝑊 𝑗,𝑡+𝑚; (cid:98)𝜃, (cid:101) 𝜂). (cid:101) Following the same arguments as in the proof of Claim B.6, we have (cid:98)Ω𝑒 = (cid:98)Ω𝑑 + 𝑜𝑃 (1). We have shown (cid:98)Ω𝑐 = 𝑜𝑃 (1). Therefore, we conclude that (cid:98)Ω𝑁𝑊 = 𝑜𝑃 (1). So it is proved. □ 147 CHAPTER 4 ANOTHER LOOK AT THE LINEAR PROBABILITY MODEL AND NONLINEAR INDEX MODELS (CO-AUTHORED WITH ROBERT S. MARTIN and JEFFREY M. WOOLDRIDGE) 4.1 Introduction When an outcome variable, 𝑦, is binary, empirical researchers usually choose between two general strategies given a vector of (exogenous) explanatory variables, x: (i) approximate the response probability, 𝑃 (𝑦 = 1|x), using a model linear in parameters or (ii) use a nonlinear model, such as logit or probit. The first strategy is commonly known as using a linear probability model (LPM). The benefits of the LPM are well-known and include ease of interpretation and simple estimation. The shortcomings of the LPM are also well known and discussed in most introductory econometrics texts; see, for example, Wooldridge, 2019, Section 7.5. More advanced discussions of the LPM recognize that one should not take the linear model for 𝑃 (𝑦 = 1|x) literally but only as an approximation. The approximation can be exact in special cases—such as when x consists of binary indicators that are exhaustive and mutually exclusive—and it may be poor in other cases. However, for the most part, prediction is not the primary use of LPMs specifically or binary response models generally. Rather, researchers are largely interested in using binary response models to measure ceteris paribus or causal effects, and it is from this perspective that the LPM approximation should be evaluated. Angrist and Pischke, 2009, Section 3.4.1 and Wooldridge, 2010, Section 15.2 take this perspective. Wooldridge, 2010, Section 15.6, p. 579 shows how the results of Stoker (1986) can be applied to OLS estimation of the parameters in a LPM. Remarkably, there are situations where the linear projection exactly recovers the average partial effects (APEs) across a broad range of binary response models.1 Even though it is natural to study the LPM from the linear projection perspective, this opinion The co-authors have approved that the co-authored chapter is included. The co-authors’ contact: Robert S. Martin, Division of Price and Index Number Research, Bureau of Labor Statistics. Email: martin.robert@bls.gov. Jeffrey M. Wooldridge, Department of Economics, Michigan State University. Email: wooldri1@msu.edu 1Note, extensions of the LPM do not necessarily recover the APE. For instance, see Li et al. (2022) for the case of the LPM with endogenous x and two-stage least squares estimation. 148 is not universally held. In an influential paper, Horrace and Oaxaca (2006) study both the bias and inconsistency of the OLS estimator for the parameters of an underlying piecewise linear model for the response probability that ensures the probabilities are in the unit interval.2 The Horrace and Oaxaca paper is regularly cited in empirical research,3 sometimes as a cautionary tale in using the LPM and sometimes as support for using the LPM when relatively few fitted values lie outside the unit interval. While Horrace and Oaxaca take the piecewise linear model seriously, much if not most of the citing literature seeks to use their results to choose between the LPM and an alternative like probit or logit.4 In the current paper, we revisit the Horrace and Oaxaca framework but, rather than focus on parameters, we focus on APEs. We show that Horrace and Oaxaca set up the problem so that, in general, the response probability is nonlinear in the underlying linear index, x𝛽 = 𝛽1 + 𝛽2𝑥2 + · · · + 𝛽𝐾𝑥𝐾: 𝑃(𝑦 = 1|x) = 𝑅(x𝛽) = 0, 𝑥 𝛽 ≤ 0 𝑥 𝛽, 𝑥 𝛽 ∈ (0, 1). 1, 𝑥 𝛽 ≥ 1    (4.1) The nonlinear function 𝑅(·)—sometimes called the ramp function—is piecewise linear and continu- ous, but it is not strictly increasing, and it is nondifferentiable at two inflection points. Nevertheless, under fairly weak assumptions, one can define the APEs. For continuous variables, the associated APEs are necessarily smaller in magnitude than the index slope coefficients in the underlying non- linear model. Consequently, Horrace and Oaxaca’s focus on index parameters rather than APEs is essentially the same as focusing on parameters in smooth response probabilities such as the logit and probit functions. Therefore, any conclusions about the usefulness of the LPM should be reexamined from the perspective of identifying APEs rather than coefficients. 2Horrace and Oaxaca (2006) defines the LPM as the piecewise linear ramp model. However, in this paper, we differentiate between the “ramp model” and the “LPM” (which is linear everywhere). 3In recent years (2020-2024), Horrace and Oaxaca (2006) has more than 300 Google Scholar citations. 4See, for example, Footnote 20 of van den Berg and Siflinger (2022). 149 It is important to understand that we are not advocating the ramp function as an especially sensible model of the response probability. Rather, we primarily study that specification from the perspective of average partial effects to determine how the Horrace and Oaxaca conclusions hold up. Briefly, in some cases, the linear projection parameters do a very good job of approximating the APEs even when a large percentage of the fitted values are outside the unit interval. Conversely, in other cases, the linear projection parameters do a very poor job of approximating the APEs even when a high percentage of the fitted values are within the unit interval. A practical implication is that there is little justification for how the Horrace and Oaxaca paper is cited in empirical research. We compare the OLS estimation of the LPM to a few nonlinear competitors, including probit and logit quasi-maximum likelihood estimation (QMLE), as natural benchmarks. Horrace and Oaxaca cite a few theoretical rationalizations for the ramp model, so it also makes sense to see if a consistent estimator exists that takes it seriously. Horrace and Oaxaca suggest trimming the sample of fitted values outside the unit interval and re-estimating using OLS, but do not present any theoretical or simulation results. In unreported simulations, we found that trimming the sample once did not necessarily improve performance over OLS for estimating the APEs. Interestingly, by iteratively trimming the sample and performing OLS estimation (referred to as the ITO procedure, hereafter), we show it produces results equivalent to those from numerically minimizing the nonlinear least square (NLS) objective function with the ramp model. In Section 4.3, we show that the NLS estimator of the ramp model is consistent and asymptotically normal under mild assumptions, which in turn justifies trimming procedures in practice. For estimating the APEs, we find that NLS estimation of the ramp function performs comparably to quasi-MLE estimation of the logit and probit models and has good finite sample properties even when OLS estimation of the LPM does not. Section 4.2 delivers our main theoretical arguments. Starting with a linear index model as the response probability of a binary outcome, we define and contrast parameters of interest, which are the index slope coefficients, average partial effects, and linear projection parameters. By leveraging results from Stoker (1986), we describe scenarios where the linear projection parameters recover 150 APEs. In particular, we extend the discussion in Wooldridge, 2010, Section 15.6 and show that, when the covariates have a multivariate normal distribution, the linear projection identifies the APEs. Section 4.4 continues with the mission by conducting simulations under scenarios where theory has made predictions and where theory suggests, but does not fully uncover the relations. Related to our main theoretical arguments, we show that a large fraction of fitted values in [0, 1] is neither sufficient nor necessary condition for the LPM to well-approximate the APEs. We revisit an empirical study of mortgage lending decisions in Section 4.5. The LPM estimated by OLS, with a full set of interactions between the variable of interest and the control variables, delivers a notably smaller and marginally statistically significant estimate of the effect of being white on the approval probability. The NLS estimator of the ramp function, probit QMLE, and logit QMLE are very similar and all statistically significant at the 0.2% level—both because the estimated effects are larger but also because the (robust) standard errors are notably smaller. In Section 4.6, we conclude with some implications for empirical research. 4.2 Relevant Parameters of Binary Response Models Let 𝑦 be the binary outcome variable and x the 1 × 𝐾 vector of explanatory variables, where 𝑥1 ≡ 1 allows for an intercept in the index. Consider a linear index model of the response probability for 𝑦: 𝑃 (𝑦 = 1|x) ≡ 𝑝(x) = 𝐺 (x𝛽) = 𝐺 (𝛽1 + 𝛽2𝑥2 + · · · + 𝛽𝐾𝑥𝐾) (4.2) where 𝐺 : R → [0, 1]. This embeds the probit/logit model by setting 𝐺 (·) as the standard normal CDF/standard logistic function, and it includes the ramp model in Horrace and Oaxaca (2006) by setting 𝐺 (·) = 𝑅(·) as in 4.1. The following subsection compares different parameters of interest for the linear index model generally and for the ramp model specifically. The ramp model for the response probability was suggested by Horowitz and Savin (2001) as being suitable when one starts with a linear model for 𝑝 (x) but wants to ensure that the probabilities are within the unit interval. While not necessarily advocating this view, our purpose is to show that Horrace and Oaxaca’s conclusions about one set 151 of parameters (𝛽) do not necessarily apply to the most interesting set of parameters (the APEs). 4.2.1 APEs, Index Slopes, and Linear Projection Parameters We will first consider partial effects for continuous variables. Let 𝑥 𝑗 be a continuously distributed explanatory variable. For simplicity, the discussion here assumes that 𝑥 𝑗 appears only by itself. If the model includes quadratics, interactions, and so on then the details become more complicated but the conclusions do not change substantively. Assume 𝐺 (·) is differentiable almost everywhere, with its derivative denoted by 𝑔(·). Then, the partial effects and average partial effects of 𝑥 𝑗 on the response probability of 𝑦 can be defined as: 𝑃𝐸 𝑗 (x) ≡ 𝛽 𝑗 𝑔 (x𝛽) , 𝐴𝑃𝐸 𝑗 ≡ 𝛽 𝑗 𝐸 [𝑔 (x𝛽)] In the case of the ramp function, even though 𝑅 (𝑧) is non-differentiable at 𝑧 = 0 and 𝑧 = 1, it is still differentiable with probability one as long as x𝛽 is continuous, and so 𝑃 (x𝛽 = 0) = 𝑃 (x𝛽 = 1) = 0. This holds true provided that at least one element of x is continuous, and that element has a nonzero coefficient, which is a very common assumption imposed in the semiparametric literature on binary response models. In what follows, we maintain that x𝛽 is continuous so that partial effects are well-defined with probability one. As a result, we can define a partial effect function as the derivative of 𝑅 (x𝛽) and ignore points where the derivative does not exist: 𝑃𝐸 𝑗 (x) = 𝜕 𝑝 𝜕𝑥 𝑗 (x) = 𝛽 𝑗 1 [0 ≤ x𝛽 ≤ 1] , where 1 [·] is the indicator function. Therefore, under the ramp model, the APE is 𝐴𝑃𝐸 𝑗 ≡ 𝐸 (cid:2)𝑃𝐸 𝑗 (x)(cid:3) = 𝛽 𝑗 𝑃 (0 ≤ x𝛽 ≤ 1) . (4.3) There are some simple but useful observations about (4.3). First, similar to the probit or logit cases, 𝐴𝑃𝐸 𝑗 always has the same sign as 𝛽 𝑗 . Second, because 𝑃 (0 ≤ x𝛽 ≤ 1) ≤ 1, (cid:12) (cid:12)𝛽 𝑗 with wide support for x𝛽, 𝐴𝑃𝐸 𝑗 can be much smaller in magnitude than 𝛽 𝑗 . Moreover, 𝐴𝑃𝐸 𝑗 = 𝛽 𝑗 (cid:12)𝐴𝑃𝐸 𝑗 (cid:12) ≤ (cid:12) (cid:12) (cid:12) (cid:12); if and only if 𝑃 (0 ≤ x𝛽 ≤ 1) = 1, which means the support of x𝛽 is inside the unit interval. This is essentially the condition used by Horrace and Oaxaca (2006) to conclude that the OLS estimator in 152 linear regression is unbiased and consistent for 𝛽. Our goal here is to compare the OLS estimators with the APEs in the general case where 𝑃 (0 ≤ x𝛽 ≤ 1) < 1; the Horrace and Oaxaca condition is then a special case where the index coefficient, 𝛽 𝑗 , is identical to 𝐴𝑃𝐸 𝑗 . In order to understand the behavior of the OLS estimator under a linear index model, it is important to introduce a third set of parameters: the linear projection parameters, denoted as 𝛾. Assume that the 𝑥 𝑗 have finite second moments and that the 𝐾 × 𝐾 matrix 𝐸 (x′x) is nonsingular. Then we can always define the 𝐾 × 1 vector 𝛾 as 𝛾 = [𝐸 (x′x)]−1 𝐸 (x′𝑦) . We then write the linear projection of 𝑦 on (1, 𝑥2, ..., 𝑥𝐾) as. 𝐿 (𝑦|x) = 𝐿 (𝑦|1, 𝑥2, ..., 𝑥𝐾) = 𝛾1 + 𝛾2𝑥2 + · · · + 𝛾𝐾𝑥𝐾 = x𝛾. In understanding the findings in Horrace and Oaxaca, and their limitations, it is important to know that, given the model of 4.2, 𝐴𝑃𝐸 𝑗 , 𝛽 𝑗 , and 𝛾 𝑗 are all well-defined parameters and, in general, they are all different. Defining 𝛽 and the APEs require an underlying model for the response probability whereas defining 𝛾 does not. As is well known, under random sampling the OLS estimator consistently estimates the param- eters of the linear projection; see, for example, Wooldridge (2010, Chapter 4.2). In other words, if we run the OLS regression underlying LPM estimation, 𝑦𝑖 on 1, 𝑥𝑖2, ..., 𝑥𝑖𝐾, 𝑖 = 1, ..., 𝑁, and obtain the (cid:98) 𝛾 𝑗 𝛾 𝑗 , then (cid:98) 𝑝 → 𝛾 𝑗 . Again, this result holds free of any kind of underlying model. Under the ramp model, Horrace and Oaxaca study the consistency of the (cid:98) 𝛾 𝑗 when considered as estimators of 𝛽 𝑗 —the coefficients in the index. In other words, their asymptotic analysis is the same as comparing the linear projection parameters 𝛾 𝑗 to the index parameters 𝛽 𝑗 . Our view is that this does usually not make much sense—for the same reason, we do not study the consistency of the OLS estimator for the index parameters in, say, probit or logit. If one explicitly models the response probability as a nonlinear function of x𝛽 then one must recognize that nonlinearity when 153 defining the parameters of interest. When interest is in the effects of the explanatory variables on the response probability—which describes almost all modern usages of the LPM—it only makes sense to compare the linear projection parameters to the APEs. In other words, we should ask: When is 𝛾 𝑗 “close” to 𝐴𝑃𝐸 𝑗 ? This is not the same as studying when 𝛾 𝑗 is “close” to 𝛽 𝑗 (except in the special case where 𝑃 (0 ≤ x𝛽 ≤ 1) holds). Under the ramp model we can write 𝐸 (𝑦|x) = 𝑝 (x) = 1 [0 ≤ x𝛽 ≤ 1] x𝛽 + 1 [x𝛽 > 1] . If 𝑃 (0 ≤ x𝛽 ≤ 1) holds then, with probability one, 𝐸 (𝑦|x) = x𝛽 = 𝐿 (𝑦|x) , in which case 𝐴𝑃𝐸 𝑗 = 𝛽 𝑗 = 𝛾 𝑗 and so the OLS estimators, (cid:98) for a random sample of size 𝑁, x𝑖 𝛽 ∈ [0, 1] for all 𝑖, then 𝛾 𝑗 are consistent for 𝛽 𝑗 and 𝐴𝑃𝐸 𝑗 . If 𝐸 (𝑦𝑖 |x1, x2, ..., x𝑛) = x𝑖 𝛽, and it follows that the OLS estimators are conditionally unbiased for the 𝛽 𝑗 – the conclusion reached in Horrace and Oaxaca. If 𝑃 (0 ≤ x𝛽 ≤ 1) < 1, then the 𝛽 𝑗 measure the partial effects when 0 ≤ x𝛽 ≤ 1, but this restriction depends on the unknown vector and the 𝛽 𝑗 need not be very useful as summary measures of the partial effects. In the next subsection, we discuss more generally when the linear projection parameters identify the APEs. 4.2.2 When Linear Projection Recovers the APEs In addition to being easy to interpret, empirically, the OLS estimates of the LPM are often similar to the corresponding APEs from nonlinear index models—particularly logit or probit. Wooldridge, 2010, Section 15.6 provides a discussion based on a results of Stoker (1986) that helps one understand these empirical findings. Here we expand that discussion to allow for an extension to the ramp model. 154 As argued in Wooldridge, 2010, Section 15.6, the results of Stoker (1986) imply that, if (𝑥2, ..., 𝑥𝐾) has a multivariate normal distribution and 𝐺 (·) is differentiable almost everywhere on R (with respect to Lebesgue measure), then 𝛾 𝑗 = 𝛽 𝑗 𝐸 [𝑔 (x𝛽)] = 𝐴𝑃𝐸 𝑗 , 𝑗 = 2, ..., 𝐾, where 𝛾 𝑗 is the slope coefficients on 𝑥 𝑗 in 𝐿 (𝑦|x) = x𝛾, 𝑔 (·) is the almost everywhere derivative of 𝐺 (·). The ramp function 𝑅 (·) is differentiable everywhere except at zero and one, and so it satisfies Stoker’s (1986) assumptions. The result is that OLS consistently estimates 𝐴𝑃𝐸 𝑗 , even though the 𝐴𝑃𝐸 𝑗 are attenuated versions of the 𝛽 𝑗 . This equality holds even when 𝑃 (0 ≤ x𝛽 ≤ 1) is very close to zero. Horrace and Oaxaca (2006), and many papers citing their findings, focus on the inconsistency of OLS for 𝛽 𝑗 , failing to recognize that the OLS estimators from the linear model could be consistent for the more interesting quantities, the 𝐴𝑃𝐸 𝑗 . This point is key to our argument: If the model of the response probability is nonlinear so that 0 ≤ 𝑝 (x) ≤ 1 is ensured, one should study estimation of APEs, not underlying index parameters. Other than the case of multivariate normality of (𝑥2, ..., 𝑥𝐾), there is another case where the linear projection parameters, 𝛾 𝑗 , 𝑗 = 2, ..., 𝐾, equal the APEs: 𝑥2, ..., 𝑥𝐾 are mutually exclusive binary indicators that, along with a base group given by 𝑥2 = 𝑥3 = · · · = 𝑥𝐾 = 0, are exhaustive. See Angrist and Pischke, 2009, Section 3.1.4 and Wooldridge, 2010, Section 15.2. If 𝑥1 = 1 denotes the base group then the APEs are simply 𝐴𝑃𝐸 𝑗 = 𝐸 (cid:0)𝑦|𝑥 𝑗 = 1(cid:1) − 𝐸 (𝑦|𝑥1 = 1) , 𝑗 = 2, ..., 𝐾, and these are identical to the corresponding LPM coefficients. 4.2.3 More General Cases Clearly the assumption of multivariate normality of x is too restrictive to be widely applicable. Nevertheless, the results of Stoker (1986) are suggestive, especially when combined with Ruud (1983). Ruud studies smooth nonlinear function forms that never hit the endpoints of the unit interval, like probit and logit. In these cases, quasi-MLE identifies the index coefficients up 155 to scale.5 If x has a centrally symmetric distribution—of which the multivariate normal is a special case—then Ruud’s (1983) conditions hold. In Section 4, we will find several cases where the covariates are symmetrically distributed (but not multivariate normal) and the APEs are still approximated well by the linear projection parameters. Beyond the extreme cases described here, there appears to be no general theory to determine when the linear projection coefficients will be the same or “close” to the APEs. Many empirical applications include a combination of continuous, discrete, and even mixed explanatory variables. Rarely do these all have marginal symmetric distributions, let alone a symmetric joint distribution. Plus, such explanatory variables often appear as quadratics, interactions, and other functional forms—which also do not have symmetric distributions. In Section 4, we use simulations to shed light on when the LPM coefficients closely approximate the APEs—and when they do not. When evaluating the performance of the LPM as an approximation to Horrace and Oaxaca’s ramp model, it makes sense to consider an estimator which takes such a model seriously. To that end, the next section describes such an estimator. 4.3 Asymptotically Valid Estimators of the Ramp Model 4.3.1 Nonlinear Least Square Estimation We have already seen how if 𝑃(0 ≤ x𝛽 ≤ 1) = 1, then OLS is consistent for the 𝛽 𝑗 , which are equal to 𝐴𝑃𝐸 𝑗 in the case of a continuous covariate 𝑥 𝑗 under model (4.1). If the probability that x𝛽 lies outside the unit interval is nonzero, then OLS is no longer consistent for the 𝛽 𝑗 , and it may or may not approximate the 𝐴𝑃𝐸 𝑗 depending on the distribution of x. In addition to probit and logit quasi-MLE, it makes sense to consider an estimator which is consistent if the ramp model is true. Of course, Bernoulli MLE in the usual fashion using the ramp model as the conditional response probability is not feasible because the log-likelihood is not necessarily defined for x𝛽 ∉ (0, 1). Instead, we consider nonlinear least squares (NLS) using the piecewise ramp function 𝑅(x𝛽) from (4.1) as the conditional mean. In addition, since there may not be much justification to think the ramp function is the true response probability, we allow for general misspecification. Therefore, 5Li et al. (2022) discuss this further and show that in the case of a single normal covariate, logit quasi-MLE identifies the APE, but probit quasi-MLE does not. 156 we define 𝛽𝑜 as the pseudo-true value in the sense that 𝛽𝑜 is the unique solution to min 𝛽 𝐸 (cid:2)(𝑦 − 𝑅(x𝛽))2(cid:3) ≡ min 𝛽 𝑄(𝛽). (4.4) We say that the model is misspecified if there is no such 𝛽 such that 𝐸 [𝑦|x] = 𝑅(x𝛽). By construction, 𝛽𝑜 is the true coefficient when the model is correctly specified and otherwise we view 𝑅(x𝛽𝑜) as the best mean squared error approximation to 𝐸 [𝑦|x] over all ramp functions 𝑅(x𝛽). Assume a random sample indexed by 𝑖 = 1, ..., 𝑁. As a sample analogue of (4.4), we define the objective function 𝑄 𝑁 (𝛽) as 𝑄 𝑁 (𝛽) ≡ = 1 𝑁 1 𝑁 𝑁 ∑︁ 𝑖=1 𝑁 ∑︁ 𝑖=1 (𝑦𝑖 − 𝑅(x𝑖 𝛽))2 (cid:16) 𝑖 1 {x𝑖 𝛽 ≤ 0} + (𝑦𝑖 − x𝑖 𝛽)2 1 {x𝑖 𝛽 ∈ (0, 1)} + (𝑦𝑖 − 1)21 {x𝑖 𝛽 ≥ 1} 𝑦2 (cid:17) , where 𝑁 is the sample size. We define the NLS estimator (cid:98)𝛽 as (cid:98)𝛽 ≡ argmin 𝛽 𝑄 𝑁 (𝛽). The following theorem gives the consistency of the NLS estimator for the pseudo-true value, allowing for misspecification of the conditional mean model. Theorem 4.1 Let {𝑦𝑖, x𝑖}∞ 𝑖=1 be an i.i.d. sequence with 𝑦 only taking on values zero and one, and let 𝑅 : R → [0, 1] be the ramp function defined in (4.1). Suppose 𝛽 ∈ B such that B ⊂ R𝐾 is compact, and 𝛽𝑜 is identified in the sense that ∀ 𝛽 ∈ B, 𝛽 ≠ 𝛽𝑜, 𝐸 (cid:2)(𝑦𝑖 − 𝑅(x𝑖 𝛽𝑜))2(cid:3) < 𝐸 (cid:2)(𝑦𝑖 − 𝑅(x𝑖 𝛽))2(cid:3) Then, (cid:98)𝛽 𝑝 → 𝛽𝑜 as 𝑁 → ∞. The consistency result of Theorem 4.1 follows directly from Theorem 12.2 of Wooldridge (2010). If x contains a continuously distributed 𝑥 𝑗 and 𝛽 𝑗 𝑜 is nonzero, then the probability of x𝑖 𝛽𝑜 being equal to 0 or 1 is zero. Then, with suitable moment conditions on x (so the Leibniz integral rule applies), the FOC of the (4.4) is well defined with probability 1 as follows: 𝐸 (cid:2)x′ 𝑖𝑢𝑖1{x𝑖 𝛽𝑜 ∈ (0, 1)}(cid:3) = 0, (4.5) 157 where 𝑢𝑖 (𝛽) = 𝑦𝑖 − 𝑅(x𝑖 𝛽) and 𝑢𝑖 ≡ 𝑢𝑖 (𝛽0). Define the score function for random draw 𝑖: s𝑖 (𝛽) = −x′ 𝑖𝑢𝑖 (𝛽)1{x𝑖 𝛽 ∈ (0, 1)}. Then, 𝛽𝑜 solves 𝐸 [s𝑖 (𝛽𝑜)] = 0. The variance-covariance matrix of s𝑖 (𝛽) is 𝛀(𝛽) = 𝐸 (cid:2)x′ 𝑖x𝑖𝑢𝑖 (𝛽)21 {x𝑖 𝛽 ∈ (0, 1)}(cid:3) . (4.6) The natural definition of the Jacobian of s𝑖 (𝛽) is A𝑖 (𝛽) = x′ 𝑖x𝑖1 {x𝑖 𝛽 ∈ (0, 1)} . For the similar reason as (4.5), the Hessian of 𝑄(𝛽) is well-defined with probability 1 at 𝛽𝑜 as follows A(𝛽𝑜) = 𝐸 (cid:2)x′ 𝑖x𝑖1 {x𝑖 𝛽𝑜 ∈ (0, 1)}(cid:3) . (4.7) Note that (4.6) and (4.7) are the same whether the conditional mean model is correctly specified or not. Therefore, the following asymptotic distribution result allows for misspecification of the model. Theorem 4.2 Suppose that the assumptions from Theorem 4.1 hold, and (i) 𝛽𝑜 is an interior point of B; (ii) x𝑖 contains a continuously distributed random variable with a nonzero coefficient; (iii) 𝐸 ∥x𝑖 ∥2 < ∞ and 𝐸 (cid:2)x′ 𝑖x𝑖1 {x𝑖 𝛽𝑜 ∈ (0, 1)}(cid:3) > 0, where ∥.∥ denotes the 𝑙2 − 𝑛𝑜𝑟𝑚. Then, as 𝑁 → ∞, √ (cid:16) 𝑁 (cid:98)𝛽 − 𝛽𝑜 (cid:17) 𝑑 → N(0, A(𝛽𝑜)−1𝛀(𝛽𝑜)A(𝛽𝑜)−1). The proof of Theorem 4.2 is given in Appendix 4A. The asymptotic normality results does not follow directly from the M-estimator due to the non-smoothness of the objective function. We therefore leverage an asymptotic normality result for estimators with non-smooth objective function from Newey and McFadden (1994). 158 Taking the sample analogue of the asymptotic variance from Theorem 4.2, we define a variance estimator of √ 𝑁 ( (cid:98)𝛽 − 𝛽𝑜) as (cid:98)V = A𝑁 ( (cid:98)𝛽)−1𝛀𝑁 ( (cid:98)𝛽)A𝑁 ( (cid:98)𝛽)−1, 𝑖x𝑖1{x𝑖 (cid:98)𝛽 ∈ (0, 1)}, 𝛀𝑁 ( (cid:98)𝛽) = 𝑁 −1 (cid:205)𝑁 where A𝑁 ( (cid:98)𝛽) = 𝑁 −1 (cid:205)𝑁 𝑖 1{x𝑖 (cid:98)𝛽 ∈ (0, 1)}, and 𝑢2 𝑖x𝑖(cid:98) 𝑢𝑖 = 𝑦𝑖 − 𝑅(x𝑖 (cid:98)𝛽). Standard errors are obtained the usual way from (cid:98)V/𝑁. The next theorem gives (cid:98) 𝑖=1 x′ 𝑖=1 x′ the consistency result of the variance estimator. Theorem 4.3 Under the same assumption of Theorem 4.2 and 𝐸 ∥𝑥∥4 < ∞, as 𝑁 → ∞, (cid:98)V A(𝛽𝑜)−1𝛀(𝛽𝑜)A(𝛽𝑜)−1. 𝑝 → The proof of Theorem 4.3 is given in Appendix 4A. As before, we are interested in the APE. Consider the best ramp approximation in (4.4), the APE of a continuous random variable 𝑥𝑘 is defined as 𝐴𝑃𝐸 𝑘 = 𝐸 (cid:21) (cid:20) 𝜕𝑅(x𝑖 𝛽𝑜) 𝜕𝑥𝑘 = 𝛽𝑘𝑜𝑃 (x𝑖 𝛽𝑜 ∈ (0, 1)) . A sample-analogue estimator of the APE is then given by 𝐴 (cid:98)𝑃𝐸 𝑘 = (cid:98)𝛽𝑘 1 𝑁 𝑁 ∑︁ 𝑖=1 (cid:110) 1 x𝑖 (cid:98)𝛽 ∈ (0, 1) (cid:111) Define 𝑔(x𝑖, 𝛽) = 𝛽𝑘 1{x𝑖 𝛽 ∈ (0, 1)}, 𝛿𝑜 = 𝐸 [𝑔(x𝑖, 𝛽𝑜)], and G𝑜 = ∇𝛽𝑔(x𝑖, 𝛽𝑜). Following problem 12.17 of Wooldridge (2010), the asymptotic variance of the estimated APE is given by 𝐴𝑉 𝑎𝑟 (cid:16)√ (cid:16) 𝑁 𝐴 (cid:98)𝑃𝐸 𝑘 − 𝐴𝑃𝐸 𝑘 (cid:17)(cid:17) (cid:16) = 𝑉 𝑎𝑟 𝑔(x𝑖, 𝛽𝑜) − 𝛿𝑜 − G𝑜A(𝛽𝑜)−1s𝑖 (𝛽𝑜) (cid:17) , where G𝑜 is a 1 × 𝐾 vector with the 𝑘 𝑡ℎ element being 𝑝𝑜 ≡ 𝑃 (x𝑖 𝛽𝑜 ∈ (0, 1)) and all else 0. The asymptotic variance can be estimated by the sample variance of 𝑔(x𝑖, (cid:98)𝛽)−(cid:98)𝛿−(cid:98)GA𝑁 ( (cid:98)𝛽)−1s𝑖 ( (cid:98)𝛽), where (cid:98)𝛿 = 1 𝑁 𝑔(x𝑖, (cid:98)𝛽), (cid:98)G is a 1 × 𝐾 vector with the 𝑘 𝑡ℎ element being (cid:98) x𝑖 (cid:98)𝛽 ∈ (0, 1) 𝑝 = 1 𝑁 (cid:205)𝑁 𝑖=1 𝑖=1 1 (cid:205)𝑁 (cid:110) (cid:111) . The APE for a discrete random variable 𝑥𝑘 can be defined as 𝐴𝑃𝐸 𝑘 = 𝐸 (cid:2)𝑅(x𝑖,−𝑘 𝛽−𝑘𝑜 + 𝛽𝑘𝑜) − 𝑅(x𝑖,−𝑘 𝛽−𝑘𝑜)(cid:3) . 159 A sample analogue estimator of 𝐴𝑃𝐸 𝑘 is given by 𝐴 (cid:98)𝑃𝐸 𝑘 = 1 𝑁 𝑁 ∑︁ 𝑖=1 𝑅(x𝑖,−𝑘 (cid:98)𝛽−𝑘 + (cid:98)𝛽𝑘 ) − 𝑅(x𝑖,−𝑘 (cid:98)𝛽−𝑘 ). The asymptotic variance can be found and estimated in a similar manner as the continuous case. 4.3.2 Iterative Trimming OLS Estimation To estimate 𝛽𝑜, Horrace and Oaxaca suggest running OLS on a trimmed sample (i.e., those observations for which initial OLS fitted values are inside the unit interval) to reduce bias. We find in practice that a single round of trimming may not reduce the bias for the APEs in the cases where OLS is not consistent for them. However, we find an iterative trimming OLS procedure (ITO) does reduce the bias for estimating APEs, as well as 𝛽𝑜. The procedure goes: 1) estimate the LPM by OLS. 2) Compute fitted values. 3) Drop observations with fitted values outside the unit interval, and 4) Repeat starting at 1) until no further observations are dropped. In fact, we find in simulations that the NLS estimates are numerically the same as the ITO estimates up to machine precision.6 It turns out that ITO is implicitly minimizing the NLS sample objective function using the OLS estimates as starting values and following the Newton-Raphson numerical method, which is iterative (see Wooldridge, 2010, Section 12.7.1). Given an estimate 𝛽{𝑔}, the next iteration is given (using our notation) by 𝛽{𝑔+1} = 𝛽{𝑔} − = 𝛽{𝑔} + 𝑁 −1 (cid:34) (cid:34) 𝑁 −1 𝑁 ∑︁ 𝑖=1 𝑁 ∑︁ 𝑖=1 (cid:35) −1 𝑁 −1 A𝑖 (𝛽{𝑔}) 𝑁 ∑︁ 𝑖=1 s𝑖 (𝛽{𝑔}) 𝑖x𝑖1 (cid:8)x𝑖 𝛽{𝑔} ∈ (0, 1)(cid:9) x′ (cid:35) −1 𝑁 −1 𝑁 ∑︁ 𝑖=1 (cid:16) 𝑦𝑖 − x𝑖 𝛽{𝑔}(cid:17) x′ 𝑖 1 (cid:8)x𝑖 𝛽{𝑔} ∈ (0, 1)(cid:9) (cid:34) = 𝑁 −1 𝑁 ∑︁ 𝑖=1 𝑖x𝑖1 (cid:8)x𝑖 𝛽{𝑔} ∈ (0, 1)(cid:9) x′ (cid:35) −1 𝑁 −1 𝑁 ∑︁ 𝑖=1 𝑖 𝑦𝑖1 (cid:8)x𝑖 𝛽{𝑔} ∈ (0, 1)(cid:9) . x′ The second equality above substitutes our expressions for s𝑖 (𝛽) and A𝑖 (𝛽) and uses the fact that 𝑅(x𝑖 𝛽) = x𝑖 𝛽 for x𝑖 𝛽 ∈ (0, 1). This shows that 𝛽{𝑔+1} is simply the OLS estimator on the sample with x𝑖 𝛽{𝑔} ∈ (0, 1). 6With some DGP, it was occasionally necessary to specify OLS starting values for the NLS function evaluator for the NLS and ITO estimates to match to machine precision. The two were still otherwise very close. 160 As a consequence, the preceding consistency and asymptotic normality results for the NLS estimator justify using the ITO procedure to reduce the OLS bias. However, it is worth mentioning that, at least in Stata, the pre-loaded NLS solver (the“nl” command) may have a performance advantage over ITO in practice. We find in simulations that ITO can result in a dead loop when only a very small portion of observations are left for estimation after iterative trimming. The pre-loaded NLS algorithm continues to work well in those cases. 4.4 Simulations In this section, we present several Monte Carlo simulations that provide insights into the behavior of different modeling/estimation approaches. The LPM is estimated by OLS and the ramp function is estimated by NLS. For the LPM, the APE estimates come directly from the linear projection (e.g., the estimated slope coefficient for a non-interacted variable). For the ramp model, the APEs are estimated using averages of derivatives and differences of the ramp function, as discussed in Section 4.2. These resemble the familiar formulas for the linear model, though the individual unit partial effects need to be scaled by 1 (cid:104) 0 ≤ x(cid:98)𝛽𝑁 𝐿𝑆 ≤ 1 (cid:105) before averaging, where (cid:98)𝛽𝑁 𝐿𝑆 corresponds to the NLS slope estimate. The logit and probit parameters are estimated by the (quasi-) maximum likelihood estimator, and then the APEs are estimated using the usual APE formulas. We used Stata®17 for simulation. 7. To better evaluate the findings from Horrace and Oaxaca (2006), we generate the responses to follow the ramp model for their true conditional probabilities. We also show that our main arguments hold when the true responses are probit. It is useful to observe that the response probability can be derived from a latent variable formulation: 𝑦∗ = x𝛽 − 𝑢, 𝑦 = 1 [𝑦∗ > 0] . For the ramp model in (4.1), suppose that 𝑢|x ∼ Uniform (0, 1) . (4.8) (4.9) (4.10) 7The Stata code is available via the repository https://kaichengchen.github.io/lpm_simulation_post. rar 161 Under 4.10, the CDF of 𝑢 is identical to the ramp function 𝑅 (·), it follows immediately that (4.8), (4.9), (4.10) lead to the response probability in (4.1). In the Appendix 4B, we show an extension of the above model where 𝑢 has variable support, which is another way to represent the role of the unit interval bounds for response probabilities. For the probit model, suppose that 𝑢|x ∼ N(0, 1). (4.11) Initially, the true models take the form (we are dropping 𝑜 on beta here) 𝑦 = 1 [𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 − 𝑢 > 0] , For a given choice of (𝛽0, 𝛽1, 𝛽2) = (𝑏0, 𝑏1, 𝑏2), we can scale (𝛽1, 𝛽2) by a positive constant 𝑐, (𝛽0, 𝛽1, 𝛽2) = (𝑏0, 𝑐𝑏1, 𝑐𝑏2), to govern how close to linear the response probability is. When 𝑢 ∼ Uniform(0, 1), the ramp model is correctly specified, but the LPM is misspecified to varying degrees. For given initial values (𝑏0, 𝑏1, 𝑏2), a larger scaling factor 𝑐 makes the kinks in the ramp function more likely to be binding and the LPM can provide a poor approximation to the response probability. Naturally, the logit and probit models are always misspecified in this case. As stated before, here we focus on the APEs rather than the underlying parameters or how well the models approximate the true response probability. The sample size is 𝑁 = 1, 000 and 10, 000 replications are used. The population (or true) APEs are not available in closed form, so we simulate these along with the estimators. In the tables to follow, the columns labeled “Simulated Truth” include the empirical means and standard deviations of the sample APEs at the true parameter values. We also simulate the probabilities 𝑃 (𝑦 = 1) and 𝑃 (0 ≤ x𝛽 ≤ 1) where the first quantity is the (Monte Carlo) population response probability and the second one tells us how binding are the ramp function inflection points. We also simulate the fraction of OLS fitted values in the unit interval, 𝑃(0 ≤ x(cid:98)𝛽𝑂 𝐿𝑆 ≤ 1). This is practically relevant because researchers often check the fraction of fitted values outside the unit interval to determine the adequacy of the LPM. 162 4.4.1 Symmetrically Distributed Explanatory Variables In the first design, (𝑥1, 𝑥2) are generated as 𝑥1 = 𝑣/2 √ 2 + 𝑒/2 √ 2 𝑥2 = 1 [𝑣 + 𝑟 > 0] , where 𝑣, 𝑒, and 𝑟 are independent standard normals. The initial choice of parameters is set to (𝑏0, 𝑏1, 𝑏2) = (1/2, 1/4, 1/4). Table 4.4.1 reports the findings when 𝑐 = 1. There is a small probability that x𝛽 ∉ [0, 1] – roughly, about 0.021. Moreover, across all simulations, about 2.0% of the OLS fitted values are outside the unit interval. The pattern is clear: All the estimators of the APEs show very little bias and have the same precision. This is true for the continuous variable, 𝑥1, and the binary variable, 𝑥2. Note that this is not predicted by application of the Stoker results because 𝑥2 is a discrete variable.8 Nevertheless, this table illustrates what is often observed in practice: the LPM coefficients estimated by OLS are often close to the probit and logit APEs. Table 4.4.1. 𝑢 ∼ Uniform(0, 1), 𝑥1 normal, 𝑥2 binary; 𝑐 = 1 𝑁 = 1000 APE1 Simulated Truth* 0.2448 0.0011 0.2489 0.0003 LPM Ramp (NLS) (OLS) 0.2450 0.2444 0.0292 0.0288 0.2493 0.2506 0.0326 0.0325 Probit (QMLE) 0.2483 0.0287 0.2454 0.0323 Logit. (QMLE) 0.2452 0.0290 0.2449 0.0324 mean sd mean sd APE2 𝑃(𝑦 = 1) = 0.6238, 𝑃(0 ≤ x𝛽 ≤ 1) = 0.9792 𝑃(0 ≤ x(cid:98)𝛽𝑂 𝐿𝑆 ≤ 1) = 0.9806, 𝑃(0 ≤ x(cid:98)𝛽𝑁 𝐿𝑆 ≤ 1) = 0.9774 *This column contains the empirical means and standard deviations of the sample APEs at the true parameter values. The story does not change when the constraints of the ramp function are strongly binding. In Table 4.4.2, we scale the initial choice coefficients by 𝑐 = 2 and we see 𝑃 (0 ≤ x𝛽 ≤ 1) is only about 0.67, and about 28% of the OLS fitted values are outside [0, 1]. And yet, for estimating the APEs, the LPM does essentially as well as probit and logit, with the bias being slightly larger for 8Admittedly, when the LPM is used for approximation, the bias for APE2 is slightly larger compared to APE1, but the bias is still reasonably small and comparable to the ones by probit and logit approximation. 163 APE2. This delivers the first argument: having a large fraction of observations with fitted values within 0, 1 is not a necessary condition for the OLS estimator to produce a good estimate of the APE. Table 4.4.2. 𝑢 ∼ Uniform(0, 1), 𝑥1 normal, 𝑥2 binary; 𝑐 = 2 𝑁 = 1000 APE1 Simulated Truth* 0.3219 0.0075 0.4003 0.0044 LPM Ramp (NLS) (OLS) 0.3221 0.3200 0.0237 0.0237 0.4006 0.4186 0.0274 0.0270 Probit (QMLE) 0.3242 0.0226 0.4051 0.0263 Logit. (QMLE) 0.3220 0.0236 0.4036 0.0270 mean sd mean sd APE2 𝑃(𝑦 = 1) = 0.6769, 𝑃(0 ≤ x𝛽 ≤ 1) = 0.6738 𝑃(0 ≤ x(cid:98)𝛽𝑂 𝐿𝑆 ≤ 1) = 0.8155, 𝑃(0 ≤ x(cid:98)𝛽𝑁 𝐿𝑆 ≤ 1) = 0.6406 *This column contains the empirical means and standard deviations of the sample APEs at the true parameter values. Table 4.4.3 shows the case where 𝑃 (0 ≤ x𝛽 ≤ 1) is very close to one (the consistency result of OLS estimator for the index coefficients in Horrace and Oaxaca (2006) applies when 𝑃 (0 ≤ x𝛽 ≤ 1) is exactly one). We would expect the LPM to work very well in this case, and it does. What is, perhaps, more surprising is that probit and logit work just as well, even though the true response probability is largely linear over the support of x𝛽. These findings are a good reminder of why statements such as “the linear probability model is preferred to probit because the latter assumes normality” are not just misleading: they are wrong. In the end, what we care about is how well each approach approximates the partial effects on 𝑃 (𝑦 = 1|x). When we consider the APEs, all methods do well even when the response probability has the peculiar ramp shape. APE1 Table 4.4.3. 𝑢 ∼ Uniform(0, 1), 𝑥1 normal, 𝑥2 binary; 𝑐 = 0.75 𝑁 = 1000 Simulated Truth* 0.1874 0.0001 0.1875 0.0000 LPM Ramp (NLS) (OLS) 0.1875 0.1873 0.0314 0.0312 0.1880 0.1881 0.0334 0.0334 Probit (QMLE) 0.1886 0.0313 0.1859 0.0333 Logit. (QMLE) 0.1877 0.0313 0.1856 0.0334 mean sd mean sd APE2 𝑃(𝑦 = 1) = 0.5937, 𝑃(0 ≤ x𝛽 ≤ 1) = 0.9996 𝑃(0 ≤ x(cid:98)𝛽𝑂 𝐿𝑆 ≤ 1) = 0.9991, 𝑃(0 ≤ x(cid:98)𝛽𝑁 𝐿𝑆 ≤ 1) = 0.9990 *This column contains the empirical means and standard deviations of the sample APEs at the true parameter values. 164 We next consider the response probability resulting from 4.8, 4.9, and 4.11, under which only the probit model is correctly specified. Tables 4.4.4 and 4.4.5 maintain the same true index slopes as Tables 4.4.1 and 2, but due to scaling from the standard normal PDF, the true APEs are lower. Nevertheless, a similar pattern is revealed. Table 4.4.4 can be compared to Table 4.4.1 where 𝑃(0 ≤ x𝛽 ≤ 1) is close to one. In this case, OLS even does a better job in fitting the response probability into the unit interval and, not surprisingly, LPM estimated by OLS performs just as well as the the correctly specified probit model in producing the estimated APEs. As 𝑐 increases from 1 to 2 in Table 5, 𝑃(0 ≤ x𝛽 ≤ 1) drops to 0.64. Due to the better-behaved Gaussian error, we still observe a large fraction of OLS fitted values are within [0, 1] and there is not much difference across different methods. To better compare with the true APEs of Table 4.4.1, we increase 𝑐 even further in Table 4.4.6. In this case, support of the linear index becomes really wide, and 𝑃(0 ≤ x𝛽 ≤ 1) is as small as 0.37. Correspondingly, only 86% of observations have OLS fitted values within [0, 1]. However, OLS still produces estimates of APEs as good as those produced by nonlinear methods. Not surprisingly, probit and logit QMLE have low bias, while NLS of the ramp model has slightly higher bias in the cases (e.g., Table 4.4.6) where the true APEs are larger. Table 4.4.4. 𝑢 ∼ N(0, 1), 𝑥1 normal, 𝑥2 binary; 𝑐 = 1 𝑁 = 1000 APE1 Simulated Truth* 0.0810 0.0003 0.0815 0.0002 LPM Ramp (NLS) (OLS) 0.0814 0.0814 0.0303 0.0302 0.0817 0.0817 0.0306 0.0306 Probit (QMLE) 0.0814 0.0302 0.0815 0.0306 Logit. (QMLE) 0.0814 0.0302 0.0816 0.0306 mean sd mean sd APE2 𝑃(𝑦 = 1) = 0.7296, 𝑃(0 ≤ x𝛽 ≤ 1) = 0.9793 𝑃(0 ≤ x(cid:98)𝛽𝑂 𝐿𝑆 ≤ 1) = 0.9999, 𝑃(0 ≤ x(cid:98)𝛽𝑁 𝐿𝑆 ≤ 1) = 0.9999 *This column contains the empirical means and standard deviations of the sample APEs at the true parameter values. We also generated the outcome 𝑦 using an interaction between 𝑥1 and 𝑥2, with 𝑢 having a uniform distribution. Specifically, 𝑦 = 1 [𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + 𝛽3 (𝑥1 · 𝑥2) + 𝑢 > 0] The initial choice of parameters and the scaled parameters are (𝑏0, 𝑏1, 𝑏2, 𝑏3) = (1/2, 1/4, 1/4, 1/8) 165 and (𝛽0, 𝛽1, 𝛽2, 𝛽3) = (𝑏0, 𝑐𝑏1, 𝑐𝑏2, 𝑐𝑏3). Table 4.4.5. 𝑢 ∼ N(0, 1), 𝑥1 normal, 𝑥2 binary; 𝑐 = 2 𝑁 = 1000 APE1 Simulated Truth* 0.1448 0.0013 0.1478 0.0008 LPM Ramp (NLS) (OLS) 0.1484 0.1450 0.0305 0.0281 0.1484 0.1488 0.0288 0.0288 Probit (QMLE) 0.1451 0.0280 0.1480 0.0287 Logit. (QMLE) 0.1450 0.0281 0.1483 0.0288 mean sd mean sd APE2 𝑃(𝑦 = 1) = 0.7550, 𝑃(0 ≤ x𝛽 ≤ 1) = 0.6438 𝑃(0 ≤ x(cid:98)𝛽𝑂 𝐿𝑆 ≤ 1) = 0.9890, 𝑃(0 ≤ x(cid:98)𝛽𝑁 𝐿𝑆 ≤ 1) = 0.9845 *This column contains the empirical means and standard deviations of the sample APEs at the true parameter values. Tables 4.4.7 and 4.4.8 display simulation results under uniformly distributed 𝑢 and normally distributed 𝑢, respectively. The scaling factor 𝑐 is set as 2 to focus on the scenarios with small 𝑃(0 ≤ x𝛽 ≤ 1) and potentially small 𝑃(0 ≤ x(cid:98)𝛽𝑂 𝐿𝑆 ≤ 1). Remember, both 𝑥1 and 𝑥2 have symmetric distributions, but this functional form falls outside Stoker’s results because 𝑥2 is discrete and so is 𝑥1 · 𝑥2: it has a mass point at zero and is otherwise continuous. However, the four approaches—where the interaction term is included in the estimation—delivered similar estimated APEs that were close to the sample “true” APEs (as previously, probit, logit, and LPM approaches use a misspecified response probability). Table 4.4.6. 𝑢 ∼ N(0, 1), 𝑥1 normal, 𝑥2 binary; 𝑐 = 4 𝑁 = 1000 APE1 Simulated Truth* 0.2296 0.0043 0.2375 0.0029 LPM Ramp (NLS) (OLS) 0.2368 0.2295 0.0257 0.0253 0.2420 0.2396 0.0279 0.0257 Probit (QMLE) 0.2298 0.0247 0.2375 0.0255 Logit. (QMLE) 0.2296 0.0249 0.2393 0.0256 mean sd mean sd APE2 𝑃(𝑦 = 1) = 0.7733, 𝑃(0 ≤ x𝛽 ≤ 1) = 0.3725 𝑃(0 ≤ x(cid:98)𝛽𝑂 𝐿𝑆 ≤ 1) = 0.8605, 𝑃(0 ≤ x(cid:98)𝛽𝑁 𝐿𝑆 ≤ 1) = 0.7028 *This column contains the empirical means and standard deviations of the sample APEs at the true parameter values. 166 Table 4.4.7. 𝑢 ∼ Uniform(0, 1), 𝑥1 normal, 𝑥2 binary; 𝑐 = 2; with interaction 𝑁 = 1000 APE1 Simulated Truth* 0.3634 0.0089 0.3509 0.0040 LPM Ramp (NLS) (OLS) 0.3638 0.3606 0.0249 0.0245 0.3512 0.3777 0.0289 0.0281 Probit (QMLE) 0.3664 0.0241 0.3471 0.0275 mean sd mean sd APE2 𝑃(𝑦 = 1) = 0.6645, 𝑃(0 ≤ x𝛽 ≤ 1) = 0.6436 𝑃(0 ≤ x(cid:98)𝛽𝑂 𝐿𝑆 ≤ 1) = 0.8554, 𝑃(0 ≤ x(cid:98)𝛽𝑁 𝐿𝑆 ≤ 1) = 0.6403 Logit. (QMLE) 0.3641 0.0246 0.3456 0.0278 *This column contains the empirical means and standard deviations of the sample APEs at the true parameter values. APE1 𝑁 = 1000 Table 4.4.8. 𝑢 ∼ N(0, 1), 𝑥1 normal, 𝑥2 binary; 𝑐 = 2; with interaction Logit. (QMLE) 0.1689 0.0280 0.1384 0.0293 LPM Ramp (NLS) (OLS) 0.1717 0.1684 0.0295 0.0280 0.1418 0.1427 0.0292 0.0290 Simulated Truth* 0.1685 0.0013 0.1393 0.0005 Probit (QMLE) 0.1689 0.0279 0.1392 0.0292 APE2 𝑃(𝑦 = 1) = 0.7566, 𝑃(0 ≤ x𝛽 ≤ 1) = 0.6437 𝑃(0 ≤ x(cid:98)𝛽𝑂 𝐿𝑆 ≤ 1) = 0.9842, 𝑃(0 ≤ x(cid:98)𝛽𝑁 𝐿𝑆 ≤ 1) = 0.9728 mean sd mean sd *This column contains the empirical means and standard deviations of the sample APEs at the true parameter values. 4.4.2 Asymmetrically Distributed Explanatory Variables The story changes markedly when the distributions of 𝑥1 and 𝑥2 are asymmetric. With 𝑣, 𝑒, and 𝑟 generated as before, 𝑥1 and 𝑥2 are now generated as 𝑥1 = exp (cid:16) −1/4 + 𝑣/2 √ 2 + 𝑒/2 (cid:17) √ 2 𝑥2 = 1 [−1/4 + 𝑣 + 𝑒 > 0] , so that 𝑥1 has a lognormal distribution. The variable 𝑥2 is still binary but the response probability is below 0.5. The unscaled parameter values are, again, (𝑏0, 𝑏1, 𝑏2) = (1/2, 1/4, 1/4). Table 4.5.1 repeats the same experiment as Table 4.4.1, with the scaling factor 𝑐 = 1 except that the explanatory variables are not asymmetrically distributed. We observe that the OLS estimated APE for 𝑥1 under the LPM is severely biased. The misspecified probit model and logit model estimated by QMLE appear to be slightly biased too. The Ramp model is correctly specified and, as predicted by the asymptotic properties given in Section 4.3, the NLS estimator continues to 167 perform well. The relative bias of probit and logit are higher as well compared to the previous tables, but not to as high a degree as the LPM. APE1 𝑁 = 1000 Table 4.5.1. 𝑢 ∼ Uniform(0, 1), 𝑥1 lognormal, 𝑥2 asym. binary; 𝑐 = 1 Logit. (QMLE) 0.2203 0.0383 0.2298 0.0230 LPM Ramp (NLS) (OLS) 0.1988 0.1299 0.0350 0.0220 0.2211 0.2354 0.0233 0.0226 Simulated Truth* 0.1975 0.0032 0.2203 0.0020 Probit (QMLE) 0.2225 0.0361 0.2291 0.0226 APE2 𝑃(𝑦 = 1) = 0.8024, 𝑃(0 ≤ x𝛽 ≤ 1) = 0.7900 𝑃(0 ≤ x(cid:98)𝛽𝑂 𝐿𝑆 ≤ 1) = 0.9011, 𝑃(0 ≤ x(cid:98)𝛽𝑁 𝐿𝑆 ≤ 1) = 0.7857 mean sd mean sd *This column contains the empirical means and standard deviations of the sample APEs at the true parameter values. The findings in Table 4.5.2 are striking. Even though 𝑃 (0 ≤ x𝛽 ≤ 1) is high—around 0.95— and the OLS fitted values are very rarely outside the unit interval (only about 3.3 percent of the time), LPM/OLS is badly biased for the APEs and notably worse than other methods. This would seem to go against the conventional wisdom of checking the proportion of fitted values within [0, 1], and this confirms the second argument: having a large fraction of OLS fitted values within the unit interval is not sufficient. Among the other estimators, Ramp/NLS has a smaller bias in terms of both APEs, while probit and logit appear to have small bias for the discrete APE, but higher bias for the continuous APE. With respect to the performance of the LPM, the results with a normally distributed 𝑢 and with an interaction term are similar and so are skipped for brevity. Table 4.5.2. 𝑢 ∼ Uniform(0, 1), 𝑥1 lognormal, 𝑥2 asym. binary; 𝑐 = 0.75 𝑁 = 1000 APE1 Simulated Truth* 0.1776 0.0013 0.1828 0.0007 LPM Ramp (NLS) (OLS) 0.1796 0.1486 0.0345 0.0248 0.1829 0.1910 0.0282 0.0278 Probit (QMLE) 0.2110 0.0358 0.1835 0.0281 Logit. (QMLE) 0.2111 0.0378 0.1835 0.0285 mean sd mean sd APE2 𝑃(𝑦 = 1) = 0.7413, 𝑃(0 ≤ x𝛽 ≤ 1) = 0.9471 𝑃(0 ≤ x(cid:98)𝛽𝑂 𝐿𝑆 ≤ 1) = 0.9671, 𝑃(0 ≤ x(cid:98)𝛽𝑁 𝐿𝑆 ≤ 1) = 0.9446 *This column contains the empirical means and standard deviations of the sample APEs at the true parameter values. 168 4.5 Mortgage Approval Probabilities As an illustration of linear and nonlinear estimators for binary response models, we revisit the analysis of mortgage lending decisions from Hunter and Walker (1996).9 We compare linear and nonlinear estimates of the average effect of being white on the probability of loan approval, holding constant a number of loan, property, and borrower characteristics. Table 4.5.1 presents basic summary statistics for the dependent variable “approve” and 23 covariates. Table 4.5.1: Loan Approval Summary Statistics (𝑁 = 1989) Variable Description =1 if loan approved approve =1 if white white loanamt Loan amount $1000s suffolk appinc unit married dep emp yjob atotinc self other rep pubrec hrat obrat cosign sch mortno mortlat1 =1 if one or two late payments mortlat2 =1 if more than two late payments chist loanprc =1 if in Suffolk County Applicant income $1000s Number of units in property =1 if applicant married Number of dependents Years employed in line of work Years at this job Total monthly income =1 if self employed Other financing $1000s Number of credit reports =1 if filed bankruptcy Housing expense % of total inc. Other obligations % of total inc. =1 if there is a cosigner =1 if > 12 years schooling =1 if no mortgage history =0 if accounts are delinq. ≥60 days Loan amount / purchase price Mean 0.88 0.85 143.25 0.15 84.68 1.12 0.66 0.77 0.21 0.45 Kurt. SD Skew. 6.29 -2.30 0.33 4.64 -1.91 0.36 20.36 3.13 80.52 4.66 1.91 0.36 36.70 5.26 87.06 19.89 4.01 0.44 1.45 -0.67 0.47 5.33 1.47 1.10 50.57 6.69 1.00 36.18 5.32 1.12 65.34 6.36 5195.55 5269.06 2.21 5.89 0.34 26.80 886.84 28.23 7.37 1.45 0.99 12.59 3.40 0.25 6.74 0.25 7.12 7.40 0.44 8.26 32.92 5.65 0.17 2.68 -1.29 0.42 1.51 0.71 0.47 50.36 7.03 0.14 92.72 9.58 0.10 4.35 -1.83 0.37 14.39 0.44 0.19 0.13 2.37 1.50 0.07 24.79 32.39 0.03 0.77 0.33 0.02 0.01 0.84 0.77 For our index model, we include interactions between “white” and all other explanatory variables to allow for the factors like loan amount and credit history to have a differential impact on approval probability by group. Let 𝑤 denote “white” and z be a vector including the 22 other covariates, so that x = {1, z, 𝑤, 𝑤z} and 𝛽 = {𝛽0, 𝛽𝑧, 𝛽𝑤, 𝛽𝑤𝑧}, where 𝛽0 is the intercept, 𝛽𝑧 and 𝛽𝑤 are the 9We use a version of the loan applications dataset provided by Mary Beth Walker for Wooldridge (2019). 169 coefficients on z and 𝑤, respectively, while 𝛽𝑤𝑧 is the coefficient on 𝑤z. Then the partial effects we average are formed by evaluating the difference in the probabilities evaluated at 𝑤 = 1 and 𝑤 = 0, respectively, as given below. 𝐴𝑃𝐸𝑤 = 𝐸 [𝐺 (𝛽0 + 𝛽𝑤 + z(𝛽𝑧 + 𝛽𝑤𝑧)) − 𝐺 (𝛽0 + z𝛽𝑧)] , where 𝐺 () is either the identity function (for the LPM), the probit CDF, the logit CDF, or the ramp function. Table 4.5.2 presents the results. Using the LPM estimated by OLS, about 18% of observations have predicted probabilities outside the unit interval.10 The Horrace and Oaxaca results then clearly imply OLS is inconsistent for the slope parameters if the ramp model is correct. There is little reason to expect the LPM will approximate this APE, either based on the theoretical results of Stoker (1986) or our simulation study. Many of the explanatory variables are binary, and the continuous variables (e.g., income) tend to be skewed. For each variable, normality is strongly rejected by a Jarque-Bera test (a joint test of the skewness and kurtosis) with p-values well below 1%. The model also includes interactions between the continuous variables and a binary variable. Using the LPM estimates, the APE for 𝑤ℎ𝑖𝑡𝑒 is 5.3 percentage points and it is only marginally significant. Using the nonlinear estimators, the APEs are each a bit larger at about 7.0 percentage points, and they are all significant at the 1% level. Table 4.5.2: Estimates of the APE of “White” on Loan Approval Logit LPM Ramp (QMLE) (NLS) (OLS) 0.0712 0.0706 0.0532 0.0219 0.0227 0.0278 0.0837 0.0839 0.1171 Estimate Robust SE Mean Squared Error Probit (QMLE) 0.0695 0.0220 0.0840 Notes: There were only 1976 complete cases out of 1989 obser- vations total. All robust standard errors were computed using the sandwich forms and the delta method. The fraction of pre- dicted linear indexes within the unit interval is 0.8173 by OLS and 0.6027 by NLS. Interestingly, OLS predicts only 18% of observations with indexes outside the unit interval, whereas NLS predicts nearly 40%, which follows the pattern of many of our simulations from the 10Within this 18%, 98% of observations had predicted values greater than 1. 170 previous section and suggests trimming the sample once is not sufficient to consistently estimate the parameters or APEs under the piecewise linear model.11 Of this 40%, 99% had NLS predicted linear indexes greater than 1 and most had high predicted probabilities of approval regardless of the model or counterfactual race.12 Model selection by the minimum mean squared error favors logit, though the other nonlinear models are very similar. 4.6 Implications for Empirical Research We have revisited the conclusions reached by Horrace and Oaxaca (2006) concerning the ability of the linear projection parameters—consistently estimated by OLS—to recover interesting parameters. We argue that Horrace and Oaxaca’s focus on the parameters in the underlying index model is misguided; instead, one should focus on the APEs. Focusing on the APEs is hardly controversial, as almost every modern study that employs any model nonlinear in the explanatory variables reports estimated APEs. Once the focus is on the APEs, a few useful conclusions emerge. First, having a high of estimated response probabilities in [0, 1] is neither necessary nor sufficient for good performance of the LPM. Notably, when the explanatory variables have a multivariate normal distribution, the linear projection parameters are identical to the population APEs under a general index model, and this is true even when the flat parts of the ramp function occur with high probability, i.e. 𝑃(0 ≤ x𝛽 ≤ 1) is small. In this case, the linear projection parameters, 𝛾 𝑗 , will be greatly attenuated toward zero compared with the index parameters, 𝛽 𝑗 . We find that OLS estimation of the LPM continues to have good finite sample properties for the APE in many cases when the covariates are symmetrically distributed. When the explanatory variables have asymmetric distributions, however, the conclusions for the LPM are not as sanguine—unless the support of x𝛽 is contained entirely in the interval [0, 1]. Some simulations show that even if the probability of x𝛽 being in the 11The reason we report the fraction of NLS predicted linear index outside 0 and 1 here is to illustrate what proportion of observations would have been trimmed by the iterated trimming OLS procedure. We should note that this quantity is not of essential interest just as the linear indexes in probit and logit models are not. 12NLS drops these observations because they have predicted indexes outside the unit interval, not necessarily In fact, under the logit model, the average Pregibon (1981) leverage statistic for because they have high leverage. the 40% (“predict lev, hat” in Stata following logit estimation) was lower (0.008) than the average for the included observations (0.033). 171 unit interval is high (e.g. 97% in Table 4.5.2), the linear projection parameters are not very close to the true APEs. For the DGPs we study, we also find the logit and probit models, estimated by quasi-MLE (because the response probabilities are misspecified), tend to approximate the APEs very well. Especially when the support of x𝛽 is wide relative to [0, 1], the logit and probit approximations to the APEs may be notably better than those for the LPM when the covariates have asymmetric distributions, though this is not guaranteed. Although the ramp function may not be particularly realistic as a model for the response probability, we have shown that NLS estimation based on it is consistent (for the best MSE approximation to the true response probability) and asymptotically normal. A nonlinear model, of course, offers other advantages over the LPM, such as more realistic response probabilities and nonconstant partial effects. Especially given the ease of modern computation, an implication of our simulation findings is that researchers should generally try a nonlinear estimator, as they may be more robust to covariate asymmetry and variance than OLS estimation of the LPM. To summarize, in evaluating different strategies, one needs to make sure we have carefully defined the population quantities of interest, and then we make proper comparisons across different approaches. We find probit, logit, and the ramp models have the best finite sample properties for estimating the APEs across the model DGPs we study. However, when the APEs are of interest, we also find the LPM is more widely applicable than a simple reading of Horrace and Oaxaca might suggest. The conclusions drawn here are easily extended to the case where 𝑦 is a fractional response, where the limit values zero and one can occur with positive probability. In particular, the results of Stoker (1986) apply to 𝐸 (𝑦|x). If this conditional mean follows the same ramp function, the qualitative conclusions obtained in the binary case will remain. 172 BIBLIOGRAPHY Angrist, J. D. and Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist’s Compan- ion. Princeton University Press. Horowitz, J. L. and Savin, N. (2001). Binary response models: Logits, probits and semiparametrics. Journal of Economic Perspectives, 15(4):43–56. Horrace, W. C. and Oaxaca, R. L. (2006). Results on the bias and inconsistency of ordinary least squares for the linear probability model. Economics Letters, 90(3):321–327. Hunter, W. C. and Walker, M. B. (1996). The cultural affinity hypothesis and mortgage lending decisions. Journal of Real Estate Finance and Economics, 13:57–70. Li, C., Poskitt, D. S., Windmeijer, F., and Zhao, X. (2022). Binary outcomes, OLS, 2SLS and IV probit. Econometric Reviews, 41(8):859–876. Newey, W. K. and McFadden, D. (1994). Large sample estimation testing. Handbook of Econometrics, 4:2113–2245. Pregibon, D. (1981). Logistic regression diagnostics. The Annals of Statistics, 9(4):705–724. Ruud, P. A. (1983). Sufficient Conditions for the Consistency of Maximum Likelihood Es- timation Despite Misspecification of Distribution in Multinomial Discrete Choice Models. Econometrica, 51(1):225–228. Stoker, T. M. (1986). Consistent estimation of scaled coefficients. Econometrica, 54(6):1461– 1481. van den Berg, G. J. and Siflinger, B. M. (2022). The effects of a daycare reform on health in childhood–Evidence from Sweden. Journal of Health Economics, 81:102577. Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data. MIT Press. Wooldridge, J. M. (2019). Introductory Econometrics: A Modern Approach. Cengage Learning. 173 APPENDIX 4A PROOFS FOR CHAPTER 4 Proof of Theorem 4.2: We will obtain the asymptotic normality of the NLS estimator by applying Theorem 7.1 of Newey and McFadden (1994). Condition (i) and (ii) of Theorem 7.1 follows from our assumptions. As we discussed in the main context, condition (iii) is satisfied as long as x contains a continuous variable 𝑥 𝑗 with nonzero 𝛽 𝑗 𝑜 so that 𝑃(x𝑖 𝛽𝑜 = 0 or x𝑖 𝛽𝑜 = 1) = 0. For condition (iv), notice that the first derivative of the object function is well defined at 𝛽𝑜 with probability 1: 𝐷 𝑁 (𝛽𝑜) = ∇𝛽𝑄 𝑁 (𝛽𝑜) = 1 𝑁 𝑁 ∑︁ 𝑖=1 x′ 𝑖 (𝑦𝑖 − x𝑖 𝛽𝑜)1{x𝑖 𝛽𝑜 ∈ (0, 1)} = 1 𝑁 𝑁 ∑︁ 𝑖=1 x′ 𝑖𝑢𝑖1{x𝑖 𝛽𝑜 ∈ (0, 1)}, where 𝑢𝑖 = 𝑦𝑖 − 𝑅(x𝑖 𝛽𝑜). Since 𝐸 ∥x′ 𝑖𝑢𝑖 ∥1{x𝑖 𝛽𝑜 ∈ (0, 1)} < ∞ under the assumption 𝐸 ∥x𝑖 ∥2 < ∞, the vector Lindberg-Levy CLT applies: √ 𝑁 𝐷 𝑁 (𝛽𝑜) 𝑑 → N (0, 𝛀(𝛽𝑜)) , giving condition (iv). Lastly, for condition (v), following Newey and McFadden (1994), we can rewrite √ √ = 𝑁 [𝑄 𝑁 (𝛽) − 𝑄 𝑁 (𝛽𝑜)] 𝑁 [𝐷 𝑁 (𝛽𝑜)(𝛽 − 𝛽𝑜) + 𝑄(𝛽) − 𝑄(𝛽𝑜)] + ∥ 𝛽 − 𝛽𝑜 ∥ 𝑀𝑁 (𝛽), where 𝑀𝑁 (𝛽) is the remainder term, defined as: √ 𝑀𝑁 (𝛽) = 𝑁 (cid:2)𝑄 𝑁 (𝛽) − 𝑄 𝑁 (𝛽𝑜) − 𝐷′ 𝑁 (𝛽𝑜) (𝛽 − 𝛽𝑜) − (𝑄(𝛽) − 𝑄(𝛽𝑜))(cid:3) ∥ 𝛽 − 𝛽𝑜 ∥ . Let 𝑈𝑁 be an neighborhood of 𝛽𝑜: 𝑈𝑁 = {𝛽 ∈ B : ∥ 𝛽 − 𝛽𝑜 ∥ < 𝜀𝑁 } where 𝜀𝑁 → 0. Consider any 𝛽 ∈ 𝑈𝑁 . Since 𝐷 𝑁 (𝛽𝑜) is the gradient of 𝑄 𝑁 (𝛽) at 𝛽𝑜, 𝑄 𝑁 (𝛽) − 𝑄 𝑁 (𝛽𝑜) − 𝐷 𝑁 (𝛽𝑜) (𝛽 − 𝛽𝑜) goes to zero faster than ∥ 𝛽 − 𝛽𝑜 ∥ as 𝛽 goes to 𝛽𝑜, by the definition of the gradient. Similarly, due to ∇𝛽𝑄(𝛽𝑜) = 𝐸 (𝑠𝑖 (𝛽𝑜)) = 0, 𝑄(𝛽) − 𝑄(𝛽𝑜) goes to 0 faster than ∥ 𝛽 − 𝛽𝑜 ∥ as 𝛽 goes to 𝛽𝑜. Under the moment conditions, we can easily show 𝑄 𝑁 (𝛽) − 𝑄(𝛽) → 0 in probability 174 for each 𝛽 and so √ √ 𝑁 [𝑄 𝑁 (𝛽) − 𝑄(𝛽)] is bounded in probability for each 𝛽. Also note that 𝑁 𝐷 𝑁 (𝛽𝑜) is bounded in probability due to the asymptotic normality. Since the numerator is bounded in probability and converges to zero faster than the denominator, we conclude that 𝑀𝑁 (𝛽) → 0 in probability, which implies condition (v). lim𝑁→∞ sup𝛽∈𝑈𝑁 Proof of Theorem 4.3: Consider 𝛀( (cid:98)𝛽): 𝛀( (cid:98)𝛽) = ≡ 1 𝑁 1 𝑁 𝑁 ∑︁ 𝑖=1 𝑁 ∑︁ 𝑖=1 (cid:16) x′ 𝑖x𝑖 𝑦𝑖 − 𝑅(x𝑖 (cid:98)𝛽) (cid:17) 2 1{x𝑖 (cid:98)𝛽 ∈ (0, 1)} 𝑎(xi, (cid:98)𝛽) Note that 𝐸 |𝑦𝑖 − 𝑅(x𝑖 𝛽)|4 ≤ 1 for any 𝛽 ∈ B since both 𝑦𝑖 and 𝑅(.) are naturally bounded in [0, 1] with probability 1. Then, we have 𝐸 sup 𝛽∈B ∥𝑎(𝑥, 𝛽)∥ ≤ (𝐸 ∥x𝑖 ∥4 𝐸 |𝑦𝑖 − 𝑅(x𝑖 𝛽)|4)1/2 < ∞, where the first inequality follows from Hölder’s inequality. Also note that 𝑎(xi, 𝛽) is continuous at 𝛽𝑜 with probability one given that 𝑃(x𝑖 𝛽𝑜 = 0) = 𝑃(x𝑖 𝛽𝑜 = 1) = 0. Then, we can apply Lemma 4.3 of Newey and McFadden (1994): 𝛀( (cid:98)𝛽) = 1 𝑁 𝑁 ∑︁ 𝑖=1 𝑎(xi, (cid:98)𝛽) 𝑝 → 𝐸 (𝑎(xi, 𝛽𝑜)) = 𝛀(𝛽𝑜). Similarly, Lemma 4.3 also applies to A𝑁 ( (cid:98)𝛽) = 1 𝑁 (cid:205)𝑁 𝑖=1 x′ 𝑖x𝑖1{x𝑖 (cid:98)𝛽 ∈ (0, 1)}: 1 𝑁 𝑁 ∑︁ 𝑖=1 plim 𝑁→∞ 1{x𝑖 (cid:98)𝛽 ∈ (0, 1)}x′ 𝑖x𝑖 = A(𝛽𝑜). So, we conclude that (cid:98)V = A𝑁 ( (cid:98)𝛽)−1𝛀𝑁 ( (cid:98)𝛽)A𝑁 ( (cid:98)𝛽)−1 𝑝 → A(𝛽𝑜)−1𝛀(𝛽𝑜)A(𝛽𝑜)−1. 175 APPENDIX 4B THE RAMP MODEL WITH VARIABLE SUPPORT In this appendix, we modify and extend the Horrace and Oaxaca setup in order to show how the constraint on the linear index through a ramp function can be interpreted as relating to the support of the latent model error term. In particular, write 𝑦∗ = x𝛽 − 𝑢 𝑢|x ∼ Uniform (−𝑎, 𝑎) 𝑦 = 1 [𝑦∗ > 0] (4B.1) for some 𝑎 > 0. Compared with Horrace and Oaxaca, we have shifted the intercept so that 𝑢 has a symmetric distribution about its mean of zero. Also, we allow 𝑢 to have narrow or wide support, depending on 𝑎. The CDF for the Uniform (cid:16) − √ 3, (cid:17) √ 3 distribution, which has unit variance, is graphed in Figure 1. Figure 1: The CDF of 𝑦 with 𝑢|𝑥 ∼ U (cid:16) √ − √ (cid:17) . 3 3, 176 Given the latent variable model in (4B.1), we can derive the response probability: 𝑝 (x) ≡ 𝑃 (𝑦 = 1|x) = 𝑃 (𝑦∗ ≥ 0|x) = 𝑃 (𝑢 ≥ −x𝛽|x) = 𝑃 (𝑢 ≤ x𝛽|x) = 𝐹𝑢 (x𝛽) 0, x𝛽 < −𝑎 x𝛽+𝑎 2𝑎 , −𝑎 ≤ x𝛽 ≤ 𝑎 1, x𝛽 > 𝑎 =    We write this function as 𝐹𝑢 (x𝛽) ≡ 𝑅𝑎 (x𝛽), which is a ramp function that is nondifferen- tiable at −𝑎 and 𝑎. For an 𝑥 𝑗 with a positive coefficient, the response probability has the same shape as in Figure 1. As 𝑎 increases relative to 𝛽, the response probability is linear over more of the support of x. If 𝑃 (−𝑎 ≤ x𝛽 ≤ 𝑎) = 1 (4B.2) then, with probability one, 𝑅𝑎 (x𝛽) = (x𝛽 + 𝑎) /2𝑎, a linear function of x. In this case, the partial effects are constant and equal to 𝛽 𝑗 /2𝑎, 𝑗 = 2, ..., 𝐾. These are also the linear projection parameters 𝛾 𝑗 and so OLS consistently estimates the APEs under (4B.2). If 𝑥 𝑗 is a continuous variable, we are interested in the APE defined as a derivative, which exists with probability one when x𝛽 is continuous. At x𝛽 ∈ {−𝑎, 𝑎} the definition of the partial effect is immaterial. To be concrete, take 𝑃𝐸 𝑗 (x) = 𝛽 𝑗 2𝑎 · 1 [−𝑎 ≤ x𝛽 ≤ 𝑎] . Notice that 𝑃𝐸 𝑗 (x) = 0 if x𝛽 < −𝑎 or x𝛽 > 𝑎 because we are on one of the flat parts of the ramp. This feature of 𝑃𝐸 𝑗 (x) is taken into account in computing the APE: 𝐴𝑃𝐸 𝑗 = 𝐸 (cid:2)𝑃𝐸 𝑗 (x)(cid:3) = 𝛽 𝑗 2𝑎 · 𝑃 (−𝑎 ≤ x𝛽 ≤ 𝑎) The case that aligns with Horrace and Oaxaca is 𝑎 = 1/2—so that the 𝑈𝑛𝑖 𝑓 𝑜𝑟𝑚 (0, 1) distributed has just been shifted to have zero mean—in which case (cid:12) It is easily seen that (cid:12) difference between 𝐴𝑃𝐸 𝑗 and 𝛽 𝑗 can be large. (cid:12)𝐴𝑃𝐸 𝑗 (cid:12)𝐴𝑃𝐸 𝑗 (cid:12) ≤ (cid:12) (cid:12) (cid:12)𝛽 𝑗 (cid:12) < (cid:12) (cid:12) (cid:12)𝛽 𝑗 (cid:12) (cid:12), and the (cid:12) (cid:12) for any 177 𝑎 ≥ 1/2. In the extended model (4B.1), depending on the values of 𝑎 and 𝑃 (−𝑎 ≤ x𝛽 ≤ 𝑎), (cid:12) (cid:12)𝐴𝑃𝐸 𝑗 (cid:12) need not be smaller than (cid:12) (cid:12) (cid:12)𝛽 𝑗 (cid:12) (cid:12). While the latent error support parameter 𝑎 is not separately identified from 𝛽, this model can be a convenient device for generating data where the unit interval for probabilities is binding to varying degrees. 178 CHAPTER 5 IDENTIFICATION OF PARTIAL EFFECTS WITH ENDOGENOUS CONTROLS (CO-AUTHORED WITH KYOO IL KIM) 5.1 Introduction In models with endogenous treatment, to obtain consistent estimates of treatment effects, researchers commonly impose conditional (mean) independence or use instrumental variables (IV) for the treatment while rather casually assuming other observable control variables are exogenous. In reality, however, empirical researchers often end up with control variables that may be subject to additional endogeneity concerns while finding instruments for every endogenous control is challenging or impossible. In this note, we will demonstrate that if the objects of interest are limited to parameters associated with the treatment, then we can get around the endogeneity issue of the control variables in certain settings. To illustrate the problem, let’s consider the following linear model: 𝑌 = 𝐷𝜏 + 𝑋 𝛽 + 𝜀, 𝐸 [𝜀|𝐷, 𝑋] = 𝐸 [𝜀|𝑋] where 𝐷 is a treatment variable of interest, 𝑋 is another observable determinant of the outcome 𝑌 , and 𝜀 is an unobserved determinant. A common scenario where the endogenous control problem arises is that the treatment 𝐷 is dependent on 𝑋 while, at the same time, 𝑋 is dependent on𝜀, too. As a result, (1) without considering 𝑋 in the specification, the dependence between 𝐷 and 𝑋 can cause bias in the OLS estimation; (2) with the presence of 𝑋, the dependence between 𝑋 and 𝜀 can also pollute the estimation of the partial effects 𝜏 even if 𝐷 is conditionally independent of 𝜀. In either case (1) or (2), 𝜏 is not identified by the linear projection parameters, so the OLS estimator is biased. In case (1), it is simply an omitted variable bias. To see the bias in case (2), let 𝑊 = (𝐷, 𝑋) and 𝜃 = (𝜏′, 𝛽′)′, the linear projection parameters are defined The co-authors have approved that the co-authored chapter is included. The co-authors’ contact: Kyoo il Kim, Department of Economics, Michigan State University. Email: kyookim@msu.edu 179 as follows: 𝛾𝐿𝑃 ≡ 𝐸 [𝑊 ′𝑊]−1𝐸 [𝑊 ′𝑌 ] = 𝜃 + 𝐸 [𝑊 ′𝑊]−1𝐸 [𝑊 ′𝐸 [𝜀|𝑋]] . Therefore, without further restriction on 𝐸 [𝜀|𝑋], OLS does not produce a consistent estimate for 𝜃, or 𝜏 in particular. In a worse scenario, 𝑋 may itself be an outcome of 𝐷, in which 𝑋 is described as a bad control as in Angrist and Pischke (2009). It is well-known that a bad control can cause problems for identification and the problem is present even if we start with a randomly assigned treatment (see Wooldridge, 2005; Lechner, 2008). However, as is shown below, 𝜏 can still be identified even when 𝑋 is affected by 𝐷, as long as 𝑋 is not solely a function of 𝐷. More formally, this extra condition is referred to as the measurable separability, as first introduced in Florens et al. (1990) and will be defined formally later. At its essence, this assumption ensures that we can vary the value of 𝐷 = 𝑑 while holding 𝑋 = 𝑥 at a particular value of 𝑥. Note that this still allows the distribution of 𝑋 to depend on 𝐷, and vice versa. In the case of a continuous random variable 𝐷, 𝜏 is nonparametrically identified as follows: 𝜕𝑑 𝐸 [𝑌 |𝐷 = 𝑑, 𝑋 = 𝑥] = 𝜏 + 𝜕𝑑 (𝑥 𝛽 + 𝐸 [𝜀|𝐷 = 𝑑, 𝑋 = 𝑥]) = 𝜏 + 𝜕𝑑 (𝑥 𝛽 + 𝐸 [𝜀|𝑋 = 𝑥]) = 𝜏, where the last result holds due to the measurable separability of 𝐷 and 𝑋. To see this, suppose the measurable separability does not hold. For example, 𝑋 = 𝑓 (𝐷) almost surely and they are not constants, then conditioning on 𝐷 = 𝑑, 𝑋 = 𝑥 necessitates 𝐷 = 𝑑, 𝑋 = 𝑥 = 𝑓 (𝑑). In that case, we would have 𝜕𝑑 𝐸 [𝑌 |𝐷 = 𝑑, 𝑋 = 𝑥] = 𝜏 + 𝜕𝑑 ( 𝑓 (𝑑) 𝛽 + 𝐸 [𝜀|𝑋 = 𝑓 (𝑑)]) ≠ 𝜏. In the case of a binary 𝐷, measurable separability between 𝐷 and 𝑋 allows for conditioning on 𝐷 = 1 and 𝐷 = 0 at different realized values of 𝑋. Combining with the conditional independence condition, the identification of 𝜏 is achieved as follows: 𝐸 [𝑌 |𝐷 = 1, 𝑋 = 𝑥] − 𝐸 [𝑌 |𝐷 = 0, 𝑋 = 𝑥] = 𝜏 + 𝐸 [𝜀|𝐷 = 1, 𝑋 = 𝑥] − 𝐸 [𝜀|𝐷 = 0, 𝑋 = 𝑥] = 𝜏 + 𝐸 [𝜀|𝑋 = 𝑥] − 𝐸 [𝜀|𝑋 = 𝑥] = 𝜏. 180 EXAMPLE 5.1 As a concrete example, consider the linear regression model relating district- level average test scores (avgscore) to district-level educational expenditure per student (expend) and average family income (avginc) from Wooldridge, 2019, Chapter 3: 𝑎𝑣𝑔𝑠𝑐𝑜𝑟𝑒 = 𝛼 + 𝜏 · 𝑒𝑥 𝑝𝑒𝑛𝑑 + 𝛽 · 𝑎𝑣𝑔𝑖𝑛𝑐 + 𝜀. Suppose we are interested in the partial effect of expend, 𝜏. Since avginc is relevant for expend in the district level and avginc can also affect avgscore through other channels, e.g. private tutoring, including avginc as a control variable or a proxy for those unobserved determinants is sensible. However, avginc may also be correlated with other unobserved determinants that affect both avginc and avgscore, then the endogeneity of avginc can pollute the identification of (𝜏, 𝛽) using linear projection and OLS fails to produce consistent estimates. However, 𝜏 is nonparametrically identified as long as: (1) expend is independent of 𝜀 conditional on avginc and (2) expend and avginc are measurably separated. Given those two conditions and the identifica- tion results above, a consistent estimate for 𝜏 is available through usual nonparametric estimators. Does the problem go away with an excludable instrumental variable? When the instrumental variable is truly exogenous and affects the outcome only through the treatment, the answer is yes because no control is needed. However, in practice, control variables are commonly considered in models with instrumental variables, sometimes due to the concern that the excludability condition is valid only after controlling certain observable variables. In those cases, again, the endogeneity in those controls can cause problems for identification. To illustrate the problem, let’s consider the same outcome equation with an excludable instrumental variable in a linear triangular model: 𝑌 = 𝐷𝜏 + 𝑋 𝛽 + 𝜀, 𝐷 = 𝑍 𝜋𝑍 + 𝑋 𝜋𝑋 + 𝜂. In this model, the instrumental variable 𝑍 is needed because 𝐷 is not conditionally independent of 𝜀 even after conditioning on 𝑋 and the nonparametric identification method introduced above 181 is not valid anymore. However, without controlling 𝑋, the instrument 𝑍 itself may not suffice for identification because 𝑍 may affect 𝑌 through 𝑋 too. Again, the endogeneity of 𝑋 brings a dilemma: (1) without controlling 𝑋, 𝑍 is not a valid IV; (2) with the presence of 𝑋, neither 𝜏 or 𝛽 is identified by usual IV or 2SLS projection without further restriction on 𝐸 [𝜀|𝑋]. Nevertheless, 𝜏 is nonparametrically identified given (i) the measurable separability between 𝑍 and 𝑋 and (ii) 𝜋𝑍 ≠ 0: 𝐸 [𝑌 |𝑋 = 𝑥, 𝑍 = 𝑧] = (𝑧𝜋𝑍 + 𝑥𝜋𝑋 + 𝐸 [𝜂|𝑋 = 𝑥]) 𝜏 + 𝑥 𝛽 + 𝐸 [𝜀|𝑋 = 𝑥], 𝐸 [𝐷 |𝑋 = 𝑥, 𝑍 = 𝑧] = 𝑧𝜋𝑍 + 𝑥𝜋𝑋 + 𝐸 [𝜂|𝑋 = 𝑥], 𝜕𝑧𝐸 [𝑌 |𝑋 = 𝑥, 𝑍 = 𝑧] 𝜕𝑧𝐸 [𝐷|𝑋 = 𝑥, 𝑍 = 𝑧] = 𝜏. (5.1) Again, at its essence, the extra measurable separability condition ensures that we can vary the value of 𝑍 = 𝑧 while holding 𝑋 = 𝑥. Alternatively, 𝜏 can also be nonparametrically identified through a control function approach. In this case, due to 𝐷 = 𝑍 𝜋𝑍 + 𝑋 𝜋𝑋 + 𝜂 with 𝜋𝑍 ≠ 0, the measurable separability between 𝑍 and 𝑋 also implies the measurable separability between (𝑋, 𝜂) and 𝐷. It will be shown later that given the exogeneity of 𝑍, 𝐷 is independent of 𝜀 conditional on (𝜂, 𝑋). Therefore, 𝜏 can also be identified as follows: 𝜕𝑑 𝐸 [𝑌 |𝐷 = 𝑑, 𝑋 = 𝑥, 𝜂 = 𝑒] = 𝜕𝑑 [𝑑𝜏 + 𝐸 [𝜀|𝑋 = 𝑥, 𝜂 = 𝑒]] = 𝜏. (5.2) EXAMPLE 5.2 Consider a linear triangular model relating the individual wage with school attendance. In an influential paper that studies the causal impact of compulsory school attendance on earnings, Angrist and Krueger (1991) use quarter of birth (qbirth) as an instrument for educational attainment (totaledu) in wage equations, based on the observation that school-entry requirement and the compulsory schooling laws compel students born in the end of the year to attend school longer than students born in other months. Suppose we include the parents’ income (parinc) as a control because parents’ income also affects the birth quarters of their children, so the inclusion of parents’ income also makes the exogeneity condition of the instrument more 182 likely to hold. The heuristic model can be specified as follows: 𝑙𝑜𝑔(𝑤𝑎𝑔𝑒) = 𝛼 + 𝜏 · 𝑡𝑜𝑡𝑎𝑙𝑒𝑑𝑢 + 𝛽 · 𝑝𝑎𝑟𝑖𝑛𝑐 + 𝜀, 𝑡𝑜𝑡𝑎𝑙𝑒𝑑𝑢 = 𝜋0 + 𝜋1 · 𝑞𝑏𝑖𝑟𝑡ℎ + 𝜋2 · 𝑝𝑎𝑟𝑖𝑛𝑐 + 𝜂. However, students from high-income families may be able to access social resources that are positively correlated with earnings, making parinc an endogenous control. In that case, IV or 2SLS does not yield consistent estimates for (𝜏, 𝛽) as discussed above. Nevertheless, 𝜏 is identified nonparametrically as long as the usual restrictions for 𝐼𝑉 hold and the measurable separability condition is justified. The heuristic exposition above focuses on linear models. When the true data-generating process is nonlinear in the parameters or nonparametric, it is not clear whether the same idea is still applicable. Especially, when the unobserved determinant is not separable from other observable covariates, it is not clear whether the dependence between the controls and the unobserved determinant could largely change the exposition. Ideally, we would like our approach to be applicable to a large class of data-generating processes. Therefore, we derive the main results under nonparametric and nonseparable models, considering cases with or without instrumental variables. The basic idea on how nonparametric estimation helps alleviate the bias due to endogenous controls is introduced in Frölich (2008). However, our work first provides a formal identification result, which not only justifies the use of nonparametric methods in the presence of endogenous control but also delineates the boundary of this method through the measurable separability condition. The issue of endogenous controls is prevalent in empirical research but is not well studied in econometrics literature. One exception outside our setting is regarding the regression discon- tinuity (RD) design where Kim (2013) finds that endogenous control variables yield asymptotic bias in the RD estimator while the inclusion of these relevant controls may offset this bias and improve some higher-order properties of the estimator. Diegert et al. (2022) assess the 183 omitted variable bias when the controls are potentially correlated with the omitted variables in a sensitivity analysis framework. For the rest of the paper, it is outlined as follows: In Section 5.2, we establish the main identification results for nonseparable models under conditional independence. In Section 5.3, we consider methods based on an instrumental variable that is only conditionally exogenous. In Section 5.4, Monte Carlo simulation demonstrates the issue of endogenous control and the performance of proposed methods in a finite sample. Section 5.5 concludes the note with recommendations for empirical practice. 5.2 Nonseparable Models with Conditional Independence Identification results for a nonseparable model with an endogenous treatment, 𝑌 = 𝑚(𝐷, 𝜀), is given in Altonji and Matzkin (2005), assuming there exists some vector 𝑋 such that conditional on 𝑋, the treatment variable 𝐷 is independent of the stochastic error 𝜀. However, in many empirical applications with either nonparametric, semiparametric, or parametric models, the vector of control variables usually appears in the model of the outcome 𝑌 . The question is, would the endogeneity of 𝑋 be a problem when we are interested in iden- tifying, for example, LAR and ATE of 𝐷 on 𝑌 given the conditional independence assumption? In this section, we show the answer is positive. To focus on our main point, for convenience, we assume all relevant (conditional) probability density functions are well defined below and also throughout the paper. Consider a nonseparable nonparametric model as follows: 𝑌 = 𝑚(𝐷, 𝑋, 𝜀) (5.3) We are interested in identifying conditional LAR (CLAR) and unconditional LAR, denoted by 𝛽(𝑑, 𝑥) and 𝛽(𝑑), respectively. For now, the focus is on a continuous outcome Y, but the results can be extended to binary choice models, as shown in Altonji and Matzkin (2005). Assume 𝑚(.) is differentiable w.r.t its first argument and 𝐷 is a continuous treatment, then 𝛽(𝑑, 𝑥) and 184 𝛽(𝑑) are defined as: 𝛽(𝑑, 𝑥) = 𝛽(𝑑) = ∫ 𝜕𝑚(𝑑, 𝑥, 𝜖) 𝜕𝑑 ∫ ∫ 𝜕𝑚(𝑑, 𝑥, 𝜖) 𝑓𝜀|𝐷=𝑑,𝑋=𝑥 (𝜖)𝑑𝜖, 𝑓𝑋,𝜀|𝐷=𝑑 (𝑥, 𝜖)𝑑𝑥𝑑𝜖, 𝜕𝑑 where 𝑓𝜀|𝐷=𝑑,𝑋=𝑥 (𝜖) and 𝑓𝑋,𝜀|𝐷=𝑑 (𝑥, 𝜖) denote relevant conditional density functions. If 𝐷 is a binary random variable (or if we are interested in discrete change in 𝐷), we can define CLAR and LAR as follows: ∫ (cid:101)𝛽(𝑑, 𝑥) = (𝑚(1, 𝑥, 𝜖) − 𝑚(0, 𝑥, 𝜖)) 𝑓𝜀|𝐷=𝑑,𝑋=𝑥 (𝜖)𝑑𝜖, ∫ ∫ (cid:101)𝛽(𝑑) = (𝑚(1, 𝑥, 𝜖) − 𝑚(0, 𝑥, 𝜖)) 𝑓𝑋,𝜀|𝐷=𝑑 (𝑥, 𝜖)𝑑𝑥𝑑𝜖 . Assumption 5.1 𝑓𝜀|𝐷,𝑋 (𝜖) = 𝑓𝜀|𝑋 (𝜖) for all 𝜖 ∈ R. Assumption 5.1 is the conditional independence assumption which is also imposed in Altonji and Matzkin (2005). We note that Assumption 5.1 does not rule out 𝑋 being endogenous (i.e. being not independent with 𝜀). If 𝐷 is a continuous random variable, we can represent 𝐷 by some function ℎ as 𝐷 = ℎ(𝑋, 𝑈), (5.4) where 𝑋 is independent of a continuous error term 𝑈 and ℎ(𝑋, 𝑢) is strictly monotonic in 𝑢 almost surely (see Matzkin (2003)). To identify CLAR and LAR in the continuous treatment case while allowing for endogeneity in 𝑋, we need an extra rank condition: Assumption 5.2 D and X are measurably separated, that is, any function of D almost surely equal to a function of X must be almost surely equal to a constant. To see why it is a type of rank condition, consider a case where the condition is violated at some point in the interior of the support of (𝐷, 𝑋), i.e. 𝑙 (𝐷) = 𝑞(𝑋) for some measurable functions 𝑙 (.) and 𝑞(.). Then, 𝑙 (ℎ(𝑋, 𝑈)) = 𝑞(𝑋). Differentiating both sides with respect to 𝑈, we have 𝜕𝑙 𝜕ℎ 𝜕𝑈 = 0. Given that the measurable separability fails, we have 𝜕𝑙 𝜕ℎ ≠ 0 and so 𝜕ℎ 𝜕𝑈 = 𝜕𝑞 185 𝜕ℎ 𝜕𝑈 = 0. Therefore, Assumption 5.2 requires 𝑈 to affect 𝐷. Following Theorem 3 in Florens et al. (2008), we give primitive conditions for Assumption 5.2 as follows: Assumption 5.3 (i) D is determined by (5.4) where 𝑋 is continuously distributed and independent of 𝑈, and ℎ(𝑥, 𝑢) is continuous in 𝑥. (ii) Given any fixed 𝑥, the support of the distribution of ℎ(𝑥, 𝑈) contains an open interval. In Appendix 5A, we give a lemma that follows from Theorem 3 in Florens et al. (2008) under which the conditions in Assumption 5.3 are sufficient for Assumption 5.2 to hold. Note that Assumption 5.3(i) implicitly restricts the treatment 𝐷 to be continuous, so it is not proper to impose this restriction for a binary treatment 𝐷. Assumption 5.3(ii) requires 𝑈 to be continuously distributed and that ℎ(𝑥, 𝑈) is a continuous monotonic function of 𝑈 for any fixed 𝑥. We now give identification results of CLAR and LAR for both continuous treatment and binary treatment cases in the following theorem: Theorem 5.1 Consider the model defined in (5.3) and (5.4). (i) For a continuous random variable 𝐷, suppose that Assumptions 5.1 and 5.3 hold and 𝐸 (cid:104)(cid:12) (cid:12) (cid:12) 𝜕𝑚(𝑑,𝑥,𝜀) 𝜕𝑑 (cid:12) (cid:12) (cid:12) |𝐷 = 𝑑, 𝑋 = 𝑥 (cid:105) < ∞, then LAR and CLAR are identified for all 𝑑 ∈ 𝑆𝑢 𝑝 𝑝(𝐷) and 𝑥 ∈ 𝑆𝑢 𝑝 𝑝(𝑋 |𝐷 = 𝑑) as follows: 𝛽(𝑑, 𝑥) = 𝛽(𝑑) = 𝜕𝐸 [𝑌 |𝐷 = 𝑑, 𝑋 = 𝑥] 𝜕𝑑 ∫ 𝜕𝐸 [𝑌 |𝐷 = 𝑑, 𝑋 = 𝑥] , 𝜕𝑑 𝑓𝑋 |𝐷=𝑑 (𝑥)𝑑𝑥. (ii) For a binary random variable 𝐷, suppose Assumptions 5.1 and 5.2 hold and for all 𝑑 ∈ 𝑆𝑢 𝑝 𝑝(𝐷) and 𝑥 ∈ 𝑆𝑢 𝑝 𝑝(𝑋 |𝐷 = 𝑑), 𝐸 [|𝑚(1, 𝜀) − 𝑚(0, 𝜀)| |𝐷 = 𝑑, 𝑋 = 𝑥] < ∞, then LAR and CLAR are identified for all 𝑑 ∈ 𝑆𝑢 𝑝 𝑝(𝐷) and 𝑥 ∈ 𝑆𝑢 𝑝 𝑝(𝑋 |𝐷 = 𝑑) as follows: (cid:101)𝛽(𝑑, 𝑥) = 𝐸 [𝑌 |𝐷 = 1, 𝑋 = 𝑥] − 𝐸 [𝑌 |𝐷 = 0, 𝑋 = 𝑥], ∫ (cid:101)𝛽(𝑑) = (𝐸 [𝑌 |𝐷 = 1, 𝑋 = 𝑥] − 𝐸 [𝑌 |𝐷 = 0, 𝑋 = 𝑥]) 𝑓𝑋 |𝐷=𝑑 (𝑥)𝑑𝑥. The proof of Theorem 5.1 is given in Appendix 5A. 186 5.3 Nonseparable Triangular Models As an alternative to the conditional independence assumption, another useful identifying restriction to solve the endogeneity problem of the treatment is the excluded IV by which we can construct a control variable that controls for the endogeneity from the treatment equation. In applications, other observable control variables are included to make the exogeneity condition of the IV more likely to hold. Commonly, these observable controls are assumed to be exogenous and included in both the outcome equation and the reduced form equation. We caution that these control variables may be endogenous, too, while finding IV for all endogenous controls is not possible. In this section, we study a nonseparable triangular model similar to the one in Imbens and Newey (2009) where we explicitly include potentially endogenous control variables in the model and provide identification results on LAR and treatment effects. Consider the nonseparable triangular model as follows: 𝑌 = 𝑔(𝐷, 𝑋, 𝜀), 𝐷 = 𝑞(𝑍, 𝑋, 𝜂) (5.5) (5.6) where 𝐷 is a continuously distributed random variable and endogenous to the stochastic error. 𝑋 is a vector of observable control variables potentially endogenous to unobservable determinants of 𝑌 . 𝑍 is an exogenous variable excluded from the outcome equation (5.5) and is independent of (𝜀, 𝜂): Assumption 5.4 𝑍 ⊥⊥ (𝜀, 𝜂) | 𝑋. This is different from Imbens and Newey (2009) in that the endogenous variables now have been separated into two vectors 𝐷 and 𝑋, and we are only interested in identifying parameters associated with 𝐷. Note that 𝑋 is allowed to be correlated with both 𝐷 and 𝑍, which also motivates the inclusion of 𝑋 in the model as it makes Assumption 5.4 more likely to hold. If there exists a control variable 𝑉 such that 𝑓𝜀|𝐷,𝑋,𝑉 (𝜖) = 𝑓𝜀|𝑋,𝑉 (𝜖), ∀𝜖 ∈ R, (5.7) 187 and both 𝑋 and 𝑉 are measurably separated from 𝐷, then we can apply the similar approach from Section 5.2 to identify CLAR and LAR, which in this case are defined as: ∫ 𝜕𝑔(𝑑, 𝑥, 𝜖) 𝜕𝑑 ∫ ∫ 𝜕𝑔(𝑑, 𝑥, 𝜖) 𝜕𝑑 ∫ ∫ 𝜕𝑔(𝑑, 𝑥, 𝜖) 𝜕𝑑 𝛽(𝑑, 𝑥) = = 𝛽(𝑑) = = 𝑓𝜀|𝐷=𝑑,𝑋=𝑥 (𝜖)𝑑𝜖 𝑓𝜀|𝐷=𝑑,𝑋=𝑥,𝑉=𝑣 (𝜖) 𝑓𝑉 |𝐷=𝑑,𝑋=𝑥 (𝑣)𝑑𝑣𝑑𝜖, 𝑓𝑋,𝜀|𝐷=𝑑 (𝑥, 𝜖)𝑑𝑥𝑑𝜖 ∫ ∫ ∫ 𝜕𝑔(𝑑, 𝑥, 𝜖) 𝜕𝑑 𝑓𝜀|𝐷=𝑑,𝑋=𝑥,𝑉=𝑣 (𝜖) 𝑓𝑉 |𝐷=𝑑,𝑋=𝑥 (𝑣) 𝑓𝑋 |𝐷=𝑑 (𝑥)𝑑𝑥𝑑𝑣𝑑𝜖 . Under the condition (5.7), measurably separability, and appropriate regularity conditions for the derivative to pass through the expectation, we have 𝜕𝐸 [𝑌 |𝐷 = 𝑑, 𝑋 = 𝑥, 𝑉 = 𝑣] 𝜕𝑑 = = 𝜕 ∫ 𝑔(𝑑, 𝑥, 𝜖) 𝑓𝜀|𝐷=𝑑,𝑋=𝑥,𝑉=𝑣 (𝜖)𝑑𝜖 𝜕𝑑 ∫ 𝜕𝑔(𝑑, 𝑥, 𝜖) 𝜕𝑑 𝑓𝜀|𝐷=𝑑,𝑋=𝑥,𝑉=𝑣 (𝜖)𝑑𝜖 . Then, CLAR and LAR are identified for all 𝑑 ∈ 𝑆𝑢 𝑝 𝑝(𝐷) and 𝑥 ∈ 𝑆𝑢 𝑝 𝑝(𝑋 |𝐷 = 𝑑) as follows: ∫ 𝜕𝐸 [𝑌 |𝐷 = 𝑑, 𝑋 = 𝑥, 𝑉 = 𝑣] 𝑓𝑉 |𝐷=𝑑,𝑋=𝑥 (𝑣)𝑑𝑣 𝜕𝑑 ∫ ∫ 𝜕𝑔(𝑑, 𝑥, 𝜖) = 𝜕𝑑 𝑓𝜀|𝐷=𝑑,𝑋=𝑥,𝑉=𝑣 (𝜖) 𝑓𝑉 |𝐷=𝑑,𝑋=𝑥 (𝑣)𝑑𝑣𝑑𝜖 = 𝛽(𝑑, 𝑥), ∫ ∫ 𝜕𝐸 [𝑌 |𝐷 = 𝑑, 𝑋 = 𝑥, 𝑉 = 𝑣] 𝜕𝑑 𝑓𝑉 |𝐷=𝑑,𝑋=𝑥 (𝑣) 𝑓𝑋 |𝐷=𝑑 (𝑥)𝑑𝑣𝑑𝑥 ∫ ∫ ∫ 𝜕𝑔(𝑑, 𝑥, 𝜖) 𝜕𝑑 = 𝑓𝜀|𝐷=𝑑,𝑋=𝑥,𝑉=𝑣 (𝜖) 𝑓𝑉 |𝐷=𝑑,𝑋=𝑥 (𝑣) 𝑓𝑋 |𝐷=𝑑 (𝑥)𝑑𝑥𝑑𝑣𝑑𝜖 = 𝛽(𝑑). We construct such control variable 𝑉 satisfying condition (5.7) in a fashion similar to Imbens and Newey (2009): 𝑉 = 𝐹𝐷 |𝑍,𝑋 (𝐷), i.e. the conditional CDF of 𝐷 given (𝑍, 𝑋). The following assumption is essential for the construction of 𝑉 and ensuring that the information contained in 𝑉 is the same as that in 𝜂. Assumption 5.5 (i) 𝑞(𝑧, 𝑥, 𝑒) is strictly monotonic in 𝑒 for any fixed (𝑧, 𝑥); (ii) 𝜂 is continuously distributed with its CDF 𝐹𝜂 (𝑒) strictly increasing in the support of 𝜂. Assumption 5.5(i) allows the inverse function of 𝑞(𝑧, 𝑥, 𝑒) with respect to 𝑒 to exist. Assumption 5.5(ii) implies that 𝐹𝜂 (𝑒) is a one-to-one function of 𝑒. 188 We further discuss measurable separability conditions. First note that if 𝜂 is independent of (𝑍, 𝑋) in (5.6), we can fix 𝜂 and see that 𝐷 and 𝑋 are measurably separated (i) if 𝑍 and 𝑋 are measurably separated, which would hold under similar sufficient conditions as Assumption 5.3, and (ii) if 𝑞(𝑧, ·, ·) is continuous in 𝑧. Again, this essentially means that given fixed values of (𝑋, 𝜂) = (𝑥, 𝑒) we can vary 𝑍 = 𝑧, so does 𝐷 = 𝑑 because 𝑑 = 𝑞(𝑧, 𝑥, 𝑒). We also provide primitive conditions for the measurable separability between 𝐷 and 𝜂: Assumption 5.6 (i) 𝐷 is determined by (5.6) where 𝜂 is continuously distributed and independent of (𝑍, 𝑋), and 𝑞(𝑧, 𝑥, 𝑒) is continuous in 𝑒. (ii) For any fixed 𝑒, the support of the distribution of 𝑞(𝑍, 𝑋, 𝑒) contains an open interval. It is a counterpart of Assumption 5.3, it implies the measurable separability of 𝐷 and 𝜂 by Lemma 5A.1 in Appendix 5A. The difference is that Assumption 5.6(ii) does impose some restriction on the 𝑋 and 𝑍: Assumption 5.6(i) requires 𝑋 to be independent of unobservable determinants of 𝐷 and Assumption 5.6(ii) requires (𝑍, 𝑋) contains a continuous element and 𝑞 is continuous in that element for any fixed 𝑒. In the next theorem, we show that the constructed control variable 𝑉 satisfies condition (5.7) and is measurable separated from 𝐷. Theorem 5.2 Suppose Assumption 5.4 holds for the nonseparable model in (5.5) and (5.6). Then, (i) 𝐷 is independent of 𝜀 conditional on (𝜂, 𝑋). (ii) If, additionally, Assumptions 5.5 and 5.6 holds, then condition (5.7) is satisfied with 𝑉 = 𝐹𝜂 (𝜂) = 𝐹𝐷|𝑍,𝑋 (𝐷) and 𝐷 is measurably separated from 𝑉. The proof of Theorem 5.2 is given in Appendix 5A. 5.4 Simulation In this section, we use Monte Carlo simulations to demonstrate the bias due to the endogenous control, and how the proposed methods perform in finite sample. Two DGPs considered are as follows: (1) the first DGP covers the scenario where the control is endogenous and treatment 189 Table 5.1: Simulation for DGP(1) Methods Bias SD (i) 0.374 0.055 (ii) 0.374 0.044 (iii) 0.114 0.444 (iv) 0.001 0.048 Note: Simulation results are based on 1000 repli- cations and random samples of size 𝑛 = 1000. Series regression uses 3rd-order polynomials. is conditionally independent of the unobserved determinants as in Section 5.2; (2) the second DGP covers the scenario where the IV is conditionally valid but the control is endogenous as in Section 5.3. First, consider DGP(1): DGP(1): 𝑌 = 𝛽0 + 𝛽1𝐷 + 𝛽2𝑋 + 𝑈 𝐷 = 𝑎2 + 𝑁 (0, 1), 𝑋 = 𝑎 𝑈 = 𝑏2 + 𝑁 (0, 1), where (𝑎, 𝑏) are jointly normal with mean zero, variance one, and covariance 0.75; 𝑁 (0, 1) denotes a random draw from a normal distribution, independent of 𝑎 and 𝑏. We observe that (1) 𝑋 is relevant for both 𝑌 and 𝐷; (2) 𝑈 affects both 𝐷 and 𝑋; (3) Conditional on 𝑋, 𝐷 is independent of 𝑈; and (4) 𝐷 and 𝑋 are measurably separated. As a result, linear projection parameters does not identify 𝛽1, which is the local average response and is our parameter of interest. As is shown in Section 5.2, 𝛽1 is nonparametrically identified. In Table 5.1, we compare the estimates of 𝛽1 using (i) OLS without control, (ii) OLS with control, (iii) the nonparametric estimations through the local linear regression, and (iii) the third-order polynomial series regression. The results are clear: while conventional methods that assume the exogeneity of controls are severely biased, both the nonparametric methods perform much better in terms of bias. The local linear kernel method is less efficient compared to the series approximation methods. Note that we can rewrite 𝑋 = (𝐷 − 𝑁 (0, 1))1/2, which is a function of the treatment 𝐷 and so is subject to the critics of bad control. However, we see that the proposed methods still work when the target is the average partial effect or the local average response. 190 Table 5.2: Simulation for DGP(2) Methods Bias SD (i) 0.374 0.057 ii) 0.375 0.052 (iii) 0.139 0.161 (v) (iv) 0.001 0.001 0.068 0.066 Note: Simulation results are based on 1000 replications and random samples of size 𝑛 = 1000. Next, consider DGP(2): DGP(2): 𝑌 = 𝛽0 + 𝛽1𝐷 + 𝛽2𝑋 + 𝑈 𝐷 = 𝑋 + 𝑍 + 𝜂, 𝑋 = 𝑎 𝑍 = 𝑎2 + 𝑁 (0, 1), 𝜂 = 𝑐 𝑈 = 𝑏2 + 𝑑2 + 𝑁 (0, 1), where (𝑐, 𝑑) are also jointly normal with mean zero, variance one, and covariance 0.75, and (𝑐, 𝑑) are independent of 𝑎, 𝑏. We observe that (i) 𝐷 is not conditionally independent given 𝑋; (ii) 𝑍 is not a valid 𝐼𝑉 except when it is conditional on 𝑋, but 𝑋 is endogenous; (iii) 𝑍 and 𝑋 are measurably separated, so the local average response parameter 𝛽1 is nonparametrically identified as 5.1; (iv) The measurable separability between 𝑍 and 𝑋 here also implies the measurable separability between 𝐷 and (𝑋, 𝜂), so 𝛽1 can also be nonparametrically identified as 5.2 and we can use a two-step control function approach. Note that the model is linear, a special case of the nonseparable model in Section 5.3, so we can also resort to Theorem 2 for identification. Table 5.2 compares the following 5 methods, corresponding to their column numbers in the same table: (i) IV without control; (ii) IV with control; (iii) nonparametric approach as 5.1 using local linear regression; (iv) nonparametric approach as 5.1 using series regression; (v) two-step control function approach as Theorem 2 using series regression for the second step. Although (v) is the most general approach that allows for nonseparable models, it is very computationally costly because we need to estimate the conditional 𝐶𝐷𝐹 𝐹𝐷|𝑍,𝑋 (𝐷𝑖) for each realized 𝐷𝑖 in the sample. In practice, method (v) is implemented as follows: (a) For some 𝑖 = 1, ..., 𝑛, we generate the indicator variable 1{𝐷 < 𝐷𝑖}. (b) Then we use local linear kernel methods to regress 1{𝐷 < 𝐷𝑖} on 𝑋 and 𝑍 for each 𝑑 and obtain the fitted values of 1{𝐷 < 𝐷𝑖} only for the 𝑖-th observation. After, repeat steps (a)-(b) for each 𝑖 = 1, ..., 𝑛, we obtain the estimated 191 𝐹𝐷|𝑍,𝑋 (𝐷𝑖) for each 𝑖. Finally, we nonparametrically regress 𝑌 on 𝐷, 𝑋, and the estimated 𝐹𝐷|𝑍,𝑋 (𝐷𝑖) using polynomial series regression. The results of this comparison are as expected by theory. While the IV-based methods (i) and (ii) that implicitly impose the exogeneity of the controls fail to produce consistent estimates of the partial effects of interest, the alternative identification methods paired with usual nonparametric estimators in (iii)-(v) perform well in finite sample. 5.5 Conclusion This note addresses a critical, prevalent, yet often overlooked problem in empirical research: the endogeneity of control variables. Building on the insightful observation and discussion in Frölich (2008) that nonparametric estimation can help with the endogenous control problem, we provide formal identification results in a simple linear model with or without the presence of instrumental variables, and extend the results to a general class of nonseparable models focusing on identifying local average responses. For empirical practice, this note provides a more flexible framework for dealing with en- dogenous control. Following the primitive conditions we provide in this note, researchers could evaluate if the inclusion of potentially endogenous control variables is desirable and if the esti- mation and inference can be made robust to the potential endogeneity in the control. Estimation based on our identification results is also standard in common empirical settings. 192 BIBLIOGRAPHY Altonji, J. G. and Matzkin, R. L. (2005). Cross section and panel data estimators for nonseparable models with endogenous regressors. Econometrica, 73(4):1053–1102. Angrist, J. D. and Krueger, A. B. (1991). Does compulsory school attendance affect schooling and earnings? The Quarterly Journal of Economics, 106(4):979–1014. Angrist, J. D. and Pischke, J.-S. (2009). Mostly harmless econometrics: An empiricist’s compan- ion. Princeton university press. Diegert, P., Masten, M. A., and Poirier, A. (2022). Assessing omitted variable bias when the controls are endogenous. arXiv preprint arXiv:2206.02303. Durrett, R. (2019). Probability: theory and examples. Cambridge University Press. Florens, J.-P., Heckman, J. J., Meghir, C., and Vytlacil, E. (2008). Identification of treat- ment effects using control functions in models with continuous, endogenous treatment and heterogeneous effects. Econometrica, 76(5):1191–1206. Florens, J. P., Mouchart, M., and Rolin, J.-M. (1990). Elements of Bayesian statistics. CRC Press. Frölich, M. (2008). Parametric and nonparametric regression in the presence of endogenous control variables. International Statistical Review, 76(2):214–227. Imbens, G. W. and Newey, W. K. (2009). Identification and estimation of triangular simultaneous equations models without additivity. Econometrica, 77(5):1481–1512. Kim, K. (2013). Regression discontinuity design with endogenous covariates. Journal of Eco- nomic Theory and Econometrics, 24(4):320–337. Lechner, M. (2008). A note on the common support problem in applied evaluation studies. Annales d’Économie et de Statistique, pages 217–235. Matzkin, R. L. (2003). Nonparametric estimation of nonadditive random functions. Economet- rica, 71(5):1339–1375. Wooldridge, J. M. (2005). Violating ignorability of treatment by controlling for too many factors. Econometric Theory, 21(5):1026–1028. Wooldridge, J. M. (2019). Introductory Econometrics: A Modern Approach. Cengage Learning. 193 APPENDIX 5A PROOFS FOR CHAPTER 5 We first restate Theorem 3 Florens et al. (2008) as Lemma 5A.1 below, which gives primitive conditions for measurable separability. Lemma 5A.1 Suppose 𝐷 is determined by 𝐷 = ℎ(𝑍, 𝑉), where 𝑉 is continuously distributed and independent of 𝑍, and ℎ(𝑧, 𝑣) is continuous in 𝑣. Further, for any fixed 𝑣, the support of the distribution of ℎ(𝑍, 𝑣) contains an open interval. Then, 𝐷 and 𝑉 are measurably separated. Proof of Theorem 5.1: First, note that Assumption 5.3 implies Assumption 5.2 using Lemma 5A.1 with 𝑍 = 𝑈 and 𝑉 = 𝑋. For continuous 𝐷, Assumptions 5.1 and 5.2 implies that 𝜕 𝑓 (𝜖 |𝐷 = 𝑑, 𝑋 = 𝑥) 𝜕𝑑 = 𝜕 𝑓 (𝜖 |𝑋 = 𝑥) 𝜕𝑑 = 0. Then, we have 𝜕𝐸 [𝑌 |𝐷 = 𝑑, 𝑋 = 𝑥] 𝜕𝑑 = 𝜕 ∫ 𝑚(𝑑, 𝑥, 𝜖) 𝑓𝜀|𝐷=𝑑,𝑋=𝑥 (𝜖)𝑑𝜖 𝜕𝑑 = ∫ 𝜕𝑚(𝑑, 𝑥, 𝜖) 𝜕𝑑 𝑓𝜀|𝐷=𝑑,𝑋=𝑥 (𝜖)𝑑𝜖, where the last equality follows from the Leibniz integral rule and the chain rule. Therefore, 𝛽(𝑑, 𝑥) is identified by 𝜕𝐸 [𝑌 |𝐷=𝑑,𝑋=𝑥] 𝜕𝑑 for all 𝑑 ∈ 𝑆𝑢 𝑝 𝑝(𝐷) and 𝑥 ∈ 𝑆𝑢 𝑝 𝑝(𝑋 |𝐷 = 𝑑). Furthermore, taking integrals on both sides with respect to 𝑋 given 𝐷 = 𝑑 gives ∫ 𝜕𝐸 [𝑌 |𝐷 = 𝑑, 𝑋 = 𝑥] 𝜕𝑑 𝑓𝑋 |𝐷=𝑑 (𝑥)𝑑𝑥 = ∫ ∫ 𝜕𝑚(𝑑, 𝑥, 𝜖) 𝜕𝑑 𝑓𝑋,𝜀|𝐷=𝑑 (𝑥, 𝜖)𝑑𝑥𝑑𝜖 . So, 𝛽(𝑑) is identified by ∫ 𝜕𝐸 [𝑌 |𝐷=𝑑,𝑋=𝑥] 𝜕𝑑 𝑓𝑋 |𝐷=𝑑 (𝑥)𝑑𝑥. In the case of binary 𝐷, Assumption 5.1 implies that 𝑓𝜀|𝐷=1,𝑋=𝑥 (𝜖)𝑑𝜖 = 𝑓𝜀|𝐷=0,𝑋=𝑥 (𝜖)𝑑𝜖. Assumption 2 allows for conditioning on 𝐷 = 1 and 𝐷 = 0 at different realized values of 𝑋, so we have = = 𝐸 [𝑌 |𝐷 = 1, 𝑋 = 𝑥] − 𝐸 [𝑌 |𝐷 = 0, 𝑋 = 𝑥] ∫ ∫ 𝑚(1, 𝑥, 𝜖) 𝑓𝜀|𝐷=1,𝑋=𝑥 (𝜖)𝑑𝜖 − 𝑚(0, 𝑥, 𝜖) 𝑓𝜀|𝐷=0,𝑋=𝑥 (𝜖)𝑑𝜖 ∫ (𝑚(1, 𝑥, 𝜖) − 𝑚(0, 𝑥, 𝜖)) 𝑓𝜀|𝐷=𝑑,𝑋=𝑥 (𝜖)𝑑𝜖 . 194 So, (cid:101)𝛽(𝑑, 𝑥) is identified for all 𝑑 ∈ 𝑆𝑢 𝑝 𝑝(𝐷) and 𝑥 ∈ 𝑆𝑢 𝑝 𝑝(𝑋 |𝐷 = 𝑑) and taking integral on both sides with respect to 𝑋 given 𝐷 = 𝑑 gives (cid:101)𝛽(𝑑) ∫ (𝐸 [𝑌 |𝐷 = 1, 𝑋 = 𝑥] − 𝐸 [𝑌 |𝐷 = 0, 𝑋 = 𝑥]) 𝑓𝑋 |𝐷=𝑑 (𝑥)𝑑𝑥 ∫ ∫ = (𝑚(1, 𝑥, 𝜖) − 𝑚(0, 𝑥, 𝜖)) 𝑓𝑋,𝜀|𝐷=𝑑 (𝑥, 𝜖)𝑑𝜖 𝑑𝑥 Proof of Theorem 5.2: The proof of statement (i) and part of statement (ii) follows closely the proof of Theorem 1 in Imbens and Newey (2009). For statement (i), let 𝑙 be any continuous and bounded real function. Due to the independence of 𝑍 and (𝜀, 𝜂) conditional on 𝑋, we first obtain the conditional mean independence as an intermediate result: 𝐸 [𝑙 (𝐷)|𝜀, 𝜂, 𝑋] = 𝐸 [𝑙 (𝑞(𝑍, 𝑋, 𝜂))|𝜀, 𝜂, 𝑋] ∫ ∫ = = 𝑙 (𝑞(𝑧, 𝑋, 𝜂))𝑑𝐹𝑍 |𝜀,𝜂,𝑋 (𝑧) 𝑙 (𝑞(𝑧, 𝑋, 𝜂))𝑑𝐹𝑍 |𝑋 (𝑧) = 𝐸 [𝑙 (𝐷)|𝜂, 𝑋]. Then, we can check the conditional independence of 𝐷 and 𝜀 given (𝜂, 𝑋) by a conditional version of Theorem 2.1.12 of Durrett (2019). Let 𝑎(·) and 𝑏(·) be any continuous and bounded real functions, then 𝐸 [𝑎(𝐷)𝑏(𝜀)|𝜂, 𝑋] = 𝐸 [𝐸 [𝑎(𝐷)𝑏(𝜀)|𝜀, 𝜂, 𝑋] |𝜂, 𝑋] = 𝐸 [𝐸 [𝑎(𝐷)|𝜀, 𝜂, 𝑋]𝑏(𝜀)|𝜂, 𝑋] = 𝐸 [𝐸 [𝑎(𝐷)|𝜂, 𝑋]𝑏(𝜀)|𝜂, 𝑋] = 𝐸 [𝑎(𝐷)|𝜂, 𝑋]𝐸 [𝑏(𝜀)|𝜂, 𝑋]. Consider statement (ii). The measurable separability between 𝐷 and 𝜂 is implied by As- sumption 5.6 using Lemma 5A.1 with 𝑍 = (𝑍, 𝑋) and 𝑉 = 𝜂. So it suffices to show that the sigma-algebra generated by 𝑉 is the same as that of 𝜂. By strict monotonicity of 𝑞(𝑧, 𝑥, 𝑒) in 𝑒 195 for any fixed (𝑧, 𝑥), there exists an inverse function 𝑞−1(𝑧, 𝑥, 𝑑) = 𝑒. Then, we have 𝐹𝐷 |𝑍=𝑧,𝑋=𝑥 (𝑑) = 𝑃𝑟 (𝐷 ≤ 𝑑|𝑍 = 𝑧, 𝑋 = 𝑥) = 𝑃𝑟 (𝑞(𝑧, 𝑥, 𝜂) ≤ 𝑑|𝑍 = 𝑧, 𝑋 = 𝑥) = 𝑃𝑟 (𝜂 ≤ 𝑞−1(𝑧, 𝑥, 𝑑)|𝑍 = 𝑧, 𝑋 = 𝑥) = 𝑃𝑟 (𝜂 ≤ 𝑞−1(𝑧, 𝑥, 𝑑)) = 𝐹𝜂 (𝑞−1(𝑧, 𝑥, 𝑑)). where the second to the last equality follows from the independence of (𝑍, 𝑋) and 𝜂 under Assumption 5.6. Note that 𝜂 = 𝑞−1(𝑍, 𝑋, 𝐷) a.s., so we have 𝑉 = 𝐹𝐷|𝑍,𝑋 (𝐷) = 𝐹𝜂 (𝜂). Under Assumption 5.5, 𝐹𝜂 (𝑒) is a one-to-one function of 𝑒, which implies the sigma-algebra generated by 𝐹𝜂 (𝜂) is the same as that of 𝜂. Furthermore, combining with the independence of 𝜂 and 𝑋 under Assumption 5.6, we have 𝐸 [𝑎(𝐷)𝑏(𝜀)|𝑉, 𝑋] = 𝐸 [𝑎(𝐷)𝑏(𝜀)|𝜂, 𝑋] = 𝐸 [𝑎(𝐷)|𝜂, 𝑋]𝐸 [𝑏(𝜀)|𝜂, 𝑋] = 𝐸 [𝑎(𝐷)|𝑉, 𝑋]𝐸 [𝑏(𝜀)|𝑉, 𝑋] which implies condition (5.7). 196