THREE ESSAYS ON PANEL DATA MODELS WITH INTERACTIVE AND UNOBSERVED EFFECTS By Nicholas Lynn Brown A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Economics โ€“ Doctor of Philosophy 2022 ABSTRACT THREE ESSAYS ON PANEL DATA MODELS WITH INTERACTIVE AND UNOBSERVED EFFECTS By Nicholas Lynn Brown Chapter 1: More Efficient Estimation of Multiplicative Panel Data Models in the Presence of Serial Correlation (with Jeffrey Wooldridge) We provide a systematic approach in obtaining an estimator asymptotically more efficient than the popular fixed effects Poisson (FEP) estimator for panel data models with multiplicative hetero- geneity in the conditional mean. In particular, we derive the optimal instrumental variables under appealing โ€˜workingโ€™ second moment assumptions that allow underdispersion, overdispersion, and general patterns of serial correlation. Because parameters in the optimal instruments must be estimated, we argue for combining our new moment conditions with those that define the FEP estimator to obtain a generalized method of moments (GMM) estimator no less efficient than the FEP estimator and the estimator using the new instruments. A simulation study shows that the GMM estimator behaves well in terms of bias, and it often delivers nontrivial efficiency gains โ€“ even when the working second-moment assumptions fail. Chapter 2: Information equivalence among transformations of semiparametric nonlinear panel data models I consider transformations of nonlinear semiparametric mean functions which yield moment con- ditions for estimation. Such transformations are said to be information equivalent if they yield the same asymptotic efficiency bound. I first derive a unified theory of algebraic equivalence for moment conditions created by a given linear transformation. The main equivalence result states that under standard regularity conditions, transformations which create conditional moment restrictions in a given empirical setting need only to have an equal rank to reach the same efficiency bound. Example applications are considered, including nonlinear models with multiplicative heterogeneity and linear models with arbitrary unobserved factor structures. Chapter 3: Moment-based Estimation of Linear Panel Data Models with Factor-augmented Errors I consider linear panel data models with unobserved factor structures when the number of time periods is small relative to the number of cross-sectional units. I examine two popular methods of estimation: the first eliminates the factors with a parameterized quasi-long-differencing (QLD) transformation. The other, referred to as common correlated effects (CCE), uses the cross-sectional averages of the independent and response variables to project out the space spanned by the factors. I show that the classical CCE assumptions imply unused moment conditions which can be exploited by the QLD transformation to derive new linear estimators which weaken identifying assumptions and have desirable theoretical properties. I prove asymptotic normality of the linear QLD estimators under a heterogeneous slope model which allows for a tradeoff between identifying conditions. These estimators do not require the number of cross-sectional variables to be less than ๐‘‡ โˆ’ 1, a strong restriction in fixed-๐‘‡ CCE analysis. Finally, I investigate the effects of per-student expenditure on standardized test performance using data from the state of Michigan. ACKNOWLEDGEMENTS To my dissertation committee: Jeff, you have given me more of your time than I ever deserved. Thank you for all of your patience and guidance. Thank you Peter for seeing potential in me and helping me along my academic journey. Despite your protest, I canโ€™t help but think of you as my co-chair. To Ben: I have benefited greatly from having such a brilliant applied researcher on my committee, someone who quickly digested my work and showed me how to apply it in relevant cases. Finally, I want to thank Nicky, and I have enjoyed working with you through the AFRE tutoring program. To my fellow graduate students: thank you for your friendship and support throughout these past five years. My qualifying exam study group, Sean, Andrew, Joffrรฉ, Elise, and Alex, to whom I would not be here without. To Mehmet and Taeyoon, the oddball macro and financial economists. And to my econometric mentor Alyssa and Akanksha. Finally, I want to give a special thanks to Bhavna: despite living halfway across the world, you were always available to jump on the phone and support me, especially during the job market. I look forward to our future collaborations. To my family: my emotional bedrock. To my mom Kathi and dad Curt, I can never repay you for your love and support throughout my entire life. You have nurtured me into the person I am today, and I am forever grateful. Also to my bonus parents Lorraine and Kevin, who have become an integral part of my family. To my brothers Jack, Nicky, Mark, and Kian, four of my closest friends and partners in crime. To Katie, whose presence brightens my home. To my confidant and future sister Dana: I would not have made it through U of I without your friendship. I am elated you have joined our family. Finally, to my flock: Griffin and Stark, my feathered friends. You drive me insane, but I could not imagine life without you two. Last but not least, Danielle. You are my best friend. You give me the strength to go on. You are my solace and my inspiration. Everything I do I do for you. I love you more than life. If I were a poet, I could fully articulate how much you mean to me, but unfortunately Iโ€™m only an economist, so youโ€™ll just have to take my word. iv TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii CHAPTER 1 MORE EFFICIENT ESTIMATION OF MULTIPLICATIVE PANEL DATA MODELS IN THE PRESENCE OF SERIAL CORRELATION . . . . . . . 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Model and Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Optimal Instruments under Second Moment Assumptions . . . . . . . . . . . . . . 8 1.4 Operationalizing Optimal IV Estimation . . . . . . . . . . . . . . . . . . . . . . . 15 1.5 A Small Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.6 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 CHAPTER 2 INFORMATION EQUIVALENCE AMONG TRANSFORMATIONS OF SEMIPARAMETRIC NONLINEAR PANEL DATA MODELS . . . . . . . 30 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.2 Information equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.2.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.2.2 General equivalence result . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.3 Examples of information equivalence . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.3.1 Multiplicative heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.3.2 Linear factor model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.3.3 Random trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.4 Practical considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 CHAPTER 3 MOMENT-BASED ESTIMATION OF LINEAR PANEL DATA MOD- ELS WITH FACTOR-AUGMENTED ERRORS . . . . . . . . . . . . . . . 55 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.2.1 Common Correlated Effects . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.2.2 Quasi-long-differencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.3.1 CCE Moment Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.3.2 Pooled and Mean Group QLD . . . . . . . . . . . . . . . . . . . . . . . . 66 3.4 Heterogeneous Slopes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.5 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.5.1 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.5.2 Comparison to TWFE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.6 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 v BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 vi LIST OF TABLES Table 1.1: Conditional Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Table 1.2: Conditional Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Table 3.1: GMM estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Table 3.2: Misspecifying ๐‘ 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Table 3.3: Pooled estimators, ๐พ = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Table 3.4: Pooled estimators, ๐พ = 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Table 3.5: Mean group estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Table 3.6: AR(1) factor structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Table 3.7: TWFE specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Table 3.8: Testing for ๐‘ 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Table 3.9: Controlling for heterogeneous intercept . . . . . . . . . . . . . . . . . . . . . . 88 Table .1: Pooled estimator, ๐พ = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Table .2: Pooled estimators, ๐พ = 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 vii CHAPTER 1 MORE EFFICIENT ESTIMATION OF MULTIPLICATIVE PANEL DATA MODELS IN THE PRESENCE OF SERIAL CORRELATION 1.1 Introduction The fixed effects Poisson (FEP) estimator was originally developed by Hausman, Hall, and Griliches (1984) (hereafter, HHG) in their study of the effects of firm-level R&D spending on patent filings. HHG used the method of conditional maximum likelihood estimation (CMLE) to estimate the parameters in the conditional mean. In deriving the CMLE, HHG assumed that, conditional on the unobserved heterogeneity and the history of the covariates, the outcome variable is independent over time with a Poisson distribution. HHG showed that, conditional on the covariates and the sum of the counts over time, the joint distribution of the counts is multinomial and does not depend on the heterogeneity. Therefore, standard maximum likelihood theory applies, and the asymptotic theory assuming a fixed number of time periods is standard. Hahn (1997) verified that the FEP estimator achieves the semiparametric efficiency bound under the full distributional and conditional independence assumptions. Wooldridge (1999) showed that the consistency of the FEP estimator only requires correct specification of the conditional mean function up to a multiplicative heterogeneity term. In par- ticular, any kind of variance is allowed along with any kind of serial dependence. In fact, the outcome variable need not even be a count variable: it can be any nonnegative outcome, including a continuous or corner solution response. Thus, the FEP estimator is to multiplicative panel data models what the linear FE estimator is to linear models with additive heterogeneity. When the conditional mean function is differentiable in the parameters โ€“ by far the leading case โ€“ Wooldridge (1999) established Fisher consistency of the FEP very generally. Specifically, Wooldridge showed that the score has a zero conditional mean (evaluated at the true parameter value) when the structural conditional mean is correctly specified. In addition to establishing 1 robustness of the FEP estimator, the zero conditional mean property of the score leads to additional moment conditions that can be exploited in generalized method of moments (GMM) estimation to obtain estimators asymptotically more efficient than the FEP estimator. Unfortunately, the extra moment conditions proposed by Wooldridge (1999) are essentially ad hoc: they are not based on any notion of optimality. Consequently, the GMM approach to estimating multiplicative panel data models has not caught on: FEP estimation with the fully robust standard errors derived in Wooldridge (1999) is much more common. Some recent examples include McCabe and Snyder (2014, 2015), Schlenker and Walker (2016), Krapf, Ursprung, and Zimmermann (2017), Castillo, Mejia, and Restrepo (2018), and Williams, Burnap, Javed, Liu, and Ozalp (2020). Given that the FEP estimator is fully robust to distributional misspecification and serial inde- pendence, it is natural to wonder about its asymptotic efficiency under assumptions weaker than the full set of assumptions used by Hahn (1997). Recently, Verdier (2018) showed that the Poisson distributional assumption and conditional independence are not necessary for the FEP estimator to achieve Chamberlainโ€™s (1987, 1992) efficiency bound. In particular, Verdier (2018) showed that it is sufficient to impose the Poisson assumption that the variance equals the mean and that the outcomes are serially uncorrelated conditional on heterogeneity and the covariates. While weaker than the HHG assumptions, they are still restrictive. The assumption that the variance equals the mean, even after conditioning on unobserved heterogeneity, is very special. For example, the most common parameterization of the gamma distribution violates equality of the variance and mean. Moreover, serial correlation in the idiosyncratic errors of linear unobserved effects models is pervasive (which is why researchers now routinely compute standard errors robust to general serial correlation), and it is known how to exploit serial correlation in fixed effects versions of generalized least squares (GLS) to improve efficiency over the usual fixed effects estimator โ€“ see, for example, Im, Ahn, Schmidt, and Wooldridge (1999). It seems natural to search for analogous improvements over the FEP estimator in the presence of serial correlation and more flexible variance-mean relationships. In this paper, we relax the second moment assumptions that are implied by the traditional HHG assumptions and derive the optimal instruments, thereby showing how to obtain an estimator that 2 achieves Chamberlainโ€™s (1992) lower bound. Our efficiency result is new, and includes the Verdier (2018) result as a special case. The variance assumption we use to derive the optimal instruments is appealing because, conditional on the observed covariates and unobserved heterogeneity, it allows for underdispersion (relative to the Poisson) or overdispersion. In the spirit of the popular generalized estimating equations (GEE) approach โ€“ see Liang and Zeger (1986) โ€“ we assume constant conditional correlations, but allow for any pattern of serial correlation. One important difference from the GEE literature is that our assumptions are more โ€œstructuralโ€ in that we state the second moment assumptions conditional on the unobserved heterogeneity. This is analogous to the linear model with an additive, unobserved effect when the working correlation matrix of the idiosyncratic errors is assumed to be constant but is otherwise unrestricted. In order to obtain parametric forms for the optimal instruments, we supplement the flexible second moment assumptions for the response variable with moment assumptions about the multi- plicative heterogeneity. These parametric assumptions are fairly flexible and are commonly used in the literature, particularly in traditional and correlated random effects environments when one needs to impose distributional assumptions on the heterogeneity in order to obtain consistent estimators. Here, we impose first and second moment assumptions in order to obtain the optimal instruments. We must emphasize that the estimator based on the optimal instruments โ€“ which we refer to as the โ€œgeneralized FEP (GFEP) estimatorโ€ โ€“ does not require any assumptions for consistency and asymptotic normality beyond those used by the FEP estimator. That our new estimator is just as robust as the FEP estimator in terms of consistency is important, as it is unfair to claim efficiency improvements if the new estimator is not as robust as the popular, robust FEP estimator. In order to emphasize the robustness of our estimator, we use the term โ€œworkingโ€ assumptions. The key is that, under these parametric โ€œworkingโ€ assumptions we obtain the optimal instruments. If the working assumptions are correct, then we have a just identified estimator that is more efficient than the FEP estimator. If any of the working assumptions are incorrect, the โ€œoptimalโ€ instrumental variables (IVs) are no longer optimal, and so the GFEP no longer achieves Chamberlainโ€™s lower bound. Therefore, we have 3 two estimators that are consistent under the same assumptions but efficient under different working assumptions. To ensure that we have an estimator that is at least as efficient than both the FEP estimator and the GFEP estimator, and usually more efficient, we combine the two sets of moment conditions. With ๐พ parameters this gives ๐พ overidentifying restrictions. The overidentifying restrictions are useful for testing the conditional mean specification โ€“ not the working assumptions, as those are not being used for consistency. To summarize, this paper has three primary contributions. First, we relax the second mo- ment assumptions implied by the traditional fixed effects Poisson setting and obtain the optimal instruments under an appealing set of second moment working assumptions, including allowing for general patterns of serial correlation. Second, we operationalize the estimator by imposing additional working assumptions on moments of the heterogeneity distribution, resulting in a GMM estimator that is computationally simple and is guaranteed to be asymptotically more efficient than both the FEP estimator and the GFEP estimator. Third, we significantly relax the conditions under which the FEP estimator achieves the asymptotic variance lower bound, allowing for both under- dispersion and overdispersion in the variance conditional on observed covariates and unobserved heterogeneity. The underlying asymptotic theory in this paper is for the microeconometric setting that treats the number of time periods, ๐‘‡, as fixed, and lets the cross section dimension, ๐‘, increase without bound. We assume random sampling in the cross section dimension but impose no restrictions on the time series dependence. We do not provide formal regularity conditions because the asymptotic theory is standard, and follow as in hundreds of panel data papers that impose random sampling in the cross section. We do assume smoothness so that certain derivatives โ€“ in particular, that of the conditional mean function โ€“ exist and are continuous. The rest of the paper is organized as follows. Section 1.2 presents the conditional mean model and summarizes the consistency result for the FEP estimator. Section 1.3 derives the optimal instruments under two working variance assumptions, including an unrestricted (but constant) conditional correlation matrix. Section 1.4 shows how to implement the GFEP estimator and the 4 GMM estimator that combines the two sets of moment conditions. Section 1.5 provides promising simulation evidence comparing the FEP, GFEP, and GMM estimators under serial correlation with both underdispersion and overdispersion in the structural variance. Section 1.6 contains concluding remarks. 1.2 Model and Background We consider a balanced panel data setting where, for each ๐‘–, {(๐‘ฆ๐‘–๐‘ก , x๐‘–๐‘ก , ๐‘๐‘– ) : ๐‘ก = 1, 2, ..., ๐‘‡ } is a random draw from the population. We observe the nonnegative response variable ๐‘ฆ๐‘–๐‘ก โ‰ฅ 0 and x๐‘–๐‘ก , a 1ร—๐พ vector. The scalar ๐‘๐‘– is the unobserved heterogeneity. As is usual in fixed effects environments, the elements of x๐‘–๐‘ก must have variation across ๐‘ก for at least some population units. Typically, these would include dummy variables indicating different time periods to allow for flexible aggregate time effects. The entire observed history of the covariates is x๐‘– = (x๐‘–1 , x๐‘–2 , ..., x๐‘–๐‘‡ ). As mentioned in the introduction, we are treating ๐‘‡ as fixed in the asymptotic analysis. Therefore, because we assume random sampling in the cross section, relevant assumptions can be stated for a random draw ๐‘– from the population. The substantive assumptions that we make throughout the paper are that the model of the conditional mean is correctly specified, the heterogeneity is multiplicative, and the covariates are strictly exogenous conditional on ๐‘๐‘– . These are all captured by the following. Assumption Conditional Mean (CM): For ๐‘ก = 1, ..., ๐‘‡ and some ๐œท0 โˆˆ R๐‘ƒ , E (๐‘ฆ๐‘–๐‘ก |x๐‘– , ๐‘๐‘– ) = E (๐‘ฆ๐‘–๐‘ก |x๐‘–๐‘ก , ๐‘๐‘– ) = ๐‘๐‘– ๐‘š ๐‘ก (x๐‘–๐‘ก , ๐œท0 ) (1.2.1) where ๐‘š ๐‘ก (x๐‘ก , ยท) โ‰ฅ 0 is continuously differentiable on R๐‘ƒ for all x๐‘ก โˆˆ X๐‘ก , the support of x๐‘–๐‘ก . โ–  As discussed in Wooldridge (1999), for consistency of the FEP estimator one can get by with assuming continuity over the parameter space, but we impose assumptions that imply asymptotic normality and easy calculation of asymptotic efficiency bounds. See Newey and McFadden (1994) or Wooldridge (2010, Chapter 12) for formal regularity conditions. In terms of smoothness, assuming ๐‘š ๐‘ก (x๐‘–๐‘ก , ยท) is twice continuously differentiable is sufficient and is almost always true in 5 practice. By far the leading case of the conditional mean function is E (๐‘ฆ๐‘–๐‘ก |x๐‘–๐‘ก , ๐‘๐‘– ) = ๐‘๐‘– exp (x๐‘–๐‘ก ๐œท0 ) (1.2.2) where x๐‘–๐‘ก can include time period dummies to allow different intercepts inside the exponential function. Naturally, x๐‘–๐‘ก can also include nonlinear functions of underlying explanatory variables, including squares and interactions. Given the choice in (1.2.2), ๐‘ƒ = ๐พ, but we also allow more general mean functions. Because we want to allow arbitrary dependence between ๐‘๐‘– and x๐‘–๐‘ก , we need time variation in the latter for at least some units in the population. This permits, for example, interactions among variables that have some time variation and others that do not. Strict exogeneity conditional on the unobserved effect ๐‘๐‘– is implied by the first equality in (1.2.1). This assumption is restrictive โ€“ for example, it rules out lagged dependent variables โ€“ but it is much less restrictive than the strict exogeneity assumption typically used in the GEE literature because of conditioning on ๐‘๐‘– . In the typical GEE approach the strict exogeneity assumption is stated as E (๐‘ฆ๐‘–๐‘ก |x๐‘– ) = E (๐‘ฆ๐‘–๐‘ก |x๐‘–๐‘ก ). [For a discussion of GEE from an econometrics perspective, see Wooldridge (2010, Section 13.11.4).] Using iterated expectations, if (1.2.1) holds then E (๐‘ฆ๐‘–๐‘ก |x๐‘– ) = E (๐‘๐‘– |x๐‘– ) ๐‘š ๐‘ก (x๐‘–๐‘ก , ๐œท0 ) and the latter expression is not E (๐‘ฆ๐‘–๐‘ก |x๐‘–๐‘ก ) if E (๐‘๐‘– |x๐‘– ) โ‰  E (๐‘๐‘– ). The multiplicative formulation using the exponential function in (1.2.2) can be obtained from E (๐‘ฆ๐‘–๐‘ก |x๐‘–๐‘ก , ๐‘Ž๐‘– ) = exp (๐‘Ž๐‘– + x๐‘–๐‘ก ๐œท0 ) where ๐‘๐‘– โ‰ก exp (๐‘Ž๐‘– ). In applications where P(๐‘ฆ๐‘–๐‘ก = 0) > 0, it is important to use (1.2.2) to allow for the possibility that ๐‘๐‘– = 0, which then implies ๐‘ฆ๐‘–๐‘ก = 0, ๐‘ก = 1, 2, ..., ๐‘‡. Often in count data and 6 corner solution applications one sees some units with ๐‘ฆ๐‘–๐‘ก = 0 for all ๐‘ก. Remember, we are only assuming ๐‘ฆ๐‘–๐‘ก โ‰ฅ 0; no other restrictions are imposed on the support of ๐‘ฆ๐‘–๐‘ก . In most cases, a model such as (1.2.2) is appealing when ๐‘ฆ๐‘–๐‘ก has no natural upper bound. In FEP estimation, the following residual function, first studied by HHG, plays an important role: ๐‘ข๐‘–๐‘ก ( ๐œท) โ‰ก ๐‘ฆ๐‘–๐‘ก โˆ’ ๐‘›๐‘– ๐‘ ๐‘ก (x๐‘– , ๐œท) (1.2.3) ร๐‘‡ where ๐‘›๐‘– โ‰ก ๐‘Ÿ=1 ๐‘ฆ ๐‘–๐‘Ÿ and ๐‘š ๐‘ก (x๐‘–๐‘ก , ๐œท) ๐‘ ๐‘ก (x๐‘– , ๐œท) โ‰ก ร๐‘‡ (1.2.4) ๐‘Ÿ=1 ๐‘š ๐‘Ÿ (x๐‘–๐‘Ÿ , ๐œท) As convenient shorthand, we write ๐‘š๐‘–๐‘ก ( ๐œท) = ๐‘š ๐‘ก (x๐‘–๐‘ก , ๐œท) and ๐‘๐‘–๐‘ก ( ๐œท) = ๐‘ ๐‘ก (x๐‘– , ๐œท). We can stack the ๐‘๐‘–๐‘ก ( ๐œท) into the ๐‘‡ ร— 1 vector p (x๐‘– , ๐œท) and write u๐‘– ( ๐œท) = y๐‘– โˆ’ p (x๐‘– , ๐œท) ๐‘›๐‘– = y๐‘– โˆ’ p (x๐‘– , ๐œท) 1๐‘‡โ€ฒ y๐‘– = I๐‘‡ โˆ’ p (x๐‘– , ๐œท) 1๐‘‡โ€ฒ y๐‘–   (1.2.5) where u๐‘– ( ๐œท) is the ๐‘‡ ร— 1 vector with ๐‘ก ๐‘กโ„Ž element ๐‘ข๐‘–๐‘ก ( ๐œท) and 1๐‘‡ is the ๐‘‡ ร— 1 vector with all elements unity. As shown in Wooldridge (1999) under Assumption CM.1, E [u๐‘– ( ๐œท0 ) |x๐‘– ] = 0 (1.2.6) Further, the score of the quasi-log-likelihood function for random draw ๐‘– can be written as s๐‘– ( ๐œท) = โˆ‡ ๐œท p (x๐‘– , ๐œท) โ€ฒ W (x๐‘– , ๐œท) u๐‘– (๐›ฝ) (1.2.7) where W (x๐‘– , ๐œท) = diag [ ๐‘๐‘–1 ( ๐œท)] โˆ’1 , [ ๐‘๐‘–2 ( ๐œท)] โˆ’1 , ..., [ ๐‘๐‘–๐‘‡ ( ๐œท)] โˆ’1  (1.2.8) is ๐‘‡ ร— ๐‘‡. 7 It follows immediately that E [s๐‘– ( ๐œท0 ) |x๐‘– ] = 0 (1.2.9) โˆš and this translates, under standard regularity conditions, into the consistency and ๐‘-asymptotic normality of the FEP estimator. For emphasis, only Assumption CM is needed for consistency and asymptotic normality, and fully robust inference using a sandwich estimator is essentially trivial. Wooldridge (1999) also notes that the conditional moment restrictions in (1.2.6) leads to uncountably many unconditional moment restrictions beyond those used by the FEP estimator, which are given by E [s๐‘– ( ๐œท0 )] = 0. Wooldridge (1999) suggests some extra moment conditions but makes no attempt to find the optimal estimator based on (1.2.6). In the next section we derive the optimal instruments under a set of second moment assumptions. 1.3 Optimal Instruments under Second Moment Assumptions Given the moment conditions in (1.2.6), we can apply Chamberlainโ€™s (1992) semiparametric effi- ciency bound to obtain an asymptotically efficient estimator. Define   D๐‘œ (x๐‘– ) โ‰ก E โˆ‡ ๐œท u๐‘– ( ๐œท0 ) |x๐‘– (1.3.1) and V๐‘œ (x๐‘– ) โ‰ก Var [u๐‘– ( ๐œท0 ) |x๐‘– ] (1.3.2) Under regularity conditions of the kind found in Newey and McFadden (1994), Newey (2001) extended Chamberlain (1992) by allowing V๐‘œ (x๐‘– ) to be singular and showed that the efficient estimator that uses only (1.2.6) has asymptotic variance โˆ’1 E D๐‘œ (x๐‘– ) โ€ฒ V๐‘œ (x๐‘– ) โˆ’ D๐‘œ (x๐‘– )    (1.3.3) 8 where V๐‘œ (x๐‘– ) โˆ’ denotes any generalized inverse (๐‘”-inverse), which means V๐‘œ (x๐‘– ) V๐‘œ (x๐‘– ) โˆ’ V๐‘œ (x๐‘– ) = V๐‘œ (x๐‘– ). Because V๐‘œ (x๐‘– ) is symmetric, a symmetric ๐‘”-inverse always exists, and it simplifies notation to take V๐‘œ (x๐‘– ) โˆ’ to be symmetric. Below we will obtain an explicit formula for a symmetric ๐‘”-inverse. Given a random sample of size ๐‘ and knowledge of D๐‘œ (x๐‘– ) and V๐‘œ (x๐‘– ), an estimator ๐›ฝห†๐‘‚๐‘ƒ๐‘‡ that achieves this lower bound solves the exactly identified moment equations ๐‘ โˆ‘๏ธ   D๐‘œ (x๐‘– ) โ€ฒ V๐‘œ (x๐‘– ) โˆ’ u๐‘– ๐œท b๐‘‚๐‘ƒ๐‘‡ = 0 (1.3.4) ๐‘–=1 Of course, this estimator is infeasible because D๐‘œ (x๐‘– ) and V๐‘œ (x๐‘– ) are generally unknown. In principle, both can be nonparametrically estimated. However, especially given the often large dimension of x๐‘– , nonparametric estimation of many conditional means, variances, and covariances hardly seems worth it just to improve asymptotic efficiency over the FEP estimator. Plus, the finite-sample properties of the the resulting estimator could be poor. Our goal here is to obtain simple formulas for the optimal IVs Zโˆ— (x๐‘– ) โ‰ก V๐‘œ (x๐‘– ) โˆ’ D๐‘œ (x๐‘– ) under reasonably flexible parametric second moment assumptions that have antecedents in the count data literature. To find D๐‘œ (x๐‘– ), note that โˆ‡ ๐œท u๐‘– ( ๐œท) = โˆ’โˆ‡ ๐œท p (x๐‘– , ๐œท) ๐‘›๐‘– (1.3.5) where, for each ๐‘ก, we can write " ๐‘‡ # โˆ’1 ( " ๐‘‡ # ) โˆ‘๏ธ โˆ‘๏ธ โˆ‡ ๐œท ๐‘๐‘–๐‘ก ( ๐œท) = ๐‘š๐‘–๐‘Ÿ ( ๐œท) โˆ‡ ๐œท ๐‘š๐‘–๐‘ก ( ๐œท) โˆ’ โˆ‡ ๐œท ๐‘š๐‘–๐‘Ÿ ( ๐œท) ๐‘๐‘–๐‘ก ( ๐œท) ๐‘Ÿ=1 ๐‘Ÿ=1 Therefore, " ๐‘‡ # โˆ’1 โˆ‘๏ธ โˆ‡ ๐œท m๐‘– ( ๐œท) โˆ’ p๐‘– ( ๐œท) 1๐‘‡โ€ฒ โˆ‡ ๐œท m๐‘– ( ๐œท)    โˆ‡ ๐œท p๐‘– ( ๐œท) = ๐‘š๐‘–๐‘Ÿ ( ๐œท) ๐‘Ÿ=1 " ๐‘‡ # โˆ’1 โˆ‘๏ธ I๐‘‡ โˆ’ p๐‘– ( ๐œท) 1๐‘‡โ€ฒ โˆ‡ ๐œท m๐‘– ( ๐œท)   = ๐‘š๐‘–๐‘Ÿ ( ๐œท) (1.3.6) ๐‘Ÿ=1 which gives us the necessary gradient. 9 Further, because " # ๐‘‡ โˆ‘๏ธ E (๐‘›๐‘– |x๐‘– , ๐‘๐‘– ) = ๐‘๐‘– ๐‘š๐‘–๐‘Ÿ ( ๐œท0 ) ๐‘Ÿ=1 we have E โˆ‡ ๐œท u๐‘– ( ๐œท0 ) |x๐‘– , ๐‘๐‘– = โˆ’๐‘๐‘– I๐‘‡ โˆ’ p๐‘– ( ๐œท0 ) 1๐‘‡โ€ฒ โˆ‡ ๐œท m๐‘– ( ๐œท0 )     Now, let ๐œ‡๐‘ (x๐‘– ) โ‰ก E (๐‘๐‘– |x๐‘– ) Then we have shown D๐‘œ (x๐‘– ) = โˆ’๐œ‡๐‘ (x๐‘– ) I๐‘‡ โˆ’ p๐‘– ( ๐œท0 ) 1๐‘‡โ€ฒ โˆ‡ ๐œท m๐‘– ( ๐œท0 )   (1.3.7) which is the first piece needed to derive the optimal instruments. The unknown function in D๐‘œ (x๐‘– ), ๐œ‡๐‘ (x๐‘– ), is the conditional mean in the heterogeneity distribution. Next, consider V๐‘œ (x๐‘– ) โˆ’ . First, we can write I๐‘‡ โˆ’ p๐‘– ( ๐œท0 ) 1๐‘‡โ€ฒ y๐‘– |x๐‘–   V๐‘œ (x๐‘– ) โ‰ก Var [u๐‘– ( ๐œท0 ) |x๐‘– ] = Var โ‰ก (I๐‘‡ โˆ’ P๐‘– ) ๐›€๐‘– I๐‘‡ โˆ’ Pโ€ฒ๐‘–  (1.3.8) where ๐›€๐‘– โ‰ก Var (y๐‘– |x๐‘– ) (1.3.9) is assumed to be nonsingular (with probability one) and P๐‘– โ‰ก p๐‘– ( ๐œท0 ) 1๐‘‡โ€ฒ is ๐‘‡ ร— ๐‘‡. Because the ๐‘๐‘–๐‘ก ( ๐œท0 ) sum to unity across ๐‘ก, it is easy to show that P๐‘– is an idempotent (but not symmetric) matrix with rank(P๐‘– ) = 1. In establishing that the FEP estimator is asymptotically efficient under the Poisson first and second moment assumptions, Verdier (2018) finds a particular symmetric matrix which is inherent to the FEP solution. 10 The matrix  โˆ’1 V๐‘œ (x๐‘– ) โˆ’ = ๐›€๐‘–โˆ’1 โˆ’ ๐›€๐‘–โˆ’1 p๐‘– ( ๐œท0 ) p๐‘– ( ๐œท0 ) โ€ฒ ๐›€๐‘–โˆ’1 p๐‘– ( ๐œท0 ) p๐‘– ( ๐œท0 ) โ€ฒ ๐›€๐‘–โˆ’1   โˆ’1 = ๐›€๐‘–โˆ’1 โˆ’ ๐›€๐‘–โˆ’1 m๐‘– ( ๐œท0 ) m๐‘– ( ๐œท0 ) โ€ฒ ๐›€๐‘–โˆ’1 m๐‘– ( ๐œท0 ) m๐‘– ( ๐œท0 ) โ€ฒ ๐›€๐‘–โˆ’1  (1.3.10) is a generalized inverse of V๐‘œ (x๐‘– ). The second equality in (1.3.10) follows by the definition of p๐‘– ( ๐œท0 ) and by cancelling terms. By simple multiplication it is easily seen that p๐‘– ( ๐œท0 ) โ€ฒ V๐‘œ (x๐‘– ) โˆ’ = 0 and so D๐‘œ (x๐‘– ) โ€ฒ V๐‘œ (x๐‘– ) โˆ’ = โˆ’๐œ‡๐‘ (x๐‘– ) โˆ‡ ๐œท m๐‘– ( ๐œท0 ) โ€ฒ V๐‘œ (x๐‘– ) โˆ’ (1.3.11) The expression for the optimal instruments in (1.3.11) is not directly applicable because ๐œ‡๐‘ (ยท) and V๐‘œ (ยท) are unknown, with the latter depending on the unknown ๐›€๐‘– . We now impose assumptions on the structural variance-covariance matrix, Var (y๐‘– |x๐‘– , ๐‘๐‘– ), that lead to useful simplifications. The first restriction is on the diagonal elements. Assumption Working Variance 1 (WV.1): For ๐‘ก = 1, ..., ๐‘‡, there exists ๐›ผ > 0 such that Var (๐‘ฆ๐‘–๐‘ก |x๐‘– , ๐‘๐‘– ) = Var (๐‘ฆ๐‘–๐‘ก |x๐‘–๐‘ก , ๐‘๐‘– ) = ๐›ผE (๐‘ฆ๐‘–๐‘ก |x๐‘–๐‘ก , ๐‘๐‘– ) = ๐›ผ๐‘๐‘– ๐‘š๐‘–๐‘ก ( ๐œท0 ) โ–  (1.3.12) Assumption WV.1 is motivated by the count data literature, where the assumption that the variance is proportional to the mean is commonly used in generalized linear models (GLM) and GEE settings; see, for example, McCullagh and Nelder (1989), Liang and Zeger (1986), Hardin and Hilbe (2012), and Wooldridge (2010, Section 13.11). Again, one important difference between our setting and the standard GEE setting is that we state the first and second moments conditional on the unobserved heterogeneity, ๐‘๐‘– , in addition to the observable variables, x๐‘– . Once the population is effectively partitioned on the basis of (x๐‘– , ๐‘๐‘– ), the so-called โ€œGLM variance assumptionโ€ is more appealing. We do not restrict the value of ๐›ผ = Var (๐‘ฆ๐‘–๐‘ก |x๐‘–๐‘ก , ๐‘๐‘– ) /E (๐‘ฆ๐‘–๐‘ก |x๐‘–๐‘ก , ๐‘๐‘– ), and so the ๐‘ฆ๐‘–๐‘ก 11 can exhibit underdispersion or overdispersion relative to the Poisson distribution. This variance- mean relationship also holds for one popular parameterization of the negative binomial distribution (which implies overdispersion), and can hold for continuous outcomes as well, such as a common parameterization of the gamma distribution. The second working assumption is on the conditional correlation matrix. Assumption Working Variance 2 (WV.2): For a ๐‘‡ ร— ๐‘‡ symmetric, positive definite matrix R (with unity down the diagonal), Corr (y๐‘– |x๐‘– , ๐‘๐‘– ) = R โ–  (1.3.13) Assumption WV.2 is motivated by the GEE literature, where a constant conditional correlation matrix is the leading example of a working correlation assumption. We do not put restrictions on the elements of R, ๐œŒ๐‘ก๐‘  = Corr (๐‘ฆ๐‘–๐‘ก , ๐‘ฆ๐‘–๐‘  |x๐‘– , ๐‘๐‘– ), other than those that ensure R is a valid correlation matrix. The special case of no serial correlation conditional on (x๐‘– , ๐‘๐‘– ) is R = I๐‘‡ . One could impose an exchangeability restriction on R, as is common in the GEE literature, but that is less attractive here because we are conditioning on ๐‘๐‘– (which would often be assumed to be an explanation for an exchangeable structure without conditioning on ๐‘๐‘– ). With large ๐‘ and small ๐‘‡, there is little reason to impose restrictions on R. Again, an important difference with the GEE literature is we condition the correlation matrix on ๐‘๐‘– as well as x๐‘– โ€“ which makes R = I๐‘‡ more tenable (but still unnecessary). We can combine Assumptions WV.1 and WV.2 into a working variance-covariance matrix conditional on (x๐‘– , ๐‘๐‘– ): Var (y๐‘– |x๐‘– , ๐‘๐‘– ) = ๐›ผ๐‘๐‘– M๐‘–1/2 RM๐‘–1/2 (1.3.14) where M๐‘– โ‰ก diag {๐‘š๐‘–1 ( ๐œท0 ) , ๐‘š๐‘–2 ( ๐œท0 ) , ..., ๐‘š๐‘–๐‘‡ ( ๐œท0 )} and M๐‘–1/2 is the obvious matrix square root. If not for conditioning on the unobserved heterogeneity ๐‘๐‘– , (1.3.14) has a structure very familiar from the GEE literature on estimating conditional means of count variables with longitudinal data. 12 In stating Assumptions WV.1 and WV.2, we have opted not to include a โ€œ0โ€ subscript on ๐›ผ or R. This decision requires a brief explanation. For deriving the optimal instruments, we are assuming the existence of โ€œtrue values.โ€ However, when we discuss implementation of our new estimator in Section 1.4, we do not assume Assumptions WV.1 or WV.2 are in force. To ensure that the focus is on estimating ๐œท0 , and to simplify the notation, we omit the โ€œ0โ€ subscripts on the parameters in the working assumptions. Before deriving the optimal instruments, we first obtain ๐›€๐‘– = Var (y๐‘– |x๐‘– ) and provide a useful expression for its inverse. As shorthand, let m๐‘– be the ๐‘‡ ร— 1 vector of ๐‘š๐‘–๐‘ก ( ๐œท0 ), and define M๐‘–1/2 โˆš as above. We use m๐‘– to denote the ๐‘‡ ร— 1 vector containing the square roots of the ๐‘š๐‘–๐‘ก ( ๐œท0 ). In stating the next lemma, let ๐œŽ๐‘2 (x๐‘– ) = Var (๐‘๐‘– |x๐‘– ) Lemma 1.3.1. Under Assumptions CM, WV.1, and WV.2, Var (y๐‘– |x๐‘– ) = ๐›€๐‘– = ๐›ผ๐œ‡๐‘ (x๐‘– ) M๐‘–1/2 RM๐‘–1/2 + ๐œŽ๐‘2 (x๐‘– ) m๐‘– mโ€ฒ๐‘– (1.3.15) which is positive definite. Further, ( ) 2 (x ) โˆš โˆš 1 ๐œŽ โ€ฒ ๐›€๐‘–โˆ’1 Mโˆ’1/2 Rโˆ’1 โˆ’  โˆ’1 m๐‘– m๐‘– Rโˆ’1 M๐‘–โˆ’1/2 ๐‘ ๐‘– = 2 โˆš โ€ฒ โˆ’1 โˆš  R [๐›ผ๐œ‡๐‘ (x๐‘– )] ๐‘– (x ) ๐›ผ๐œ‡๐‘ ๐‘– + ๐œŽ๐‘ ๐‘– m๐‘– R(x ) m๐‘– Proof. See Appendix for proof. โ–ก Establishing the formula for ๐›€๐‘– uses the law of total variance (for matrices). Positive definiteness of ๐›€๐‘– follows because the first term in (1.3.15) is positive definite under WV.1 and WV.2 and the second is always positive semi-definite. As shown in the Appendix, the formula for ๐›€๐‘–โˆ’1 applies a result due to Sherman and Morrison (1950). Now we can state the main optimal instrument result. 13 Theorem 1.3.1. Under Assumptions CM, WV.1, and WV.2, a symmetric generalized inverse of V๐‘œ (x๐‘– ) is   1 โˆ’1/2 1 โˆ’1 โˆš โˆš โ€ฒ โˆ’1 โˆ’ V๐‘œ (x๐‘– ) = M โˆ’1 R โˆ’ โˆš โ€ฒ โˆ’1 โˆš R m๐‘– m๐‘– R M๐‘–โˆ’1/2 (1.3.16) [๐›ผ๐œ‡๐‘ (x๐‘– )] ๐‘– m๐‘– R m๐‘– Further, the optimal ๐‘‡ ร— ๐พ matrix of instruments, Zโˆ— (x๐‘– ), is   1 โˆ’1 โˆš โˆš โ€ฒ โˆ’1 โˆ— โ€ฒ Z (x๐‘– ) โ‰ก โˆ‡ ๐œท m๐‘– ( ๐œท0 ) โ€ฒ M๐‘–โˆ’1/2 R โˆ’1 โˆ’ โˆš โ€ฒ โˆ’1 โˆš R m๐‘– m๐‘– R M๐‘–โˆ’1/2 (1.3.17) m๐‘– R m๐‘– where, again, m๐‘– and M๐‘– are evaluated at ๐œท0 . We have dropped the minus sign in D๐‘œ (x๐‘– ) as that does not affect the optimal choice. Proof. See Appendix for proof. โ–ก The optimal instrument matrix in (1.3.17) has a rather remarkable feature: it does not depend on the constant ๐›ผ nor on the conditional first two moments of the heterogeneity distribution, ๐œ‡๐‘ (x๐‘– ) and ๐œŽ๐‘ (x๐‘– ) โ€“ even though ๐›€๐‘–โˆ’1 depends on all of these quantities and D๐‘œ (x๐‘– ) depends on ๐œ‡๐‘ (x๐‘– ). Under the working variance matrix assumptions, the optimal instruments depend only on ๐œท0 and R. We have a natural preliminary estimator of ๐œท0 , namely, the FEP estimator. Estimating R is much more challenging, and for that we will introduce additional working assumptions โ€“ something we take up in the next section. An interesting special case of Theorem 1.3.1 is when the {๐‘ฆ๐‘–๐‘ก : ๐‘ก = 1, 2, ..., ๐‘‡ } are conditionally uncorrelated, an assumption with a long history in linear and nonlinear unobserved effects mod- els. Traditional treatments of linear unobserved effects models โ€“ often called โ€œrandom effectsโ€ models โ€“ include the assumption that idiosyncratic shocks are serially uncorrelated, which implies that, conditional on (x๐‘– , ๐‘๐‘– ), the {๐‘ฆ๐‘–๐‘ก : ๐‘ก = 1, 2, ..., ๐‘‡ } are uncorrelated. In using joint maximum likelihood to estimate nonlinear models with unobserved heterogeneity โ€“ random effects probit and ordered probit, random effects multinomial logit, random effects Tobit, random effects version of Poisson and negative binomial models, among others โ€“ it is almost always assumed that the 14 {๐‘ฆ๐‘–๐‘ก : ๐‘ก = 1, 2, ..., ๐‘‡ } are independent conditional on (x๐‘– , ๐‘๐‘– ); see Sections 13.9, 15.8, 17.8, and 18.7 in Wooldridge (2010). Corollary 1.3.1. Under Assumptions CM, WV.1, and WV.2 with R = I๐‘‡ , the FEP estimator is efficient among estimators that use only Assumption CM for consistency. Proof. See Appendix for proof. โ–ก Corollary 1.3.1 is a new result that shows the FEP estimator is asymptotically efficient for any ๐›ผ > 0 in Assumption WV.1 provided there is no serial correlation. Conditional on x๐‘– and ๐‘๐‘– , any amount of constant underdispersion or overdispersion is allowed. Therefore, Corollary 1.3.1 improves on Verdier (2018), who imposed ๐›ผ = 1, the value that holds for the Poisson distribution. That FEP is asymptotically efficient for any ๐›ผ while allowing for any dependence between ๐‘๐‘– and x๐‘– allows us to make an interesting connection with the cross-sectional GLM literature. As pointed out in Wooldridge (2010, Section 13.11.3), the cross-sectional version of Assumption WV.1 implies that the Poisson QMLE is asymptotically efficient among estimators that use only correct specification of the conditional mean function for consistency. 1.4 Operationalizing Optimal IV Estimation From Theorem 1.3.1, in order to obtain a feasible optimal IV estimator under Assumptions CM, WV.1, and WV.2, we need a preliminary consistent estimator of ๐œท0 and we either need to know R or have a consistent estimator of it. If we want to impose a specific structure on R โ€“ say, an AR(1) model with a known AR(1) parameter โ€“ then (1.3.17) can be used after replacing ๐œท0 with ๐œท b๐น๐ธ ๐‘ƒ (the clear choice for a first-stage estimator of ๐œท0 ). Remember, imposing such a restriction when it is incorrect would not affect consistency of the method of moments estimator; but the estimator would not be asymptotically efficient. Generally, we want to estimate R without imposing any restrictions. โˆš   In order to ignore the first-stage estimation when obtaining the asymptotic variance of ๐‘ ๐›ฝห†๐‘‚๐‘ƒ๐‘‡ โˆ’ ๐œท0 , โˆš the first-stage estimators of should be ๐‘-consistent โ€“ a weak requirement because we are assuming 15 random sampling and smooth moment and objective functions. See Wooldridge (2010, Chapter 14) for discussion. As mentioned earlier, it is very natural to use the FEP estimator as the initial estimator of ๐œท0 . Estimation of R is more difficult because it is the (working) correlation matrix conditional on the unobserved heterogeneity, ๐‘๐‘– , in addition to x๐‘– . The key to estimating R is the relationship in (1.3.15). To see how (1.3.15) can be used, define a ๐‘‡ ร— 1 vector of errors v๐‘– โ‰ก y๐‘– โˆ’ E (y๐‘– |x๐‘– ) = y๐‘– โˆ’ ๐œ‡๐‘ (x๐‘– ) m๐‘– (1.4.1) Then E v๐‘– vโ€ฒ๐‘– |x๐‘– = ๐›ผ๐œ‡๐‘ (x๐‘– ) M๐‘–1/2 RM๐‘–1/2 + ๐œŽ๐‘2 (x๐‘– ) m๐‘– mโ€ฒ๐‘–  (1.4.2) which we can write in matrix error form as v๐‘– vโ€ฒ๐‘– = ๐›ผ๐œ‡๐‘ (x๐‘– ) M๐‘–1/2 RM๐‘–1/2 + ๐œŽ๐‘2 (x๐‘– ) m๐‘– mโ€ฒ๐‘– + S๐‘– with E (S๐‘– |x๐‘– ) = 0 (1.4.3) Next, define k๐‘– โ‰ก E (y๐‘– |x๐‘– ) = ๐œ‡๐‘ (x๐‘– ) m๐‘– (1.4.4) and let K๐‘– be the diagonalized version of k๐‘– . Then โˆš๏ธ โˆš๏ธ v๐‘– vโ€ฒ๐‘– โˆ’ ๐œŽ๐‘2 (x๐‘– ) m๐‘– mโ€ฒ๐‘– = ๐›ผ K๐‘– R K๐‘– + S๐‘– (1.4.5) and so K๐‘–โˆ’1/2 v๐‘– vโ€ฒ๐‘– โˆ’ ๐œŽ๐‘2 (x๐‘– ) m๐‘– mโ€ฒ๐‘– K๐‘–โˆ’1/2 /๐›ผ = R + K๐‘–โˆ’1/2 S๐‘– K๐‘–โˆ’1/2 /๐›ผ   (1.4.6) By (1.4.3) and iterated expectations, the second term in (1.4.6), K๐‘–โˆ’1/2 S๐‘– K๐‘–โˆ’1/2 /๐›ผ, has a mean of zero. 16 Therefore, we have shown n o R = E ๐›ผโˆ’1 K๐‘–โˆ’1/2 v๐‘– vโ€ฒ๐‘– โˆ’ ๐œŽ๐‘2 (x๐‘– ) m๐‘– mโ€ฒ๐‘– K๐‘–โˆ’1/2   (1.4.7) Combining (1.4.7) with (1.3.17) shows that ๐›ผ appears as a multiplicative factor in Zโˆ— (x๐‘– ), and therefore does not affect the optimal choice of instruments. Equation (1.4.7) for R suggests simply computing the sample analog of the matrix inside the expected value. However, we must deal with the fact that the matrix depends on three unknown quantities: the parameter ๐›ผ, the conditional mean function ๐œ‡๐‘ (ยท) (which appears in the definition of v๐‘– ), and the conditional variance function ๐œŽ๐‘2 (ยท). There are different ways to approach estimation of ๐œ‡๐‘ (ยท). For example, under Assumption CM, " ๐‘‡ # โˆ‘๏ธ E (๐‘›๐‘– |x๐‘– , ๐‘๐‘– ) = ๐‘๐‘– ๐‘š๐‘–๐‘Ÿ ( ๐œท0 ) (1.4.8) ๐‘Ÿ=1 and so " # ๐‘›๐‘– E ร๐‘‡ x๐‘– = ๐œ‡๐‘ (x๐‘– ) (1.4.9) ๐‘Ÿ=1 ๐‘š ๐‘–๐‘Ÿ ( ๐œท0 ) Alternatively, we can write " ๐‘‡ # โˆ‘๏ธ ๐‘ฆ๐‘–๐‘ก E ๐‘‡ โˆ’1 x๐‘– = ๐œ‡๐‘ (x๐‘– ) (1.4.10) ๐‘ก=1 ๐‘š๐‘–๐‘ก ( ๐œท0 ) โˆš Because we have available ๐‘-consistent estimators of ๐œท0 , expressions (1.4.9) and (1.4.10) show that ๐œ‡๐‘ (ยท) is nonparametrically identified. In fact, we can use these expressions to motivate a nonparametric estimator. Almost certainly the initial estimator of ๐œท0 is ๐œท b๐น๐ธ ๐‘ƒ , in which case we ร    construct a dependent variable, ๐‘›๐‘– / ๐‘‡๐‘Ÿ=1 ๐‘šห† ๐‘–๐‘Ÿ , where ๐‘šห† ๐‘–๐‘Ÿ = ๐‘š๐‘–๐‘Ÿ ๐œท b๐น๐ธ ๐‘ƒ , and use it in a cross- sectional nonparametric regression to obtain ๐œ‡ห† ๐‘ (ยท). For ๐œŽ๐‘2 (ยท), the law of total variance gives the conditional form given ๐’™๐‘– . 17 We have   E ๐‘ฃ ๐‘–๐‘ก2 |x๐‘– = Var (๐‘ฆ๐‘–๐‘ก |x๐‘– ) = E [Var (๐‘ฆ๐‘–๐‘ก |x๐‘– , ๐‘๐‘– ) |x๐‘– ] + Var [E (๐‘ฆ๐‘–๐‘ก |x๐‘– , ๐‘๐‘– ) |x๐‘– ] = E [๐›ผ๐‘๐‘– ๐‘š๐‘–๐‘ก ( ๐œท0 ) |x๐‘– ] + Var [๐‘๐‘– ๐‘š๐‘–๐‘ก ( ๐œท0 ) |x๐‘– ] = ๐›ผ๐œ‡๐‘ (x๐‘– ) ๐‘š๐‘–๐‘ก ( ๐œท0 ) + ๐œŽ๐‘2 (x๐‘– ) [๐‘š๐‘–๐‘ก ( ๐œท0 )] 2 (1.4.11) where we impose the working variance Assumption WV.1. Given that ๐œ‡๐‘ (x๐‘– ) is identified from the previous argument, this expression identifies ๐›ผ and ๐œŽ๐‘2 (ยท). In fact, after obtaining (semiparametric)   residuals ๐‘ฃห† ๐‘–๐‘ก = ๐‘ฆ๐‘–๐‘ก โˆ’ ๐œ‡ห† ๐‘ (x๐‘– ) ๐‘š๐‘–๐‘ก ๐œทb๐น๐ธ ๐‘ƒ , we can use the squared residuals, ๐‘ฃห† 2 , as the dependent ๐‘–๐‘ก variable in nonparametric estimation of ๐œŽ๐‘2 (ยท). Therefore, a semiparametric approach to estimating the optimal IVs is available under Assumptions CM, WV.1, and WV.2. For practical reasons, our suggestion is to avoid estimating either ๐œ‡๐‘ (ยท) and ๐œŽ 2 (ยท) nonpara- metrically. Remember, we only need to estimate these conditional moments to obtain IVs more efficient than those used by the FEP estimator. The dimension of x๐‘– = (x๐‘–1 , x๐‘–2 , ..., x๐‘–๐‘‡ ) is often large. We can reduce the dimension by using a nonparametric Mundlak (1978) device, which would have ๐œ‡๐‘ (ยท) and ๐œŽ 2 (ยท) depending only on time averages xฬ„๐‘– โ‰ก ๐‘‡ โˆ’1 ๐‘‡๐‘Ÿ=1 x๐‘–๐‘Ÿ . Nevertheless, ร estimating a conditional variance along with a conditional mean when ๐พ is even moderately large is still challenging, both theoretically and practically. It would involve choosing at least two tuning parameters. From a robustness perspective, we cannot improve over the FEP estimator because it is consistent under Assumption CM. High-dimensional nonparametric estimation seems unnecessary to improve over the usual FEP estimator in the presence of serial correlation and under- or overdis- persion, especially if one factors in finite-sample considerations. Instead, we draw on the literature on models for nonnegative responses to suggest working assumptions for the conditional mean and variance of the heterogeneity โ€“ as summarized, for example, in Wooldridge (2010, Section 18.7.3). For concreteness, and because it is by far the leading case, we now assume that ๐‘š๐‘–๐‘ก ( ๐œท0 ) = exp (x๐‘–๐‘ก ๐œท0 ). Other forms of ๐‘š๐‘–๐‘ก ( ๐œท0 ) are easily handled, but the formulas and connections with other literatures is not as straightforward. In fact, we do not even need a generalized linear model form in our current setting, though such a mean function tends to lead to easier interpretation. 18 Assumption WH.1: For known 1 ร— ๐‘„ functions h (x๐‘– ), a scalar ๐œ‚, and ๐€ a ๐‘„ ร— 1 vector, ๐œ‡๐‘ (x๐‘– ) โ‰ก E (๐‘๐‘– |x๐‘– ) = exp [๐œ‚ + h (x๐‘– ) ๐€] โ–  (1.4.12) The leading case is to use the (nonredundant) time averages of {x๐‘–๐‘ก : ๐‘ก = 1, ..., ๐‘‡ }, which is an extension of the Mundlak (1978) device to the nonlinear case, so that h (x๐‘– ) = xฬ„๐‘– . But we can also use Chamberlainโ€™s (1980) less restrictive version, or include other functions of {x๐‘–๐‘ก : ๐‘ก = 1, ..., ๐‘‡ }, such as unit-specific trends or even unit-specific second moments. It seems sensible to use something simple, such as the Mundlak device, as we are only using WH.1 to generate instruments. When we combine Assumption WH.1 with the exponential conditional mean for E (๐‘ฆ๐‘–๐‘ก |x๐‘– , ๐‘๐‘– ), we obtain, by iterated expectations, E (๐‘ฆ๐‘–๐‘ก |x๐‘– ) = exp [๐œ‚ + h (x๐‘– ) ๐€] exp (x๐‘–๐‘ก ๐œท0 ) = exp [x๐‘–๐‘ก ๐œท0 + ๐œ‚ + h (x๐‘– ) ๐€] (1.4.13) The parameters in this conditional mean function can be consistently estimated using a variety of methods. A simple approach is to exploit equation (1.4.9) or (1.4.10) using exponential mean functions. After obtaining the FEP estimator ๐œท b๐น๐ธ ๐‘ƒ , estimate ๐œ‚ and ๐€ by a cross sectional Poisson regression with mean function exp [๐œ‚ + h (x๐‘– ) ๐€] and one of the dependent variables ๐‘‡ ๐‘›๐‘– โˆ’1 โˆ‘๏ธ ๐‘ฆ๐‘–๐‘ก ร๐‘‡   or ๐‘‡   (1.4.14) ๐‘Ÿ=1 exp x๐‘–๐‘Ÿ ๐œท ๐น๐ธ ๐‘ƒ ๐‘ก=1 exp x๐‘–๐‘ก ๐œท ๐น๐ธ ๐‘ƒ b b Even if the original ๐‘ฆ๐‘–๐‘ก are count variables โ€“ and there is no presumption that they are โ€“ neither of the regressands in (1.4.14) would be a count variable. Of course, this is of no consequence because of the robustness of the Poisson QMLE for estimating the parameters of the conditional mean regardless of the nature of the dependent variable (provided it is nonnegative). Alternatively, ๐œท0 , ๐œ‚, and ๐€ can be estimated jointly using the pooled Poisson QMLE. The pooled Poisson QMLE is completely robust to distributional misspecification and serial correlation. Of course, to preserve consistency of the resulting method of moments estimator we do not need Assumption WH.1 to hold; we are using it to estimate the optimal instruments derived earlier. 19 The second working assumption on the heterogeneity distribution imposes a restriction on the variance-mean relationship. Assumption WH.2: For ๐›ฟ > 0, ๐œŽ๐‘2 (x๐‘– ) โ‰ก Var (๐‘๐‘– |x๐‘– ) = ๐›ฟ [๐œ‡๐‘ (x๐‘– )] 2 = ๐›ฟ {exp [๐œ‚ + h (x๐‘– ) ๐€]}2 โ–  (1.4.15) Assumption WH.2 is very common in settings with nonnegative, continuous heterogeneity (including so-called random effects Poisson and negative binomial models). The condition that the variance is proportional to the square of the mean holds for the natural parameterizations of the gamma and lognormal distributions, and holds whenever ๐‘๐‘– = โ„Ž๐‘– ๐œ‡๐‘ (x๐‘– ) (1.4.16) for โ„Ž๐‘– โ‰ฅ 0 and independent of x๐‘– , without any further restrictions on the distribution of โ„Ž๐‘– . Like Assumption WH.1, Assumption WH.2 is not needed for consistent estimation using the method of moments estimator but only to estimate the optimal instruments under the working Assumptions WV.1 and WV.2. Using Assumptions CM, WV.1, WH.1, and WH.2 we can obtain estimating equations for ๐›ผ and ๐›ฟ. First, note that   E ๐‘ฃ ๐‘–๐‘ก2 |x๐‘– = ๐›ผ๐‘˜ ๐‘–๐‘ก + ๐›ฟ๐‘˜ ๐‘–๐‘ก2 (1.4.17) where ๐‘˜ ๐‘–๐‘ก โ‰ก E (๐‘ฆ๐‘–๐‘ก |x๐‘– ) = exp [x๐‘–๐‘ก ๐œท0 + ๐œ‚ + h (x๐‘– ) ๐€] An immediate implication of equation (1.4.17) is " 2 # ๐‘ฃ ๐‘–๐‘ก E โˆš x๐‘– = ๐›ผ + ๐›ฟ๐‘˜ ๐‘–๐‘ก (1.4.18) ๐‘˜ ๐‘–๐‘ก 20 which is the basis for estimating variance parameters in common cross-sectional models where heterogeneity is assumed independent of the covariates. A simple way to operationalize the conditional mean is h i ๐‘ฃห† ๐‘–๐‘ก = ๐‘ฆ๐‘–๐‘ก โˆ’ ๐‘˜ห† ๐‘–๐‘ก = ๐‘ฆ๐‘–๐‘ก โˆ’ exp x๐‘–๐‘ก ๐œท b๐น๐ธ ๐‘ƒ + ๐œ‚ห† + h (x๐‘– ) ๐€ b (1.4.19) where ๐œ‚ห† and ๐€ b are from one of the Poisson regressions described in equation (1.4.13). Then ๐›ผห† and ๐›ฟห† are, respectively, the intercept and slope in the pooled simple regression ๐‘ฃห† ๐‘–๐‘ก2 on 1, ๐‘˜ห† ๐‘–๐‘ก , ๐‘ก = 1, ..., ๐‘‡; ๐‘– = 1, ..., ๐‘ (1.4.20) ๐‘˜ห† ๐‘–๐‘ก It is clear from equation (1.3.17) that ๐›ผห† does not appear in the optimal instruments, but we need to estimate ๐›ผ in order to obtain ๐›ฟ. ห† In order to conclude the working assumptions are a reasonable approximation to reality, both ๐›ผห† and ๐›ฟห† should be nonnegative. If one of them is negative (most likely ห† then ๐›ฟห† should be set to zero. Because ๐›ผห† drops out of the optimal IVs, we need not estimate it ๐›ฟ) when we set ๐›ฟห† = 0. Nevertheless, one may be curious about the estimated amount of overdispersion when ๐›ฟ is set to zero. With ๐›ฟ = 0, the estimate of ๐›ผ is simply ๐‘ โˆ‘๏ธ โˆ‘๏ธ ๐‘‡   ๐›ผห† = (๐‘๐‘‡) โˆ’1 ๐‘ฃห† ๐‘–๐‘ก2 / ๐‘˜ห† ๐‘–๐‘ก (1.4.21) ๐‘–=1 ๐‘ก=1 and this is guaranteed to be nonnegative. However, as mentioned above, ๐›ผห† does not affect estimation of the optimal IVs when ๐›ฟ = 0. When we add Assumptions WH.1 and WH.2 to the previous assumptions, we obtain a simple form for R: n o R = E K๐‘–โˆ’1/2 v๐‘– vโ€ฒ๐‘– โˆ’ ๐›ฟk๐‘– kโ€ฒ๐‘– K๐‘–โˆ’1/2 /๐›ผ   which leads immediately to the method-of-moments/plug-in estimator   ๐‘ 1 โˆ‘๏ธ   Rฬ‚ = ๐‘ โˆ’1 Kฬ‚๐‘–โˆ’1/2 vฬ‚๐‘– vฬ‚โ€ฒ๐‘– โˆ’ ๐›ฟห†kฬ‚๐‘– kฬ‚โ€ฒ๐‘– Kฬ‚๐‘–โˆ’1/2 (1.4.22) ๐›ผห† ๐‘–=1 21 By a standard application of the uniform weak law of large numbers [Wooldridge (2010, Lemma ๐‘ 12.1)], Rฬ‚ โ†’ R. For each ๐‘ก โ‰  ๐‘ , the correlations are estimated as     ๐‘ ห† ๐‘ฃ ห† ๐‘ฃ โˆ’ ๐›ฟห† ๐‘˜ห† ห† ๐‘˜ 1 โˆ‘๏ธ ๐‘–๐‘  ๐‘–๐‘ก ๐‘–๐‘  ๐‘–๐‘ก ๐œŒห† ๐‘ ๐‘ก = ๐‘ โˆ’1 โˆš๏ธ (1.4.23) ๐›ผห† ๐‘–=1 ๐‘˜ห† ๐‘–๐‘  ๐‘˜ห† ๐‘–๐‘ก From the definition of ๐›ผห† and ๐›ฟห† obtained from (1.4.18), it is easily seen that ๐œŒห† ๐‘ก๐‘ก = 1 for ๐‘ก = 1, ..., ๐‘‡, and so this estimator imposes the logical requirement that a correlation matrix must have unity down its diagonal. If we set ๐›ฟ = 0, Rฬ‚ reduces to   ๐‘ 1 โˆ‘๏ธ โˆ’1 Kฬ‚๐‘–โˆ’1/2 vฬ‚๐‘– vฬ‚โ€ฒ๐‘– Kฬ‚๐‘–โˆ’1/2  Rฬ‚ = ๐‘ (1.4.24) ๐›ผห† ๐‘–=1 With this choice of Rฬ‚, we can make a direct connection with the GEE literature by ignoring the presence of ๐‘๐‘– and working off the first two conditional moments of y๐‘– given x๐‘– โ€“ see, for example, Liang and Zeger (1986) and Wooldridge (2010, Sections 13.11.4 and 18.7.3). Namely, under the full set of working assumptions with ๐›ฟ = 0, E (๐‘ฆ๐‘–๐‘ก |x๐‘– ) = exp [x๐‘–๐‘ก ๐œท0 + ๐œ‚ + h (x๐‘– ) ๐€] = ๐‘˜ ๐‘–๐‘ก , ๐‘ก = 1, ..., ๐‘‡ (1.4.25) Var (๐‘ฆ๐‘–๐‘ก |x๐‘– ) = ๐›ผE (๐‘ฆ๐‘–๐‘ก |x๐‘– ) , ๐‘ก = 1, ..., ๐‘‡ (1.4.26) Corr (y๐‘– |x๐‘– ) = ๐›ผK๐‘–1/2 RK๐‘–1/2 (1.4.27) This collection of moment assumptions is precisely what is used in GEE applications of Poisson regression (whether or not ๐‘ฆ๐‘–๐‘ก is a count variable), with the addition of the vector of functions h (x๐‘– ). We emphasize that these are all working assumptions in the current context. Not even the conditional mean function in (1.4.25) is assumed to hold for consistency because (1.4.25) is obtained from Assumptions CM and WH.1, whereas we are only require Assumption CM for consistency. We impose Assumptions WH.1 and WH.2 in order to estimate R and then to estimate ๐›€๐‘– . Provided it leads to a positive definite estimate, we prefer (1.4.20) because it is the correct expression under all of the working assumptions. 22 Under Assumption CM and the full set of working assumptions, we can estimate the optimal IVs, for each ๐‘–, as   1 โˆš๏ธ โˆš๏ธ โ€ฒ โˆ‡ ๐œท mฬ‚โ€ฒ๐‘– Mฬ‚๐‘–โˆ’1/2 Rฬ‚ โˆ’1 โˆ’โˆš โ€ฒ โˆš Rฬ‚ โˆ’1 mฬ‚๐‘– mฬ‚๐‘– Rฬ‚ โˆ’1 Mฬ‚๐‘–โˆ’1/2 (1.4.28) mฬ‚๐‘– Rฬ‚โˆ’1 mฬ‚๐‘– where โ€œห†โ€ means the quantity is evaluated at a first-round estimator, most likely ๐œท b๐น๐ธ ๐‘ƒ , and Rฬ‚ is from (1.4.22) or, if necessary, (1.4.24). [In either case, ๐›ผห† drops out of (1.4.28).] However, without the full set of working assumptions, this choice of IVs is not guaranteed to improve over the FEP estimator because of its dependence on Rฬ‚. A somewhat subtle point is that (1.4.28) is not even optimal under Assumptions CM, WV.1, and WV.2 because consistency of Rฬ‚ for R generally requires correct specification of the heterogeneity mean and variance โ€“ that is, Assumptions WH.1 and WH.2. As mentioned previously, if we did not have to estimate R, we could use (1.4.28) with Rฬ‚ replaced by R, and then we would have just identification as with the FEP estimator. Naturally, we want to use the data to provide an estimator of R better than just guessing. Incidentally, expression (1.4.28) shows that the estimator ๐›ผห† has no direct effect on the optimal IVs because it factors out as a constant. In order to ensure improvements over FEP, our recommendation is to stack the FEP and the new โ€œoptimalโ€ IVs to form an expanded IV matrix and use GMM. The resulting estimator, which we simply call the โ€œGMM estimator,โ€ is guaranteed to be asymptotically at least as efficient as the FEP and GFEP estimators; usually it is strictly more efficient than both. In other words, the ๐‘‡ ร— 2๐พ matrix of IVs is Zฬ‚๐‘– , written in transposed form as โ€ฒMฬ‚โˆ’1/2 I โˆ’ pฬ‚ pฬ‚ โ€ฒ Mฬ‚โˆ’1/2 h โˆš๏ธ โˆš๏ธ i โˆ‡ mฬ‚ ๐œท ๐‘– ๐‘– ๐‘‡ ๐‘– ๐‘– Zฬ‚โ€ฒ๐‘– = ยญยญ ยฉ ๐‘– ยช h โˆš โˆš i ยฎ (1.4.29) โ€ฒ โˆ’1/2 โˆ’1 1 โˆ’1 โ€ฒ โˆ’1 โˆ’1/2 ยฎ โˆ‡ mฬ‚ Mฬ‚ Rฬ‚ โˆ’ โˆš โ€ฒ โˆ’1 โˆš Rฬ‚ mฬ‚๐‘– mฬ‚๐‘– Rฬ‚ Mฬ‚๐‘– ยซ ๐œท ๐‘– ๐‘– mฬ‚๐‘– Rฬ‚ mฬ‚๐‘– ยฌ Given this choice of Zฬ‚๐‘– , the mechanics of GMM are straightforward. After obtaining ๐œท b๐น๐ธ ๐‘ƒ , obtain the ๐‘‡ ร— 1 residual vectors   uฬƒ๐‘– = y๐‘– โˆ’ p x๐‘– , ๐œทb๐น๐ธ ๐‘ƒ ๐‘›๐‘– (1.4.30) 23 Then, given the estimators of ๐œ‚, ๐€, ๐›ผ, ๐›ฟ, and R described above, obtain the 2๐พ ร— 2๐พ matrix, ๐‘ โˆ‘๏ธ ห† = ๐‘ โˆ’1 ๐šฟ Zฬ‚โ€ฒ๐‘– uฬƒ๐‘– uฬƒโ€ฒ๐‘– Zฬ‚๐‘– (1.4.31) ๐‘–=1 Assuming ๐šฟ ห† is positive definite (which generally holds with probability approaching one), the optimal GMM estimator, ๐›ฝห†๐บ ๐‘€ ๐‘€ , solves ๐‘ ! ๐‘ ! โˆ‘๏ธ โˆ‘๏ธ min u๐‘– ( ๐œท) โ€ฒ Zฬ‚๐‘– ๐šฟห† โˆ’1 Zฬ‚โ€ฒ๐‘– u๐‘– ( ๐œท) (1.4.32) ๐œทโˆˆR๐พ ๐‘–=1 ๐‘–=1 Because we have chosen very smooth mean, variance, and correlation functions, the consistency โˆš and ๐‘-asymptotic normality are standard; see, for example, Wooldridge (2010, Chapter 14). Remember, ๐šฟ ห† โˆ’1 is an (estimated) optimal weighting matrix given the choice of instruments; the standard GMM inference does not require that Zฬ‚๐‘– is optimal. Regardless of the size of ๐‘‡, the GMM estimator generates ๐พ overidentification restrictions that can be used to test Assumption CM. 1.5 A Small Simulation Study We now present the results of a small Monte Carlo simulation to demonstrate the efficacy of the improved GMM estimator. The conditional mean model, which has an exponential form, includes three time-varying explanatory variables and multiplicative heterogeneity. We consider two conditional distributions for the outcome variable, ๐‘ฆ๐‘–๐‘ก . In the first case, ๐‘ฆ๐‘–๐‘ก is a count variable generated as ๐‘ฆ๐‘–๐‘ก |x๐‘– , ๐‘๐‘– , e๐‘– โˆผ Poisson [๐‘๐‘– exp (x๐‘–๐‘ก ๐œท + ๐‘’๐‘–๐‘ก )] (1.5.1) where e๐‘– = (๐‘’๐‘–1 , ๐‘’๐‘–2 , ..., ๐‘’๐‘–๐‘‡ ) โ€ฒ is distributed as multivariate normal with unit variances. In order to generate serial dependence in {๐‘ฆ๐‘–๐‘ก : ๐‘ก = 1, ..., ๐‘‡ } conditional on (x๐‘– , ๐‘๐‘– ), {๐‘’๐‘–๐‘ก : ๐‘ก = 1, 2, ..., ๐‘‡ } follows an AR(1) process with first-order correlation ๐œ™ โˆˆ {0, 0.25, 0.75}. This autoregressive process generates no conditional dependence when ๐œ™ = 0 and fairly strong time series dependence when ๐œ™ = 0.75. Because of the inclusion of ๐‘’๐‘–๐‘ก , the conditional distribution D (๐‘ฆ๐‘–๐‘ก |x๐‘– , ๐‘๐‘– ) is 24 not Poisson; in fact, it exhibits overdispersion because exp (๐‘’๐‘–๐‘ก ) is integrated out in obtaining D (๐‘ฆ๐‘–๐‘ก |x๐‘– , ๐‘๐‘– ). However, consistency of the estimators requires only that that E (๐‘ฆ๐‘–๐‘ก |x๐‘– , ๐‘๐‘– ) has the exponential form with multiplicative ๐‘๐‘– . The strictly exogenous explanatory variables, x๐‘–๐‘ก , are generated as a trivariate, stationary vector autoregression, where the stochastic term is an independent multivariate standard normal distribu-  tion with autocorrelation parameter 0.125. The processes x๐‘– = x๐‘–1 , ..., x๐‘–๐‘‡ and e๐‘– are independent. The vector ๐œท is set to ๐œทโ€ฒ = (0.15, 0.25, 0.35) (where we drop the ๐‘œ subscript to make the tables easier to read). To generate correlation between ๐‘๐‘– and x๐‘– , we first use an exponential version of the Mundlak (1978) device and an exponential distribution: ๐‘๐‘– |x๐‘– โˆผ Exponential [exp (๐œ‚ + xฬ„๐‘– ๐€)] (1.5.2) Under this specification, the working assumptions WH.1 and WH.2 are both satisfied with h (x๐‘– ) = xฬ„๐‘– and, in the case of WH.2, ๐›ฟ = 1. We estimate the parameters in the heterogeneity moments using a two-step pooled Poisson QMLE with the FEP estimator as the first-stage estimator of ๐œท. The estimates ๐›ผห† and ๐›ฟห† are estimated via the pooled OLS regression in equation (1.4.20) and Rฬ‚ is estimated as in (1.4.22). When Rฬ‚ is not positive definite for a particular draw, we set ๐›ฟห† = 0 and estimate Rฬ‚ as in (1.4.24) (in which case the value of ๐›ผห† plays no role in the estimation of ๐œท). This situation occurs between 60% and 80% of the simulations. We use ๐‘ = 300, ๐‘‡ โˆˆ {4, 8}, and 1, 000 replications in the simulations. The findings are reported in Table 1.1. Some general patterns emerge from Table 1.1. First, the FEP estimator shows very little bias, and its bias is almost always smaller than the GFEP and GMM estimators. The GFEP estimator generally shows the most bias โ€“ as high as nine percent in some cases. Still, we only have ๐‘ = 300, which is not especially large. Interestingly, the bias in the GMM estimator โ€“ which combines both sets of moment conditions โ€“ is well below that of the GFEP estimator. The bias in both the GFEP 25 Table 1.1: Conditional Poisson distribution Bias SD RMSE FEP GFEP GMM FEP GFEP GMM FEP GFEP GMM ๐“=0 T=4 0.002 -0.004 0.000 0.082 0.075 0.072 0.082 0.075 0.072 0.001 -0.011 -0.003 0.083 0.078 0.072 0.083 0.079 0.072 -0.001 -0.016 -0.005 0.083 0.079 0.075 0.083 0.081 0.075 T=8 0.011 -0.010 -0.005 0.052 0.044 0.041 0.052 0.045 0.041 0.000 -0.020 -0.011 0.053 0.044 0.042 0.053 0.049 0.044 0.001 -0.027 -0.014 0.051 0.045 0.042 0.052 0.052 0.045 ๐“ = 0.25 T = 4 -0.007 -0.016 0.008 0.081 0.074 0.072 0.081 0.076 0.073 -0.003 -0.014 0.004 0.082 0.075 0.070 0.079 0.077 0.070 0.002 -0.015 0.003 0.079 0.075 0.070 0.079 0.077 0.070 T=8 -0.001 -0.014 -0.007 0.051 0.045 0.042 0.051 0.047 0.043 0.000 -0.021 -0.010 0.048 0.044 0.040 0.048 0.049 0.042 -0.001 -0.029 -0.015 0.051 0.046 0.043 0.051 0.054 0.046 ๐“ = 0.75 T = 4 -0.001 -0.007 -0.003 0.057 0.054 0.051 0.057 0.055 0.051 0.005 -0.008 0.001 0.060 0.058 0.052 0.061 0.059 0.052 0.001 -0.014 -0.002 0.060 0.059 0.053 0.060 0.060 0.053 T=8 0.001 -0.012 -0.004 0.043 0.035 0.034 0.043 0.037 0.034 -0.001 -0.023 -0.011 0.044 0.036 0.034 0.044 0.043 0.036 -0.002 -0.032 -0.015 0.047 0.038 0.036 0.047 0.050 0.039 and GMM estimators appears to increase with ๐‘‡. Overall, the bias in the GMM estimator seems acceptable, especially given the small ๐‘. The GMM estimator always has the smallest sampling standard deviation, sometimes being about 80% of the FEP standard error. The SD of the GFEP estimator falls in between that of the FEP and GMM estimators. In a few cases the FEP estimator has smaller root mean squared error (RMSE) than the GFEP estimator. The asymptotic theory of GMM estimation implies that the GMM estimator is asymptotically more efficient than FEP or GFEP because, in the setting of the simulation, the entire set of working assumptions does not hold, and so GFEP does not use the optimal IVs. The ranking of the estimators in terms of the root mean squared error favors the GMM estimator in every case. To see how the estimators perform when ๐‘ฆ๐‘–๐‘ก is a continuous outcome, we generated ๐‘ฆ๐‘–๐‘ก as ๐‘ฆ๐‘–๐‘ก |x๐‘– , ๐‘๐‘– , e๐‘– โˆผ Gamma [exp (x๐‘–๐‘ก ๐œท + ๐‘’๐‘–๐‘ก ) , ๐‘๐‘– ] (1.5.3) 26 where the gamma distribution is parameterized so that E (๐‘ฆ๐‘–๐‘ก |x๐‘–๐‘ก , ๐‘๐‘– , e๐‘– ) = ๐‘๐‘– exp (x๐‘–๐‘ก ๐œท + ๐‘’๐‘–๐‘ก ), as before. The conditional variance is Var (๐‘ฆ๐‘–๐‘ก |x๐‘–๐‘ก , ๐‘๐‘– , e๐‘– ) = ๐‘๐‘–2 exp (x๐‘–๐‘ก ๐œท + ๐‘’๐‘–๐‘ก ). We use the same process in (1.5.2) to generate ๐‘๐‘– . The simulation findings are reported in Table 1.2. Table 1.2: Conditional Gamma distribution Bias SD RMSE FEP GFEP GMM FEP GFEP GMM FEP GFEP GMM ๐“=0 T=4 0.000 -0.006 -0.002 0.090 0.087 0.081 0.090 0.087 0.081 0.003 -0.008 0.003 0.089 0.085 0.080 0.089 0.085 0.080 0.001 -0.014 0.000 0.090 0.088 0.083 0.090 0.089 0.083 T=8 0.000 -0.012 -0.006 0.056 0.049 0.048 0.056 0.051 0.048 -0.001 -0.019 -0.009 0.052 0.050 0.047 0.052 0.054 0.048 -0.001 -0.027 -0.014 0.054 0.051 0.048 0.054 0.058 0.050 ๐“ = 0.25 T = 4 0.002 -0.007 0.002 0.086 0.082 0.078 0.086 0.082 0.078 -0.003 -0.016 -0.004 0.085 0.082 0.077 0.085 0.084 0.078 0.002 -0.014 -0.001 0.086 0.084 0.081 0.086 0.085 0.081 T=8 0.000 -0.013 -0.006 0.057 0.050 0.048 0.057 0.052 0.048 0.000 -0.019 -0.009 0.055 0.050 0.048 0.055 0.053 0.049 -0.001 -0.033 -0.017 0.058 0.053 0.051 0.058 0.062 0.053 ๐“ = 0.75 T = 4 0.001 -0.006 0.000 0.069 0.067 0.063 0.069 0.067 0.063 0.000 -0.012 -0.001 0.074 0.072 0.067 0.074 0.073 0.067 0.000 -0.016 -0.001 0.070 0.072 0.064 0.070 0.074 0.064 T=8 0.001 -0.014 -0.005 0.049 0.041 0.040 0.049 0.044 0.040 0.000 -0.023 -0.008 0.048 0.042 0.039 0.048 0.048 0.040 -0.001 -0.034 -0.013 0.050 0.046 0.043 0.050 0.057 0.045 The general pattern found in Table 1.1 continues to hold in Table 1.2. The FEP estimator generally has the lowest bias, although the GMM estimator also does well with bias. The GFEP estimator, which uses only the โ€œoptimalโ€ IVs, shows more bias โ€“ again, sometimes on the order of more than nine percent. In terms of precision and RMSE, the GMM estimator outperforms FEP and GFEP in all scenarios, although the gains are modest in some cases. We tried several additional scenarios, including cases where Assumption WH.2 is violated โ€“ by drawing ๐‘๐‘– from a Poisson distribution โ€“ and cases where, conditional on (x๐‘– , ๐‘๐‘– ) โ€“ ๐‘ฆ๐‘–๐‘ก is an underdispersed gamma random variable. In the former case, we found only minor differences among the estimators, although sometimes the FEP estimator outperformed the other two in terms of RMSE. In the latter case, where we did not allow serial correlation, the estimators perform very 27 similarly. As a final set of simulations, we misspecified the conditional mean E (๐‘๐‘– |x๐‘– ) in (1.5.2) by letting the mean depend on the average of the first and last time periods rather than xฬ„๐‘– . In other words, Assumption WH.1 is violated. The GMM estimator uniformly performed the best based on RMSE and exhibited biases on the order of those reported in Tables 1.1 and 1.2. These simulations are available upon request from the authors. 1.6 Summary and Conclusion We have characterized the optimal instruments in a multiplicative panel model under a general set of working assumptions. The variance-mean relationship, conditional on unobserved heterogeneity as well as covariates, is allowed to be any positive number. The conditional correlation matrix is assumed to be constant but is otherwise unrestricted. Under these assumptions, the optimal IVs depends only on the unknown correlation matrix, R (and the value of the conditional mean parameters, ๐œท0 ). In the special case that R = I๐‘‡ , we show that the FEP estimator achieves the asymptotic efficiency bound for any amount of overdispersion or underdispersion, thereby relaxing the assumptions under which the FEP estimator is known to be asymptotically efficient. When R is not the identity matrix, it is possible to improve on the FEP estimator. To operationalize the optimal IVs in order to exploit serial correlation, we add working first and second moment assumptions on the conditional heterogeneity distribution. These assumption are common in literatures that allows nonnegative heterogeneity in cross-sectional and panel data models. We show that estimating the optimal IVs is straightforward, and suggest a GMM approach that is guaranteed to improve asymptotic efficiency whether or not serial correlation is present. Our simulations show that the GMM estimator that combines the FEP moment conditions and the new โ€œoptimalโ€ moment conditions has very good bias properties and provides nontrivial efficiency gains โ€“ even when the cross-sectional sample size is only ๐‘ = 300. Our results and new estimator are appealing for cases where ๐‘ is substantially larger than ๐‘‡, as we have used the standard microeconometric setting where ๐‘‡ is fixed in the asymptotic analysis. Naturally, this is not the only possibility. For example, Fernรกndez-Val and Weidner 28 (2018) and Chen, Fernรกndez-Val, and Weidner (2020) have proposed quasi-MLEs that allow more heterogeneity. However, consistency requires ๐‘‡ โ†’ โˆž along with ๐‘ โ†’ โˆž, and necessarily restricts the amount of time series heterogeneity and dependence. 29 CHAPTER 2 INFORMATION EQUIVALENCE AMONG TRANSFORMATIONS OF SEMIPARAMETRIC NONLINEAR PANEL DATA MODELS 2.1 Introduction In the standard linear panel data model with additive unobserved heterogeneity, it is well known that numerous transformations can be used to eliminate the heterogeneity prior to estimation. The most common methods are the within and first-differencing transformations1. Similarly, when the heterogeneity appears as a multiplicative term in the conditional mean like in certain Generalized Linear Model settings, modified within and differencing transformations can control for the heterogeneity and provide moment conditions for estimation. There exist other transformations which control for heterogeneity but are clearly absurd. For example, multiplying all the data by zero eliminates the heterogeneity along with all information for estimation. For a less trivial example, suppose the population model is linear with a single additive effect and the first-differenced errors are homoskedastic and uncorrelated. Then second-differencing is still consistent but less efficient than first-differencing. These examples raise the question of how to evaluate methods for eliminating heterogeneity while preserving information for estimation. This paper considers conditional mean models with unobserved heterogeneity. The general framework derived within encompasses a large class of both linear and strictly nonlinear models, examples of which are given in Section 2.2.1. The models are referred to as โ€œsemiparametricโ€ in the sense that nothing is assumed about the relationship between the heterogeneity and observables other than regularity conditions needed for asymptotic analysis. In place of assumptions on the conditional distribution of the heterogeneity, these models often require a transformation to eliminate or control for the term. I provide a unified framework for comparing such transformations in terms of the information they preserve. Those that yield the same moment conditions, given 1 For a comprehensive review of linear panel models with additive heterogeneity, see Chapters 10 and 11 of Wooldridge (2010). 30 โˆš certain regularity assumptions, will provide the same ๐‘-asymptotic efficiency bound if they have equal rank. As mentioned above, the within and first-differencing transformations are the most common in the linear panel case for eliminating additive heterogeneity. When the covariates are strictly exogenous with respect to the idiosyncratic errors, these transformations provide conditional mo- ment restrictions which can be exploited for estimation of the population parameters. For a given conditional variance matrix, Arellano and Bover (1995) suggest that Generalized Least Squares (GLS) on the demeaned equations is equivalent to the efficient 3SLS estimator. This claim was later proven in Im et al. (1999) along with a proof that the GLS estimators on the demeaned and first-differenced estimators are equivalent. Their result shows that two commonly used methods of estimation preserve the same information in the linear case. However, they limit their investigation to a small number of estimators and only allow for a single time-invariant individual effect. My approach can derive the same result as Im et al. (1999) as well as general factor-augmented panels with an arbitrary number of individual effects with time-varying coefficients. For nonlinear models with a multiplicative heterogeneity term, one approach to estimation is the fixed effects Poisson (FEP) estimator. Hausman et al. (1984) derive the FEP as the maximum likelihood estimator of a multinomial distribution2. Wooldridge (1999) shows that the FEP is in fact consistent under a much weaker strict exogeneity assumption. One proof of this result shows that the score of the likelihood function has a mean of zero at the true parameter value due the likelihoodโ€™s implicit transformation of the data. This transformation subtracts the weighted time averages from each outcome and so I refer to it as the generalized within transformation. Another approach is the generalized next-differencing transformation first studied by Chamberlain (1992) and Wooldridge (1997), which subtracts from a time period the next period outcome, weighted by the quotient of the mean functions. While generalized next-differencing was originally proposed for a sequential exogeneity setting, I study it here in the context of strict exogeneity. I also consider the residual maker matrix from regressing on the outcome variableโ€™s mean function. To the best of 2 Similar to the linear fixed effects estimator, the FEP estimator is a true fixed effects procedure as it can be derived by estimating via pooled Poisson regression and treating the multiplicative terms as parameters to estimate. 31 my knowledge, this paper is the first to show information equivalence of these transformations. In Section 2.2, I define information equivalence in a first order asymptotic sense. The efficiency bounds studied will apply to โ€œsmall-๐‘‡โ€ settings where asymptotics are derived with ๐‘‡ fixed as ๐‘ โ†’ โˆž. I then derive sufficient conditions under which transformations of the data which yield moment restrictions for estimation preserve the same information. This result is general and can apply to a number of finite and asymptotic settings. In Section 2.3, I apply the main result from Section 2.2 to a nonlinear multiplicative model, a linear model with an unknown factor structure, and a linear random trend model. Section 2.4 discusses further practical suggestions like implementation and extensions. Section 2.5 provides concluding remarks along with potential directions for future research. 2.2 Information equivalence As mentioned in the Introduction, the results of this section apply to population moments. In what follows, ( ๐’š๐‘– , ๐’™๐‘– , ๐’„๐‘– ) is assumed to be a random draw from an infinite population. The matrix (๐’š๐‘– , ๐’™๐‘– ) is ๐‘‡ ร— (1 + ๐พ) and observable whereas the random ๐‘ ร— 1 vector ๐’„๐‘– is not. All statements involving expressions of random variables hold almost surely. For example, conditional means and rank conditions for random matrices hold with probability one. Finally, I assume regularity conditions suitable for asymptotic analysis such as bounds on the higher-order moments of the data. 2.2.1 Model The following conditional mean assumption specifies the empirical setting: Assumption CM: For ๐‘ก = 1, ..., ๐‘‡, ๐ธ (๐‘ฆ๐‘–๐‘ก |๐’™๐‘– , ๐’„๐‘– ) = ๐‘š ๐‘ก (๐’™๐‘–๐‘ก , ๐œท0 , ๐’„๐‘– ) (2.2.1) where ๐‘š ๐‘ก (๐’™, ., ๐’„) : R๐พ โ†’ R is a known Borel twice-differentiable function for every ๐’™ โˆˆ X๐‘ก and ๐’„ โˆˆ C, where X๐‘ก and C are the respective supports of ๐’™๐‘–๐‘ก and ๐’„๐‘– . โ–  32 Equation (2.2.1) specifies a nonlinear semiparametric conditional mean function with strictly exogenous covariates where ๐œท0 is a ๐พ ร— 1 vector of parameters3. The mean function itself is allowed to vary over time periods. The heterogeneity is also allowed to enter the mean function in any arbitrary way. In the linear panel case, the simplest and most common specification is an individual- specific intercept. In nonlinear cases, the heterogeneity is often included as a multiplicative term. I do not place any identifying assumptions directly on ๐‘š ๐‘ก . These implicit identification con- ditions will come later in the form of rank assumptions. Essentially, the results contained in this paper apply to nontrivial empirical situations. For example, consider a model ๐‘ฆ๐‘–1 = ๐‘๐‘– + ๐›ฝ๐‘ฆ๐‘–2 where ๐‘๐‘– is an individual-specific intercept and ๐‘ฆ๐‘–2 is an indicator variable associated with a treatment or policy intervention. If ๐‘๐‘– has a mass point at zero, it must be the case that there is variation, so that ๐‘ฆ๐‘–1 โ‰  0 for all ๐‘–. The following examples illustrate some common empirical settings for which Assumption CM applies: Example 1 (Linear model with additive effects): Consider the following specification: ๐‘ฆ๐‘–๐‘ก = ๐‘๐‘– + ๐’™๐‘–๐‘ก ๐œท0 + ๐‘ข๐‘–๐‘ก This model is common among applied microeconometric researchers. Im et al. (1999) shows that the 3SLS estimator of ๐œท0 using the differenced covariates as instruments is algebraically equivalent to GLS estimators based off of both the within and differenced transformed residuals4. This example will be discussed in Sections 2.2.2 and 2.3.1. We can include multiple individual effects loaded onto macro shocks in the form ๐‘ฆ๐‘–๐‘ก = ๐’„โ€ฒ๐‘– ๐’‡๐‘ก + ๐’™๐‘–๐‘ก ๐œท0 + ๐‘ข๐‘–๐‘ก where ๐’„โ€ฒ๐‘– ๐’‡๐‘ก = ร๐‘ ๐‘Ÿ=1 ๐‘๐‘–๐‘Ÿ ๐‘“๐‘Ÿ ๐‘ก and ๐’‡๐‘ก is observable. An example of the general setting is the random trend linear model. ๐‘ฆ๐‘–๐‘ก = ๐‘๐‘– + ๐‘Ž๐‘– ๐‘ก + ๐’™๐‘–๐‘ก ๐œท0 + ๐‘ข๐‘–๐‘ก 3 In this context, nonlinear does not mean โ€™strictly nonlinearโ€™, but can also include linear models. 4 The setting studied by Im et al. is motivated by considering covariates which satisfy ๐ธ (๐’™๐‘– โŠ— ๐’– ๐‘– ) = 0. The equivalence result provided in their paper, however, is purely algebraic in nature and holds regardless of the covariance between the covariates and idiosyncratic errors. 33 The standard approach to estimation is to first-difference the outcomes to yield another linear model with only an additive individual effect. If strict exogeneity is assumed with respect to ๐’™๐‘– , we have the same empirical setting as above, and so the same analysis will apply. I discuss the general model in Section 2.3.2. โ–  Example 2 (Exponential mean): Consider the following mean function: ๐ธ (๐‘ฆ๐‘–๐‘ก |๐’™๐‘– , ๐‘๐‘– ) = exp(๐‘๐‘– + ๐’™๐‘–๐‘ก ๐œท0 ) The exponential mean function is most popularly employed to study count data. The most common estimator of the parameters in this model is the FEP estimator. Wooldridge (1999) shows that Assumption CM is sufficient for identification using the following transformation: ๐‘‡ ! ! โˆ‘๏ธ exp(๐’™๐‘–๐‘ก ๐œท0 ) ๐‘ฆ๐‘–๐‘ก โˆ’ ๐‘ฆ๐‘– ๐‘  ร๐‘‡ ๐‘ =1 ๐‘ =1 exp(๐’™๐‘– ๐‘  ๐œท0 ) This transformation will be referred to as the generalized within transformation and provides the basis of the FEP estimator since it shows up in the score function of the Poisson QMLE and has an expectation of zero conditional on ๐’™๐‘– . Another possible transformation is exp(๐’™๐‘–๐‘ก ๐œท0 ) ๐‘ฆ๐‘–๐‘ก โˆ’ ๐‘ฆ๐‘–,๐‘ก+1 exp(๐’™๐‘–,๐‘ก+1 ๐œท0 ) which I refer to as the generalized next-differencing transformation. Both of this transformations are studied in generality in Section 2.3. In an analogy to the linear setting, we can discuss an exponential random trend model with multiplicative specification ๐ธ (๐‘ฆ๐‘–๐‘ก |๐’™๐‘– , ๐’„๐‘– ) = ๐‘๐‘– ๐‘Ž๐‘–๐‘ก exp(๐’™๐‘–๐‘ก ๐œท0 ) which can be motivated by the form ๐ธ (๐‘ฆ๐‘–๐‘ก |๐’™๐‘– , ๐’„๐‘– ) = exp(๐›พ๐‘– + ๐›ผ๐‘– ๐‘ก + ๐’™๐‘–๐‘ก ๐œท0 ). This model has received no attention in the econometric literature to the best of my knowledge. I discuss how the results of this paper could apply to such a model in Section 2.3.1. โ–  Example 3 (Production functions): Suppose the dependent variable is firm output which follows the given production technology: ๐›ฝ ๐›ฝ ๐‘„ ๐‘–๐‘ก = exp(๐œ–๐‘–๐‘ก โˆ’ ๐‘๐‘– )๐ฟ ๐‘– ๐‘ก 1 ๐พ๐‘– ๐‘ก 2 34 where (๐ฟ, ๐พ) are labor and capital stock respectively. The heterogeneity can be written exp(โˆ’๐‘๐‘– ). If ๐ธ (๐œ–๐‘–๐‘ก |๐‘๐‘– , ๐ฟ ๐‘–๐‘ก , ๐พ๐‘–๐‘ก ) is assumed constant5, then the transformations studied in Section 2.3 can be used for estimation of the parameters and average partial effects under weak assumptions on the heterogeneity term. This example serves as an interesting bridge between the linear and nonlinear specifications as production theory can be stated in the above nonlinear fashion, but production function estimation is often carried out after log-linearization for which the results of Im et al. (1999) would apply. The specific form of the error is reminiscent of a stochastic frontier model with a time-invariant inefficiency term. See Section V of Amsler, Lee, and Schmidt (2009). โ–  For the general treatment of the paper, I consider transformations of the mean function which provide moment conditions for estimating ๐œท0 . Assumption MAT characterizes such matrix trans- formations: Assumption MAT: Let ๐ฟ โ‰ค ๐‘‡, and let ๐‘จ(๐’™, ๐œท) be an ๐ฟ ร— ๐‘‡ matrix which satisfies ๐‘จ(๐’™๐‘– , ๐œท0 )๐ธ ( ๐’š๐‘– |๐’™๐‘– , ๐’„๐‘– ) = 0 (2.2.2) and is differentiable in ๐œท over int(๐šฏ) for every ๐’™ โˆˆ X. โ–  ๐‘จ is a residual maker matrix which is zero at the true parameter value ๐œท0 . I assume ๐ฟ โ‰ค ๐‘‡ which corresponds to the examples studied in Section 2.3. While ๐ฟ > ๐‘‡ is theoretically possible and would rely on the same theory of g-inverses employed in this paper, I do not consider such a case. In fact, cases of the examples in Section 2.3 where ๐ฟ > ๐‘‡ often correspond to linearly dependent and hence redundant sets of moment conditions. Under the previous assumptions, ๐ธ ( ๐‘จ(๐’™๐‘– , ๐œท0 ) ๐’š๐‘– |๐’™๐‘– ) = 0 (2.2.3) by iterated expectations. We can thus use equation (2.2.3) as the basis of a GMM estimator of ๐œท0 , where any function of ๐’™๐‘– can be used as an instruments for ๐‘จ(๐’™๐‘– , ๐œท0 ) ๐’š๐‘– to improve efficiency. Note 5 The value of ๐ธ (๐œ– ๐‘– ๐‘ก |๐ฟ ๐‘– ๐‘ก , ๐พ๐‘– ๐‘ก ) is allowed to differ over time as long as it is not a function of observables. The researcher can then just specify time dummies in the mean function to capture the temporal change. 35 that ๐‘จ could contain external instrumental variables which do not appear in the mean function. This more general case is considered in Section 2.2.2. The following Lemma demonstrates a useful fact for characterizing information equivalent trans- formations and has clear parallels in the linear model case. First define ๐’Ž๐‘– ( ๐œท) = (๐‘š ๐‘ก (๐’™๐‘–1 , ๐œท, ๐’„๐‘– ), ..., ๐‘š ๐‘ก (๐’™๐‘–๐‘‡ , ๐œท, ๐’„๐‘– )) โ€ฒ. Lemma 2.2.1. Suppose ๐‘จ(๐’™, ๐œท) is an ๐ฟ ร— ๐‘‡ matrix satisfying Assumption MAT. Then for any (๐’™ 0 , ๐’„0 ) โˆˆ X ร— C such that |๐‘š ๐‘ก (๐’™๐‘ก0 , ๐œท0 , ๐’„0 )| > 0 for some ๐‘ก, Rank( ๐‘จ(๐’™ 0 , ๐œท0 )) < ๐‘‡. Proof. ๐‘จ(๐’™ 0 , ๐œท0 )๐’Ž๐‘– ( ๐œท0 ) = 0 over the supports of ๐’™๐‘– and ๐’„๐‘– by Assumption MAT. As |๐‘š ๐‘ก (๐’™๐‘ก0 , ๐œท0 , ๐’„0 )| > 0, ๐‘จ(๐’™ 0 , ๐œท0 ) has a nontrivial null space, and hence its rank is less than ๐‘‡. โ–ก The theory for choosing optimal instruments is well-known: when the conditional variance is nonsingular, the optimal GMM estimator uses instruments (๐‘‰ ๐‘Ž๐‘Ÿ ( ๐‘จ(๐’™๐‘– , ๐œท0 ) ๐’š๐‘– |๐’™๐‘– ) โˆ’1 ๐ธ (โˆ‡ ๐œท ๐‘จ(๐’™๐‘– , ๐œท0 ) ๐’š๐‘– |๐’™๐‘– )) โ€ฒ. However, in most nontrivial cases when ๐‘จ is ๐‘‡ ร— ๐‘‡, the conditional variance matrix of ๐‘จ(๐’™๐‘– , ๐œท0 ) ๐’š๐‘– is singular even when ๐‘‰ ๐‘Ž๐‘Ÿ (๐’š๐‘– |๐’™๐‘– ) is nonsingular. I make one additional assumption on the trans- formation studied which allows for such a generality. Assumption SYS specifies consistency of a particular linear system which is necessary for the definition of the asymptotic efficiency bound. It will allow us to use a certain class of generalized inverses when the conditional variance is singular. Assumption SYS: The system ๐‘‰ ๐‘Ž๐‘Ÿ ( ๐‘จ(๐’™๐‘– , ๐œท0 ) ๐’š๐‘– |๐’™๐‘– )๐‘ญ(๐’™๐‘– ) = ๐ธ (โˆ‡ ๐œท ๐‘จ(๐’™๐‘– , ๐œท0 ) ๐’š๐‘– |๐’™๐‘– ) is consistent in ๐‘ญ(๐’™๐‘– ) and ๐ธ (๐‘ญ(๐’™๐‘– ) โ€ฒ๐‘‰ ๐‘Ž๐‘Ÿ ( ๐‘จ( ๐œท0 ) ๐’š๐‘– |๐’™๐‘– )๐‘ญ(๐’™๐‘– )) is nonsingular for a given solution. โ–  Consistency of a linear system only requires the existence of a solution and not necessarily uniqueness. In fact, Section 2.3 considers relevant cases for which uniqueness does not hold. Assumption SYS is posed in Newey (2001) for studying censored and truncated regression. It holds trivially when the conditional variance is nonsingular, in which case the unique solution is ๐‘‰ ๐‘Ž๐‘Ÿ ( ๐‘จ(๐’™๐‘– , ๐œท0 ) ๐’š๐‘– |๐’™๐‘– ) โˆ’1 ๐ธ (โˆ‡ ๐œท ๐‘จ(๐’™๐‘– , ๐œท0 ) ๐’š๐‘– |๐’™๐‘– ). The results in Chamberlain (1987) and Newey (2001) show that the semiparametric efficiency bound for estimating ๐œท0 using equation (2.2.3) and Assumptions CM, MAT, and SYS is  โˆ’1 ๐ธ ๐ธ (โˆ‡ ๐œท ๐‘จ(๐’™๐‘– , ๐œท0 ) ๐’š๐‘– |๐’™๐‘– ) โ€ฒ๐‘‰ ๐‘Ž๐‘Ÿ ( ๐‘จ(๐’™๐‘– , ๐œท0 ) ๐’š๐‘– |๐’™๐‘– ) โˆ’ ๐ธ (โˆ‡ ๐œท ๐‘จ(๐’™๐‘– , ๐œท0 ) ๐’š๐‘– |๐’™๐‘– ) (2.2.4) 36 โˆš where "โˆ’" denotes a symmetric g-inverse6. That is, no ๐‘-consistent estimator of ๐œท0 based off of equation (2.2.3) has a smaller asymptotic variance than (2.2.4). Theorem 5.2 in Newey (2001) shows that the efficiency bound in (2.2.4) is invariant to choice of symmetric g-inverse under Assumption SYS. If the conditional variance is nonsingular, then the g-inverse can be replaced by a proper inverse as in Chamberlain (1987). Otherwise any g-inverse will work as long as the consistency assumption holds. The matrix in (2.2.4) is also equivalent to the asymptotic variance of the GMM estimator based off of the moment conditions in (2.2.3) which uses the optimal instruments (๐‘‰ ๐‘Ž๐‘Ÿ ( ๐‘จ(๐’™๐‘– , ๐œท0 ) ๐’š๐‘– |๐’™๐‘– ) โˆ’ ๐ธ (โˆ‡ ๐œท ๐‘จ(๐’™๐‘– , ๐œท0 ) ๐’š๐‘– |๐’™๐‘– )) โ€ฒ. The system is just identified and so no weight matrix is required for the asymptotic bound. Realizing this efficiency bound is the subject of Section 2.4. The rest of the paper is concerned with studying transformations of the observed data which provide the same semiparametric efficiency bound as defined in (2.2.4). The following definition characterizes the types of transformations I consider: Definition: Let Assumption CM hold, and let ๐‘จ(๐’™๐‘– , ๐œท) and ๐‘ฉ(๐’™๐‘– , ๐œท) be ๐ฟ ร— ๐‘‡ and ๐‘€ ร— ๐‘‡, respectively. Given ๐‘จ and ๐‘ฉ satisfy Assumptions MAT and SYS, the matrices are information equivalent transformations if their semiparametric efficiency bounds given by (2.2.4) are equal. โ–  Information equivalence defined above is an equivalence relation on the set of ๐พ ร— ๐พ real- valued matrices since it is defined via matrix equivalences. This fact will be used in Section 2.3 to show information equivalence between general forms of applied transformations since it is a transitive property and it is easiest to evaluate the information bound in relation to the generalized within transformation. Information equivalence is similar to the definition of redundancy of moment conditions as given by Breusch et al. (1999). However, the results in this paper are not direct consequences of their redundancy results as I allow the moment conditions to have singular covariance matrices which directly applies to the examples in Section 2.3. 6A g-inverse for matrix ๐›€ is a matrix ๐›€โˆ’ such that ๐›€๐›€โˆ’ ๐›€ = ๐›€. This condition is weaker than the Moore-Penrose inverse which requires three other non-redundant properties. It is worth noting that the Moore-Penrose inverse is unique, but a g-inverse is not necessarily; this fact will be used to prove the main results in Section 2.2.2. For a general treatment of g-inverses, see Rao and Mitra (1978). 37 2.2.2 General equivalence result I now prove the general unifying theory of information equivalence. Consider the empirical setting proposed in Section 2.2.1 where Assumption CM holds. I suppose there is a ๐‘‡ ร— ๐‘‡ matrix ๐‘ด (๐’›๐‘– , ๐œท) satisfying Assumptions MAT and SYS where ๐’›๐‘– is allowed to include any element of ๐’™๐‘– as well as outside instruments. Dropping the arguments and writing ๐‘ด๐‘– = ๐‘ด (๐’›๐‘– , ๐œท0 ) for simplicity, we have the following moment conditions: ๐ธ ( ๐‘ด๐‘– ๐’š๐‘– |๐’›๐‘– ) = 0 (2.2.5) Equation (2.2.5) includes the case of unconditional moment restrictions. I denote ๐‘ฝ๐‘– = ๐ธ (๐’š๐‘– ๐’šโ€ฒ๐‘– |๐’›๐‘– ). I now consider transformations which still yield valid moment conditions. Let ๐‘ฉ๐‘– = ๐‘ฉ(๐’›๐‘– , ๐œท0 ) be a ๐ฝ ร— ๐‘‡ matrix such that ๐ธ (๐‘ฉ๐‘– ๐’š๐‘– |๐’›๐‘– ) = 0. Now I make the following assumptions which are pivotal for the general result of this section, and thus refer to them as Assumptions GR.1 and GR.2. Assumption GR.1: ๐‘ฉ๐‘– ๐‘ด๐‘– = ๐‘ฉ๐‘– and ๐‘…๐‘Ž๐‘›๐‘˜ ( ๐‘ด๐‘– ๐‘ฝ๐‘– ๐‘ด๐‘–โ€ฒ) = ๐‘…๐‘Ž๐‘›๐‘˜ ( ๐‘ด๐‘– ) = ๐ฝ < ๐‘‡. โ–  Assumption GR.2: ๐‘…๐‘Ž๐‘›๐‘˜ (๐‘ฉ๐‘– ๐‘ฝ๐‘– ๐‘ฉ๐‘–โ€ฒ) = ๐‘…๐‘Ž๐‘›๐‘˜ (๐‘ฉ๐‘– ) = ๐ฝ. โ–  The notation for ๐‘ด๐‘– in Assumption GR.1 is motivated by the standard notation for a residual maker matrix. In fact, one possible sufficient condition for Assumption GR.1 is that ๐‘…๐‘Ž๐‘›๐‘˜ (๐‘ฝ๐‘– ) = ๐ฝ and that ๐‘ฝ๐‘– shares a null space with ๐‘ฉ๐‘– . This assumption would also suffice for Assumption GR.2 since ๐‘ฉ๐‘–โ€ฒ spans the column space of ๐‘ฝ๐‘– , and is relevant in linear panel models with additive heterogeneity. We can then let ๐‘ด๐‘– be a residual maker matrix from regressing on a basis vector for the null space of ๐‘ฉ๐‘– . Another relevant setting to this paper is when ๐‘ด๐‘– = ๐‘ฐ๐‘‡ โˆ’ ๐‘ท๐‘– where ๐‘ท๐‘– has rank ๐‘‡ โˆ’ ๐ฝ and ๐‘ฉ๐‘– ๐‘ท๐‘– = 0. This setting characterizes the nonlinear models studied in Section 2.3 and is also sufficient for Assumptions GR.1 and GR.2. Given the discussion above, I now prove a lemma which is essential to the proof of the general equivalence result. Lemma 2.2.2. ๐‘ฉ๐‘–โ€ฒ (๐‘ฉ๐‘– ๐‘ฝ๐‘– ๐‘ฉ๐‘–โ€ฒ) โˆ’1 ๐‘ฉ๐‘– is a g-inverse of ๐‘ด๐‘– ๐‘ฝ๐‘– ๐‘ด๐‘–โ€ฒ. 38 Proof. ๐‘ฉ๐‘–โ€ฒ (๐‘ฉ๐‘– ๐‘ฝ๐‘– ๐‘ฉ๐‘–โ€ฒ) โˆ’1 ๐‘ฉ๐‘– ๐‘ด๐‘– ๐‘ฝ๐‘– ๐‘ด๐‘–โ€ฒ ๐‘ฉ๐‘–โ€ฒ (๐‘ฉ๐‘– ๐‘ฝ๐‘– ๐‘ฉ๐‘–โ€ฒ) โˆ’1 ๐‘ฉ๐‘– = ๐‘ฉ๐‘–โ€ฒ (๐‘ฉ๐‘– ๐‘ฝ๐‘– ๐‘ฉ๐‘–โ€ฒ) โˆ’1 ๐‘ฉ๐‘– ๐‘ฝ๐‘– ๐‘ฉ๐‘–โ€ฒ (๐‘ฉ๐‘– ๐‘ฝ๐‘– ๐‘ฉ๐‘–โ€ฒ) โˆ’1 ๐‘ฉ๐‘– = ๐‘ฉ๐‘–โ€ฒ (๐‘ฉ๐‘– ๐‘ฝ๐‘– ๐‘ฉ๐‘–โ€ฒ) โˆ’1 ๐‘ฉ๐‘– . Since ๐‘…๐‘Ž๐‘›๐‘˜ (๐‘ฉ๐‘–โ€ฒ (๐‘ฉ๐‘– ๐‘ฝ๐‘– ๐‘ฉ๐‘–โ€ฒ) โˆ’1 ๐‘ฉ๐‘– ) = ๐ฝ by Assumption GR.2 and ๐‘…๐‘Ž๐‘›๐‘˜ ( ๐‘ด๐‘– ๐‘ฝ๐‘– ๐‘ด๐‘–โ€ฒ) = ๐ฝ by Assumption GR.1, ๐‘ฉ๐‘–โ€ฒ (๐‘ฉ๐‘– ๐‘ฝ๐‘– ๐‘ฉ๐‘–โ€ฒ) โˆ’1 ๐‘ฉ๐‘– is a g-inverse of ๐‘ด๐‘– ๐‘ฝ๐‘– ๐‘ด๐‘–โ€ฒ by Theorem 2.6 of Rao and Mitra (1971). โ–ก Theorem 2.2.1. The equality ๐‘ฉ๐‘–โ€ฒ (๐‘ฉ๐‘– ๐‘ฝ๐‘– ๐‘ฉ๐‘–โ€ฒ) โˆ’1 ๐‘ฉ๐‘– = ๐‘ด๐‘–โ€ฒ ( ๐‘ด๐‘– ๐‘ฝ๐‘– ๐‘ด๐‘–โ€ฒ) โˆ’ ๐‘ด๐‘– (2.2.6) holds for any choice of matrix ๐‘ฉ๐‘– satisfying Assumptions GR.1 and GR.2 for the same ๐‘ด๐‘– and for any g-inverse of ๐‘ด๐‘– ๐‘ฝ๐‘– ๐‘ด๐‘–โ€ฒ. Proof. By Rao and Mitra (1971, p. 603), the expression ๐‘ด๐‘–โ€ฒ ( ๐‘ด๐‘– ๐‘ฝ๐‘– ๐‘ด๐‘–โ€ฒ) โˆ’ ๐‘ด๐‘– (2.2.7) is invariant to the choice of g-inverse as ๐‘…๐‘Ž๐‘›๐‘˜ ( ๐‘ด๐‘– ๐‘ฝ๐‘– ๐‘ด๐‘–โ€ฒ) = ๐‘…๐‘Ž๐‘›๐‘˜ ( ๐‘ด๐‘– ) by Assumption GR.1. Since ๐‘ฉ๐‘–โ€ฒ (๐‘ฉ๐‘– ๐‘ฝ๐‘– ๐‘ฉ๐‘–โ€ฒ) โˆ’1 ๐‘ฉ๐‘– is such a g-inverse by Lemma 2.2.2 and ๐‘ฉ๐‘– ๐‘ด๐‘– = ๐‘ฉ๐‘– we have ๐‘ฉ๐‘–โ€ฒ (๐‘ฉ๐‘– ๐‘ฝ๐‘– ๐‘ฉ๐‘–โ€ฒ) โˆ’1 ๐‘ฉ๐‘– = ๐‘ด๐‘–โ€ฒ ๐‘ฉ๐‘–โ€ฒ (๐‘ฉ๐‘– ๐‘ฝ๐‘– ๐‘ฉ๐‘–โ€ฒ) โˆ’1 ๐‘ฉ๐‘– ๐‘ด๐‘– = ๐‘ด๐‘–โ€ฒ ( ๐‘ด๐‘– ๐‘ฝ๐‘– ๐‘ด๐‘–โ€ฒ) โˆ’ ๐‘ด๐‘– which is independent of ๐‘ฉ๐‘– . โ–ก Equation (2.2.6) of Theorem 1 provides the framework for evaluating information equivalence. To see how, I include an additional orthogonality assumption which simplifies the efficiency bound in (2.2.4). Assumption ORTH: ๐‘จ(๐’™๐‘– , ๐œท) is an ๐ฟ ร— ๐‘‡ matrix, ๐ฟ โ‰ค ๐‘‡, such that ๐‘จ(๐’™๐‘– , ๐œท)๐’Ž๐‘– ( ๐œท) = 0 for all ๐œท is some open ball about ๐œท0 . โ–  Assumption ORTH is clearly sufficient for Assumption MAT. The transformations studied in the next section satisfy Assumption ORTH for all values of ๐œท โˆˆ R๐พ for which the mean function is well-defined. However it only needs to be defined on a relatively small open and connected 39 set so that it applies with respect to differentiation. Note that ORTH does not say anything about point identification of ๐œท0 . Assumption CM guarantees ๐ธ ( ๐‘จ(๐’™๐‘– , ๐œท0 ) ๐’š๐‘– |๐’™๐‘– ) = 0 only at ๐œท0 because ๐ธ ( ๐’š๐‘–๐‘ก |๐’™๐‘– , ๐’„๐‘– ) = ๐‘š ๐‘ก (๐’™๐‘–๐‘ก , ๐œท0 , ๐’„๐‘– ). I also note that every transformation considered in Section 2.3 satisfies Assumption ORTH. The following lemma is a consequence of Assumption ORTH which greatly simplifies the bound in (2.2.4). Lemma 2.2.3. Let ๐‘จ(๐’™๐‘– , ๐œท) satisfy Assumption ORTH. Then under regularity conditions which allow us to pass the gradient operator through the conditional expectation, ๐ธ (โˆ‡ ๐œท ๐‘จ(๐’™๐‘– , ๐œท0 ) ๐’š๐‘– |๐’™๐‘– ) = ๐‘จ(๐’™๐‘– , ๐œท0 )โˆ‡ ๐œท ๐’Ž๐‘– ( ๐œท0 ) Proof. See Appendix for proof. โ–ก Lemma 2.2.3 greatly simplifies the efficiency bound in (2.2.4). It also allows us to say something about finite sample equivalence among certain types of transformations. I summarize these results here: Corollary 2.2.1. Let ๐‘จ(๐’™๐‘– , ๐œท) be a ๐ฟ ร— ๐‘‡ matrix satisfying Assumptions MAT, SYS, and ORTH. Then ๐‘จ(๐’™๐‘– , ๐œท0 ) has the following efficiency bound: โˆ’  โˆ’1 ๐ธ โˆ‡ ๐œท ๐’Ž๐‘– ( ๐œท0 ) โ€ฒ ๐‘จ(๐’™๐‘– , ๐œท0 ) โ€ฒ ( ๐‘จ(๐’™๐‘– , ๐œท0 )๐ธ (๐’š๐‘– ๐’šโ€ฒ๐‘– |๐’™๐‘– ) ๐‘จ(๐’™๐‘– , ๐œท0 ) โ€ฒ ๐‘จ(๐’™๐‘– , ๐œท0 )โˆ‡ ๐œท ๐’Ž๐‘– ( ๐œท0 ) (2.2.8) Corollary 2.2.2. Suppose ๐‘จ๐‘– and ๐‘ฉ๐‘– are ๐ฝ ร— ๐‘‡ matrices and ๐‘ด๐‘– is a ๐‘‡ ร— ๐‘‡ matrix such that Assumptions GR.1 and GR.2 hold for ๐‘จ and ๐‘ฉ. Further suppose ๐‘จ๐‘– , ๐‘ฉ๐‘– , ๐‘ด๐‘– , and the conditional gradient โˆ‡ ๐œท ๐ธ ( ๐’š๐‘– |๐’›๐‘– ) are independent of ๐œท. Then โˆ‡ ๐œท ๐’Ž๐‘–โ€ฒ ๐‘จ๐‘–โ€ฒ ( ๐‘จ๐‘– ๐‘ฝ๐‘– ๐‘จ๐‘–โ€ฒ) โˆ’1 ๐‘จ๐‘– ๐’Ž๐‘– ( ๐œท) = โˆ‡ ๐œท ๐’Ž๐‘–โ€ฒ ๐‘ฉ๐‘–โ€ฒ (๐‘ฉ๐‘– ๐‘ฝ๐‘– ๐‘ฉ๐‘–โ€ฒ) โˆ’1 ๐‘ฉ๐‘– ๐’Ž๐‘– ( ๐œท) (2.2.9) for any value of ๐œท in ๐’Ž๐‘– ( ๐œท). Corollary 2.2.1 allows us to directly apply the result from Theorem 2.2.1 to the relevant cases in Section 2.3. For information equivalence, it will suffice to show that the relevant transformations 40 satisfying Assumptions MAT, SYS, and ORTH only need to satisfy a rank assumption to be information equivalent. The choice of ๐‘ด will become apparent based on the empirical setting. Corollary 2.2.2 gives an even more powerful result than equivalence of efficiency bounds. For example, if the moment conditions in (2.2.5) are conditional on ๐’™๐‘– , the efficient GMM estimator of ๐œท0 , say ๐œท, b solves โˆ‘๏ธ๐‘ โˆ‡ ๐œท ๐’Ž๐‘–โ€ฒ ๐‘ด๐‘–โ€ฒ ( ๐‘ด๐‘– ๐‘ฝ๐‘– ๐‘ด๐‘–โ€ฒ) โˆ’ ๐‘ด๐‘– ๐’Ž๐‘– ( ๐œท)b =0 (2.2.10) ๐‘–=1 Corollary 2.2.2 tells us that the efficient estimator based off of ๐ธ ( ๐‘จ๐‘– ๐’š๐‘– |๐’™๐‘– ) and ๐ธ (๐‘ฉ๐‘– ๐’š๐‘– |๐’™๐‘– ) are algebraically equivalent. When the transformations are themselves functions of the parameters, implementation of the efficient instruments depends on first-stage estimators whereas the transfor- mation ๐‘จ๐‘– ๐’Ž๐‘– depends on the FOC solution, so the results only hold asymptotically. The proof of Theorem 4.2 in Im et al. (1999) uses a specific form of the argument in the proof above. This fact suggests further applications to panel data transformations with strictly exogenous covariates which I explore in the next section. 2.3 Examples of information equivalence This section considers the application of Theorem 2.2.1 to a variety of interesting empirical settings. 2.3.1 Multiplicative heterogeneity I now consider the case of a single multiplicative heterogeneous effect: ๐ธ (๐‘ฆ๐‘–๐‘ก |๐’™๐‘– , ๐‘๐‘– ) = ๐‘๐‘– ๐‘š ๐‘ก (๐’™๐‘–๐‘ก , ๐œท0 ) (2.3.1) This specification has grown in popularity in recent years. For example, see Krapf, Ursprung, and Zimmermann (2017), Fischer, Royer, and White (2018), Castillo, Mejรญa, and Restrepo (2020), Schlenker and Walker (2016), McCabe and Snyder (2014, 2015), and Williams et al. (2020). The most common specification of equation (2.3.1) in practice is the exponential mean function as demonstrated in Example 2. Often the data generating process is a count variable with a mass point 41 at zero, but the model can apply to any nonnegative outcome. This typically implies ๐‘š ๐‘ก (๐’™, ๐œท0 ) > 0 for all ๐’™ โˆˆ X which the rank assumptions made in this section will imply. I consider the following generalized residual functions first introduced in Example 2: ๐‘‡ ! โˆ‘๏ธ ๐‘ข๐‘–๐‘ก ( ๐œท) = ๐‘ฆ๐‘–๐‘ก โˆ’ ๐‘ฆ๐‘– ๐‘  ๐‘๐‘–๐‘ก ( ๐œท) (2.3.2) ๐‘ =1 ๐‘š ๐‘ก (๐’™๐‘–๐‘ก , ๐œท) ๐‘Ÿ๐‘–,๐‘ก,๐‘  ( ๐œท) = ๐‘ฆ๐‘–๐‘ก โˆ’ ๐‘ฆ๐‘– ๐‘  (2.3.3) ๐‘š ๐‘  (๐’™๐‘– ๐‘  , ๐œท) ร  โˆ’1 ๐‘‡ where ๐‘๐‘–๐‘ก ( ๐œท) = ๐‘š ๐‘ก (๐’™๐‘–๐‘ก , ๐œท) ๐‘ =1 ๐‘š ๐‘  (๐’™๐‘– ๐‘  , ๐œท) . Equation (2.3.2) is reminiscent of the linear within transformation. However, the transformation in the linear case demeans using the time averages, whereas the generalized within transformation weights by the pseudo-probability ๐‘๐‘–๐‘ก ( ๐œท). The generalized differencing residual in equation (2.3.3) allows a large number of differencing procedures, including next- and first-differencing as well as differencing one time period from the others in which ๐‘ก is fixed and ๐‘  is allowed to vary. Any other number of arbitrary generalized differencing is allowed so long as it produces a full rank transformation. In contrast to the linear model with an additive effect, the transformations in equations (2.3.2) and (2.3.3) will not eliminate the heterogeneity but still creates valid moment conditions. For example, taking the mean of equation (2.3.3) conditional on (๐’™๐‘– , ๐‘๐‘– ) gives ๐‘š ๐‘ก (๐’™๐‘–๐‘ก , ๐œท0 ) ๐ธ (๐‘Ÿ๐‘–,๐‘ก,๐‘  ( ๐œท0 )|๐’™๐‘– , ๐‘๐‘– ) = ๐‘๐‘– ๐‘š ๐‘ก (๐’™๐‘–๐‘ก , ๐œท0 ) โˆ’ ๐‘๐‘– ๐‘š ๐‘  (๐’™๐‘–๐‘  , ๐œท0 ) ๐‘š ๐‘  (๐’™๐‘–๐‘  , ๐œท0 ) = ๐‘๐‘– (๐‘š ๐‘ก (๐’™๐‘–๐‘ก , ๐œท0 ) โˆ’ ๐‘š ๐‘ก (๐’™๐‘–๐‘ก , ๐œท0 )) =0 which still yields conditional moment restrictions. Define the respective ๐‘‡ ร— 1 and (๐‘‡ โˆ’ 1) ร— 1 residual vectors ๐’–๐‘– ( ๐œท) = ( ๐‘ฐ๐‘‡ โˆ’ ๐’‘๐‘– ( ๐œท)1โ€ฒ) ๐’š๐‘– (2.3.4) ๐’“๐‘– ( ๐œท) = ๐‘ซ ๐‘– ( ๐œท) ๐’š๐‘– (2.3.5) where 1 is a ๐‘‡ ร—1 vector of ones and ๐‘ซ ๐‘– ( ๐œท) is the ๐‘‡ โˆ’1ร—๐‘‡ weighted generalized differencing matrix which yields the desired residuals as in (2.3.3). I refer to transformations in Equations (2.3.4) and 42 (2.3.5) as the generalized within and generalized differencing transformations respectively. Then an iterated expectations argument shows ๐ธ (๐’–๐‘– ( ๐œท0 )|๐’™๐‘– ) = 0 and ๐ธ (๐’“๐‘– ( ๐œท0 )|๐’™๐‘– ) = 0. Thus equations (2.3.4) and (2.3.5) satisfy Assumption MAT and suggest moment conditions for efficient GMM estimation which could reach their respective efficiency bounds in (2.2.4). As discussed in the Introduction, equation (2.3.4) is the foundation of the FEP estimator. The FEP is defined in Hausman et al. (1984) as the MLE of a conditional Multinomial distribution with probability and count parameters ๐’‘๐‘– ( ๐œท0 ) = ( ๐‘๐‘–1 ( ๐œท0 ), ..., ๐‘๐‘–๐‘‡ ( ๐œท0 )) โ€ฒ and ๐‘›๐‘– . Wooldridge (1999) shows that the FEP is consistent under Assumption CM using the fact that equation (2.3.4) has a zero conditional mean at ๐œท0 regardless of the true distribution of ๐’š๐‘– |๐’™๐‘– . This robustness result helped lead to its proliferation in empirical research. As for efficiency, Hahn (1997) shows that the FEP is asymptotically efficient under the full set of Multinomial distributional assumptions. Verdier (2018) strengthens this result substantially by showing efficiency under just zero conditional correlation and conditional mean-variance equality. Brown and Wooldridge (2021) extends this result to allow arbitrary constant conditional mean-variance dispersion. Equation (2.3.5) was first studied by Chamberlain (1992) and Wooldridge (1997) in the context of next-differencing for nonlinear models. It can also allow for estimation of ๐œท0 under weaker forms of exogeneity, like sequential exogeneity in the next-differencing case of ๐‘  = ๐‘ก + 1, rather than the strict exogeneity implied by Assumption CM. Sequential exogeneity allows the researcher to specify lag dynamics in the mean function which violates strict exogeneity. However, remarkably less is known about efficient estimation based off of equation (2.3.5) when compared to equation (2.3.4) in the context of strict exogeneity as studied here. The transformations defined in (2.3.4) and (2.3.5) are clearly not the only transformations which satisfy Assumption MAT. Consider the residual maker matrix from regressing on the mean function defined by equation (2.3.1): ( ๐‘ฐ๐‘‡ โˆ’ ๐’Ž๐‘– ( ๐œท)(๐’Ž๐‘– ( ๐œท) โ€ฒ ๐’Ž๐‘– ( ๐œท)) โˆ’1 ๐’Ž๐‘– ( ๐œท) โ€ฒ). This matrix satisfies Assumption ORTH and thus Assumption MAT since it is algebraically orthogonal to the mean function by construction. It is also well-known that the matrix is symmetric, idempotent, and has rank ๐‘‡ โˆ’ 1. I will refer to this matrix as the residual maker transformation. 43 The main theorem of this section proves the information equivalence between the generalized within, generalized differencing, and residual maker transformations. This result is similar to Theorem 4.2 of Im et al. (1999) who proves algebraic equivalences of GLS estimators based off of strictly exogenous covariates in linear panel data models with additive effects. There are two primary differences between Theorem 2.3.1 in this paper and Theorem 4.2 in Im et al. First, the heterogeneity is multiplicative rather than additive. This difference is not made without loss of generality as rewriting the terms causes the heterogeneity to have time variation7. Second, Im et al. shows an algebraic equivalence between the estimators studied, while I show an asymptotic equivalence. As mentioned after Theorem 2.2.1, finite sample equality will not necessarily follow when the transformations are functions of the parameter ๐œท0 and require a first-step estimator to implement. By Lemma 2.2.1, the conditional variance of the generalized within transformation is necessarily singular, so I will need to show that its efficiency bound is well-defined and invariant to the choice of symmetric g-inverse. Lemma 1 of Verdier (2018) shows that it has rank ๐‘‡ โˆ’ 1 at the true parameter value. This fact suggests that deleting a row to remove the rank degeneracy leads to a transformation with a nonsingular variance matrix. Im et al. (1999) takes this approach when showing equivalence between the within and differenced linear estimators. Let ๐‘ธ be a ๐‘‡ โˆ’ 1 ร— ๐‘‡ matrix which removes any arbitrary row from a given ๐‘‡ ร— ๐‘‡ matrix. Then the transformation ๐‘ธ( ๐‘ฐ๐‘‡ โˆ’ ๐’‘๐‘– ( ๐œท0 )1โ€ฒ) is the generalized within transformation with an arbitrary row deleted. A similar procedure can be used to make the residual maker transformation full rank. The main result will show that information equivalence is invariant to the row deleted. Lemma 2.3.1 will show that the efficiency bound of the within and residual maker transfor- mations are well-defined. First I will assume that ๐ธ (๐’š๐‘– ๐’šโ€ฒ๐‘– |๐’™๐‘– ) is strictly positive definite, a weaker assumption than the conditional variance of ๐’š๐‘– itself being positive definite. Under this assumption, the conditional variance of the generalized differencing transformation is nonsingular under a rank condition provided below. Before I can verify Assumption SYS, I will need an additional rank 7 If ๐‘š๐‘ก +๐‘ข๐‘– ๐‘ฆ ๐‘– ๐‘ก = ๐‘š ๐‘ก + ๐‘ข ๐‘– , then rewriting this as ๐‘ฆ ๐‘– ๐‘ก = ๐‘ ๐‘– ๐‘š ๐‘ก implies ๐‘ ๐‘– = ๐‘š๐‘ก which depends on the time period specified. 44 assumption for each respective transformation. Assumption RK.1: ๐‘…๐‘Ž๐‘›๐‘˜ (๐‘ซ ๐‘– ( ๐œท0 )) = ๐‘‡ โˆ’ 1. โ–  Assumption RK.1 states that the differencing matrix has full row rank. It requires that none of the differences used for estimation are redundant in the sense that some row or rows are linear combinations of the others. Necessarily the researcher cannot reuse rows, and if ๐‘ฆ๐‘–๐‘ก is differenced from ๐‘ฆ๐‘– ๐‘  , then ๐‘ฆ๐‘– ๐‘  cannot be differenced from ๐‘ฆ๐‘–๐‘ก . Further, we must have ๐‘  โ‰  ๐‘ก for each row so that ๐‘ซ does not have any zero rows. For example, including all pairwise differences leads to linear dependence which causes RK.1 to fail. Assumption RK.2: Let ๐šบ๐‘– = ๐ธ ( ๐’š๐‘– ๐’šโ€ฒ๐‘– |๐’™๐‘– ) be positive definite. Define ๐‘ฝ๐‘–โˆ’ = (๐šบ๐‘–โˆ’1 โˆ’ ๐‘Ž1๐‘– ๐šบ๐‘–โˆ’1 ๐’Ž๐‘– ( ๐œท0 )๐’Ž๐‘– ( ๐œท0 ) โ€ฒ๐šบ๐‘–โˆ’1 ) where ๐‘Ž๐‘– = ๐’Ž๐‘– ( ๐œท0 ) โ€ฒ๐šบ๐‘–โˆ’1 ๐’Ž๐‘– ( ๐œท0 ). Then the square matrix ๐ธ (โˆ‡ ๐œท ๐’Ž๐‘– ( ๐œท0 ) โ€ฒ๐‘ฝ๐‘–โˆ’ โˆ‡ ๐œท ๐’Ž๐‘– ( ๐œท0 )) has full rank. โ–  ๐‘ฝ๐‘–โˆ’ is a symmetric g-inverse of ๐‘‰ ๐‘Ž๐‘Ÿ (( ๐‘ฐ๐‘‡ โˆ’ ๐’‘๐‘– ( ๐œท0 )1โ€ฒ) ๐’š๐‘– |๐’™๐‘– ). In fact, it also satisfies the property ๐‘ฝ๐‘–โˆ’ [๐‘‰ ๐‘Ž๐‘Ÿ (( ๐‘ฐ๐‘‡ โˆ’ ๐’‘๐‘– ( ๐œท0 )1โ€ฒ) ๐’š๐‘– |๐’™๐‘– )] ๐‘ฝ๐‘–โˆ’ = ๐‘ฝ๐‘–โˆ’ (2.3.6) as shown in Lemma 2 of Verdier (2018) so that it is a reflexive inverse and is also clearly symmetric. Assumption RK.2 suffices for the bound in (2.2.4) existing, as I show in the next lemma that ๐‘ฝ๐‘–โˆ’ โˆ‡ ๐œท ๐’Ž๐‘– ( ๐œท0 ) is a solution to the system in Assumption SYS. This fact, along with the fact that ๐‘ฝ๐‘–โˆ’ ๐’Ž๐‘– ( ๐œท0 ) = 0 and Lemma 2.2.3, gives the bound in (2.2.4) as the expectation above. The following lemma shows that all transformations studied satisfy Assumption SYS and so any symmetric g- inverse will suffice. Lemma 2.3.1. Suppose Assumptions CM, RK.1, and RK.2 hold and that ๐ธ ( ๐’š๐‘– ๐’šโ€ฒ๐‘– |๐’™๐‘– ) is positive definite. Then the generalized differencing, generalized within, and residual maker transformations satisfy Assumption SYS. Further, either of the ๐‘‡ ร— ๐‘‡ transformations with any arbitrary row deleted also satisfy Assumption SYS. Proof. See Appendix for proof. โ–ก 45 The main consequence of Lemma 2.3.1 is that the asymptotic efficiency bound is well-defined and invariant to symmetric g-inverse for all of the transformations studied in this section. Now I can formally state the application of the main equivalence theorem to the transformations studied in this section. First note that Assumptions CM, RK.1, RK.2, and the positive definiteness of ๐ธ ( ๐’š๐‘– ๐’šโ€ฒ๐‘– |๐’™๐‘– ) are sufficient for each of the transformations studied to satisfy Assumptions SYS and ORTH (and thus MAT) so that their asymptotic efficiency bounds are well-defined and given by (2.2.8). Theorem 2.3.1. Suppose Assumptions CM, RK.1, and RK.2 hold and that ๐ธ (๐’š๐‘– ๐’šโ€ฒ๐‘– |๐’™๐‘– ) is positive definite. ( ๐‘ฐ๐‘‡ โˆ’ ๐’‘๐‘– ( ๐œท0 )1โ€ฒ), ๐‘ซ ๐‘– ( ๐œท0 ), ( ๐‘ฐ๐‘‡ โˆ’ ๐’Ž๐‘– ( ๐œท)(๐’Ž๐‘– ( ๐œท) โ€ฒ ๐’Ž๐‘– ( ๐œท)) โˆ’1 ๐’Ž๐‘– ( ๐œท) โ€ฒ), ๐‘ธ( ๐‘ฐ๐‘‡ โˆ’ ๐’‘๐‘– ( ๐œท0 )1โ€ฒ), and ๐‘ธ(( ๐‘ฐ๐‘‡ โˆ’ ๐’Ž๐‘– ( ๐œท)(๐’Ž๐‘– ( ๐œท) โ€ฒ ๐’Ž๐‘– ( ๐œท)) โˆ’1 ๐’Ž๐‘– ( ๐œท) โ€ฒ)) are information equivalent and invariant to the row deleted by ๐‘ธ. Proof. See Appendix for proof. โ–ก The proof of Theorem 2.3.1 is independent of which row is deleted in choosing ๐‘ธ and the type of differencing chosen in ๐‘ซ satisfying Assumption RK.1, reinforcing the importance of the rank assumptions. As in Theorem 2.2.1, transformations with rank ๐ฟ < ๐‘‡ can be shown to be information equivalent via a similar argument, but this fact is not directly relevant to the current results. Itโ€™s also important to note that the list of information equivalent transformations is not necessarily exhaustive, as any ๐‘‡ ร—๐‘‡ or ๐‘‡ โˆ’ 1 ร—๐‘‡ matrix with rank ๐‘‡ โˆ’ 1 and respective orthogonality condition will be information equivalent to the transformations in Theorem 2.3.1 by Theorem 2.2.1. Similar to the discussion after Theorem 2.2.1, the results in Theorem 2.3.1 could also apply to mean function which have already been transformed. For example, consider the multiplicative random trend from Example 2, ๐‘ฆ๐‘–๐‘ก = ๐‘๐‘– ๐‘Ž๐‘–๐‘ก ๐‘š ๐‘ก (๐’™๐‘–๐‘ก , ๐œท0 )๐‘ข๐‘–๐‘ก where ๐‘ข๐‘–๐‘ก is an idiosyncratic error. If we assume the outcomes are bounded away from zero, we could first divide each outcome by ๐‘š ๐‘ก (๐’™ ๐‘– ๐‘ก ,๐œท0 ) the previous period. We now have the multiplicative model ๐‘ฆ๐‘– ๐‘กโˆ— = ๐‘Ž๐‘– ๐‘š ๐‘กโˆ’1 ๐‘ข๐‘– ๐‘ก (๐’™ ๐‘–,๐‘กโˆ’1 ,๐œท0 ) ๐‘ข ๐‘–,๐‘กโˆ’1 . If ๐‘ข๐‘– ๐‘ก ๐‘ข ๐‘–,๐‘กโˆ’1 is independent of ๐’™๐‘– and ๐‘Ž๐‘– with mean 1, we have the model from equation (2.3.1). Then all of the transformations studied here are information equivalent on the transformed outcomes ๐’š๐‘–โˆ— . 46 2.3.2 Linear factor model This section considers linear panels with a factor-augmented error: ๐‘ฆ๐‘–๐‘ก = ๐’™๐‘–๐‘ก ๐œท0 + ๐’‡๐‘กโ€ฒ ๐œธ๐‘– + ๐‘ข๐‘–๐‘ก (2.3.7) where ๐’‡๐‘ก is a ๐‘ ร— 1 vector of common factors. Stacking the factors into the ๐‘‡ ร— ๐‘ matrix ๐‘ญ = ( ๐’‡1 , ..., ๐’‡๐‘‡ ) โ€ฒ, Pesaran (2006) adds the additional reduced form equation ๐’™๐‘– = ๐‘ญ๐šช๐‘– + ๐’— ๐‘– (2.3.8) where ๐šช๐‘– is a ๐‘ ร— ๐พ matrix of โ€œfactor loadings" and ๐’— ๐‘– is a ๐‘‡ ร— ๐พ matrix of mean zero idiosyncratic errors. Write ๐’›๐‘– = ( ๐’š๐‘– , ๐’™๐‘– ). Under the assumptions in Pesaran (2006), equations (2.3.7) and (2.3.8) imply ๐ธ (๐’›๐‘– ) = ๐‘ญ๐‘ช๐‘ธ (2.3.9) where ๐‘ช๐‘ธ is a ๐‘ ร— ๐พ + 1. Assuming ๐‘ โ‰ค ๐พ + 1, ๐‘ช๐‘ธ is full rank which suggests that ๐ธ (๐’›๐‘– ) can control for the space spanned by ๐‘ญ. The pooled common correlated effects estimator (CCEP) is defined as ! โˆ’1 ๐‘ โˆ‘๏ธ ๐‘ โˆ‘๏ธ ๐œท b๐ถ๐ถ๐ธ ๐‘ƒ = ๐’™๐‘–โ€ฒ ๐‘ด ๐‘ญb๐’™๐‘– ๐’™๐‘–โ€ฒ ๐‘ด ๐‘ญb ๐’š๐‘– (2.3.10) ๐‘–=1 ๐‘–=1 b= ๐’ = 1 ร๐‘ where ๐‘ญ ๐‘ ๐‘–=1 (๐’š ๐‘– , ๐’™๐‘– ). Westerlund et al. (2019) shows that when ๐‘‡ is fixed and ๐‘ โ†’ โˆž, ๐‘ ๐‘ด ๐‘ญb โ†’ ๐‘ด ๐‘ญ โˆ’ ๐‘ทโˆ’๐‘ where ๐‘ทโˆ’๐‘ is a nonlinear function of the modelโ€™s errors. When ๐‘ = ๐พ + 1 and the number of cross-sectional averages equals the number of factors, ๐‘ทโˆ’๐‘ = 0 and so the CCEP removes the factors and nothing else. Another fixed-๐‘‡ approach comes from Ahn et al. (2013). They do not make the reduced form assumption in equation (2.3.8). Instead, they introduce new parameters which eliminate ๐‘ญ. As both ๐‘ญ and ๐œธ๐‘– are unobserved, they impose the ๐‘ 2 normalizations on the factor matrix ๐‘ญ = (๐šฏโ€ฒ, โˆ’๐‘ฐ ๐‘ ) โ€ฒ (2.3.11) 47 where ฮ˜ is a (๐‘‡ โˆ’ ๐‘) ร— ๐‘ matrix of unrestricted parameters. Let ๐œฝ = vec(๐šฏ). They then define the quasi-long-differencing (QLD) matrix ยฉ ๐‘ฐ๐‘‡โˆ’๐‘ ยช ๐‘ฏ(๐œฝ) = ยญยญ ยฎ (2.3.12) โ€ฒ ยฎ ๐šฏ ยซ ยฌ so that ๐‘ฏ(๐œฝ) โ€ฒ ๐‘ญ = 0. The Ahn et al. (2013) technique involves jointly estimating ( ๐œทโ€ฒ0 , ๐œฝ โ€ฒ) โ€ฒ with the use of many instruments. Instead, I focus on the QLD transformation and compare it to the asymptotic CCE transformation. Suppose ๐›€๐‘– = ๐ธ (๐’–๐‘– ๐’–โ€ฒ๐‘– |๐’™๐‘– ) is known and has full rank. Define the CCE GLS and QLD GLS estimators as ๐‘ ! โˆ’1 ๐‘ โˆ‘๏ธ โˆ‘๏ธ ๐œท b๐ถ๐ถ๐ธ๐บ ๐ฟ๐‘† = ๐’™๐‘–โ€ฒ ๐‘ด ๐‘ญ ( ๐‘ด ๐‘ญ ๐›€๐‘– ๐‘ด ๐‘ญ ) โˆ’ ๐‘ด ๐‘ญ ๐’™๐‘– ๐’™๐‘–โ€ฒ ๐‘ด ๐‘ญ ( ๐‘ด ๐‘ญ ๐›€๐‘– ๐‘ด ๐‘ญ ) โˆ’ ๐‘ด ๐‘ญ ๐’š๐‘– (2.3.13) ๐‘–=1 ๐‘–=1 ๐‘ ! โˆ’1 ๐‘ โˆ‘๏ธ โˆ‘๏ธ ๐œท b๐‘„๐ฟ๐ท๐บ ๐ฟ๐‘† = ๐’™๐‘–โ€ฒ ๐‘ฏ(๐œฝ)(๐‘ฏ(๐œฝ) โ€ฒ๐›€๐‘– ๐‘ฏ(๐œฝ)) โˆ’1 ๐‘ฏ(๐œฝ) โ€ฒ ๐’™๐‘– ๐’™๐‘–โ€ฒ ๐‘ฏ(๐œฝ)(๐‘ฏ(๐œฝ) โ€ฒ๐›€๐‘– ๐‘ฏ(๐œฝ)) โˆ’1 ๐‘ฏ(๐œฝ) โ€ฒ ๐’š๐‘– ๐‘–=1 ๐‘–=1 (2.3.14) Theorem 2.3.2. Suppose Assumption CM holds, ๐ธ ( ๐’š๐‘– ๐’šโ€ฒ๐‘– |๐’™๐‘– ) is positive definite, and ๐‘…๐‘Ž๐‘›๐‘˜ (๐‘ญ) = ๐‘ < ๐‘‡. Then ๐œท b๐ถ๐ถ๐ธ๐บ ๐ฟ๐‘† = ๐œท b๐‘„๐ฟ๐ท๐บ ๐ฟ๐‘† . Proof. ๐‘…๐‘Ž๐‘›๐‘˜ (๐‘ฏ(๐œฝ)) = ๐‘…๐‘Ž๐‘›๐‘˜ ( ๐‘ด ๐‘ญ ) = ๐‘‡โˆ’๐‘ so ๐‘ด ๐‘ญ ( ๐‘ด ๐‘ญ ๐›€๐‘– ๐‘ด ๐‘ญ ) โˆ’ ๐‘ด ๐‘ญ = ๐‘ฏ(๐œฝ)(๐‘ฏ(๐œฝ) โ€ฒ๐›€๐‘– ๐‘ฏ(๐œฝ)) โˆ’1 ๐‘ฏ(๐œฝ) โ€ฒ by Theorem 1. โ–ก Because ๐‘ฏ(๐œฝ) and ๐‘ด ๐‘ญ are only available asymptotically, the best we can hope to achieve is an asymptotic equivalence result. Further, as discussed earlier, the CCE transformation ๐‘ด ๐‘ญb only converges in probability to ๐‘ด ๐‘ญ when ๐‘ = ๐พ + 1. Other fixed-๐‘‡ approaches in the literature include Robertson and Sarafidis (2015) who parameterize the correlation between the exogenous instruments and the factor loadings. They show that one of their estimators is asymptotically equivalent to the full QLD GMM estimator of Ahn et al. (2013) which suggests a similar efficiency result as Theorem 3. Westerlund (2020) studies the principal components (PC) estimator using the Pesaran (2006) CCE model. PC estimation is essentially fixed effects OLS which estimates the 48 factors and loadings as additional parameters. If the estimator of ๐‘ด ๐‘ญ is consistent for ๐‘ด ๐‘ญ , it can be made asymptotically efficient in the sense of Theorem 2.3.2 and thus a possible efficient alternative to CCE estimation when ๐‘‡ is fixed. 2.3.3 Random trend I now consider a particular factor specification which is common in applied settings. This linear model with additive effects as described in Example 1 of Section 2.2.1. takes the form ๐‘ฆ๐‘–๐‘ก = ๐‘๐‘– + ๐‘Ž๐‘– ๐‘ก + ๐’™๐‘–๐‘ก ๐œท0 + ๐‘ข๐‘–๐‘ก (2.3.15) Such a model is often called a random trend model because the outcome variable has an unobserved heterogeneous response to the observable time trend8. A standard technique in dealing with the heterogeneous trend is to first-difference. Define ฮ”๐‘ฆ๐‘–๐‘ก = ๐‘ฆ๐‘–๐‘ก โˆ’ ๐‘ฆ๐‘–,๐‘กโˆ’1 with similar definitions for ฮ”๐’™๐‘–๐‘ก and ฮ”๐‘ข๐‘–๐‘ก . Then ฮ”๐‘ฆ๐‘–๐‘ก = ๐‘Ž๐‘– + ฮ”๐’™๐‘–๐‘ก ๐œท0 + ฮ”๐‘ข๐‘–๐‘ก (2.3.16) Under the strict exogeneity assumption of Assumption CM, we have ๐ธ (ฮ”๐‘ข๐‘–๐‘ก |๐’™๐‘– ) = 0 for each ๐‘ก โ‰ฅ 2. Thus we have strictly exogenous covariates with an additive heterogeneity term. The most popular technique for estimating ๐œท0 in a linear model with additive heterogeneity is fixed effects estimation 1 โ€ฒ which applies the within transformation, ๐‘ฐ๐‘‡โˆ’1 โˆ’ ๐‘‡โˆ’1 1๐‘‡โˆ’1 1๐‘‡โˆ’1 where here 1๐‘‡โˆ’1 is a ๐‘‡ โˆ’ 1 ร— 1 vector of ones, to the first differenced residuals ฮ”๐‘ฆ๐‘–๐‘ก โˆ’ ฮ”๐’™๐‘–๐‘ก ๐œท0 . Another way to eliminate the heterogeneity in equation (2.3.15) is to apply the first-differencing transformation again on equation (2.3.16). This technique is often referred to as second-differencing. The regression is then run for ฮ”๐‘ฆ๐‘–๐‘ก โˆ’ ฮ”๐‘ฆ๐‘–,๐‘กโˆ’1 on ฮ”๐’™๐‘–๐‘ก โˆ’ ฮ”๐’™๐‘–,๐‘กโˆ’1 . Since the heterogeneous terms cor- respond to a known intercept and time trend, we can also run a full fixed regression on equation (2.3.15) which treats (๐‘ 1 , ..., ๐‘ ๐‘ , ๐‘Ž 1 , ..., ๐‘Ž ๐‘ ) as parameters. One final transformation to consider is the forward orthogonal deviations (FOD) operator in Arellano and Bover (1995). This matrix applies the following transformation to the errors ๐‘ข๐‘–๐‘ก in 8 See Section 11.7.1 of Wooldridge (2010). 49 equation (2.2.16):   (๐‘‡ โˆ’ ๐‘ก) 1 ๐‘ข๐‘–๐‘ก โˆ’ (๐‘ข๐‘–,๐‘ก+1 + ... + ๐‘ข๐‘–๐‘‡ ) (2.3.17) (๐‘‡ โˆ’ ๐‘ก + 1) (๐‘‡ โˆ’ ๐‘ก) The transformation can be written in matrix form as ยฉ1 โˆ’(๐‘‡ โˆ’ 1) โˆ’1 โˆ’(๐‘‡ โˆ’ 1) โˆ’1 . . . โˆ’(๐‘‡ โˆ’ 1) โˆ’1 โˆ’(๐‘‡ โˆ’ 1) โˆ’1 โˆ’(๐‘‡ โˆ’ 1) โˆ’1 ยช ยญ ยฎ โˆ’(๐‘‡ โˆ’ 2) โˆ’1 . . . โˆ’(๐‘‡ โˆ’ 2) โˆ’1 โˆ’(๐‘‡ โˆ’ 2) โˆ’1 โˆ’(๐‘‡ โˆ’ 2) โˆ’1 ยฎยฎ ยญ ยฎ ยญ0 1 ยญ ๐‘‡ โˆ’1 1 ยญ. .. .. .. .. .. , ..., ) 1/2 ร—ยญยญ .. ยฎ diag( . . . . . ยฎ ๐‘‡ 2 ยญ ยฎ ยฎ 1 1 ยญ0 0 0 1 โˆ’2 โˆ’2 ยญ ยฎ ... ยฎ ยญ ยฎ ยญ ยฎ 0 0 0 ... 0 1 โˆ’1 ยซ ยฌ (2.3.18) I denote this FOD transformation as the matrix ๐‘ญ. For each of the first ๐‘‡ โˆ’ 1 observations, ๐‘ญ subtracts off a weighted mean of the rest of the independent variables. While initially studied in the context of sequential exogeneity and predetermined systems like first-differencing, I study it here in the context of strict exogeneity to determine information equivalence. Since I am also assuming the structure in (2.3.16) where first-differencing has already occurred, I consider the ๐‘‡ โˆ’ 2 ร— ๐‘‡ โˆ’ 1 matrix ๐‘ญ which corresponds to the definition in equation (2.3.18) but only assumes ๐‘‡ โˆ’ 1 dependent variables instead of ๐‘‡. Regardless of the number of time periods considered, ๐‘ญ has full row rank which is ๐‘‡ โˆ’ 2 in this case. To show information equivalence of the techniques described, let ๐‘ซ 1 and ๐‘ซ 2 be the respective 1 โ€ฒ ๐‘‡ โˆ’ 1 ร— ๐‘‡ and ๐‘‡ โˆ’ 2 ร— ๐‘‡ โˆ’ 1 full rank first-differencing matrices, ๐‘พ = ๐‘ฐ๐‘‡โˆ’1 โˆ’ ๐‘‡โˆ’1 1๐‘‡โˆ’1 1๐‘‡โˆ’1 be the ๐‘‡ โˆ’ 1 ร— ๐‘‡ โˆ’ 1 within transformation which has rank ๐‘‡ โˆ’ 2, ๐‘ญ be the ๐‘‡ โˆ’ 2 ร— ๐‘‡ โˆ’ 1 full rank matrix defined similarly to equation (2.3.18), and ๐‘ด be the ๐‘‡ ร— ๐‘‡ residual maker matrix from regressing on (1, ๐‘ก). Then ๐‘ซ 2 ๐‘ซ 1 ๐ธ (( ๐’š๐‘– โˆ’ ๐’™๐‘– ๐œท0 )|๐’™๐‘– ) = ๐ธ (๐‘ซ 2 ๐‘ซ 1 (๐’š๐‘– โˆ’ ๐’™๐‘– ๐œท0 )|๐’™๐‘– ) = 0 (2.3.19) ๐‘พ ๐‘ซ 1 ๐ธ (( ๐’š๐‘– โˆ’ ๐’™๐‘– ๐œท0 )|๐’™๐‘– ) = ๐ธ (๐‘พ ๐‘ซ 1 ( ๐’š๐‘– โˆ’ ๐’™๐‘– ๐œท0 )|๐’™๐‘– ) = 0 (2.3.20) ๐‘ด๐ธ (( ๐’š๐‘– โˆ’ ๐’™๐‘– ๐œท0 )|๐’™๐‘– ) = ๐ธ ( ๐‘ด (๐’š๐‘– โˆ’ ๐’™๐‘– ๐œท0 )|๐’™๐‘– ) = 0 (2.3.21) ๐‘ญ๐‘ซ 1 ๐ธ (( ๐’š๐‘– โˆ’ ๐’™๐‘– ๐œท0 )|๐’™๐‘– ) = ๐ธ (๐‘ญ๐‘ซ 1 ( ๐’š๐‘– โˆ’ ๐’™๐‘– ๐œท0 )|๐’™๐‘– ) = 0 (2.3.22) 50 where equations (2.3.19)-(2.3.22) correspond to the residuals from the second-differencing, first- differencing then within, first-differencing then forward orthogonal deviations, and full fixed effects transformations respectively. Thus each of the transformations satisfy Assumption MAT and so we can apply the general theory from Section 2.2.2. Theorem 2.3.3. Suppose Assumption CM holds and ๐ธ ( ๐’š๐‘– ๐’šโ€ฒ๐‘– |๐’™๐‘– ) is positive definite. Then ๐‘ซ 2 ๐‘ซ 1 , ๐‘พ ๐‘ซ 1 , ๐‘ญ๐‘ซ 1 and ๐‘ด are information equivalent. Proof. As ๐‘ซ 1 is full rank, ๐‘…๐‘Ž๐‘›๐‘˜ (๐‘ซ 2 ๐‘ซ 1 ) = ๐‘…๐‘Ž๐‘›๐‘˜ (๐‘พ ๐‘ซ 1 ) = ๐‘…๐‘Ž๐‘›๐‘˜ (๐‘ญ๐‘ซ 1 ) = ๐‘‡ โˆ’ 2. Since ๐‘…๐‘Ž๐‘›๐‘˜ ( ๐‘ด) = ๐‘‡ โˆ’ 2 by definition, the result holds by Theorem 1. โ–ก The simplicity of the proof follows from the general nature of the unified theory proved in Section 2.2 and thus demonstrates its usefulness. In the language of Im et al. (1999), the GLS estimators based off of the residuals in equations (2.3.19)-(2.3.22) for a given ๐ธ (๐’–๐‘– ๐’–โ€ฒ๐‘– | ๐‘ฟ๐‘– ) are algebraically equivalent for a given covariance matrix. Finally, Theorem 2.3.3 can be seen as a generalization of Theorem 4.3 of Im et al. (1999). 2.4 Practical considerations The final section of the paper provides useful applications of the results in the previous two sections. I first consider implementation of the efficiency bounds discussed in the paper. Given a transformation ๐‘จ(๐’™๐‘– , ๐œท) satisfying Assumptions SYS and ORTH (and thus MAT), I describe the efficient estimator. The estimator ๐œท b๐ด which solves โˆ‘๏ธ๐‘ โˆ‡ ๐œท ๐’Ž๐‘– ( ๐œท0 ) โ€ฒ ๐‘จ(๐’™๐‘– , ๐œท0 ) โ€ฒ ( ๐‘จ(๐’™๐‘– , ๐œท0 )๐ธ ( ๐’š๐‘– ๐’šโ€ฒ๐‘– |๐’™๐‘– ) ๐‘จ(๐’™๐‘– , ๐œท0 ) โ€ฒ) โˆ’ ๐‘จ(๐’™๐‘– , ๐œท b๐ด ) ๐’š๐‘– = 0 (2.4.1) ๐‘–=1 โˆš is ๐‘-asymptotically normal with asymptotic variance equal to the efficiency bound given by equation (2.2.4). First-stage estimation of ๐œท0 comes from a GMM estimator with an arbitrary weight matrix. Second, one needs to consistently estimate ๐ธ ( ๐’š๐‘– ๐’šโ€ฒ๐‘– |๐’™๐‘– ). A nonparametric regression estimator can be used in principle, but in practice this estimator may give highly imprecise estimates when ๐‘‡ 51 and ๐พ are relatively large. In the multiplicative heterogeneity setting, Brown and Wooldridge (2021) provides a simple and attractive parametric framework for the FEP setting. They assume ๐‘‰ ๐‘Ž๐‘Ÿ (๐‘ฆ๐‘–๐‘ก |๐’™๐‘– , ๐‘๐‘– ) = ๐›ผ๐ธ (๐‘ฆ๐‘–๐‘ก |๐’™๐‘– , ๐‘๐‘– ) where ๐›ผ > 0 is an identified coefficient along with a constant conditional correlation matrix. Asymptotically justified standard errors can be derived using the familiar sample analog to the efficiency bound in (2.2.4). The researcher can then test the validity of parts of Assumption CM. For strict exogeneity, Wooldridge (2010, Chapter 18) suggests including functions of lead values of independent variables and running a joint test of significance. This methodโ€™s most attractive feature is the weakness of its alternative hypothesis. The null maintains strict exogeneity while the alternative is merely that strict exogeneity fails. It is also easy to implement and can be tested in most standard statistical packages. However, there is no guidance on how to choose which regressors to include or their functional forms. Another possible way to examine strict exogeneity is via a Hausman test. The researcher could choose a competing estimator based on the desired alternative hypothesis. In the nonlinear multiplicative example of Section 2.3.1, suppose the researcher believes that sequential exogeneity holds, or that ๐ธ (๐‘ฆ๐‘–๐‘ก |๐’™๐‘–1 , ..., ๐’™๐‘–๐‘ก , ๐‘๐‘– ) = ๐‘š ๐‘ก (๐’™๐‘–๐‘ก , ๐œท0 ). Then the generalized next-differencing trans- formation ๐‘ซ ๐‘– ( ๐œท) = (๐’“๐‘–,1,2 ( ๐œท), ..., ๐’“๐‘–,๐‘‡โˆ’1,๐‘‡ ( ๐œท)) โ€ฒ still provides valid moment conditions. However, the instruments are designed to reach the efficiency bound in (2.2.4) will not be valid under sequential exogeneity alone. Chamberlain (1992b) derives the asymptotic efficiency bound for moment conditions under sequential exogeneity and provides an implementable estimator which reaches said bound. Under the null hypothesis, both estimators are consistent with the generalized next-differencing estimator as in (2.3.5) being asymptotically efficient. Under the alternative, only โˆš Chamberlainโ€™s instruments are valid (and in fact asymptotically efficient among ๐‘-asymptotically normal estimators). Thus we can use a Hausman statistic to test the assumption of strict exogeneity. The Chamberlain estimator described in the Hausman statistic procedure is difficult to im- plement as the instruments may be comprised of multiple sums of conditional moments. The researcher will need to either greatly strengthen the assumptions of the model to allow for para- 52 metric forms of these moments or utilize a large number of nonparametric regressions. Either way, this computational burden makes the Chamberlain estimator difficult to implement. Another possible application of the results involve finite-sample and computational concerns. Phillips (2020) demonstrates that matrix inversion for estimators based on first-differencing can involve significantly more computational resources than those based on forward orthogonal devi- ations. He demonstrates with simulation evidence that computational time increases quickly with ๐‘‡ even for relatively small values of ๐‘. While instruments need to satisfy two conditions given in Phillips (2020) which are not necessarily assumed here, I reiterate that the results in Section 2.2 are purely algebraic and can be applied in a large number of settings. 2.5 Conclusion This paper considers linear transformations of nonlinear panel models with unobserved heterogene- ity. When covariates are strictly exogenous in the zero conditional mean sense, such transforma- tions provide uncountable moment conditions exploitable for estimation. I consider specifically the asymptotic efficiency bound for estimating the modelโ€™s parameter which is reached by the optimal โˆš choice of instruments. This matrix specifies a lower bound on how efficient any ๐‘-asymptotically normal estimator of ๐œท0 can possibly be. Transformations of the data are said to be information equivalent if they yield the same asymp- totic efficiency bound. The main result of Section 2.2 is a unified framework for evaluating the efficiency bounds of transformations that provide moment conditions for estimation. It shows that besides regularity conditions, matrix transformations which yield conditional moment restrictions and have the same rank yield the same information bound. I also simplify the form of the efficiency bound under a general and easily verifiable algebraic orthogonality property which could potentially help in determining other interesting relationships between instrumental variable estimators. The general framework is applied to show that the generalized within transformation, which provides the basis of the FEP estimator, is in fact information equivalent to a number of other transformations. These transformations, which include generalizations of varying differencing 53 techniques used in the linear panel data context such as next-, first-, and long-differencing, as well as the residual maker matrix from regression on the outcome variableโ€™s mean function, are only required to satisfy a rank condition for the main theorem to hold. It is also shown that any ๐‘‡ โˆ’ 1 ร— ๐‘‡ matrix which is algebraically orthogonal to the mean function of the outcome and of full rank is information equivalent, which includes deleting any arbitrary row from the generalized within transformation and removing the linear redundancy does not lose any information. I also generalize a result of Im et al. (1999) on linear panels with an additive heterogeneity term to a general factor-augmented error structure as studied in Pesaran (2006), Ahn et al. (2013), and Westerlund (2020). I show that any transformation of the data which is full rank and eliminates the factors is information equivalent. I use this result to show that in the case of a random heterogeneous trend model, first-differencing twice, first-differencing and then using a within transformation, and the true fixed effects estimator are information equivalent. For arbitrary factor structures, the QLD transformation of Ahn et al. (2013) is information equivalent to the infeasible fixed effects GLS estimator which takes the unobserved effects as known. The work in this paper provides a basic framework for comparison of parametric estimators for a broad class of nonlinear models. I primarily consider strictly exogenous covariates so I could compare estimators using theoretically efficient instruments. However, the finite sample algebraic results hold regardless of validity of the instruments. As such, the main theorem in Section 2.2 can apply to any comparison of efficiency for instrumental variable estimators. 54 CHAPTER 3 MOMENT-BASED ESTIMATION OF LINEAR PANEL DATA MODELS WITH FACTOR-AUGMENTED ERRORS 3.1 Introduction The prevalence of panel data in modern economics has led theorists and practitioners to pay more attention to unobserved and interactive heterogeneity. A popular representation of unobserved ร๐‘ effects is the linear factor structure ๐‘—=1 ๐‘“๐‘ก ๐‘— ๐›พ ๐‘— ๐‘– where ๐‘“๐‘ก ๐‘— is a time-varying macro effect or โ€œcommon factor" and ๐›พ ๐‘— ๐‘– is an individually heterogeneous response or โ€œfactor loading". In studying the statistical properties of estimators of factor models, most theoretical treatments have relied on asymptotic expansions where the number of time periods ๐‘‡ grows large with the number of cross- sectional units ๐‘. As the vast majority of microeconometric data sets have only a few time periods, the recent literature assumes ๐‘‡ is fixed while ๐‘ goes to infinity. One of the most popular approaches is the common correlated effects (CCE) estimator of Pesaran (2006). He assumes that the covariates are a linear function of the common factors plus a matrix of independent idiosyncratic errors. The pooled CCE estimator comes from the OLS regression which estimates unit-specific slopes on the cross-sectional averages of the dependent and independent variables. CCE is similar to a fixed effects treatment which seeks to eliminate the factors and remove a source of both endogeneity and cross-sectional dependence. Consistency and asymptotic normality was originally proved for sequences of ๐‘ and ๐‘‡ going to infinity. Recent work extends the CCE framework to a fixed-๐‘‡ setting. De Vos and Everaert (2021) derive a fixed-๐‘‡ consistency correction for the dynamic CCE estimator but requires ๐‘‡ โ†’ โˆž for asymptotic normality. Westerlund et al. (2019) provide the first asymptotic normality derivation of pooled CCE when ๐‘‡ is fixed and ๐‘ โ†’ โˆž. However, they still maintain stringent assumptions on the modelโ€™s DGP. For example, they assume that the factor loadings are independent of the idiosyncratic errors. My estimators do not require this assumption for consistency, though making 55 it simplifies the standard errors. Further, the CCE estimator generally uses more factor proxies than necessary which can lead to inefficiency. Finally, the CCE estimator requires ๐‘‡ > ๐พ + 1 which is highly restrictive in microeconometric settings. For example, an intervention analysis with only pre-treatment, treatment, and post-treatment observations, classical CCE would require the treatment indicator to be the only regressor. Aside from CCE, most existing fixed-๐‘‡ techniques create moment conditions by including additional parameters to estimate or by eliminating the factors with observed proxies. A few examples include Hayakawa (2012), Ahn et al. (2001, 2013), Robertson and Sarafidis (2015), and Juodis and Sarafidis (2018, 2020)1. Of these approaches, I focus on Ahn et al. (2013), who define a parameterized quasi-long-differencing (QLD) transformation that eliminates the factor structure. The QLD residuals then form the basis for a GMM estimator which uses all available exogenous variables to generate moment conditions. I focus on the QLD technique for the sake of comparison to CCE as both approaches eliminate the factor structure and allow for โ€œfixed effects" assumptions. For example, Robertson and Sarafidis (2015) parameterize the correlation between the exogenous variables and the factor loadings. Ahn (2015) points out that if the factor loadingsโ€™ distributions change over the cross-sectional units, identification in Robertson and Sarafidis (2015) does not hold. Ahn et al. (2013) do not assume a pure factor structure in the covariates like Pesaran (2006) and leaves the distribution of the covariates unspecified. However, the generality of Ahn et al. (2013) comes at the cost of identifying assumptions, which may explain its lack of use in the empirical literature. The QLD GMM estimator requires many moments to identify all the modelโ€™s parameters. If either ๐‘‡ or the number of factors is large, their GMM estimator may require outside instruments. Their estimator also requires nonlinear optimization with a large number of moments and parameters. Hayakawa (2016) provides a simple example where the global identifying assumptions fail and there exist local stationary points. 1 Juodis and Sarafidis (2021) allows for a linear estimator which requires no additional parameters, However, the fixed-๐‘‡ analysis requires strong assumptions on the loadings which this paper avoids. See Assumption S.1.1(d) in their Appendix. 56 I synthesize both approaches and weaken both the Pesaran (2006) and Ahn et al. (2013) assumptions. I use a weakened CCE model without any independence assumptions to provide a first-stage estimator of the additional QLD parameters. Using the QLD transformation, I then derive pooled and mean group linear estimators and provide standard errors which are valid even when the heterogeneity is correlated with the modelโ€™s errors. These novel estimators have desirable rank conditions and do not require outside instruments like in Ahn et al. (2013). They also do not restrict the number of covariates to be less than the number of time periods minus one, an improvement over fixed-๐‘‡ CCE. Simulations suggest that the linear QLD estimators outperform the CCE and QLD GMM estimators in finite samples. Another potential source of heterogeneity in linear models comes from the slope coefficients on the observed variables of interest. Pesaran (2006) proves fixed-๐‘‡ consistency of the mean group CCE estimator under random slopes but assumes they are independent of everything else in the model. Asymptotic normality requires ๐‘‡ โ†’ โˆž and pooled CCE is studied under constant slopes. I prove fixed-๐‘‡ consistency and asymptotic normality of the new pooled and mean group QLD estimators. I show that the first-stage estimation of the QLD parameters does not affect consistency, which mirrors the pooled OLS result of Wooldridge (2005), who assumes known factors. To the best of my knowledge, this paper is the first to consider arbitrary random slopes in the context of fixed-๐‘‡ panels with factor-driven endogeneity. The rest of the paper is structured as follows: Section 3.2 describes the main model of interest which is weaker than that in Westerlund et al. (2019). Section 3.3 provides the assumptions which underlie the model and discusses implementation of the QLD-based estimators. Section 3.4 introduces random slopes. Section 3.5 provides simulation evidence for the finite sample properties of the QLD estimators. Section 3.6 compares the pooled QLD estimator to two-way fixed effects (TWFE) and CCE in estimating the effect of education expenditure on standardized test performance using a school district-level data set from the state of Michigan. Section 3.7 concludes with a brief summary and suggestions for future research. 57 3.2 Model This section lays out the models considered in Westerlund et al. (2019) and Ahn et al. (2013), the fixed-๐‘‡ CCE and QLD approaches respectively. Throughout the paper, the equation of interest is ๐’š๐‘– = ๐‘ฟ๐‘– ๐œท0 + ๐‘ญ0 ๐œธ๐‘– + ๐’–๐‘– (3.2.1) where ๐’š๐‘– is a ๐‘‡ ร— 1 vector of outcomes, ๐‘ฟ๐‘– is ๐‘‡ ร— ๐พ matrix of covariates, ๐‘ญ0 is a ๐‘‡ ร— ๐‘ 0 matrix of factors common to all units in the population, ๐œธ๐‘– is a ๐‘ 0 ร— 1 vector of factor loadings, ๐’–๐‘– is a ๐‘‡ ร— 1 vector of idiosyncratic shocks. A โ€˜0โ€™ subscript denotes the true or realized value of an unobserved parameter. ๐‘ 0 is then unobserved because ๐‘ญ0 and ๐œธ๐‘– are unobserved. Later, ๐‘ denotes the number of factors specified by the econometrician. ๐œท0 is the object of interest and the factor structure ๐‘ญ0 ๐œธ๐‘– is treated as a collection of nuisance parameters. This paper defines ๐‘ 0 as the number of factors whose loadings correlate with ๐‘ฟ๐‘– . This interpre- tation is similar to Ahn et al. (2013) and implicit to the CCE model as discussed in the following section. One justification of this interpretation is to write the full error as ๐‘ซ 0 ๐†๐‘– + ๐๐‘– where ๐‘ซ 0 is a possibly infinite dimensional matrix of common factors and ๐๐‘– is a vector of idiosyncratic errors. Then ๐‘ญ0 ๐œธ๐‘– is the set of variables from ๐‘ซ 0 ๐†๐‘– which are correlated with ๐‘ฟ๐‘– and the rest are absorbed into the error. However, it is entirely likely that ๐œธ๐‘– is correlated with the other loadings which are uncorrelated with ๐‘ฟ๐‘– . This correlation can cause problems for inference and is addressed in Section 3.3. Finally, I assume the factors in ๐‘ญ0 are constant for the purpose of asymptotic analysis. The alternative setting is to assume the factors are stochastic and independent of the other terms, or make the modeling assumptions conditional on the sigma-algebra generated by the factors like in Ahn et al. (2013). When ๐‘‡ is fixed, the stochastic nature of the factors is less relevant for the asymptotic arguments. Standard errors do not change as properly studentized test statistics converge to their usual distributions2. As such, I consider the standard microeconometric assumption of random 2 See Section 6 of Andrews (2005). 58 sampling in the cross-section. Hsiao (2018) provides examples of papers which make either the fixed or random assumption on the factors. 3.2.1 Common Correlated Effects The CCE model in Pesaran (2006) and Westerlund et al. (2019) adds an additional reduced form equation which represents the relationship between the covariates and the factor structure: ๐‘ฟ๐‘– = ๐‘ญ0 ๐šช๐‘– + ๐‘ฝ๐‘– (3.2.2) where ๐šช๐‘– is a ๐‘ 0 ร— ๐พ matrix of factor loadings and ๐‘ฝ๐‘– is a ๐‘‡ ร— ๐พ matrix of idiosyncratic errors. Westerlund et al. (2019) follows Pesaran (2006) in assuming ๐‘ฝ๐‘– , ๐šช๐‘– , ๐œธ๐‘– , and ๐’–๐‘– are mutually independent3. Assuming that the idiosyncratic errors have mean zero, CCE estimates the factors with the matrix ๐‘ญ b = (๐’š, ๐‘ฟ) where (๐’š, ๐‘ฟ) = 1 ร๐‘ ( ๐’š๐‘– , ๐‘ฟ๐‘– ) are the cross-sectional averages of ๐’š๐‘– ๐‘ ๐‘–=1 and ๐‘ฟ๐‘– . The pooled common correlated effects (CCEP) estimator treats the cross-sectional averages as fixed effects and can be represented as ๐‘ ! โˆ’1 ๐‘ โˆ‘๏ธ โˆ‘๏ธ ๐œทb๐ถ๐ถ๐ธ ๐‘ƒ = ๐‘ฟ๐‘–โ€ฒ ๐‘ด ๐‘ญb ๐‘ฟ๐‘– ๐‘ฟ๐‘–โ€ฒ ๐‘ด ๐‘ญb ๐’š๐‘– (3.2.3) ๐‘–=1 ๐‘–=1 where ๐‘ด ๐‘ญb = ๐‘ฐ๐‘‡ โˆ’ ๐‘ญ( b๐‘ญ bโ€ฒ ๐‘ญ) b +๐‘ญbโ€ฒ. Here โ€ฒ+โ€ฒ denotes a Moore-Penrose inverse which can be replaced by a proper inverse in samples where ๐‘ญ bโ€ฒ ๐‘ญ b has full rank. Pesaran (2006) derives the CCE estimator under the following intuition: first, write ๐’๐‘– = (๐’š๐‘– , ๐‘ฟ๐‘– ). The two models in equations (3.2.1) and (3.2.2) imply ๐ธ (๐’๐‘– ) = ๐‘ญ0 ๐ธ (๐‘ช๐‘– )๐‘ธ (3.2.4) where ๐‘ช๐‘– = (๐œธ๐‘– , ๐šช๐‘– ) and ยฉ 1 01ร—๐พ ยช ๐‘ธ = ยญยญ ยฎ ยฎ ๐œท0 ๐‘ฐ ๐พ ยซ ยฌ 3 Westerlund et al. (2019) assume the loadings come from a fixed series of constant matrices which is more general than the Pesaran (2006) assumption that the loadings are iid. 59 Thus, ๐‘ด ๐‘ญb asymptotically eliminates the space spanned by ๐‘ญ0 which includes ๐‘ญ0 ๐œธ๐‘– . Westerlund et al. (2019) show that ๐‘ด ๐‘ญb generally converges to the space orthogonal to both ๐‘ญ and a random term which is a function of the modelโ€™s idiosyncratic errors. For the sake of ๐‘ simplicity, suppose that ๐‘ด ๐‘ญb โ†’ ๐‘ด ๐‘ญ0 which is the case when ๐‘ 0 = ๐พ + 1. Then the pooled CCE estimator is based off of the moment conditions ๐ธ ( ๐‘ฟ๐‘–โ€ฒ ๐‘ด ๐‘ญ0 (๐’š๐‘– โˆ’ ๐‘ฟ๐‘– ๐œท)) = 0 Assuming ๐ธ (๐‘ฝ๐‘– ) = 0 as in Pesaran (2006) and Westerlund et al. (2019), the reduced form portion of the CCE model also implies ๐ธ ( ๐‘ด ๐‘ญ0 ๐‘ฟ๐‘– ) = 0. Since the CCE approach estimates no parameters in this set of moments, the additional moments are unused by the CCE residual above. I show how these reduced form moments can be exploited for additional information in Section 3.3. A particularly harsh restriction of the pooled CCE estimator is the rank condition required for the denominator. ๐‘ด ๐‘ญb is a residual-maker matrix and so it has rank ๐‘‡ โˆ’ (๐พ + 1). For the estimator to be well-defined, we require ๐‘‡ > ๐พ + 1. This constraint is trivially nonbinding when ๐‘‡ โ†’ โˆž like in the prior literature. However, when ๐‘‡ is fixed like in this paper, we need ๐พ < ๐‘‡ โˆ’ 1. For example, if we only observe three time periods, we can only incorporate one regressor. Also, when ๐พ + 1 > ๐‘ 0 , the CCE estimator unnecessarily removes variation from the data which could improve precision of the estimator. I address both of these problems in Section 3.2.2. 3.2.2 Quasi-long-differencing Ahn et al. (2013) do not assume the pure factor structure in ๐‘ฟ๐‘– . They start with equation (3.2.1) then parameterize the factors for the purpose of eliminating them. Before discussing how this process works, I introduce the โ€˜rotation problemโ€™, a well-known issue in the factor literature. Since both ๐‘ญ and ๐œธ๐‘– are unobservable, they cannot be separately identified. To see why, consider any nonsingular ๐‘ ร— ๐‘ matrix ๐‘จ. Then ๐‘ญ0 ๐šช๐‘– = ๐‘ญ โˆ— ๐šช๐‘–โˆ— where ๐‘ญ โˆ— = ๐‘ญ0 ๐‘จ and ๐šช๐‘–โˆ— = ๐‘จโˆ’1 ๐šช๐‘– . We can only hope to identify the factors up to an arbitrary rotation of their linear subspace. Ahn et al. (2013) 60 suggest the following ๐‘ 20 normalizations based off of a row-reduction rotation: ๐‘ญ0 = (๐šฏโ€ฒ0 , โˆ’๐‘ฐ ๐‘0 ) โ€ฒ (3.2.5) where ๐šฏ0 is a (๐‘‡ โˆ’ ๐‘ 0 ) ร— ๐‘ 0 matrix of unrestricted parameters. The given normalization is irrelevant because I am not interested in estimating ๐‘ญ0 . In this case, I only assume that the factors are full rank; the normalization chosen merely reflects this fact. Given the normalization of the general factor matrix ๐‘ญ0 in equation (3.2.5), Ahn et al. (2013) define the quasi-long-differencing (QLD) matrix ยฉ ๐‘ฐ๐‘‡โˆ’๐‘0 ยช ๐‘ฏ(๐œฝ 0 ) = ยญยญ ยฎ (3.2.6) โ€ฒ ยฎ ๐šฏ0 ยซ ยฌ where ๐œฝ 0 = vec(๐šฏ0 ). The QLD transformation eliminates the factors for any given ๐œฝ 0 : ๐‘ฏ(๐œฝ 0 ) โ€ฒ ๐‘ญ(๐œฝ 0 ) = 0. This differencing technique allows for the construction of the QLD residual studied in Ahn et al. (2013): ๐ธ (๐’˜ ๐‘– โŠ— ๐‘ฏ(๐œฝ 0 ) โ€ฒ ( ๐’š๐‘– โˆ’ ๐‘ฟ๐‘– ๐œท0 )) = 0 (3.2.7) where ๐’˜ ๐‘– is a vector of instruments which may contain vec( ๐‘ฟ๐‘– ). The normalization in (3.2.5) and implicit in (3.2.6) is only one particular choice of rotation. The Ahn et al. (2013) estimator depends on the choice of normalization which is unaddressed in the original paper. I discuss this issue in the Appendix and provide potential solutions for the estimators derived in Section 3.2. While Ahn et al. (2013) provide a general framework for estimating ๐œท0 without strong restric- tions on the distribution of ๐‘ฟ๐‘– , it requires at least ๐‘ 0 + ๐พ/(๐‘‡ โˆ’ ๐‘ 0 ) instruments in ๐’˜ ๐‘– to identify all of the modelโ€™s parameters. If some of the variables are not exogenous in each time period like with weakly exogenous or predetermined variables, or if ๐‘ 0 is large, we may require outside instruments. Hayakawa (2016) demonstrates an example where the objective function based off of equation (3.2.7) suffers from non-global stationary points due to the nonlinear nature of estimation with a large number of moments and parameters. 61 The pure factor structure in equation (3.2.4) can thus be used for estimating the parameters in equation (3.2.6). If we assume ๐‘ฟ๐‘– = ๐‘ญ0 ๐šช๐‘– + ๐‘ฝ๐‘– where ๐ธ (๐‘ฝ๐‘– ) = 0, then ๐ธ (๐‘ฏ(๐œฝ 0 ) โ€ฒ ๐’๐‘– ) = 0 (3.2.8) and ๐œฝ 0 is identified by equation (3.2.8) which substantially reduces the number of moments needed to identify ๐œท0 . I also show explicitly in the following section how and when these additional moments are useful for the purpose of identification and efficiency. 3.3 Estimation I now state this paperโ€™s primary assumptions. The first assumption is similar to the โ€™Basic As- sumptionsโ€™ of Ahn et al. (2013) and is made for the sake of comparison to their approach. The second set specifies the pure factor structure in ๐‘ฟ๐‘– similar to Westerlund et al. (2019). I specify the models in the assumptions as the main results of the paper depend on which model is being assumed. Conditional moments hold almost surely. Assumption 1 (Linear population model): 1. ๐’š๐‘– = ๐‘ฟ๐‘– ๐œท0 + ๐‘ญ0 ๐œธ๐‘– + ๐’–๐‘– . โ–  Assumption 2 (CCE reduced form equations): 1. ๐‘ฟ๐‘– = ๐‘ญ0 ๐šช๐‘– + ๐‘ฝ๐‘– . 2. (๐œธ๐‘– , ๐šช๐‘– , ๐‘ฝ๐‘– , ๐’–๐‘– ) are independent and identically distributed across ๐‘– with finite fourth moments. 3. ๐ธ (๐‘ฝ๐‘– ) = 0 and ๐ธ (๐’–๐‘– |๐‘ฝ๐‘– ) = 0. 4. Rk(๐‘ญ0 ) = ๐‘ 0 and Rk(๐ธ ( [๐œธ๐‘– , ๐šช๐‘– ])) = ๐‘ 0 โ‰ค ๐พ + 1. โ–  Assumption 1 simply defines the relevant population model. I will not require the strong rank conditions of Ahn et al. (2013) which can be found in the Appendix, nor will I require outside 62 instruments. Assumption 2 specifies the pure factor assumption similar to Pesaran (2006) and Westerlund et al. (2019). I assume random sampling in the cross-section to simplify the asymptotic analysis, though this restriction is unnecessary. Westerlund et al. (2019) follow the classical CCE approach in assuming independence between all stochastic components of the model which is unrealistic in microeconometric settings. Further, the asymptotic normality derivation in Westerlund et al. (2019) relies on the assumption that 1 ร๐‘ โ€ฒ โ€ฒ โˆ’1/2 ). I demonstrate in Section 3.3.2 that it is unnecessary for consistency ๐‘ ๐‘–=1 ๐œธ๐‘– โŠ— ๐‘ฝ๐‘– = ๐‘‚ ๐‘ (๐‘ and asymptotic normality, and how misspecification causes inconsistency in the standard errors and bootstrapped test statistics provided in Westerlund et al. (2019). The factor structure allows us to weaken the Ahn et al. (2013) assumption from ๐ธ (๐’–๐‘– | ๐‘ฟ๐‘– ) = 0 to ๐ธ (๐’–๐‘– |๐‘ฝ๐‘– ) = 0. Finally, I do not assume the reduced form equation is a conditional mean specification like Westerlund et al. (2019). They assume ๐ธ (๐‘ฝ๐‘– |๐šช๐‘– ) = 0, where I only need ๐ธ (๐‘ฝ๐‘– ) = 0 and place no restrictions on ๐ท (๐‘ฝ๐‘– |๐’–๐‘– ). Another way in which QLD can help weaken the CCE model is the relevant order conditions. As described earlier, Westerlund et al. (2019) require ๐‘‡ > ๐พ + 1 for CCE estimation but I will directly use the moments ๐ธ (๐‘ฏ0โ€ฒ ๐’๐‘– ) = 0 to remove the factors which only requires ๐พ โ‰ฅ ๐‘ 0 + 1, a restriction also made by Pesaran (2006) and Westerlund et al. (2019). Ahn et al. (2013) does not require this condition but assumes the existence of outside instruments which may be infeasible given the application. I also discuss in Section 3.3.2 how to include known factors like a heterogeneous intercept which decreases the number of relevant factors and makes the assumption even less restrictive. 3.3.1 CCE Moment Conditions I now look at the moment conditions implied by Assumption 2. Equation (3.2.8) of Section 3.2, ๐ธ (๐‘ฏ0โ€ฒ ๐’๐‘– ) = 0 where ๐’๐‘– = (๐’š๐‘– , ๐‘ฟ๐‘– ), implies that Assumption 2 provides information on ๐œฝ 0 which leads to more efficient estimation of ๐œท0 and provides a first-stage estimator which negates the need for the full joint estimator of Ahn et al. (2013). I first consider identification of ๐œฝ 0 from the pure factor structure alone to show that it in fact yields valid moments. As in Ahn et al. (2013), 63 identification hinges on correctly specifying ๐‘ = ๐‘ 0 where ๐‘ is the number of factors specified by the econometrician. Lemma 3.3.1. Under Assumption 2, ๐œฝ 0 is identified by ๐ธ (๐‘ฏ(๐œฝ) โ€ฒ ๐’๐‘– ) = 0 if and only if ๐‘ = ๐‘ 0 . Proof. Assumption 2(3) implies ๐ธ (๐‘ฏ(๐œฝ) โ€ฒ ๐’๐‘– ) = ๐‘ฏ(๐œฝ) โ€ฒ ๐‘ญ0 ๐ธ (๐‘ช๐‘– )๐‘ธ (3.3.1) where ๐‘ฌ (๐ถ๐‘– ) = ๐ธ ( [๐œธ๐‘– , ๐šช๐‘– ]) and ๐‘ธ is given in Section 3.2.1. ๐‘ธ is nonsingular and ๐ธ (๐‘ช๐‘– ) has full row rank by Assumption 2(4), so equation (3.3.1) is zero if and only if ๐‘ฏ(๐œฝ) โ€ฒ ๐‘ญ0 = 0. When ๐‘ = ๐‘ 0 , ๐ป (๐œฝ) โ€ฒ ๐‘ญ0 = ๐šฏ0 โˆ’ ๐šฏ which is zero if and only if ๐œฝ = ๐œฝ 0 . See the Appendix for the ๐‘ โ‰  ๐‘ 0 cases. โ–ก Remark (Misspecification): A possible reason for the lack of use of CCE estimation among microeconomists is the model in Assumption 2(1). This assumption is in fact not strictly necessary for identifying ๐œฝ 0 . Consider the following linear projection: ๐ธ (๐’๐‘– ) = ๐‘ญ0 ๐‘ฎ + ๐‘ฌ where ๐‘ญ0โ€ฒ ๐‘ฌ = 0. Then ๐œฝ 0 is still identified by the moments ๐ธ (๐‘ฏ(๐œฝ) โ€ฒ ๐’๐‘– ) if ๐‘ฎ has full rank. โ–  We can use Lemma 3.3.1 to provide an estimator of ๐œฝ 0 based off of the covariates alone. Let ๐‘ฏ b ๐‘ซ ๐œฝ = ๐ธ (โˆ‡๐œฝ vec(๐‘ฏโ€ฒ ๐‘ฟ๐‘– )), and ๐‘จ๐œฝ = ๐ธ (vec(๐‘ฏโ€ฒ ๐’๐‘– )vec(๐‘ฏโ€ฒ ๐’๐‘– ) โ€ฒ). b = ๐‘ฏ( ๐œฝ), 0 0 0 Theorem 3.3.1. Suppose Assumption 2 holds, and let ๐œฝb be the GMM estimator based off of ๐ธ (vec(๐‘ฏ0โ€ฒ ๐’๐‘– )) = 0 using a consistent estimator of the optimal weight matrix. Then โˆš ๐‘‘  โˆ’1 1. ๐‘ ( ๐œฝb โˆ’ ๐œฝ 0 ) โ†’ ๐‘ (0, ๐‘ซ โ€ฒ๐œฝ ๐‘จโˆ’1 ๐œฝ ๐‘ซ๐œฝ ). ๐‘ Now suppose that ๐‘จ b๐œฝ โ†’ ๐‘จ๐œฝ using first-step estimator ๐œฝ. b ร โ€ฒ ร  ๐‘‘ 1. If ๐‘ 0 = ๐‘ then ๐‘ โˆ’1 ๐‘ bโ€ฒ ๐‘–=1 vec( ๐‘ฏ ๐’๐‘– ) ๐‘จ๐œฝ bโˆ’1 ๐‘ โ€ฒ 2 ๐‘–=1 vec( ๐‘ฏ ๐’๐‘– ) โ†’ ๐œ’ ((๐‘‡ โˆ’ ๐‘ 0 )(๐พ + 1 โˆ’ ๐‘ 0 )). b ร โ€ฒ   ๐‘ 2. If ๐‘ 0 > ๐‘, then ๐‘ โˆ’1 ๐‘ bโ€ฒ ๐‘–=1 vec( ๐‘ฏ ๐’๐‘– ) ๐‘จbโˆ’1 ร๐‘ vec( ๐‘ฏ ๐œฝ ๐‘–=1 bโ€ฒ ๐’๐‘– ) โ†’ โˆž. 64 Proof. The proof comes from standard theory; see Hansen (1982). The estimator of the optimal weight matrix is ๐‘จ b๐œฝ = 1 ร๐‘ vec(๐‘ฏ( ๐œฝ) หœ โ€ฒ ๐’๐‘– )vec(๐‘ฏ( ๐œฝ) หœ โ€ฒ ๐’๐‘– ) โ€ฒ where ๐œฝหœ is a consistent first-stage ๐‘ ๐‘–=1 estimator of ๐œฝ 0 . โ–ก It is entirely possible there are variables in the data set which are linear in the factors but not relevant for estimation. In this case, one can simply use them to estimate ๐œฝ 0 but drop them from the estimating equation. Further, if relevant variables are not linear in ๐‘ญ0 , they should be dropped from the estimation in Theorem 3.3.1. This can occur if there are polynomial or interactive functions of the covariates in the estimating equation. De Vos and Westerlund (2019) study this case in the context of CCE. I also note that the just identified case ๐‘ 0 = ๐พ + 1 corresponds to a simple M-estimator: Corollary 3.3.1. When ๐‘ 0 = ๐พ + 1, the estimator ๐œฝb solves bโ€ฒ ( ๐’š, ๐‘ฟ) = 0 ๐‘ฏ Corollary 3.3.1 provides important robustness properties in Section 3.3. For now, I point out how Theorem 3.3.1 can help test for ๐‘ 0 . There are (๐‘‡ โˆ’ ๐‘ 0 )(๐พ + 1) moments and (๐‘‡ โˆ’ ๐‘ 0 ) ๐‘ 0 parameters, so the system is underidentified when ๐พ + 1 < ๐‘ 0 and just identified like in Corollary 3.3.1 when ๐พ + 1 = ๐‘ 0 . When ๐พ + 1 > ๐‘ 0 , we have overidentifying restrictions to test for ๐‘ 0 . Ahn et al. (2013) recommend testing for ๐‘ 0 by first setting ๐‘ = 0 and setting ๐‘ฏ = ๐‘ฐ๐‘‡ . If the hypothesis is rejected using the statistic in part (2) of Theorem 3.3.1, move to ๐‘ = 1. Continue until the null hypothesis cannot be rejected. I refer the reader to Section 3 of Ahn et al. (2013) for additional details and tests. I follow a similar approach to testing based off of the moments in Theorem 3.3.1. I now demonstrate that the additional moments generally improve efficiency of the Ahn et al. (2013) GMM estimator by demonstrating that the CCE modelโ€™s reduced form assumption implies additional non-redundant moment conditions. The following theorem completely characterizes when the moments ๐ธ (๐‘ฏ0โ€ฒ ๐‘ฟ๐‘– ) = ๐ธ (๐‘ฏ0โ€ฒ ๐‘ฝ๐‘– ) = 0 are partially redundant for estimating ๐œท0 using the Ahn et al. (2013) estimator, meaning its asymptotic variance is the same with or without the additional moments. I do not include ๐ธ (๐‘ฏ0โ€ฒ ๐’š๐‘– ) = 0 because the efficiency result would require additional 65 assumptions on ๐‘‰ ๐‘Ž๐‘Ÿ (๐’–๐‘– ). Let ๐’ˆ๐‘–1 ( ๐œท, ๐œฝ) = vec( ๐‘ฟ๐‘– ) โŠ— ๐‘ฏ(๐œฝ) โ€ฒ ( ๐’š๐‘– โˆ’ ๐‘ฟ๐‘– ๐œท) and ๐’ˆ๐‘–2 (๐œฝ) = ๐‘ฏ(๐œฝ) โ€ฒ๐‘ฝ๐‘– be the residuals associated with the moment conditions from equations (3.2.7) and (3.2.8) respectively. Let ๐‘ซ 11 = ๐ธ (โˆ‡ ๐œท ๐’ˆ๐‘–1 ( ๐œท0 , ๐œฝ 0 )), ๐‘ซ 12 = ๐ธ (โˆ‡๐œฝ ๐’ˆ๐‘–1 ( ๐œท0 , ๐œฝ 0 )), and ๐›€11 = ๐‘‰ ๐‘Ž๐‘Ÿ ( ๐’ˆ๐‘–1 ( ๐œท0 , ๐œฝ 0 )). Theorem 3.3.2. Given Assumptions 1 and 2, suppose ๐ธ (๐’–๐‘– | ๐‘ฟ๐‘– ) and the Identifying Assumptions in the Appendix hold. Then the moment conditions ๐ธ ( ๐’ˆ๐‘–2 (๐œฝ 0 )) = 0 are partially redundant for estimating ๐œท0 if and only if ๐‘ซ โ€ฒ12 ๐›€โˆ’1 11 ๐‘ซ 11 = 0 (3.3.2) Proof. See Appendix for proof. The extra assumptions are only needed so that ( ๐œทโ€ฒ0 , ๐œฝ 0โ€ฒ ) โ€ฒ are identified by ๐ธ ( ๐’ˆ๐‘–1 ( ๐œท0 , ๐œฝ 0 )) = 0 and are equivalent to the Basic Assumptions of Ahn et al. (2013). I assume ๐ธ (๐’–๐‘– | ๐‘ฟ๐‘– ) = 0 whereas Assumption 2 implies the weaker ๐ธ (๐’–๐‘– |๐‘ฝ๐‘– ) = 0. I make the stronger exogeneity assumption for simplicity, though the moment conditions in ๐’ˆ๐‘–1 could be reformulated with ๐‘ฏ0โ€ฒ ๐‘ฝ๐‘– โŠ‚ ๐’˜ ๐‘– . โ–ก There is no reason to believe equation (3.3.2) holds in general, and so the additional moments improve the efficiency of estimating ๐œท0 using the QLD residual in equation (3.2.7). Trivial cases where equation (3.3.2) holds includes ๐œฝ 0 being known to the researcher and ๐‘ 0 = 0. 3.3.2 Pooled and Mean Group QLD The QLD GMM approach of Ahn et al. (2013) can select appropriate instruments for a given time period. However, an abundance of moment conditions can induce finite-sample bias and local stationary points in the GMM objective function. This section introduces the linear pooled and mean group estimators based off of the QLD transformation. They allow for a variety of rank and exogeneity conditions which are especially useful when the researcher includes heterogeneous slopes in the model, like in Section 3.4. I propose first estimating the parameters ๐œฝ 0 using the pure factor structure assumed in ๐’๐‘– and then running the relevant regressions using the โ€œdefactored" data 66 ๐‘ฏbโ€ฒ ๐’š๐‘– and ๐‘ฏbโ€ฒ ๐‘ฟ๐‘– : ๐‘ ! โˆ’1 ๐‘ โˆ‘๏ธ โˆ‘๏ธ ๐œท b๐‘„๐ฟ๐ท๐‘ƒ = ๐‘ฟ๐‘–โ€ฒ ๐‘ฏb๐‘ฏbโ€ฒ ๐‘ฟ๐‘– ๐‘ฟ๐‘–โ€ฒ ๐‘ฏ b๐‘ฏbโ€ฒ ๐’š๐‘– (3.3.3) ๐‘–=1 ๐‘–=1 ๐‘ 1 โˆ‘๏ธ โ€ฒ b bโ€ฒ โˆ’1 โ€ฒ b bโ€ฒ ๐œท๐‘„๐ฟ๐ท ๐‘€๐บ = b ( ๐‘ฟ ๐‘ฏ ๐‘ฏ ๐‘ฟ๐‘– ) ๐‘ฟ๐‘– ๐‘ฏ ๐‘ฏ ๐’š๐‘– (3.3.4) ๐‘ ๐‘–=1 ๐‘– The pooled quasi-long-differencing (QLDP) estimator defined by equation (3.3.3) is the pooled OLS estimator from regressing ๐‘ฏ bโ€ฒ ๐’š๐‘– on ๐‘ฏbโ€ฒ ๐‘ฟ๐‘– . A similar estimator was mentioned in Breitung and Hansen (2020) but not thoroughly studied. The mean group quasi-long-differencing (QLDMG) estimator defined by equation (3.3.4) can be obtained by running the ๐‘‡ โˆ’ ๐‘ observation time series regression ๐‘ฏ bโ€ฒ ๐’š๐‘– on ๐‘ฏ bโ€ฒ ๐‘ฟ๐‘– for each ๐‘–, and then averaging each of the ๐‘ estimates. It should be noted that ๐‘ฏbโ€ฒ can be used to โ€œdefactor" any variables which are linear in ๐‘ญ0 and not just those used in the estimator of ๐œฝ 0 . This observation allows for 2SLS estimation using outside instruments. Intuitively, the mean group estimator should allow for arbitrarily random slopes at the cost of rank assumptions and precision. If the model is thought to have homogeneous slopes, one should generally choose the pooled estimator over the mean group one. I ignore its asymptotic properties until Section 3.4 when I introduce random slopes. However, the pooled QLD allows us to relax the rank conditions used in Ahn et al. (2013) and Westerlund et al. (2019). Instead of ๐ธ (vec( ๐‘ฟ๐‘– ) โŠ— ๐‘ฏ0โ€ฒ (๐’š๐‘– โˆ’ ๐‘ฟ๐‘– ๐œท0 )) = 0, we can use the moments ๐ธ ( ๐‘ฟ๐‘–โ€ฒ ๐‘ฏ0 ๐‘ฏ0โ€ฒ (๐’š๐‘– โˆ’ ๐‘ฟ๐‘– ๐œท0 )) = 0. This residual represents a just-identified system of moments, requires no outside instruments, and allows ๐ธ (๐œธ๐‘– ๐œธ๐‘–โ€ฒ) and ๐ธ (๐œธ๐‘– ) to be completely arbitrary. Further, since estimation of ๐œฝ 0 comes from the reduced form moments, I do not require ๐‘‡ > ๐พ + 1. Before proving asymptotic normality, I point out that the case of ๐‘ = ๐พ + 1 implies a powerful algebraic fact about the pooled QLD estimator: it is the same whether or not the researcher includes common variables in the regression. That is, all variables which do not vary over ๐‘– are irrelevant to the estimation of ๐œท0 , which includes time dummies. Further, the pooled QLD residuals are the same with or without the inclusion of common variables. Note that I say ๐‘ = ๐พ + 1 instead of ๐‘ 0 = ๐พ + 1 as the following theorem is purely algebraic and independent of model specification or statistical properties. 67 Let ๐‘พ be a (๐‘‡ โˆ’ ๐‘) ร— ๐‘ž matrix of common variables, and let ( ๐œถหœ โ€ฒ, ๐œทหœโ€ฒ) โ€ฒ be the estimates from the pooled regression of ๐‘ฏ bโ€ฒ ๐’š๐‘– on ๐‘ฏ bโ€ฒ [๐‘พ, ๐‘ฟ๐‘– ]. Finally, let b b๐‘„๐ฟ๐ท๐‘ƒ ) and ๐หœ๐‘– = ( ๐’š๐‘– โˆ’ ๐‘ฟ๐‘– ๐œทหœ โˆ’๐‘พ ๐œถ) ๐๐‘– = ( ๐’š๐‘– โˆ’ ๐‘ฟ๐‘– ๐œท หœ be the associated residuals. Theorem 3.3.3. Suppose ๐‘ = ๐พ + 1. If Rk( ๐‘ฏ bโ€ฒ๐‘พ) = ๐‘ž, then 1. ๐œทb๐‘„๐ฟ๐ท๐‘ƒ = ๐œท. หœ 2. ๐œถหœ = 0. 3. b๐๐‘– = ๐หœ๐‘– . Proof. By Corollary 3.3.1, the first-stage estimator ๐œฝb solves ๐‘ฏ bโ€ฒ [๐’š, ๐‘ฟ] = 0. ๐‘ b = ๐‘ ๐‘ฟ โ€ฒ ๐‘ฏ๐‘พ โˆ‘๏ธ ๐‘ฟ๐‘–โ€ฒ ๐‘ฏ๐‘พ b =0 ๐‘–=1 by Corollary 3.3.1, so ๐‘ฏ bโ€ฒ ๐‘ฟ๐‘– and ๐‘พ are uncorrelated in the sample. Thus ๐œทหœ๐‘„๐ฟ๐ท๐‘ƒ = ๐œท b๐‘„๐ฟ๐ท๐‘ƒ . Using the same argument, ๐‘ ! โˆ’1 ๐‘ โˆ‘๏ธ โˆ‘๏ธ โ€ฒ bโ€ฒ๐‘พ ๐œถหœ = ๐‘พ๐‘ฏ b๐‘ฏ ๐‘พโ€ฒ๐‘ฏ b๐‘ฏbโ€ฒ ๐’š๐‘– ๐‘–=1 ๐‘–=1   โˆ’1 = ๐‘ ๐‘พโ€ฒ๐‘ฏ b๐‘ฏbโ€ฒ๐‘พ ๐‘พโ€ฒ๐‘ฏ b๐‘ฏbโ€ฒ ๐’š = 0 As ๐œถหœ = 0 and ๐œทหœ = ๐œท b๐‘„๐ฟ๐ท๐‘ƒ , we have ๐หœ๐‘– = b๐๐‘– . โ–ก The above result suggests that when ๐‘ = ๐พ +1, the QLD matrix suffices to remove all unobserved time effects in the population, even those which do not interact with the heterogeneity. The intuition is similar to the โ€˜zero sumโ€™ class of estimators studied by Westerlund (2019). It may appear that Theorem 3.3.3 only applies in very special scenarios; however, simulation evidence in the Appendix suggests that overestimating ๐‘ 0 does not cause inconsistency. These results bolster the simulation evidence from Ahn et al. (2013) which suggests the same thing when using their GMM estimator. Breitung and Hansen (2020) also demonstrate that the Ahn et al. (2013) estimator performs well under the BIC method of estimating ๐‘ 0 which has a tendency to 68 overestimate the number of factors. Overestimating ๐‘ 0 includes the case of incorrectly estimating factors when ๐‘ 0 = 0. Under strict exogeneity, CCE and QLD procedures will be consistent because their factor proxies are just functions of the exogenous variables. Reporting the QLDP which takes ๐‘ = ๐พ + 1 could then serve as a robustness check if the estimated ๐‘ 0 is less than ๐พ + 1. This fact is explored in a brief simulation study in Section 3.5.2. I now show asymptotic normality for the pooled QLD estimator. I demonstrate how first-stage estimation of ๐œฝ 0 can affect the asymptotic distribution and show why ignoring this problem leads to incorrect standard errors even when pooled QLD is asymptotically normal. I briefly discuss why the standard errors in Westerlund et al. (2019) do not account for this problem. The full proof of asymptotic normality is given in the Appendix, so I will only sketch the problem here. Let ๐‘จ๐‘ƒ = ๐ธ (๐‘ฝ๐‘–โ€ฒ ๐‘ฏ0 ๐‘ฏ0โ€ฒ ๐‘ฝ๐‘– ). I show in the Appendix that ! โˆš ๐‘ 1 โˆ‘๏ธ โ€ฒ b bโ€ฒ ๐‘(๐œทb๐‘„๐ฟ๐ท๐‘ƒ โˆ’ ๐œท0 ) = ๐‘จโˆ’1 ๐‘ƒ โˆš ๐‘ฟ๐‘– ๐‘ฏ ๐‘ฏ (๐‘ญ0 ๐œธ๐‘– + ๐’–๐‘– ) + ๐‘œ ๐‘ (1) ๐‘ ๐‘–=1 After a mean value expansion about ๐œฝ 0 , and using the results from Theorem 3.3.1, the normalized estimator is โˆš โˆ’1 1 ๐‘ โˆ‘๏ธ ๐‘ฝ๐‘–โ€ฒ ๐‘ฏ0 ๐‘ฏ0โ€ฒ ๐’–๐‘– + ๐‘ฎ ๐‘ƒ ๐’“๐‘– (๐œฝ 0 ) + ๐‘œ ๐‘ (1)  ๐‘ ( ๐œท๐‘„๐ฟ๐ท๐‘ƒ โˆ’ ๐œท0 ) = ๐‘จ๐‘ƒ โˆš b ๐‘ ๐‘–=1 where ๐’“๐‘– (๐œฝ 0 ) is derived from Theorem 1 and ๐‘ฎ ๐‘ƒ = ๐ธ (โˆ‡๐œฝ ๐‘ฟ๐‘–โ€ฒ ๐‘ฏ(๐œฝ)๐‘ฏ(๐œฝ) โ€ฒ (๐‘ญ0 ๐œธ๐‘– + ๐’–๐‘– )) evaluated at ๐œฝ = ๐œฝ 0 . ๐‘ฎ ๐‘ƒ = 0 when ๐ธ (๐’–๐‘– โŠ— ๐‘ฝ๐‘– ) = 0, ๐ธ (๐’–๐‘– โŠ— ๐šช๐‘– ) = 0, and ๐ธ (๐‘ฝ๐‘– โŠ— ๐œธ๐‘– ) = 0. I only need exogeneity of ๐‘ฝ๐‘– with respect to ๐’–๐‘– for asymptotic normality, so the other assumptions only simplify the asymptotic variance. Westerlund et al. (2019) impose these assumptions which ignores the effect of first-stage estimation uncertainty. My result thus proves asymptotic normality of the pooled QLD under weaker assumptions than used in Westerlund et al. (2019) for the pooled CCE with an even more general asymptotic variance formula. In fact, one could only assume exogeneity on the last ๐‘ 0 elements of the differenced quantities, but this assumption is difficult to interpret. I now state the general asymptotic normality result assuming ๐‘ = ๐‘ 0 is known due to Theorem 3.3.1. Theorem 3.3.4. Given Assumptions 1 and 2, suppose that 69 1. ๐‘จ๐‘ƒ = ๐ธ (๐‘ฝ๐‘–โ€ฒ ๐‘ฏ0 ๐‘ฏ0โ€ฒ ๐‘ฝ๐‘– ) has full rank. 2. ๐ธ (๐‘ฝ๐‘–โ€ฒ ๐‘ฏ0 ๐‘ฏ0โ€ฒ ๐’–๐‘– ) = 0. ๐‘ Then ๐œท b๐‘„๐ฟ๐ท๐‘ƒ โ†’ ๐œท0 and โˆš ๐‘ ๐‘(๐œทb๐‘„๐ฟ๐ท๐‘ƒ โˆ’ ๐œท0 ) โ†’ ๐‘ (0, ๐‘จโˆ’1 ๐‘ฉ ๐‘ƒ ๐‘จโˆ’1 ) ๐‘ƒ ๐‘ƒ where ๐‘ฉ ๐‘ƒ = ๐ธ ((๐‘ฝ๐‘–โ€ฒ ๐‘ฏ0 ๐‘ฏ0โ€ฒ ๐’–๐‘– + ๐‘ฎ ๐‘ƒ ๐’“๐‘– (๐œฝ 0 ))(๐‘ฝ๐‘–โ€ฒ ๐‘ฏ0 ๐‘ฏ0โ€ฒ ๐’–๐‘– + ๐‘ฎ ๐‘ƒ ๐’“๐‘– (๐œฝ 0 )) โ€ฒ). If ๐ธ (๐’–๐‘– โŠ— ๐šช๐‘– ) = 0 and ๐ธ (๐‘ฝ๐‘– โŠ— ๐œธ๐‘– ) = 0, then ๐‘ฎ ๐‘ƒ = 0. Proof. See Appendix for proof and a derivation of ๐‘ฎ ๐‘ƒ and ๐’“๐‘– (๐œฝ 0 ). Condition (2) is not practically weaker than ๐ธ (๐’–๐‘– |๐‘ฝ๐‘– ) = 0 for linear estimation but I state it for completeness. โ–ก Remark (Joint estimation): The two-step procedure is less efficient than joint GMM estimation using ๐ธ (๐‘ฝ๐‘–โ€ฒ ๐‘ฏ0 ๐‘ฏ0โ€ฒ ( ๐’š๐‘– โˆ’๐‘ฝ๐‘– ๐œท0 )) = 0 and ๐ธ (๐‘ฏ0โ€ฒ ๐’๐‘– ) = 0 unless ๐‘ = ๐พ +1; see Ahn and Schmidt (1997). However, the ๐‘ = ๐พ +1 case confers the advantage of invariance to common variables from Theorem 3.3.3 and appears consistent even when ๐‘ 0 < ๐‘. There are also optimization issues involved in joint estimation because the moments which identify ๐œท0 are nonlinear in ๐œฝ 0 . โ–  Remark (Known factors): Eliminating known factors like random intercepts or polynomial time trends can make the QLD estimators more precise. Simply remove the known factors from [๐’š๐‘– , ๐‘ฟ๐‘– ] by regressing it, unit-by-unit, onto the known factors, then estimate ๐œฝ 0 as in Theorem 3.3.1 using the residuals. This procedure is equivalent to defining ๐‘ด = ๐‘ฐ๐‘‡ โˆ’ ๐‘ญ1 (๐‘ญ1โ€ฒ ๐‘ญ1 ) โˆ’1 ๐‘ญ1โ€ฒ , where ๐‘ญ1 are the known factors, and running estimation based off of (๐’š๐‘–โˆ— , ๐‘ฟ๐‘–โˆ— ) = ( ๐‘ฐ ๐‘ โŠ— ๐‘ด)(๐’š๐‘– , ๐‘ฟ๐‘– ). โ–  Remark (Bootstrap): While I provide analytic inference below, the standard errors can be quite complicated in general. Regardless of any additional restrictions which can simplify the โˆš calculation of standard errors, ๐‘ ( ๐œท b๐‘„๐ฟ๐ท๐‘ƒ โˆ’ ๐œท0 ) is asymptotically normal so that one can instead do inference via the nonparametric bootstrap. Just resample over ( ๐’š๐‘– , ๐‘ฟ๐‘– ), with ๐‘ฏ b estimated for each new sample to account for the first-stage estimation in the final standard errors. This procedure contrasts to Section 2 of the Supplement to Westerlund et al. (2019) which does not estimate ๐‘ญ b with each new sample. I do not provide a proof of consistency because the problem is standard; 70 Westerlund et al. (2019) needed a proof because of the CCE projection matrix has a reduced-rank limit. โ–  The asymptotic variance can be estimated by ๐‘จ bโˆ’1 b bโˆ’1 ๐‘ ๐‘ฉ ๐‘ƒ ๐‘จ ๐‘ƒ where ๐‘ b๐‘ƒ = 1 โˆ‘๏ธ ๐‘จ ๐‘ฟโ€ฒ ๐‘ฏb๐‘ฏ bโ€ฒ ๐‘ฟ๐‘– ๐‘ ๐‘–=1 ๐‘– ๐‘ b๐‘ƒ = 1 โˆ‘๏ธ ๐‘ฉ ๐’— ๐‘– ๐’—b๐‘– โ€ฒ b ๐‘ ๐‘–=1 Here, b ๐’— ๐‘– = ๐‘ฟ๐‘–โ€ฒ ๐‘ฏ b๐‘ฏbโ€ฒ๐b๐‘– + ๐‘ฎ ๐‘ƒ ( ๐œฝ) b ๐’“๐‘– ( ๐œฝ) b where b ๐๐‘– = ๐’š๐‘– โˆ’ ๐‘ฟ๐‘– ๐œท b๐‘„๐ฟ๐ท๐‘ƒ is the full pooled QLD residual. The gradient is ยฉ ยฉ ๐’™๐‘– โˆ—1โ€ฒ โŠ— ๐‘ฐ๐‘‡โˆ’๐‘0 ยช ยช ๐‘ ยญ b๐‘ƒ = 1 โˆ‘๏ธ ยญ ยฎ ยฎ .. ๐‘ฎ ๐๐‘–โ€ฒ ๐‘ฏ) ยญ ( ๐‘ฐ๐พ โŠ— b ยฎ + ๐‘ฟ๐‘–โ€ฒ ๐‘ฏ(b b ๐๐‘–โˆ—โ€ฒ โŠ— ๐‘ฐ๐‘‡โˆ’๐‘0 ) ยฎยฎ (3.3.5) ยญ ยญ ยฎ b ยญ . ๐‘ ๐‘–=1 ยญ ยญ ยฎ ยฎ ยญ ยญ ยฎ ยฎ โˆ— โ€ฒ ๐’™๐‘– ๐พ โŠ— ๐‘ฐ๐‘‡โˆ’๐‘0 ยซ ยซ ยฌ ยฌ bโ€ฒ ๐‘จ bโˆ’1 b โˆ’1 bโ€ฒ bโˆ’1 bโ€ฒ ๐’“๐‘– ( ๐œฝ) ๐œฝ ๐œฝ ๐‘ซ ๐œฝ ) ๐‘ซ ๐œฝ ๐‘จ๐œฝ vec( ๐‘ฏ ๐’๐‘– ) b = (๐‘ซ (3.3.6) where a โ€˜โˆ—โ€™ denotes the last ๐‘ 0 elements of a ๐‘‡ ร— 1 vector. The form for ๐’“๐‘– ( ๐œฝ) b comes from Theorem 3.3.1 and is derived in the proof of Theorem 3.3.4. The matrix ๐‘ฎ ๐‘ƒ appears because of correlation between the full error ๐๐‘– = ๐‘ญ0 ๐œธ๐‘– + ๐’–๐‘– and the covariates ๐‘ฟ๐‘– , and the vector ๐’“๐‘– comes from error in estimating ๐œฝ 0 in the first stage. The regular cluster-robust standard errors for a pooled regression are only valid if ๐‘ฎ ๐‘ƒ = 0. Assuming factor loadings are independent of the errors causes this matrix to be zero, like in the classical CCE treatments of Pesaran (2006) and Westerlund et al. (2019). Though the loadings are meant to model the correlation between ๐‘ฟ๐‘– and all unobservables, they may still correlate with the errors due to misspecification. If there are additional factors in ๐’š๐‘– not in ๐‘ฟ๐‘– , we can still estimate ๐œท0 but the asymptotic variances will depend on first-stage estimation of ๐œฝ 0 . In fact, if we allow for uncorrelated loadings, the CCE and QLD estimators exclude relevant information for estimation. Additionally assuming ๐ธ (๐‘ฝ๐‘– |๐œธ๐‘– ) = 0 like in Westerlund et al. (2019), 71 we have: ๐ธ ((๐‘ฏ0โ€ฒ ๐‘ฝ๐‘– ) โŠ— ๐‘ฏ0โ€ฒ ( ๐’š๐‘– โˆ’ ๐‘ฝ๐‘– ๐œท0 )) = 0 (3.3.7) ๐ธ ((๐‘ฏ0โ€ฒ ๐‘ฝ๐‘– ) โŠ— ( ๐’š๐‘– โˆ’ ๐‘ฟ๐‘– ๐œท0 )) = 0 (3.3.8) ๐ธ ( ๐‘ฟ๐‘– โŠ— ๐‘ฏ0โ€ฒ (๐’š๐‘– โˆ’ ๐‘ฝ๐‘– ๐œท0 )) = 0 (3.3.9) ๐ธ (๐‘ฏ0โ€ฒ (๐’š๐‘– โˆ’ ๐‘ฝ๐‘– ๐œท0 )) = 0 (3.3.10) ๐ธ (๐‘ฏ0โ€ฒ ๐‘ฝ๐‘– ) = 0 (3.3.11) Equations (3.3.7)-(3.3.11) list (๐‘‡ โˆ’ ๐‘ 0 )((๐‘‡ โˆ’ ๐‘ 0 )๐พ + 2๐‘‡ ๐พ + ๐พ + 1) moment conditions which displays the strength of the CCE assumptions made in current applications. Without at least theoretically justifying ๐ธ (๐‘ฝ๐‘– โŠ— ๐œธ๐‘– ) = 0, CCE-based inference needs a modern treatment which accounts for first-stage estimation as in Brown et al. (2021). To summarize, if the loadings are allowed to be correlated, then the pooled CCE standard errors from Pesaran (2006) and Westerlund et al. (2019) are incorrect. If the loadings are assumed uncorrelated, then we have a significant number of unused moment restrictions. In fact, if first-stage estimation does not affect the asymptotic distribution, and the conditional covariance ๐ธ (๐’–๐‘– ๐’–โ€ฒ๐‘– | ๐‘ฟ๐‘– ) is estimable, the feasible version of the GLS โˆš estimator from Section 3.2 of Brown (2021) is ๐‘-consistent and efficient among all estimators based off of ๐ธ ( ๐‘ด ๐‘ญ0 ( ๐’š๐‘– โˆ’ ๐‘ฟ๐‘– ๐œท0 )) = 0 in which case all the moments in equations (3.3.7)-(3.3.11) are redundant. 3.4 Heterogeneous Slopes I now consider a generalization of the population model in equation (3.2.1) which allows for random slopes. ๐’š๐‘– = ๐‘ฟ๐‘– ๐œท๐‘– + ๐‘ญ0 ๐œธ๐‘– + ๐’–๐‘– (3.4.1) ๐œท๐‘– = ๐œท0 + ๐’ƒ๐‘– (3.4.2) ๐’ƒ๐‘– โˆผ (0, ๐šบ) (3.4.3) The random slopes model is identical to the forms in Wooldridge (2005) and Pesaran (2006) though the former assumes ๐‘ญ0 is observable. Neither Ahn et al. (2013) nor Westerlund (2019) consider 72 random slopes in their fixed-๐‘‡ analyses. I summarize this model in the following assumption: Assumption 3 (Random slopes): 1. ๐’š๐‘– = ๐‘ฟ๐‘– (๐œท0 + ๐’ƒ๐‘– ) + ๐‘ญ0 ๐œธ๐‘– + ๐’–๐‘– . 2. ( ๐‘ฟ๐‘– , ๐’ƒ๐‘– , ๐œธ๐‘– , ๐’–๐‘– ) are independent and identically distributed across ๐‘– with finite fourth moments. 3. ๐ธ (๐’ƒ๐‘– ) = 0. โ–  The iid sampling assumption on ๐’ƒ๐‘– does not rule out correlation between ๐’ƒ๐‘– and the other stochastic components of the model. Similarly, Assumption 3(3) places no restrictions on the correlation between ๐’ƒ๐‘– and ๐‘ฟ๐‘– . It only states that ๐’ƒ๐‘– is the heterogeneous, unobserved deviation from the population parameters ๐œท0 . Most fixed-๐‘‡ treatments of random slope models either exclude factors altogether or simplify the factor structure as in a fixed effects analysis. Examples of fixed effects treatments include Juhl and Lugovskyy (2014) Campello et al. (2019), and Breitung and Salish (2021). Though Pesaran (2006), Chudik and Pesaran (2015), Neal (2015), and Norkuteฬ‡ et al. (2021) allow for random slopes and arbitrary factors, they require ๐‘‡ to grow to infinity and make strong exogeneity conditions which I avoid. Before continuing with the analysis, I want to address how the random slopes model changes first-stage estimation of ๐œฝ 0 . The pure factor model for ๐’๐‘– in equation (3.2.4) now takes the form ๐ธ ( ๐’๐‘– ) = ๐‘ญ0 ๐ธ (๐‘ช๐‘– ๐‘ธ ๐‘– ) + ๐ธ (๐‘ผ๐‘– ๐‘ธ ๐‘– ) where ๐‘ผ๐‘– = [๐’–๐‘– , ๐‘ฝ๐‘– ]. In order for the identification result in Lemma 3.3.1 to hold, we need two additional conditions. First, Rk(๐ธ (๐‘ช๐‘– ๐‘ธ ๐‘– )) = ๐‘ 0 , which is reasonable given Assumption 1. We also need ๐ธ (๐‘ธ ๐‘– ๐‘ผ๐‘– ) = 0 which necessitates ๐ธ ( ๐œท๐‘–โ€ฒ ๐’— ๐‘–๐‘ก ) = 0 for each ๐‘ก, implying that ๐’ƒ๐‘– and ๐’— ๐‘–๐‘ก are uncorrelated but allows arbitrary correlation between ๐’ƒ๐‘– and (๐œธ๐‘– , ๐šช๐‘– ). We could instead estimate ๐œฝ 0 based off of ๐ธ (๐‘ฏ0โ€ฒ ๐‘ฟ๐‘– ) = ๐ธ (๐‘ฏ0โ€ฒ ๐‘ฝ๐‘– ) = 0 and require ๐‘ 0 โ‰ค ๐พ instead of ๐พ + 1. The robustness result of Theorem 3.3.3(1) holds for ๐‘ = ๐พ but parts (2) and (3) are not necessarily true. 73 Remark (Testing for random slopes): Assumption 2 allows us to test for correlated random slopes. Assuming that ๐‘ 0 < ๐พ + 1, we can test the model ๐ธ (๐‘ฏ0โ€ฒ ๐’๐‘– ) = 0 using the standard overidentifying restrictions test. The moments are zero under Assumptions 2 and 3 only when ๐œท๐‘– is uncorrelated with ๐‘ฝ๐‘– . โ–  The remainder of this section assumes ๐œฝ 0 is derived from the reduced form moments ๐ธ (๐‘ฏ0โ€ฒ ๐‘ฝ๐‘– ) = 0 with an analogous result to Theorem 1 to avoid uncertainty related to the overidentifying restric- tions test. I first consider the Ahn et al. (2013) estimator in the presence of random slopes. The GMM estimator cannot estimate the individual random slopes due to the well-known incidental parameters problem. As such, I consider estimation which ignores the random slopes so that ๐‘ฟ๐‘– ๐’ƒ๐‘– is absorbed into the error. The Ahn et al. (2013) expected residual becomes ๐ธ (vec( ๐‘ฟ๐‘– ) โŠ— ๐‘ฏ0โ€ฒ (๐’š๐‘– โˆ’ ๐‘ฟ๐‘– ๐œท0 )) = ๐ธ (vec( ๐‘ฟ๐‘– ) โŠ— ๐‘ฏ0โ€ฒ ๐‘ฟ๐‘– ๐’ƒ๐‘– ) (3.4.4) Theorem 3.4.1. Under Assumptions 1 and 3, ( ๐œทโ€ฒ0 , ๐œฝ 0โ€ฒ ) โ€ฒ is identified by equation (3.4.4) if and only if ๐ธ (vec( ๐‘ฟ๐‘– ) โŠ— ๐‘ฏ0โ€ฒ ๐‘ฝ๐‘– ๐’ƒ๐‘– ) = 0 Proof. The proof is a corollary of the identification result presented in Section 3.1 of Ahn et al. (2013). โ–ก Murtazashvili and Wooldridge (2008) consider IV estimation with random slopes and known factors. The exogeneity condition in Theorem 3.4.1 can depend on the type of instruments available. If there is a vector ๐’˜ ๐‘– of outside instruments, one sufficient condition is ๐ถ๐‘œ๐‘ฃ(๐‘ฏ0โ€ฒ ๐‘ฟ๐‘– , ๐’ƒ๐‘– |๐’˜ ๐‘– ) = ๐ถ๐‘œ๐‘ฃ(๐‘ฏ0โ€ฒ ๐‘ฟ๐‘– , ๐’ƒ๐‘– ) = 0 (3.4.5) which is similar to Assumption 3.3 of Murtazashvili and Wooldridge (2008). With strictly exogenous covariates, the exogeneity condition is more similar to equations (12) and (13) of Wooldridge (2005) who considers fixed effects OLS. Wooldridge shows that pooled OLS is robust to heterogeneous slopes which are uncorrelated with the matrix of second moments of the defactored covariates; that is ๐ธ ( ๐‘ฟ๐‘–โ€ฒ ๐‘ด ๐‘ญ0 ๐‘ฟ๐‘– ๐’ƒ๐‘– ) = 0 where he also assumes ๐‘ญ0 is known. An 74 even simpler sufficient condition would be ๐ธ (๐’ƒ๐‘– | ๐‘ฟ๐‘– ) = 0 which is in fact even weaker than the random slope assumption from Pesaran (2006) who assumes ๐’ƒ๐‘– is independent of all stochastic components of the model. The Ahn et al. (2013) estimator requires stronger exogeneity and rank conditions than Wooldridge (2005) and Murtazashvili and Wooldridge (2008) because ๐œฝ 0 needs to be estimated โˆš along with ๐œท0 . If we add Assumption 2, we are able to obtain a first stage ๐‘-consistent estimator of ๐œฝ 0 by Theorem 3.3.1 and so joint identification of ( ๐œทโ€ฒ0 , ๐œฝ 0โ€ฒ ) โ€ฒ is irrelevant. This first stage estimator allows us to substantially weaken the identification requirements for ๐œท0 which allows for estimation under a broader class of settings. Using the given estimator ๐œฝb from Theorem 3.3.1, I study the pooled QLD estimator in the context of heterogeneous slopes. Theorem 3.4.2. Given Assumptions 2 and 3, where Rk(๐ธ (๐šช๐‘– )) = ๐‘ 0 โ‰ค ๐พ, suppose that 1. ๐‘จ๐‘ƒ = ๐ธ (๐‘ฝ๐‘–โ€ฒ ๐‘ฏ0 ๐‘ฏ0โ€ฒ ๐‘ฝ๐‘– ) has full rank. 2. ๐ธ (๐‘ฝ๐‘–โ€ฒ ๐‘ฏ0 ๐‘ฏ0โ€ฒ (๐‘ฝ๐‘– ๐’ƒ๐‘– + ๐’–๐‘– )) = 0. ๐‘ Then ๐œท b๐‘„๐ฟ๐ท๐‘ƒ โ†’ ๐œท0 and โˆš ๐‘ โˆ’1 ๐‘(๐œทb๐‘„๐ฟ๐ท๐‘ƒ โˆ’ ๐œท0 ) โ†’ ๐‘ (0, ๐‘จโˆ’1 ๐‘ƒ ๐‘ฉ ๐‘ƒ ๐‘จ๐‘ƒ ) where ๐‘ฉ ๐‘ƒ = ๐ธ ((๐‘ฝ๐‘–โ€ฒ ๐‘ฏ0 ๐‘ฏ0โ€ฒ (๐‘ฝ๐‘– ๐’ƒ๐‘– + ๐’–๐‘– ) + ๐‘ฎ ๐‘ƒ ๐’“ ๐‘ฅ,๐‘– (๐œฝ 0 ))(๐‘ฝ๐‘–โ€ฒ ๐‘ฏ0 ๐‘ฏ0โ€ฒ (๐‘ฝ๐‘– ๐’ƒ๐‘– + ๐’–๐‘– ) + ๐‘ฎ ๐‘ƒ ๐’“ ๐‘ฅ,๐‘– (๐œฝ 0 )) โ€ฒ), ๐‘ฎ ๐‘ƒ = ๐ธ (โˆ‡๐œฝ ๐‘ฝ๐‘–โ€ฒ ๐‘ฏ0 ๐‘ฏ0โ€ฒ ( ๐‘ฟ๐‘– ๐’ƒ๐‘– + ๐‘ญ0 ๐œธ๐‘– + ๐’–๐‘– )), and ๐’“ ๐‘ฅ,๐‘– (๐œฝ 0 ) is given in the Appendix. If ๐ธ (๐’–๐‘– โŠ— ๐šช๐‘– ) = 0, ๐ธ (๐‘ฝ๐‘– โŠ— ๐’ƒ๐‘– ) = 0, and ๐ธ (๐‘ฝ๐‘– โŠ— ๐œธ๐‘– ) = 0, then ๐‘ฎ ๐‘ƒ = 0. Proof. The proof is identical to the proof of Theorem 3.3.4 with the full error ๐๐‘– = ๐‘ฟ๐‘– ๐’ƒ๐‘– + ๐‘ญ0 ๐œธ๐‘– + ๐’–๐‘– . While ๐‘ฉ ๐‘ƒ does not have the same form as in Theorem 3.3.4, the standard errors are calculated the same but with ๐’“ ๐‘ฅ,๐‘– instead of ๐’“๐‘– , and so I use the same notation. The additional rank assumption on ๐ธ (๐šช๐‘– ) allows us to estimate ๐œฝ 0 via ๐ธ (๐‘ฏ0โ€ฒ ๐‘ฝ๐‘– ) = 0 which overcomes the problems of correlation โˆš between ๐œท๐‘– and ๐‘ฝ๐‘– . The asymptotic variance of ๐‘ ( ๐œฝb โˆ’ ๐œฝ 0 ) and the computation of ๐’“๐‘–,๐‘ฅ are given in the Appendix. โ–ก 75 Consistency is not affected by the first stage estimates of ๐œฝ 0 even with random slopes so that the exogeneity conditions needed are identical in spirit to Wooldridge (2005) who assumes known factors. I also do not require independence between ๐’ƒ๐‘– and ( ๐‘ฟ๐‘– , ๐’–๐‘– ) like Pesaran (2006), but I still restrict the correlation between ๐‘ฟ๐‘– and ๐’ƒ๐‘– . This condition can be weakened via mean group estimation which allows an arbitrary conditional distribution ๐ท (๐’ƒ๐‘– | ๐‘ฟ๐‘– ) at the expense of much stronger rank and exogeneity conditions. I now state consistency and asymptotic normality for the mean group QLD estimator. Again, ๐œฝb is derived from ๐ธ (๐‘ฏ0โ€ฒ ๐‘ฝ๐‘– ) = 0. Define T as the parameter โˆš๏ธƒร โ€ฒ ๐พ โ€ฒ  โˆ’1 where {๐œŽ (๐‘ซ)} ๐พ are the singular space of ๐œฝ 0 . Finally, let ๐‘Ž๐‘– (๐œฝ) = ๐‘–=1 ๐œŽ๐‘– ( ๐‘ฟ๐‘– ๐‘ฏ(๐œฝ)๐‘ฏ(๐œฝ) ๐‘ฟ๐‘– ) ๐‘– ๐‘–=1 values of the ๐พ ร— ๐พ matrix ๐‘ซ. Theorem 3.4.3. Given Assumptions 2 and 3, where Rk(๐ธ (๐šช๐‘– )) = ๐‘ 0 โ‰ค ๐พ, suppose that 1. The eigenvalues of ๐‘ฟ๐‘–โ€ฒ ๐‘ฏ(๐œฝ)๐‘ฏ(๐œฝ) โ€ฒ ๐‘ฟ๐‘– are almost surely positive uniformly over T . 2. Uniformly over T , n  o max ๐ธ (๐‘Ž๐‘– (๐œฝ) โˆฅ ๐‘ฟ๐‘– โˆฅ โˆฅ๐’–๐‘– โˆฅ) , ๐ธ ๐‘Ž๐‘– (๐œฝ) 2 โˆฅ ๐‘ฟ๐‘– โˆฅ 3 โˆฅ๐’–๐‘– โˆฅ < โˆž 3. T is a compact subset of R (๐‘‡โˆ’๐‘0 ) ๐‘0 . ๐‘ Then ๐œท b๐‘„๐ฟ๐ท ๐‘€๐บ โ†’ ๐œท0 and โˆš ๐‘‘ ๐‘(๐œท b๐‘„๐ฟ๐ท ๐‘€๐บ โˆ’ ๐œท0 ) โ†’ ๐‘ (0, ๐‘ฉ ๐‘€๐บ )   โ€ฒ where ๐‘ฉ ๐‘€๐บ โ€ฒ โ€ฒ โˆ’1 โ€ฒ โ€ฒ โ€ฒ โ€ฒ โˆ’1 โ€ฒ โ€ฒ = ๐ธ ( (๐‘ฝ๐‘– ๐‘ฏ0 ๐‘ฏ0๐‘ฝ๐‘– ) ๐‘ฝ๐‘– ๐‘ฏ0 ๐‘ฏ0 ๐’–๐‘– + ๐‘ฎ ๐‘€๐บ ๐’“ ๐‘ฅ,๐‘– (๐œฝ 0 ) (๐‘ฝ๐‘– ๐‘ฏ0 ๐‘ฏ0๐‘ฝ๐‘– ) ๐‘ฝ๐‘– ๐‘ฏ0 ๐‘ฏ0 ๐’–๐‘– + ๐‘ฎ ๐‘€๐บ ๐’“ ๐‘ฅ,๐‘– (๐œฝ 0 ) ). If ๐ธ (๐’ƒ๐‘– |๐‘ฝ๐‘– ) = 0 and ๐ธ (๐‘ฝ๐‘– โŠ— ๐œธ๐‘– = 0), then ๐‘ฎ ๐‘€๐บ = 0. Proof. See Appendix for proof and the derivation of ๐‘ฎ ๐‘€๐บ . Note that Assumption 2 implies ๐ธ (๐’–๐‘– |๐‘ฝ๐‘– ) = 0. โ–ก Standard errors are derived similarly to the pooled QLD estimator in Section 3.3.2. Let ๐‘  โ€ฒ b= 1 โˆ‘๏ธ  ๐‘ฉ ( ๐‘ฟ๐‘–โ€ฒ ๐‘ฏ b๐‘ฏbโ€ฒ ๐‘ฟ๐‘– ) โˆ’1 ๐‘ฟ๐‘–โ€ฒ ๐‘ฏ b๐‘ฏ bโ€ฒb๐๐‘– ๐‘ฎ b ( ๐‘ฟโ€ฒ ๐‘ฏ b ๐‘€๐บ ๐’“ ๐‘ฅ,๐‘– ( ๐œฝ) ๐‘– b ๐‘ฏ b โ€ฒ ๐‘ฟ ๐‘– ) โˆ’1 โ€ฒ b bโ€ฒ b ๐‘ฟ๐‘– ๐‘ฏ ๐‘ฏ ๐ b ๐‘ฎ ๐’“ ๐‘– ๐‘€๐บ ๐‘ฅ,๐‘– ( ๐œฝ) b (3.4.6) ๐‘ ๐‘–=1 76 where ๐b๐‘– = ๐’š๐‘– โˆ’ ๐‘ฟ๐‘– ๐œทb๐ถ๐ถ๐ธ ๐‘€๐บ is the mean group QLD residual and ๐’“ ๐‘ฅ,๐‘– ( ๐œฝ) b comes from Lemma .0.2 in the Appendix. The gradient ๐‘ฎ ๐‘€๐บ can be estimated via ๐‘ 1 โˆ‘๏ธ  โ€ฒ b bโ€ฒ  โ€ฒ b bโ€ฒ โˆ’1 โ€ฒ b bโ€ฒ โˆ’1  ๐‘ฎ ๐‘€๐บ = b โˆ’ ๐‘ฐ๐พ โŠ— b ๐๐‘– ๐‘ฏ ๐‘ฏ ๐‘ฟ๐‘– ( ๐‘ฟ๐‘– ๐‘ฏ ๐‘ฏ ๐‘ฟ๐‘– ) โŠ— ( ๐‘ฟ๐‘– ๐‘ฏ ๐‘ฏ ๐‘ฟ๐‘– ) ( ๐‘ฐ๐พ 2 + ๐‘ฒ๐พ )( ๐‘ฐ๐พ โŠ— ๐‘ฟ๐‘–โ€ฒ ๐‘ฏ)โˆ— b ๐‘ ๐‘–=1 ยฉ ๐’™๐‘– โˆ—1โ€ฒ โŠ— ๐‘ฐ๐‘‡โˆ’๐‘0 ยช ยญ ยฎ .. โˆ—ยญ ยฎ+ ยญ ยฎ . ยญ ยฎ ยญ ยฎ ๐’™๐‘– โˆ—๐พ โ€ฒ โŠ— ๐‘ฐ๐‘‡โˆ’๐‘0 ยซ ยฌ ยฉ ๐’™๐‘– โˆ— โ€ฒ โŠ— ๐‘ฐ๐‘‡โˆ’๐‘0 ยช ยญ 1 ยฉ ยช ยญ ยฎ ยฎ .. + ( ๐‘ฟ๐‘–โ€ฒ ๐‘ฏ bโ€ฒ ๐‘ฟ๐‘– ) โˆ’1 ยญยญ ๐‘ฐ๐พ โŠ— b๐๐‘–โ€ฒ ๐‘ฏ โ€ฒ b โˆ—โ€ฒ  b๐‘ฏ b ยญยญ . ยฎ ยฎ + ๐‘ฟ๐‘– ๐‘ฏ ๐ b ๐‘– โŠ— ๐‘ฐ๐‘‡โˆ’๐‘ 0 ยฎ ยฎ ยญ ยญ ยฎ ยฎ ยญ ยญ ยฎ ยฎ ๐’™๐‘– โˆ—๐พ โ€ฒ โŠ— ๐‘ฐ๐‘‡โˆ’๐‘0 ยซ ยซ ยฌ ยฌ where ๐‘ฒ๐พ is the ๐พ 2 ร— ๐พ 2 commutation matrix. As discussed in Section 3.3.2, Theorem 3.4.3 is the first fixed-๐‘‡ proof of asymptotic normality for a mean group estimator which allows for arbitrary random factors. While I believe the mean group CCE estimator can be adjusted to allow ๐‘‡ fixed, it has yet to be proved, as Pesaran (2006) required ๐‘‡ โ†’ โˆž. Further, it is likely that a modern proof using the methods of Karabiyik et โˆš al. (2017) and Westerlund et al. (2019) is required. Like with the pooled estimator, the ๐‘- asymptotic normal convergence result in Theorem 3.4.3 implies that inference can be done via the usual nonparametric bootstrap, estimating ๐œฝb for each new bootstrap sample. Remark (Order conditions): Similar to the pooled estimator, one advantage of the QLD transformation is that it allows for more variables than the CCE when ๐‘ 0 is small. CCE uses ( ๐’š, ๐‘ฟ) to control for the factors. The rank of ๐‘ด ๐‘ญb is generally ๐‘‡ โˆ’ (๐พ + 1) in finite samples, regardless of the number of factors. The rank of ๐‘ฏ b๐‘ฏ bโ€ฒ is ๐‘‡ โˆ’ ๐‘ which is assumed to be greater than ๐‘‡ โˆ’ (๐พ + 1) in Westerlund et al. (2019). โ–  One consequence of the strong rank conditions is that we cannot allow values which take zero for all ๐‘ก with positive probability. This rules out demographic dummy variables which are common in applied microeconometrics. Instead, we could just split the sample and run mean group estimation on each demographic sub sample. The estimatorโ€™s precision will suffer, but this technique allows 77 us to estimate different slope means for different groups in the population. 3.5 Simulations This section considers the finite-sample performance of the QLD estimators compared to the GMM and CCE estimators of Ahn et al. (2013) and Pesaran (2006) respectively. 3.5.1 Main Results The main model is ๐’š๐‘– = ๐‘ฟ๐‘– ๐œท0 + ๐‘ญ0 ๐œธ๐‘– + ๐’–๐‘– ๐‘ฟ๐‘– = ๐‘ญ0 ๐šช๐‘– + ๐‘ฝ๐‘– as in Assumptions 1 and 2. There are two variables with slopes ๐œท0 = (1, 1) โ€ฒ. I do not include random slopes as they would only serve to increase the amount of noise in the model and restrict the first-stage estimation of ๐œฝ 0 for the QLD estimators and the cross-sectional averages for the CCE estimator. Theorems 3.4.2 and 3.4.3 dictate theoretically how the estimators should perform in given scenarios. I refer the reader to Campello et al. (2019) for simulation studies regarding the performance of pooled estimators when slopes are correlated with the variables of interest. The two factors are generated as AR(1) random processes with initial value from a normal distribution with mean 1 and variance 1, having parameters 0.75 and โˆ’0.75 respectively. The factors are generated once then fixed over repeated replications. The simulations do not substantively change if factors are repeatedly drawn4. As described earlier, since ๐‘‡ is small and fixed, it is the factor loadings which cause problems asymptotically and not the factors. The loadings on ๐‘ฟ๐‘– are drawn as ยฉ ๐‘ (1, 1) ๐‘ (0, 1) ยช ๐šช๐‘– โˆผ ยญยญ ยฎ ยฎ ๐‘ (0, 1) ๐‘ (1, 1) ยซ ยฌ 4 Additional simulations are available upon request. 78 so that ๐œฝ 0 is identified from the reduced form moments. The loadings in ๐’š๐‘– are drawn ยฉ ๐‘ (ฮ“1,1 , 1) ยช ๐œธ๐‘– โˆผ ยญยญ ยฎ ยฎ ๐‘ (ฮ“2,2 , 1) ยซ ยฌ The errors ๐’–๐‘– and ๐‘ฝ๐‘–๐‘˜ (๐‘˜ = 1, 2) are drawn from a multivariate normal distribution with mean 0๐‘‡ร—1 and variance ๐‘ช where ๐‘ช is the correlation matrix from an AR(1) process with parameter 0.75. That is, the two errors in ๐‘ฝ๐‘– = (๐‘ฝ๐‘–1 , ๐‘ฝ๐‘–2 ) are both drawn from ๐‘€๐‘‰ ๐‘ (0๐‘‡ร—1 , ๐‘ช) but are independent of each other and ๐’–๐‘– . Each simulation study includes 1000 replications. Table 3.1 compares the Ahn et al. (2013) estimator both with and without the additional moments ๐ธ (๐‘ฏ0โ€ฒ ๐’๐‘– ) = 0. Both estimators are computed as two-step estimators where the optimal weight matrix is calculated with a consistent first-step estimator. The first-step estimator uses an identity weight matrix. Table 3.1: GMM estimators Bias SD RMSE GMM1 GMM2 GMM1 GMM2 GMM1 GMM2 N = 50 T=3 0.0328 -0.0107 0.2326 0.1812 0.2349 0.1815 -0.0053 -0.0167 0.1719 0.1690 0.1720 0.1698 T=4 -0.0019 -0.0225 0.1444 0.1518 0.1444 0.1535 0.0137 -0.0196 0.1626 0.1424 0.1632 0.1438 T=5 0.0170 -0.0249 0.1701 0.1694 0.1710 0.1712 0.1375 -0.0055 0.3080 0.2057 0.3373 0.2058 N = 300 T=3 0.0328 -0.0107 0.2326 0.1812 0.2349 0.1815 -0.0053 -0.0167 0.1719 0.1690 0.1720 0.1698 T=4 -0.0019 -0.0225 0.1444 0.1518 0.1444 0.1535 0.0137 -0.0196 0.1626 0.1424 0.1632 0.1438 T=5 0.0005 -0.0016 0.0363 0.0364 0.0363 0.0365 0.0156 -0.0029 0.1014 0.0367 0.1026 0.0368 The GMM estimator based off of the Ahn et al. (2013) residual ๐ธ (vec( ๐‘ฟ๐‘– ) โŠ— ๐‘ฏ0โ€ฒ ( ๐’š๐‘– โˆ’ ๐‘ฟ๐‘– ๐œท0 )) only is GMM1, whereas the GMM estimator using the Ahn et al. residual and the additional moments ๐ธ (๐‘ฏ0โ€ฒ ๐’๐‘– ) = 0 is GMM2. The GMM estimator using both sets of moments consistently outperforms the original Ahn et al. (2013) estimator in terms of both bias and standard deviation implying that the additional moments are practically relevant in finite samples. 79 Before turning to a comparison of the pooled QLD and CCE estimators, I first investigate the performance of QLDP when ๐‘ 0 is misspecified in estimation of ๐œฝ 0 . The simulation setting implies ๐‘ 0 = 2, so I look at the performance of QLDP for ๐‘ = 1, 2, 3. I reiterate that ๐‘ 0 is given by the DGP and ๐‘ is the number of factors specified by the econometrician. Table 3.2: Misspecifying ๐‘ 0 Bias SD RMSE p=1 p=2 p=3 p=1 p=2 p=3 p=1 p=2 p=3 N = 50 T = 4 0.2700 0.0078 0.0118 0.1677 0.1097 0.1466 0.3178 0.1100 0.1471 0.4024 0.0029 0.0120 0.1814 0.1097 0.1561 0.4414 0.1098 0.1566 T = 5 0.4662 0.0095 0.0154 0.3511 0.1005 0.1282 0.5836 0.1009 0.1291 0.5372 0.0058 0.0119 0.4111 0.0950 0.1228 0.6764 0.0952 0.1234 T = 6 0.1697 0.0074 0.0126 0.1534 0.0956 0.1239 0.2287 0.0959 0.1246 0.5843 0.0132 0.0200 0.1516 0.1025 0.1222 0.6036 0.1034 0.1238 N = 300 T = 4 0.2748 -0.0003 0.0000 0.0657 0.0424 0.0559 0.2826 0.0424 0.0559 0.4087 0.0024 0.0030 0.0746 0.0411 0.0587 0.4154 0.0411 0.0588 T = 5 0.5267 0.0008 0.0032 0.2545 0.0382 0.0491 0.5849 0.0383 0.0492 0.5993 0.0007 0.0038 0.2953 0.0369 0.0474 0.6681 0.0369 0.0476 T = 6 0.1484 0.0015 0.0027 0.0646 0.0392 0.0470 0.1618 0.0392 0.0471 0.6191 0.0013 0.0020 0.0596 0.0406 0.0480 0.6220 0.0406 0.0480 Table 3.2 gives the results for the QLDP under the different specifications. My results track with previous simulation evidence provided by Ahn et al. (2013) and Breitung and Hansen (2020). Underestimating ๐‘ 0 leads to substantial bias which does not decrease with ๐‘. However, overestimating ๐‘ 0 leads to only slightly worse performance than correct specification. The bias is larger but decreases with ๐‘; in fact, even ๐‘ = 300 gives reasonable bias for the ๐‘ = 3 estimator. The ๐‘ = 3 estimator also performs worse than the correctly specified estimator in terms of standard deviation, which is not surprising. Overall, I find evidence that overestimation of ๐‘ 0 does not lead to substantial bias in estimation, but underestimating ๐‘ 0 can. I now turn to comparison of the QLDP and CCEP estimators. Tables 3.3 and 3.4 look at the QLDP estimator compared to the CCEP estimator where the QLD transformation is estimated under ๐‘ = ๐‘ 0 = 2. Table 3.3 contains results for ๐พ = 2 and table 3.4 contains results for ๐พ = 3. I include ๐พ = 3 because it demonstrates how CCE removes more information as ๐พ grows but QLD does not. First note that the CCEP is biased when ๐‘‡ = 3 as ๐พ + 1 = 3 and this order condition is not allowed. 80 However, the QLDP is still consistent here. Further, the QLD estimators takes ๐‘ 0 as known while the CCE estimators โ€œoverestimates" ๐‘ 0 with the cross-sectional averages, of which there are ๐พ + 1. One might suspect this overestimation leads to inefficiency which is demonstrated by the results of the simulations. The QLDP estimator consistently shows a 15%-25% decline in standard deviation over the CCE estimator. Further, the CCE identifying condition requires ๐‘‡ > ๐พ + 1 which causes severe bias when violated. The QLDP estimator significantly outperforms the CCEP estimator in every setting provided. Table 3.3: Pooled estimators, ๐พ = 2 Bias SD RMSE CCEP QLDP CCEP QLDP CCEP QLDP N = 50 T=3 -0.5525 0.0082 25.9618 0.1546 25.9676 0.1548 1.2734 0.0034 12.5824 0.1555 12.6467 0.1556 T=4 0.0118 0.0078 0.1466 0.1097 0.1471 0.1100 0.0120 0.0029 0.1561 0.1097 0.1566 0.1098 T=5 0.0197 0.0095 0.1220 0.1005 0.1236 0.1009 0.0089 0.0058 0.1152 0.0950 0.1155 0.0952 N = 300 T=3 0.0272 0.0024 2.7295 0.0580 2.7296 0.0581 0.9400 0.0026 3.3976 0.0585 3.5253 0.0585 T=4 0.0000 -0.0003 0.0559 0.0424 0.0559 0.0424 0.0030 0.0024 0.0587 0.0411 0.0588 0.0411 T=5 0.0050 0.0008 0.0464 0.0382 0.0467 0.0383 0.0027 0.0007 0.0441 0.0369 0.0442 0.0369 Comparing table 3.3 to table 3.1, the QLDP performs much better than either of the GMM estimators despite the fact that we know they are using valid instruments. That the QLDP has better finite-sample performance than the overidentified systems from Ahn et al. (2013) is most likely due to the fact that it uses a smaller, just identified system of moments. See the Appendix for additional simulations including larger values of ๐‘‡. Finally, I investigate the performance of the mean group quasi-long-differencing (QLDMG) and mean group common correlated effects (CCEMG) estimators. The QLDMG estimator is given by equation (3.3.4) and the CCEMG estimator is identical to the QLDMG estimator but with ๐‘ด ๐‘ญb in place of ๐‘ฏ b๐‘ฏ bโ€ฒ. Consistency is proved in Pesaran (2006) but, like the pooled estimator, will 81 Table 3.4: Pooled estimators, ๐พ = 3 Bias SD RMSE CCEP QLDP CCEP QLDP CCEP QLDP N = 50 T=3 0.0875 0.0076 3.0883 0.1586 3.0895 0.1588 1.0809 0.0094 2.2956 0.1594 2.5373 0.1597 0.3240 -0.0018 7.6585 0.1560 7.6654 0.1560 T=4 0.1574 0.0041 3.1025 0.1105 3.1065 0.1106 1.1709 0.0140 3.2437 0.1107 3.4486 0.1116 -0.2552 -0.0047 6.7375 0.1089 6.7423 0.1090 T=5 0.0151 0.0066 0.1530 0.0986 0.1537 0.0988 0.0039 0.0031 0.1495 0.0979 0.1495 0.0979 -0.0072 -0.0041 0.1408 0.0958 0.1410 0.0959 N = 300 T=3 1.9936 0.0030 61.6795 0.0580 61.7117 0.0581 2.5873 0.0007 45.5170 0.0578 45.5905 0.0578 -0.8012 0.0017 17.5764 0.0570 17.5947 0.0570 T=4 0.0011 0.0008 0.0601 0.0397 0.0601 0.0397 0.0028 0.0001 0.0559 0.0394 0.0560 0.0394 0.0035 0.0009 0.0571 0.0378 0.0572 0.0378 T=5 0.0064 0.0028 2.0502 0.0400 2.0502 0.0401 1.0163 0.0020 0.9861 0.0414 1.4160 0.0414 -0.0826 0.0006 3.6462 0.0400 3.6471 0.0400 eventually require a modern treatment which either controls for the asymptotic degeneracy in ๐‘ด ๐‘ญb like Karabiyik et al. (2017) and Westerlund et al. (2019) or assumes full rank limits like Brown et al. (2021). Table 3.5 contains the results for the mean group estimators where the QLD transformation is estimated assuming ๐‘ = ๐‘ 0 = 2. I start at ๐‘‡ = 5 so that ๐‘‡ โˆ’ ๐‘ 0 > ๐‘ 0 and the CCEMG estimator is well-defined. Despite ๐‘‡ > 2๐พ + 1 for each setting, the CCEMG estimator exhibits substantial bias when ๐‘‡ = 6, though the QLDMG estimator appears unbiased. The QLDMG outperforms the CCEMG in terms of RMSE for each ๐‘ and ๐‘‡ besides ๐‘ = 600 and ๐‘‡ = 8. We would expect the CCEMG to perform well relative to the QLDMG as ๐‘‡ grows due to the incidental parameter problem in the first-stage QLD estimation. However, even for moderately low values of ๐‘ and large values of ๐‘‡, the QLDMG has optimistic properties. 82 Table 3.5: Mean group estimators Bias SD RMSE CCEMG QLDMG CCEMG QLDMG CCEMG QLDMG N = 50 T=5 -1.5703 -0.0055 34.8038 0.4837 34.8392 0.4837 -0.4832 0.0256 18.2402 0.6523 18.2466 0.6529 T=6 0.0324 0.0056 0.4630 0.1737 0.4641 0.1738 0.0256 0.0044 0.3774 0.1820 0.3782 0.1820 T=7 0.0187 0.0156 0.1670 0.1658 0.1681 0.1665 0.0113 0.0102 0.1628 0.1574 0.1632 0.1577 N = 300 T=5 -1.2597 -0.0039 27.7644 0.1537 27.7929 0.1537 1.1968 -0.0030 34.6115 0.1420 34.6322 0.1420 T=6 -0.0077 0.0039 0.2846 0.0767 0.2847 0.0768 0.0116 -0.0004 0.1768 0.0745 0.1772 0.0745 T=7 0.0003 0.0000 0.0649 0.0641 0.0649 0.0641 0.0010 0.0009 0.0677 0.0595 0.0677 0.0595 3.5.2 Comparison to TWFE Theorem 3.3.3 suggests a certain robustness property for the QLDP estimator with respect to the traditional TWFE estimator. If the factor structure gives the traditional two-way error ๐’‡๐‘กโ€ฒ ๐œธ๐‘– + ๐‘ข๐‘–๐‘ก = ๐›พ๐‘– + ๐‘“๐‘ก + ๐‘ข๐‘–๐‘ก , the QLDP can accommodate the time and individual fixed effects without Assumption 2 holding. If one regresses out a heterogeneous intercept and estimates ๐œฝb assuming ๐‘ = ๐พ + 1, the QLDP estimator will be consistent even if it is nonlinear in the unobserved effects. I first demonstrate that TWFE is inconsistent in the presence of an arbitrary factor structure. The DGP is the same as Section 3.5.1 so that the QLDP results are identical to table 3.2. TWFE performs poorly as expected. I now generate the data according to the two-way error model so that ๐‘ฆ๐‘–๐‘ก = ๐‘ฅ๐‘–๐‘ก1 + ๐‘ฅ๐‘–๐‘ก2 + ๐‘ก + ๐›พ๐‘– + ๐‘ข๐‘–๐‘ก where ๐‘ก is the time effect and ๐›พ๐‘– โˆผ ๐‘ (1, 1) is the individual effect. The covariates are generated as ๐‘ฅ๐‘–๐‘ก1 โˆผ Poisson(|๐‘๐‘– + ๐‘ก|) ๐‘ฅ๐‘–๐‘ก2 โˆผ ๐‘ˆ (0, log((๐‘๐‘– + ๐‘ก) 2 )) so that Assumption 2 does not hold. The simulation results in table 3.7 compare TWFE to QLDP 83 Table 3.6: AR(1) factor structure Bias SD RMSE K=2 TWFE QLDP TWFE QLDP TWFE QLDP N = 50 T=3 0.0791 0.0082 0.1366 0.1546 0.1578 0.1548 0.8684 0.0034 0.1339 0.1555 0.8787 0.1556 T=4 0.1148 0.0078 0.1351 0.1097 0.1773 0.1100 0.8321 0.0029 0.1330 0.1097 0.8427 0.1098 T=5 0.1116 0.0095 0.1290 0.1005 0.1706 0.1009 0.8107 0.0058 0.1302 0.0950 0.8211 0.0952 N = 300 T=3 0.0765 0.0024 0.0528 0.0580 0.0929 0.0581 0.8851 0.0026 0.0513 0.0585 0.8865 0.0585 T=4 0.1089 -0.0003 0.0527 0.0424 0.1210 0.0424 0.8321 0.0024 0.0527 0.0411 0.8337 0.0411 T=5 0.1119 0.0008 0.0529 0.0382 0.1238 0.0383 0.8055 0.0007 0.0530 0.0369 0.8073 0.0369 when ๐œฝb is computed with ๐‘ = ๐พ + 1 (despite the fact that ๐‘ 0 = 1) and after removing a random intercept for ๐‘ฟ๐‘– and ๐’š๐‘– unit-by-unit. That is, let ๐‘ด be the ๐‘‡ ร— ๐‘‡ within transformation. I compute b๐‘„๐ฟ๐ท๐‘ƒ with ๐’š โˆ— and ๐‘ฟ โˆ— where ๐’š โˆ— = ๐‘ด ๐’š๐‘– and ๐‘ฟ โˆ— = ๐‘ด ๐‘ฟ. The time effects are irrelevant ๐œฝb and ๐œท ๐‘– ๐‘– ๐‘– ๐‘– because the QLDP estimator is the same regardless of whether or not they are controlled for in the regression. Table 3.7: TWFE specification Bias SD RMSE TWFE QLDP TWFE QLDP TWFE QLDP N = 50 T = 4 -0.0004 -0.0044 0.0284 0.0388 0.0284 0.0390 -0.0006 -0.0013 0.0184 0.0276 0.0184 0.0277 T = 5 -0.0010 -0.0022 0.0240 0.0300 0.0240 0.0301 0.0000 -0.0015 0.0142 0.0196 0.0142 0.0197 T = 6 -0.0004 -0.0022 0.0199 0.0251 0.0199 0.0252 0.0007 -0.0013 0.0126 0.0157 0.0127 0.0157 N = 300 T = 4 -0.0003 -0.0004 0.0106 0.0142 0.0106 0.0142 0.0003 -0.0005 0.0061 0.0086 0.0061 0.0086 T = 5 -0.0001 -0.0004 0.0092 0.0116 0.0092 0.0116 -0.0002 -0.0001 0.0054 0.0072 0.0054 0.0072 T = 6 0.0001 0.0001 0.0082 0.0105 0.0082 0.0105 -0.0002 -0.0005 0.0048 0.0065 0.0048 0.0065 While the TWFE estimator is clearly superior in terms of both bias and standard deviation 84 when ๐‘ is small, the QLDP shows promising results. When ๐‘ = 300, the two estimators are nearly indistinguishable in terms of their bias. The QLDPโ€™s RMSE is inflated because of its higher variance, but this result is unsurprising as it is a more conservative estimator which is trying to eliminate more heterogeneity. However, it performs comparably well even though it removes more variation from the data than is needed. 3.6 Application I evaluate the effect of expenditure per student on standardized test performance. I consider school district-level data in the state of Michigan over the time periods 1995-2001. The state of Michigan reformed education expenditure in 1994 to bring poorly-funded schools to parity with wealthier schools. See Papke (2005) for a comprehensive discussion of the data and institutional details. There are ๐‘ = 501 school districts observed for ๐‘‡ = 7 school years over 1995-2001. I present summary statistics and descriptions for the variables of interest. Variable Mean Standard Deviation Description math4 0.6939 0.1515 Fraction of fourth graders who pass the MEAP math test. lunch 0.2886 0.1616 Fraction of students eligible for free and reduced lunch. enroll 3112.31 7965.49 Total enrollment. avgrexp 6385.51 1034.94 Average real expenditure per pupil. The outcome variable, math4, denotes the pass rate for fourth-grade students taking a standard- ized math test and stands as a measure of student achievement. Michigan students undertake a battery of standardized tests in elementary, junior, and secondary school. Like Papke (2005) and Papke and Wooldridge (2008), I focus on the fourth-grade math test because it has been consistently defined and measured over the observed time periods. The primary variable of interest is average expenditure per pupil, as it represents the effect of additional expenditure on test scores. Starting in the 1994/1995 school year, the state of Michigan began awarding so-called โ€œfoundation grants" which were based on the per-student spending of the school district in the previous year. The goal was to eventually bring schools up to a benchmark โ€œbasic foundation" amount which increased over time. The state started by awarding foundation 85 grants to increase expenditure to a minimum $4200 per student or an additional $250 per student, whichever was higher. By 2000, the minimum and benchmark amounts were equal at $5700. Expenditures per pupil were averaged over the current year as well as the previous three, meaning average real expenditure per pupil in 1995 is an average of expenditure in 1992, 1993, 1994, and 1995. The equation of interest is ๐‘š๐‘Ž๐‘กโ„Ž4๐‘–๐‘ก = ๐‘๐‘– + log(๐‘Ž๐‘ฃ๐‘”๐‘Ÿ๐‘’๐‘ฅ ๐‘๐‘–๐‘ก ) ๐›ฝ1 + ๐‘™๐‘ข๐‘›๐‘โ„Ž๐‘–๐‘ก ๐›ฝ2 + log(๐‘’๐‘›๐‘Ÿ๐‘œ๐‘™๐‘™๐‘–๐‘ก ) ๐›ฝ3 + ๐’‡๐‘กโ€ฒ ๐œธ๐‘– + ๐‘’๐‘–๐‘ก (3.6.1) which is similar to Papke (2005). I collect ๐‘™๐‘ข๐‘›๐‘โ„Ž๐‘–๐‘ก , log(๐‘’๐‘›๐‘Ÿ๐‘œ๐‘™๐‘™)๐‘–๐‘ก , and log(๐‘Ž๐‘ฃ๐‘”๐‘Ÿ๐‘’๐‘ฅ ๐‘)๐‘–๐‘ก and use the reduced form CCE equation from Assumption 2 to implement the pooled QLD estimator. This specification allows me to test for the number of factors. I also use the Ahn et al. (2013) GMM function to test for ๐‘ 0 , with and without the CCE equations. Table 3.8 provides the p-values for testing the hypothesis ๐ป0 : ๐‘ 0 = ๐‘ versus ๐ป1 : ๐‘ 0 > ๐‘. Table 3.8: Testing for ๐‘ 0 p-values RF2 GMM1 GMM2 ๐’‘0 =0 0.0000 0.0000 0.0000 ๐’‘0 =1 0.0000 0.0000 0.0000 ๐’‘0 =2 0.0000 0.4852 0.0000 ๐’‘0 =3 0.0000 0.1157 0.0000 A rejection of the hypothesis suggests more factors than the tested value, and a failure to reject suggests the current value is correct. The titles โ€˜GMM1โ€™, โ€˜GMM2โ€™, and โ€˜RFโ€™ (for reduced form) refer to the respective objective function used to test the relevant hypothesis. I stress that testing for ๐‘ 0 comes from a long-established literature, briefly described in Ahn et al. (2013). The only new concept I introduce with respect to this specific specification test is using the reduced form moments ๐ธ (๐‘ฏ0โ€ฒ ๐’๐‘– ) = 0. GMM1 is just the Ahn et al. (2013) objective function from equation (3.2.7). GMM2 is the Ahn et al. objective function with the additional moments ๐ธ (๐‘ฏ0โ€ฒ ๐’๐‘– ) = 0. Finally, RF is just the 86 reduced form moments ๐ธ (๐‘ฏ0โ€ฒ ๐’๐‘– ) = 0. GMM1 suggests that the correct number of factors is ๐‘ 0 = 2. GMM2 and RF both reject ๐‘ 0 = 2 at any reasonable confidence level, and GMM2 rejects ๐‘ 0 = 3, though it uses a much larger set of moments than the other two which may decrease power. It may suffer from the same global identification problems discussed in Hayakawa (2016) which suggests the GMM1 test will perform better practically. I stop testing at ๐‘ 0 = 3 because RF is just identified at ๐‘ 0 = 4. Regardless of the tests, the moments ๐ธ (๐‘ฏ0โ€ฒ ๐’๐‘– ) = 0 only allow me to estimate up to four factors. Even if ๐‘ 0 > 4, the QLDP nets more unobserved heterogeneity than TWFE. For the purpose of comparison with the pooled QLD estimator, I include the TWFE estimator and the pooled CCE estimator. As ๐‘‡ = 7 and ๐พ = 3, the CCE estimator can accommodate both ๐‘ฟ, ๐’š, and a heterogeneous intercept in ๐‘ญ. b Further, the pooled QLD estimator is computed with ๐‘ = ๐พ = 3 after eliminating a heterogeneous intercept from ๐‘ฟ๐‘– and ๐’š๐‘– , unit-by-unit. As such, QLDP is a natural comparison to TWFE. Theorem 3.3.3 tells us that ๐œท b๐‘„๐ฟ๐ท๐‘ƒ is invariant to common variables when ๐‘ = ๐พ. Since it also eliminates a heterogeneous intercept, it will be consistent if TWFE is consistent, assuming strictly exogenous covariates. I present results in table 3.9 which shows estimation after eliminating a heterogeneous intercept. For CCE, this simply amounts to ๐‘ญ b = (1, ๐’š, ๐‘ฟ). For QLDP, I project out the intercept from each ๐‘ฟ๐‘– and ๐’š๐‘– via the within transformation before estimating. Standard errors are in parentheses while p-values are in brackets. The reported standard errors are generated via the panel nonparametric bootstrap. The QLDP estimator suggests substantial estimates for the effect of per student expenditures. A 10% increase in the average expenditure per student is associated with an 8.3 percentage point increase in the math test pass rate, with a p-value of 0.0009. This estimate is more than twice as large as the TWFE estimate and more than three halves the CCEP estimate. These results suggest that TWFE is not adequately controlling for the heterogeneity present in the data set. Both the CCE and QLDP estimates are statistically significant at the 5% level. The TWFE standard errors are generally smaller than CCE and QLD because it removes less variation from the data. I also considered estimation via the mean group QLD and CCE estimators. However, both 87 Table 3.9: Controlling for heterogeneous intercept TWFE CCEP QLDP lunch -0.0419 0.0398 -0.1576 (0.0730) (0.1367) (0.1637) [0.5658] [0.7709] [0.3381] log(enroll) 0.0021 -0.0592 0.0268 (0.0487) (0.1497) (0.2152) [0.9663] [0.6924] [0.8838] log(avgrexp) 0.3771 0.5409 0.8287 (0.0704) (0.2695) (0.3785) [0.0000] [0.0446] [0.0303] parameter estimates and standard errors were unreasonable compared to the other estimators. In fact, the p-values were significantly larger than any other reported case and suggested a critical lack of precision. Recall that the mean group estimators require much stronger exogeneity and identifying conditions than the pooled estimators. 3.7 Conclusion This paper considers fixed-๐‘‡ estimation of linear panel data models where the errors have a general unknown factor structure. I use the quasi-long-difference transformation studied by Ahn et al. (2013) to eliminate the factor structure and provide moment conditions for estimation. For the purpose of comparison with the popular pooled common correlated effects estimator, I study the moments implied by assuming a pure factor structure in the covariates. Applying the QLD transformation to the independent variables improves efficiency of estimating the parameters of interest in the main equation which is information that pooled CCE does not use. Current proofs of fixed-๐‘‡ asymptotic normality of the pooled CCE estimator assumes loadings 88 which are strictly exogenous with respect to the idiosyncratic errors in the independent variables. I show that the uncorrelated loadings assumptions implies the existence of an even larger number of moments which CCE neglects. Ultimately, if one makes the strong assumptions sufficient for asymptotic normality of pooled CCE in Westerlund et al. (2019), one should fully consider the information available for efficient estimation. Regardless, I provide robust standard errors in a more general and appealing setting than the CCE models in Pesaran (2006) and Westerlund et al. (2019). I apply the moment-based perspective to a heterogeneous slopes model similar to the original Pesaran (2006) setting. I prove consistency and asymptotic normality of pooled and mean group estimators based off of the QLD transformation which put no restrictions on the relationship between ๐‘‡ and ๐พ in contrast to CCE. These estimators are shown to outperform CCE estimators in finite samples even when ๐‘ is small. The pooled QLD estimator also has the desirable property of invariance to common variables, like time trends and macroeconomic indicators, when the estimated number of factors equals the number of regressors. I reexamine estimation of school district expenditures on standardized test performance and find significantly larger effects of educational spending compared to simple fixed effects regression. These estimates are also reported up to reasonable precision which suggests that applied researchers are not adequately controlling for heterogeneity in their data. One important direction for future work concerns the overestimation of ๐‘ 0 . It is known that CCE is robust to ๐พ +1 > ๐‘ 0 . Moon and Weidner (2015) prove that principal components estimation is also robust to overestimating the number of factors, provided ๐‘‡ is large. However, while there is ample simulation evidence suggesting the robustness of QLD to such a failure, a formal proof is lacking. It would also be useful to investigate the robustness of the QLDP estimators to failure of the reduced form equation in Assumption 2. Finally, the methods presented in this paper all assumed balanced panels. Missing data causes challenges to constructing the CCE and QLD transformations. It is not clear how even a complete cases estimator would work, as the cross sectional averages and first-stage estimator of ๐œฝb require all time periods for each unit in the sample. 89 APPENDIX 90 APPENDIX PROOFS FOR CHAPTER 1 This Appendix collects together proofs of the formal results stated in the text. Proof of Lemma 1.3.1 From equation (1.3.14), Assumptions WV.1 and WV.2 imply Var (y๐‘– |x๐‘– , ๐‘๐‘– ) = ๐›ผ๐‘๐‘– M๐‘–1/2 RM๐‘–1/2 By the law of total variance, Var (y๐‘– |x๐‘– ) = E [Var (y๐‘– |x๐‘– , ๐‘๐‘– ) |x๐‘– ] + Var [E (y๐‘– |x๐‘– , ๐‘๐‘– ) |x๐‘– ]   = E ๐›ผ๐‘๐‘– M๐‘–1/2 RM๐‘–1/2 x๐‘– + Var (๐‘๐‘– m๐‘– |x๐‘– ) = ๐›ผ๐œ‡๐‘ (x๐‘– ) M๐‘–1/2 RM๐‘–1/2 + ๐œŽ๐‘2 (x๐‘– ) m๐‘– mโ€ฒ๐‘– (.0.1) To simplify notation in what follows, write ๐œ‡๐‘– โ‰ก ๐œ‡๐‘ (x๐‘– ), ๐œŽ๐‘–2 โ‰ก ๐œŽ๐‘2 (x๐‘– ). To derive ๐›€๐‘–โˆ’1 , we apply an implication of Sherman and Morrison (1950): For a nonsingular ๐‘‡ ร— ๐‘‡ matrix A and ๐‘‡ ร— 1 vector b, 1 (A + bbโ€ฒ) โˆ’1 = Aโˆ’1 โˆ’ โ€ฒ โˆ’1 Aโˆ’1 bbโ€ฒAโˆ’1 (.0.2) 1+bA b which can be verified by direct multiplication. Take A โ‰ก ๐›ผ๐œ‡๐‘– M๐‘–1/2 RM๐‘–1/2 and b โ‰ก ๐œŽ๐‘– m๐‘– in (.0.2) h i โˆ’1 โˆš and note that ๐›ผ๐œ‡๐‘– M๐‘–1/2 RM๐‘–1/2 = M๐‘–โˆ’1/2 Rโˆ’1 M๐‘–โˆ’1/2 /(๐›ผ๐œ‡๐‘– ) and M๐‘–โˆ’1/2 m๐‘– = m๐‘– . 91 Therefore, 1 ๐›€๐‘–โˆ’1 = M๐‘–โˆ’1/2 Rโˆ’1 M๐‘–โˆ’1/2 ๐›ผ๐œ‡๐‘– 1 โˆš โˆš โ€ฒ โˆ’  2โˆš โ€ฒ โˆš  ๐œŽ๐‘–2 Rโˆ’1 m๐‘– m๐‘– Rโˆ’1 /(๐›ผ๐œ‡๐‘– ) 2 1 + ๐œŽ๐‘– m๐‘– Rโˆ’1 m๐‘– /(๐›ผ๐œ‡๐‘– ) 1 = Mโˆ’1/2 Rโˆ’1 M๐‘–โˆ’1/2 ๐›ผ๐œ‡๐‘– ๐‘– ๐œŽ๐‘–2 2 โˆ’1 โˆš โˆš โ€ฒ โˆ’1 โˆ’ โˆš โ€ฒ โˆš ๐œŽ R m๐‘– m๐‘– R /(๐›ผ๐œ‡๐‘– ) ๐›ผ๐œ‡๐‘– + ๐œŽ๐‘–2 m๐‘– Rโˆ’1 m๐‘– ๐‘– ( ) 2 1 ๐œŽ โˆš โˆš โ€ฒ = Mโˆ’1/2 Rโˆ’1 โˆ’  2 ๐‘– โˆš โ€ฒ โˆ’1 โˆš  R โˆ’1 m๐‘– m๐‘– Rโˆ’1 M๐‘–โˆ’1/2 ๐›ผ๐œ‡๐‘– ๐‘– ๐›ผ๐œ‡๐‘– + ๐œŽ๐‘– m๐‘– R m๐‘– โ–ก Proof of Theorem 1.3.1 Simplify the notation by defining D๐‘– โ‰ก D๐‘œ (x๐‘– ), V๐‘– โ‰ก V๐‘œ (x๐‘– ), ๐œ‡๐‘– โ‰ก ๐œ‡๐‘ (x๐‘– ), ๐œŽ๐‘–2 โ‰ก ๐œŽ๐‘2 (x๐‘– ), and drop dependences on ๐œท0 . With this simplified notation,   โˆ’1 V๐‘–โˆ’ = ๐›€๐‘–โˆ’1 โˆ’ ๐›€๐‘–โˆ’1 m๐‘– mโ€ฒ๐‘– ๐›€๐‘–โˆ’1 m๐‘– mโ€ฒ๐‘– ๐›€๐‘–โˆ’1 and, from Lemma 1.3.1, 1 ๐œŽ๐‘–2 โˆ’1/2 โˆš โˆš โ€ฒ ๐›€๐‘–โˆ’1 = M๐‘–โˆ’1/2 Rโˆ’1 M๐‘–โˆ’1/2 โˆ’ 2  M๐‘– Rโˆ’1 m๐‘– m๐‘– Rโˆ’1 M๐‘– โˆ’1/2 ๐›ผ๐œ‡๐‘– ๐›ผ๐œ‡๐‘– ๐›ผ๐œ‡๐‘– + ๐‘Ž๐‘– ๐œŽ๐‘– โˆš โˆš โˆš where ๐‘Ž๐‘– โ‰ก m๐‘– โ€ฒR๐‘–โˆ’1 m๐‘– . Therefore, because M๐‘–โˆ’1/2 m๐‘– = m๐‘– , 1 โˆš ๐œŽ๐‘–2 โˆ’1/2 โˆ’1 โˆš โˆš โ€ฒ โˆ’1 โˆš ๐›€๐‘–โˆ’1 m๐‘– = M๐‘–โˆ’1/2 Rโˆ’1 m๐‘– โˆ’ M R m๐‘– m๐‘– R m๐‘– ๐›ผ๐œ‡๐‘– ๐›ผ๐œ‡๐‘– + ๐‘Ž๐‘– ๐œŽ๐‘–2  ๐‘– ๐›ผ๐œ‡๐‘– 1 โˆ’1/2 โˆ’1 โˆš ๐‘Ž๐‘– ๐œŽ๐‘–2 โˆ’1/2 โˆš = M๐‘– R m๐‘– โˆ’ 2  M๐‘– Rโˆ’1 m๐‘– ๐›ผ๐œ‡๐‘– ๐›ผ๐œ‡๐‘– ๐›ผ๐œ‡๐‘– + ๐‘Ž๐‘– ๐œŽ๐‘– " # 1 ๐‘Ž๐‘– ๐œŽ๐‘–2 โˆ’1/2 โˆ’1 โˆš = โˆ’ M R m๐‘– ๐›ผ๐œ‡๐‘– ๐›ผ๐œ‡๐‘– ๐›ผ๐œ‡๐‘– + ๐‘Ž๐‘– ๐œŽ๐‘–2  ๐‘– ๐›ผ๐œ‡๐‘– + ๐‘Ž๐‘– ๐œŽ๐‘–2 โˆ’ ๐‘Ž๐‘– ๐œŽ๐‘–2    โˆ’1/2 โˆš = 2  M๐‘– Rโˆ’1 m๐‘– ๐›ผ๐œ‡๐‘– ๐›ผ๐œ‡๐‘– + ๐‘Ž๐‘– ๐œŽ๐‘– 1 โˆ’1/2 โˆ’1 โˆš = M R m๐‘– ๐›ผ๐œ‡๐‘– + ๐‘Ž๐‘– ๐œŽ๐‘–2 ๐‘–  92 Also, 1 โˆš โ€ฒ โˆ’1 โˆš ๐‘Ž๐‘– mโ€ฒ๐‘– ๐›€๐‘–โˆ’1 m๐‘– = m๐‘– R m ๐‘– = ๐›ผ๐œ‡๐‘– + ๐‘Ž๐‘– ๐œŽ๐‘–2 ๐›ผ๐œ‡๐‘– + ๐‘Ž๐‘– ๐œŽ๐‘–2  It follows that โˆ’1/2 โˆ’1 โˆš โˆš โ€ฒ โˆ’1 โˆ’1/2   โˆ’1 1 ๐›€๐‘–โˆ’1 m๐‘– mโ€ฒ๐‘– ๐›€๐‘–โˆ’1 m๐‘– mโ€ฒ๐‘– ๐›€๐‘–โˆ’1 = M R m๐‘– m๐‘– R M๐‘– ๐‘Ž๐‘– ๐›ผ๐œ‡๐‘– + ๐‘Ž๐‘– ๐œŽ๐‘–2 ๐‘–  Plugging into V๐‘–โˆ’ gives 1 โˆ’1/2 โˆ’1 โˆ’1/2 ๐œŽ๐‘–2 โˆ’1/2 โˆš โˆš โ€ฒ โˆ’1/2 V๐‘–โˆ’ = M๐‘– R M๐‘– โˆ’ 2  M๐‘– Rโˆ’1 m๐‘– m๐‘– Rโˆ’1 M๐‘– ๐›ผ๐œ‡๐‘– ๐›ผ๐œ‡๐‘– ๐›ผ๐œ‡๐‘– + ๐‘Ž๐‘– ๐œŽ๐‘– 1 โˆ’1/2 โˆ’1 โˆš โˆš โ€ฒ โˆ’1 โˆ’1/2 โˆ’ M R m๐‘– m๐‘– R M๐‘– ๐‘Ž๐‘– ๐›ผ๐œ‡๐‘– + ๐‘Ž๐‘– ๐œŽ๐‘–2  ๐‘– " # 1 โˆ’1/2 โˆ’1 โˆ’1/2 ๐›ผ๐œ‡๐‘– + ๐‘Ž๐‘– ๐œŽ๐‘–2 โˆ’1/2 โˆš โˆš โ€ฒ โˆ’1/2 = M๐‘– R M๐‘– โˆ’ 2  M๐‘– Rโˆ’1 m๐‘– m๐‘– Rโˆ’1 M๐‘– ๐›ผ๐œ‡๐‘– ๐‘Ž๐‘– ๐›ผ๐œ‡๐‘– ๐›ผ๐œ‡๐‘– + ๐‘Ž๐‘– ๐œŽ๐‘–   1 โˆ’1/2 โˆ’1 โˆ’1/2 1 โˆ’1/2 โˆ’1 โˆš โˆš โ€ฒ โˆ’1 โˆ’1/2 = M๐‘– R M๐‘– โˆ’ M๐‘– R m๐‘– m๐‘– R M๐‘– ๐›ผ๐œ‡๐‘– ๐‘Ž๐‘– which completes the result for V๐‘–โˆ’ . From (1.3.10), the optimal IVs are   1 โ€ฒ โˆ’1/2 โˆ’1 โˆ’1/2 1 โˆ’1/2 โˆ’1 โˆš โˆš โ€ฒ โˆ’1 โˆ’1/2 Dโ€ฒ๐‘– V๐‘–โˆ’ = โˆ’๐œ‡๐‘– โˆ‡ ๐œท mโ€ฒ๐‘– V๐‘–โˆ’ = โˆ’ โˆ‡ ๐œท m๐‘– M๐‘– R M๐‘– โˆ’ M๐‘– R m๐‘– m๐‘– R M๐‘– ๐›ผ ๐‘Ž๐‘– and we can drop โˆ’1/๐›ผ and factor out M๐‘–โˆ’1/2 to get the result. โ–ก Proof of Corollary 1.3.1 Putting R = I๐‘‡ into (1.3.17) and using simple algebra gives the optimal IVs as ! 1 Zโˆ— (x๐‘– ) โ€ฒ = โˆ‡ ๐›ฝ m๐‘– ( ๐œท0 ) โ€ฒ M๐‘–โˆ’1 โˆ’ ร๐‘‡ 1๐‘‡ 1๐‘‡โ€ฒ ๐‘Ÿ=1 ๐‘š ๐‘–๐‘Ÿ We show that this choice of instruments leads to the FEP first order condition, as expressed by Wooldridge (1999), using the definition of W๐‘– given in Section 1.2: โˆ‡ ๐œท p๐‘– ( ๐œท0 ) โ€ฒ W๐‘– = โˆ‡ ๐œท m๐‘– ( ๐œท0 ) โ€ฒ I๐‘‡ โˆ’ 1๐‘‡ p๐‘– ( ๐œท0 ) โ€ฒ M๐‘–โˆ’1   93 To see the equivalence, note that ยฉm๐‘– ยช ยญ ยฎ ยญm๐‘– ยฎ ยญ ยฎ 1 1 1๐‘‡ p๐‘– ( ๐œท0 ) โ€ฒ M๐‘–โˆ’1 = ร  ยญยญ . ยฎยฎ M๐‘–โˆ’1 = ร๐‘‡ 1๐‘‡ 1๐‘‡โ€ฒ ๐‘‡ ยญ . ยฎ ๐‘Ÿ=1 ๐‘š ๐‘–๐‘Ÿ ยญ . ยฎ ๐‘Ÿ=1 ๐‘š ๐‘–๐‘Ÿ ยญ ยฎ ยซm๐‘– ยฌ and so ! 1 โˆ‡ ๐œท p๐‘– ( ๐œท0 ) โ€ฒ W๐‘– = โˆ‡ ๐œท m๐‘– ( ๐œท0 ) โ€ฒ M๐‘–โˆ’1 โˆ’ ร๐‘‡ 1๐‘‡ 1๐‘‡โ€ฒ = Zโˆ— (x๐‘– ) โ€ฒ ๐‘Ÿ=1 ๐‘š ๐‘–๐‘Ÿ โ–ก 94 APPENDIX PROOFS FOR CHAPTER 2 Proof of Lemma 2.2.3 ร  โˆ’1 , ๐’‘๐‘– ( ๐œท) = ( ๐‘๐‘–1 ( ๐œท), ..., ๐‘๐‘–๐‘‡ ( ๐œท)) โ€ฒ, and ๐‘›๐‘– = ๐‘‡ ร๐‘‡ Let ๐‘๐‘–๐‘ก ( ๐œท) = ๐‘š ๐‘ก (๐’™๐‘–๐‘ก , ๐œท) ๐‘ =1 ๐‘š ๐‘  (๐’™๐‘– ๐‘  , ๐œท) ๐‘ =1 ๐‘ฆ ๐‘– ๐‘  . Let 1 be a ๐‘‡ ร— 1 vector of ones. First I directly show the conclusion holds for ๐‘ฐ๐‘‡ โˆ’ ๐’‘( ๐œท)1โ€ฒ which satisfies the lemmaโ€™s assumption. It also satisfies Assumption MAT, which is made clear in Section 2.3. I need the following derivation: ๐‘‡ โˆ‘๏ธ ๐‘‡ โˆ‘๏ธ โˆ‘๏ธ๐‘‡ โˆ’2 โˆ‡ ๐œท ๐‘๐‘–๐‘ก = ( ๐‘š๐‘–๐‘Ÿ (๐’™๐‘–๐‘Ÿ , ๐œท)) (โˆ‡ ๐œท ๐‘š๐‘–๐‘ก (๐’™๐‘–๐‘ก , ๐œท) ๐‘š๐‘–๐‘Ÿ (๐’™๐‘–๐‘Ÿ , ๐œท) โˆ’ ๐‘š๐‘–๐‘ก (๐’™๐‘–๐‘ก , ๐œท) โˆ‡ ๐œท ๐‘š๐‘–๐‘Ÿ (๐’™๐‘–๐‘Ÿ , ๐œท)) ๐‘Ÿ=1 ๐‘Ÿ=1 ๐‘Ÿ=1 ๐‘‡ โˆ‘๏ธ ๐‘‡ โˆ‘๏ธ =( ๐‘š๐‘–๐‘Ÿ (๐’™๐‘–๐‘Ÿ , ๐œท)) โˆ’1 (โˆ‡ ๐œท ๐‘š๐‘–๐‘ก (๐’™๐‘–๐‘ก , ๐œท) โˆ’ ๐‘๐‘–๐‘ก ( ๐œท)( โˆ‡ ๐œท ๐‘š๐‘–๐‘Ÿ (๐’™๐‘–๐‘Ÿ , ๐œท))) ๐‘Ÿ=1 ๐‘Ÿ=1 Stacking the ๐‘‡ equations gives โˆ‘๏ธ๐‘‡ โˆ‡ ๐œท ๐’‘๐‘– ( ๐œท) = ( ๐‘š๐‘–๐‘Ÿ (๐’™๐‘–๐‘Ÿ , ๐œท)) โˆ’1 (โˆ‡ ๐œท ๐’Ž๐‘– ( ๐œท) โˆ’ ๐’‘๐‘– ( ๐œท)1โ€ฒโˆ‡ ๐œท ๐’Ž๐‘– ( ๐œท)) ๐‘Ÿ=1 โˆ‘๏ธ๐‘‡ =( ๐‘š๐‘–๐‘Ÿ (๐’™๐‘–๐‘Ÿ , ๐œท)) โˆ’1 ( ๐‘ฐ๐‘‡ โˆ’ ๐’‘๐‘– ( ๐œท)1โ€ฒ)โˆ‡ ๐œท ๐’Ž๐‘– ( ๐œท) ๐‘Ÿ=1 ร๐‘‡ As ๐ธ (โˆ’๐‘›๐‘– |๐’™๐‘– ) = โˆ’๐œ‡๐‘ (๐’™๐‘– ) ๐‘Ÿ=1 ๐‘š ๐‘–๐‘Ÿ (๐’™๐‘–๐‘Ÿ , ๐œท0 ), evaluating the derivative at ๐œท0 and multiplying by ๐ธ (โˆ’๐‘›๐‘– |๐’™๐‘– ) yields the final result. Now let ๐‘จ(๐’™๐‘– , ๐œท) be an ๐ฟ ร— ๐‘‡ matrix satisfying the assumption of the lemma. ๐‘จ(๐’™๐‘– , ๐œท)( ๐‘ฐ๐‘‡ โˆ’ ๐’‘๐‘– ( ๐œท)1โ€ฒ) = ๐‘จ(๐’™๐‘– , ๐œท) for all ๐œท near ๐œท0 . Then writing ๐’ˆ(๐’™๐‘– , ๐œท) = ( ๐‘ฐ๐‘‡ โˆ’ ๐’‘๐‘– ( ๐œท)1โ€ฒ) ๐’š๐‘– , we have for all ๐œท near ๐œท0 ๐ธ (โˆ‡ ๐œท ( ๐‘จ(๐’™๐‘– , ๐œท) ๐’š๐‘– )|๐’™๐‘– ) = ๐ธ (โˆ‡ ๐œท ( ๐‘จ(๐’™๐‘– , ๐œท) ๐’ˆ(๐’™๐‘– , ๐œท))|๐’™๐‘– ) = โˆ‡ ๐œท ๐‘จ(๐’™๐‘– , ๐œท)๐ธ ( ๐’ˆ(๐’™๐‘– , ๐œท)|๐’™๐‘– ) + ๐‘จ(๐’™๐‘– , ๐œท)๐ธ (โˆ‡ ๐œท ๐’ˆ(๐’™๐‘– , ๐œท)|๐’™๐‘– ) Evaluating at ๐œท0 yields ๐ธ (โˆ‡ ๐œท ๐‘จ(๐’™๐‘– , ๐œท0 ) ๐’š๐‘– |๐’™๐‘– ) = ๐‘จ(๐’™๐‘– , ๐œท0 )โˆ‡ ๐œท ๐’Ž๐‘– ( ๐œท0 ) since ๐ธ ( ๐’ˆ(๐’™๐‘– , ๐œท0 )|๐’™๐‘– ) = 0 and ๐ธ (โˆ‡ ๐œท ๐’ˆ(๐’™๐‘– , ๐œท0 )|๐’™๐‘– ) = ( ๐‘ฐ๐‘‡ โˆ’ ๐’‘๐‘– ( ๐œท0 )1โ€ฒ)โˆ‡ ๐œท ๐’Ž๐‘– ( ๐œท0 ). โ–ก 95 Proof of Lemma 2.3.1 Write ๐ธ ( ๐’š๐‘– ๐’šโ€ฒ๐‘– |๐’™๐‘– ) = ๐šบ๐‘– . Then for any ๐‘‡ โˆ’ 1 ร— ๐‘‡ transformation ๐‘จ(๐’™๐‘– , ๐œท0 ) with rank ๐‘‡ โˆ’ 1, ๐‘…๐‘Ž๐‘›๐‘˜ ( ๐‘จ(๐’™๐‘– , ๐œท0 )๐šบ๐‘– ๐‘จ(๐’™๐‘– , ๐œท0 )) = ๐‘…๐‘Ž๐‘›๐‘˜ (( ๐‘จ(๐’™๐‘– , ๐œท0 )๐šบ๐‘–1/2 )(( ๐‘จ(๐’™๐‘– , ๐œท0 )๐šบ๐‘–1/2 ) โ€ฒ) = ๐‘…๐‘Ž๐‘›๐‘˜ ( ๐‘จ(๐’™๐‘– , ๐œท0 )๐šบ๐‘–1/2 ) = ๐‘…๐‘Ž๐‘›๐‘˜ ( ๐‘จ(๐’™๐‘– , ๐œท0 )) = ๐‘‡ โˆ’ 1 as ๐šบ๐‘–1/2 is ๐‘‡ ร— ๐‘‡ and full rank. Thus the conditional variance is nonsingular and (2.2.4) holds with a proper inverse. Any generalized differencing residual with transformation satisfying Assumption RK.1 has a nonsingular conditional variance. This result goes for ๐‘ธ( ๐‘ฐ๐‘‡ โˆ’ ๐’‘๐‘– ( ๐œท0 )1โ€ฒ) and ๐‘ธ( ๐‘ฐ๐‘‡ โˆ’ ๐’Ž๐‘– ( ๐œท0 ) (๐’Ž๐‘– ( ๐œท0 ) โ€ฒ ๐’Ž๐‘– ( ๐œท0 )) โˆ’1 ๐’Ž๐‘– ( ๐œท0 ) since their full transformations have rank ๐‘‡ โˆ’ 1. Lemma 1 of Verdier (2018) shows ๐‘…๐‘Ž๐‘›๐‘˜ (( ๐‘ฐ๐‘‡ โˆ’ ๐’‘๐‘– ( ๐œท0 )1โ€ฒ)) = ๐‘‡ โˆ’1; the rank of the residual maker transformation is a well-known result. First note that ๐‘ฝ๐‘–โˆ’ ๐’Ž๐‘– ( ๐œท0 ) = 0 by construction. As 1 ๐’‘๐‘– ( ๐œท0 )1โ€ฒ ( ๐‘ฐ๐‘‡ โˆ’ ๐’Ž๐‘– ( ๐œท0 )๐’Ž๐‘– ( ๐œท0 ) โ€ฒ๐šบ๐‘–โˆ’1 ) = 0 ๐‘Ž๐‘– ( ๐‘ฐ๐‘‡ โˆ’ ๐’Ž๐‘– ( ๐œท0 )(๐’Ž๐‘– ( ๐œท0 ) โ€ฒ ๐’Ž๐‘– ( ๐œท0 )) โˆ’1 ๐’Ž๐‘– ( ๐œท0 ) โ€ฒ)๐’Ž๐‘– ( ๐œท0 ) = 0 the conditional gradients are given as ( ๐‘ฐ๐‘‡ โˆ’ ๐’‘๐‘– ( ๐œท0 )1โ€ฒ)โˆ‡ ๐œท ๐’Ž๐‘– ( ๐œท0 ) ( ๐‘ฐ๐‘‡ โˆ’ ๐’Ž๐‘– ( ๐œท0 )(๐’Ž๐‘– ( ๐œท0 ) โ€ฒ ๐’Ž๐‘– ( ๐œท0 )) โˆ’1 ๐’Ž๐‘– ( ๐œท0 ) โ€ฒ)โˆ‡ ๐œท ๐’Ž๐‘– ( ๐œท0 ) by Lemma 2.2.3. Then the systems defined by Assumption SYS for both transformations are con- sistent with ๐‘ญ(๐’™๐‘– ) = ๐‘ฝ๐‘–โˆ’ โˆ‡ ๐œท ๐’Ž๐‘– ( ๐œท0 ) and the singularity assumption in Assumption RK.2 guarantees both efficiency bounds exist. โ–ก Proof of Theorem 2.3.1 As mentioned in the text, Assumptions CM, RK.1, RK.2, and the positive definiteness of ๐ธ ( ๐’š๐‘– ๐’šโ€ฒ๐‘– |๐’™๐‘– ) are sufficient for each of the transformations studied to satisfy Assumptions SYS and ORTH (and thus MAT) so that their asymptotic efficiency bounds are well-defined and given by 96 (2.2.8). Let ๐‘ฉ๐‘– be one of the full rank ๐‘‡ โˆ’ 1 ร— ๐‘‡ transformation (evaluated at ๐’™๐‘– and ๐œท0 ) studied. ๐‘ฉ๐‘– could be the generalized within transformation, or either the generalized within or residual maker transformation with any arbitrary row deleted. I will prove the theorem by showing each of these transformations are information equivalent to the full generalized within transformation via Theorem 1, and noting that a similar proof holds for the full residual maker transformation. Write ๐šบ๐‘– = ๐ธ (๐’š๐‘– ๐’šโ€ฒ๐‘– |๐’™๐‘– ). Since each of the potential ๐‘ฉ๐‘– matrices satisfy Assumption ORTH, its efficiency bound is given by (2.2.8): ๐ธ (โˆ‡ ๐œท ๐’Ž๐‘– ( ๐œท0 ) โ€ฒ ๐‘ฉ๐‘–โ€ฒ (๐‘ฉ๐‘– ๐šบ๐‘– ๐‘ฉ๐‘–โ€ฒ) โˆ’1 ๐‘ฉ๐‘– โˆ‡ ๐œท ๐’Ž๐‘– ( ๐œท0 )) โˆ’1 In the notation of Theorem 1, let ๐‘ฝ๐‘– = ( ๐‘ฐ๐‘‡ โˆ’ ๐’‘๐‘– ( ๐œท0 )1โ€ฒ)๐šบ๐‘– ( ๐‘ฐ๐‘‡ โˆ’ 1 ๐’‘๐‘– ( ๐œท0 ) โ€ฒ) and ๐‘ด๐‘– = ( ๐‘ฐ๐‘‡ โˆ’ ๐’‘๐‘– ( ๐œท0 )1โ€ฒ). ๐‘ฉ๐‘– ๐‘ด๐‘– = ๐‘ฉ๐‘– as ๐‘ฉ๐‘– ๐’‘๐‘– ( ๐œท0 ) = 0 by Assumption CM. Also ๐‘…๐‘Ž๐‘›๐‘˜ ( ๐‘ด๐‘– ๐‘ฝ๐‘– ๐‘ด๐‘–โ€ฒ) = ๐‘…๐‘Ž๐‘›๐‘˜ (๐‘ฝ๐‘– ) = ๐‘‡ โˆ’ 1 = ๐‘…๐‘Ž๐‘›๐‘˜ ( ๐‘ด๐‘– ), so Assumption GR.1 holds for the same ๐‘ด๐‘– regardless of ๐‘ฉ๐‘– . As ๐‘ฉ๐‘– ๐‘ฝ๐‘– ๐‘ฉ๐‘–โ€ฒ = ๐‘ฉ๐‘– ๐šบ๐‘– ๐‘ฉ๐‘–โ€ฒ, we have ๐‘…๐‘Ž๐‘›๐‘˜ (๐‘ฉ๐‘– ๐‘ฝ๐‘– ๐‘ฉ๐‘– ) = ๐‘‡ โˆ’ 1 = ๐‘…๐‘Ž๐‘›๐‘˜ (๐‘ฉ๐‘– ), so Assumption GR.2 holds. Thus by Theorem 1 ๐‘ฉ๐‘–โ€ฒ (๐‘ฉ๐‘– ๐šบ๐‘– ๐‘ฉ๐‘–โ€ฒ) โˆ’1 ๐‘ฉ๐‘– = ๐‘ด๐‘–โ€ฒ ( ๐‘ด๐‘– ๐šบ๐‘– ๐‘ด๐‘–โ€ฒ) โˆ’ ๐‘ด๐‘– . The information bound for the generalized within transformation is ๐ธ (โˆ‡ ๐œท ๐’Ž๐‘– ( ๐œท0 ) ๐‘ด๐‘–โ€ฒ ( ๐‘ด๐‘– ๐šบ๐‘– ๐‘ด๐‘–โ€ฒ) โˆ’ ๐‘ด๐‘– โˆ‡ ๐œท ๐’Ž๐‘– ( ๐œท0 )) โˆ’1 This expression is equal to the expression in (2.2.6) by Theorem 2.2.1, so the generalized within transformation is information equivalent to ๐‘ฉ๐‘– . The proof for the residual maker transformation is similar with ๐‘ด๐‘– = ( ๐‘ฐ๐‘‡ โˆ’๐’Ž๐‘– ( ๐œท0 )(๐’Ž๐‘– ( ๐œท0 ) โ€ฒ ๐’Ž๐‘– ( ๐œท0 )) โˆ’1 ๐’Ž๐‘– ( ๐œท0 ) โ€ฒ) and ๐‘ฝ๐‘– being the respective conditional covariance matrix. โ–ก 97 APPENDIX PROOFS FOR CHAPTER 3 Proof of Lemma 3.3.1 Separate the estimated parameters into the respective ๐‘‡ โˆ’ ๐‘ ร— ๐‘ โˆ’ ๐‘ 0 and ๐‘‡ โˆ’ ๐‘ ร— ๐‘ 0 matrices (๐šฏ1 |๐šฏ2 ). Separate the true regularized parameters by rows (๐šฏ10โ€ฒ |๐šฏ20โ€ฒ) โ€ฒ, which are then ๐‘‡ โˆ’ ๐‘ ร— ๐‘ 0 and ๐‘ โˆ’ ๐‘ 0 ร— ๐‘ 0 matrices, respectively. Then for ๐‘ > ๐‘ 0 , ๐‘ฏ(๐œฝ) โ€ฒ ๐‘ญ0 = ๐šฏ10 + ๐šฏ1 ๐šฏ20 โˆ’ ๐šฏ2 . Set ๐šฏ2 = ๐šฏ10 + ๐šฏ1 ๐šฏ20 for any value of ๐šฏ1 , so that there are infinitely many solutions which make equation (3.3.1) zero. Finally when ๐‘ < ๐‘ 0 there are too many parameters than can be consistently estimated. Thus there are no values of ๐šฏ which cause (3.3.1) to be zero. These order conditions for estimation of ๐œฝ 0 are identical to Ahn et al. (2013). โ–ก Proof of Theorem 3.3.2 I first state the Identifying Assumption (IA) which comes from Ahn et al. (2013)โ€™s Basic Assumptions: Identifying Assumption: Rk(๐ธ (๐œธ๐‘– ๐œธ๐‘–โ€ฒ)) = ๐‘ 0 < ๐‘‡. For any ๐‘‡ ร— (๐‘‡ โˆ’ ๐‘ 0 ) matrix ๐‘ฏ0 such that Rk(๐‘ญ0 , ๐‘ฏ0 ) = ๐‘‡, the following matrix has full column rank: ๐ธ (๐‘ฏ0โ€ฒ ๐‘ฟ๐‘– โŠ— vec( ๐‘ฟ๐‘– )), ๐‘ฐ๐‘‡โˆ’๐‘0 โŠ— ๐ธ (vec( ๐‘ฟ๐‘– )๐œธ๐‘–โ€ฒ) โ–   The two equations under consideration are equations (3.2.7) and (3.2.8), ๐ธ (๐’˜ ๐‘– โŠ— ๐‘ฏ0โ€ฒ ( ๐’š๐‘– โˆ’ ๐‘ฟ๐‘– ๐œท0 )) = 0 ๐ธ (๐‘ฏ0โ€ฒ ๐‘ฝ๐‘– ) = 0 I appeal to the partial redundancy results given in Section 4 of Breusch et al. (1999). In this setting, partial redundancy of two sets of moment conditions means that the asymptotic variance of the GMM estimator of ๐œท0 based off of both sets of moment conditions is the same as that of the GMM estimator which only uses the first set. See Section 1 of Breusch et al. (1999) for examples. 98 Write ๐€ = ( ๐œทโ€ฒ0 , ๐œฝ 0โ€ฒ ) โ€ฒ and let ๐€ 1 = ๐œท0 and ๐€2 = ๐œฝ 0 . Then ๐€ 1 is identified by equation (3.2.7) under IA1 and ๐€ 2 is identified by equation (3.2.8), both facts I use in the proof. They consider a general vector of moment conditions ๏ฃฎ ๏ฃน ๏ฃฏ ๐’ˆ1 (๐€, ๐œผ๐‘– )) ๏ฃบ ๐ธ ( ๐’ˆ(๐€, ๐œผ๐‘– )) = ๏ฃฏ ๏ฃบ =0 ๏ฃฏ ๏ฃบ ๏ฃฏ ๐’ˆ (๐€, ๐œผ )) ๏ฃบ ๏ฃฏ 2 ๐‘– ๏ฃบ ๏ฃฐ ๏ฃป where in my notation ๐œผ๐‘– = (๐’š๐‘– , ๐‘ฟ๐‘– , ๐œธ๐‘– , ๐šช๐‘– ), ๐’ˆ1 = ๐‘ฏ(๐œฝ) โ€ฒ (๐’š๐‘– โˆ’ ๐‘ฟ๐‘– ๐œท0 + ๐‘ญ๐œธ๐‘– ), and ๐’ˆ2 = ๐‘ฏ(๐œฝ) โ€ฒ๐‘ฝ๐‘– . I partition the gradient and covariances matrices as ๏ฃฎ ๏ฃน ๏ฃฏ ๐‘ซ 11 ๐‘ซ 12 ๏ฃบ๏ฃบ ๐‘ซ=๏ฃฏ ๏ฃฏ ๏ฃบ ๏ฃฏ๐‘ซ ๐‘ซ 22 ๏ฃบ๏ฃบ ๏ฃฏ 21 ๏ฃฐ ๏ฃป ๏ฃฎ ๏ฃน ๏ฃฏ๐›€11 ๐›€12 ๏ฃบ๏ฃบ ๐›€= ๏ฃฏ ๏ฃฏ ๏ฃบ ๏ฃฏ๐›€ ๐›€22 ๏ฃบ๏ฃบ ๏ฃฏ 21 ๏ฃฐ ๏ฃป where ๐‘ซ ๐‘š ๐‘› = ๐ธ (โˆ‡๐€ ๐‘› ๐’ˆ๐‘š (๐€, ๐œผ๐‘– )) and ๐›€๐‘š ๐‘› = ๐ธ ( ๐’ˆ๐‘š (๐€, ๐œผ๐‘– ) ๐’ˆ๐‘› (๐€, ๐œผ๐‘– ) โ€ฒ). Equation (3.2.8) is partially redundant for estimating ๐œท0 if and only if ๐‘ซ 21 โˆ’ ๐›€21 ๐›€โˆ’1 โˆ’1 โ€ฒ โˆ’1 โˆ’1 โ€ฒ โˆ’1 11 ๐‘ซ 11 = (๐‘ซ 22 โˆ’ ๐›€21 ๐›€11 ๐‘ซ 12 )(๐‘ซ 12 ๐›€11 ๐‘ซ 12 ) (๐‘ซ 12 ๐›€11 ๐‘ซ 11 ) by Theorem 7 of Breusch et al. (1999). As ๐’–๐‘– is mean independent of ๐‘ฟ๐‘– , ๐›€21 = 0 and ๐›€12 = 0 so that the necessary and sufficient condition of partial redundancy is ๐‘ซ 21 = ๐‘ซ 22 (๐‘ซ โ€ฒ12 ๐›€โˆ’1 โˆ’1 โ€ฒ โˆ’1 11 ๐‘ซ 12 ) (๐‘ซ 12 ๐›€11 ๐‘ซ 11 ) Since ๐’ˆ2 (๐€, ๐œผ๐‘– ) is not a function of ๐œท0 , we also have ๐‘ซ 21 = 0. Assumption PF gives that ๐‘ซ 22 has full column rank so that ๐‘ซ 22 (๐‘ซ โ€ฒ12 ๐›€โˆ’1 11 ๐‘ซ 12 ) โˆ’1 is left-invertible. Therefore the redundancy condition becomes ๐‘ซ โ€ฒ12 ๐›€โˆ’1 11 ๐‘ซ 11 = 0 โ–ก Proof of Theorem 3.3.4 1 See Section 3 of Ahn et al. (2013). 99 I start with the proof of consistency. The centered QLDP estimator is written as ๐‘ ! โˆ’1 ๐‘ ! b๐‘„๐ฟ๐ท๐‘ƒ โˆ’ ๐œท0 = 1 โˆ‘๏ธ 1 โˆ‘๏ธ โ€ฒ b bโ€ฒ ๐œท ๐‘ฟโ€ฒ ๐‘ฏb๐‘ฏbโ€ฒ ๐‘ฟ๐‘– ๐‘ฟ ๐‘ฏ ๐‘ฏ (๐‘ญ0 ๐œธ๐‘– + ๐’–๐‘– ) ๐‘ ๐‘–=1 ๐‘– ๐‘ ๐‘–=1 ๐‘– 1 โ€ฒ โ€ฒ up to a ๐‘‚ ๐‘ (๐‘ โˆ’1/2 ) term by ร๐‘ The denominator equals its infeasible counterpart ๐‘ ๐‘–=1 ๐‘ฝ๐‘– ๐‘ฏ0 ๐‘ฏ0๐‘ฝ๐‘– Theorem 1 and the moment bounds from BASE. The inverse exists with probability approaching one by condition (1) of the theorem. Thus the denominator is a ๐‘‚ ๐‘ (1) term so consistency depends on the numerator. The difference between the numerator and its infeasible counterpart is ๐‘ ๐‘ ! 1 โˆ‘๏ธ โ€ฒ b bโ€ฒ 1 โˆ‘๏ธ ๐‘ฟ ( ๐‘ฏ ๐‘ฏ โˆ’๐‘ฏ0 ๐‘ฏ0โ€ฒ )(๐‘ญ0 ๐œธ๐‘– +๐’–๐‘– ) = (๐‘ญ0 ๐œธ๐‘– + ๐’–๐‘– ) โ€ฒ โŠ— ๐‘ฟ๐‘–โ€ฒ vec( ๐‘ฏ b๐‘ฏbโ€ฒ โˆ’๐‘ฏ0 ๐‘ฏโ€ฒ ) = ๐‘‚ ๐‘ (1)๐‘œ ๐‘ (1) 0 ๐‘ ๐‘–=1 ๐‘– ๐‘ ๐‘–=1 The sum converges to its finite expectation by the moment bounds from Assumption 2(2). vec( ๐‘ฏ b๐‘ฏ bโ€ฒโˆ’ ๐‘ฏ0 ๐‘ฏ0โ€ฒ ) = ๐‘‚ ๐‘ (๐‘ โˆ’1/2 ) by Theorem 3.3.1. The infeasible numerator, ๐‘1 ๐‘–=1 ร๐‘ โ€ฒ ๐‘ฟ๐‘– ๐‘ฏ0 ๐‘ฏ0โ€ฒ (๐‘ญ0 ๐œธ๐‘– + ๐’–๐‘– ), is ๐‘œ ๐‘ (1) as ๐‘ฏ0โ€ฒ ๐‘ญ0 = 0 and ๐‘1 ๐‘–=1 ร๐‘ โ€ฒ ๐‘ฟ๐‘– ๐‘ฏ0 ๐‘ฏ0โ€ฒ ๐’–๐‘– = ๐‘œ ๐‘ (1) by condition (3), so we have ๐œท b๐‘„๐ฟ๐ท๐‘ƒ โˆ’ ๐œท0 = ๐‘œ ๐‘ (1). Before deriving the asymptotic distribution of the QLDP, I need the following lemma: Lemma .0.1. Let ๐๐‘– = ๐‘ญ0 ๐œธ๐‘– + ๐’–๐‘– . Then ยฉ ๐’™๐‘– โˆ—1โ€ฒ โŠ— ๐‘ฐ๐‘‡โˆ’๐‘0 ยช  ยญยญ .. ยฎ โˆ‡๐œฝ ( ๐‘ฟ๐‘–โ€ฒ ๐‘ฏ0 ๐‘ฏ0โ€ฒ ๐๐‘– ) = ๐‘ฐ๐พ โŠ— ๐’–โ€ฒ๐‘– ๐‘ฏ0 ยญ ยฎ + ๐‘ฝ๐‘–โ€ฒ ๐‘ฏ0 ๐๐‘–โˆ—โ€ฒ โŠ— ๐‘ฐ๐‘‡โˆ’๐‘0  (.0.1) ยฎ . ยญ ยฎ ยญ ยฎ โˆ— โ€ฒ ๐’™๐‘– ๐พ โŠ— ๐‘ฐ๐‘‡โˆ’๐‘0 ยซ ยฌ where ๐’™๐‘– ๐‘— is the ๐‘—โ€™th column of ๐‘ฟ๐‘– and ๐’— โˆ— = (๐‘ฃ๐‘‡โˆ’๐‘0 +1 , ..., ๐‘ฃ๐‘‡ ) โ€ฒ is the last ๐‘ 0 elements of the ๐‘‡ ร— 1 vector ๐’—. Proof. I omit the pure factor notation for simplicity and work with the full matrix ๐‘ฟ๐‘– . Proposition 5.4 of Dhrymes (2013) gives โˆ‡๐œฝ ( ๐‘ฟ๐‘–โ€ฒ ๐‘ฏ(๐œฝ)๐‘ฏ(๐œฝ) โ€ฒ๐๐‘– ) = (๐๐‘–โ€ฒ ๐‘ฏ(๐œฝ) โŠ— ๐‘ฐ๐พ )โˆ‡๐œฝ ( ๐‘ฟ๐‘–โ€ฒ ๐‘ฏ(๐œฝ)) + ๐‘ฟ๐‘–โ€ฒ ๐‘ฏ(๐œฝ)โˆ‡๐œฝ (๐‘ฏ(๐œฝ) โ€ฒ๐๐‘– ) (.0.2) 100 where I follow standard notation in writing the derivative of the ๐‘› ร— ๐‘š matrix ๐‘จ with respect to the ๐‘˜ ร— 1 vector ๐œถ as โˆ‡๐œถ ๐‘จ = โˆ‡๐œถ vec( ๐‘จ). The row vectors of โˆ‡๐œถ ๐‘จ are then the 1 ร— ๐‘˜ gradient vectors of the elements of vec( ๐‘จ) with respect to ๐œถ. In order to derive the various derivatives, I first start with the case of an arbitrary ๐‘‡ ร— 1 vector ๐’— = (๐‘ฃ 1 , ..., ๐‘ฃ๐‘‡ ) โ€ฒ. As described in Section 3.1, ๐‘ฏ(๐œฝ) โ€ฒ = ( ๐‘ฐ๐‘‡โˆ’๐‘0 , ๐šฏ) where ๐œฝ = vec(๐šฏ). I write the ๐‘ 0 column vectors of ๐šฏ as ๐šฏ = (๐œฝ 1 , ..., ๐œฝ ๐‘0 ) where each column can be written as ๐œฝ ๐‘— = (๐œƒ ๐‘— 1 , ..., ๐œƒ ๐‘—,๐‘‡โˆ’๐‘0 ) โ€ฒ. These definitions give the expression ยฉ ๐‘ฃ 1 + ๐œƒ 11 ๐‘ฃ๐‘‡โˆ’๐‘0 +1 + ... + ๐œƒ ๐‘1 ๐‘ฃ๐‘‡ ยช ยญ ยฎ โ€ฒ .. ๐‘ฏ(๐œฝ) ๐’— = ยญ (.0.3) ยญ ยฎ . ยฎ ยญ ยฎ ยญ ยฎ ๐‘ฃ + ๐œƒ 1,๐‘‡โˆ’๐‘0 ๐‘ฃ๐‘‡โˆ’๐‘0 +1 + ... + ๐œƒ ๐‘,๐‘‡โˆ’๐‘0 ๐‘ฃ๐‘‡ ยซ ๐‘‡โˆ’๐‘0 ยฌ The expression above is similar to that derived below equation (4) of Ahn et al. (2013). They write the terms as the dot product between the rows of ๐‘ฏ(๐œฝ) โ€ฒ and ๐’— โˆ— . However, I expand the sums so that the gradient is easier to see. Taking the gradient of the ๐‘Ÿโ€™th element of ๐‘ฏ(๐œฝ) โ€ฒ ๐’— with respect to ๐œฝ ๐‘— gives โˆ‡๐œฝ ๐‘— (๐‘ฃ ๐‘Ÿ + ๐œƒ 1๐‘Ÿ ๐‘ฃ๐‘‡โˆ’๐‘0 +1 + ... + ๐œƒ ๐‘0๐‘Ÿ ๐‘ฃ๐‘‡ ) = (0, ..., 0, ๐‘ฃ๐‘‡โˆ’๐‘0 + ๐‘— , 0, ..., 0) where the only nonzero term is in the ๐‘Ÿโ€™th column. Thus differentiating with respect to the ๐‘—โ€™th vector gives ยฉ๐‘ฃ๐‘‡โˆ’๐‘0 + ๐‘— 0 ... 0 ยช ยญ ยฎ 0 0 ยญ ยฎ โ€ฒ ยญ ๐‘ฃ ๐‘‡โˆ’๐‘ 0 + ๐‘— . . . ยฎ โˆ‡๐œฝ ๐‘— ๐‘ฏ(๐œฝ) ๐’— = ยญ . ยญ . ยฎ = ๐‘ฃ๐‘‡โˆ’๐‘ + ๐‘— ๐‘ฐ๐‘‡โˆ’๐‘ 0 0 ยญ . . . . . . ยฎ ยญ . ยฎ ยฎ ยญ ยฎ ยซ 0 . . . . . . ๐‘ฃ ๐‘‡โˆ’๐‘ 0 + ๐‘— ยฌ Putting together the ๐‘‡ โˆ’ ๐‘ 0 gradients gives โˆ‡๐œฝ ๐‘ฏ(๐œฝ) โ€ฒ ๐’— = ๐‘ฃ๐‘‡โˆ’๐‘0 +1 ๐‘ฐ๐‘‡โˆ’๐‘0 , ..., ๐‘ฃ๐‘‡ ๐‘ฐ๐‘‡โˆ’๐‘0 = ๐’— โˆ—โ€ฒ โŠ— ๐‘ฐ๐‘‡โˆ’๐‘0  (.0.4) Equation (.0.4) implies โˆ‡๐œฝ ๐‘ฏ(๐œฝ) โ€ฒ๐๐‘– = ๐๐‘–โˆ—โ€ฒ โŠ— ๐‘ฐ๐‘‡โˆ’๐‘0 . Handling ๐‘ฏ(๐œฝ) โ€ฒ ๐‘ฟ๐‘– is done similarly. Writing the covariates in terms of its column vectors ๐‘ฟ๐‘– = (๐’™๐‘–1 , ..., ๐’™๐‘– ๐พ ) where now the subscript on ๐’™๐‘– ๐‘˜ denotes the ๐‘‡ ร— 1 vector of observations for variable ๐‘˜ of individual ๐‘–, we can see that ๐‘ฏ(๐œฝ) โ€ฒ ๐‘ฟ๐‘– = (๐‘ฏ(๐œฝ) โ€ฒ ๐’™๐‘–1 , ..., ๐‘ฏ(๐œฝ) โ€ฒ ๐’™๐‘– ๐พ ) 101 which implies that ยฉ ๐‘ฏ(๐œฝ) โ€ฒ ๐’™๐‘–1 ยช ยญ ยฎ .. vec(๐‘ฏ(๐œฝ) โ€ฒ ๐‘ฟ๐‘– ) = ยญ ยญ ยฎ . ยฎ ยญ ยฎ ยญ ยฎ ๐‘ฏ(๐œฝ) ๐’™๐‘– ๐พ โ€ฒ ยซ ยฌ โ€ฒ ๐‘ฏ(๐œฝ) ๐’™๐‘– ๐‘˜ is a (๐‘‡ โˆ’ ๐‘ 0 ) ร— 1 vector so its gradient follow the same form as equation (.0.4). Thus ยฉ ๐’™๐‘– โˆ—1โ€ฒ โŠ— ๐‘ฐ๐‘‡โˆ’๐‘0 ยช ยญ ยฎ โ€ฒ .. โˆ‡๐œฝ vec(๐‘ฏ(๐œฝ) ๐‘ฟ๐‘– ) = ยญ ยญ ยฎ . ยฎ ยญ ยฎ ยญ ยฎ โˆ— โ€ฒ ๐’™๐‘– ๐พ โŠ— ๐‘ฐ๐‘‡โˆ’๐‘0 ยซ ยฌ Filling in the gradient in equation (.0.1) gives our final answer. โ–ก Returning to the main proof of asymptotic normality, the pooled QLD estimator can be written as ! โˆ’1 ! โˆš ๐‘ 1 โˆ‘๏ธ โ€ฒ b bโ€ฒ ๐‘ 1 โˆ‘๏ธ โ€ฒ b bโ€ฒ ๐‘ ( ๐œท๐‘„๐ฟ๐ท๐‘ƒ โˆ’ ๐œท0 ) = b ๐‘ฟ ๐‘ฏ ๐‘ฏ ๐‘ฟ๐‘– โˆš ๐‘ฟ๐‘– ๐‘ฏ ๐‘ฏ (๐‘ญ0 ๐œธ๐‘– + ๐’–๐‘– ) ๐‘ ๐‘–=1 ๐‘– ๐‘ ๐‘–=1 As before, he denominator equals ๐‘จ๐‘ƒ up to a ๐‘‚ ๐‘ (๐‘ โˆ’1/2 ). The inverse exists with probability approaching one by condition (1) of the theorem. Thus asymptotic normality depends on the numerator. Write the full error as ๐๐‘– = ๐‘ญ0 ๐œธ๐‘– +๐’–๐‘– so that we study the asymptotic distribution of โˆš1 ๐‘ฟ๐‘–โ€ฒ ๐‘ฏ bโ€ฒ๐๐‘– . ร๐‘ b๐‘ฏ ๐‘ ๐‘–=1 Mean value expansion about ๐œฝ 0 gives ๐‘ 1 โˆ‘๏ธ โ€ฒ b bโ€ฒ 1 โˆ‘๏ธ โ€ฒ ๐‘ โˆš โˆš ๐‘ฟ๐‘– ๐‘ฏ ๐‘ฏ ๐๐‘– = โˆš ๐‘ฝ๐‘– ๐‘ฏ0 ๐‘ฏ0โ€ฒ ๐’–๐‘– + ๐‘ฎ ๐‘ƒ ๐‘ ( ๐œฝb โˆ’ ๐œฝ 0 ) + ๐‘œ ๐‘ (1) ๐‘ ๐‘–=1 ๐‘ ๐‘–=1 where ๐‘ฎ ๐‘ƒ = ๐ธ (โˆ‡๐œฝ ๐‘ฟ๐‘–โ€ฒ ๐‘ฏ0 ๐‘ฏ0โ€ฒ ๐๐‘– ) which is derived explicitly in Lemma .0.1. The estimator ๐œฝb is derived in Theorem 3.3.1 as based on the moments ๐ธ (vec(๐‘ฏ0โ€ฒ ๐’๐‘– ) = 0. It is a GMM estimator using the optimal weight matrix ๐‘จ b๐œฝ = 1 ร๐‘ vec( ๐‘ฏหœ โ€ฒ ๐’๐‘– )vec( ๐‘ฏหœ โ€ฒ ๐’๐‘– ) โ€ฒ where ๐‘ฏหœ = ๐‘ฏ( ๐œฝ) หœ uses an initial ๐‘ ๐‘–=1 estimator. The first order conditions of the GMM optimization problem give ๐‘ !โ€ฒ ๐‘ ! โˆ‘๏ธ โˆ‘๏ธ bโ€ฒ ๐’๐‘– ) ๐‘จ โˆ‡๐œฝ vec( ๐‘ฏ bโˆ’1 vec( ๐‘ฏ bโ€ฒ ๐’๐‘– ) = 0 ๐œฝ ๐‘–=1 ๐‘–=1 where โˆ‡๐œฝ vec( ๐‘ฏ bโ€ฒ ๐’๐‘– ) = (๐’›โˆ— โŠ— ๐‘ฐ๐‘‡โˆ’๐‘0 , ..., ๐’›โˆ— โ€ฒ ๐‘–,1 ๐‘–,๐พ+1 โŠ— ๐‘ฐ๐‘‡โˆ’๐‘ 0 ) comes from Lemma 3.3.1. Interestingly, this gradient is free of any parameters and thus the same regardless of the estimator. 102 Write ๐‘ซ ๐œฝ = ๐ธ (โˆ‡๐œฝ vec(๐‘ฏ0โ€ฒ ๐’๐‘– )) and ๐‘จ๐œฝ = ๐ธ (vec(๐‘ฏ0โ€ฒ ๐’๐‘– )vec(๐‘ฏ0โ€ฒ ๐’๐‘– ) โ€ฒ), the notation from Theorem 3.3.1. Using another standard mean value expansion gives โˆš ๐‘ 1 โˆ‘๏ธ โ€ฒ โˆ’1 ๐‘ ( ๐œฝb โˆ’ ๐œฝ 0 ) = โˆš (๐‘ซ ๐œฝ ๐‘จ๐œฝ ๐‘ซ ๐œฝ ) โˆ’1 ๐‘ซ โ€ฒ๐œฝ ๐‘จโˆ’1 โ€ฒ ๐œฝ vec(๐‘ฏ0 ๐’๐‘– ) + ๐‘œ ๐‘ (1) (.0.5) ๐‘ ๐‘–=1 which allows us to write the estimator as โˆš โˆ’1 1 ๐‘ โˆ‘๏ธ ๐‘ฝ๐‘–โ€ฒ ๐‘ฏ0 ๐‘ฏ0โ€ฒ ๐’–๐‘– + ๐‘ฎ ๐‘ƒ ๐’“๐‘– (๐œฝ 0 ) + ๐‘œ ๐‘ (1)  ๐‘ ( ๐œท๐‘„๐ฟ๐ท๐‘ƒ โˆ’ ๐œท0 ) = ๐‘จ๐‘ƒ โˆš b (.0.6) ๐‘ ๐‘–=1 where ๐’“๐‘– (๐œฝ 0 ) = (๐‘ซ โ€ฒ๐œฝ ๐‘จโˆ’1 โˆ’1 โ€ฒ โˆ’1 โ€ฒ ๐œฝ ๐‘ซ ๐œฝ ) ๐‘ซ ๐œฝ ๐‘จ๐œฝ vec(๐‘ฏ0 ๐’๐‘– ). Thus we have โˆš ๐‘‘ ๐‘(๐œท b๐‘„๐ฟ๐ท๐‘ƒ โˆ’ ๐œท0 ) โ†’ ๐‘ (0, ๐‘จโˆ’1 ๐‘ƒ ๐‘ฉ ๐‘ƒ ๐‘จ๐‘ƒ ) โˆ’1 (.0.7) where ๐‘ฉ ๐‘ƒ = ๐ธ ((๐‘ฝ๐‘–โ€ฒ ๐‘ฏ0 ๐‘ฏ0โ€ฒ ๐’–๐‘– + ๐‘ฎ ๐‘ƒ ๐’“๐‘– (๐œฝ 0 ))(๐‘ฝ๐‘–โ€ฒ ๐‘ฏ0 ๐‘ฏ0โ€ฒ ๐’–๐‘– + ๐‘ฎ ๐‘ƒ ๐’“๐‘– (๐œฝ 0 )) โ€ฒ). โ–ก Proof of Theorem 3.4.2 Now the asymptotic variance depends only on the moments ๐ธ (๐‘ฏ0โ€ฒ ๐‘ฝ๐‘– ) = 0. Lemma .0.2. Suppose Assumption 2 holds and Rk(๐ธ (๐šช๐‘– )) = ๐‘ 0 and let ๐œฝb be the GMM estimator based off of ๐ธ (vec(๐‘ฏ0โ€ฒ ๐‘ฟ๐‘– )) = ๐ธ (vec(๐‘ฏ0โ€ฒ ๐‘ฝ๐‘– ) = 0 using a consistent estimator of the optimal weight matrix. Then โˆš ๐‘‘   โˆ’1 ๐‘ ( ๐œฝb โˆ’ ๐œฝ 0 ) โ†’ ๐‘ (0, ๐‘ซ โ€ฒ๐‘ฅ,๐œฝ ๐‘จ๐‘ฅ,๐œฝโˆ’1 ๐‘ซ ๐‘ฅ,๐œฝ ). and ๐’“ ๐‘ฅ,๐‘– (๐œฝ 0 ) = (๐‘ซ โ€ฒ๐‘ฅ,๐œฝ ๐‘จ๐‘ฅ,๐œฝ โˆ’1 ๐‘ซ ) โˆ’1 ๐‘ซ โ€ฒ ๐‘จโˆ’1 vec(๐‘ฏโ€ฒ ๐‘ฝ ), where ๐‘จ ๐‘ฅ,๐œฝ ๐‘ฅ,๐œฝ ๐‘ฅ,๐œฝ 0 ๐‘– โ€ฒ โ€ฒ โ€ฒ ๐‘ฅ,๐œฝ = ๐ธ (vec(๐‘ฏ0๐‘ฝ๐‘– )vec(๐‘ฏ0๐‘ฝ๐‘– ) ) and ๐‘ซ ๐‘ฅ,๐œฝ = ๐ธ (โˆ‡๐œฝ vec(๐‘ฏ0โ€ฒ ๐‘ฝ๐‘– )) is derived in Lemma .0.1. โ–ก Proof of Theorem 3.4.3 I first consider the proof of consistency. Facts about uniform convergence shown for consistency will be taken for granted in the proof of asymptotic normality. As a technical aside, I do not differentiate between the Euclidean vector norm and the Frobenius matrix norm in terms of notation. It does not affect the proof as the two norms are compatible in the sense that โˆฅ ๐‘จ๐’™โˆฅ ๐ธ โ‰ค โˆฅ ๐‘จโˆฅ ๐น โˆฅ๐’™โˆฅ ๐ธ where ๐‘จ is a ๐‘› ร— ๐‘š matrix, ๐’™ is a ๐‘š ร— 1 vector, and the 103 F and E subscripts refer to Frobenius and Euclidean respectively. Further, since both norms are submultiplicative, it does not matter for the point of this proof. As such the notation should be clear from the context. Finally, all statements involving random quantities are assumed to hold almost surely unless stated otherwise. The QDMG estimator can be written as ๐‘ ๐‘ 1 โˆ‘๏ธ โ€ฒ b bโ€ฒ โˆ’1 โ€ฒ b bโ€ฒ 1 โˆ‘๏ธ (๐œทb๐‘„๐ฟ๐ท ๐‘€๐บ โˆ’ ๐œท0 ) = ( ๐‘ฟ๐‘– ๐‘ฏ ๐‘ฏ ๐‘ฟ๐‘– ) ๐‘ฟ๐‘– ๐‘ฏ ๐‘ฏ (๐‘ญ0 ๐œธ๐‘– + ๐’–๐‘– ) + ๐’ƒ๐‘– ๐‘ ๐‘–=1 ๐‘ ๐‘–=1 ๐‘ 1 โˆ‘๏ธ โ€ฒ b bโ€ฒ โˆ’1 โ€ฒ b bโ€ฒ = ( ๐‘ฟ ๐‘ฏ ๐‘ฏ ๐‘ฟ๐‘– ) ๐‘ฟ๐‘– ๐‘ฏ ๐‘ฏ (๐‘ญ0 ๐œธ๐‘– + ๐’–๐‘– ) + ๐‘‚ ๐‘ (๐‘ โˆ’1/2 ) ๐‘ ๐‘–=1 ๐‘– ๐‘ 1 ๐’ƒ๐‘– = ๐‘‚ ๐‘ (๐‘ โˆ’1/2 ) by the CLT, consistency of ร๐‘ where ๐‘ฏ b = ๐‘ฏ( ๐œฝ), b ๐œฝb โ†’ ๐œฝ 0 by Theorem 1. As ๐‘ ๐‘–=1 the QLDMG does not depend on the correlation between ๐’ƒ๐‘– and ( ๐‘ฟ๐‘– , ๐œธ๐‘– , ๐’–๐‘– ). However, since the โˆš rate of convergence is ๐‘, it will affect the asymptotic distribution. This fact is handled later in the proof. I write ๐’๐‘– (๐œฝ) = ( ๐‘ฟ๐‘–โ€ฒ ๐‘ฏ(๐œฝ)๐‘ฏ(๐œฝ) โ€ฒ ๐‘ฟ๐‘– ) โˆ’1 ๐‘ฟ๐‘–โ€ฒ ๐‘ฏ(๐œฝ)๐‘ฏ(๐œฝ) โ€ฒ (๐‘ญ0 ๐œธ๐‘– + ๐’–๐‘– ) for convenience. The goal of this section is to show that ๐‘ 1 โˆ‘๏ธ ๐‘ ๐’๐‘– ( ๐œฝ) b โ†’ ๐ธ ( ๐’๐‘– (๐œฝ 0 )) = 0 (.0.8) ๐‘ ๐‘–=1 By Theorem 21.6 of Davidson (1994), the convergence result in equation (.0.8) is implied by conditions: ๐‘ ๐œฝb โ†’ ๐œฝ 0 (.0.9) ๐‘ 1 โˆ‘๏ธ sup ๐’๐‘– (๐œฝ) โˆ’ ๐ธ (๐’๐‘– (๐œฝ)) = ๐‘œ ๐‘ (1) where ๐‘ฉ0 is some open set about ๐œฝ 0 . (.0.10) ๐œฝโˆˆ๐ต0 ๐‘ ๐‘–=1 where โˆฅ.โˆฅ denotes the Euclidean ๐ฟ 2 norm for vectors and Frobenius norm for matrices. Consistency of ๐œฝb holds by Theorem 1 so that uniform convergence is the only condition which needs to be verified. I show uniform convergence via a traditional argument which demonstrates both pointwise convergence in probability and stochastic equicontinuity (SE). Pointwise convergence in probability follows from the WLLN by the moment bounds and sampling assumptions in Assumption 3(2). {๐‘ฟ๐‘–โ€ฒ ๐‘ฏ(๐œฝ)๐‘ฏ(๐œฝ) โ€ฒ ๐‘ฟ๐‘– }๐‘–โ‰ฅ1 is a sequence of positive definite 104 random matrices for all possible values of ๐œฝ by condition (1) of the theorem. Thus for each ๐œฝ, ๐‘ {๐’๐‘– (๐œฝ)}๐‘–โ‰ฅ1 is well-defined and iid. By the WLLN, ๐‘1 ๐‘–=1 ร๐‘ ๐’๐‘– (๐œฝ) โ†’ ๐ธ ( ๐’๐‘– (๐œฝ)) which is 0 when ๐œฝ = ๐œฝ0. For the purpose of verifying SE of the random sequence, I show that the following Lipschitz condition of Theorem 21.11 from Davidson (1994) holds: for some random sequence {๐ต ๐‘ ๐‘– }๐‘–โ‰ฅ1 with bounded expectations and real function โ„Ž such that โ„Ž(๐‘ฅ) โ†’ 0 as ๐‘ฅ โ†’ 0, there exists ๐‘› โˆˆ N such that 1 ยค โˆ’ ๐ธ (๐’๐‘– ( ๐œฝ))) ยค ( ๐’๐‘– (๐œฝ) โˆ’ ๐ธ ( ๐’๐‘– (๐œฝ))) โˆ’ ( ๐’๐‘– ( ๐œฝ) โ‰ค ๐ต ๐‘ ๐‘– โ„Ž(โˆฅ๐œฝ โˆ’ ๐œฝ โ€ฒ โˆฅ) (.0.11) ๐‘ for all ๐œฝ, ๐œฝยค โˆˆ T and ๐‘ โ‰ฅ ๐‘›, where all stated inequalities hold almost surely as stated above. I start with the stochastic component ๐’๐‘– (๐œฝ) โˆ’ ๐’๐‘– ( ๐œฝ). ยค It will make sense to write ๐’๐‘– (๐œฝ) = ๐‘จ(๐œฝ) โˆ’1 ๐‘ฉ(๐œฝ) where ๐‘จ๐‘– (๐œฝ) = ๐‘ฟ๐‘–โ€ฒ ๐‘ฏ(๐œฝ)๐‘ฏ(๐œฝ) โ€ฒ ๐‘ฟ๐‘– ๐‘ฉ๐‘– (๐œฝ) = ๐‘ฟ๐‘–โ€ฒ ๐‘ฏ(๐œฝ)๐‘ฏ(๐œฝ) โ€ฒ (๐‘ญ0 ๐œธ๐‘– + ๐’–๐‘– ) We then have ยค = ๐‘จ๐‘– (๐œฝ) โˆ’1 ๐‘ฉ๐‘– (๐œฝ) โˆ’ ๐‘จ๐‘– ( ๐œฝ) ๐’๐‘– (๐œฝ) โˆ’ ๐’๐‘– ( ๐œฝ) ยค โˆ’1 ๐‘ฉ๐‘– ( ๐œฝ) ยค โ‰ค ๐‘จ๐‘– (๐œฝ) โˆ’1 ๐‘ฉ๐‘– (๐œฝ) โˆ’ ๐‘จ๐‘– ( ๐œฝ) ยค โˆ’1 ๐‘ฉ๐‘– (๐œฝ) + ๐‘จ๐‘– ( ๐œฝ) ยค โˆ’1 ๐‘ฉ(๐œฝ) โˆ’ ๐‘จ๐‘– ( ๐œฝ) ยค โˆ’1 ๐‘ฉ( ๐œฝ) ยค We can bound the second normed value on the right-hand side. Let ๐‘ซ (๐œฝ, ๐œฝ) ยค = ๐‘ฏ(๐œฝ)๐‘ฏ(๐œฝ) โ€ฒ โˆ’ ยค ๐‘ฏ( ๐œฝ)๐‘ฏ( ยค โ€ฒ. The Frobenius norm of a matrix is equal to the square root of the sum of its squared ๐œฝ) singular values (see, for example, Horn and Johnson (2013)). Thus ๐‘จ(๐œฝ) โˆ’1 = ๐‘Ž๐‘– (๐œฝ) > 0 and we have ยค โˆ’1 ๐‘ฉ๐‘– (๐œฝ) โˆ’ ๐‘จ๐‘– ( ๐œฝ) ๐‘จ๐‘– ( ๐œฝ) ยค โˆ’1 ๐‘ฉ๐‘– ( ๐œฝ) ยค = ๐‘จ๐‘– ( ๐œฝ) ยค โˆ’1 (๐‘ฉ๐‘– (๐œฝ) โˆ’ ๐‘ฉ๐‘– ( ๐œฝ)) ยค ยค ๐‘ฟ โ€ฒ ๐‘ซ (๐œฝ, ๐œฝ)(๐‘ญ๐œธ โ‰ค ๐‘Ž๐‘– ( ๐œฝ) ยค ๐‘– + ๐’–๐‘– ) ๐‘– ยค โˆฅ ๐‘ฟ๐‘– โˆฅ โˆฅ๐‘ญ๐œธ๐‘– + ๐’–๐‘– โˆฅ ๐‘ซ (๐œฝ, ๐œฝ) โ‰ค ๐‘Ž๐‘– ( ๐œฝ) ยค 105 Turning now to the other term from the triangle inequality, note that condition (1) of the theorem implies ๐‘จ(๐œฝ) is nonsingular for any ๐œฝ in the parameter space. Then   ๐‘จ๐‘– (๐œฝ) โˆ’1 ๐‘ฉ๐‘– (๐œฝ) โˆ’ ๐‘จ๐‘– ( ๐œฝ)ยค โˆ’1 ๐‘ฉ๐‘– (๐œฝ) = ๐‘จ๐‘– (๐œฝ) โˆ’1 โˆ’ ๐‘จ๐‘– ( ๐œฝ) ยค โˆ’1 ๐‘ฉ๐‘– (๐œฝ)   = ยค โˆ’1 ๐‘จ๐‘– ( ๐œฝ) ๐‘จ๐‘– ( ๐œฝ) ยค ๐‘จ๐‘– (๐œฝ) โˆ’1 โˆ’ ๐‘จ๐‘– ( ๐œฝ)ยค โˆ’1 ๐‘จ๐‘– (๐œฝ) ๐‘จ๐‘– (๐œฝ) โˆ’1 ๐‘ฉ๐‘– (๐œฝ) ยค โˆ’1 ๐‘จ๐‘– ( ๐œฝ) ยค โˆ’ ๐‘จ๐‘– (๐œฝ) ๐‘จ๐‘– (๐œฝ) โˆ’1 ๐‘ฉ๐‘– (๐œฝ)  = ๐‘จ๐‘– ( ๐œฝ) โ‰ค ๐‘จ๐‘– ( ๐œฝ) ยค โˆ’1 ๐‘จ๐‘– ( ๐œฝ) ยค โˆ’ ๐‘จ๐‘– (๐œฝ) ๐‘จ๐‘– (๐œฝ) โˆ’1 โˆฅ ๐‘ฉ๐‘– (๐œฝ) โˆฅ As before, ๐‘จ๐‘– ( ๐œฝ) ยค โˆ’1 ๐‘จ๐‘– (๐œฝ) โˆ’1 = ๐‘Ž๐‘– ( ๐œฝ)๐‘Ž ยค ๐‘– (๐œฝ). โˆฅ ๐‘ฉ๐‘– (๐œฝ) โˆฅ = ๐‘ฟ๐‘–โ€ฒ ๐‘ฏ(๐œฝ)๐‘ฏ(๐œฝ) โ€ฒ (๐‘ญ๐œธ๐‘– + ๐’–๐‘– ) where (๐‘ญ๐œธ๐‘– + ๐’–๐‘– ) ๐‘ฟ๐‘–โ€ฒ is bounded in expectation. Condition (3) implies that sup๐œฝโˆˆT โˆฅ๐‘ฏ(๐œฝ)๐‘ฏ(๐œฝ) โ€ฒ โˆฅ < ๐œ for some ๐œ < โˆž. Finally note that ยค โˆ’ ๐‘จ๐‘– (๐œฝ) = ๐‘ฟ โ€ฒ ๐‘ซ ( ๐œฝ, ๐‘จ๐‘– ( ๐œฝ) ยค ๐œฝ) ๐‘ฟ๐‘– ๐‘– โ‰ค โˆฅ ๐‘ฟ๐‘– โˆฅ 2 ๐‘ซ (๐œฝ, ๐œฝ) ยค as ๐‘ซ (๐œฝ, ๐œฝ)ยค = โˆ’๐‘ซ ( ๐œฝ, ยค ๐œฝ). Putting everything together yields 1 ยค โ‰ค 1 ๐‘Ž๐‘– ( ๐œฝ)  ยค โˆฅ ๐‘ฟ๐‘– โˆฅ โˆฅ(๐‘ญ0 ๐œธ๐‘– + ๐’–๐‘– )โˆฅ + ๐œ๐‘Ž๐‘– ( ๐œฝ)๐‘Ž  ยค ยค ๐‘– (๐œฝ) โˆฅ ๐‘ฟ๐‘– โˆฅ 3 โˆฅ (๐‘ญ0 ๐œธ๐‘– + ๐’–๐‘– ) โˆฅ ๐‘ซ (๐œฝ, ๐œฝ) ๐’๐‘– (๐œฝ) โˆ’ ๐’๐‘– ( ๐œฝ) ๐‘ ๐‘ Clearly ๐‘ซ (๐œฝ, ๐œฝ) ยค โ†’ 0 as ๐œฝ โˆ’ ๐œฝยค โ†’ 0. In the language of Davidsonโ€™s Theorem 21.11, ๐‘ ๐‘ โˆ‘๏ธ 1 โˆ‘๏ธ ยค (1 + ๐œ๐‘Ž๐‘– (๐œฝ) โˆฅ ๐‘ฟ๐‘– โˆฅ) ๐ต๐‘ ๐‘– = โˆฅ ๐‘ฟ๐‘– โˆฅ โˆฅ(๐‘ญ0 ๐œธ๐‘– + ๐’–๐‘– ) โˆฅ ๐‘Ž๐‘– ( ๐œฝ) ๐‘–=1 ๐‘ ๐‘–=1 The random variables here have identical moments by Assumption 2(2) and the bound on ๐’‚๐‘– (๐œฝ) holds uniformly over T by Condition (2) so that โˆ‘๏ธ๐‘ ยค (1 + ๐œ๐‘Ž๐‘– (๐œฝ) โˆฅ ๐‘ฟ๐‘– โˆฅ)  ๐ธ( ๐ต ๐‘ ๐‘– ) = ๐ธ โˆฅ ๐‘ฟ๐‘– โˆฅ โˆฅ(๐‘ญ0 ๐œธ๐‘– + ๐’–๐‘– ) โˆฅ ๐‘Ž๐‘– ( ๐œฝ) ๐‘–=1 = ๐‘‚ (1) as the expectation is finite. Looking at equation (.0.11), we have ยค โˆ’ ๐ธ ( ๐’๐‘– ( ๐œฝ)) ยค ยค + ๐ธ ( ๐’๐‘– (๐œฝ) โˆ’ ๐’๐‘– ( ๐œฝ)) ยค  (๐’๐‘– (๐œฝ) โˆ’ ๐ธ (๐’๐‘– (๐œฝ))) โˆ’ ๐’๐‘– ( ๐œฝ) โ‰ค ๐’๐‘– (๐œฝ) โˆ’ ๐’๐‘– ( ๐œฝ) 106 As norms are convex, ๐ธ (( ๐’๐‘– (๐œฝ) โˆ’ ๐’๐‘– ( ๐œฝ)) ยค โ‰ค ๐ธ ( ๐’๐‘– (๐œฝ) โˆ’ ๐’๐‘– ( ๐œฝ) ยค ) which is bounded by the same argument as above. I have thus verified SE and so ๐œท b๐‘„๐ฟ๐ท ๐‘€๐บ โˆ’ ๐œท0 = ๐‘œ ๐‘ (1). Turning to asymptotic normality, I need a lemma on the mean value expansion of the QLDMG estimator like in Theorem 3.3.4. Lemma .0.3. Let ๐๐‘– = ๐‘ฟ๐‘– ๐’ƒ๐‘– + ๐‘ญ0 ๐œธ๐‘– + ๐’–๐‘– . Then   โˆ‡๐œฝ ( ๐‘ฟ๐‘– ๐‘ฏ0 ๐‘ฏ0โ€ฒ ๐‘ฟ๐‘– ) โˆ’1 ๐‘ฟ๐‘–โ€ฒ ๐‘ฏ0 ๐‘ฏ0โ€ฒ ๐๐‘– ๐๐‘–โ€ฒ ๐‘ฏ0 ๐‘ฏ0โ€ฒ ๐‘ฝ๐‘– (๐‘ฝ๐‘–โ€ฒ ๐‘ฏ0 ๐‘ฏ0โ€ฒ ๐‘ฝ๐‘– ) โˆ’1 (๐‘ฝ๐‘–โ€ฒ ๐‘ฏ0 ๐‘ฏ0โ€ฒ ๐‘ฝ๐‘– ) โˆ’1  = โˆ’ ๐‘ฐ๐พ โŠ— โŠ— ยฉ ๐’™๐‘– โˆ—1โ€ฒ โŠ— ๐‘ฐ๐‘‡โˆ’๐‘0 ยช ยญ ยฎ .. โˆ— ( ๐‘ฐ๐พ 2 + ๐‘ฒ๐พ )( ๐‘ฐ๐พ โŠ— ๐‘ฝ๐‘–โ€ฒ ๐‘ฏ0 ) ยญ ยฎ+ ยญ ยฎ . ยญ ยฎ ยญ ยฎ ๐’™๐‘– โˆ—๐พ โ€ฒ โŠ— ๐‘ฐ๐‘‡โˆ’๐‘0 ยซ ยฌ ยฉ ๐’™๐‘– โˆ—1โ€ฒ โŠ— ๐‘ฐ๐‘‡โˆ’๐‘0 ยช ยญ ยฎ .. + (๐‘ฝ๐‘–โ€ฒ ๐‘ฏ0 ๐‘ฏ0โ€ฒ ๐‘ฝ๐‘– ) โˆ’1 ( ๐‘ฐ๐พ โŠ— ๐๐‘–โ€ฒ ๐‘ฏ0 ) ยญ ยฎ+ ยญ ยฎ . ยญ ยฎ ยญ ยฎ ๐’™๐‘– โˆ—๐พ โ€ฒ โŠ— ๐‘ฐ๐‘‡โˆ’๐‘0 ยซ ยฌ + (๐‘ฝ๐‘–โ€ฒ ๐‘ฏ0 ๐‘ฏ0โ€ฒ ๐‘ฝ๐‘– ) โˆ’1๐‘ฝ๐‘–โ€ฒ ๐‘ฏ0 ๐๐‘–โˆ—โ€ฒ โŠ— ๐‘ฐ๐‘‡โˆ’๐‘0  where ๐‘ฒ๐พ is the ๐พ 2 ร— ๐พ 2 commutation matrix. Proof. Like in Lemma .0.1, I omit the factor structure ๐‘ฟ๐‘– = ๐‘ญ0 ๐šช๐‘– + ๐‘ฝ๐‘– and derive the above form with respect to just ๐‘ฟ๐‘– . The factor structure is substituted in later after the lemma. Assumption 2 and conditions (1) and (2) imply that the inverse of ๐‘ฟ๐‘–โ€ฒ ๐‘ฏ(๐œฝ)๐‘ฏ(๐œฝ) โ€ฒ ๐‘ฟ๐‘– is differentiable about ๐œฝ 0 . Proposition 5.16 of Dhrymes (2013) gives   โˆ‡๐œฝ ( ๐‘ฟ๐‘–โ€ฒ ๐‘ฏ0 ๐‘ฏ0โ€ฒ ๐‘ฟ๐‘– ) โˆ’1 = โˆ’ ( ๐‘ฟ๐‘–โ€ฒ ๐‘ฏ0 ๐‘ฏ0โ€ฒ ๐‘ฟ๐‘– ) โˆ’1 โŠ— ( ๐‘ฟ๐‘–โ€ฒ ๐‘ฏ0 ๐‘ฏ0โ€ฒ ๐‘ฟ๐‘– ) โˆ’1 โˆ‡๐œฝ ๐‘ฟ๐‘–โ€ฒ ๐‘ฏ0 ๐‘ฏ0โ€ฒ ๐‘ฟ๐‘–  The differential of the ๐‘ฟ๐‘–โ€ฒ ๐‘ฏ(๐œฝ)๐‘ฏ(๐œฝ) โ€ฒ ๐‘ฟ๐‘– can be worked out via 13.19(b) of Abadir and Magnus (2013): ๐‘‘vec( ๐‘ฟ๐‘–โ€ฒ ๐‘ฏ(๐œฝ)๐‘ฏ(๐œฝ) โ€ฒ ๐‘ฟ๐‘– ) = ( ๐‘ฐ๐พ 2 + ๐‘ฒ๐พ )( ๐‘ฐ๐พ โŠ— ๐‘ฟ๐‘–โ€ฒ ๐‘ฏ(๐œฝ))๐‘‘vec(๐‘ฏ(๐œฝ) โ€ฒ ๐‘ฟ๐‘– ) 107 The associated gradient was worked out in the proof of Theorem 3.3.4. Thus we have ยฉ ๐’™๐‘– โˆ—1โ€ฒ โŠ— ๐‘ฐ๐‘‡โˆ’๐‘0 ยช   ยญ ยฎ .. โˆ‡๐œฝ ( ๐‘ฟ๐‘–โ€ฒ ๐‘ฏ0 ๐‘ฏ0โ€ฒ ๐‘ฟ๐‘– ) โˆ’1 = โˆ’ ( ๐‘ฟ๐‘–โ€ฒ ๐‘ฏ0 ๐‘ฏ0โ€ฒ ๐‘ฟ๐‘– ) โˆ’1 โŠ— ( ๐‘ฟ๐‘–โ€ฒ ๐‘ฏ0 ๐‘ฏ0โ€ฒ ๐‘ฟ๐‘– ) โˆ’1 ( ๐‘ฐ๐พ 2 +๐‘ฒ๐พ )( ๐‘ฐ๐พ โŠ— ๐‘ฟ๐‘–โ€ฒ ๐‘ฏ0 ) ยญ ยญ ยฎ . ยฎ ยญ ยฎ ยญ ยฎ โˆ— โ€ฒ ๐’™๐‘– ๐พ โŠ— ๐‘ฐ๐‘‡โˆ’๐‘0 ยซ ยฌ The product rule of the gradient is given in Proposition 5.4 of Dhrymes (2013) and the gradient โˆ‡๐œฝ ๐‘ฟ๐‘–โ€ฒ ๐‘ฏ0 ๐‘ฏ0โ€ฒ ๐๐‘– comes from Lemma .0.1 in the proof of Theorem 3.3.4. โ–ก โˆš The ๐‘-normalized estimator is โˆš ๐‘ 1 โˆ‘๏ธ โ€ฒ b bโ€ฒ โˆ’1 โ€ฒ b bโ€ฒ ๐‘ ( ๐œท๐‘„๐ฟ๐ท ๐‘€๐บ โˆ’ ๐œท0 ) = โˆš b ( ๐‘ฟ๐‘– ๐‘ฏ ๐‘ฏ ๐‘ฟ๐‘– ) ๐‘ฟ๐‘– ๐‘ฏ ๐‘ฏ ๐๐‘– ๐‘ ๐‘–=1 where ๐๐‘– = ๐‘ฟ๐‘– ๐’ƒ๐‘– + ๐‘ญ0 ๐œธ๐‘– + ๐’–๐‘– . I write the estimator in terms of its full error because the asymptotic variance generally depends on the correlation between ๐’ƒ๐‘– and the other terms. I derive the asymptotic variance in full, with a simpler form under stronger exogeneity conditions. I apply a mean value expansion to the above sum and get ๐‘ 1 โˆ‘๏ธ โ€ฒ b bโ€ฒ โˆ’1 โ€ฒ b bโ€ฒ 1 โˆ‘๏ธ โ€ฒ ๐‘ โˆš โˆš ( ๐‘ฟ๐‘– ๐‘ฏ ๐‘ฏ ๐‘ฟ๐‘– ) ๐‘ฟ๐‘– ๐‘ฏ ๐‘ฏ ๐œ–๐‘– = โˆš (๐‘ฝ๐‘– ๐‘ฏ0 ๐‘ฏ0โ€ฒ ๐‘ฝ๐‘– ) โˆ’1๐‘ฝ๐‘–โ€ฒ ๐‘ฏ0 ๐‘ฏ0โ€ฒ ๐๐‘– + ๐‘ฎ ๐‘€๐บ ๐‘ ( ๐œฝb โˆ’ ๐œฝ 0 ) + ๐‘œ ๐‘ (1) ๐‘ ๐‘–=1 ๐‘ ๐‘–=1 where ๐‘ฎ ๐‘€๐บ comes from Lemma .0.3. Thus โˆš ๐‘ 1 โˆ‘๏ธ  โ€ฒ โ€ฒ โˆ’1 โ€ฒ โ€ฒ  ๐‘ ( ๐œท๐‘„๐ฟ๐ท ๐‘€๐บ โˆ’ ๐œท0 ) = โˆš b (๐‘ฝ๐‘– ๐‘ฏ0 ๐‘ฏ0๐‘ฝ๐‘– ) ๐‘ฝ๐‘– ๐‘ฏ0 ๐‘ฏ0 ๐๐‘– + ๐‘ฎ ๐‘€๐บ ๐’“ ๐‘ฅ,๐‘– (๐œฝ 0 ) + ๐‘œ ๐‘ (1) (.0.12) ๐‘ ๐‘–=1 where ๐’“ ๐‘ฅ,๐‘– (๐œฝ 0 ) = (๐‘ซ โ€ฒ๐‘ฅ,๐œฝ ๐‘จ๐‘ฅ,๐œฝ โˆ’1 ๐‘ซ ) โˆ’1 ๐‘ซ โ€ฒ ๐‘จโˆ’1 vec(๐‘ฏโ€ฒ ๐‘ฝ ) comes from Lemma .0.2. We then have ๐‘ฅ,๐œฝ ๐‘ฅ,๐œฝ ๐‘ฅ,๐œฝ 0 ๐‘– โˆš ๐‘‘ ๐‘(๐œท b๐‘„๐ฟ๐ท ๐‘€๐บ โˆ’ ๐œท0 ) โ†’ ๐‘ (0, ๐‘ฉ ๐‘€๐บ ) (.0.13)   where ๐‘ฉ ๐‘€๐บ = ๐‘‰ ๐‘Ž๐‘Ÿ (๐‘ฝ๐‘–โ€ฒ ๐‘ฏ0 ๐‘ฏ0โ€ฒ ๐‘ฝ๐‘– ) โˆ’1๐‘ฝ๐‘–โ€ฒ ๐‘ฏ0 ๐‘ฏ0โ€ฒ ๐๐‘– + ๐‘ฎ ๐‘€๐บ ๐’“ ๐‘ฅ,๐‘– (๐œฝ 0 ) . โ–ก 108 APPENDIX ADDITIONAL TABLES FOR CHAPTER 3 I now present additional simulations comparing the pooled CCE and QLD estimators. Table .1 gives results for ๐พ = 2 and ๐‘ 0 = 2 but for larger values of ๐‘‡. Table .1: Pooled estimator, ๐พ = 2 Bias SD RMSE CCEP QLDP CCEP QLDP CCEP QLDP N = 50 T=6 0.0128 0.0074 0.1028 0.0956 0.1036 0.0959 0.0128 0.0132 0.1019 0.1025 0.1027 0.1034 T=7 0.0146 0.0102 0.0994 0.1222 0.1004 0.1226 0.0150 0.0096 0.0910 0.1191 0.0922 0.1194 T=8 0.0105 0.0061 0.0873 0.0886 0.0879 0.0888 0.0166 0.0086 0.0855 0.0852 0.0871 0.0856 N = 300 T=6 0.0029 0.0015 0.0405 0.0392 0.0406 0.0392 0.0039 0.0013 0.0416 0.0406 0.0418 0.0406 T=7 0.0016 0.0001 0.0376 0.0477 0.0377 0.0477 0.0021 -0.0001 0.0374 0.0450 0.0374 0.0450 T=8 0.0020 0.0009 0.0344 0.0348 0.0344 0.0349 0.0010 0.0001 0.0345 0.0344 0.0346 0.0344 Both estimators perform poorly when ๐‘ = 50 with CCEP typically outperforming the QLDP in terms of SD for all ๐‘ and ๐‘‡. Interestingly, QLDP seems to decrease in bias as ๐‘‡ gets larger despite the fact that the number of parameters increases linearly in ๐‘‡ for fixed ๐‘ 0 . Generally, the differences in bias are small, and CCEP has a smaller RMSE dues to its reduced SD. Table .2 performs the same simulations but for ๐พ = 3. In these cases, the QLDP has the smaller SD, most likely due to the fact that the additional covariates provide information which the QLD transformation can exploit. 109 Table .2: Pooled estimators, ๐พ = 3 Bias SD RMSE K=3 CCE QLDP CCE QLDP CCE QLDP N = 50 T=6 0.0115 0.0055 0.1174 0.1010 0.1179 0.1012 0.0207 0.0131 0.1143 0.1024 0.1161 0.1032 -0.0041 -0.0009 0.1151 0.1001 0.1151 0.1001 T = 7 0.0184 0.0127 0.0991 0.1255 0.1008 0.1261 0.0218 0.0079 0.1009 0.1245 0.1033 0.1247 -0.0054 -0.0022 0.0998 0.1157 0.0999 0.1157 T = 8 0.0151 0.0122 0.0883 0.0867 0.0896 0.0875 0.0095 0.0084 0.0896 0.0873 0.0901 0.0877 0.0015 -0.0041 0.0895 0.0870 0.0895 0.0871 N = 300 T = 6 0.0034 0.0024 0.0451 0.0374 0.0452 0.0375 -0.0001 0.0007 0.0468 0.0404 0.0468 0.0404 0.0001 -0.0016 0.0440 0.0391 0.0440 0.0391 T = 7 0.0038 0.0021 0.0385 0.0468 0.0387 0.0468 0.0048 0.0010 0.0381 0.0448 0.0384 0.0448 0.0005 0.0016 0.0382 0.0461 0.0382 0.0461 T = 8 0.0005 -0.0002 0.0352 0.0347 0.0352 0.0347 0.0042 0.0015 0.0364 0.0336 0.0367 0.0336 0.0000 0.0012 0.0351 0.0344 0.0351 0.0344 110 BIBLIOGRAPHY 111 BIBLIOGRAPHY Abadir, K. M., & Magnus, J. R. (2005). Matrix algebra (Vol. 1). Cambridge University Press. Andrews, D. W. K. (2005). Cross-section regression with common shocks. Econometrica, 73, 1551โ€“1585. Ahn, S. C. (2015). Comment on โ€™iv estimation of panels with factor residualsโ€™ by d. robertson and v. sarafidis. Journal of Econometrics, 185, 542โ€“544. https://doi.org/10.1016/j.jeconom. 2014.12.002 Ahn, S. C., Lee, Y. H., & Schmidt, P. (2013). Panel data models with multiple time-varying individual effects. Journal of Econometrics, 174, 1โ€“14. https://doi.org/10.1016/j.jeconom. 2012.12.002 Ahn, S. C., & Schmidt, P. (1997). Efficient estimation of dynamic panel data models: Alternative assumptions and simplified estimation. Journal of Econometrics, 76, 309โ€“321. Amsler, C., Lee, Y. H., & Schmidt, P. (2009). A survey of stochastic frontier models and likely future developments. Seoul Journal of Economics, 22(1). Arellano, M., & Bover, O. (1995). Another look at the instrumental variable estimation of error- components models. Journal of Econometrics, 68(1), 29โ€“51. Arellano, M., Hahn, J. et al. (2005). Understanding bias in nonlinear panel models: Some recent developments (tech. rep.). Mimeo, CEMFI. Bai, J. (2003). Inferential theory for factor models of large dimensions. Econometrica, 71(1), 135โ€“171. Bai, J. (2009). Panel data models with interactive fixed effects. Econometrica, 77(4), 1229โ€“1279. Breitung, J., & Hansen, P. (2021). Alternative estimation approaches for the factor augmented panel data model with small t. Empirical Economics, 60, 327โ€“351. https://doi.org/10. 1007/s00181-020-01948-7 Breitung, J., & Salish, N. (2021). Estimation of heterogeneous panels with systematic slope variations. Journal of Econometrics, 220, 399โ€“415. https://doi.org/10.1016/j.jeconom. 2020.04.007 Breusch, T., Qian, H., Schmidt, P., & Wyhowski, D. J. (1997). Redundancy of moment conditions. Journal of Econometrics, 91. 112 Brown, N. (2021). Information equivalence among transformations of semiparametric nonlinear panel data models *. https://www.researchgate.net/publication/344047637_Information-equivalence_ among_transformations_of_semiparametric_nonlinear_panel_data_models Brown, N., Schmidt, P., & Wooldridge, J. M. (2021). Simple alternatives to the common correlated effects model. https://doi.org/10.13140/RG.2.2.12655.76969/1 Brown, N. L., & Wooldridge, J. M. (2021). More efficient estimation of multiplicative panel data models in the presence of serial correlation. Manuscript submitted for publication. Campello, M., Galvao, A. F., & Juhl, T. (2019). Testing for slope heterogeneity bias in panel data models. Journal of Business and Economic Statistics, 37, 749โ€“760. https://doi.org/10. 1080/07350015.2017.1421545 Castillo, J. C., Mejรญa, D., & Restrepo, P. (2020). Scarcity without leviathan: The violent effects of cocaine supply shortages in the mexican drug war. Review of Economics and Statistics, 102(2), 269โ€“286. Chamberlain, G. (1980). Analysis of covariance with qualitative data. Review of Economic Studies, 47, 225โ€“238. Chamberlain, G. (1987). Asymptotic efficiency in estimation with conditional moment restrictions. Journal of Econometrics, 34(3), 305โ€“334. Chamberlain, G. (1992). Efficiency bounds for semiparametric regression. Econometrica: Journal of the Econometric Society, 60(3), 567โ€“596. Chen, M., Fernรกndez-Val, I., & Weidner, M. (2014). Nonlinear factor models for network and panel data. arXiv preprint arXiv:1412.5647. Chudik, A., & Pesaran, M. H. (2015). Common correlated effects estimation of heterogeneous dynamic panel data models with weakly exogenous regressors. Journal of Econometrics, 188, 393โ€“420. https://doi.org/10.1016/j.jeconom.2015.03.007 Davidson, J. (1994, October). Stochastic limit theory: An introduction for econometricians. Oxford University Press. https://doi.org/10.1093/0198774036.001.0001 Vos, I. D., & Everaert, G. (2021). Bias-corrected common correlated effects pooled estimation in dynamic panels. Journal of Business and Economic Statistics, 39, 294โ€“306. https: //doi.org/10.1080/07350015.2019.1654879 Vos, I. D., & Westerlund, J. (2019). On cce estimation of factor-augmented models when regressors are not linear in the factors. Economics Letters, 178, 5โ€“7. https://doi.org/10.1016/j.econlet. 2019.02.001 113 Dhrymes, P. J. (2013). Mathematics for econometrics. Springer Science; Business Media. Fernรกndez-Val, I., & Weidner, M. (2018). Fixed effects estimation of large-t panel data models. Annual Review of Economics, 10, 109โ€“138. Fischer, S., Royer, H., & White, C. (2018). The impacts of reduced access to abortion and family planning services on abortions, births, and contraceptive purchases. Journal of Public Economics, 167, 43โ€“68. Hahn, J. (1997). A note on the efficient semiparametric estimation of some exponential panel models. Econometric Theory, 13(4), 583โ€“588. Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica, 50, 1029โ€“1054. https://doi.org/10.2307/1912775 Hardin, J. W., & Hilbe, J. M. (2012). Generalized estimation equations (2nd ed.). London: Chapman Hall. Hausman, J., Hall, B. H., & Griliches, Z. (1984). Econometric models for count data with an application to the patents-r&d relationship. Econometrica: Journal of the Econometric Society, 52(4), 909โ€“938. Hayakawa, K. (2012). Gmm estimation of short dynamic panel data models with interactive fixed effects. J. Japan Statist. Soc, 42, 109โ€“123. Hayakawa, K. (2016). Identification problem of gmm estimators for short panel data models with interactive fixed effects. Economics Letters, 139, 22โ€“26. https://doi.org/10.1016/j.econlet. 2015.12.012 Horn, R. A., & Johnson, C. R. (2012). Matrix analysis. Cambridge University Press. Hsiao, C. (2018). Panel models with interactive effects. Journal of Econometrics, 206, 645โ€“673. https://doi.org/10.1016/j.jeconom.2018.06.017 Im, K. S., Ahn, S. C., Schmidt, P., & Wooldridge, J. M. (1999). Efficient estimation of panel data models with strictly exogenous explanatory variables. Journal of Econometrics, 93(1), 177โ€“201. Juhl, T., & Lugovskyy, O. (2014). A test for slope heterogeneity in fixed effects models. Econo- metric Reviews, 33, 906โ€“935. https://doi.org/10.1080/07474938.2013.806708 Juodis, A., & Sarafidis, V. (2018). Fixed t dynamic panel data estimators with multifactor errors. Econometric Reviews, 37, 893โ€“929. https://doi.org/10.1080/00927872.2016.1178875 Juodis, A., & Sarafidis, V. (2020). A linear estimator for factor-augmented fixed-t panels with 114 endogenous regressors. Journal of Business and Economic Statistics. https://doi.org/10. 1080/07350015.2020.1766469 Juodis, A., & Sarafidis, V. (2021). An incidental parameters free inference approach for panels with common shocks. Journal of Econometrics. https://doi.org/10.1016/j.jeconom.2021.03.011 Karabiyik, H., Reese, S., & Westerlund, J. (2017). On the role of the rank condition in cce estimation of factor-augmented panel regressions. Journal of Econometrics, 197(1), 60โ€“64. Krapf, M., Ursprung, H. W., & Zimmermann, C. (2017). Parenthood and productivity of highly skilled labor: Evidence from the groves of academe. Journal of Economic Behavior & Organization, 140, 147โ€“175. Liang, Y., & Zeger, S. (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73, 13โ€“22. McCabe, M. J., & Snyder, C. M. (2014). Identifying the effect of open access on citations using a panel of science journals. Economic Inquiry, 52(4), 1284โ€“1300. McCabe, M. J., & Snyder, C. M. (2015). Does online availability increase citations? theory and evidence from a panel of economics and business journals. Review of Economics and Statistics, 97(1), 144โ€“165. McCullagh, P., & Nelder, J. (1989). Generalized linear models (2nd ed.). London: Chapman Hall. Moon, H. R., & Weidner, M. (2015). Linear regression for panel with unknown number of factors as interactive fixed effects. Econometrica, 83, 1543โ€“1579. https://doi.org/10.3982/ecta9382 Mundlak, Y. (1978). On the pooling of time series and cross section data. Econometrica, 46, 69โ€“85. Murtazashvili, I., & Wooldridge, J. M. (2008). Fixed effects instrumental variables estimation in correlated random coefficient panel data models. Journal of Econometrics, 142, 539โ€“552. https://doi.org/10.1016/j.jeconom.2007.09.001 Neal, T. (2015). Estimating heterogeneous coefficients in panel data models with endogenous regressors and common factors. Newey, W. K. (2001). Conditional moment restrictions in censored and truncated regression models. Econometric Theory, 17(5), 863โ€“888. Newey, K., & McFadden, D. (1994). Large sample estimation and hypothesis. Handbook of Econometrics, IV, Edited by RF Engle and DL McFadden, 2112โ€“2245. Norkuteฬ‡, M., Sarafidis, V., Yamagata, T., & Cui, G. (2021). Instrumental variable estimation 115 of dynamic linear panel data models with defactored regressors and a multifactor error structure. Journal of Econometrics, 220, 416โ€“446. https://doi.org/10.1016/j.jeconom. 2020.04.008 Papke, L. E. (2005). The effects of spending on test pass rates: Evidence from michigan. Journal of Public Economics, 89, 821โ€“839. https://doi.org/10.1016/j.jpubeco.2004.05.008 Papke, L. E., & Wooldridge, J. M. (2008). Panel data methods for fractional response variables with an application to test pass rates. Journal of Econometrics, 145, 121โ€“133. https: //doi.org/10.1016/j.jeconom.2008.05.009 Pesaran, M. H. (2006). Estimation and inference in large heterogeneous panels with a multifactor error structure. Econometrica, 74, 967โ€“1012. Phillips, R. F. (2020). Quantifying the advantages of forward orthogonal deviations for long time series. Computational Economics, 55(2), 653โ€“672. Rao, C. R., & Mitra, S. K. Generalized inverse of a matrix and its applications. In: Proceedings of the sixth berkeley symposium on mathematical statistics and probability, volume 1: Theory of statistics. The Regents of the University of California. 1972. Robertson, D., & Sarafidis, V. (2015). Iv estimation of panels with factor residuals. Journal of Econometrics, 185, 526โ€“541. https://doi.org/10.1016/j.jeconom.2014.12.001 Schlenker, W., & Walker, W. R. (2016). Airports, air pollution, and contemporaneous health. The Review of Economic Studies, 83(2), 768โ€“809. Schmidt, P., Ahn, S. C., & Wyhowski, D. (1992). On the estimation of panel-data models with serial correlation when instruments are not strictly exogenous: Comment. Journal of Business Economic Statistics, 10, 10โ€“14. https://doi.org/10.2307/1391796 Sherman, J, & Morrison, W. (1950). Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. Annals of Mathematical Statistics, 21, 124โ€“127. Verdier, V. (2018). Local semi-parametric efficiency of the poisson fixed effects estimator. Journal of Econometric Methods, 7(1). Westerlund, J. (2019). On estimation and inference in heterogeneous panel regressions with interactive effects. Journal of Time Series Analysis, 40, 852โ€“857. https://doi.org/10.1111/ jtsa.12432 Westerlund, J. (2020). A cross-section average-based principal components approach for fixed-t panels. Journal of Applied Econometrics, 35(6), 776โ€“785. Westerlund, J., Petrova, Y., & Norkuteฬ‡, M. (2019). Cce in fixed-t panels. Journal of Applied 116 Econometrics, 34, 746โ€“761. https://doi.org/10.1002/jae.2707 Williams, M. L., Burnap, P., Javed, A., Liu, H., & Ozalp, S. (2020). Hate in the machine: Anti- black and anti-muslim social media posts as predictors of offline racially and religiously aggravated crime. The British Journal of Criminology, 60(1), 93โ€“117. Wooldridge, J. M. (1997). Multiplicative panel data models without the strict exogeneity assump- tion. Econometric Theory, 13(5), 667โ€“678. Wooldridge, J. M. (1999). Distribution-free estimation of some nonlinear panel data models. Journal of Econometrics, 90(1), 77โ€“97. Wooldridge, J. M. (2005). Fixed-effects and related estimators for correlated random-coefficient and treatment-effect panel data models. Source: The Review of Economics and Statistics, 87, 385โ€“390. https://about.jstor.org/terms Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data (2nd ed., Vol. 1). MIT press. 117