THREE ESSAYS ON PANEL DATA MODELS WITH INTERACTIVE AND UNOBSERVED
                                     EFFECTS
                                        By
                             Nicholas Lynn Brown
                               A DISSERTATION
                                   Submitted to
                           Michigan State University
                   in partial fulfillment of the requirements
                                for the degree of
                      Economics – Doctor of Philosophy
                                       2022


                                           ABSTRACT
 THREE ESSAYS ON PANEL DATA MODELS WITH INTERACTIVE AND UNOBSERVED
                                             EFFECTS
                                                By
                                       Nicholas Lynn Brown
Chapter 1: More Efficient Estimation of Multiplicative Panel Data Models in the Presence of
Serial Correlation (with Jeffrey Wooldridge)
We provide a systematic approach in obtaining an estimator asymptotically more efficient than the
popular fixed effects Poisson (FEP) estimator for panel data models with multiplicative hetero-
geneity in the conditional mean. In particular, we derive the optimal instrumental variables under
appealing ‘working’ second moment assumptions that allow underdispersion, overdispersion, and
general patterns of serial correlation. Because parameters in the optimal instruments must be
estimated, we argue for combining our new moment conditions with those that define the FEP
estimator to obtain a generalized method of moments (GMM) estimator no less efficient than the
FEP estimator and the estimator using the new instruments. A simulation study shows that the
GMM estimator behaves well in terms of bias, and it often delivers nontrivial efficiency gains –
even when the working second-moment assumptions fail.
    Chapter 2: Information equivalence among transformations of semiparametric nonlinear
panel data models
I consider transformations of nonlinear semiparametric mean functions which yield moment con-
ditions for estimation. Such transformations are said to be information equivalent if they yield
the same asymptotic efficiency bound. I first derive a unified theory of algebraic equivalence for
moment conditions created by a given linear transformation. The main equivalence result states that
under standard regularity conditions, transformations which create conditional moment restrictions
in a given empirical setting need only to have an equal rank to reach the same efficiency bound.


Example applications are considered, including nonlinear models with multiplicative heterogeneity
and linear models with arbitrary unobserved factor structures.
    Chapter 3: Moment-based Estimation of Linear Panel Data Models with Factor-augmented
Errors
I consider linear panel data models with unobserved factor structures when the number of time
periods is small relative to the number of cross-sectional units. I examine two popular methods
of estimation: the first eliminates the factors with a parameterized quasi-long-differencing (QLD)
transformation. The other, referred to as common correlated effects (CCE), uses the cross-sectional
averages of the independent and response variables to project out the space spanned by the factors.
I show that the classical CCE assumptions imply unused moment conditions which can be exploited
by the QLD transformation to derive new linear estimators which weaken identifying assumptions
and have desirable theoretical properties. I prove asymptotic normality of the linear QLD estimators
under a heterogeneous slope model which allows for a tradeoff between identifying conditions.
These estimators do not require the number of cross-sectional variables to be less than 𝑇 − 1, a
strong restriction in fixed-𝑇 CCE analysis. Finally, I investigate the effects of per-student expenditure
on standardized test performance using data from the state of Michigan.


                                    ACKNOWLEDGEMENTS
To my dissertation committee: Jeff, you have given me more of your time than I ever deserved.
Thank you for all of your patience and guidance. Thank you Peter for seeing potential in me and
helping me along my academic journey. Despite your protest, I can’t help but think of you as my
co-chair. To Ben: I have benefited greatly from having such a brilliant applied researcher on my
committee, someone who quickly digested my work and showed me how to apply it in relevant
cases. Finally, I want to thank Nicky, and I have enjoyed working with you through the AFRE
tutoring program.
    To my fellow graduate students: thank you for your friendship and support throughout these
past five years. My qualifying exam study group, Sean, Andrew, Joffré, Elise, and Alex, to whom I
would not be here without. To Mehmet and Taeyoon, the oddball macro and financial economists.
And to my econometric mentor Alyssa and Akanksha. Finally, I want to give a special thanks to
Bhavna: despite living halfway across the world, you were always available to jump on the phone
and support me, especially during the job market. I look forward to our future collaborations.
    To my family: my emotional bedrock. To my mom Kathi and dad Curt, I can never repay you
for your love and support throughout my entire life. You have nurtured me into the person I am
today, and I am forever grateful. Also to my bonus parents Lorraine and Kevin, who have become
an integral part of my family. To my brothers Jack, Nicky, Mark, and Kian, four of my closest
friends and partners in crime. To Katie, whose presence brightens my home. To my confidant and
future sister Dana: I would not have made it through U of I without your friendship. I am elated
you have joined our family.
    Finally, to my flock: Griffin and Stark, my feathered friends. You drive me insane, but I could
not imagine life without you two. Last but not least, Danielle. You are my best friend. You give
me the strength to go on. You are my solace and my inspiration. Everything I do I do for you. I
love you more than life. If I were a poet, I could fully articulate how much you mean to me, but
unfortunately I’m only an economist, so you’ll just have to take my word.
                                                 iv


                                 TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
CHAPTER 1     MORE EFFICIENT ESTIMATION OF MULTIPLICATIVE PANEL DATA
              MODELS IN THE PRESENCE OF SERIAL CORRELATION . . . . . .                        . 1
   1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
   1.2 Model and Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . 5
   1.3 Optimal Instruments under Second Moment Assumptions . . . . . . . . . . . . .          . 8
   1.4 Operationalizing Optimal IV Estimation . . . . . . . . . . . . . . . . . . . . . .     . 15
   1.5 A Small Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . 24
   1.6 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . 28
CHAPTER 2     INFORMATION EQUIVALENCE AMONG TRANSFORMATIONS OF
              SEMIPARAMETRIC NONLINEAR PANEL DATA MODELS . . . . .                          . . 30
   2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
   2.2 Information equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 32
       2.2.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 32
       2.2.2 General equivalence result . . . . . . . . . . . . . . . . . . . . . . . .     . . 38
   2.3 Examples of information equivalence . . . . . . . . . . . . . . . . . . . . . . .    . . 41
       2.3.1 Multiplicative heterogeneity . . . . . . . . . . . . . . . . . . . . . . .     . . 41
       2.3.2 Linear factor model . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 47
       2.3.3 Random trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 49
   2.4 Practical considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
   2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 53
CHAPTER 3     MOMENT-BASED ESTIMATION OF LINEAR PANEL DATA MOD-
              ELS WITH FACTOR-AUGMENTED ERRORS . . . . . . . . . . . . .                    . . 55
   3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
   3.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 58
       3.2.1 Common Correlated Effects . . . . . . . . . . . . . . . . . . . . . . .        . . 59
       3.2.2 Quasi-long-differencing . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 60
   3.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
       3.3.1 CCE Moment Conditions . . . . . . . . . . . . . . . . . . . . . . . . .        . . 63
       3.3.2 Pooled and Mean Group QLD . . . . . . . . . . . . . . . . . . . . . .          . . 66
   3.4 Heterogeneous Slopes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 72
   3.5 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 78
       3.5.1 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 78
       3.5.2 Comparison to TWFE . . . . . . . . . . . . . . . . . . . . . . . . . .         . . 83
   3.6 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 85
   3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 88
APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
                                               v


BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
                                            vi


                                        LIST OF TABLES
Table 1.1: Conditional Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Table 1.2: Conditional Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Table 3.1: GMM estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Table 3.2: Misspecifying 𝑝 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Table 3.3: Pooled estimators, 𝐾 = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Table 3.4: Pooled estimators, 𝐾 = 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Table 3.5: Mean group estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Table 3.6: AR(1) factor structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Table 3.7: TWFE specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Table 3.8: Testing for 𝑝 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Table 3.9: Controlling for heterogeneous intercept . . . . . . . . . . . . . . . . . . . . . . 88
Table .1:  Pooled estimator, 𝐾 = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Table .2:  Pooled estimators, 𝐾 = 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
                                                 vii


                                           CHAPTER 1
 MORE EFFICIENT ESTIMATION OF MULTIPLICATIVE PANEL DATA MODELS IN
                        THE PRESENCE OF SERIAL CORRELATION
1.1     Introduction
The fixed effects Poisson (FEP) estimator was originally developed by Hausman, Hall, and Griliches
(1984) (hereafter, HHG) in their study of the effects of firm-level R&D spending on patent filings.
HHG used the method of conditional maximum likelihood estimation (CMLE) to estimate the
parameters in the conditional mean. In deriving the CMLE, HHG assumed that, conditional on the
unobserved heterogeneity and the history of the covariates, the outcome variable is independent
over time with a Poisson distribution. HHG showed that, conditional on the covariates and the sum
of the counts over time, the joint distribution of the counts is multinomial and does not depend
on the heterogeneity. Therefore, standard maximum likelihood theory applies, and the asymptotic
theory assuming a fixed number of time periods is standard. Hahn (1997) verified that the FEP
estimator achieves the semiparametric efficiency bound under the full distributional and conditional
independence assumptions.
    Wooldridge (1999) showed that the consistency of the FEP estimator only requires correct
specification of the conditional mean function up to a multiplicative heterogeneity term. In par-
ticular, any kind of variance is allowed along with any kind of serial dependence. In fact, the
outcome variable need not even be a count variable: it can be any nonnegative outcome, including
a continuous or corner solution response. Thus, the FEP estimator is to multiplicative panel data
models what the linear FE estimator is to linear models with additive heterogeneity.
    When the conditional mean function is differentiable in the parameters – by far the leading
case – Wooldridge (1999) established Fisher consistency of the FEP very generally. Specifically,
Wooldridge showed that the score has a zero conditional mean (evaluated at the true parameter
value) when the structural conditional mean is correctly specified. In addition to establishing
                                                 1


robustness of the FEP estimator, the zero conditional mean property of the score leads to additional
moment conditions that can be exploited in generalized method of moments (GMM) estimation to
obtain estimators asymptotically more efficient than the FEP estimator. Unfortunately, the extra
moment conditions proposed by Wooldridge (1999) are essentially ad hoc: they are not based on
any notion of optimality. Consequently, the GMM approach to estimating multiplicative panel
data models has not caught on: FEP estimation with the fully robust standard errors derived in
Wooldridge (1999) is much more common. Some recent examples include McCabe and Snyder
(2014, 2015), Schlenker and Walker (2016), Krapf, Ursprung, and Zimmermann (2017), Castillo,
Mejia, and Restrepo (2018), and Williams, Burnap, Javed, Liu, and Ozalp (2020).
     Given that the FEP estimator is fully robust to distributional misspecification and serial inde-
pendence, it is natural to wonder about its asymptotic efficiency under assumptions weaker than
the full set of assumptions used by Hahn (1997). Recently, Verdier (2018) showed that the Poisson
distributional assumption and conditional independence are not necessary for the FEP estimator to
achieve Chamberlain’s (1987, 1992) efficiency bound. In particular, Verdier (2018) showed that it is
sufficient to impose the Poisson assumption that the variance equals the mean and that the outcomes
are serially uncorrelated conditional on heterogeneity and the covariates. While weaker than the
HHG assumptions, they are still restrictive. The assumption that the variance equals the mean, even
after conditioning on unobserved heterogeneity, is very special. For example, the most common
parameterization of the gamma distribution violates equality of the variance and mean. Moreover,
serial correlation in the idiosyncratic errors of linear unobserved effects models is pervasive (which
is why researchers now routinely compute standard errors robust to general serial correlation), and
it is known how to exploit serial correlation in fixed effects versions of generalized least squares
(GLS) to improve efficiency over the usual fixed effects estimator – see, for example, Im, Ahn,
Schmidt, and Wooldridge (1999). It seems natural to search for analogous improvements over the
FEP estimator in the presence of serial correlation and more flexible variance-mean relationships.
     In this paper, we relax the second moment assumptions that are implied by the traditional HHG
assumptions and derive the optimal instruments, thereby showing how to obtain an estimator that
                                                     2


achieves Chamberlain’s (1992) lower bound. Our efficiency result is new, and includes the Verdier
(2018) result as a special case. The variance assumption we use to derive the optimal instruments
is appealing because, conditional on the observed covariates and unobserved heterogeneity, it
allows for underdispersion (relative to the Poisson) or overdispersion. In the spirit of the popular
generalized estimating equations (GEE) approach – see Liang and Zeger (1986) – we assume
constant conditional correlations, but allow for any pattern of serial correlation. One important
difference from the GEE literature is that our assumptions are more “structural” in that we state
the second moment assumptions conditional on the unobserved heterogeneity. This is analogous
to the linear model with an additive, unobserved effect when the working correlation matrix of the
idiosyncratic errors is assumed to be constant but is otherwise unrestricted.
    In order to obtain parametric forms for the optimal instruments, we supplement the flexible
second moment assumptions for the response variable with moment assumptions about the multi-
plicative heterogeneity. These parametric assumptions are fairly flexible and are commonly used in
the literature, particularly in traditional and correlated random effects environments when one needs
to impose distributional assumptions on the heterogeneity in order to obtain consistent estimators.
Here, we impose first and second moment assumptions in order to obtain the optimal instruments.
    We must emphasize that the estimator based on the optimal instruments – which we refer to as
the “generalized FEP (GFEP) estimator” – does not require any assumptions for consistency and
asymptotic normality beyond those used by the FEP estimator. That our new estimator is just as
robust as the FEP estimator in terms of consistency is important, as it is unfair to claim efficiency
improvements if the new estimator is not as robust as the popular, robust FEP estimator. In order
to emphasize the robustness of our estimator, we use the term “working” assumptions. The key
is that, under these parametric “working” assumptions we obtain the optimal instruments. If the
working assumptions are correct, then we have a just identified estimator that is more efficient than
the FEP estimator.
    If any of the working assumptions are incorrect, the “optimal” instrumental variables (IVs) are no
longer optimal, and so the GFEP no longer achieves Chamberlain’s lower bound. Therefore, we have
                                                     3


two estimators that are consistent under the same assumptions but efficient under different working
assumptions. To ensure that we have an estimator that is at least as efficient than both the FEP
estimator and the GFEP estimator, and usually more efficient, we combine the two sets of moment
conditions. With 𝐾 parameters this gives 𝐾 overidentifying restrictions. The overidentifying
restrictions are useful for testing the conditional mean specification – not the working assumptions,
as those are not being used for consistency.
    To summarize, this paper has three primary contributions. First, we relax the second mo-
ment assumptions implied by the traditional fixed effects Poisson setting and obtain the optimal
instruments under an appealing set of second moment working assumptions, including allowing
for general patterns of serial correlation. Second, we operationalize the estimator by imposing
additional working assumptions on moments of the heterogeneity distribution, resulting in a GMM
estimator that is computationally simple and is guaranteed to be asymptotically more efficient than
both the FEP estimator and the GFEP estimator. Third, we significantly relax the conditions under
which the FEP estimator achieves the asymptotic variance lower bound, allowing for both under-
dispersion and overdispersion in the variance conditional on observed covariates and unobserved
heterogeneity.
    The underlying asymptotic theory in this paper is for the microeconometric setting that treats
the number of time periods, 𝑇, as fixed, and lets the cross section dimension, 𝑁, increase without
bound. We assume random sampling in the cross section dimension but impose no restrictions on
the time series dependence. We do not provide formal regularity conditions because the asymptotic
theory is standard, and follow as in hundreds of panel data papers that impose random sampling in
the cross section. We do assume smoothness so that certain derivatives – in particular, that of the
conditional mean function – exist and are continuous.
    The rest of the paper is organized as follows. Section 1.2 presents the conditional mean model
and summarizes the consistency result for the FEP estimator. Section 1.3 derives the optimal
instruments under two working variance assumptions, including an unrestricted (but constant)
conditional correlation matrix. Section 1.4 shows how to implement the GFEP estimator and the
                                                    4


GMM estimator that combines the two sets of moment conditions. Section 1.5 provides promising
simulation evidence comparing the FEP, GFEP, and GMM estimators under serial correlation with
both underdispersion and overdispersion in the structural variance. Section 1.6 contains concluding
remarks.
1.2     Model and Background
We consider a balanced panel data setting where, for each 𝑖, {(𝑦𝑖𝑡 , x𝑖𝑡 , 𝑐𝑖 ) : 𝑡 = 1, 2, ..., 𝑇 } is a
random draw from the population. We observe the nonnegative response variable 𝑦𝑖𝑡 ≥ 0 and x𝑖𝑡 , a
1×𝐾 vector. The scalar 𝑐𝑖 is the unobserved heterogeneity. As is usual in fixed effects environments,
the elements of x𝑖𝑡 must have variation across 𝑡 for at least some population units. Typically, these
would include dummy variables indicating different time periods to allow for flexible aggregate
time effects. The entire observed history of the covariates is x𝑖 = (x𝑖1 , x𝑖2 , ..., x𝑖𝑇 ). As mentioned
in the introduction, we are treating 𝑇 as fixed in the asymptotic analysis. Therefore, because we
assume random sampling in the cross section, relevant assumptions can be stated for a random
draw 𝑖 from the population.
    The substantive assumptions that we make throughout the paper are that the model of the
conditional mean is correctly specified, the heterogeneity is multiplicative, and the covariates are
strictly exogenous conditional on 𝑐𝑖 . These are all captured by the following.
Assumption Conditional Mean (CM): For 𝑡 = 1, ..., 𝑇 and some 𝜷0 ∈ R𝑃 ,
                              E (𝑦𝑖𝑡 |x𝑖 , 𝑐𝑖 ) = E (𝑦𝑖𝑡 |x𝑖𝑡 , 𝑐𝑖 ) = 𝑐𝑖 𝑚 𝑡 (x𝑖𝑡 , 𝜷0 )          (1.2.1)
where 𝑚 𝑡 (x𝑡 , ·) ≥ 0 is continuously differentiable on R𝑃 for all x𝑡 ∈ X𝑡 , the support of x𝑖𝑡 . ■
    As discussed in Wooldridge (1999), for consistency of the FEP estimator one can get by with
assuming continuity over the parameter space, but we impose assumptions that imply asymptotic
normality and easy calculation of asymptotic efficiency bounds. See Newey and McFadden (1994)
or Wooldridge (2010, Chapter 12) for formal regularity conditions. In terms of smoothness,
assuming 𝑚 𝑡 (x𝑖𝑡 , ·) is twice continuously differentiable is sufficient and is almost always true in
                                                          5


practice.
     By far the leading case of the conditional mean function is
                                         E (𝑦𝑖𝑡 |x𝑖𝑡 , 𝑐𝑖 ) = 𝑐𝑖 exp (x𝑖𝑡 𝜷0 )                   (1.2.2)
where x𝑖𝑡 can include time period dummies to allow different intercepts inside the exponential
function. Naturally, x𝑖𝑡 can also include nonlinear functions of underlying explanatory variables,
including squares and interactions. Given the choice in (1.2.2), 𝑃 = 𝐾, but we also allow more
general mean functions. Because we want to allow arbitrary dependence between 𝑐𝑖 and x𝑖𝑡 , we
need time variation in the latter for at least some units in the population. This permits, for example,
interactions among variables that have some time variation and others that do not.
     Strict exogeneity conditional on the unobserved effect 𝑐𝑖 is implied by the first equality in
(1.2.1). This assumption is restrictive – for example, it rules out lagged dependent variables – but
it is much less restrictive than the strict exogeneity assumption typically used in the GEE literature
because of conditioning on 𝑐𝑖 . In the typical GEE approach the strict exogeneity assumption is
stated as E (𝑦𝑖𝑡 |x𝑖 ) = E (𝑦𝑖𝑡 |x𝑖𝑡 ). [For a discussion of GEE from an econometrics perspective, see
Wooldridge (2010, Section 13.11.4).] Using iterated expectations, if (1.2.1) holds then
                                        E (𝑦𝑖𝑡 |x𝑖 ) = E (𝑐𝑖 |x𝑖 ) 𝑚 𝑡 (x𝑖𝑡 , 𝜷0 )
and the latter expression is not E (𝑦𝑖𝑡 |x𝑖𝑡 ) if E (𝑐𝑖 |x𝑖 ) ≠ E (𝑐𝑖 ).
     The multiplicative formulation using the exponential function in (1.2.2) can be obtained from
                                        E (𝑦𝑖𝑡 |x𝑖𝑡 , 𝑎𝑖 ) = exp (𝑎𝑖 + x𝑖𝑡 𝜷0 )
where 𝑐𝑖 ≡ exp (𝑎𝑖 ). In applications where P(𝑦𝑖𝑡 = 0) > 0, it is important to use (1.2.2) to allow
for the possibility that 𝑐𝑖 = 0, which then implies 𝑦𝑖𝑡 = 0, 𝑡 = 1, 2, ..., 𝑇. Often in count data and
                                                            6


corner solution applications one sees some units with 𝑦𝑖𝑡 = 0 for all 𝑡. Remember, we are only
assuming 𝑦𝑖𝑡 ≥ 0; no other restrictions are imposed on the support of 𝑦𝑖𝑡 . In most cases, a model
such as (1.2.2) is appealing when 𝑦𝑖𝑡 has no natural upper bound.
    In FEP estimation, the following residual function, first studied by HHG, plays an important
role:
                                          𝑢𝑖𝑡 ( 𝜷) ≡ 𝑦𝑖𝑡 − 𝑛𝑖 𝑝 𝑡 (x𝑖 , 𝜷)                         (1.2.3)
             Í𝑇
where 𝑛𝑖 ≡      𝑟=1 𝑦 𝑖𝑟 and
                                                            𝑚 𝑡 (x𝑖𝑡 , 𝜷)
                                         𝑝 𝑡 (x𝑖 , 𝜷) ≡ Í𝑇                                         (1.2.4)
                                                           𝑟=1 𝑚 𝑟 (x𝑖𝑟 , 𝜷)
As convenient shorthand, we write 𝑚𝑖𝑡 ( 𝜷) = 𝑚 𝑡 (x𝑖𝑡 , 𝜷) and 𝑝𝑖𝑡 ( 𝜷) = 𝑝 𝑡 (x𝑖 , 𝜷). We can stack the
𝑝𝑖𝑡 ( 𝜷) into the 𝑇 × 1 vector p (x𝑖 , 𝜷) and write
                 u𝑖 ( 𝜷) = y𝑖 − p (x𝑖 , 𝜷) 𝑛𝑖 = y𝑖 − p (x𝑖 , 𝜷) 1𝑇′ y𝑖 = I𝑇 − p (x𝑖 , 𝜷) 1𝑇′ y𝑖
                                                                                              
                                                                                                   (1.2.5)
where u𝑖 ( 𝜷) is the 𝑇 × 1 vector with 𝑡 𝑡ℎ element 𝑢𝑖𝑡 ( 𝜷) and 1𝑇 is the 𝑇 × 1 vector with all elements
unity. As shown in Wooldridge (1999) under Assumption CM.1,
                                                E [u𝑖 ( 𝜷0 ) |x𝑖 ] = 0                             (1.2.6)
Further, the score of the quasi-log-likelihood function for random draw 𝑖 can be written as
                                   s𝑖 ( 𝜷) = ∇ 𝜷 p (x𝑖 , 𝜷) ′ W (x𝑖 , 𝜷) u𝑖 (𝛽)                    (1.2.7)
where
                       W (x𝑖 , 𝜷) = diag [ 𝑝𝑖1 ( 𝜷)] −1 , [ 𝑝𝑖2 ( 𝜷)] −1 , ..., [ 𝑝𝑖𝑇 ( 𝜷)] −1
                                           
                                                                                                   (1.2.8)
is 𝑇 × 𝑇.
                                                         7


    It follows immediately that
                                         E [s𝑖 ( 𝜷0 ) |x𝑖 ] = 0                              (1.2.9)
                                                                                    √
and this translates, under standard regularity conditions, into the consistency and   𝑁-asymptotic
normality of the FEP estimator. For emphasis, only Assumption CM is needed for consistency and
asymptotic normality, and fully robust inference using a sandwich estimator is essentially trivial.
    Wooldridge (1999) also notes that the conditional moment restrictions in (1.2.6) leads to
uncountably many unconditional moment restrictions beyond those used by the FEP estimator,
which are given by
                                           E [s𝑖 ( 𝜷0 )] = 0.
Wooldridge (1999) suggests some extra moment conditions but makes no attempt to find the optimal
estimator based on (1.2.6). In the next section we derive the optimal instruments under a set of
second moment assumptions.
1.3     Optimal Instruments under Second Moment Assumptions
Given the moment conditions in (1.2.6), we can apply Chamberlain’s (1992) semiparametric effi-
ciency bound to obtain an asymptotically efficient estimator. Define
                                                                    
                                     D𝑜 (x𝑖 ) ≡ E ∇ 𝜷 u𝑖 ( 𝜷0 ) |x𝑖                          (1.3.1)
and
                                     V𝑜 (x𝑖 ) ≡ Var [u𝑖 ( 𝜷0 ) |x𝑖 ]                         (1.3.2)
Under regularity conditions of the kind found in Newey and McFadden (1994), Newey (2001)
extended Chamberlain (1992) by allowing V𝑜 (x𝑖 ) to be singular and showed that the efficient
estimator that uses only (1.2.6) has asymptotic variance
                                                                       −1
                                   E D𝑜 (x𝑖 ) ′ V𝑜 (x𝑖 ) − D𝑜 (x𝑖 )
                                                                  
                                                                                             (1.3.3)
                                                    8


where V𝑜 (x𝑖 ) − denotes any generalized inverse (𝑔-inverse), which means V𝑜 (x𝑖 ) V𝑜 (x𝑖 ) − V𝑜 (x𝑖 ) =
V𝑜 (x𝑖 ). Because V𝑜 (x𝑖 ) is symmetric, a symmetric 𝑔-inverse always exists, and it simplifies
notation to take V𝑜 (x𝑖 ) − to be symmetric. Below we will obtain an explicit formula for a symmetric
𝑔-inverse. Given a random sample of size 𝑁 and knowledge of D𝑜 (x𝑖 ) and V𝑜 (x𝑖 ), an estimator
𝛽ˆ𝑂𝑃𝑇 that achieves this lower bound solves the exactly identified moment equations
                                      𝑁
                                     ∑︁                                  
                                         D𝑜 (x𝑖 ) ′ V𝑜 (x𝑖 ) − u𝑖 𝜷 b𝑂𝑃𝑇 = 0                        (1.3.4)
                                     𝑖=1
Of course, this estimator is infeasible because D𝑜 (x𝑖 ) and V𝑜 (x𝑖 ) are generally unknown. In
principle, both can be nonparametrically estimated. However, especially given the often large
dimension of x𝑖 , nonparametric estimation of many conditional means, variances, and covariances
hardly seems worth it just to improve asymptotic efficiency over the FEP estimator. Plus, the
finite-sample properties of the the resulting estimator could be poor. Our goal here is to obtain
simple formulas for the optimal IVs Z∗ (x𝑖 ) ≡ V𝑜 (x𝑖 ) − D𝑜 (x𝑖 ) under reasonably flexible parametric
second moment assumptions that have antecedents in the count data literature.
    To find D𝑜 (x𝑖 ), note that
                                         ∇ 𝜷 u𝑖 ( 𝜷) = −∇ 𝜷 p (x𝑖 , 𝜷) 𝑛𝑖                           (1.3.5)
where, for each 𝑡, we can write
                               "  𝑇
                                               # −1 (                "  𝑇
                                                                                        #         )
                                 ∑︁                                    ∑︁
                ∇ 𝜷 𝑝𝑖𝑡 ( 𝜷) =        𝑚𝑖𝑟 ( 𝜷)        ∇ 𝜷 𝑚𝑖𝑡 ( 𝜷) −        ∇ 𝜷 𝑚𝑖𝑟 ( 𝜷) 𝑝𝑖𝑡 ( 𝜷)
                                 𝑟=1                                   𝑟=1
Therefore,
                                   "  𝑇
                                                  # −1
                                     ∑︁
                                                         ∇ 𝜷 m𝑖 ( 𝜷) − p𝑖 ( 𝜷) 1𝑇′ ∇ 𝜷 m𝑖 ( 𝜷)
                                                                                             
                 ∇ 𝜷 p𝑖 ( 𝜷) =           𝑚𝑖𝑟 ( 𝜷)
                                     𝑟=1
                                   "  𝑇
                                                  # −1
                                     ∑︁
                                                         I𝑇 − p𝑖 ( 𝜷) 1𝑇′ ∇ 𝜷 m𝑖 ( 𝜷)
                                                                         
                              =          𝑚𝑖𝑟 ( 𝜷)                                                   (1.3.6)
                                     𝑟=1
which gives us the necessary gradient.
                                                          9


    Further, because                                         "                #
                                                                𝑇
                                                               ∑︁
                                       E (𝑛𝑖 |x𝑖 , 𝑐𝑖 ) = 𝑐𝑖       𝑚𝑖𝑟 ( 𝜷0 )
                                                               𝑟=1
we have
                        E ∇ 𝜷 u𝑖 ( 𝜷0 ) |x𝑖 , 𝑐𝑖 = −𝑐𝑖 I𝑇 − p𝑖 ( 𝜷0 ) 1𝑇′ ∇ 𝜷 m𝑖 ( 𝜷0 )
                                                                             
Now, let
                                               𝜇𝑐 (x𝑖 ) ≡ E (𝑐𝑖 |x𝑖 )
Then we have shown
                             D𝑜 (x𝑖 ) = −𝜇𝑐 (x𝑖 ) I𝑇 − p𝑖 ( 𝜷0 ) 1𝑇′ ∇ 𝜷 m𝑖 ( 𝜷0 )
                                                                        
                                                                                                  (1.3.7)
which is the first piece needed to derive the optimal instruments. The unknown function in D𝑜 (x𝑖 ),
𝜇𝑐 (x𝑖 ), is the conditional mean in the heterogeneity distribution.
    Next, consider V𝑜 (x𝑖 ) − . First, we can write
                                                                      I𝑇 − p𝑖 ( 𝜷0 ) 1𝑇′ y𝑖 |x𝑖
                                                                                      
                     V𝑜 (x𝑖 ) ≡ Var [u𝑖 ( 𝜷0 ) |x𝑖 ] = Var
                               ≡ (I𝑇 − P𝑖 ) 𝛀𝑖 I𝑇 − P′𝑖
                                                                
                                                                                                  (1.3.8)
where
                                                 𝛀𝑖 ≡ Var (y𝑖 |x𝑖 )                               (1.3.9)
is assumed to be nonsingular (with probability one) and P𝑖 ≡ p𝑖 ( 𝜷0 ) 1𝑇′ is 𝑇 × 𝑇. Because the
𝑝𝑖𝑡 ( 𝜷0 ) sum to unity across 𝑡, it is easy to show that P𝑖 is an idempotent (but not symmetric) matrix
with rank(P𝑖 ) = 1.
    In establishing that the FEP estimator is asymptotically efficient under the Poisson first and
second moment assumptions, Verdier (2018) finds a particular symmetric matrix which is inherent
to the FEP solution.
                                                          10


     The matrix
                                                                                     −1
                V𝑜 (x𝑖 ) − = 𝛀𝑖−1 − 𝛀𝑖−1 p𝑖 ( 𝜷0 ) p𝑖 ( 𝜷0 ) ′ 𝛀𝑖−1 p𝑖 ( 𝜷0 )            p𝑖 ( 𝜷0 ) ′ 𝛀𝑖−1
                                                          
                                                                                        −1
                           = 𝛀𝑖−1 − 𝛀𝑖−1 m𝑖 ( 𝜷0 ) m𝑖 ( 𝜷0 ) ′ 𝛀𝑖−1 m𝑖 ( 𝜷0 )               m𝑖 ( 𝜷0 ) ′ 𝛀𝑖−1
                                                           
                                                                                                             (1.3.10)
is a generalized inverse of V𝑜 (x𝑖 ). The second equality in (1.3.10) follows by the definition of
p𝑖 ( 𝜷0 ) and by cancelling terms. By simple multiplication it is easily seen that
                                              p𝑖 ( 𝜷0 ) ′ V𝑜 (x𝑖 ) − = 0
and so
                             D𝑜 (x𝑖 ) ′ V𝑜 (x𝑖 ) − = −𝜇𝑐 (x𝑖 ) ∇ 𝜷 m𝑖 ( 𝜷0 ) ′ V𝑜 (x𝑖 ) −                    (1.3.11)
     The expression for the optimal instruments in (1.3.11) is not directly applicable because 𝜇𝑐 (·)
and V𝑜 (·) are unknown, with the latter depending on the unknown 𝛀𝑖 . We now impose assumptions
on the structural variance-covariance matrix, Var (y𝑖 |x𝑖 , 𝑐𝑖 ), that lead to useful simplifications. The
first restriction is on the diagonal elements.
Assumption Working Variance 1 (WV.1): For 𝑡 = 1, ..., 𝑇, there exists 𝛼 > 0 such that
                   Var (𝑦𝑖𝑡 |x𝑖 , 𝑐𝑖 ) = Var (𝑦𝑖𝑡 |x𝑖𝑡 , 𝑐𝑖 ) = 𝛼E (𝑦𝑖𝑡 |x𝑖𝑡 , 𝑐𝑖 ) = 𝛼𝑐𝑖 𝑚𝑖𝑡 ( 𝜷0 ) ■       (1.3.12)
     Assumption WV.1 is motivated by the count data literature, where the assumption that the
variance is proportional to the mean is commonly used in generalized linear models (GLM) and
GEE settings; see, for example, McCullagh and Nelder (1989), Liang and Zeger (1986), Hardin and
Hilbe (2012), and Wooldridge (2010, Section 13.11). Again, one important difference between our
setting and the standard GEE setting is that we state the first and second moments conditional on
the unobserved heterogeneity, 𝑐𝑖 , in addition to the observable variables, x𝑖 . Once the population
is effectively partitioned on the basis of (x𝑖 , 𝑐𝑖 ), the so-called “GLM variance assumption” is
more appealing. We do not restrict the value of 𝛼 = Var (𝑦𝑖𝑡 |x𝑖𝑡 , 𝑐𝑖 ) /E (𝑦𝑖𝑡 |x𝑖𝑡 , 𝑐𝑖 ), and so the 𝑦𝑖𝑡
                                                             11


can exhibit underdispersion or overdispersion relative to the Poisson distribution. This variance-
mean relationship also holds for one popular parameterization of the negative binomial distribution
(which implies overdispersion), and can hold for continuous outcomes as well, such as a common
parameterization of the gamma distribution.
    The second working assumption is on the conditional correlation matrix.
Assumption Working Variance 2 (WV.2): For a 𝑇 × 𝑇 symmetric, positive definite matrix R
(with unity down the diagonal),
                                          Corr (y𝑖 |x𝑖 , 𝑐𝑖 ) = R ■                               (1.3.13)
    Assumption WV.2 is motivated by the GEE literature, where a constant conditional correlation
matrix is the leading example of a working correlation assumption. We do not put restrictions on
the elements of R, 𝜌𝑡𝑠 = Corr (𝑦𝑖𝑡 , 𝑦𝑖𝑠 |x𝑖 , 𝑐𝑖 ), other than those that ensure R is a valid correlation
matrix. The special case of no serial correlation conditional on (x𝑖 , 𝑐𝑖 ) is R = I𝑇 . One could impose
an exchangeability restriction on R, as is common in the GEE literature, but that is less attractive
here because we are conditioning on 𝑐𝑖 (which would often be assumed to be an explanation for
an exchangeable structure without conditioning on 𝑐𝑖 ). With large 𝑁 and small 𝑇, there is little
reason to impose restrictions on R. Again, an important difference with the GEE literature is we
condition the correlation matrix on 𝑐𝑖 as well as x𝑖 – which makes R = I𝑇 more tenable (but still
unnecessary).
    We can combine Assumptions WV.1 and WV.2 into a working variance-covariance matrix
conditional on (x𝑖 , 𝑐𝑖 ):
                                   Var (y𝑖 |x𝑖 , 𝑐𝑖 ) = 𝛼𝑐𝑖 M𝑖1/2 RM𝑖1/2                          (1.3.14)
where M𝑖 ≡ diag {𝑚𝑖1 ( 𝜷0 ) , 𝑚𝑖2 ( 𝜷0 ) , ..., 𝑚𝑖𝑇 ( 𝜷0 )} and M𝑖1/2 is the obvious matrix square root. If
not for conditioning on the unobserved heterogeneity 𝑐𝑖 , (1.3.14) has a structure very familiar from
the GEE literature on estimating conditional means of count variables with longitudinal data.
                                                       12


    In stating Assumptions WV.1 and WV.2, we have opted not to include a “0” subscript on 𝛼 or R.
This decision requires a brief explanation. For deriving the optimal instruments, we are assuming
the existence of “true values.” However, when we discuss implementation of our new estimator in
Section 1.4, we do not assume Assumptions WV.1 or WV.2 are in force. To ensure that the focus
is on estimating 𝜷0 , and to simplify the notation, we omit the “0” subscripts on the parameters in
the working assumptions.
    Before deriving the optimal instruments, we first obtain 𝛀𝑖 = Var (y𝑖 |x𝑖 ) and provide a useful
expression for its inverse. As shorthand, let m𝑖 be the 𝑇 × 1 vector of 𝑚𝑖𝑡 ( 𝜷0 ), and define M𝑖1/2
                     √
as above. We use m𝑖 to denote the 𝑇 × 1 vector containing the square roots of the 𝑚𝑖𝑡 ( 𝜷0 ). In
stating the next lemma, let
                                           𝜎𝑐2 (x𝑖 ) = Var (𝑐𝑖 |x𝑖 )
Lemma 1.3.1. Under Assumptions CM, WV.1, and WV.2,
                       Var (y𝑖 |x𝑖 ) = 𝛀𝑖 = 𝛼𝜇𝑐 (x𝑖 ) M𝑖1/2 RM𝑖1/2 + 𝜎𝑐2 (x𝑖 ) m𝑖 m′𝑖           (1.3.15)
which is positive definite. Further,
                             (                                                               )
                                                         2 (x )                   √   √
               1                                       𝜎                                ′
   𝛀𝑖−1               M−1/2 R−1 −                                             −1
                                                                                   m𝑖 m𝑖 R−1 M𝑖−1/2
                                                         𝑐    𝑖
        =                                              2       √ ′ −1 √  R
           [𝛼𝜇𝑐 (x𝑖 )] 𝑖                      (x )
                                        𝛼𝜇𝑐 𝑖 + 𝜎𝑐 𝑖 m𝑖 R(x )          m𝑖
Proof. See Appendix for proof.                                                                        □
    Establishing the formula for 𝛀𝑖 uses the law of total variance (for matrices). Positive definiteness
of 𝛀𝑖 follows because the first term in (1.3.15) is positive definite under WV.1 and WV.2 and the
second is always positive semi-definite. As shown in the Appendix, the formula for 𝛀𝑖−1 applies a
result due to Sherman and Morrison (1950).
    Now we can state the main optimal instrument result.
                                                      13


Theorem 1.3.1. Under Assumptions CM, WV.1, and WV.2, a symmetric generalized inverse of
V𝑜 (x𝑖 ) is
                                                                                  
                        1          −1/2                   1         −1 √ √ ′ −1
                 −
        V𝑜 (x𝑖 ) =               M          −1
                                          R − √ ′ −1 √ R                m𝑖 m𝑖 R      M𝑖−1/2      (1.3.16)
                    [𝛼𝜇𝑐 (x𝑖 )] 𝑖                  m𝑖 R         m𝑖
Further, the optimal 𝑇 × 𝐾 matrix of instruments, Z∗ (x𝑖 ), is
                                                                                  
                                                          1         −1 √ √ ′ −1
          ∗     ′
       Z (x𝑖 ) ≡ ∇ 𝜷 m𝑖 ( 𝜷0 ) ′
                                 M𝑖−1/2   R −1
                                               − √ ′ −1 √ R             m𝑖 m𝑖 R      M𝑖−1/2      (1.3.17)
                                                    m𝑖 R        m𝑖
where, again, m𝑖 and M𝑖 are evaluated at 𝜷0 . We have dropped the minus sign in D𝑜 (x𝑖 ) as that
does not affect the optimal choice.
Proof. See Appendix for proof.                                                                         □
    The optimal instrument matrix in (1.3.17) has a rather remarkable feature: it does not depend
on the constant 𝛼 nor on the conditional first two moments of the heterogeneity distribution, 𝜇𝑐 (x𝑖 )
and 𝜎𝑐 (x𝑖 ) – even though 𝛀𝑖−1 depends on all of these quantities and D𝑜 (x𝑖 ) depends on 𝜇𝑐 (x𝑖 ).
Under the working variance matrix assumptions, the optimal instruments depend only on 𝜷0 and R.
We have a natural preliminary estimator of 𝜷0 , namely, the FEP estimator. Estimating R is much
more challenging, and for that we will introduce additional working assumptions – something we
take up in the next section.
    An interesting special case of Theorem 1.3.1 is when the {𝑦𝑖𝑡 : 𝑡 = 1, 2, ..., 𝑇 } are conditionally
uncorrelated, an assumption with a long history in linear and nonlinear unobserved effects mod-
els. Traditional treatments of linear unobserved effects models – often called “random effects”
models – include the assumption that idiosyncratic shocks are serially uncorrelated, which implies
that, conditional on (x𝑖 , 𝑐𝑖 ), the {𝑦𝑖𝑡 : 𝑡 = 1, 2, ..., 𝑇 } are uncorrelated. In using joint maximum
likelihood to estimate nonlinear models with unobserved heterogeneity – random effects probit
and ordered probit, random effects multinomial logit, random effects Tobit, random effects version
of Poisson and negative binomial models, among others – it is almost always assumed that the
                                                      14


{𝑦𝑖𝑡 : 𝑡 = 1, 2, ..., 𝑇 } are independent conditional on (x𝑖 , 𝑐𝑖 ); see Sections 13.9, 15.8, 17.8, and 18.7
in Wooldridge (2010).
Corollary 1.3.1. Under Assumptions CM, WV.1, and WV.2 with R = I𝑇 , the FEP estimator is
efficient among estimators that use only Assumption CM for consistency.
Proof. See Appendix for proof.                                                                            □
     Corollary 1.3.1 is a new result that shows the FEP estimator is asymptotically efficient for any
𝛼 > 0 in Assumption WV.1 provided there is no serial correlation. Conditional on x𝑖 and 𝑐𝑖 ,
any amount of constant underdispersion or overdispersion is allowed. Therefore, Corollary 1.3.1
improves on Verdier (2018), who imposed 𝛼 = 1, the value that holds for the Poisson distribution.
That FEP is asymptotically efficient for any 𝛼 while allowing for any dependence between 𝑐𝑖
and x𝑖 allows us to make an interesting connection with the cross-sectional GLM literature. As
pointed out in Wooldridge (2010, Section 13.11.3), the cross-sectional version of Assumption WV.1
implies that the Poisson QMLE is asymptotically efficient among estimators that use only correct
specification of the conditional mean function for consistency.
1.4     Operationalizing Optimal IV Estimation
From Theorem 1.3.1, in order to obtain a feasible optimal IV estimator under Assumptions CM,
WV.1, and WV.2, we need a preliminary consistent estimator of 𝜷0 and we either need to know R
or have a consistent estimator of it. If we want to impose a specific structure on R – say, an AR(1)
model with a known AR(1) parameter – then (1.3.17) can be used after replacing 𝜷0 with 𝜷               b𝐹𝐸 𝑃
(the clear choice for a first-stage estimator of 𝜷0 ). Remember, imposing such a restriction when
it is incorrect would not affect consistency of the method of moments estimator; but the estimator
would not be asymptotically efficient. Generally, we want to estimate R without imposing any
restrictions.
                                                                                                 √          
     In order to ignore the first-stage estimation when obtaining the asymptotic variance of 𝑁 𝛽ˆ𝑂𝑃𝑇 − 𝜷0 ,
                                           √
the first-stage estimators of should be 𝑁-consistent – a weak requirement because we are assuming
                                                     15


random sampling and smooth moment and objective functions. See Wooldridge (2010, Chapter
14) for discussion. As mentioned earlier, it is very natural to use the FEP estimator as the initial
estimator of 𝜷0 . Estimation of R is more difficult because it is the (working) correlation matrix
conditional on the unobserved heterogeneity, 𝑐𝑖 , in addition to x𝑖 .
    The key to estimating R is the relationship in (1.3.15). To see how (1.3.15) can be used, define
a 𝑇 × 1 vector of errors
                                    v𝑖 ≡ y𝑖 − E (y𝑖 |x𝑖 ) = y𝑖 − 𝜇𝑐 (x𝑖 ) m𝑖                  (1.4.1)
Then
                         E v𝑖 v′𝑖 |x𝑖 = 𝛼𝜇𝑐 (x𝑖 ) M𝑖1/2 RM𝑖1/2 + 𝜎𝑐2 (x𝑖 ) m𝑖 m′𝑖
                                        
                                                                                              (1.4.2)
which we can write in matrix error form as
                          v𝑖 v′𝑖 = 𝛼𝜇𝑐 (x𝑖 ) M𝑖1/2 RM𝑖1/2 + 𝜎𝑐2 (x𝑖 ) m𝑖 m′𝑖 + S𝑖
with
                                                   E (S𝑖 |x𝑖 ) = 0                            (1.4.3)
Next, define
                                          k𝑖 ≡ E (y𝑖 |x𝑖 ) = 𝜇𝑐 (x𝑖 ) m𝑖                      (1.4.4)
and let K𝑖 be the diagonalized version of k𝑖 . Then
                                                                √︁    √︁
                                 v𝑖 v′𝑖 − 𝜎𝑐2 (x𝑖 ) m𝑖 m′𝑖 = 𝛼 K𝑖 R K𝑖 + S𝑖                   (1.4.5)
and so
                   K𝑖−1/2 v𝑖 v′𝑖 − 𝜎𝑐2 (x𝑖 ) m𝑖 m′𝑖 K𝑖−1/2 /𝛼 = R + K𝑖−1/2 S𝑖 K𝑖−1/2 /𝛼
                                                     
                                                                                              (1.4.6)
By (1.4.3) and iterated expectations, the second term in (1.4.6), K𝑖−1/2 S𝑖 K𝑖−1/2 /𝛼, has a mean of
zero.
                                                         16


    Therefore, we have shown
                                   n                                                  o
                           R = E 𝛼−1 K𝑖−1/2 v𝑖 v′𝑖 − 𝜎𝑐2 (x𝑖 ) m𝑖 m′𝑖 K𝑖−1/2
                                                                              
                                                                                                          (1.4.7)
Combining (1.4.7) with (1.3.17) shows that 𝛼 appears as a multiplicative factor in Z∗ (x𝑖 ), and
therefore does not affect the optimal choice of instruments.
    Equation (1.4.7) for R suggests simply computing the sample analog of the matrix inside the
expected value. However, we must deal with the fact that the matrix depends on three unknown
quantities: the parameter 𝛼, the conditional mean function 𝜇𝑐 (·) (which appears in the definition
of v𝑖 ), and the conditional variance function 𝜎𝑐2 (·).
    There are different ways to approach estimation of 𝜇𝑐 (·). For example, under Assumption CM,
                                                            "  𝑇
                                                                              #
                                                              ∑︁
                                     E (𝑛𝑖 |x𝑖 , 𝑐𝑖 ) = 𝑐𝑖         𝑚𝑖𝑟 ( 𝜷0 )                             (1.4.8)
                                                              𝑟=1
and so
                                        "                        #
                                                  𝑛𝑖
                                     E Í𝑇                      x𝑖 = 𝜇𝑐 (x𝑖 )                              (1.4.9)
                                            𝑟=1 𝑚 𝑖𝑟  ( 𝜷0 )
Alternatively, we can write
                                      "       𝑇
                                                                   #
                                             ∑︁        𝑦𝑖𝑡
                                  E 𝑇 −1                        x𝑖 = 𝜇𝑐 (x𝑖 )                            (1.4.10)
                                             𝑡=1
                                                   𝑚𝑖𝑡 ( 𝜷0 )
                              √
Because we have available       𝑁-consistent estimators of 𝜷0 , expressions (1.4.9) and (1.4.10) show
that 𝜇𝑐 (·) is nonparametrically identified. In fact, we can use these expressions to motivate a
nonparametric estimator. Almost certainly the initial estimator of 𝜷0 is 𝜷              b𝐹𝐸 𝑃 , in which case we
                                         Í                                           
construct a dependent variable, 𝑛𝑖 / 𝑇𝑟=1 𝑚ˆ 𝑖𝑟 , where 𝑚ˆ 𝑖𝑟 = 𝑚𝑖𝑟 𝜷              b𝐹𝐸 𝑃 , and use it in a cross-
sectional nonparametric regression to obtain 𝜇ˆ 𝑐 (·). For 𝜎𝑐2 (·), the law of total variance gives the
conditional form given 𝒙𝑖 .
                                                        17


     We have
                         
               E 𝑣 𝑖𝑡2 |x𝑖 = Var (𝑦𝑖𝑡 |x𝑖 ) = E [Var (𝑦𝑖𝑡 |x𝑖 , 𝑐𝑖 ) |x𝑖 ] + Var [E (𝑦𝑖𝑡 |x𝑖 , 𝑐𝑖 ) |x𝑖 ]
                            = E [𝛼𝑐𝑖 𝑚𝑖𝑡 ( 𝜷0 ) |x𝑖 ] + Var [𝑐𝑖 𝑚𝑖𝑡 ( 𝜷0 ) |x𝑖 ]
                            = 𝛼𝜇𝑐 (x𝑖 ) 𝑚𝑖𝑡 ( 𝜷0 ) + 𝜎𝑐2 (x𝑖 ) [𝑚𝑖𝑡 ( 𝜷0 )] 2                             (1.4.11)
where we impose the working variance Assumption WV.1. Given that 𝜇𝑐 (x𝑖 ) is identified from the
previous argument, this expression identifies 𝛼 and 𝜎𝑐2 (·). In fact, after obtaining (semiparametric)
                                            
residuals 𝑣ˆ 𝑖𝑡 = 𝑦𝑖𝑡 − 𝜇ˆ 𝑐 (x𝑖 ) 𝑚𝑖𝑡 𝜷b𝐹𝐸 𝑃 , we can use the squared residuals, 𝑣ˆ 2 , as the dependent
                                                                                             𝑖𝑡
variable in nonparametric estimation of 𝜎𝑐2 (·). Therefore, a semiparametric approach to estimating
the optimal IVs is available under Assumptions CM, WV.1, and WV.2.
     For practical reasons, our suggestion is to avoid estimating either 𝜇𝑐 (·) and 𝜎 2 (·) nonpara-
metrically. Remember, we only need to estimate these conditional moments to obtain IVs more
efficient than those used by the FEP estimator. The dimension of x𝑖 = (x𝑖1 , x𝑖2 , ..., x𝑖𝑇 ) is often
large. We can reduce the dimension by using a nonparametric Mundlak (1978) device, which
would have 𝜇𝑐 (·) and 𝜎 2 (·) depending only on time averages x̄𝑖 ≡ 𝑇 −1 𝑇𝑟=1 x𝑖𝑟 . Nevertheless,
                                                                                       Í
estimating a conditional variance along with a conditional mean when 𝐾 is even moderately large
is still challenging, both theoretically and practically. It would involve choosing at least two tuning
parameters. From a robustness perspective, we cannot improve over the FEP estimator because it is
consistent under Assumption CM. High-dimensional nonparametric estimation seems unnecessary
to improve over the usual FEP estimator in the presence of serial correlation and under- or overdis-
persion, especially if one factors in finite-sample considerations. Instead, we draw on the literature
on models for nonnegative responses to suggest working assumptions for the conditional mean and
variance of the heterogeneity – as summarized, for example, in Wooldridge (2010, Section 18.7.3).
     For concreteness, and because it is by far the leading case, we now assume that 𝑚𝑖𝑡 ( 𝜷0 ) =
exp (x𝑖𝑡 𝜷0 ). Other forms of 𝑚𝑖𝑡 ( 𝜷0 ) are easily handled, but the formulas and connections with
other literatures is not as straightforward. In fact, we do not even need a generalized linear model
form in our current setting, though such a mean function tends to lead to easier interpretation.
                                                       18


Assumption WH.1: For known 1 × 𝑄 functions h (x𝑖 ), a scalar 𝜂, and 𝝀 a 𝑄 × 1 vector,
                                𝜇𝑐 (x𝑖 ) ≡ E (𝑐𝑖 |x𝑖 ) = exp [𝜂 + h (x𝑖 ) 𝝀] ■                      (1.4.12)
The leading case is to use the (nonredundant) time averages of {x𝑖𝑡 : 𝑡 = 1, ..., 𝑇 }, which is an
extension of the Mundlak (1978) device to the nonlinear case, so that h (x𝑖 ) = x̄𝑖 . But we can also use
Chamberlain’s (1980) less restrictive version, or include other functions of {x𝑖𝑡 : 𝑡 = 1, ..., 𝑇 }, such
as unit-specific trends or even unit-specific second moments. It seems sensible to use something
simple, such as the Mundlak device, as we are only using WH.1 to generate instruments.
    When we combine Assumption WH.1 with the exponential conditional mean for E (𝑦𝑖𝑡 |x𝑖 , 𝑐𝑖 ),
we obtain, by iterated expectations,
           E (𝑦𝑖𝑡 |x𝑖 ) = exp [𝜂 + h (x𝑖 ) 𝝀] exp (x𝑖𝑡 𝜷0 ) = exp [x𝑖𝑡 𝜷0 + 𝜂 + h (x𝑖 ) 𝝀]          (1.4.13)
The parameters in this conditional mean function can be consistently estimated using a variety
of methods. A simple approach is to exploit equation (1.4.9) or (1.4.10) using exponential mean
functions. After obtaining the FEP estimator 𝜷        b𝐹𝐸 𝑃 , estimate 𝜂 and 𝝀 by a cross sectional Poisson
regression with mean function exp [𝜂 + h (x𝑖 ) 𝝀] and one of the dependent variables
                                                                 𝑇
                                       𝑛𝑖                    −1
                                                                ∑︁        𝑦𝑖𝑡
                            Í𝑇                       or 𝑇                                       (1.4.14)
                               𝑟=1 exp x𝑖𝑟 𝜷 𝐹𝐸 𝑃               𝑡=1 exp x𝑖𝑡 𝜷 𝐹𝐸 𝑃
                                            b                               b
Even if the original 𝑦𝑖𝑡 are count variables – and there is no presumption that they are – neither
of the regressands in (1.4.14) would be a count variable. Of course, this is of no consequence
because of the robustness of the Poisson QMLE for estimating the parameters of the conditional
mean regardless of the nature of the dependent variable (provided it is nonnegative).
    Alternatively, 𝜷0 , 𝜂, and 𝝀 can be estimated jointly using the pooled Poisson QMLE. The pooled
Poisson QMLE is completely robust to distributional misspecification and serial correlation. Of
course, to preserve consistency of the resulting method of moments estimator we do not need
Assumption WH.1 to hold; we are using it to estimate the optimal instruments derived earlier.
                                                        19


    The second working assumption on the heterogeneity distribution imposes a restriction on the
variance-mean relationship.
Assumption WH.2: For 𝛿 > 0,
                    𝜎𝑐2 (x𝑖 ) ≡ Var (𝑐𝑖 |x𝑖 ) = 𝛿 [𝜇𝑐 (x𝑖 )] 2 = 𝛿 {exp [𝜂 + h (x𝑖 ) 𝝀]}2 ■ (1.4.15)
    Assumption WH.2 is very common in settings with nonnegative, continuous heterogeneity
(including so-called random effects Poisson and negative binomial models). The condition that the
variance is proportional to the square of the mean holds for the natural parameterizations of the
gamma and lognormal distributions, and holds whenever
                                                   𝑐𝑖 = ℎ𝑖 𝜇𝑐 (x𝑖 )                         (1.4.16)
for ℎ𝑖 ≥ 0 and independent of x𝑖 , without any further restrictions on the distribution of ℎ𝑖 . Like
Assumption WH.1, Assumption WH.2 is not needed for consistent estimation using the method of
moments estimator but only to estimate the optimal instruments under the working Assumptions
WV.1 and WV.2.
    Using Assumptions CM, WV.1, WH.1, and WH.2 we can obtain estimating equations for 𝛼 and
𝛿. First, note that
                                                        
                                            E 𝑣 𝑖𝑡2 |x𝑖 = 𝛼𝑘 𝑖𝑡 + 𝛿𝑘 𝑖𝑡2                    (1.4.17)
where
                                𝑘 𝑖𝑡 ≡ E (𝑦𝑖𝑡 |x𝑖 ) = exp [x𝑖𝑡 𝜷0 + 𝜂 + h (x𝑖 ) 𝝀]
An immediate implication of equation (1.4.17) is
                                             "         2    #
                                                  𝑣 𝑖𝑡
                                         E      √           x𝑖 = 𝛼 + 𝛿𝑘 𝑖𝑡                  (1.4.18)
                                                   𝑘 𝑖𝑡
                                                           20


which is the basis for estimating variance parameters in common cross-sectional models where
heterogeneity is assumed independent of the covariates. A simple way to operationalize the
conditional mean is
                                                                    h                               i
                           𝑣ˆ 𝑖𝑡 = 𝑦𝑖𝑡 − 𝑘ˆ 𝑖𝑡 = 𝑦𝑖𝑡 − exp x𝑖𝑡 𝜷         b𝐹𝐸 𝑃 + 𝜂ˆ + h (x𝑖 ) 𝝀    b  (1.4.19)
where 𝜂ˆ and 𝝀 b are from one of the Poisson regressions described in equation (1.4.13). Then 𝛼ˆ and
𝛿ˆ are, respectively, the intercept and slope in the pooled simple regression
                                     𝑣ˆ 𝑖𝑡2
                                              on 1, 𝑘ˆ 𝑖𝑡 , 𝑡 = 1, ..., 𝑇; 𝑖 = 1, ..., 𝑁              (1.4.20)
                                     𝑘ˆ 𝑖𝑡
It is clear from equation (1.3.17) that 𝛼ˆ does not appear in the optimal instruments, but we need
to estimate 𝛼 in order to obtain 𝛿.       ˆ In order to conclude the working assumptions are a reasonable
approximation to reality, both 𝛼ˆ and 𝛿ˆ should be nonnegative. If one of them is negative (most likely
 ˆ then 𝛿ˆ should be set to zero. Because 𝛼ˆ drops out of the optimal IVs, we need not estimate it
𝛿)
when we set 𝛿ˆ = 0. Nevertheless, one may be curious about the estimated amount of overdispersion
when 𝛿 is set to zero. With 𝛿 = 0, the estimate of 𝛼 is simply
                                                                  𝑁 ∑︁
                                                                 ∑︁     𝑇                  
                                              𝛼ˆ = (𝑁𝑇) −1                   𝑣ˆ 𝑖𝑡2 / 𝑘ˆ 𝑖𝑡           (1.4.21)
                                                                 𝑖=1 𝑡=1
and this is guaranteed to be nonnegative. However, as mentioned above, 𝛼ˆ does not affect estimation
of the optimal IVs when 𝛿 = 0.
     When we add Assumptions WH.1 and WH.2 to the previous assumptions, we obtain a simple
form for R:
                                                 n                                              o
                                   R = E K𝑖−1/2 v𝑖 v′𝑖 − 𝛿k𝑖 k′𝑖 K𝑖−1/2 /𝛼
                                                                              
which leads immediately to the method-of-moments/plug-in estimator
                                                      𝑁
                                           1          ∑︁                                     
                                R̂ =            𝑁 −1         K̂𝑖−1/2 v̂𝑖 v̂′𝑖 − 𝛿ˆk̂𝑖 k̂′𝑖 K̂𝑖−1/2    (1.4.22)
                                           𝛼ˆ         𝑖=1
                                                                 21


By a standard application of the uniform weak law of large numbers [Wooldridge (2010, Lemma
            𝑝
12.1)], R̂ → R. For each 𝑡 ≠ 𝑠, the correlations are estimated as
                                                                                     
                                                     𝑁     ˆ
                                                             𝑣   ˆ
                                                                 𝑣    −   𝛿ˆ 𝑘ˆ   ˆ
                                                                                  𝑘
                                              1       ∑︁       𝑖𝑠 𝑖𝑡            𝑖𝑠 𝑖𝑡
                                   𝜌ˆ 𝑠𝑡 =       𝑁 −1             √︁                              (1.4.23)
                                             𝛼ˆ       𝑖=1            𝑘ˆ 𝑖𝑠 𝑘ˆ 𝑖𝑡
From the definition of 𝛼ˆ and 𝛿ˆ obtained from (1.4.18), it is easily seen that 𝜌ˆ 𝑡𝑡 = 1 for 𝑡 = 1, ..., 𝑇,
and so this estimator imposes the logical requirement that a correlation matrix must have unity
down its diagonal.
    If we set 𝛿 = 0, R̂ reduces to
                                                    𝑁
                                           1         ∑︁
                                                  −1
                                                         K̂𝑖−1/2 v̂𝑖 v̂′𝑖 K̂𝑖−1/2
                                                                            
                                   R̂ =         𝑁                                                 (1.4.24)
                                           𝛼ˆ        𝑖=1
With this choice of R̂, we can make a direct connection with the GEE literature by ignoring the
presence of 𝑐𝑖 and working off the first two conditional moments of y𝑖 given x𝑖 – see, for example,
Liang and Zeger (1986) and Wooldridge (2010, Sections 13.11.4 and 18.7.3). Namely, under the
full set of working assumptions with 𝛿 = 0,
                       E (𝑦𝑖𝑡 |x𝑖 ) = exp [x𝑖𝑡 𝜷0 + 𝜂 + h (x𝑖 ) 𝝀] = 𝑘 𝑖𝑡 , 𝑡 = 1, ..., 𝑇         (1.4.25)
                     Var (𝑦𝑖𝑡 |x𝑖 ) = 𝛼E (𝑦𝑖𝑡 |x𝑖 ) , 𝑡 = 1, ..., 𝑇                               (1.4.26)
                    Corr (y𝑖 |x𝑖 ) = 𝛼K𝑖1/2 RK𝑖1/2                                                (1.4.27)
This collection of moment assumptions is precisely what is used in GEE applications of Poisson
regression (whether or not 𝑦𝑖𝑡 is a count variable), with the addition of the vector of functions
h (x𝑖 ). We emphasize that these are all working assumptions in the current context. Not even
the conditional mean function in (1.4.25) is assumed to hold for consistency because (1.4.25) is
obtained from Assumptions CM and WH.1, whereas we are only require Assumption CM for
consistency. We impose Assumptions WH.1 and WH.2 in order to estimate R and then to estimate
𝛀𝑖 . Provided it leads to a positive definite estimate, we prefer (1.4.20) because it is the correct
expression under all of the working assumptions.
                                                        22


    Under Assumption CM and the full set of working assumptions, we can estimate the optimal
IVs, for each 𝑖, as
                                                                                        
                                                          1              √︁ √︁ ′
                      ∇ 𝜷 m̂′𝑖 M̂𝑖−1/2   R̂ −1
                                                −√ ′          √    R̂ −1
                                                                           m̂𝑖 m̂𝑖 R̂ −1
                                                                                            M̂𝑖−1/2          (1.4.28)
                                                     m̂𝑖 R̂−1 m̂𝑖
where “ˆ” means the quantity is evaluated at a first-round estimator, most likely 𝜷                    b𝐹𝐸 𝑃 , and R̂
is from (1.4.22) or, if necessary, (1.4.24). [In either case, 𝛼ˆ drops out of (1.4.28).] However,
without the full set of working assumptions, this choice of IVs is not guaranteed to improve over
the FEP estimator because of its dependence on R̂. A somewhat subtle point is that (1.4.28) is not
even optimal under Assumptions CM, WV.1, and WV.2 because consistency of R̂ for R generally
requires correct specification of the heterogeneity mean and variance – that is, Assumptions WH.1
and WH.2. As mentioned previously, if we did not have to estimate R, we could use (1.4.28) with R̂
replaced by R, and then we would have just identification as with the FEP estimator. Naturally, we
want to use the data to provide an estimator of R better than just guessing. Incidentally, expression
(1.4.28) shows that the estimator 𝛼ˆ has no direct effect on the optimal IVs because it factors out as
a constant.
    In order to ensure improvements over FEP, our recommendation is to stack the FEP and the
new “optimal” IVs to form an expanded IV matrix and use GMM. The resulting estimator, which
we simply call the “GMM estimator,” is guaranteed to be asymptotically at least as efficient as the
FEP and GFEP estimators; usually it is strictly more efficient than both. In other words, the 𝑇 × 2𝐾
matrix of IVs is Ẑ𝑖 , written in transposed form as
                                                   ′M̂−1/2 I − p̂ p̂ ′ M̂−1/2
                                                            h     √︁ √︁ i
                                           ∇    m̂
                                              𝜷 𝑖 𝑖           𝑇        𝑖    𝑖
                  Ẑ′𝑖 = ­­
                          ©                                                       𝑖                  ª
                                             h                             √    √          i         ®       (1.4.29)
                                   ′   −1/2       −1          1         −1          ′ −1        −1/2 ®
                            ∇ m̂ M̂            R̂ − √ ′ −1 √ R̂              m̂𝑖 m̂𝑖 R̂ M̂𝑖
                          « 𝜷 𝑖 𝑖                        m̂𝑖 R̂  m̂𝑖                                 ¬
Given this choice of Ẑ𝑖 , the mechanics of GMM are straightforward. After obtaining 𝜷                 b𝐹𝐸 𝑃 , obtain
the 𝑇 × 1 residual vectors
                                                                        
                                               ũ𝑖 = y𝑖 − p x𝑖 , 𝜷b𝐹𝐸 𝑃 𝑛𝑖                                   (1.4.30)
                                                             23


Then, given the estimators of 𝜂, 𝝀, 𝛼, 𝛿, and R described above, obtain the 2𝐾 × 2𝐾 matrix,
                                                                 𝑁
                                                                ∑︁
                                                   ˆ = 𝑁 −1
                                                   𝚿                Ẑ′𝑖 ũ𝑖 ũ′𝑖 Ẑ𝑖                (1.4.31)
                                                                𝑖=1
Assuming 𝚿  ˆ is positive definite (which generally holds with probability approaching one), the
optimal GMM estimator, 𝛽ˆ𝐺 𝑀 𝑀 , solves
                                                 𝑁
                                                                  !              𝑁
                                                                                                   !
                                              ∑︁                              ∑︁
                                    min              u𝑖 ( 𝜷) ′ Ẑ𝑖 𝚿ˆ −1              Ẑ′𝑖 u𝑖 ( 𝜷)   (1.4.32)
                                   𝜷∈R𝐾
                                               𝑖=1                             𝑖=1
Because we have chosen very smooth mean, variance, and correlation functions, the consistency
     √
and 𝑁-asymptotic normality are standard; see, for example, Wooldridge (2010, Chapter 14).
Remember, 𝚿  ˆ −1 is an (estimated) optimal weighting matrix given the choice of instruments; the
standard GMM inference does not require that Ẑ𝑖 is optimal.
    Regardless of the size of 𝑇, the GMM estimator generates 𝐾 overidentification restrictions that
can be used to test Assumption CM.
1.5    A Small Simulation Study
We now present the results of a small Monte Carlo simulation to demonstrate the efficacy of
the improved GMM estimator. The conditional mean model, which has an exponential form,
includes three time-varying explanatory variables and multiplicative heterogeneity. We consider
two conditional distributions for the outcome variable, 𝑦𝑖𝑡 . In the first case, 𝑦𝑖𝑡 is a count variable
generated as
                                   𝑦𝑖𝑡 |x𝑖 , 𝑐𝑖 , e𝑖 ∼ Poisson [𝑐𝑖 exp (x𝑖𝑡 𝜷 + 𝑒𝑖𝑡 )]                (1.5.1)
where e𝑖 = (𝑒𝑖1 , 𝑒𝑖2 , ..., 𝑒𝑖𝑇 ) ′ is distributed as multivariate normal with unit variances. In order
to generate serial dependence in {𝑦𝑖𝑡 : 𝑡 = 1, ..., 𝑇 } conditional on (x𝑖 , 𝑐𝑖 ), {𝑒𝑖𝑡 : 𝑡 = 1, 2, ..., 𝑇 }
follows an AR(1) process with first-order correlation 𝜙 ∈ {0, 0.25, 0.75}. This autoregressive
process generates no conditional dependence when 𝜙 = 0 and fairly strong time series dependence
when 𝜙 = 0.75. Because of the inclusion of 𝑒𝑖𝑡 , the conditional distribution D (𝑦𝑖𝑡 |x𝑖 , 𝑐𝑖 ) is
                                                                24


not Poisson; in fact, it exhibits overdispersion because exp (𝑒𝑖𝑡 ) is integrated out in obtaining
D (𝑦𝑖𝑡 |x𝑖 , 𝑐𝑖 ). However, consistency of the estimators requires only that that E (𝑦𝑖𝑡 |x𝑖 , 𝑐𝑖 ) has the
exponential form with multiplicative 𝑐𝑖 .
     The strictly exogenous explanatory variables, x𝑖𝑡 , are generated as a trivariate, stationary vector
autoregression, where the stochastic term is an independent multivariate standard normal distribu-
                                                                               
tion with autocorrelation parameter 0.125. The processes x𝑖 = x𝑖1 , ..., x𝑖𝑇 and e𝑖 are independent.
The vector 𝜷 is set to 𝜷′ = (0.15, 0.25, 0.35) (where we drop the 𝑜 subscript to make the tables
easier to read).
     To generate correlation between 𝑐𝑖 and x𝑖 , we first use an exponential version of the Mundlak
(1978) device and an exponential distribution:
                                   𝑐𝑖 |x𝑖 ∼ Exponential [exp (𝜂 + x̄𝑖 𝝀)]                           (1.5.2)
Under this specification, the working assumptions WH.1 and WH.2 are both satisfied with h (x𝑖 ) =
x̄𝑖 and, in the case of WH.2, 𝛿 = 1.
     We estimate the parameters in the heterogeneity moments using a two-step pooled Poisson
QMLE with the FEP estimator as the first-stage estimator of 𝜷. The estimates 𝛼ˆ and 𝛿ˆ are estimated
via the pooled OLS regression in equation (1.4.20) and R̂ is estimated as in (1.4.22). When R̂ is
not positive definite for a particular draw, we set 𝛿ˆ = 0 and estimate R̂ as in (1.4.24) (in which case
the value of 𝛼ˆ plays no role in the estimation of 𝜷). This situation occurs between 60% and 80%
of the simulations.
     We use 𝑁 = 300, 𝑇 ∈ {4, 8}, and 1, 000 replications in the simulations. The findings are
reported in Table 1.1.
     Some general patterns emerge from Table 1.1. First, the FEP estimator shows very little bias,
and its bias is almost always smaller than the GFEP and GMM estimators. The GFEP estimator
generally shows the most bias – as high as nine percent in some cases. Still, we only have 𝑁 = 300,
which is not especially large. Interestingly, the bias in the GMM estimator – which combines both
sets of moment conditions – is well below that of the GFEP estimator. The bias in both the GFEP
                                                   25


                            Table 1.1: Conditional Poisson distribution
                                 Bias                         SD                       RMSE
                      FEP      GFEP          GMM     FEP    GFEP      GMM        FEP   GFEP GMM
  𝝓=0         T=4     0.002    -0.004          0.000 0.082   0.075       0.072   0.082  0.075 0.072
                      0.001    -0.011         -0.003 0.083   0.078       0.072   0.083  0.079 0.072
                     -0.001    -0.016         -0.005 0.083   0.079       0.075   0.083  0.081 0.075
              T=8     0.011    -0.010         -0.005 0.052   0.044       0.041   0.052  0.045 0.041
                      0.000    -0.020         -0.011 0.053   0.044       0.042   0.053  0.049 0.044
                      0.001    -0.027         -0.014 0.051   0.045       0.042   0.052  0.052 0.045
  𝝓 = 0.25 T = 4     -0.007    -0.016          0.008 0.081   0.074       0.072   0.081  0.076 0.073
                     -0.003    -0.014          0.004 0.082   0.075       0.070   0.079  0.077 0.070
                      0.002    -0.015          0.003 0.079   0.075       0.070   0.079  0.077 0.070
              T=8    -0.001    -0.014         -0.007 0.051   0.045       0.042   0.051  0.047 0.043
                      0.000    -0.021         -0.010 0.048   0.044       0.040   0.048  0.049 0.042
                     -0.001    -0.029         -0.015 0.051   0.046       0.043   0.051  0.054 0.046
  𝝓 = 0.75 T = 4     -0.001    -0.007         -0.003 0.057   0.054       0.051   0.057  0.055 0.051
                      0.005    -0.008          0.001 0.060   0.058       0.052   0.061  0.059 0.052
                      0.001    -0.014         -0.002 0.060   0.059       0.053   0.060  0.060 0.053
              T=8     0.001    -0.012         -0.004 0.043   0.035       0.034   0.043  0.037 0.034
                     -0.001    -0.023         -0.011 0.044   0.036       0.034   0.044  0.043 0.036
                     -0.002    -0.032         -0.015 0.047   0.038       0.036   0.047  0.050 0.039
and GMM estimators appears to increase with 𝑇. Overall, the bias in the GMM estimator seems
acceptable, especially given the small 𝑁.
    The GMM estimator always has the smallest sampling standard deviation, sometimes being
about 80% of the FEP standard error. The SD of the GFEP estimator falls in between that of the
FEP and GMM estimators. In a few cases the FEP estimator has smaller root mean squared error
(RMSE) than the GFEP estimator. The asymptotic theory of GMM estimation implies that the
GMM estimator is asymptotically more efficient than FEP or GFEP because, in the setting of the
simulation, the entire set of working assumptions does not hold, and so GFEP does not use the
optimal IVs. The ranking of the estimators in terms of the root mean squared error favors the GMM
estimator in every case.
    To see how the estimators perform when 𝑦𝑖𝑡 is a continuous outcome, we generated 𝑦𝑖𝑡 as
                            𝑦𝑖𝑡 |x𝑖 , 𝑐𝑖 , e𝑖 ∼ Gamma [exp (x𝑖𝑡 𝜷 + 𝑒𝑖𝑡 ) , 𝑐𝑖 ]              (1.5.3)
                                                     26


where the gamma distribution is parameterized so that E (𝑦𝑖𝑡 |x𝑖𝑡 , 𝑐𝑖 , e𝑖 ) = 𝑐𝑖 exp (x𝑖𝑡 𝜷 + 𝑒𝑖𝑡 ), as
before. The conditional variance is Var (𝑦𝑖𝑡 |x𝑖𝑡 , 𝑐𝑖 , e𝑖 ) = 𝑐𝑖2 exp (x𝑖𝑡 𝜷 + 𝑒𝑖𝑡 ). We use the same
process in (1.5.2) to generate 𝑐𝑖 . The simulation findings are reported in Table 1.2.
                            Table 1.2: Conditional Gamma distribution
                                Bias                           SD                        RMSE
                       FEP     GFEP      GMM      FEP        GFEP     GMM        FEP     GFEP GMM
  𝝓=0         T=4      0.000   -0.006     -0.002  0.090       0.087    0.081    0.090     0.087   0.081
                       0.003   -0.008      0.003  0.089       0.085    0.080    0.089     0.085   0.080
                       0.001   -0.014      0.000  0.090       0.088    0.083    0.090     0.089   0.083
              T=8      0.000   -0.012     -0.006  0.056       0.049    0.048    0.056     0.051   0.048
                      -0.001   -0.019     -0.009  0.052       0.050    0.047    0.052     0.054   0.048
                      -0.001   -0.027     -0.014  0.054       0.051    0.048    0.054     0.058   0.050
  𝝓 = 0.25 T = 4       0.002   -0.007      0.002  0.086       0.082    0.078    0.086     0.082   0.078
                      -0.003   -0.016     -0.004  0.085       0.082    0.077    0.085     0.084   0.078
                       0.002   -0.014     -0.001  0.086       0.084    0.081    0.086     0.085   0.081
              T=8      0.000   -0.013     -0.006  0.057       0.050    0.048    0.057     0.052   0.048
                       0.000   -0.019     -0.009  0.055       0.050    0.048    0.055     0.053   0.049
                      -0.001   -0.033     -0.017  0.058       0.053    0.051    0.058     0.062   0.053
  𝝓 = 0.75 T = 4       0.001   -0.006      0.000  0.069       0.067    0.063    0.069     0.067   0.063
                       0.000   -0.012     -0.001  0.074       0.072    0.067    0.074     0.073   0.067
                       0.000   -0.016     -0.001  0.070       0.072    0.064    0.070     0.074   0.064
              T=8      0.001   -0.014     -0.005  0.049       0.041    0.040    0.049     0.044   0.040
                       0.000   -0.023     -0.008  0.048       0.042    0.039    0.048     0.048   0.040
                      -0.001   -0.034     -0.013  0.050       0.046    0.043    0.050     0.057   0.045
    The general pattern found in Table 1.1 continues to hold in Table 1.2. The FEP estimator
generally has the lowest bias, although the GMM estimator also does well with bias. The GFEP
estimator, which uses only the “optimal” IVs, shows more bias – again, sometimes on the order of
more than nine percent. In terms of precision and RMSE, the GMM estimator outperforms FEP
and GFEP in all scenarios, although the gains are modest in some cases.
    We tried several additional scenarios, including cases where Assumption WH.2 is violated –
by drawing 𝑐𝑖 from a Poisson distribution – and cases where, conditional on (x𝑖 , 𝑐𝑖 ) – 𝑦𝑖𝑡 is an
underdispersed gamma random variable. In the former case, we found only minor differences
among the estimators, although sometimes the FEP estimator outperformed the other two in terms
of RMSE. In the latter case, where we did not allow serial correlation, the estimators perform very
                                                  27


similarly. As a final set of simulations, we misspecified the conditional mean E (𝑐𝑖 |x𝑖 ) in (1.5.2)
by letting the mean depend on the average of the first and last time periods rather than x̄𝑖 . In other
words, Assumption WH.1 is violated. The GMM estimator uniformly performed the best based on
RMSE and exhibited biases on the order of those reported in Tables 1.1 and 1.2. These simulations
are available upon request from the authors.
1.6     Summary and Conclusion
We have characterized the optimal instruments in a multiplicative panel model under a general set
of working assumptions. The variance-mean relationship, conditional on unobserved heterogeneity
as well as covariates, is allowed to be any positive number. The conditional correlation matrix
is assumed to be constant but is otherwise unrestricted. Under these assumptions, the optimal
IVs depends only on the unknown correlation matrix, R (and the value of the conditional mean
parameters, 𝜷0 ). In the special case that R = I𝑇 , we show that the FEP estimator achieves the
asymptotic efficiency bound for any amount of overdispersion or underdispersion, thereby relaxing
the assumptions under which the FEP estimator is known to be asymptotically efficient. When R
is not the identity matrix, it is possible to improve on the FEP estimator.
    To operationalize the optimal IVs in order to exploit serial correlation, we add working first
and second moment assumptions on the conditional heterogeneity distribution. These assumption
are common in literatures that allows nonnegative heterogeneity in cross-sectional and panel data
models. We show that estimating the optimal IVs is straightforward, and suggest a GMM approach
that is guaranteed to improve asymptotic efficiency whether or not serial correlation is present.
Our simulations show that the GMM estimator that combines the FEP moment conditions and the
new “optimal” moment conditions has very good bias properties and provides nontrivial efficiency
gains – even when the cross-sectional sample size is only 𝑁 = 300.
    Our results and new estimator are appealing for cases where 𝑁 is substantially larger than
𝑇, as we have used the standard microeconometric setting where 𝑇 is fixed in the asymptotic
analysis. Naturally, this is not the only possibility. For example, Fernández-Val and Weidner
                                                   28


(2018) and Chen, Fernández-Val, and Weidner (2020) have proposed quasi-MLEs that allow more
heterogeneity. However, consistency requires 𝑇 → ∞ along with 𝑁 → ∞, and necessarily restricts
the amount of time series heterogeneity and dependence.
                                               29


                                               CHAPTER 2
            INFORMATION EQUIVALENCE AMONG TRANSFORMATIONS OF
                  SEMIPARAMETRIC NONLINEAR PANEL DATA MODELS
2.1     Introduction
In the standard linear panel data model with additive unobserved heterogeneity, it is well known
that numerous transformations can be used to eliminate the heterogeneity prior to estimation.
The most common methods are the within and first-differencing transformations1. Similarly,
when the heterogeneity appears as a multiplicative term in the conditional mean like in certain
Generalized Linear Model settings, modified within and differencing transformations can control for
the heterogeneity and provide moment conditions for estimation. There exist other transformations
which control for heterogeneity but are clearly absurd. For example, multiplying all the data by zero
eliminates the heterogeneity along with all information for estimation. For a less trivial example,
suppose the population model is linear with a single additive effect and the first-differenced errors
are homoskedastic and uncorrelated. Then second-differencing is still consistent but less efficient
than first-differencing. These examples raise the question of how to evaluate methods for eliminating
heterogeneity while preserving information for estimation.
    This paper considers conditional mean models with unobserved heterogeneity. The general
framework derived within encompasses a large class of both linear and strictly nonlinear models,
examples of which are given in Section 2.2.1. The models are referred to as “semiparametric” in
the sense that nothing is assumed about the relationship between the heterogeneity and observables
other than regularity conditions needed for asymptotic analysis. In place of assumptions on
the conditional distribution of the heterogeneity, these models often require a transformation to
eliminate or control for the term. I provide a unified framework for comparing such transformations
in terms of the information they preserve. Those that yield the same moment conditions, given
    1 For a comprehensive review of linear panel models with additive heterogeneity, see Chapters 10 and 11 of
Wooldridge (2010).
                                                     30


                                                                     √
certain regularity assumptions, will provide the same 𝑁-asymptotic efficiency bound if they have
equal rank.
     As mentioned above, the within and first-differencing transformations are the most common
in the linear panel case for eliminating additive heterogeneity. When the covariates are strictly
exogenous with respect to the idiosyncratic errors, these transformations provide conditional mo-
ment restrictions which can be exploited for estimation of the population parameters. For a given
conditional variance matrix, Arellano and Bover (1995) suggest that Generalized Least Squares
(GLS) on the demeaned equations is equivalent to the efficient 3SLS estimator. This claim was
later proven in Im et al. (1999) along with a proof that the GLS estimators on the demeaned and
first-differenced estimators are equivalent. Their result shows that two commonly used methods of
estimation preserve the same information in the linear case. However, they limit their investigation
to a small number of estimators and only allow for a single time-invariant individual effect. My
approach can derive the same result as Im et al. (1999) as well as general factor-augmented panels
with an arbitrary number of individual effects with time-varying coefficients.
     For nonlinear models with a multiplicative heterogeneity term, one approach to estimation is
the fixed effects Poisson (FEP) estimator. Hausman et al. (1984) derive the FEP as the maximum
likelihood estimator of a multinomial distribution2. Wooldridge (1999) shows that the FEP is in
fact consistent under a much weaker strict exogeneity assumption. One proof of this result shows
that the score of the likelihood function has a mean of zero at the true parameter value due the
likelihood’s implicit transformation of the data. This transformation subtracts the weighted time
averages from each outcome and so I refer to it as the generalized within transformation. Another
approach is the generalized next-differencing transformation first studied by Chamberlain (1992)
and Wooldridge (1997), which subtracts from a time period the next period outcome, weighted by
the quotient of the mean functions. While generalized next-differencing was originally proposed
for a sequential exogeneity setting, I study it here in the context of strict exogeneity. I also consider
the residual maker matrix from regressing on the outcome variable’s mean function. To the best of
     2 Similar to the linear fixed effects estimator, the FEP estimator is a true fixed effects procedure as it can be derived
by estimating via pooled Poisson regression and treating the multiplicative terms as parameters to estimate.
                                                              31


my knowledge, this paper is the first to show information equivalence of these transformations.
    In Section 2.2, I define information equivalence in a first order asymptotic sense. The efficiency
bounds studied will apply to “small-𝑇” settings where asymptotics are derived with 𝑇 fixed as
𝑁 → ∞. I then derive sufficient conditions under which transformations of the data which yield
moment restrictions for estimation preserve the same information. This result is general and
can apply to a number of finite and asymptotic settings. In Section 2.3, I apply the main result
from Section 2.2 to a nonlinear multiplicative model, a linear model with an unknown factor
structure, and a linear random trend model. Section 2.4 discusses further practical suggestions
like implementation and extensions. Section 2.5 provides concluding remarks along with potential
directions for future research.
2.2    Information equivalence
As mentioned in the Introduction, the results of this section apply to population moments. In what
follows, ( 𝒚𝑖 , 𝒙𝑖 , 𝒄𝑖 ) is assumed to be a random draw from an infinite population. The matrix (𝒚𝑖 , 𝒙𝑖 )
is 𝑇 × (1 + 𝐾) and observable whereas the random 𝑝 × 1 vector 𝒄𝑖 is not. All statements involving
expressions of random variables hold almost surely. For example, conditional means and rank
conditions for random matrices hold with probability one. Finally, I assume regularity conditions
suitable for asymptotic analysis such as bounds on the higher-order moments of the data.
2.2.1   Model
The following conditional mean assumption specifies the empirical setting:
Assumption CM: For 𝑡 = 1, ..., 𝑇,
                                         𝐸 (𝑦𝑖𝑡 |𝒙𝑖 , 𝒄𝑖 ) = 𝑚 𝑡 (𝒙𝑖𝑡 , 𝜷0 , 𝒄𝑖 )                (2.2.1)
where 𝑚 𝑡 (𝒙, ., 𝒄) : R𝐾 → R is a known Borel twice-differentiable function for every 𝒙 ∈ X𝑡 and
𝒄 ∈ C, where X𝑡 and C are the respective supports of 𝒙𝑖𝑡 and 𝒄𝑖 . ■
                                                           32


    Equation (2.2.1) specifies a nonlinear semiparametric conditional mean function with strictly
exogenous covariates where 𝜷0 is a 𝐾 × 1 vector of parameters3. The mean function itself is allowed
to vary over time periods. The heterogeneity is also allowed to enter the mean function in any
arbitrary way. In the linear panel case, the simplest and most common specification is an individual-
specific intercept. In nonlinear cases, the heterogeneity is often included as a multiplicative term.
    I do not place any identifying assumptions directly on 𝑚 𝑡 . These implicit identification con-
ditions will come later in the form of rank assumptions. Essentially, the results contained in this
paper apply to nontrivial empirical situations. For example, consider a model 𝑦𝑖1 = 𝑐𝑖 + 𝛽𝑦𝑖2 where
𝑐𝑖 is an individual-specific intercept and 𝑦𝑖2 is an indicator variable associated with a treatment or
policy intervention. If 𝑐𝑖 has a mass point at zero, it must be the case that there is variation, so that
𝑦𝑖1 ≠ 0 for all 𝑖.
    The following examples illustrate some common empirical settings for which Assumption CM
applies:
Example 1 (Linear model with additive effects): Consider the following specification:
                                                𝑦𝑖𝑡 = 𝑐𝑖 + 𝒙𝑖𝑡 𝜷0 + 𝑢𝑖𝑡
This model is common among applied microeconometric researchers. Im et al. (1999) shows that
the 3SLS estimator of 𝜷0 using the differenced covariates as instruments is algebraically equivalent
to GLS estimators based off of both the within and differenced transformed residuals4. This example
will be discussed in Sections 2.2.2 and 2.3.1.
    We can include multiple individual effects loaded onto macro shocks in the form
                                               𝑦𝑖𝑡 = 𝒄′𝑖 𝒇𝑡 + 𝒙𝑖𝑡 𝜷0 + 𝑢𝑖𝑡
where 𝒄′𝑖 𝒇𝑡 =
                   Í𝑝
                     𝑟=1 𝑐𝑖𝑟 𝑓𝑟 𝑡 and 𝒇𝑡 is observable. An example of the general setting is the random
trend linear model.
                                             𝑦𝑖𝑡 = 𝑐𝑖 + 𝑎𝑖 𝑡 + 𝒙𝑖𝑡 𝜷0 + 𝑢𝑖𝑡
    3 In this context, nonlinear does not mean ’strictly nonlinear’, but can also include linear models.
    4 The  setting studied by Im et al. is motivated by considering covariates which satisfy 𝐸 (𝒙𝑖 ⊗ 𝒖 𝑖 ) = 0. The
equivalence result provided in their paper, however, is purely algebraic in nature and holds regardless of the covariance
between the covariates and idiosyncratic errors.
                                                            33


The standard approach to estimation is to first-difference the outcomes to yield another linear model
with only an additive individual effect. If strict exogeneity is assumed with respect to 𝒙𝑖 , we have
the same empirical setting as above, and so the same analysis will apply. I discuss the general
model in Section 2.3.2. ■
Example 2 (Exponential mean): Consider the following mean function:
                                    𝐸 (𝑦𝑖𝑡 |𝒙𝑖 , 𝑐𝑖 ) = exp(𝑐𝑖 + 𝒙𝑖𝑡 𝜷0 )
The exponential mean function is most popularly employed to study count data. The most common
estimator of the parameters in this model is the FEP estimator. Wooldridge (1999) shows that
Assumption CM is sufficient for identification using the following transformation:
                                           𝑇
                                                    !                       !
                                          ∑︁               exp(𝒙𝑖𝑡 𝜷0 )
                                  𝑦𝑖𝑡 −        𝑦𝑖 𝑠 Í𝑇
                                          𝑠=1             𝑠=1 exp(𝒙𝑖 𝑠 𝜷0 )
This transformation will be referred to as the generalized within transformation and provides the
basis of the FEP estimator since it shows up in the score function of the Poisson QMLE and has an
expectation of zero conditional on 𝒙𝑖 . Another possible transformation is
                                                        exp(𝒙𝑖𝑡 𝜷0 )
                                        𝑦𝑖𝑡 − 𝑦𝑖,𝑡+1
                                                       exp(𝒙𝑖,𝑡+1 𝜷0 )
which I refer to as the generalized next-differencing transformation. Both of this transformations
are studied in generality in Section 2.3.
    In an analogy to the linear setting, we can discuss an exponential random trend model with
multiplicative specification
                                    𝐸 (𝑦𝑖𝑡 |𝒙𝑖 , 𝒄𝑖 ) = 𝑐𝑖 𝑎𝑖𝑡 exp(𝒙𝑖𝑡 𝜷0 )
which can be motivated by the form 𝐸 (𝑦𝑖𝑡 |𝒙𝑖 , 𝒄𝑖 ) = exp(𝛾𝑖 + 𝛼𝑖 𝑡 + 𝒙𝑖𝑡 𝜷0 ). This model has received
no attention in the econometric literature to the best of my knowledge. I discuss how the results of
this paper could apply to such a model in Section 2.3.1. ■
Example 3 (Production functions): Suppose the dependent variable is firm output which follows
the given production technology:
                                                                  𝛽    𝛽
                                      𝑄 𝑖𝑡 = exp(𝜖𝑖𝑡 − 𝑐𝑖 )𝐿 𝑖 𝑡 1 𝐾𝑖 𝑡 2
                                                       34


where (𝐿, 𝐾) are labor and capital stock respectively. The heterogeneity can be written exp(−𝑐𝑖 ).
If 𝐸 (𝜖𝑖𝑡 |𝑐𝑖 , 𝐿 𝑖𝑡 , 𝐾𝑖𝑡 ) is assumed constant5, then the transformations studied in Section 2.3 can be
used for estimation of the parameters and average partial effects under weak assumptions on the
heterogeneity term. This example serves as an interesting bridge between the linear and nonlinear
specifications as production theory can be stated in the above nonlinear fashion, but production
function estimation is often carried out after log-linearization for which the results of Im et al.
(1999) would apply. The specific form of the error is reminiscent of a stochastic frontier model
with a time-invariant inefficiency term. See Section V of Amsler, Lee, and Schmidt (2009). ■
    For the general treatment of the paper, I consider transformations of the mean function which
provide moment conditions for estimating 𝜷0 . Assumption MAT characterizes such matrix trans-
formations:
Assumption MAT: Let 𝐿 ≤ 𝑇, and let 𝑨(𝒙, 𝜷) be an 𝐿 × 𝑇 matrix which satisfies
                                                      𝑨(𝒙𝑖 , 𝜷0 )𝐸 ( 𝒚𝑖 |𝒙𝑖 , 𝒄𝑖 ) = 0                                  (2.2.2)
and is differentiable in 𝜷 over int(𝚯) for every 𝒙 ∈ X. ■
     𝑨 is a residual maker matrix which is zero at the true parameter value 𝜷0 . I assume 𝐿 ≤ 𝑇
which corresponds to the examples studied in Section 2.3. While 𝐿 > 𝑇 is theoretically possible
and would rely on the same theory of g-inverses employed in this paper, I do not consider such
a case. In fact, cases of the examples in Section 2.3 where 𝐿 > 𝑇 often correspond to linearly
dependent and hence redundant sets of moment conditions.
    Under the previous assumptions,
                                                       𝐸 ( 𝑨(𝒙𝑖 , 𝜷0 ) 𝒚𝑖 |𝒙𝑖 ) = 0                                     (2.2.3)
by iterated expectations. We can thus use equation (2.2.3) as the basis of a GMM estimator of 𝜷0 ,
where any function of 𝒙𝑖 can be used as an instruments for 𝑨(𝒙𝑖 , 𝜷0 ) 𝒚𝑖 to improve efficiency. Note
    5 The  value of 𝐸 (𝜖 𝑖 𝑡 |𝐿 𝑖 𝑡 , 𝐾𝑖 𝑡 ) is allowed to differ over time as long as it is not a function of observables. The
researcher can then just specify time dummies in the mean function to capture the temporal change.
                                                                    35


that 𝑨 could contain external instrumental variables which do not appear in the mean function.
This more general case is considered in Section 2.2.2.
     The following Lemma demonstrates a useful fact for characterizing information equivalent trans-
formations and has clear parallels in the linear model case. First define 𝒎𝑖 ( 𝜷) = (𝑚 𝑡 (𝒙𝑖1 , 𝜷, 𝒄𝑖 ), ..., 𝑚 𝑡 (𝒙𝑖𝑇 , 𝜷, 𝒄𝑖 )) ′.
Lemma 2.2.1. Suppose 𝑨(𝒙, 𝜷) is an 𝐿 × 𝑇 matrix satisfying Assumption MAT. Then for any
(𝒙 0 , 𝒄0 ) ∈ X × C such that |𝑚 𝑡 (𝒙𝑡0 , 𝜷0 , 𝒄0 )| > 0 for some 𝑡, Rank( 𝑨(𝒙 0 , 𝜷0 )) < 𝑇.
Proof. 𝑨(𝒙 0 , 𝜷0 )𝒎𝑖 ( 𝜷0 ) = 0 over the supports of 𝒙𝑖 and 𝒄𝑖 by Assumption MAT. As |𝑚 𝑡 (𝒙𝑡0 , 𝜷0 , 𝒄0 )| >
0, 𝑨(𝒙 0 , 𝜷0 ) has a nontrivial null space, and hence its rank is less than 𝑇.                                 □
     The theory for choosing optimal instruments is well-known: when the conditional variance is
nonsingular, the optimal GMM estimator uses instruments (𝑉 𝑎𝑟 ( 𝑨(𝒙𝑖 , 𝜷0 ) 𝒚𝑖 |𝒙𝑖 ) −1 𝐸 (∇ 𝜷 𝑨(𝒙𝑖 , 𝜷0 ) 𝒚𝑖 |𝒙𝑖 )) ′.
However, in most nontrivial cases when 𝑨 is 𝑇 × 𝑇, the conditional variance matrix of 𝑨(𝒙𝑖 , 𝜷0 ) 𝒚𝑖
is singular even when 𝑉 𝑎𝑟 (𝒚𝑖 |𝒙𝑖 ) is nonsingular. I make one additional assumption on the trans-
formation studied which allows for such a generality. Assumption SYS specifies consistency of a
particular linear system which is necessary for the definition of the asymptotic efficiency bound. It
will allow us to use a certain class of generalized inverses when the conditional variance is singular.
Assumption SYS: The system 𝑉 𝑎𝑟 ( 𝑨(𝒙𝑖 , 𝜷0 ) 𝒚𝑖 |𝒙𝑖 )𝑭(𝒙𝑖 ) = 𝐸 (∇ 𝜷 𝑨(𝒙𝑖 , 𝜷0 ) 𝒚𝑖 |𝒙𝑖 ) is consistent in
𝑭(𝒙𝑖 ) and 𝐸 (𝑭(𝒙𝑖 ) ′𝑉 𝑎𝑟 ( 𝑨( 𝜷0 ) 𝒚𝑖 |𝒙𝑖 )𝑭(𝒙𝑖 )) is nonsingular for a given solution. ■
     Consistency of a linear system only requires the existence of a solution and not necessarily
uniqueness. In fact, Section 2.3 considers relevant cases for which uniqueness does not hold.
Assumption SYS is posed in Newey (2001) for studying censored and truncated regression. It
holds trivially when the conditional variance is nonsingular, in which case the unique solution
is 𝑉 𝑎𝑟 ( 𝑨(𝒙𝑖 , 𝜷0 ) 𝒚𝑖 |𝒙𝑖 ) −1 𝐸 (∇ 𝜷 𝑨(𝒙𝑖 , 𝜷0 ) 𝒚𝑖 |𝒙𝑖 ). The results in Chamberlain (1987) and Newey
(2001) show that the semiparametric efficiency bound for estimating 𝜷0 using equation (2.2.3) and
Assumptions CM, MAT, and SYS is
                                                                                                       −1
             𝐸 𝐸 (∇ 𝜷 𝑨(𝒙𝑖 , 𝜷0 ) 𝒚𝑖 |𝒙𝑖 ) ′𝑉 𝑎𝑟 ( 𝑨(𝒙𝑖 , 𝜷0 ) 𝒚𝑖 |𝒙𝑖 ) − 𝐸 (∇ 𝜷 𝑨(𝒙𝑖 , 𝜷0 ) 𝒚𝑖 |𝒙𝑖 )      (2.2.4)
                                                              36


                                                                       √
where "−" denotes a symmetric g-inverse6. That is, no 𝑁-consistent estimator of 𝜷0 based off of
equation (2.2.3) has a smaller asymptotic variance than (2.2.4).
     Theorem 5.2 in Newey (2001) shows that the efficiency bound in (2.2.4) is invariant to choice
of symmetric g-inverse under Assumption SYS. If the conditional variance is nonsingular, then the
g-inverse can be replaced by a proper inverse as in Chamberlain (1987). Otherwise any g-inverse
will work as long as the consistency assumption holds. The matrix in (2.2.4) is also equivalent to
the asymptotic variance of the GMM estimator based off of the moment conditions in (2.2.3) which
uses the optimal instruments (𝑉 𝑎𝑟 ( 𝑨(𝒙𝑖 , 𝜷0 ) 𝒚𝑖 |𝒙𝑖 ) − 𝐸 (∇ 𝜷 𝑨(𝒙𝑖 , 𝜷0 ) 𝒚𝑖 |𝒙𝑖 )) ′. The system is just
identified and so no weight matrix is required for the asymptotic bound. Realizing this efficiency
bound is the subject of Section 2.4.
     The rest of the paper is concerned with studying transformations of the observed data which
provide the same semiparametric efficiency bound as defined in (2.2.4). The following definition
characterizes the types of transformations I consider:
Definition: Let Assumption CM hold, and let 𝑨(𝒙𝑖 , 𝜷) and 𝑩(𝒙𝑖 , 𝜷) be 𝐿 × 𝑇 and 𝑀 × 𝑇,
respectively. Given 𝑨 and 𝑩 satisfy Assumptions MAT and SYS, the matrices are information
equivalent transformations if their semiparametric efficiency bounds given by (2.2.4) are equal.
■
     Information equivalence defined above is an equivalence relation on the set of 𝐾 × 𝐾 real-
valued matrices since it is defined via matrix equivalences. This fact will be used in Section 2.3
to show information equivalence between general forms of applied transformations since it is a
transitive property and it is easiest to evaluate the information bound in relation to the generalized
within transformation. Information equivalence is similar to the definition of redundancy of
moment conditions as given by Breusch et al. (1999). However, the results in this paper are not
direct consequences of their redundancy results as I allow the moment conditions to have singular
covariance matrices which directly applies to the examples in Section 2.3.
     6A g-inverse for matrix 𝛀 is a matrix 𝛀− such that 𝛀𝛀− 𝛀 = 𝛀. This condition is weaker than the Moore-Penrose
inverse which requires three other non-redundant properties. It is worth noting that the Moore-Penrose inverse is
unique, but a g-inverse is not necessarily; this fact will be used to prove the main results in Section 2.2.2. For a general
treatment of g-inverses, see Rao and Mitra (1978).
                                                              37


2.2.2   General equivalence result
I now prove the general unifying theory of information equivalence. Consider the empirical setting
proposed in Section 2.2.1 where Assumption CM holds. I suppose there is a 𝑇 × 𝑇 matrix 𝑴 (𝒛𝑖 , 𝜷)
satisfying Assumptions MAT and SYS where 𝒛𝑖 is allowed to include any element of 𝒙𝑖 as well as
outside instruments. Dropping the arguments and writing 𝑴𝑖 = 𝑴 (𝒛𝑖 , 𝜷0 ) for simplicity, we have
the following moment conditions:
                                             𝐸 ( 𝑴𝑖 𝒚𝑖 |𝒛𝑖 ) = 0                                (2.2.5)
Equation (2.2.5) includes the case of unconditional moment restrictions.
    I denote 𝑽𝑖 = 𝐸 (𝒚𝑖 𝒚′𝑖 |𝒛𝑖 ). I now consider transformations which still yield valid moment
conditions. Let 𝑩𝑖 = 𝑩(𝒛𝑖 , 𝜷0 ) be a 𝐽 × 𝑇 matrix such that 𝐸 (𝑩𝑖 𝒚𝑖 |𝒛𝑖 ) = 0. Now I make the
following assumptions which are pivotal for the general result of this section, and thus refer to them
as Assumptions GR.1 and GR.2.
Assumption GR.1: 𝑩𝑖 𝑴𝑖 = 𝑩𝑖 and 𝑅𝑎𝑛𝑘 ( 𝑴𝑖 𝑽𝑖 𝑴𝑖′) = 𝑅𝑎𝑛𝑘 ( 𝑴𝑖 ) = 𝐽 < 𝑇. ■
Assumption GR.2: 𝑅𝑎𝑛𝑘 (𝑩𝑖 𝑽𝑖 𝑩𝑖′) = 𝑅𝑎𝑛𝑘 (𝑩𝑖 ) = 𝐽. ■
    The notation for 𝑴𝑖 in Assumption GR.1 is motivated by the standard notation for a residual
maker matrix. In fact, one possible sufficient condition for Assumption GR.1 is that 𝑅𝑎𝑛𝑘 (𝑽𝑖 ) = 𝐽
and that 𝑽𝑖 shares a null space with 𝑩𝑖 . This assumption would also suffice for Assumption
GR.2 since 𝑩𝑖′ spans the column space of 𝑽𝑖 , and is relevant in linear panel models with additive
heterogeneity. We can then let 𝑴𝑖 be a residual maker matrix from regressing on a basis vector for
the null space of 𝑩𝑖 . Another relevant setting to this paper is when 𝑴𝑖 = 𝑰𝑇 − 𝑷𝑖 where 𝑷𝑖 has rank
𝑇 − 𝐽 and 𝑩𝑖 𝑷𝑖 = 0. This setting characterizes the nonlinear models studied in Section 2.3 and is
also sufficient for Assumptions GR.1 and GR.2.
    Given the discussion above, I now prove a lemma which is essential to the proof of the general
equivalence result.
Lemma 2.2.2. 𝑩𝑖′ (𝑩𝑖 𝑽𝑖 𝑩𝑖′) −1 𝑩𝑖 is a g-inverse of 𝑴𝑖 𝑽𝑖 𝑴𝑖′.
                                                    38


Proof. 𝑩𝑖′ (𝑩𝑖 𝑽𝑖 𝑩𝑖′) −1 𝑩𝑖 𝑴𝑖 𝑽𝑖 𝑴𝑖′ 𝑩𝑖′ (𝑩𝑖 𝑽𝑖 𝑩𝑖′) −1 𝑩𝑖 = 𝑩𝑖′ (𝑩𝑖 𝑽𝑖 𝑩𝑖′) −1 𝑩𝑖 𝑽𝑖 𝑩𝑖′ (𝑩𝑖 𝑽𝑖 𝑩𝑖′) −1 𝑩𝑖 = 𝑩𝑖′ (𝑩𝑖 𝑽𝑖 𝑩𝑖′) −1 𝑩𝑖 .
Since 𝑅𝑎𝑛𝑘 (𝑩𝑖′ (𝑩𝑖 𝑽𝑖 𝑩𝑖′) −1 𝑩𝑖 ) = 𝐽 by Assumption GR.2 and 𝑅𝑎𝑛𝑘 ( 𝑴𝑖 𝑽𝑖 𝑴𝑖′) = 𝐽 by Assumption
GR.1, 𝑩𝑖′ (𝑩𝑖 𝑽𝑖 𝑩𝑖′) −1 𝑩𝑖 is a g-inverse of 𝑴𝑖 𝑽𝑖 𝑴𝑖′ by Theorem 2.6 of Rao and Mitra (1971).                         □
Theorem 2.2.1. The equality
                                     𝑩𝑖′ (𝑩𝑖 𝑽𝑖 𝑩𝑖′) −1 𝑩𝑖 = 𝑴𝑖′ ( 𝑴𝑖 𝑽𝑖 𝑴𝑖′) − 𝑴𝑖                               (2.2.6)
holds for any choice of matrix 𝑩𝑖 satisfying Assumptions GR.1 and GR.2 for the same 𝑴𝑖 and for
any g-inverse of 𝑴𝑖 𝑽𝑖 𝑴𝑖′.
Proof. By Rao and Mitra (1971, p. 603), the expression
                                                  𝑴𝑖′ ( 𝑴𝑖 𝑽𝑖 𝑴𝑖′) − 𝑴𝑖                                          (2.2.7)
is invariant to the choice of g-inverse as 𝑅𝑎𝑛𝑘 ( 𝑴𝑖 𝑽𝑖 𝑴𝑖′) = 𝑅𝑎𝑛𝑘 ( 𝑴𝑖 ) by Assumption GR.1. Since
𝑩𝑖′ (𝑩𝑖 𝑽𝑖 𝑩𝑖′) −1 𝑩𝑖 is such a g-inverse by Lemma 2.2.2 and 𝑩𝑖 𝑴𝑖 = 𝑩𝑖 we have
                                  𝑩𝑖′ (𝑩𝑖 𝑽𝑖 𝑩𝑖′) −1 𝑩𝑖 = 𝑴𝑖′ 𝑩𝑖′ (𝑩𝑖 𝑽𝑖 𝑩𝑖′) −1 𝑩𝑖 𝑴𝑖
                                                        = 𝑴𝑖′ ( 𝑴𝑖 𝑽𝑖 𝑴𝑖′) − 𝑴𝑖
which is independent of 𝑩𝑖 .                                                                                            □
     Equation (2.2.6) of Theorem 1 provides the framework for evaluating information equivalence.
To see how, I include an additional orthogonality assumption which simplifies the efficiency bound
in (2.2.4).
Assumption ORTH: 𝑨(𝒙𝑖 , 𝜷) is an 𝐿 × 𝑇 matrix, 𝐿 ≤ 𝑇, such that 𝑨(𝒙𝑖 , 𝜷)𝒎𝑖 ( 𝜷) = 0 for all 𝜷 is
some open ball about 𝜷0 . ■
     Assumption ORTH is clearly sufficient for Assumption MAT. The transformations studied in
the next section satisfy Assumption ORTH for all values of 𝜷 ∈ R𝐾 for which the mean function
is well-defined. However it only needs to be defined on a relatively small open and connected
                                                           39


set so that it applies with respect to differentiation. Note that ORTH does not say anything about
point identification of 𝜷0 . Assumption CM guarantees 𝐸 ( 𝑨(𝒙𝑖 , 𝜷0 ) 𝒚𝑖 |𝒙𝑖 ) = 0 only at 𝜷0 because
𝐸 ( 𝒚𝑖𝑡 |𝒙𝑖 , 𝒄𝑖 ) = 𝑚 𝑡 (𝒙𝑖𝑡 , 𝜷0 , 𝒄𝑖 ). I also note that every transformation considered in Section 2.3
satisfies Assumption ORTH.
     The following lemma is a consequence of Assumption ORTH which greatly simplifies the
bound in (2.2.4).
Lemma 2.2.3. Let 𝑨(𝒙𝑖 , 𝜷) satisfy Assumption ORTH. Then under regularity conditions which
allow us to pass the gradient operator through the conditional expectation,
                                    𝐸 (∇ 𝜷 𝑨(𝒙𝑖 , 𝜷0 ) 𝒚𝑖 |𝒙𝑖 ) = 𝑨(𝒙𝑖 , 𝜷0 )∇ 𝜷 𝒎𝑖 ( 𝜷0 )
Proof. See Appendix for proof.                                                                                         □
     Lemma 2.2.3 greatly simplifies the efficiency bound in (2.2.4). It also allows us to say something
about finite sample equivalence among certain types of transformations. I summarize these results
here:
Corollary 2.2.1. Let 𝑨(𝒙𝑖 , 𝜷) be a 𝐿 × 𝑇 matrix satisfying Assumptions MAT, SYS, and ORTH.
Then 𝑨(𝒙𝑖 , 𝜷0 ) has the following efficiency bound:
                                                                                 −                           −1
      𝐸 ∇ 𝜷 𝒎𝑖 ( 𝜷0 ) ′ 𝑨(𝒙𝑖 , 𝜷0 ) ′ ( 𝑨(𝒙𝑖 , 𝜷0 )𝐸 (𝒚𝑖 𝒚′𝑖 |𝒙𝑖 ) 𝑨(𝒙𝑖 , 𝜷0 ) ′    𝑨(𝒙𝑖 , 𝜷0 )∇ 𝜷 𝒎𝑖 ( 𝜷0 )      (2.2.8)
Corollary 2.2.2. Suppose 𝑨𝑖 and 𝑩𝑖 are 𝐽 × 𝑇 matrices and 𝑴𝑖 is a 𝑇 × 𝑇 matrix such that
Assumptions GR.1 and GR.2 hold for 𝑨 and 𝑩. Further suppose 𝑨𝑖 , 𝑩𝑖 , 𝑴𝑖 , and the conditional
gradient ∇ 𝜷 𝐸 ( 𝒚𝑖 |𝒛𝑖 ) are independent of 𝜷. Then
                         ∇ 𝜷 𝒎𝑖′ 𝑨𝑖′ ( 𝑨𝑖 𝑽𝑖 𝑨𝑖′) −1 𝑨𝑖 𝒎𝑖 ( 𝜷) = ∇ 𝜷 𝒎𝑖′ 𝑩𝑖′ (𝑩𝑖 𝑽𝑖 𝑩𝑖′) −1 𝑩𝑖 𝒎𝑖 ( 𝜷)           (2.2.9)
for any value of 𝜷 in 𝒎𝑖 ( 𝜷).
     Corollary 2.2.1 allows us to directly apply the result from Theorem 2.2.1 to the relevant cases
in Section 2.3. For information equivalence, it will suffice to show that the relevant transformations
                                                               40


satisfying Assumptions MAT, SYS, and ORTH only need to satisfy a rank assumption to be
information equivalent. The choice of 𝑴 will become apparent based on the empirical setting.
     Corollary 2.2.2 gives an even more powerful result than equivalence of efficiency bounds. For
example, if the moment conditions in (2.2.5) are conditional on 𝒙𝑖 , the efficient GMM estimator of
𝜷0 , say 𝜷,
         b solves
                               ∑︁𝑁
                                    ∇ 𝜷 𝒎𝑖′ 𝑴𝑖′ ( 𝑴𝑖 𝑽𝑖 𝑴𝑖′) − 𝑴𝑖 𝒎𝑖 ( 𝜷)b =0                  (2.2.10)
                                𝑖=1
Corollary 2.2.2 tells us that the efficient estimator based off of 𝐸 ( 𝑨𝑖 𝒚𝑖 |𝒙𝑖 ) and 𝐸 (𝑩𝑖 𝒚𝑖 |𝒙𝑖 ) are
algebraically equivalent. When the transformations are themselves functions of the parameters,
implementation of the efficient instruments depends on first-stage estimators whereas the transfor-
mation 𝑨𝑖 𝒎𝑖 depends on the FOC solution, so the results only hold asymptotically. The proof of
Theorem 4.2 in Im et al. (1999) uses a specific form of the argument in the proof above. This
fact suggests further applications to panel data transformations with strictly exogenous covariates
which I explore in the next section.
2.3     Examples of information equivalence
This section considers the application of Theorem 2.2.1 to a variety of interesting empirical settings.
2.3.1   Multiplicative heterogeneity
I now consider the case of a single multiplicative heterogeneous effect:
                                     𝐸 (𝑦𝑖𝑡 |𝒙𝑖 , 𝑐𝑖 ) = 𝑐𝑖 𝑚 𝑡 (𝒙𝑖𝑡 , 𝜷0 )                      (2.3.1)
     This specification has grown in popularity in recent years. For example, see Krapf, Ursprung,
and Zimmermann (2017), Fischer, Royer, and White (2018), Castillo, Mejía, and Restrepo (2020),
Schlenker and Walker (2016), McCabe and Snyder (2014, 2015), and Williams et al. (2020). The
most common specification of equation (2.3.1) in practice is the exponential mean function as
demonstrated in Example 2. Often the data generating process is a count variable with a mass point
                                                       41


at zero, but the model can apply to any nonnegative outcome. This typically implies 𝑚 𝑡 (𝒙, 𝜷0 ) > 0
for all 𝒙 ∈ X which the rank assumptions made in this section will imply.
    I consider the following generalized residual functions first introduced in Example 2:
                                                                 𝑇
                                                                         !
                                                                ∑︁
                                          𝑢𝑖𝑡 ( 𝜷) = 𝑦𝑖𝑡 −           𝑦𝑖 𝑠 𝑝𝑖𝑡 ( 𝜷)                       (2.3.2)
                                                                𝑠=1
                                                                    𝑚 𝑡 (𝒙𝑖𝑡 , 𝜷)
                                          𝑟𝑖,𝑡,𝑠 ( 𝜷) = 𝑦𝑖𝑡 − 𝑦𝑖 𝑠                                       (2.3.3)
                                                                    𝑚 𝑠 (𝒙𝑖 𝑠 , 𝜷)
                                  Í                       −1
                                      𝑇
where 𝑝𝑖𝑡 ( 𝜷) = 𝑚 𝑡 (𝒙𝑖𝑡 , 𝜷)        𝑠=1 𝑚 𝑠 (𝒙𝑖 𝑠 ,  𝜷)      . Equation (2.3.2) is reminiscent of the linear
within transformation. However, the transformation in the linear case demeans using the time
averages, whereas the generalized within transformation weights by the pseudo-probability 𝑝𝑖𝑡 ( 𝜷).
The generalized differencing residual in equation (2.3.3) allows a large number of differencing
procedures, including next- and first-differencing as well as differencing one time period from the
others in which 𝑡 is fixed and 𝑠 is allowed to vary. Any other number of arbitrary generalized
differencing is allowed so long as it produces a full rank transformation.
    In contrast to the linear model with an additive effect, the transformations in equations (2.3.2)
and (2.3.3) will not eliminate the heterogeneity but still creates valid moment conditions. For
example, taking the mean of equation (2.3.3) conditional on (𝒙𝑖 , 𝑐𝑖 ) gives
                                                                                         𝑚 𝑡 (𝒙𝑖𝑡 , 𝜷0 )
                    𝐸 (𝑟𝑖,𝑡,𝑠 ( 𝜷0 )|𝒙𝑖 , 𝑐𝑖 ) = 𝑐𝑖 𝑚 𝑡 (𝒙𝑖𝑡 , 𝜷0 ) − 𝑐𝑖 𝑚 𝑠 (𝒙𝑖𝑠 , 𝜷0 )
                                                                                         𝑚 𝑠 (𝒙𝑖𝑠 , 𝜷0 )
                                               = 𝑐𝑖 (𝑚 𝑡 (𝒙𝑖𝑡 , 𝜷0 ) − 𝑚 𝑡 (𝒙𝑖𝑡 , 𝜷0 ))
                                               =0
which still yields conditional moment restrictions.
    Define the respective 𝑇 × 1 and (𝑇 − 1) × 1 residual vectors
                                              𝒖𝑖 ( 𝜷) = ( 𝑰𝑇 − 𝒑𝑖 ( 𝜷)1′) 𝒚𝑖                             (2.3.4)
                                              𝒓𝑖 ( 𝜷) = 𝑫 𝑖 ( 𝜷) 𝒚𝑖                                      (2.3.5)
where 1 is a 𝑇 ×1 vector of ones and 𝑫 𝑖 ( 𝜷) is the 𝑇 −1×𝑇 weighted generalized differencing matrix
which yields the desired residuals as in (2.3.3). I refer to transformations in Equations (2.3.4) and
                                                             42


(2.3.5) as the generalized within and generalized differencing transformations respectively. Then
an iterated expectations argument shows 𝐸 (𝒖𝑖 ( 𝜷0 )|𝒙𝑖 ) = 0 and 𝐸 (𝒓𝑖 ( 𝜷0 )|𝒙𝑖 ) = 0. Thus equations
(2.3.4) and (2.3.5) satisfy Assumption MAT and suggest moment conditions for efficient GMM
estimation which could reach their respective efficiency bounds in (2.2.4).
    As discussed in the Introduction, equation (2.3.4) is the foundation of the FEP estimator. The
FEP is defined in Hausman et al. (1984) as the MLE of a conditional Multinomial distribution
with probability and count parameters 𝒑𝑖 ( 𝜷0 ) = ( 𝑝𝑖1 ( 𝜷0 ), ..., 𝑝𝑖𝑇 ( 𝜷0 )) ′ and 𝑛𝑖 . Wooldridge (1999)
shows that the FEP is consistent under Assumption CM using the fact that equation (2.3.4) has
a zero conditional mean at 𝜷0 regardless of the true distribution of 𝒚𝑖 |𝒙𝑖 . This robustness result
helped lead to its proliferation in empirical research. As for efficiency, Hahn (1997) shows that
the FEP is asymptotically efficient under the full set of Multinomial distributional assumptions.
Verdier (2018) strengthens this result substantially by showing efficiency under just zero conditional
correlation and conditional mean-variance equality. Brown and Wooldridge (2021) extends this
result to allow arbitrary constant conditional mean-variance dispersion.
    Equation (2.3.5) was first studied by Chamberlain (1992) and Wooldridge (1997) in the context
of next-differencing for nonlinear models. It can also allow for estimation of 𝜷0 under weaker
forms of exogeneity, like sequential exogeneity in the next-differencing case of 𝑠 = 𝑡 + 1, rather than
the strict exogeneity implied by Assumption CM. Sequential exogeneity allows the researcher to
specify lag dynamics in the mean function which violates strict exogeneity. However, remarkably
less is known about efficient estimation based off of equation (2.3.5) when compared to equation
(2.3.4) in the context of strict exogeneity as studied here.
    The transformations defined in (2.3.4) and (2.3.5) are clearly not the only transformations
which satisfy Assumption MAT. Consider the residual maker matrix from regressing on the mean
function defined by equation (2.3.1): ( 𝑰𝑇 − 𝒎𝑖 ( 𝜷)(𝒎𝑖 ( 𝜷) ′ 𝒎𝑖 ( 𝜷)) −1 𝒎𝑖 ( 𝜷) ′). This matrix satisfies
Assumption ORTH and thus Assumption MAT since it is algebraically orthogonal to the mean
function by construction. It is also well-known that the matrix is symmetric, idempotent, and has
rank 𝑇 − 1. I will refer to this matrix as the residual maker transformation.
                                                   43


    The main theorem of this section proves the information equivalence between the generalized
within, generalized differencing, and residual maker transformations. This result is similar to
Theorem 4.2 of Im et al. (1999) who proves algebraic equivalences of GLS estimators based off
of strictly exogenous covariates in linear panel data models with additive effects. There are two
primary differences between Theorem 2.3.1 in this paper and Theorem 4.2 in Im et al. First, the
heterogeneity is multiplicative rather than additive. This difference is not made without loss of
generality as rewriting the terms causes the heterogeneity to have time variation7. Second, Im et
al. shows an algebraic equivalence between the estimators studied, while I show an asymptotic
equivalence. As mentioned after Theorem 2.2.1, finite sample equality will not necessarily follow
when the transformations are functions of the parameter 𝜷0 and require a first-step estimator to
implement.
    By Lemma 2.2.1, the conditional variance of the generalized within transformation is necessarily
singular, so I will need to show that its efficiency bound is well-defined and invariant to the choice of
symmetric g-inverse. Lemma 1 of Verdier (2018) shows that it has rank 𝑇 − 1 at the true parameter
value. This fact suggests that deleting a row to remove the rank degeneracy leads to a transformation
with a nonsingular variance matrix. Im et al. (1999) takes this approach when showing equivalence
between the within and differenced linear estimators. Let 𝑸 be a 𝑇 − 1 × 𝑇 matrix which removes
any arbitrary row from a given 𝑇 × 𝑇 matrix. Then the transformation 𝑸( 𝑰𝑇 − 𝒑𝑖 ( 𝜷0 )1′) is the
generalized within transformation with an arbitrary row deleted. A similar procedure can be used
to make the residual maker transformation full rank. The main result will show that information
equivalence is invariant to the row deleted.
    Lemma 2.3.1 will show that the efficiency bound of the within and residual maker transfor-
mations are well-defined. First I will assume that 𝐸 (𝒚𝑖 𝒚′𝑖 |𝒙𝑖 ) is strictly positive definite, a weaker
assumption than the conditional variance of 𝒚𝑖 itself being positive definite. Under this assumption,
the conditional variance of the generalized differencing transformation is nonsingular under a rank
condition provided below. Before I can verify Assumption SYS, I will need an additional rank
    7 If                                                                          𝑚𝑡 +𝑢𝑖
         𝑦 𝑖 𝑡 = 𝑚 𝑡 + 𝑢 𝑖 , then rewriting this as 𝑦 𝑖 𝑡 = 𝑐 𝑖 𝑚 𝑡 implies 𝑐 𝑖 =  𝑚𝑡    which depends on the time period specified.
                                                                      44


assumption for each respective transformation.
Assumption RK.1: 𝑅𝑎𝑛𝑘 (𝑫 𝑖 ( 𝜷0 )) = 𝑇 − 1. ■
    Assumption RK.1 states that the differencing matrix has full row rank. It requires that none
of the differences used for estimation are redundant in the sense that some row or rows are linear
combinations of the others. Necessarily the researcher cannot reuse rows, and if 𝑦𝑖𝑡 is differenced
from 𝑦𝑖 𝑠 , then 𝑦𝑖 𝑠 cannot be differenced from 𝑦𝑖𝑡 . Further, we must have 𝑠 ≠ 𝑡 for each row so
that 𝑫 does not have any zero rows. For example, including all pairwise differences leads to linear
dependence which causes RK.1 to fail.
Assumption RK.2: Let 𝚺𝑖 = 𝐸 ( 𝒚𝑖 𝒚′𝑖 |𝒙𝑖 ) be positive definite. Define 𝑽𝑖− = (𝚺𝑖−1 − 𝑎1𝑖 𝚺𝑖−1 𝒎𝑖 ( 𝜷0 )𝒎𝑖 ( 𝜷0 ) ′𝚺𝑖−1 )
where 𝑎𝑖 = 𝒎𝑖 ( 𝜷0 ) ′𝚺𝑖−1 𝒎𝑖 ( 𝜷0 ). Then the square matrix 𝐸 (∇ 𝜷 𝒎𝑖 ( 𝜷0 ) ′𝑽𝑖− ∇ 𝜷 𝒎𝑖 ( 𝜷0 )) has full rank.
■
    𝑽𝑖− is a symmetric g-inverse of 𝑉 𝑎𝑟 (( 𝑰𝑇 − 𝒑𝑖 ( 𝜷0 )1′) 𝒚𝑖 |𝒙𝑖 ). In fact, it also satisfies the property
                                𝑽𝑖− [𝑉 𝑎𝑟 (( 𝑰𝑇 − 𝒑𝑖 ( 𝜷0 )1′) 𝒚𝑖 |𝒙𝑖 )] 𝑽𝑖− = 𝑽𝑖−                       (2.3.6)
as shown in Lemma 2 of Verdier (2018) so that it is a reflexive inverse and is also clearly symmetric.
Assumption RK.2 suffices for the bound in (2.2.4) existing, as I show in the next lemma that
𝑽𝑖− ∇ 𝜷 𝒎𝑖 ( 𝜷0 ) is a solution to the system in Assumption SYS. This fact, along with the fact that
𝑽𝑖− 𝒎𝑖 ( 𝜷0 ) = 0 and Lemma 2.2.3, gives the bound in (2.2.4) as the expectation above. The following
lemma shows that all transformations studied satisfy Assumption SYS and so any symmetric g-
inverse will suffice.
Lemma 2.3.1. Suppose Assumptions CM, RK.1, and RK.2 hold and that 𝐸 ( 𝒚𝑖 𝒚′𝑖 |𝒙𝑖 ) is positive
definite. Then the generalized differencing, generalized within, and residual maker transformations
satisfy Assumption SYS. Further, either of the 𝑇 × 𝑇 transformations with any arbitrary row deleted
also satisfy Assumption SYS.
Proof. See Appendix for proof.                                                                                □
                                                       45


    The main consequence of Lemma 2.3.1 is that the asymptotic efficiency bound is well-defined
and invariant to symmetric g-inverse for all of the transformations studied in this section. Now I can
formally state the application of the main equivalence theorem to the transformations studied in this
section. First note that Assumptions CM, RK.1, RK.2, and the positive definiteness of 𝐸 ( 𝒚𝑖 𝒚′𝑖 |𝒙𝑖 )
are sufficient for each of the transformations studied to satisfy Assumptions SYS and ORTH (and
thus MAT) so that their asymptotic efficiency bounds are well-defined and given by (2.2.8).
Theorem 2.3.1. Suppose Assumptions CM, RK.1, and RK.2 hold and that 𝐸 (𝒚𝑖 𝒚′𝑖 |𝒙𝑖 ) is positive
definite. ( 𝑰𝑇 − 𝒑𝑖 ( 𝜷0 )1′), 𝑫 𝑖 ( 𝜷0 ), ( 𝑰𝑇 − 𝒎𝑖 ( 𝜷)(𝒎𝑖 ( 𝜷) ′ 𝒎𝑖 ( 𝜷)) −1 𝒎𝑖 ( 𝜷) ′), 𝑸( 𝑰𝑇 − 𝒑𝑖 ( 𝜷0 )1′), and
𝑸(( 𝑰𝑇 − 𝒎𝑖 ( 𝜷)(𝒎𝑖 ( 𝜷) ′ 𝒎𝑖 ( 𝜷)) −1 𝒎𝑖 ( 𝜷) ′)) are information equivalent and invariant to the row
deleted by 𝑸.
Proof. See Appendix for proof.                                                                                                □
    The proof of Theorem 2.3.1 is independent of which row is deleted in choosing 𝑸 and the
type of differencing chosen in 𝑫 satisfying Assumption RK.1, reinforcing the importance of the
rank assumptions. As in Theorem 2.2.1, transformations with rank 𝐿 < 𝑇 can be shown to be
information equivalent via a similar argument, but this fact is not directly relevant to the current
results. It’s also important to note that the list of information equivalent transformations is not
necessarily exhaustive, as any 𝑇 ×𝑇 or 𝑇 − 1 ×𝑇 matrix with rank 𝑇 − 1 and respective orthogonality
condition will be information equivalent to the transformations in Theorem 2.3.1 by Theorem 2.2.1.
    Similar to the discussion after Theorem 2.2.1, the results in Theorem 2.3.1 could also apply
to mean function which have already been transformed. For example, consider the multiplicative
random trend from Example 2, 𝑦𝑖𝑡 = 𝑐𝑖 𝑎𝑖𝑡 𝑚 𝑡 (𝒙𝑖𝑡 , 𝜷0 )𝑢𝑖𝑡 where 𝑢𝑖𝑡 is an idiosyncratic error. If
we assume the outcomes are bounded away from zero, we could first divide each outcome by
                                                                                       𝑚 𝑡 (𝒙 𝑖 𝑡 ,𝜷0 )
the previous period. We now have the multiplicative model 𝑦𝑖 𝑡∗ = 𝑎𝑖 𝑚 𝑡−1                                 𝑢𝑖 𝑡
                                                                                          (𝒙 𝑖,𝑡−1 ,𝜷0 ) 𝑢 𝑖,𝑡−1 . If
                                                                                                                        𝑢𝑖 𝑡
                                                                                                                      𝑢 𝑖,𝑡−1 is
independent of 𝒙𝑖 and 𝑎𝑖 with mean 1, we have the model from equation (2.3.1). Then all of the
transformations studied here are information equivalent on the transformed outcomes 𝒚𝑖∗ .
                                                         46


2.3.2      Linear factor model
This section considers linear panels with a factor-augmented error:
                                                 𝑦𝑖𝑡 = 𝒙𝑖𝑡 𝜷0 + 𝒇𝑡′ 𝜸𝑖 + 𝑢𝑖𝑡                         (2.3.7)
where 𝒇𝑡 is a 𝑝 × 1 vector of common factors. Stacking the factors into the 𝑇 × 𝑝 matrix 𝑭 =
( 𝒇1 , ..., 𝒇𝑇 ) ′, Pesaran (2006) adds the additional reduced form equation
                                                      𝒙𝑖 = 𝑭𝚪𝑖 + 𝒗 𝑖                                 (2.3.8)
where 𝚪𝑖 is a 𝑝 × 𝐾 matrix of “factor loadings" and 𝒗 𝑖 is a 𝑇 × 𝐾 matrix of mean zero idiosyncratic
errors. Write 𝒛𝑖 = ( 𝒚𝑖 , 𝒙𝑖 ). Under the assumptions in Pesaran (2006), equations (2.3.7) and (2.3.8)
imply
                                                      𝐸 (𝒛𝑖 ) = 𝑭𝑪𝑸                                  (2.3.9)
where 𝑪𝑸 is a 𝑝 × 𝐾 + 1. Assuming 𝑝 ≤ 𝐾 + 1, 𝑪𝑸 is full rank which suggests that 𝐸 (𝒛𝑖 ) can
control for the space spanned by 𝑭. The pooled common correlated effects estimator (CCEP) is
defined as                                                          ! −1
                                                      𝑁
                                                     ∑︁                   𝑁
                                                                         ∑︁
                                          𝜷
                                          b𝐶𝐶𝐸 𝑃 =       𝒙𝑖′ 𝑴 𝑭b𝒙𝑖          𝒙𝑖′ 𝑴 𝑭b 𝒚𝑖            (2.3.10)
                                                     𝑖=1                 𝑖=1
           b= 𝒁 =      1 Í𝑁
where 𝑭                𝑁   𝑖=1 (𝒚 𝑖 , 𝒙𝑖 ). Westerlund et al. (2019) shows that when 𝑇 is fixed and 𝑁 → ∞,
        𝑝
𝑴 𝑭b → 𝑴 𝑭 − 𝑷−𝑝 where 𝑷−𝑝 is a nonlinear function of the model’s errors. When 𝑝 = 𝐾 + 1 and
the number of cross-sectional averages equals the number of factors, 𝑷−𝑝 = 0 and so the CCEP
removes the factors and nothing else.
     Another fixed-𝑇 approach comes from Ahn et al. (2013). They do not make the reduced form
assumption in equation (2.3.8). Instead, they introduce new parameters which eliminate 𝑭. As
both 𝑭 and 𝜸𝑖 are unobserved, they impose the 𝑝 2 normalizations on the factor matrix
                                                      𝑭 = (𝚯′, −𝑰 𝑝 ) ′                             (2.3.11)
                                                             47


where Θ is a (𝑇 − 𝑝) × 𝑝 matrix of unrestricted parameters. Let 𝜽 = vec(𝚯). They then define the
quasi-long-differencing (QLD) matrix
                                                           © 𝑰𝑇−𝑝 ª
                                               𝑯(𝜽) = ­­           ®                                  (2.3.12)
                                                                ′
                                                                   ®
                                                              𝚯
                                                           «       ¬
so that 𝑯(𝜽) ′ 𝑭 = 0.
     The Ahn et al. (2013) technique involves jointly estimating ( 𝜷′0 , 𝜽 ′) ′ with the use of many
instruments. Instead, I focus on the QLD transformation and compare it to the asymptotic CCE
transformation. Suppose 𝛀𝑖 = 𝐸 (𝒖𝑖 𝒖′𝑖 |𝒙𝑖 ) is known and has full rank. Define the CCE GLS and
QLD GLS estimators as
                         𝑁
                                                           ! −1   𝑁
                      ∑︁                                        ∑︁
        𝜷
        b𝐶𝐶𝐸𝐺 𝐿𝑆 =         𝒙𝑖′ 𝑴 𝑭 ( 𝑴 𝑭 𝛀𝑖 𝑴 𝑭 ) − 𝑴 𝑭 𝒙𝑖           𝒙𝑖′ 𝑴 𝑭 ( 𝑴 𝑭 𝛀𝑖 𝑴 𝑭 ) − 𝑴 𝑭 𝒚𝑖  (2.3.13)
                       𝑖=1                                      𝑖=1
                  𝑁
                                                                ! −1 𝑁
                 ∑︁                                                  ∑︁
   𝜷
   b𝑄𝐿𝐷𝐺 𝐿𝑆 =        𝒙𝑖′ 𝑯(𝜽)(𝑯(𝜽) ′𝛀𝑖 𝑯(𝜽)) −1 𝑯(𝜽) ′ 𝒙𝑖                 𝒙𝑖′ 𝑯(𝜽)(𝑯(𝜽) ′𝛀𝑖 𝑯(𝜽)) −1 𝑯(𝜽) ′ 𝒚𝑖
                 𝑖=1                                                 𝑖=1
                                                                                                      (2.3.14)
Theorem 2.3.2. Suppose Assumption CM holds, 𝐸 ( 𝒚𝑖 𝒚′𝑖 |𝒙𝑖 ) is positive definite, and 𝑅𝑎𝑛𝑘 (𝑭) =
𝑝 < 𝑇. Then 𝜷  b𝐶𝐶𝐸𝐺 𝐿𝑆 = 𝜷  b𝑄𝐿𝐷𝐺 𝐿𝑆 .
Proof. 𝑅𝑎𝑛𝑘 (𝑯(𝜽)) = 𝑅𝑎𝑛𝑘 ( 𝑴 𝑭 ) = 𝑇−𝑝 so 𝑴 𝑭 ( 𝑴 𝑭 𝛀𝑖 𝑴 𝑭 ) − 𝑴 𝑭 = 𝑯(𝜽)(𝑯(𝜽) ′𝛀𝑖 𝑯(𝜽)) −1 𝑯(𝜽) ′
by Theorem 1.                                                                                                  □
     Because 𝑯(𝜽) and 𝑴 𝑭 are only available asymptotically, the best we can hope to achieve
is an asymptotic equivalence result. Further, as discussed earlier, the CCE transformation 𝑴 𝑭b
only converges in probability to 𝑴 𝑭 when 𝑝 = 𝐾 + 1. Other fixed-𝑇 approaches in the literature
include Robertson and Sarafidis (2015) who parameterize the correlation between the exogenous
instruments and the factor loadings. They show that one of their estimators is asymptotically
equivalent to the full QLD GMM estimator of Ahn et al. (2013) which suggests a similar efficiency
result as Theorem 3. Westerlund (2020) studies the principal components (PC) estimator using the
Pesaran (2006) CCE model. PC estimation is essentially fixed effects OLS which estimates the
                                                       48


factors and loadings as additional parameters. If the estimator of 𝑴 𝑭 is consistent for 𝑴 𝑭 , it can be
made asymptotically efficient in the sense of Theorem 2.3.2 and thus a possible efficient alternative
to CCE estimation when 𝑇 is fixed.
2.3.3     Random trend
I now consider a particular factor specification which is common in applied settings. This linear
model with additive effects as described in Example 1 of Section 2.2.1. takes the form
                                              𝑦𝑖𝑡 = 𝑐𝑖 + 𝑎𝑖 𝑡 + 𝒙𝑖𝑡 𝜷0 + 𝑢𝑖𝑡                            (2.3.15)
    Such a model is often called a random trend model because the outcome variable has an
unobserved heterogeneous response to the observable time trend8. A standard technique in dealing
with the heterogeneous trend is to first-difference. Define Δ𝑦𝑖𝑡 = 𝑦𝑖𝑡 − 𝑦𝑖,𝑡−1 with similar definitions
for Δ𝒙𝑖𝑡 and Δ𝑢𝑖𝑡 . Then
                                              Δ𝑦𝑖𝑡 = 𝑎𝑖 + Δ𝒙𝑖𝑡 𝜷0 + Δ𝑢𝑖𝑡                                (2.3.16)
Under the strict exogeneity assumption of Assumption CM, we have 𝐸 (Δ𝑢𝑖𝑡 |𝒙𝑖 ) = 0 for each 𝑡 ≥ 2.
Thus we have strictly exogenous covariates with an additive heterogeneity term. The most popular
technique for estimating 𝜷0 in a linear model with additive heterogeneity is fixed effects estimation
                                                              1        ′
which applies the within transformation, 𝑰𝑇−1 − 𝑇−1             1𝑇−1 1𝑇−1  where here 1𝑇−1 is a 𝑇 − 1 × 1 vector
of ones, to the first differenced residuals Δ𝑦𝑖𝑡 − Δ𝒙𝑖𝑡 𝜷0 .
    Another way to eliminate the heterogeneity in equation (2.3.15) is to apply the first-differencing
transformation again on equation (2.3.16). This technique is often referred to as second-differencing.
The regression is then run for Δ𝑦𝑖𝑡 − Δ𝑦𝑖,𝑡−1 on Δ𝒙𝑖𝑡 − Δ𝒙𝑖,𝑡−1 . Since the heterogeneous terms cor-
respond to a known intercept and time trend, we can also run a full fixed regression on equation
(2.3.15) which treats (𝑐 1 , ..., 𝑐 𝑁 , 𝑎 1 , ..., 𝑎 𝑁 ) as parameters.
    One final transformation to consider is the forward orthogonal deviations (FOD) operator in
Arellano and Bover (1995). This matrix applies the following transformation to the errors 𝑢𝑖𝑡 in
    8 See Section 11.7.1 of Wooldridge (2010).
                                                             49


equation (2.2.16):
                                                                                    
                                   (𝑇 − 𝑡)                 1
                                                𝑢𝑖𝑡 −          (𝑢𝑖,𝑡+1 + ... + 𝑢𝑖𝑇 )                       (2.3.17)
                                 (𝑇 − 𝑡 + 1)           (𝑇 − 𝑡)
The transformation can be written in matrix form as
                           ©1 −(𝑇 − 1) −1 −(𝑇 − 1) −1           . . . −(𝑇 − 1) −1 −(𝑇 − 1) −1 −(𝑇 − 1) −1 ª
                           ­                                                                                        ®
                                                 −(𝑇 − 2) −1    . . . −(𝑇 − 2) −1 −(𝑇 − 2) −1 −(𝑇 − 2) −1 ®®
                           ­                                                                                        ®
                           ­0         1
                           ­
      𝑇 −1        1        ­.          ..               ..                   ..             ..              ..
            , ..., ) 1/2 ×­­ ..
                                                                                                                    ®
diag(                                   .                .                    .              .               .      ®
        𝑇         2        ­
                                                                                                                    ®
                                                                                                                    ®
                                                                                               1               1
                           ­0         0                0                     1             −2              −2
                           ­                                                                                        ®
                                                                ...                                                 ®
                           ­                                                                                        ®
                           ­                                                                                        ®
                             0        0                0        ...          0              1              −1
                           «                                                                                        ¬
                                                                                                           (2.3.18)
I denote this FOD transformation as the matrix 𝑭. For each of the first 𝑇 − 1 observations, 𝑭
subtracts off a weighted mean of the rest of the independent variables. While initially studied in the
context of sequential exogeneity and predetermined systems like first-differencing, I study it here
in the context of strict exogeneity to determine information equivalence. Since I am also assuming
the structure in (2.3.16) where first-differencing has already occurred, I consider the 𝑇 − 2 × 𝑇 − 1
matrix 𝑭 which corresponds to the definition in equation (2.3.18) but only assumes 𝑇 − 1 dependent
variables instead of 𝑇. Regardless of the number of time periods considered, 𝑭 has full row rank
which is 𝑇 − 2 in this case.
    To show information equivalence of the techniques described, let 𝑫 1 and 𝑫 2 be the respective
                                                                                                 1       ′
𝑇 − 1 × 𝑇 and 𝑇 − 2 × 𝑇 − 1 full rank first-differencing matrices, 𝑾 = 𝑰𝑇−1 −                  𝑇−1 1𝑇−1 1𝑇−1   be the
𝑇 − 1 × 𝑇 − 1 within transformation which has rank 𝑇 − 2, 𝑭 be the 𝑇 − 2 × 𝑇 − 1 full rank matrix
defined similarly to equation (2.3.18), and 𝑴 be the 𝑇 × 𝑇 residual maker matrix from regressing
on (1, 𝑡). Then
                         𝑫 2 𝑫 1 𝐸 (( 𝒚𝑖 − 𝒙𝑖 𝜷0 )|𝒙𝑖 ) = 𝐸 (𝑫 2 𝑫 1 (𝒚𝑖 − 𝒙𝑖 𝜷0 )|𝒙𝑖 ) = 0                (2.3.19)
                          𝑾 𝑫 1 𝐸 (( 𝒚𝑖 − 𝒙𝑖 𝜷0 )|𝒙𝑖 ) = 𝐸 (𝑾 𝑫 1 ( 𝒚𝑖 − 𝒙𝑖 𝜷0 )|𝒙𝑖 ) = 0                  (2.3.20)
                                𝑴𝐸 (( 𝒚𝑖 − 𝒙𝑖 𝜷0 )|𝒙𝑖 ) = 𝐸 ( 𝑴 (𝒚𝑖 − 𝒙𝑖 𝜷0 )|𝒙𝑖 ) = 0                     (2.3.21)
                          𝑭𝑫 1 𝐸 (( 𝒚𝑖 − 𝒙𝑖 𝜷0 )|𝒙𝑖 ) = 𝐸 (𝑭𝑫 1 ( 𝒚𝑖 − 𝒙𝑖 𝜷0 )|𝒙𝑖 ) = 0                    (2.3.22)
                                                           50


where equations (2.3.19)-(2.3.22) correspond to the residuals from the second-differencing, first-
differencing then within, first-differencing then forward orthogonal deviations, and full fixed effects
transformations respectively. Thus each of the transformations satisfy Assumption MAT and so we
can apply the general theory from Section 2.2.2.
Theorem 2.3.3. Suppose Assumption CM holds and 𝐸 ( 𝒚𝑖 𝒚′𝑖 |𝒙𝑖 ) is positive definite. Then 𝑫 2 𝑫 1 ,
𝑾 𝑫 1 , 𝑭𝑫 1 and 𝑴 are information equivalent.
Proof. As 𝑫 1 is full rank, 𝑅𝑎𝑛𝑘 (𝑫 2 𝑫 1 ) = 𝑅𝑎𝑛𝑘 (𝑾 𝑫 1 ) = 𝑅𝑎𝑛𝑘 (𝑭𝑫 1 ) = 𝑇 − 2.                            Since
𝑅𝑎𝑛𝑘 ( 𝑴) = 𝑇 − 2 by definition, the result holds by Theorem 1.                                                    □
    The simplicity of the proof follows from the general nature of the unified theory proved in
Section 2.2 and thus demonstrates its usefulness. In the language of Im et al. (1999), the GLS
estimators based off of the residuals in equations (2.3.19)-(2.3.22) for a given 𝐸 (𝒖𝑖 𝒖′𝑖 | 𝑿𝑖 ) are
algebraically equivalent for a given covariance matrix. Finally, Theorem 2.3.3 can be seen as a
generalization of Theorem 4.3 of Im et al. (1999).
2.4     Practical considerations
The final section of the paper provides useful applications of the results in the previous two
sections. I first consider implementation of the efficiency bounds discussed in the paper. Given a
transformation 𝑨(𝒙𝑖 , 𝜷) satisfying Assumptions SYS and ORTH (and thus MAT), I describe the
efficient estimator. The estimator 𝜷         b𝐴 which solves
         ∑︁𝑁
              ∇ 𝜷 𝒎𝑖 ( 𝜷0 ) ′ 𝑨(𝒙𝑖 , 𝜷0 ) ′ ( 𝑨(𝒙𝑖 , 𝜷0 )𝐸 ( 𝒚𝑖 𝒚′𝑖 |𝒙𝑖 ) 𝑨(𝒙𝑖 , 𝜷0 ) ′) − 𝑨(𝒙𝑖 , 𝜷
                                                                                                  b𝐴 ) 𝒚𝑖 = 0 (2.4.1)
          𝑖=1
   √
is   𝑁-asymptotically normal with asymptotic variance equal to the efficiency bound given by
equation (2.2.4).
    First-stage estimation of 𝜷0 comes from a GMM estimator with an arbitrary weight matrix.
Second, one needs to consistently estimate 𝐸 ( 𝒚𝑖 𝒚′𝑖 |𝒙𝑖 ). A nonparametric regression estimator can
be used in principle, but in practice this estimator may give highly imprecise estimates when 𝑇
                                                              51


and 𝐾 are relatively large. In the multiplicative heterogeneity setting, Brown and Wooldridge
(2021) provides a simple and attractive parametric framework for the FEP setting. They assume
𝑉 𝑎𝑟 (𝑦𝑖𝑡 |𝒙𝑖 , 𝑐𝑖 ) = 𝛼𝐸 (𝑦𝑖𝑡 |𝒙𝑖 , 𝑐𝑖 ) where 𝛼 > 0 is an identified coefficient along with a constant
conditional correlation matrix.
    Asymptotically justified standard errors can be derived using the familiar sample analog to the
efficiency bound in (2.2.4). The researcher can then test the validity of parts of Assumption CM.
For strict exogeneity, Wooldridge (2010, Chapter 18) suggests including functions of lead values
of independent variables and running a joint test of significance. This method’s most attractive
feature is the weakness of its alternative hypothesis. The null maintains strict exogeneity while
the alternative is merely that strict exogeneity fails. It is also easy to implement and can be tested
in most standard statistical packages. However, there is no guidance on how to choose which
regressors to include or their functional forms.
    Another possible way to examine strict exogeneity is via a Hausman test. The researcher
could choose a competing estimator based on the desired alternative hypothesis. In the nonlinear
multiplicative example of Section 2.3.1, suppose the researcher believes that sequential exogeneity
holds, or that 𝐸 (𝑦𝑖𝑡 |𝒙𝑖1 , ..., 𝒙𝑖𝑡 , 𝑐𝑖 ) = 𝑚 𝑡 (𝒙𝑖𝑡 , 𝜷0 ). Then the generalized next-differencing trans-
formation 𝑫 𝑖 ( 𝜷) = (𝒓𝑖,1,2 ( 𝜷), ..., 𝒓𝑖,𝑇−1,𝑇 ( 𝜷)) ′ still provides valid moment conditions. However,
the instruments are designed to reach the efficiency bound in (2.2.4) will not be valid under
sequential exogeneity alone. Chamberlain (1992b) derives the asymptotic efficiency bound for
moment conditions under sequential exogeneity and provides an implementable estimator which
reaches said bound. Under the null hypothesis, both estimators are consistent with the generalized
next-differencing estimator as in (2.3.5) being asymptotically efficient. Under the alternative, only
                                                                                         √
Chamberlain’s instruments are valid (and in fact asymptotically efficient among 𝑁-asymptotically
normal estimators). Thus we can use a Hausman statistic to test the assumption of strict exogeneity.
    The Chamberlain estimator described in the Hausman statistic procedure is difficult to im-
plement as the instruments may be comprised of multiple sums of conditional moments. The
researcher will need to either greatly strengthen the assumptions of the model to allow for para-
                                                           52


metric forms of these moments or utilize a large number of nonparametric regressions. Either way,
this computational burden makes the Chamberlain estimator difficult to implement.
    Another possible application of the results involve finite-sample and computational concerns.
Phillips (2020) demonstrates that matrix inversion for estimators based on first-differencing can
involve significantly more computational resources than those based on forward orthogonal devi-
ations. He demonstrates with simulation evidence that computational time increases quickly with
𝑇 even for relatively small values of 𝑁. While instruments need to satisfy two conditions given in
Phillips (2020) which are not necessarily assumed here, I reiterate that the results in Section 2.2
are purely algebraic and can be applied in a large number of settings.
2.5     Conclusion
This paper considers linear transformations of nonlinear panel models with unobserved heterogene-
ity. When covariates are strictly exogenous in the zero conditional mean sense, such transforma-
tions provide uncountable moment conditions exploitable for estimation. I consider specifically the
asymptotic efficiency bound for estimating the model’s parameter which is reached by the optimal
                                                                                 √
choice of instruments. This matrix specifies a lower bound on how efficient any 𝑁-asymptotically
normal estimator of 𝜷0 can possibly be.
    Transformations of the data are said to be information equivalent if they yield the same asymp-
totic efficiency bound. The main result of Section 2.2 is a unified framework for evaluating the
efficiency bounds of transformations that provide moment conditions for estimation. It shows that
besides regularity conditions, matrix transformations which yield conditional moment restrictions
and have the same rank yield the same information bound. I also simplify the form of the efficiency
bound under a general and easily verifiable algebraic orthogonality property which could potentially
help in determining other interesting relationships between instrumental variable estimators.
    The general framework is applied to show that the generalized within transformation, which
provides the basis of the FEP estimator, is in fact information equivalent to a number of other
transformations. These transformations, which include generalizations of varying differencing
                                                 53


techniques used in the linear panel data context such as next-, first-, and long-differencing, as well
as the residual maker matrix from regression on the outcome variable’s mean function, are only
required to satisfy a rank condition for the main theorem to hold. It is also shown that any 𝑇 − 1 × 𝑇
matrix which is algebraically orthogonal to the mean function of the outcome and of full rank
is information equivalent, which includes deleting any arbitrary row from the generalized within
transformation and removing the linear redundancy does not lose any information.
    I also generalize a result of Im et al. (1999) on linear panels with an additive heterogeneity term
to a general factor-augmented error structure as studied in Pesaran (2006), Ahn et al. (2013), and
Westerlund (2020). I show that any transformation of the data which is full rank and eliminates the
factors is information equivalent. I use this result to show that in the case of a random heterogeneous
trend model, first-differencing twice, first-differencing and then using a within transformation, and
the true fixed effects estimator are information equivalent. For arbitrary factor structures, the QLD
transformation of Ahn et al. (2013) is information equivalent to the infeasible fixed effects GLS
estimator which takes the unobserved effects as known.
    The work in this paper provides a basic framework for comparison of parametric estimators for
a broad class of nonlinear models. I primarily consider strictly exogenous covariates so I could
compare estimators using theoretically efficient instruments. However, the finite sample algebraic
results hold regardless of validity of the instruments. As such, the main theorem in Section 2.2 can
apply to any comparison of efficiency for instrumental variable estimators.
                                                    54


                                              CHAPTER 3
      MOMENT-BASED ESTIMATION OF LINEAR PANEL DATA MODELS WITH
                                  FACTOR-AUGMENTED ERRORS
3.1     Introduction
The prevalence of panel data in modern economics has led theorists and practitioners to pay more
attention to unobserved and interactive heterogeneity. A popular representation of unobserved
                                      Í𝑝
effects is the linear factor structure 𝑗=1 𝑓𝑡 𝑗 𝛾 𝑗 𝑖 where 𝑓𝑡 𝑗 is a time-varying macro effect or “common
factor" and 𝛾 𝑗 𝑖 is an individually heterogeneous response or “factor loading". In studying the
statistical properties of estimators of factor models, most theoretical treatments have relied on
asymptotic expansions where the number of time periods 𝑇 grows large with the number of cross-
sectional units 𝑁. As the vast majority of microeconometric data sets have only a few time periods,
the recent literature assumes 𝑇 is fixed while 𝑁 goes to infinity.
    One of the most popular approaches is the common correlated effects (CCE) estimator of
Pesaran (2006). He assumes that the covariates are a linear function of the common factors plus
a matrix of independent idiosyncratic errors. The pooled CCE estimator comes from the OLS
regression which estimates unit-specific slopes on the cross-sectional averages of the dependent
and independent variables. CCE is similar to a fixed effects treatment which seeks to eliminate the
factors and remove a source of both endogeneity and cross-sectional dependence. Consistency and
asymptotic normality was originally proved for sequences of 𝑁 and 𝑇 going to infinity.
    Recent work extends the CCE framework to a fixed-𝑇 setting. De Vos and Everaert (2021)
derive a fixed-𝑇 consistency correction for the dynamic CCE estimator but requires 𝑇 → ∞ for
asymptotic normality. Westerlund et al. (2019) provide the first asymptotic normality derivation
of pooled CCE when 𝑇 is fixed and 𝑁 → ∞. However, they still maintain stringent assumptions
on the model’s DGP. For example, they assume that the factor loadings are independent of the
idiosyncratic errors. My estimators do not require this assumption for consistency, though making
                                                       55


it simplifies the standard errors. Further, the CCE estimator generally uses more factor proxies
than necessary which can lead to inefficiency. Finally, the CCE estimator requires 𝑇 > 𝐾 + 1
which is highly restrictive in microeconometric settings. For example, an intervention analysis
with only pre-treatment, treatment, and post-treatment observations, classical CCE would require
the treatment indicator to be the only regressor.
     Aside from CCE, most existing fixed-𝑇 techniques create moment conditions by including
additional parameters to estimate or by eliminating the factors with observed proxies. A few
examples include Hayakawa (2012), Ahn et al. (2001, 2013), Robertson and Sarafidis (2015), and
Juodis and Sarafidis (2018, 2020)1. Of these approaches, I focus on Ahn et al. (2013), who define
a parameterized quasi-long-differencing (QLD) transformation that eliminates the factor structure.
The QLD residuals then form the basis for a GMM estimator which uses all available exogenous
variables to generate moment conditions. I focus on the QLD technique for the sake of comparison
to CCE as both approaches eliminate the factor structure and allow for “fixed effects" assumptions.
For example, Robertson and Sarafidis (2015) parameterize the correlation between the exogenous
variables and the factor loadings. Ahn (2015) points out that if the factor loadings’ distributions
change over the cross-sectional units, identification in Robertson and Sarafidis (2015) does not
hold.
     Ahn et al. (2013) do not assume a pure factor structure in the covariates like Pesaran (2006)
and leaves the distribution of the covariates unspecified. However, the generality of Ahn et al.
(2013) comes at the cost of identifying assumptions, which may explain its lack of use in the
empirical literature. The QLD GMM estimator requires many moments to identify all the model’s
parameters. If either 𝑇 or the number of factors is large, their GMM estimator may require
outside instruments. Their estimator also requires nonlinear optimization with a large number of
moments and parameters. Hayakawa (2016) provides a simple example where the global identifying
assumptions fail and there exist local stationary points.
    1 Juodis and Sarafidis (2021) allows for a linear estimator which requires no additional parameters, However, the
fixed-𝑇 analysis requires strong assumptions on the loadings which this paper avoids. See Assumption S.1.1(d) in their
Appendix.
                                                           56


    I synthesize both approaches and weaken both the Pesaran (2006) and Ahn et al. (2013)
assumptions. I use a weakened CCE model without any independence assumptions to provide
a first-stage estimator of the additional QLD parameters. Using the QLD transformation, I then
derive pooled and mean group linear estimators and provide standard errors which are valid even
when the heterogeneity is correlated with the model’s errors. These novel estimators have desirable
rank conditions and do not require outside instruments like in Ahn et al. (2013). They also do
not restrict the number of covariates to be less than the number of time periods minus one, an
improvement over fixed-𝑇 CCE. Simulations suggest that the linear QLD estimators outperform
the CCE and QLD GMM estimators in finite samples.
    Another potential source of heterogeneity in linear models comes from the slope coefficients
on the observed variables of interest. Pesaran (2006) proves fixed-𝑇 consistency of the mean group
CCE estimator under random slopes but assumes they are independent of everything else in the
model. Asymptotic normality requires 𝑇 → ∞ and pooled CCE is studied under constant slopes.
I prove fixed-𝑇 consistency and asymptotic normality of the new pooled and mean group QLD
estimators. I show that the first-stage estimation of the QLD parameters does not affect consistency,
which mirrors the pooled OLS result of Wooldridge (2005), who assumes known factors. To the
best of my knowledge, this paper is the first to consider arbitrary random slopes in the context of
fixed-𝑇 panels with factor-driven endogeneity.
    The rest of the paper is structured as follows: Section 3.2 describes the main model of interest
which is weaker than that in Westerlund et al. (2019). Section 3.3 provides the assumptions
which underlie the model and discusses implementation of the QLD-based estimators. Section
3.4 introduces random slopes. Section 3.5 provides simulation evidence for the finite sample
properties of the QLD estimators. Section 3.6 compares the pooled QLD estimator to two-way
fixed effects (TWFE) and CCE in estimating the effect of education expenditure on standardized
test performance using a school district-level data set from the state of Michigan. Section 3.7
concludes with a brief summary and suggestions for future research.
                                                   57


3.2      Model
This section lays out the models considered in Westerlund et al. (2019) and Ahn et al. (2013), the
fixed-𝑇 CCE and QLD approaches respectively. Throughout the paper, the equation of interest is
                                         𝒚𝑖 = 𝑿𝑖 𝜷0 + 𝑭0 𝜸𝑖 + 𝒖𝑖                                  (3.2.1)
where 𝒚𝑖 is a 𝑇 × 1 vector of outcomes, 𝑿𝑖 is 𝑇 × 𝐾 matrix of covariates, 𝑭0 is a 𝑇 × 𝑝 0 matrix of
factors common to all units in the population, 𝜸𝑖 is a 𝑝 0 × 1 vector of factor loadings, 𝒖𝑖 is a 𝑇 × 1
vector of idiosyncratic shocks. A ‘0’ subscript denotes the true or realized value of an unobserved
parameter. 𝑝 0 is then unobserved because 𝑭0 and 𝜸𝑖 are unobserved. Later, 𝑝 denotes the number
of factors specified by the econometrician. 𝜷0 is the object of interest and the factor structure 𝑭0 𝜸𝑖
is treated as a collection of nuisance parameters.
     This paper defines 𝑝 0 as the number of factors whose loadings correlate with 𝑿𝑖 . This interpre-
tation is similar to Ahn et al. (2013) and implicit to the CCE model as discussed in the following
section. One justification of this interpretation is to write the full error as 𝑫 0 𝝆𝑖 + 𝝐𝑖 where 𝑫 0 is a
possibly infinite dimensional matrix of common factors and 𝝐𝑖 is a vector of idiosyncratic errors.
Then 𝑭0 𝜸𝑖 is the set of variables from 𝑫 0 𝝆𝑖 which are correlated with 𝑿𝑖 and the rest are absorbed
into the error. However, it is entirely likely that 𝜸𝑖 is correlated with the other loadings which are
uncorrelated with 𝑿𝑖 . This correlation can cause problems for inference and is addressed in Section
3.3.
     Finally, I assume the factors in 𝑭0 are constant for the purpose of asymptotic analysis. The
alternative setting is to assume the factors are stochastic and independent of the other terms, or make
the modeling assumptions conditional on the sigma-algebra generated by the factors like in Ahn et
al. (2013). When 𝑇 is fixed, the stochastic nature of the factors is less relevant for the asymptotic
arguments. Standard errors do not change as properly studentized test statistics converge to their
usual distributions2. As such, I consider the standard microeconometric assumption of random
     2 See Section 6 of Andrews (2005).
                                                    58


sampling in the cross-section. Hsiao (2018) provides examples of papers which make either the
fixed or random assumption on the factors.
3.2.1    Common Correlated Effects
The CCE model in Pesaran (2006) and Westerlund et al. (2019) adds an additional reduced form
equation which represents the relationship between the covariates and the factor structure:
                                                   𝑿𝑖 = 𝑭0 𝚪𝑖 + 𝑽𝑖                                            (3.2.2)
where 𝚪𝑖 is a 𝑝 0 × 𝐾 matrix of factor loadings and 𝑽𝑖 is a 𝑇 × 𝐾 matrix of idiosyncratic errors.
Westerlund et al. (2019) follows Pesaran (2006) in assuming 𝑽𝑖 , 𝚪𝑖 , 𝜸𝑖 , and 𝒖𝑖 are mutually
independent3. Assuming that the idiosyncratic errors have mean zero, CCE estimates the factors
with the matrix 𝑭    b = (𝒚, 𝑿) where (𝒚, 𝑿) = 1 Í𝑁 ( 𝒚𝑖 , 𝑿𝑖 ) are the cross-sectional averages of 𝒚𝑖
                                                       𝑁    𝑖=1
and 𝑿𝑖 .
    The pooled common correlated effects (CCEP) estimator treats the cross-sectional averages
as fixed effects and can be represented as
                                                  𝑁
                                                                 ! −1   𝑁
                                                 ∑︁                   ∑︁
                                   𝜷b𝐶𝐶𝐸 𝑃 =         𝑿𝑖′ 𝑴 𝑭b 𝑿𝑖           𝑿𝑖′ 𝑴 𝑭b 𝒚𝑖                        (3.2.3)
                                                 𝑖=1                   𝑖=1
where 𝑴 𝑭b = 𝑰𝑇 − 𝑭(    b𝑭 b′ 𝑭)
                               b +𝑭b′. Here ′+′ denotes a Moore-Penrose inverse which can be replaced
by a proper inverse in samples where 𝑭        b′ 𝑭
                                                 b has full rank. Pesaran (2006) derives the CCE estimator
under the following intuition: first, write 𝒁𝑖 = (𝒚𝑖 , 𝑿𝑖 ). The two models in equations (3.2.1) and
(3.2.2) imply
                                               𝐸 (𝒁𝑖 ) = 𝑭0 𝐸 (𝑪𝑖 )𝑸                                          (3.2.4)
where 𝑪𝑖 = (𝜸𝑖 , 𝚪𝑖 ) and
                                                       © 1 01×𝐾 ª
                                                  𝑸 = ­­              ®
                                                                      ®
                                                         𝜷0 𝑰 𝐾
                                                       «              ¬
    3 Westerlund et al. (2019) assume the loadings come from a fixed series of constant matrices which is more general
than the Pesaran (2006) assumption that the loadings are iid.
                                                          59


Thus, 𝑴 𝑭b asymptotically eliminates the space spanned by 𝑭0 which includes 𝑭0 𝜸𝑖 .
    Westerlund et al. (2019) show that 𝑴 𝑭b generally converges to the space orthogonal to both
𝑭 and a random term which is a function of the model’s idiosyncratic errors. For the sake of
                                𝑝
simplicity, suppose that 𝑴 𝑭b → 𝑴 𝑭0 which is the case when 𝑝 0 = 𝐾 + 1. Then the pooled CCE
estimator is based off of the moment conditions
                                      𝐸 ( 𝑿𝑖′ 𝑴 𝑭0 (𝒚𝑖 − 𝑿𝑖 𝜷)) = 0
Assuming 𝐸 (𝑽𝑖 ) = 0 as in Pesaran (2006) and Westerlund et al. (2019), the reduced form portion
of the CCE model also implies 𝐸 ( 𝑴 𝑭0 𝑿𝑖 ) = 0. Since the CCE approach estimates no parameters
in this set of moments, the additional moments are unused by the CCE residual above. I show how
these reduced form moments can be exploited for additional information in Section 3.3.
    A particularly harsh restriction of the pooled CCE estimator is the rank condition required for
the denominator. 𝑴 𝑭b is a residual-maker matrix and so it has rank 𝑇 − (𝐾 + 1). For the estimator
to be well-defined, we require 𝑇 > 𝐾 + 1. This constraint is trivially nonbinding when 𝑇 → ∞
like in the prior literature. However, when 𝑇 is fixed like in this paper, we need 𝐾 < 𝑇 − 1. For
example, if we only observe three time periods, we can only incorporate one regressor. Also, when
𝐾 + 1 > 𝑝 0 , the CCE estimator unnecessarily removes variation from the data which could improve
precision of the estimator. I address both of these problems in Section 3.2.2.
3.2.2    Quasi-long-differencing
Ahn et al. (2013) do not assume the pure factor structure in 𝑿𝑖 . They start with equation (3.2.1)
then parameterize the factors for the purpose of eliminating them. Before discussing how this
process works, I introduce the ‘rotation problem’, a well-known issue in the factor literature. Since
both 𝑭 and 𝜸𝑖 are unobservable, they cannot be separately identified. To see why, consider any
nonsingular 𝑝 × 𝑝 matrix 𝑨. Then 𝑭0 𝚪𝑖 = 𝑭 ∗ 𝚪𝑖∗ where 𝑭 ∗ = 𝑭0 𝑨 and 𝚪𝑖∗ = 𝑨−1 𝚪𝑖 . We can only
hope to identify the factors up to an arbitrary rotation of their linear subspace. Ahn et al. (2013)
                                                    60


suggest the following 𝑝 20 normalizations based off of a row-reduction rotation:
                                            𝑭0 = (𝚯′0 , −𝑰 𝑝0 ) ′                                (3.2.5)
where 𝚯0 is a (𝑇 − 𝑝 0 ) × 𝑝 0 matrix of unrestricted parameters. The given normalization is irrelevant
because I am not interested in estimating 𝑭0 . In this case, I only assume that the factors are full
rank; the normalization chosen merely reflects this fact.
    Given the normalization of the general factor matrix 𝑭0 in equation (3.2.5), Ahn et al. (2013)
define the quasi-long-differencing (QLD) matrix
                                                         © 𝑰𝑇−𝑝0 ª
                                            𝑯(𝜽 0 ) = ­­         ®                               (3.2.6)
                                                              ′
                                                                 ®
                                                            𝚯0
                                                         «       ¬
where 𝜽 0 = vec(𝚯0 ). The QLD transformation eliminates the factors for any given 𝜽 0 : 𝑯(𝜽 0 ) ′ 𝑭(𝜽 0 ) =
0. This differencing technique allows for the construction of the QLD residual studied in Ahn et al.
(2013):
                                    𝐸 (𝒘 𝑖 ⊗ 𝑯(𝜽 0 ) ′ ( 𝒚𝑖 − 𝑿𝑖 𝜷0 )) = 0                       (3.2.7)
where 𝒘 𝑖 is a vector of instruments which may contain vec( 𝑿𝑖 ). The normalization in (3.2.5) and
implicit in (3.2.6) is only one particular choice of rotation. The Ahn et al. (2013) estimator depends
on the choice of normalization which is unaddressed in the original paper. I discuss this issue in
the Appendix and provide potential solutions for the estimators derived in Section 3.2.
    While Ahn et al. (2013) provide a general framework for estimating 𝜷0 without strong restric-
tions on the distribution of 𝑿𝑖 , it requires at least 𝑝 0 + 𝐾/(𝑇 − 𝑝 0 ) instruments in 𝒘 𝑖 to identify
all of the model’s parameters. If some of the variables are not exogenous in each time period
like with weakly exogenous or predetermined variables, or if 𝑝 0 is large, we may require outside
instruments. Hayakawa (2016) demonstrates an example where the objective function based off of
equation (3.2.7) suffers from non-global stationary points due to the nonlinear nature of estimation
with a large number of moments and parameters.
                                                     61


    The pure factor structure in equation (3.2.4) can thus be used for estimating the parameters in
equation (3.2.6). If we assume 𝑿𝑖 = 𝑭0 𝚪𝑖 + 𝑽𝑖 where 𝐸 (𝑽𝑖 ) = 0, then
                                              𝐸 (𝑯(𝜽 0 ) ′ 𝒁𝑖 ) = 0                                  (3.2.8)
and 𝜽 0 is identified by equation (3.2.8) which substantially reduces the number of moments needed
to identify 𝜷0 . I also show explicitly in the following section how and when these additional
moments are useful for the purpose of identification and efficiency.
3.3    Estimation
I now state this paper’s primary assumptions. The first assumption is similar to the ’Basic As-
sumptions’ of Ahn et al. (2013) and is made for the sake of comparison to their approach. The
second set specifies the pure factor structure in 𝑿𝑖 similar to Westerlund et al. (2019). I specify
the models in the assumptions as the main results of the paper depend on which model is being
assumed. Conditional moments hold almost surely.
Assumption 1 (Linear population model):
   1. 𝒚𝑖 = 𝑿𝑖 𝜷0 + 𝑭0 𝜸𝑖 + 𝒖𝑖 .
■
Assumption 2 (CCE reduced form equations):
   1. 𝑿𝑖 = 𝑭0 𝚪𝑖 + 𝑽𝑖 .
   2. (𝜸𝑖 , 𝚪𝑖 , 𝑽𝑖 , 𝒖𝑖 ) are independent and identically distributed across 𝑖 with finite fourth moments.
   3. 𝐸 (𝑽𝑖 ) = 0 and 𝐸 (𝒖𝑖 |𝑽𝑖 ) = 0.
   4. Rk(𝑭0 ) = 𝑝 0 and Rk(𝐸 ( [𝜸𝑖 , 𝚪𝑖 ])) = 𝑝 0 ≤ 𝐾 + 1.
■
    Assumption 1 simply defines the relevant population model. I will not require the strong rank
conditions of Ahn et al. (2013) which can be found in the Appendix, nor will I require outside
                                                      62


instruments. Assumption 2 specifies the pure factor assumption similar to Pesaran (2006) and
Westerlund et al. (2019). I assume random sampling in the cross-section to simplify the asymptotic
analysis, though this restriction is unnecessary.
     Westerlund et al. (2019) follow the classical CCE approach in assuming independence between
all stochastic components of the model which is unrealistic in microeconometric settings. Further,
the asymptotic normality derivation in Westerlund et al. (2019) relies on the assumption that
 1 Í𝑁      ′    ′         −1/2 ). I demonstrate in Section 3.3.2 that it is unnecessary for consistency
𝑁    𝑖=1 𝜸𝑖 ⊗ 𝑽𝑖 = 𝑂 𝑝 (𝑁
and asymptotic normality, and how misspecification causes inconsistency in the standard errors
and bootstrapped test statistics provided in Westerlund et al. (2019). The factor structure allows us
to weaken the Ahn et al. (2013) assumption from 𝐸 (𝒖𝑖 | 𝑿𝑖 ) = 0 to 𝐸 (𝒖𝑖 |𝑽𝑖 ) = 0. Finally, I do not
assume the reduced form equation is a conditional mean specification like Westerlund et al. (2019).
They assume 𝐸 (𝑽𝑖 |𝚪𝑖 ) = 0, where I only need 𝐸 (𝑽𝑖 ) = 0 and place no restrictions on 𝐷 (𝑽𝑖 |𝒖𝑖 ).
     Another way in which QLD can help weaken the CCE model is the relevant order conditions.
As described earlier, Westerlund et al. (2019) require 𝑇 > 𝐾 + 1 for CCE estimation but I will
directly use the moments 𝐸 (𝑯0′ 𝒁𝑖 ) = 0 to remove the factors which only requires 𝐾 ≥ 𝑝 0 + 1,
a restriction also made by Pesaran (2006) and Westerlund et al. (2019). Ahn et al. (2013)
does not require this condition but assumes the existence of outside instruments which may be
infeasible given the application. I also discuss in Section 3.3.2 how to include known factors like a
heterogeneous intercept which decreases the number of relevant factors and makes the assumption
even less restrictive.
3.3.1    CCE Moment Conditions
I now look at the moment conditions implied by Assumption 2. Equation (3.2.8) of Section 3.2,
𝐸 (𝑯0′ 𝒁𝑖 ) = 0 where 𝒁𝑖 = (𝒚𝑖 , 𝑿𝑖 ), implies that Assumption 2 provides information on 𝜽 0 which
leads to more efficient estimation of 𝜷0 and provides a first-stage estimator which negates the need
for the full joint estimator of Ahn et al. (2013). I first consider identification of 𝜽 0 from the
pure factor structure alone to show that it in fact yields valid moments. As in Ahn et al. (2013),
                                                   63


identification hinges on correctly specifying 𝑝 = 𝑝 0 where 𝑝 is the number of factors specified by
the econometrician.
Lemma 3.3.1. Under Assumption 2, 𝜽 0 is identified by 𝐸 (𝑯(𝜽) ′ 𝒁𝑖 ) = 0 if and only if 𝑝 = 𝑝 0 .
Proof. Assumption 2(3) implies
                                        𝐸 (𝑯(𝜽) ′ 𝒁𝑖 ) = 𝑯(𝜽) ′ 𝑭0 𝐸 (𝑪𝑖 )𝑸                           (3.3.1)
where 𝑬 (𝐶𝑖 ) = 𝐸 ( [𝜸𝑖 , 𝚪𝑖 ]) and 𝑸 is given in Section 3.2.1. 𝑸 is nonsingular and 𝐸 (𝑪𝑖 ) has full
row rank by Assumption 2(4), so equation (3.3.1) is zero if and only if 𝑯(𝜽) ′ 𝑭0 = 0. When 𝑝 = 𝑝 0 ,
𝐻 (𝜽) ′ 𝑭0 = 𝚯0 − 𝚯 which is zero if and only if 𝜽 = 𝜽 0 . See the Appendix for the 𝑝 ≠ 𝑝 0 cases. □
    Remark (Misspecification): A possible reason for the lack of use of CCE estimation among
microeconomists is the model in Assumption 2(1). This assumption is in fact not strictly necessary
for identifying 𝜽 0 . Consider the following linear projection:
                                                𝐸 (𝒁𝑖 ) = 𝑭0 𝑮 + 𝑬
where 𝑭0′ 𝑬 = 0. Then 𝜽 0 is still identified by the moments 𝐸 (𝑯(𝜽) ′ 𝒁𝑖 ) if 𝑮 has full rank. ■
    We can use Lemma 3.3.1 to provide an estimator of 𝜽 0 based off of the covariates alone. Let
𝑯         b 𝑫 𝜽 = 𝐸 (∇𝜽 vec(𝑯′ 𝑿𝑖 )), and 𝑨𝜽 = 𝐸 (vec(𝑯′ 𝒁𝑖 )vec(𝑯′ 𝒁𝑖 ) ′).
 b = 𝑯( 𝜽),
                                     0                              0            0
Theorem 3.3.1. Suppose Assumption 2 holds, and let 𝜽b be the GMM estimator based off of
𝐸 (vec(𝑯0′ 𝒁𝑖 )) = 0 using a consistent estimator of the optimal weight matrix. Then
       √                 𝑑                      −1
   1.    𝑁 ( 𝜽b − 𝜽 0 ) → 𝑁 (0, 𝑫 ′𝜽 𝑨−1 𝜽 𝑫𝜽       ).
                             𝑝
Now suppose that 𝑨      b𝜽 →    𝑨𝜽 using first-step estimator 𝜽.  b
                                Í                 ′       Í                  𝑑
   1. If 𝑝 0 = 𝑝 then      𝑁 −1    𝑁        b′
                                   𝑖=1 vec( 𝑯 𝒁𝑖 )     𝑨𝜽
                                                        b−1    𝑁         ′          2
                                                               𝑖=1 vec( 𝑯 𝒁𝑖 ) → 𝜒 ((𝑇 − 𝑝 0 )(𝐾 + 1 − 𝑝 0 )).
                                                                        b
                                 Í                  ′                         𝑝
   2. If 𝑝 0 > 𝑝, then 𝑁 −1         𝑁        b′
                                    𝑖=1 vec( 𝑯 𝒁𝑖 )     𝑨b−1 Í𝑁 vec( 𝑯
                                                          𝜽     𝑖=1
                                                                         b′ 𝒁𝑖 ) → ∞.
                                                           64


Proof. The proof comes from standard theory; see Hansen (1982). The estimator of the optimal
weight matrix is 𝑨   b𝜽 = 1 Í𝑁 vec(𝑯( 𝜽)     ˜ ′ 𝒁𝑖 )vec(𝑯( 𝜽)
                                                            ˜ ′ 𝒁𝑖 ) ′ where 𝜽˜ is a consistent first-stage
                            𝑁    𝑖=1
estimator of 𝜽 0 .                                                                                        □
    It is entirely possible there are variables in the data set which are linear in the factors but not
relevant for estimation. In this case, one can simply use them to estimate 𝜽 0 but drop them from the
estimating equation. Further, if relevant variables are not linear in 𝑭0 , they should be dropped from
the estimation in Theorem 3.3.1. This can occur if there are polynomial or interactive functions
of the covariates in the estimating equation. De Vos and Westerlund (2019) study this case in the
context of CCE.
    I also note that the just identified case 𝑝 0 = 𝐾 + 1 corresponds to a simple M-estimator:
Corollary 3.3.1. When 𝑝 0 = 𝐾 + 1, the estimator 𝜽b solves
                                               b′ ( 𝒚, 𝑿) = 0
                                               𝑯
    Corollary 3.3.1 provides important robustness properties in Section 3.3. For now, I point out
how Theorem 3.3.1 can help test for 𝑝 0 . There are (𝑇 − 𝑝 0 )(𝐾 + 1) moments and (𝑇 − 𝑝 0 ) 𝑝 0
parameters, so the system is underidentified when 𝐾 + 1 < 𝑝 0 and just identified like in Corollary
3.3.1 when 𝐾 + 1 = 𝑝 0 . When 𝐾 + 1 > 𝑝 0 , we have overidentifying restrictions to test for 𝑝 0 . Ahn
et al. (2013) recommend testing for 𝑝 0 by first setting 𝑝 = 0 and setting 𝑯 = 𝑰𝑇 . If the hypothesis
is rejected using the statistic in part (2) of Theorem 3.3.1, move to 𝑝 = 1. Continue until the null
hypothesis cannot be rejected. I refer the reader to Section 3 of Ahn et al. (2013) for additional
details and tests. I follow a similar approach to testing based off of the moments in Theorem 3.3.1.
    I now demonstrate that the additional moments generally improve efficiency of the Ahn et al.
(2013) GMM estimator by demonstrating that the CCE model’s reduced form assumption implies
additional non-redundant moment conditions. The following theorem completely characterizes
when the moments 𝐸 (𝑯0′ 𝑿𝑖 ) = 𝐸 (𝑯0′ 𝑽𝑖 ) = 0 are partially redundant for estimating 𝜷0 using the Ahn
et al. (2013) estimator, meaning its asymptotic variance is the same with or without the additional
moments. I do not include 𝐸 (𝑯0′ 𝒚𝑖 ) = 0 because the efficiency result would require additional
                                                      65


assumptions on 𝑉 𝑎𝑟 (𝒖𝑖 ). Let 𝒈𝑖1 ( 𝜷, 𝜽) = vec( 𝑿𝑖 ) ⊗ 𝑯(𝜽) ′ ( 𝒚𝑖 − 𝑿𝑖 𝜷) and 𝒈𝑖2 (𝜽) = 𝑯(𝜽) ′𝑽𝑖 be the
residuals associated with the moment conditions from equations (3.2.7) and (3.2.8) respectively.
Let 𝑫 11 = 𝐸 (∇ 𝜷 𝒈𝑖1 ( 𝜷0 , 𝜽 0 )), 𝑫 12 = 𝐸 (∇𝜽 𝒈𝑖1 ( 𝜷0 , 𝜽 0 )), and 𝛀11 = 𝑉 𝑎𝑟 ( 𝒈𝑖1 ( 𝜷0 , 𝜽 0 )).
Theorem 3.3.2. Given Assumptions 1 and 2, suppose 𝐸 (𝒖𝑖 | 𝑿𝑖 ) and the Identifying Assumptions
in the Appendix hold. Then the moment conditions 𝐸 ( 𝒈𝑖2 (𝜽 0 )) = 0 are partially redundant for
estimating 𝜷0 if and only if
                                                𝑫 ′12 𝛀−1
                                                        11 𝑫 11 = 0                                      (3.3.2)
Proof. See Appendix for proof. The extra assumptions are only needed so that ( 𝜷′0 , 𝜽 0′ ) ′ are
identified by 𝐸 ( 𝒈𝑖1 ( 𝜷0 , 𝜽 0 )) = 0 and are equivalent to the Basic Assumptions of Ahn et al. (2013).
I assume 𝐸 (𝒖𝑖 | 𝑿𝑖 ) = 0 whereas Assumption 2 implies the weaker 𝐸 (𝒖𝑖 |𝑽𝑖 ) = 0. I make the stronger
exogeneity assumption for simplicity, though the moment conditions in 𝒈𝑖1 could be reformulated
with 𝑯0′ 𝑽𝑖 ⊂ 𝒘 𝑖 .                                                                                           □
    There is no reason to believe equation (3.3.2) holds in general, and so the additional moments
improve the efficiency of estimating 𝜷0 using the QLD residual in equation (3.2.7). Trivial cases
where equation (3.3.2) holds includes 𝜽 0 being known to the researcher and 𝑝 0 = 0.
3.3.2   Pooled and Mean Group QLD
The QLD GMM approach of Ahn et al. (2013) can select appropriate instruments for a given
time period. However, an abundance of moment conditions can induce finite-sample bias and local
stationary points in the GMM objective function. This section introduces the linear pooled and
mean group estimators based off of the QLD transformation. They allow for a variety of rank
and exogeneity conditions which are especially useful when the researcher includes heterogeneous
slopes in the model, like in Section 3.4. I propose first estimating the parameters 𝜽 0 using the pure
factor structure assumed in 𝒁𝑖 and then running the relevant regressions using the “defactored" data
                                                         66


𝑯b′ 𝒚𝑖 and 𝑯b′ 𝑿𝑖 :
                                            𝑁
                                                            ! −1  𝑁
                                           ∑︁                    ∑︁
                               𝜷
                               b𝑄𝐿𝐷𝑃 =         𝑿𝑖′ 𝑯b𝑯b′ 𝑿𝑖          𝑿𝑖′ 𝑯
                                                                         b𝑯b′ 𝒚𝑖                (3.3.3)
                                           𝑖=1                   𝑖=1
                                                𝑁
                                             1 ∑︁ ′ b b′ −1 ′ b b′
                               𝜷𝑄𝐿𝐷 𝑀𝐺 =
                               b                    ( 𝑿 𝑯 𝑯 𝑿𝑖 ) 𝑿𝑖 𝑯 𝑯 𝒚𝑖                      (3.3.4)
                                            𝑁 𝑖=1 𝑖
The pooled quasi-long-differencing (QLDP) estimator defined by equation (3.3.3) is the pooled
OLS estimator from regressing 𝑯    b′ 𝒚𝑖 on 𝑯b′ 𝑿𝑖 . A similar estimator was mentioned in Breitung and
Hansen (2020) but not thoroughly studied. The mean group quasi-long-differencing (QLDMG)
estimator defined by equation (3.3.4) can be obtained by running the 𝑇 − 𝑝 observation time series
regression 𝑯 b′ 𝒚𝑖 on 𝑯
                      b′ 𝑿𝑖 for each 𝑖, and then averaging each of the 𝑁 estimates. It should be noted
that 𝑯b′ can be used to “defactor" any variables which are linear in 𝑭0 and not just those used in the
estimator of 𝜽 0 . This observation allows for 2SLS estimation using outside instruments.
     Intuitively, the mean group estimator should allow for arbitrarily random slopes at the cost
of rank assumptions and precision. If the model is thought to have homogeneous slopes, one
should generally choose the pooled estimator over the mean group one. I ignore its asymptotic
properties until Section 3.4 when I introduce random slopes. However, the pooled QLD allows us
to relax the rank conditions used in Ahn et al. (2013) and Westerlund et al. (2019). Instead of
𝐸 (vec( 𝑿𝑖 ) ⊗ 𝑯0′ (𝒚𝑖 − 𝑿𝑖 𝜷0 )) = 0, we can use the moments 𝐸 ( 𝑿𝑖′ 𝑯0 𝑯0′ (𝒚𝑖 − 𝑿𝑖 𝜷0 )) = 0. This
residual represents a just-identified system of moments, requires no outside instruments, and allows
𝐸 (𝜸𝑖 𝜸𝑖′) and 𝐸 (𝜸𝑖 ) to be completely arbitrary. Further, since estimation of 𝜽 0 comes from the
reduced form moments, I do not require 𝑇 > 𝐾 + 1.
     Before proving asymptotic normality, I point out that the case of 𝑝 = 𝐾 + 1 implies a powerful
algebraic fact about the pooled QLD estimator: it is the same whether or not the researcher includes
common variables in the regression. That is, all variables which do not vary over 𝑖 are irrelevant
to the estimation of 𝜷0 , which includes time dummies. Further, the pooled QLD residuals are the
same with or without the inclusion of common variables. Note that I say 𝑝 = 𝐾 + 1 instead of
𝑝 0 = 𝐾 + 1 as the following theorem is purely algebraic and independent of model specification or
statistical properties.
                                                      67


     Let 𝑾 be a (𝑇 − 𝑝) × 𝑞 matrix of common variables, and let ( 𝜶˜ ′, 𝜷˜′) ′ be the estimates from the
pooled regression of 𝑯 b′ 𝒚𝑖 on 𝑯
                                b′ [𝑾, 𝑿𝑖 ]. Finally, let b               b𝑄𝐿𝐷𝑃 ) and 𝝐˜𝑖 = ( 𝒚𝑖 − 𝑿𝑖 𝜷˜ −𝑾 𝜶)
                                                           𝝐𝑖 = ( 𝒚𝑖 − 𝑿𝑖 𝜷                                 ˜
be the associated residuals.
Theorem 3.3.3. Suppose 𝑝 = 𝐾 + 1. If Rk( 𝑯        b′𝑾) = 𝑞, then
    1. 𝜷b𝑄𝐿𝐷𝑃 = 𝜷. ˜
    2. 𝜶˜ = 0.
    3. b𝝐𝑖 = 𝝐˜𝑖 .
Proof. By Corollary 3.3.1, the first-stage estimator 𝜽b solves 𝑯     b′ [𝒚, 𝑿] = 0.
                                        𝑁
                                                b = 𝑁 𝑿 ′ 𝑯𝑾
                                      ∑︁
                                            𝑿𝑖′ 𝑯𝑾            b =0
                                       𝑖=1
by Corollary 3.3.1, so 𝑯 b′ 𝑿𝑖 and 𝑾 are uncorrelated in the sample. Thus 𝜷˜𝑄𝐿𝐷𝑃 = 𝜷           b𝑄𝐿𝐷𝑃 . Using
the same argument,
                                         𝑁
                                                        ! −1   𝑁
                                       ∑︁                    ∑︁
                                                ′  b′𝑾
                                 𝜶˜ =       𝑾𝑯    b𝑯              𝑾′𝑯 b𝑯b′ 𝒚𝑖
                                        𝑖=1                   𝑖=1
                                                      −1
                                    = 𝑁 𝑾′𝑯     b𝑯b′𝑾      𝑾′𝑯  b𝑯b′ 𝒚 = 0
As 𝜶˜ = 0 and 𝜷˜ = 𝜷 b𝑄𝐿𝐷𝑃 , we have 𝝐˜𝑖 = b𝝐𝑖 .                                                             □
     The above result suggests that when 𝑝 = 𝐾 +1, the QLD matrix suffices to remove all unobserved
time effects in the population, even those which do not interact with the heterogeneity. The intuition
is similar to the ‘zero sum’ class of estimators studied by Westerlund (2019).
     It may appear that Theorem 3.3.3 only applies in very special scenarios; however, simulation
evidence in the Appendix suggests that overestimating 𝑝 0 does not cause inconsistency. These
results bolster the simulation evidence from Ahn et al. (2013) which suggests the same thing when
using their GMM estimator. Breitung and Hansen (2020) also demonstrate that the Ahn et al.
(2013) estimator performs well under the BIC method of estimating 𝑝 0 which has a tendency to
                                                     68


overestimate the number of factors. Overestimating 𝑝 0 includes the case of incorrectly estimating
factors when 𝑝 0 = 0. Under strict exogeneity, CCE and QLD procedures will be consistent because
their factor proxies are just functions of the exogenous variables. Reporting the QLDP which takes
𝑝 = 𝐾 + 1 could then serve as a robustness check if the estimated 𝑝 0 is less than 𝐾 + 1. This fact is
explored in a brief simulation study in Section 3.5.2.
    I now show asymptotic normality for the pooled QLD estimator. I demonstrate how first-stage
estimation of 𝜽 0 can affect the asymptotic distribution and show why ignoring this problem leads
to incorrect standard errors even when pooled QLD is asymptotically normal. I briefly discuss why
the standard errors in Westerlund et al. (2019) do not account for this problem. The full proof of
asymptotic normality is given in the Appendix, so I will only sketch the problem here.
    Let 𝑨𝑃 = 𝐸 (𝑽𝑖′ 𝑯0 𝑯0′ 𝑽𝑖 ). I show in the Appendix that
                                                                                !
                    √                                𝑁
                                                 1 ∑︁ ′ b b′
                      𝑁(𝜷b𝑄𝐿𝐷𝑃 − 𝜷0 ) =  𝑨−1
                                           𝑃    √        𝑿𝑖 𝑯 𝑯 (𝑭0 𝜸𝑖 + 𝒖𝑖 ) + 𝑜 𝑝 (1)
                                                  𝑁 𝑖=1
After a mean value expansion about 𝜽 0 , and using the results from Theorem 3.3.1, the normalized
estimator is
                  √                       −1 1
                                                  𝑁
                                                 ∑︁
                                                      𝑽𝑖′ 𝑯0 𝑯0′ 𝒖𝑖 + 𝑮 𝑃 𝒓𝑖 (𝜽 0 ) + 𝑜 𝑝 (1)
                                                                                   
                    𝑁 ( 𝜷𝑄𝐿𝐷𝑃 − 𝜷0 ) = 𝑨𝑃 √
                        b
                                              𝑁 𝑖=1
where 𝒓𝑖 (𝜽 0 ) is derived from Theorem 1 and 𝑮 𝑃 = 𝐸 (∇𝜽 𝑿𝑖′ 𝑯(𝜽)𝑯(𝜽) ′ (𝑭0 𝜸𝑖 + 𝒖𝑖 )) evaluated at
𝜽 = 𝜽 0 . 𝑮 𝑃 = 0 when 𝐸 (𝒖𝑖 ⊗ 𝑽𝑖 ) = 0, 𝐸 (𝒖𝑖 ⊗ 𝚪𝑖 ) = 0, and 𝐸 (𝑽𝑖 ⊗ 𝜸𝑖 ) = 0.
    I only need exogeneity of 𝑽𝑖 with respect to 𝒖𝑖 for asymptotic normality, so the other assumptions
only simplify the asymptotic variance. Westerlund et al. (2019) impose these assumptions which
ignores the effect of first-stage estimation uncertainty. My result thus proves asymptotic normality
of the pooled QLD under weaker assumptions than used in Westerlund et al. (2019) for the pooled
CCE with an even more general asymptotic variance formula. In fact, one could only assume
exogeneity on the last 𝑝 0 elements of the differenced quantities, but this assumption is difficult to
interpret. I now state the general asymptotic normality result assuming 𝑝 = 𝑝 0 is known due to
Theorem 3.3.1.
Theorem 3.3.4. Given Assumptions 1 and 2, suppose that
                                                    69


    1. 𝑨𝑃 = 𝐸 (𝑽𝑖′ 𝑯0 𝑯0′ 𝑽𝑖 ) has full rank.
    2. 𝐸 (𝑽𝑖′ 𝑯0 𝑯0′ 𝒖𝑖 ) = 0.
                 𝑝
Then 𝜷 b𝑄𝐿𝐷𝑃 →     𝜷0 and
                                  √                     𝑝
                                    𝑁(𝜷b𝑄𝐿𝐷𝑃 − 𝜷0 ) → 𝑁 (0, 𝑨−1 𝑩 𝑃 𝑨−1 )
                                                                   𝑃           𝑃
where 𝑩 𝑃 = 𝐸 ((𝑽𝑖′ 𝑯0 𝑯0′ 𝒖𝑖 + 𝑮 𝑃 𝒓𝑖 (𝜽 0 ))(𝑽𝑖′ 𝑯0 𝑯0′ 𝒖𝑖 + 𝑮 𝑃 𝒓𝑖 (𝜽 0 )) ′). If 𝐸 (𝒖𝑖 ⊗ 𝚪𝑖 ) = 0 and 𝐸 (𝑽𝑖 ⊗
𝜸𝑖 ) = 0, then 𝑮 𝑃 = 0.
Proof. See Appendix for proof and a derivation of 𝑮 𝑃 and 𝒓𝑖 (𝜽 0 ). Condition (2) is not practically
weaker than 𝐸 (𝒖𝑖 |𝑽𝑖 ) = 0 for linear estimation but I state it for completeness.                              □
     Remark (Joint estimation): The two-step procedure is less efficient than joint GMM estimation
using 𝐸 (𝑽𝑖′ 𝑯0 𝑯0′ ( 𝒚𝑖 −𝑽𝑖 𝜷0 )) = 0 and 𝐸 (𝑯0′ 𝒁𝑖 ) = 0 unless 𝑝 = 𝐾 +1; see Ahn and Schmidt (1997).
However, the 𝑝 = 𝐾 +1 case confers the advantage of invariance to common variables from Theorem
3.3.3 and appears consistent even when 𝑝 0 < 𝑝. There are also optimization issues involved in
joint estimation because the moments which identify 𝜷0 are nonlinear in 𝜽 0 . ■
     Remark (Known factors): Eliminating known factors like random intercepts or polynomial
time trends can make the QLD estimators more precise. Simply remove the known factors from
[𝒚𝑖 , 𝑿𝑖 ] by regressing it, unit-by-unit, onto the known factors, then estimate 𝜽 0 as in Theorem 3.3.1
using the residuals. This procedure is equivalent to defining 𝑴 = 𝑰𝑇 − 𝑭1 (𝑭1′ 𝑭1 ) −1 𝑭1′ , where 𝑭1
are the known factors, and running estimation based off of (𝒚𝑖∗ , 𝑿𝑖∗ ) = ( 𝑰 𝑁 ⊗ 𝑴)(𝒚𝑖 , 𝑿𝑖 ). ■
     Remark (Bootstrap): While I provide analytic inference below, the standard errors can be
quite complicated in general. Regardless of any additional restrictions which can simplify the
                                    √
calculation of standard errors, 𝑁 ( 𝜷    b𝑄𝐿𝐷𝑃 − 𝜷0 ) is asymptotically normal so that one can instead
do inference via the nonparametric bootstrap. Just resample over ( 𝒚𝑖 , 𝑿𝑖 ), with 𝑯              b estimated for
each new sample to account for the first-stage estimation in the final standard errors. This procedure
contrasts to Section 2 of the Supplement to Westerlund et al. (2019) which does not estimate 𝑭                  b
with each new sample. I do not provide a proof of consistency because the problem is standard;
                                                       70


Westerlund et al. (2019) needed a proof because of the CCE projection matrix has a reduced-rank
limit. ■
      The asymptotic variance can be estimated by 𝑨              b−1   b b−1
                                                                    𝑝 𝑩 𝑃 𝑨 𝑃 where
                                                             𝑁
                                                b𝑃 = 1
                                                           ∑︁
                                               𝑨                  𝑿′ 𝑯b𝑯   b′ 𝑿𝑖
                                                       𝑁 𝑖=1 𝑖
                                                                  𝑁
                                                    b𝑃 = 1
                                                                 ∑︁
                                                    𝑩                 𝒗 𝑖 𝒗b𝑖 ′
                                                                      b
                                                           𝑁 𝑖=1
Here, b 𝒗 𝑖 = 𝑿𝑖′ 𝑯
                  b𝑯b′𝝐b𝑖 + 𝑮 𝑃 ( 𝜽)
                                  b 𝒓𝑖 ( 𝜽)
                                          b where b  𝝐𝑖 = 𝒚𝑖 − 𝑿𝑖 𝜷    b𝑄𝐿𝐷𝑃 is the full pooled QLD residual. The
gradient is
                                       ©               © 𝒙𝑖 ∗1′ ⊗ 𝑰𝑇−𝑝0 ª                                ª
                                   𝑁 ­
                      b𝑃 = 1
                                 ∑︁                    ­                       ®                         ®
                                                                  ..
                     𝑮                          𝝐𝑖′ 𝑯)
                                       ­ ( 𝑰𝐾 ⊗ b                              ® + 𝑿𝑖′ 𝑯(b
                                                                                       b 𝝐𝑖∗′ ⊗ 𝑰𝑇−𝑝0 ) ®®      (3.3.5)
                                       ­               ­                       ®
                                                    b ­            .
                             𝑁    𝑖=1
                                       ­               ­                       ®                         ®
                                       ­               ­                       ®                         ®
                                                            ∗  ′
                                                         𝒙𝑖 𝐾 ⊗ 𝑰𝑇−𝑝0
                                       «               «                       ¬                         ¬
                                                b′ 𝑨 b−1 b −1 b′ b−1                   b′
                                    𝒓𝑖 ( 𝜽)        𝜽 𝜽 𝑫 𝜽 ) 𝑫 𝜽 𝑨𝜽 vec( 𝑯 𝒁𝑖 )
                                         b = (𝑫                                                                 (3.3.6)
where a ‘∗’ denotes the last 𝑝 0 elements of a 𝑇 × 1 vector. The form for 𝒓𝑖 ( 𝜽)                b comes from Theorem
3.3.1 and is derived in the proof of Theorem 3.3.4. The matrix 𝑮 𝑃 appears because of correlation
between the full error 𝝐𝑖 = 𝑭0 𝜸𝑖 + 𝒖𝑖 and the covariates 𝑿𝑖 , and the vector 𝒓𝑖 comes from error in
estimating 𝜽 0 in the first stage. The regular cluster-robust standard errors for a pooled regression
are only valid if 𝑮 𝑃 = 0. Assuming factor loadings are independent of the errors causes this matrix
to be zero, like in the classical CCE treatments of Pesaran (2006) and Westerlund et al. (2019).
      Though the loadings are meant to model the correlation between 𝑿𝑖 and all unobservables, they
may still correlate with the errors due to misspecification. If there are additional factors in 𝒚𝑖 not
in 𝑿𝑖 , we can still estimate 𝜷0 but the asymptotic variances will depend on first-stage estimation of
𝜽 0 . In fact, if we allow for uncorrelated loadings, the CCE and QLD estimators exclude relevant
information for estimation. Additionally assuming 𝐸 (𝑽𝑖 |𝜸𝑖 ) = 0 like in Westerlund et al. (2019),
                                                              71


we have:
                                     𝐸 ((𝑯0′ 𝑽𝑖 ) ⊗ 𝑯0′ ( 𝒚𝑖 − 𝑽𝑖 𝜷0 )) = 0                           (3.3.7)
                                      𝐸 ((𝑯0′ 𝑽𝑖 ) ⊗ ( 𝒚𝑖 − 𝑿𝑖 𝜷0 )) = 0                              (3.3.8)
                                        𝐸 ( 𝑿𝑖 ⊗ 𝑯0′ (𝒚𝑖 − 𝑽𝑖 𝜷0 )) = 0                               (3.3.9)
                                           𝐸 (𝑯0′ (𝒚𝑖 − 𝑽𝑖 𝜷0 )) = 0                                (3.3.10)
                                                 𝐸 (𝑯0′ 𝑽𝑖 ) = 0                                    (3.3.11)
    Equations (3.3.7)-(3.3.11) list (𝑇 − 𝑝 0 )((𝑇 − 𝑝 0 )𝐾 + 2𝑇 𝐾 + 𝐾 + 1) moment conditions which
displays the strength of the CCE assumptions made in current applications. Without at least
theoretically justifying 𝐸 (𝑽𝑖 ⊗ 𝜸𝑖 ) = 0, CCE-based inference needs a modern treatment which
accounts for first-stage estimation as in Brown et al. (2021). To summarize, if the loadings are
allowed to be correlated, then the pooled CCE standard errors from Pesaran (2006) and Westerlund
et al. (2019) are incorrect. If the loadings are assumed uncorrelated, then we have a significant
number of unused moment restrictions. In fact, if first-stage estimation does not affect the asymptotic
distribution, and the conditional covariance 𝐸 (𝒖𝑖 𝒖′𝑖 | 𝑿𝑖 ) is estimable, the feasible version of the GLS
                                                        √
estimator from Section 3.2 of Brown (2021) is 𝑁-consistent and efficient among all estimators
based off of 𝐸 ( 𝑴 𝑭0 ( 𝒚𝑖 − 𝑿𝑖 𝜷0 )) = 0 in which case all the moments in equations (3.3.7)-(3.3.11)
are redundant.
3.4    Heterogeneous Slopes
I now consider a generalization of the population model in equation (3.2.1) which allows for random
slopes.
                                            𝒚𝑖 = 𝑿𝑖 𝜷𝑖 + 𝑭0 𝜸𝑖 + 𝒖𝑖                                   (3.4.1)
                                                 𝜷𝑖 = 𝜷0 + 𝒃𝑖                                         (3.4.2)
                                                  𝒃𝑖 ∼ (0, 𝚺)                                         (3.4.3)
The random slopes model is identical to the forms in Wooldridge (2005) and Pesaran (2006) though
the former assumes 𝑭0 is observable. Neither Ahn et al. (2013) nor Westerlund (2019) consider
                                                       72


random slopes in their fixed-𝑇 analyses. I summarize this model in the following assumption:
Assumption 3 (Random slopes):
    1. 𝒚𝑖 = 𝑿𝑖 (𝜷0 + 𝒃𝑖 ) + 𝑭0 𝜸𝑖 + 𝒖𝑖 .
    2. ( 𝑿𝑖 , 𝒃𝑖 , 𝜸𝑖 , 𝒖𝑖 ) are independent and identically distributed across 𝑖 with finite fourth moments.
    3. 𝐸 (𝒃𝑖 ) = 0.
■
     The iid sampling assumption on 𝒃𝑖 does not rule out correlation between 𝒃𝑖 and the other
stochastic components of the model. Similarly, Assumption 3(3) places no restrictions on the
correlation between 𝒃𝑖 and 𝑿𝑖 . It only states that 𝒃𝑖 is the heterogeneous, unobserved deviation
from the population parameters 𝜷0 .
     Most fixed-𝑇 treatments of random slope models either exclude factors altogether or simplify
the factor structure as in a fixed effects analysis. Examples of fixed effects treatments include Juhl
and Lugovskyy (2014) Campello et al. (2019), and Breitung and Salish (2021). Though Pesaran
(2006), Chudik and Pesaran (2015), Neal (2015), and Norkutė et al. (2021) allow for random slopes
and arbitrary factors, they require 𝑇 to grow to infinity and make strong exogeneity conditions which
I avoid.
     Before continuing with the analysis, I want to address how the random slopes model changes
first-stage estimation of 𝜽 0 . The pure factor model for 𝒁𝑖 in equation (3.2.4) now takes the form
                                         𝐸 ( 𝒁𝑖 ) = 𝑭0 𝐸 (𝑪𝑖 𝑸 𝑖 ) + 𝐸 (𝑼𝑖 𝑸 𝑖 )
where 𝑼𝑖 = [𝒖𝑖 , 𝑽𝑖 ]. In order for the identification result in Lemma 3.3.1 to hold, we need two
additional conditions. First, Rk(𝐸 (𝑪𝑖 𝑸 𝑖 )) = 𝑝 0 , which is reasonable given Assumption 1. We
also need 𝐸 (𝑸 𝑖 𝑼𝑖 ) = 0 which necessitates 𝐸 ( 𝜷𝑖′ 𝒗 𝑖𝑡 ) = 0 for each 𝑡, implying that 𝒃𝑖 and 𝒗 𝑖𝑡 are
uncorrelated but allows arbitrary correlation between 𝒃𝑖 and (𝜸𝑖 , 𝚪𝑖 ). We could instead estimate 𝜽 0
based off of 𝐸 (𝑯0′ 𝑿𝑖 ) = 𝐸 (𝑯0′ 𝑽𝑖 ) = 0 and require 𝑝 0 ≤ 𝐾 instead of 𝐾 + 1. The robustness result
of Theorem 3.3.3(1) holds for 𝑝 = 𝐾 but parts (2) and (3) are not necessarily true.
                                                          73


     Remark (Testing for random slopes): Assumption 2 allows us to test for correlated random
slopes. Assuming that 𝑝 0 < 𝐾 + 1, we can test the model 𝐸 (𝑯0′ 𝒁𝑖 ) = 0 using the standard
overidentifying restrictions test. The moments are zero under Assumptions 2 and 3 only when 𝜷𝑖
is uncorrelated with 𝑽𝑖 . ■
     The remainder of this section assumes 𝜽 0 is derived from the reduced form moments 𝐸 (𝑯0′ 𝑽𝑖 ) =
0 with an analogous result to Theorem 1 to avoid uncertainty related to the overidentifying restric-
tions test. I first consider the Ahn et al. (2013) estimator in the presence of random slopes. The
GMM estimator cannot estimate the individual random slopes due to the well-known incidental
parameters problem. As such, I consider estimation which ignores the random slopes so that 𝑿𝑖 𝒃𝑖
is absorbed into the error. The Ahn et al. (2013) expected residual becomes
                         𝐸 (vec( 𝑿𝑖 ) ⊗ 𝑯0′ (𝒚𝑖 − 𝑿𝑖 𝜷0 )) = 𝐸 (vec( 𝑿𝑖 ) ⊗ 𝑯0′ 𝑿𝑖 𝒃𝑖 )           (3.4.4)
Theorem 3.4.1. Under Assumptions 1 and 3, ( 𝜷′0 , 𝜽 0′ ) ′ is identified by equation (3.4.4) if and only
if
                                        𝐸 (vec( 𝑿𝑖 ) ⊗ 𝑯0′ 𝑽𝑖 𝒃𝑖 ) = 0
Proof. The proof is a corollary of the identification result presented in Section 3.1 of Ahn et al.
(2013).                                                                                                □
     Murtazashvili and Wooldridge (2008) consider IV estimation with random slopes and known
factors. The exogeneity condition in Theorem 3.4.1 can depend on the type of instruments available.
If there is a vector 𝒘 𝑖 of outside instruments, one sufficient condition is
                                𝐶𝑜𝑣(𝑯0′ 𝑿𝑖 , 𝒃𝑖 |𝒘 𝑖 ) = 𝐶𝑜𝑣(𝑯0′ 𝑿𝑖 , 𝒃𝑖 ) = 0                    (3.4.5)
which is similar to Assumption 3.3 of Murtazashvili and Wooldridge (2008).
     With strictly exogenous covariates, the exogeneity condition is more similar to equations (12)
and (13) of Wooldridge (2005) who considers fixed effects OLS. Wooldridge shows that pooled
OLS is robust to heterogeneous slopes which are uncorrelated with the matrix of second moments
of the defactored covariates; that is 𝐸 ( 𝑿𝑖′ 𝑴 𝑭0 𝑿𝑖 𝒃𝑖 ) = 0 where he also assumes 𝑭0 is known. An
                                                       74


even simpler sufficient condition would be 𝐸 (𝒃𝑖 | 𝑿𝑖 ) = 0 which is in fact even weaker than the
random slope assumption from Pesaran (2006) who assumes 𝒃𝑖 is independent of all stochastic
components of the model.
     The Ahn et al.       (2013) estimator requires stronger exogeneity and rank conditions than
Wooldridge (2005) and Murtazashvili and Wooldridge (2008) because 𝜽 0 needs to be estimated
                                                                                       √
along with 𝜷0 . If we add Assumption 2, we are able to obtain a first stage 𝑁-consistent estimator
of 𝜽 0 by Theorem 3.3.1 and so joint identification of ( 𝜷′0 , 𝜽 0′ ) ′ is irrelevant. This first stage estimator
allows us to substantially weaken the identification requirements for 𝜷0 which allows for estimation
under a broader class of settings. Using the given estimator 𝜽b from Theorem 3.3.1, I study the
pooled QLD estimator in the context of heterogeneous slopes.
Theorem 3.4.2. Given Assumptions 2 and 3, where Rk(𝐸 (𝚪𝑖 )) = 𝑝 0 ≤ 𝐾, suppose that
    1. 𝑨𝑃 = 𝐸 (𝑽𝑖′ 𝑯0 𝑯0′ 𝑽𝑖 ) has full rank.
    2. 𝐸 (𝑽𝑖′ 𝑯0 𝑯0′ (𝑽𝑖 𝒃𝑖 + 𝒖𝑖 )) = 0.
                𝑝
Then 𝜷 b𝑄𝐿𝐷𝑃 →      𝜷0 and
                                   √                      𝑝                   −1
                                     𝑁(𝜷b𝑄𝐿𝐷𝑃 − 𝜷0 ) →      𝑁 (0, 𝑨−1
                                                                    𝑃 𝑩 𝑃 𝑨𝑃 )
where 𝑩 𝑃 = 𝐸 ((𝑽𝑖′ 𝑯0 𝑯0′ (𝑽𝑖 𝒃𝑖 + 𝒖𝑖 ) + 𝑮 𝑃 𝒓 𝑥,𝑖 (𝜽 0 ))(𝑽𝑖′ 𝑯0 𝑯0′ (𝑽𝑖 𝒃𝑖 + 𝒖𝑖 ) + 𝑮 𝑃 𝒓 𝑥,𝑖 (𝜽 0 )) ′), 𝑮 𝑃 =
𝐸 (∇𝜽 𝑽𝑖′ 𝑯0 𝑯0′ ( 𝑿𝑖 𝒃𝑖 + 𝑭0 𝜸𝑖 + 𝒖𝑖 )), and 𝒓 𝑥,𝑖 (𝜽 0 ) is given in the Appendix. If 𝐸 (𝒖𝑖 ⊗ 𝚪𝑖 ) = 0,
𝐸 (𝑽𝑖 ⊗ 𝒃𝑖 ) = 0, and 𝐸 (𝑽𝑖 ⊗ 𝜸𝑖 ) = 0, then 𝑮 𝑃 = 0.
Proof. The proof is identical to the proof of Theorem 3.3.4 with the full error 𝝐𝑖 = 𝑿𝑖 𝒃𝑖 + 𝑭0 𝜸𝑖 + 𝒖𝑖 .
While 𝑩 𝑃 does not have the same form as in Theorem 3.3.4, the standard errors are calculated the
same but with 𝒓 𝑥,𝑖 instead of 𝒓𝑖 , and so I use the same notation. The additional rank assumption
on 𝐸 (𝚪𝑖 ) allows us to estimate 𝜽 0 via 𝐸 (𝑯0′ 𝑽𝑖 ) = 0 which overcomes the problems of correlation
                                                        √
between 𝜷𝑖 and 𝑽𝑖 . The asymptotic variance of 𝑁 ( 𝜽b − 𝜽 0 ) and the computation of 𝒓𝑖,𝑥 are given
in the Appendix.                                                                                                  □
                                                        75


     Consistency is not affected by the first stage estimates of 𝜽 0 even with random slopes so that
the exogeneity conditions needed are identical in spirit to Wooldridge (2005) who assumes known
factors. I also do not require independence between 𝒃𝑖 and ( 𝑿𝑖 , 𝒖𝑖 ) like Pesaran (2006), but I
still restrict the correlation between 𝑿𝑖 and 𝒃𝑖 . This condition can be weakened via mean group
estimation which allows an arbitrary conditional distribution 𝐷 (𝒃𝑖 | 𝑿𝑖 ) at the expense of much
stronger rank and exogeneity conditions. I now state consistency and asymptotic normality for the
mean group QLD estimator. Again, 𝜽b is derived from 𝐸 (𝑯0′ 𝑽𝑖 ) = 0. Define T as the parameter
                                        √︃Í
                                                         ′
                                            𝐾                             ′
                                                                                    
                                                                                −1 where {𝜎 (𝑫)} 𝐾 are the singular
space of 𝜽 0 . Finally, let 𝑎𝑖 (𝜽) =        𝑖=1 𝜎𝑖 ( 𝑿𝑖 𝑯(𝜽)𝑯(𝜽) 𝑿𝑖 )                           𝑖      𝑖=1
values of the 𝐾 × 𝐾 matrix 𝑫.
Theorem 3.4.3. Given Assumptions 2 and 3, where Rk(𝐸 (𝚪𝑖 )) = 𝑝 0 ≤ 𝐾, suppose that
    1. The eigenvalues of 𝑿𝑖′ 𝑯(𝜽)𝑯(𝜽) ′ 𝑿𝑖 are almost surely positive uniformly over T .
    2. Uniformly over T ,
                                    n                                                         o
                             max 𝐸 (𝑎𝑖 (𝜽) ∥ 𝑿𝑖 ∥ ∥𝒖𝑖 ∥) , 𝐸 𝑎𝑖 (𝜽) 2 ∥ 𝑿𝑖 ∥ 3 ∥𝒖𝑖 ∥ < ∞
    3. T is a compact subset of R (𝑇−𝑝0 ) 𝑝0 .
                    𝑝
Then 𝜷  b𝑄𝐿𝐷 𝑀𝐺 →       𝜷0 and
                                        √                           𝑑
                                          𝑁(𝜷 b𝑄𝐿𝐷 𝑀𝐺 − 𝜷0 ) →          𝑁 (0, 𝑩 𝑀𝐺 )
                                                                                                                         ′
where 𝑩 𝑀𝐺                ′       ′     −1  ′       ′                                  ′     ′     −1 ′      ′
               = 𝐸 ( (𝑽𝑖 𝑯0 𝑯0𝑽𝑖 ) 𝑽𝑖 𝑯0 𝑯0 𝒖𝑖 + 𝑮 𝑀𝐺 𝒓 𝑥,𝑖 (𝜽 0 ) (𝑽𝑖 𝑯0 𝑯0𝑽𝑖 ) 𝑽𝑖 𝑯0 𝑯0 𝒖𝑖 + 𝑮 𝑀𝐺 𝒓 𝑥,𝑖 (𝜽 0 ) ).
If 𝐸 (𝒃𝑖 |𝑽𝑖 ) = 0 and 𝐸 (𝑽𝑖 ⊗ 𝜸𝑖 = 0), then 𝑮 𝑀𝐺 = 0.
Proof. See Appendix for proof and the derivation of 𝑮 𝑀𝐺 . Note that Assumption 2 implies
𝐸 (𝒖𝑖 |𝑽𝑖 ) = 0.                                                                                                         □
     Standard errors are derived similarly to the pooled QLD estimator in Section 3.3.2. Let
             𝑁                                                                                                  ′
   b= 1
           ∑︁                                                    
  𝑩              ( 𝑿𝑖′ 𝑯
                       b𝑯b′ 𝑿𝑖 ) −1 𝑿𝑖′ 𝑯
                                        b𝑯 b′b𝝐𝑖 𝑮            b ( 𝑿′ 𝑯
                                                 b 𝑀𝐺 𝒓 𝑥,𝑖 ( 𝜽)
                                                                        𝑖
                                                                          b 𝑯
                                                                            b ′
                                                                                𝑿 𝑖 ) −1 ′ b b′ b
                                                                                         𝑿𝑖 𝑯 𝑯  𝝐
                                                                                                 b  𝑮    𝒓
                                                                                                   𝑖 𝑀𝐺 𝑥,𝑖 ( 𝜽)
                                                                                                               b    (3.4.6)
         𝑁 𝑖=1
                                                              76


where 𝝐b𝑖 = 𝒚𝑖 − 𝑿𝑖 𝜷b𝐶𝐶𝐸 𝑀𝐺 is the mean group QLD residual and 𝒓 𝑥,𝑖 ( 𝜽)               b comes from Lemma .0.2
in the Appendix. The gradient 𝑮 𝑀𝐺 can be estimated via
                𝑁
            1 ∑︁                ′ b b′
                                            
                                                   ′ b b′     −1           ′ b b′   −1
                                                                                       
   𝑮 𝑀𝐺 =
   b               − 𝑰𝐾 ⊗ b    𝝐𝑖 𝑯 𝑯 𝑿𝑖 ( 𝑿𝑖 𝑯 𝑯 𝑿𝑖 ) ⊗ ( 𝑿𝑖 𝑯 𝑯 𝑿𝑖 )                   ( 𝑰𝐾 2 + 𝑲𝐾 )( 𝑰𝐾 ⊗ 𝑿𝑖′ 𝑯)∗
                                                                                                                 b
            𝑁 𝑖=1
                       © 𝒙𝑖 ∗1′ ⊗ 𝑰𝑇−𝑝0 ª
                       ­                   ®
                                   ..
                     ∗­                    ®+
                       ­                   ®
                                    .
                       ­                   ®
                       ­                   ®
                         𝒙𝑖 ∗𝐾 ′ ⊗ 𝑰𝑇−𝑝0
                       «                   ¬
                                                             © 𝒙𝑖 ∗ ′ ⊗ 𝑰𝑇−𝑝0 ª
                                                            ­ 1
                                              ©                                                            ª
                                              ­                                ®                          ®
                                                                        ..
                     + ( 𝑿𝑖′ 𝑯    b′ 𝑿𝑖 ) −1 ­­ 𝑰𝐾 ⊗ b𝝐𝑖′ 𝑯                           ′ b ∗′             
                              b𝑯                          b ­­           .
                                                                                ®
                                                                                ® + 𝑿𝑖 𝑯    𝝐
                                                                                            b 𝑖 ⊗ 𝑰𝑇−𝑝 0 ®
                                                                                                           ®
                                              ­              ­                  ®                          ®
                                              ­              ­                  ®                          ®
                                                               𝒙𝑖 ∗𝐾 ′ ⊗ 𝑰𝑇−𝑝0
                                              «              «                  ¬                          ¬
where 𝑲𝐾 is the 𝐾 2 × 𝐾 2 commutation matrix.
     As discussed in Section 3.3.2, Theorem 3.4.3 is the first fixed-𝑇 proof of asymptotic normality
for a mean group estimator which allows for arbitrary random factors. While I believe the mean
group CCE estimator can be adjusted to allow 𝑇 fixed, it has yet to be proved, as Pesaran (2006)
required 𝑇 → ∞. Further, it is likely that a modern proof using the methods of Karabiyik et
                                                                                                                  √
al. (2017) and Westerlund et al. (2019) is required. Like with the pooled estimator, the 𝑁-
asymptotic normal convergence result in Theorem 3.4.3 implies that inference can be done via the
usual nonparametric bootstrap, estimating 𝜽b for each new bootstrap sample.
     Remark (Order conditions): Similar to the pooled estimator, one advantage of the QLD
transformation is that it allows for more variables than the CCE when 𝑝 0 is small. CCE uses ( 𝒚, 𝑿)
to control for the factors. The rank of 𝑴 𝑭b is generally 𝑇 − (𝐾 + 1) in finite samples, regardless of
the number of factors. The rank of 𝑯         b𝑯 b′ is 𝑇 − 𝑝 which is assumed to be greater than 𝑇 − (𝐾 + 1)
in Westerlund et al. (2019). ■
     One consequence of the strong rank conditions is that we cannot allow values which take zero for
all 𝑡 with positive probability. This rules out demographic dummy variables which are common in
applied microeconometrics. Instead, we could just split the sample and run mean group estimation
on each demographic sub sample. The estimator’s precision will suffer, but this technique allows
                                                           77


us to estimate different slope means for different groups in the population.
3.5     Simulations
This section considers the finite-sample performance of the QLD estimators compared to the GMM
and CCE estimators of Ahn et al. (2013) and Pesaran (2006) respectively.
3.5.1    Main Results
The main model is
                                              𝒚𝑖 = 𝑿𝑖 𝜷0 + 𝑭0 𝜸𝑖 + 𝒖𝑖
                                                  𝑿𝑖 = 𝑭0 𝚪𝑖 + 𝑽𝑖
as in Assumptions 1 and 2. There are two variables with slopes 𝜷0 = (1, 1) ′. I do not include
random slopes as they would only serve to increase the amount of noise in the model and restrict
the first-stage estimation of 𝜽 0 for the QLD estimators and the cross-sectional averages for the CCE
estimator. Theorems 3.4.2 and 3.4.3 dictate theoretically how the estimators should perform in
given scenarios. I refer the reader to Campello et al. (2019) for simulation studies regarding the
performance of pooled estimators when slopes are correlated with the variables of interest.
    The two factors are generated as AR(1) random processes with initial value from a normal
distribution with mean 1 and variance 1, having parameters 0.75 and −0.75 respectively. The
factors are generated once then fixed over repeated replications. The simulations do not substantively
change if factors are repeatedly drawn4. As described earlier, since 𝑇 is small and fixed, it is the
factor loadings which cause problems asymptotically and not the factors. The loadings on 𝑿𝑖 are
drawn as
                                                   © 𝑁 (1, 1) 𝑁 (0, 1) ª
                                             𝚪𝑖 ∼ ­­                   ®
                                                                       ®
                                                     𝑁 (0, 1) 𝑁 (1, 1)
                                                   «                   ¬
    4 Additional simulations are available upon request.
                                                          78


so that 𝜽 0 is identified from the reduced form moments. The loadings in 𝒚𝑖 are drawn
                                                 © 𝑁 (Γ1,1 , 1) ª
                                           𝜸𝑖 ∼ ­­              ®
                                                                ®
                                                   𝑁 (Γ2,2 , 1)
                                                 «              ¬
The errors 𝒖𝑖 and 𝑽𝑖𝑘 (𝑘 = 1, 2) are drawn from a multivariate normal distribution with mean 0𝑇×1
and variance 𝑪 where 𝑪 is the correlation matrix from an AR(1) process with parameter 0.75. That
is, the two errors in 𝑽𝑖 = (𝑽𝑖1 , 𝑽𝑖2 ) are both drawn from 𝑀𝑉 𝑁 (0𝑇×1 , 𝑪) but are independent of
each other and 𝒖𝑖 . Each simulation study includes 1000 replications.
     Table 3.1 compares the Ahn et al. (2013) estimator both with and without the additional
moments 𝐸 (𝑯0′ 𝒁𝑖 ) = 0. Both estimators are computed as two-step estimators where the optimal
weight matrix is calculated with a consistent first-step estimator. The first-step estimator uses an
identity weight matrix.
                                      Table 3.1: GMM estimators
                                      Bias                      SD             RMSE
                               GMM1 GMM2              GMM1 GMM2           GMM1 GMM2
           N = 50       T=3      0.0328   -0.0107      0.2326      0.1812  0.2349   0.1815
                                -0.0053   -0.0167      0.1719      0.1690  0.1720   0.1698
                        T=4     -0.0019   -0.0225      0.1444      0.1518  0.1444   0.1535
                                 0.0137   -0.0196      0.1626      0.1424  0.1632   0.1438
                        T=5      0.0170   -0.0249      0.1701      0.1694  0.1710   0.1712
                                 0.1375   -0.0055      0.3080      0.2057  0.3373   0.2058
           N = 300      T=3      0.0328   -0.0107      0.2326      0.1812  0.2349   0.1815
                                -0.0053   -0.0167      0.1719      0.1690  0.1720   0.1698
                        T=4     -0.0019   -0.0225      0.1444      0.1518  0.1444   0.1535
                                 0.0137   -0.0196      0.1626      0.1424  0.1632   0.1438
                        T=5      0.0005   -0.0016      0.0363      0.0364  0.0363   0.0365
                                 0.0156   -0.0029      0.1014      0.0367  0.1026   0.0368
     The GMM estimator based off of the Ahn et al. (2013) residual 𝐸 (vec( 𝑿𝑖 ) ⊗ 𝑯0′ ( 𝒚𝑖 − 𝑿𝑖 𝜷0 ))
only is GMM1, whereas the GMM estimator using the Ahn et al. residual and the additional
moments 𝐸 (𝑯0′ 𝒁𝑖 ) = 0 is GMM2. The GMM estimator using both sets of moments consistently
outperforms the original Ahn et al. (2013) estimator in terms of both bias and standard deviation
implying that the additional moments are practically relevant in finite samples.
                                                    79


    Before turning to a comparison of the pooled QLD and CCE estimators, I first investigate the
performance of QLDP when 𝑝 0 is misspecified in estimation of 𝜽 0 . The simulation setting implies
𝑝 0 = 2, so I look at the performance of QLDP for 𝑝 = 1, 2, 3. I reiterate that 𝑝 0 is given by the
DGP and 𝑝 is the number of factors specified by the econometrician.
                                     Table 3.2: Misspecifying 𝑝 0
                                Bias                       SD                        RMSE
                      p=1      p=2      p=3       p=1     p=2      p=3      p=1       p=2      p=3
  N = 50     T = 4 0.2700 0.0078 0.0118          0.1677  0.1097   0.1466   0.3178    0.1100   0.1471
                     0.4024 0.0029 0.0120        0.1814  0.1097   0.1561   0.4414    0.1098   0.1566
             T = 5 0.4662 0.0095 0.0154          0.3511  0.1005   0.1282   0.5836    0.1009   0.1291
                     0.5372 0.0058 0.0119        0.4111  0.0950   0.1228   0.6764    0.0952   0.1234
             T = 6 0.1697 0.0074 0.0126          0.1534  0.0956   0.1239   0.2287    0.0959   0.1246
                     0.5843 0.0132 0.0200        0.1516  0.1025   0.1222   0.6036    0.1034   0.1238
  N = 300 T = 4 0.2748 -0.0003 0.0000            0.0657  0.0424   0.0559   0.2826    0.0424   0.0559
                     0.4087 0.0024 0.0030        0.0746  0.0411   0.0587   0.4154    0.0411   0.0588
             T = 5 0.5267 0.0008 0.0032          0.2545  0.0382   0.0491   0.5849    0.0383   0.0492
                     0.5993 0.0007 0.0038        0.2953  0.0369   0.0474   0.6681    0.0369   0.0476
             T = 6 0.1484 0.0015 0.0027          0.0646  0.0392   0.0470   0.1618    0.0392   0.0471
                     0.6191 0.0013 0.0020        0.0596  0.0406   0.0480   0.6220    0.0406   0.0480
    Table 3.2 gives the results for the QLDP under the different specifications. My results track
with previous simulation evidence provided by Ahn et al. (2013) and Breitung and Hansen
(2020). Underestimating 𝑝 0 leads to substantial bias which does not decrease with 𝑁. However,
overestimating 𝑝 0 leads to only slightly worse performance than correct specification. The bias is
larger but decreases with 𝑁; in fact, even 𝑁 = 300 gives reasonable bias for the 𝑝 = 3 estimator.
The 𝑝 = 3 estimator also performs worse than the correctly specified estimator in terms of standard
deviation, which is not surprising. Overall, I find evidence that overestimation of 𝑝 0 does not lead
to substantial bias in estimation, but underestimating 𝑝 0 can.
    I now turn to comparison of the QLDP and CCEP estimators. Tables 3.3 and 3.4 look at the
QLDP estimator compared to the CCEP estimator where the QLD transformation is estimated under
𝑝 = 𝑝 0 = 2. Table 3.3 contains results for 𝐾 = 2 and table 3.4 contains results for 𝐾 = 3. I include
𝐾 = 3 because it demonstrates how CCE removes more information as 𝐾 grows but QLD does not.
First note that the CCEP is biased when 𝑇 = 3 as 𝐾 + 1 = 3 and this order condition is not allowed.
                                                  80


However, the QLDP is still consistent here. Further, the QLD estimators takes 𝑝 0 as known while
the CCE estimators “overestimates" 𝑝 0 with the cross-sectional averages, of which there are 𝐾 + 1.
One might suspect this overestimation leads to inefficiency which is demonstrated by the results of
the simulations. The QLDP estimator consistently shows a 15%-25% decline in standard deviation
over the CCE estimator. Further, the CCE identifying condition requires 𝑇 > 𝐾 + 1 which causes
severe bias when violated. The QLDP estimator significantly outperforms the CCEP estimator in
every setting provided.
                                   Table 3.3: Pooled estimators, 𝐾 = 2
                                         Bias                  SD              RMSE
                                  CCEP QLDP              CCEP QLDP         CCEP QLDP
            N = 50       T=3     -0.5525 0.0082 25.9618           0.1546 25.9676    0.1548
                                  1.2734 0.0034 12.5824           0.1555 12.6467    0.1556
                         T=4      0.0118 0.0078 0.1466            0.1097 0.1471     0.1100
                                  0.0120 0.0029 0.1561            0.1097 0.1566     0.1098
                         T=5      0.0197 0.0095 0.1220            0.1005 0.1236     0.1009
                                  0.0089 0.0058 0.1152            0.0950 0.1155     0.0952
            N = 300      T=3      0.0272 0.0024 2.7295            0.0580 2.7296     0.0581
                                  0.9400 0.0026 3.3976            0.0585 3.5253     0.0585
                         T=4      0.0000 -0.0003 0.0559           0.0424 0.0559     0.0424
                                  0.0030 0.0024 0.0587            0.0411 0.0588     0.0411
                         T=5      0.0050 0.0008 0.0464            0.0382 0.0467     0.0383
                                  0.0027 0.0007 0.0441            0.0369 0.0442     0.0369
    Comparing table 3.3 to table 3.1, the QLDP performs much better than either of the GMM
estimators despite the fact that we know they are using valid instruments. That the QLDP has better
finite-sample performance than the overidentified systems from Ahn et al. (2013) is most likely due
to the fact that it uses a smaller, just identified system of moments. See the Appendix for additional
simulations including larger values of 𝑇.
    Finally, I investigate the performance of the mean group quasi-long-differencing (QLDMG)
and mean group common correlated effects (CCEMG) estimators. The QLDMG estimator is given
by equation (3.3.4) and the CCEMG estimator is identical to the QLDMG estimator but with 𝑴 𝑭b
in place of 𝑯 b𝑯 b′. Consistency is proved in Pesaran (2006) but, like the pooled estimator, will
                                                      81


                                 Table 3.4: Pooled estimators, 𝐾 = 3
                                      Bias                 SD                 RMSE
                                CCEP QLDP            CCEP QLDP           CCEP QLDP
            N = 50     T=3      0.0875 0.0076 3.0883           0.1586 3.0895        0.1588
                                1.0809 0.0094 2.2956           0.1594 2.5373        0.1597
                                0.3240 -0.0018 7.6585          0.1560 7.6654        0.1560
                       T=4      0.1574 0.0041 3.1025           0.1105 3.1065        0.1106
                                1.1709 0.0140 3.2437           0.1107 3.4486        0.1116
                               -0.2552 -0.0047 6.7375          0.1089 6.7423        0.1090
                       T=5      0.0151 0.0066 0.1530           0.0986 0.1537        0.0988
                                0.0039 0.0031 0.1495           0.0979 0.1495        0.0979
                               -0.0072 -0.0041 0.1408          0.0958 0.1410        0.0959
            N = 300    T=3      1.9936 0.0030 61.6795          0.0580 61.7117       0.0581
                                2.5873 0.0007 45.5170          0.0578 45.5905       0.0578
                               -0.8012 0.0017 17.5764          0.0570 17.5947       0.0570
                       T=4      0.0011 0.0008 0.0601           0.0397 0.0601        0.0397
                                0.0028 0.0001 0.0559           0.0394 0.0560        0.0394
                                0.0035 0.0009 0.0571           0.0378 0.0572        0.0378
                       T=5      0.0064 0.0028 2.0502           0.0400 2.0502        0.0401
                                1.0163 0.0020 0.9861           0.0414 1.4160        0.0414
                               -0.0826 0.0006 3.6462           0.0400 3.6471        0.0400
eventually require a modern treatment which either controls for the asymptotic degeneracy in 𝑴 𝑭b
like Karabiyik et al. (2017) and Westerlund et al. (2019) or assumes full rank limits like Brown et al.
(2021). Table 3.5 contains the results for the mean group estimators where the QLD transformation
is estimated assuming 𝑝 = 𝑝 0 = 2. I start at 𝑇 = 5 so that 𝑇 − 𝑝 0 > 𝑝 0 and the CCEMG estimator
is well-defined.
     Despite 𝑇 > 2𝐾 + 1 for each setting, the CCEMG estimator exhibits substantial bias when
𝑇 = 6, though the QLDMG estimator appears unbiased. The QLDMG outperforms the CCEMG
in terms of RMSE for each 𝑁 and 𝑇 besides 𝑁 = 600 and 𝑇 = 8. We would expect the CCEMG
to perform well relative to the QLDMG as 𝑇 grows due to the incidental parameter problem in the
first-stage QLD estimation. However, even for moderately low values of 𝑁 and large values of 𝑇,
the QLDMG has optimistic properties.
                                                  82


                                  Table 3.5: Mean group estimators
                                 Bias                         SD                   RMSE
                         CCEMG QLDMG               CCEMG QLDMG              CCEMG QLDMG
     N = 50        T=5     -1.5703      -0.0055      34.8038         0.4837  34.8392       0.4837
                           -0.4832       0.0256      18.2402         0.6523  18.2466       0.6529
                   T=6      0.0324       0.0056        0.4630        0.1737   0.4641       0.1738
                            0.0256       0.0044        0.3774        0.1820   0.3782       0.1820
                   T=7      0.0187       0.0156        0.1670        0.1658   0.1681       0.1665
                            0.0113       0.0102        0.1628        0.1574   0.1632       0.1577
     N = 300       T=5     -1.2597      -0.0039      27.7644         0.1537  27.7929       0.1537
                            1.1968      -0.0030      34.6115         0.1420  34.6322       0.1420
                   T=6     -0.0077       0.0039        0.2846        0.0767   0.2847       0.0768
                            0.0116      -0.0004        0.1768        0.0745   0.1772       0.0745
                   T=7      0.0003       0.0000        0.0649        0.0641   0.0649       0.0641
                            0.0010       0.0009        0.0677        0.0595   0.0677       0.0595
3.5.2    Comparison to TWFE
Theorem 3.3.3 suggests a certain robustness property for the QLDP estimator with respect to the
traditional TWFE estimator. If the factor structure gives the traditional two-way error 𝒇𝑡′ 𝜸𝑖 + 𝑢𝑖𝑡 =
𝛾𝑖 + 𝑓𝑡 + 𝑢𝑖𝑡 , the QLDP can accommodate the time and individual fixed effects without Assumption
2 holding. If one regresses out a heterogeneous intercept and estimates 𝜽b assuming 𝑝 = 𝐾 + 1,
the QLDP estimator will be consistent even if it is nonlinear in the unobserved effects. I first
demonstrate that TWFE is inconsistent in the presence of an arbitrary factor structure. The DGP is
the same as Section 3.5.1 so that the QLDP results are identical to table 3.2.
    TWFE performs poorly as expected. I now generate the data according to the two-way error
model so that
                                     𝑦𝑖𝑡 = 𝑥𝑖𝑡1 + 𝑥𝑖𝑡2 + 𝑡 + 𝛾𝑖 + 𝑢𝑖𝑡
where 𝑡 is the time effect and 𝛾𝑖 ∼ 𝑁 (1, 1) is the individual effect. The covariates are generated as
                                        𝑥𝑖𝑡1 ∼ Poisson(|𝑐𝑖 + 𝑡|)
                                      𝑥𝑖𝑡2 ∼ 𝑈 (0, log((𝑐𝑖 + 𝑡) 2 ))
so that Assumption 2 does not hold. The simulation results in table 3.7 compare TWFE to QLDP
                                                   83


                                   Table 3.6: AR(1) factor structure
                                       Bias                 SD              RMSE
                  K=2           TWFE QLDP             TWFE QLDP        TWFE QLDP
            N = 50      T=3      0.0791      0.0082   0.1366   0.1546  0.1578    0.1548
                                 0.8684      0.0034   0.1339   0.1555  0.8787    0.1556
                        T=4      0.1148      0.0078   0.1351   0.1097  0.1773    0.1100
                                 0.8321      0.0029   0.1330   0.1097  0.8427    0.1098
                        T=5      0.1116      0.0095   0.1290   0.1005  0.1706    0.1009
                                 0.8107      0.0058   0.1302   0.0950  0.8211    0.0952
            N = 300     T=3      0.0765      0.0024   0.0528   0.0580  0.0929    0.0581
                                 0.8851      0.0026   0.0513   0.0585  0.8865    0.0585
                        T=4      0.1089     -0.0003   0.0527   0.0424  0.1210    0.0424
                                 0.8321      0.0024   0.0527   0.0411  0.8337    0.0411
                        T=5      0.1119      0.0008   0.0529   0.0382  0.1238    0.0383
                                 0.8055      0.0007   0.0530   0.0369  0.8073    0.0369
when 𝜽b is computed with 𝑝 = 𝐾 + 1 (despite the fact that 𝑝 0 = 1) and after removing a random
intercept for 𝑿𝑖 and 𝒚𝑖 unit-by-unit. That is, let 𝑴 be the 𝑇 × 𝑇 within transformation. I compute
       b𝑄𝐿𝐷𝑃 with 𝒚 ∗ and 𝑿 ∗ where 𝒚 ∗ = 𝑴 𝒚𝑖 and 𝑿 ∗ = 𝑴 𝑿. The time effects are irrelevant
𝜽b and 𝜷              𝑖       𝑖           𝑖               𝑖
because the QLDP estimator is the same regardless of whether or not they are controlled for in the
regression.
                                    Table 3.7: TWFE specification
                                       Bias                 SD              RMSE
                                TWFE QLDP             TWFE QLDP        TWFE QLDP
            N = 50      T = 4 -0.0004       -0.0044   0.0284   0.0388   0.0284   0.0390
                                -0.0006     -0.0013   0.0184   0.0276   0.0184   0.0277
                        T = 5 -0.0010       -0.0022   0.0240   0.0300   0.0240   0.0301
                                 0.0000     -0.0015   0.0142   0.0196   0.0142   0.0197
                        T = 6 -0.0004       -0.0022   0.0199   0.0251   0.0199   0.0252
                                 0.0007     -0.0013   0.0126   0.0157   0.0127   0.0157
            N = 300 T = 4 -0.0003           -0.0004   0.0106   0.0142   0.0106   0.0142
                                 0.0003     -0.0005   0.0061   0.0086   0.0061   0.0086
                        T = 5 -0.0001       -0.0004   0.0092   0.0116   0.0092   0.0116
                                -0.0002     -0.0001   0.0054   0.0072   0.0054   0.0072
                        T = 6 0.0001         0.0001   0.0082   0.0105   0.0082   0.0105
                                -0.0002     -0.0005   0.0048   0.0065   0.0048   0.0065
    While the TWFE estimator is clearly superior in terms of both bias and standard deviation
                                                   84


when 𝑁 is small, the QLDP shows promising results. When 𝑁 = 300, the two estimators are
nearly indistinguishable in terms of their bias. The QLDP’s RMSE is inflated because of its higher
variance, but this result is unsurprising as it is a more conservative estimator which is trying to
eliminate more heterogeneity. However, it performs comparably well even though it removes more
variation from the data than is needed.
3.6     Application
I evaluate the effect of expenditure per student on standardized test performance. I consider school
district-level data in the state of Michigan over the time periods 1995-2001. The state of Michigan
reformed education expenditure in 1994 to bring poorly-funded schools to parity with wealthier
schools. See Papke (2005) for a comprehensive discussion of the data and institutional details.
    There are 𝑁 = 501 school districts observed for 𝑇 = 7 school years over 1995-2001. I present
summary statistics and descriptions for the variables of interest.
  Variable     Mean      Standard Deviation                           Description
  math4        0.6939            0.1515        Fraction of fourth graders who pass the MEAP math test.
  lunch        0.2886            0.1616        Fraction of students eligible for free and reduced lunch.
  enroll      3112.31           7965.49        Total enrollment.
  avgrexp     6385.51           1034.94        Average real expenditure per pupil.
    The outcome variable, math4, denotes the pass rate for fourth-grade students taking a standard-
ized math test and stands as a measure of student achievement. Michigan students undertake a
battery of standardized tests in elementary, junior, and secondary school. Like Papke (2005) and
Papke and Wooldridge (2008), I focus on the fourth-grade math test because it has been consistently
defined and measured over the observed time periods.
    The primary variable of interest is average expenditure per pupil, as it represents the effect of
additional expenditure on test scores. Starting in the 1994/1995 school year, the state of Michigan
began awarding so-called “foundation grants" which were based on the per-student spending of the
school district in the previous year. The goal was to eventually bring schools up to a benchmark
“basic foundation" amount which increased over time. The state started by awarding foundation
                                                   85


grants to increase expenditure to a minimum $4200 per student or an additional $250 per student,
whichever was higher. By 2000, the minimum and benchmark amounts were equal at $5700.
Expenditures per pupil were averaged over the current year as well as the previous three, meaning
average real expenditure per pupil in 1995 is an average of expenditure in 1992, 1993, 1994, and
1995.
    The equation of interest is
         𝑚𝑎𝑡ℎ4𝑖𝑡 = 𝑐𝑖 + log(𝑎𝑣𝑔𝑟𝑒𝑥 𝑝𝑖𝑡 ) 𝛽1 + 𝑙𝑢𝑛𝑐ℎ𝑖𝑡 𝛽2 + log(𝑒𝑛𝑟𝑜𝑙𝑙𝑖𝑡 ) 𝛽3 + 𝒇𝑡′ 𝜸𝑖 + 𝑒𝑖𝑡    (3.6.1)
which is similar to Papke (2005). I collect 𝑙𝑢𝑛𝑐ℎ𝑖𝑡 , log(𝑒𝑛𝑟𝑜𝑙𝑙)𝑖𝑡 , and log(𝑎𝑣𝑔𝑟𝑒𝑥 𝑝)𝑖𝑡 and use the
reduced form CCE equation from Assumption 2 to implement the pooled QLD estimator. This
specification allows me to test for the number of factors. I also use the Ahn et al. (2013) GMM
function to test for 𝑝 0 , with and without the CCE equations.
    Table 3.8 provides the p-values for testing the hypothesis 𝐻0 : 𝑝 0 = 𝑝 versus 𝐻1 : 𝑝 0 > 𝑝.
                                        Table 3.8: Testing for 𝑝 0
                                                     p-values
                                           RF2      GMM1 GMM2
                                  𝒑0 =0    0.0000    0.0000     0.0000
                                  𝒑0 =1    0.0000    0.0000     0.0000
                                  𝒑0 =2    0.0000    0.4852     0.0000
                                  𝒑0 =3    0.0000    0.1157     0.0000
    A rejection of the hypothesis suggests more factors than the tested value, and a failure to reject
suggests the current value is correct. The titles ‘GMM1’, ‘GMM2’, and ‘RF’ (for reduced form)
refer to the respective objective function used to test the relevant hypothesis. I stress that testing
for 𝑝 0 comes from a long-established literature, briefly described in Ahn et al. (2013). The only
new concept I introduce with respect to this specific specification test is using the reduced form
moments 𝐸 (𝑯0′ 𝒁𝑖 ) = 0.
    GMM1 is just the Ahn et al. (2013) objective function from equation (3.2.7). GMM2 is the
Ahn et al. objective function with the additional moments 𝐸 (𝑯0′ 𝒁𝑖 ) = 0. Finally, RF is just the
                                                   86


reduced form moments 𝐸 (𝑯0′ 𝒁𝑖 ) = 0. GMM1 suggests that the correct number of factors is 𝑝 0 = 2.
GMM2 and RF both reject 𝑝 0 = 2 at any reasonable confidence level, and GMM2 rejects 𝑝 0 = 3,
though it uses a much larger set of moments than the other two which may decrease power. It may
suffer from the same global identification problems discussed in Hayakawa (2016) which suggests
the GMM1 test will perform better practically. I stop testing at 𝑝 0 = 3 because RF is just identified
at 𝑝 0 = 4. Regardless of the tests, the moments 𝐸 (𝑯0′ 𝒁𝑖 ) = 0 only allow me to estimate up to four
factors. Even if 𝑝 0 > 4, the QLDP nets more unobserved heterogeneity than TWFE.
    For the purpose of comparison with the pooled QLD estimator, I include the TWFE estimator
and the pooled CCE estimator. As 𝑇 = 7 and 𝐾 = 3, the CCE estimator can accommodate both
𝑿, 𝒚, and a heterogeneous intercept in 𝑭.   b Further, the pooled QLD estimator is computed with
𝑝 = 𝐾 = 3 after eliminating a heterogeneous intercept from 𝑿𝑖 and 𝒚𝑖 , unit-by-unit. As such,
QLDP is a natural comparison to TWFE. Theorem 3.3.3 tells us that 𝜷    b𝑄𝐿𝐷𝑃 is invariant to common
variables when 𝑝 = 𝐾. Since it also eliminates a heterogeneous intercept, it will be consistent if
TWFE is consistent, assuming strictly exogenous covariates.
    I present results in table 3.9 which shows estimation after eliminating a heterogeneous intercept.
For CCE, this simply amounts to 𝑭    b = (1, 𝒚, 𝑿). For QLDP, I project out the intercept from each
𝑿𝑖 and 𝒚𝑖 via the within transformation before estimating. Standard errors are in parentheses while
p-values are in brackets. The reported standard errors are generated via the panel nonparametric
bootstrap.
    The QLDP estimator suggests substantial estimates for the effect of per student expenditures.
A 10% increase in the average expenditure per student is associated with an 8.3 percentage point
increase in the math test pass rate, with a p-value of 0.0009. This estimate is more than twice as
large as the TWFE estimate and more than three halves the CCEP estimate. These results suggest
that TWFE is not adequately controlling for the heterogeneity present in the data set. Both the
CCE and QLDP estimates are statistically significant at the 5% level. The TWFE standard errors
are generally smaller than CCE and QLD because it removes less variation from the data.
    I also considered estimation via the mean group QLD and CCE estimators. However, both
                                                   87


                         Table 3.9: Controlling for heterogeneous intercept
                                          TWFE        CCEP      QLDP
                           lunch           -0.0419     0.0398    -0.1576
                                          (0.0730)    (0.1367)  (0.1637)
                                          [0.5658]    [0.7709]  [0.3381]
                           log(enroll)      0.0021    -0.0592     0.0268
                                          (0.0487)    (0.1497)  (0.2152)
                                          [0.9663]    [0.6924]  [0.8838]
                           log(avgrexp)     0.3771     0.5409     0.8287
                                          (0.0704)    (0.2695)  (0.3785)
                                          [0.0000]    [0.0446]  [0.0303]
parameter estimates and standard errors were unreasonable compared to the other estimators. In
fact, the p-values were significantly larger than any other reported case and suggested a critical
lack of precision. Recall that the mean group estimators require much stronger exogeneity and
identifying conditions than the pooled estimators.
3.7     Conclusion
This paper considers fixed-𝑇 estimation of linear panel data models where the errors have a general
unknown factor structure. I use the quasi-long-difference transformation studied by Ahn et al.
(2013) to eliminate the factor structure and provide moment conditions for estimation. For the
purpose of comparison with the popular pooled common correlated effects estimator, I study
the moments implied by assuming a pure factor structure in the covariates. Applying the QLD
transformation to the independent variables improves efficiency of estimating the parameters of
interest in the main equation which is information that pooled CCE does not use.
    Current proofs of fixed-𝑇 asymptotic normality of the pooled CCE estimator assumes loadings
                                                 88


which are strictly exogenous with respect to the idiosyncratic errors in the independent variables.
I show that the uncorrelated loadings assumptions implies the existence of an even larger number
of moments which CCE neglects. Ultimately, if one makes the strong assumptions sufficient for
asymptotic normality of pooled CCE in Westerlund et al. (2019), one should fully consider the
information available for efficient estimation. Regardless, I provide robust standard errors in a more
general and appealing setting than the CCE models in Pesaran (2006) and Westerlund et al. (2019).
     I apply the moment-based perspective to a heterogeneous slopes model similar to the original
Pesaran (2006) setting. I prove consistency and asymptotic normality of pooled and mean group
estimators based off of the QLD transformation which put no restrictions on the relationship
between 𝑇 and 𝐾 in contrast to CCE. These estimators are shown to outperform CCE estimators in
finite samples even when 𝑁 is small. The pooled QLD estimator also has the desirable property of
invariance to common variables, like time trends and macroeconomic indicators, when the estimated
number of factors equals the number of regressors. I reexamine estimation of school district
expenditures on standardized test performance and find significantly larger effects of educational
spending compared to simple fixed effects regression. These estimates are also reported up to
reasonable precision which suggests that applied researchers are not adequately controlling for
heterogeneity in their data.
     One important direction for future work concerns the overestimation of 𝑝 0 . It is known that
CCE is robust to 𝐾 +1 > 𝑝 0 . Moon and Weidner (2015) prove that principal components estimation
is also robust to overestimating the number of factors, provided 𝑇 is large. However, while there
is ample simulation evidence suggesting the robustness of QLD to such a failure, a formal proof
is lacking. It would also be useful to investigate the robustness of the QLDP estimators to failure
of the reduced form equation in Assumption 2. Finally, the methods presented in this paper all
assumed balanced panels. Missing data causes challenges to constructing the CCE and QLD
transformations. It is not clear how even a complete cases estimator would work, as the cross
sectional averages and first-stage estimator of 𝜽b require all time periods for each unit in the sample.
                                                   89


APPENDIX
   90


                                                APPENDIX
                                      PROOFS FOR CHAPTER 1
This Appendix collects together proofs of the formal results stated in the text.
Proof of Lemma 1.3.1
    From equation (1.3.14), Assumptions WV.1 and WV.2 imply
                                     Var (y𝑖 |x𝑖 , 𝑐𝑖 ) = 𝛼𝑐𝑖 M𝑖1/2 RM𝑖1/2
By the law of total variance,
                    Var (y𝑖 |x𝑖 ) = E [Var (y𝑖 |x𝑖 , 𝑐𝑖 ) |x𝑖 ] + Var [E (y𝑖 |x𝑖 , 𝑐𝑖 ) |x𝑖 ]
                                                                 
                                    = E 𝛼𝑐𝑖 M𝑖1/2 RM𝑖1/2 x𝑖 + Var (𝑐𝑖 m𝑖 |x𝑖 )
                                    = 𝛼𝜇𝑐 (x𝑖 ) M𝑖1/2 RM𝑖1/2 + 𝜎𝑐2 (x𝑖 ) m𝑖 m′𝑖                    (.0.1)
To simplify notation in what follows, write 𝜇𝑖 ≡ 𝜇𝑐 (x𝑖 ), 𝜎𝑖2 ≡ 𝜎𝑐2 (x𝑖 ). To derive 𝛀𝑖−1 , we apply an
implication of Sherman and Morrison (1950): For a nonsingular 𝑇 × 𝑇 matrix A and 𝑇 × 1 vector
b,
                                                              1
                             (A + bb′) −1 = A−1 −             ′  −1
                                                                     A−1 bb′A−1                    (.0.2)
                                                         1+bA b
which can be verified by direct multiplication. Take A ≡ 𝛼𝜇𝑖 M𝑖1/2 RM𝑖1/2 and b ≡ 𝜎𝑖 m𝑖 in (.0.2)
              h                   i −1                                                    √
and note that 𝛼𝜇𝑖 M𝑖1/2 RM𝑖1/2         = M𝑖−1/2 R−1 M𝑖−1/2 /(𝛼𝜇𝑖 ) and M𝑖−1/2 m𝑖 = m𝑖 .
                                                        91


   Therefore,
                    1
         𝛀𝑖−1 =         M𝑖−1/2 R−1 M𝑖−1/2
                   𝛼𝜇𝑖
                                      1                               √ √ ′
                  −        2√ ′         √                  𝜎𝑖2 R−1 m𝑖 m𝑖 R−1 /(𝛼𝜇𝑖 ) 2
                    1 + 𝜎𝑖 m𝑖 R−1 m𝑖 /(𝛼𝜇𝑖 )
                    1
               =        M−1/2 R−1 M𝑖−1/2
                   𝛼𝜇𝑖 𝑖
                                𝜎𝑖2                  2 −1 √ √ ′ −1
                  −            √    ′      √      𝜎   R         m𝑖 m𝑖 R /(𝛼𝜇𝑖 )
                    𝛼𝜇𝑖 + 𝜎𝑖2 m𝑖 R−1 m𝑖
                                                    𝑖
                               (                                                              )
                                                              2
                    1                                      𝜎                       √   √  ′
               =        M−1/2 R−1 −                    2
                                                             𝑖
                                                          √ ′ −1 √  R
                                                                                −1
                                                                                     m𝑖 m𝑖 R−1 M𝑖−1/2
                   𝛼𝜇𝑖 𝑖                    𝛼𝜇𝑖 + 𝜎𝑖 m𝑖 R               m𝑖
□
Proof of Theorem 1.3.1
   Simplify the notation by defining D𝑖 ≡ D𝑜 (x𝑖 ), V𝑖 ≡ V𝑜 (x𝑖 ), 𝜇𝑖 ≡ 𝜇𝑐 (x𝑖 ), 𝜎𝑖2 ≡ 𝜎𝑐2 (x𝑖 ), and
drop dependences on 𝜷0 . With this simplified notation,
                                                                        −1
                             V𝑖− = 𝛀𝑖−1 − 𝛀𝑖−1 m𝑖 m′𝑖 𝛀𝑖−1 m𝑖                m′𝑖 𝛀𝑖−1
and, from Lemma 1.3.1,
                  1                                      𝜎𝑖2               −1/2     √ √ ′
         𝛀𝑖−1 =      M𝑖−1/2 R−1 M𝑖−1/2 −                           2
                                                                      M𝑖 R−1 m𝑖 m𝑖 R−1 M𝑖
                                                                                                −1/2
                 𝛼𝜇𝑖                           𝛼𝜇𝑖 𝛼𝜇𝑖 + 𝑎𝑖 𝜎𝑖
            √        √                                                 √
where 𝑎𝑖 ≡ m𝑖 ′R𝑖−1 m𝑖 . Therefore, because M𝑖−1/2 m𝑖 = m𝑖 ,
                        1               √                    𝜎𝑖2              −1/2 −1 √ √ ′ −1 √
         𝛀𝑖−1 m𝑖 =         M𝑖−1/2 R−1 m𝑖 −                                 M       R    m𝑖 m𝑖 R   m𝑖
                                                  𝛼𝜇𝑖 𝛼𝜇𝑖 + 𝑎𝑖 𝜎𝑖2
                                                                             𝑖
                      𝛼𝜇𝑖
                        1    −1/2 −1 √                      𝑎𝑖 𝜎𝑖2            −1/2    √
                 =         M𝑖 R           m𝑖 −                         2
                                                                          M𝑖 R−1 m𝑖
                      𝛼𝜇𝑖                         𝛼𝜇𝑖 𝛼𝜇𝑖 + 𝑎𝑖 𝜎𝑖
                      "                                 #
                         1              𝑎𝑖 𝜎𝑖2                −1/2 −1 √
                 =          −                             M        R       m𝑖
                        𝛼𝜇𝑖 𝛼𝜇𝑖 𝛼𝜇𝑖 + 𝑎𝑖 𝜎𝑖2
                                                             𝑖
                         𝛼𝜇𝑖 + 𝑎𝑖 𝜎𝑖2 − 𝑎𝑖 𝜎𝑖2
                                                
                                                        −1/2       √
                 =                           2
                                                M𝑖 R−1 m𝑖
                          𝛼𝜇𝑖 𝛼𝜇𝑖 + 𝑎𝑖 𝜎𝑖
                            1            −1/2 −1 √
                 =                    M         R       m𝑖
                        𝛼𝜇𝑖 + 𝑎𝑖 𝜎𝑖2 𝑖
                                    
                                                        92


Also,
                                                      1        √ ′ −1 √                     𝑎𝑖
                           m′𝑖 𝛀𝑖−1 m𝑖 =                         m𝑖  R         m  𝑖 =
                                               𝛼𝜇𝑖 + 𝑎𝑖 𝜎𝑖2                           𝛼𝜇𝑖 + 𝑎𝑖 𝜎𝑖2
                                                             
It follows that
                                                                                 −1/2 −1 √ √ ′ −1 −1/2
                                   −1                         1
             𝛀𝑖−1 m𝑖   m′𝑖 𝛀𝑖−1 m𝑖      m′𝑖 𝛀𝑖−1 =                          M         R      m𝑖 m𝑖 R M𝑖
                                                        𝑎𝑖 𝛼𝜇𝑖 + 𝑎𝑖 𝜎𝑖2 𝑖
                                                                          
Plugging into V𝑖− gives
                    1      −1/2 −1 −1/2                     𝜎𝑖2               −1/2       √ √ ′           −1/2
        V𝑖−   =        M𝑖 R M𝑖                  −                     2
                                                                         M𝑖 R−1 m𝑖 m𝑖 R−1 M𝑖
                   𝛼𝜇𝑖                              𝛼𝜇𝑖 𝛼𝜇𝑖 + 𝑎𝑖 𝜎𝑖
                              1              −1/2 −1 √ √ ′ −1 −1/2
                  −                       M         R      m𝑖 m𝑖 R M𝑖
                     𝑎𝑖 𝛼𝜇𝑖 + 𝑎𝑖 𝜎𝑖2
                                            𝑖
                                                    "                           #
                    1      −1/2 −1 −1/2                   𝛼𝜇𝑖 + 𝑎𝑖 𝜎𝑖2               −1/2       √ √ ′         −1/2
              =        M𝑖 R M𝑖                  −                           2
                                                                               M𝑖 R−1 m𝑖 m𝑖 R−1 M𝑖
                   𝛼𝜇𝑖                                𝑎𝑖 𝛼𝜇𝑖 𝛼𝜇𝑖 + 𝑎𝑖 𝜎𝑖
                                                                                                
                    1        −1/2 −1 −1/2              1 −1/2 −1 √ √ ′ −1 −1/2
              =           M𝑖 R M𝑖                  − M𝑖 R              m𝑖 m𝑖 R M𝑖
                   𝛼𝜇𝑖                                𝑎𝑖
which completes the result for V𝑖− . From (1.3.10), the optimal IVs are
                                                                                                                  
                                       1         ′      −1/2 −1 −1/2            1 −1/2 −1 √ √ ′ −1 −1/2
      D′𝑖 V𝑖− = −𝜇𝑖 ∇ 𝜷 m′𝑖 V𝑖−  = − ∇ 𝜷 m𝑖 M𝑖 R M𝑖                        − M𝑖 R                 m𝑖 m𝑖 R M𝑖
                                      𝛼                                         𝑎𝑖
and we can drop −1/𝛼 and factor out M𝑖−1/2 to get the result. □
Proof of Corollary 1.3.1
     Putting R = I𝑇 into (1.3.17) and using simple algebra gives the optimal IVs as
                                                                                               !
                                                                                 1
                                Z∗ (x𝑖 ) ′ = ∇ 𝛽 m𝑖 ( 𝜷0 ) ′ M𝑖−1 − Í𝑇                  1𝑇 1𝑇′
                                                                               𝑟=1 𝑚 𝑖𝑟
We show that this choice of instruments leads to the FEP first order condition, as expressed by
Wooldridge (1999), using the definition of W𝑖 given in Section 1.2:
                             ∇ 𝜷 p𝑖 ( 𝜷0 ) ′ W𝑖 = ∇ 𝜷 m𝑖 ( 𝜷0 ) ′ I𝑇 − 1𝑇 p𝑖 ( 𝜷0 ) ′ M𝑖−1
                                                                                          
                                                               93


To see the equivalence, note that
                                                          ©m𝑖 ª
                                                          ­ ®
                                                          ­m𝑖 ®
                                                          ­ ®
                                                 1                            1
                     1𝑇 p𝑖 ( 𝜷0 ) ′ M𝑖−1 = Í           ­­ . ®® M𝑖−1 = Í𝑇          1𝑇 1𝑇′
                                               𝑇          ­ . ®
                                               𝑟=1 𝑚 𝑖𝑟 ­ . ®              𝑟=1 𝑚 𝑖𝑟
                                                          ­ ®
                                                          «m𝑖 ¬
and so
                                                                                !
                                                                   1
                 ∇ 𝜷 p𝑖 ( 𝜷0 ) ′ W𝑖 = ∇ 𝜷 m𝑖 ( 𝜷0 ) ′ M𝑖−1 − Í𝑇           1𝑇 1𝑇′ = Z∗ (x𝑖 ) ′
                                                                 𝑟=1 𝑚 𝑖𝑟
□
                                                      94


                                                       APPENDIX
                                           PROOFS FOR CHAPTER 2
Proof of Lemma 2.2.3
                                   Í                      −1
                                                               , 𝒑𝑖 ( 𝜷) = ( 𝑝𝑖1 ( 𝜷), ..., 𝑝𝑖𝑇 ( 𝜷)) ′, and 𝑛𝑖 =
                                       𝑇                                                                            Í𝑇
     Let 𝑝𝑖𝑡 ( 𝜷) = 𝑚 𝑡 (𝒙𝑖𝑡 , 𝜷)      𝑠=1 𝑚 𝑠 (𝒙𝑖 𝑠 , 𝜷)                                                              𝑠=1 𝑦 𝑖 𝑠 .
Let 1 be a 𝑇 × 1 vector of ones. First I directly show the conclusion holds for 𝑰𝑇 − 𝒑( 𝜷)1′ which
satisfies the lemma’s assumption. It also satisfies Assumption MAT, which is made clear in Section
2.3. I need the following derivation:
                  𝑇
                 ∑︁                                           𝑇
                                                             ∑︁                                   ∑︁𝑇
                                     −2
     ∇ 𝜷 𝑝𝑖𝑡 = (     𝑚𝑖𝑟 (𝒙𝑖𝑟 , 𝜷)) (∇ 𝜷 𝑚𝑖𝑡 (𝒙𝑖𝑡 , 𝜷)            𝑚𝑖𝑟 (𝒙𝑖𝑟 , 𝜷) − 𝑚𝑖𝑡 (𝒙𝑖𝑡 , 𝜷)         ∇ 𝜷 𝑚𝑖𝑟 (𝒙𝑖𝑟 , 𝜷))
                 𝑟=1                                         𝑟=1                                   𝑟=1
                  𝑇
                 ∑︁                                                        𝑇
                                                                          ∑︁
             =(      𝑚𝑖𝑟 (𝒙𝑖𝑟 , 𝜷)) −1 (∇ 𝜷 𝑚𝑖𝑡 (𝒙𝑖𝑡 , 𝜷) − 𝑝𝑖𝑡 ( 𝜷)(         ∇ 𝜷 𝑚𝑖𝑟 (𝒙𝑖𝑟 , 𝜷)))
                 𝑟=1                                                      𝑟=1
Stacking the 𝑇 equations gives
                                       ∑︁𝑇
                     ∇ 𝜷 𝒑𝑖 ( 𝜷) = (        𝑚𝑖𝑟 (𝒙𝑖𝑟 , 𝜷)) −1 (∇ 𝜷 𝒎𝑖 ( 𝜷) − 𝒑𝑖 ( 𝜷)1′∇ 𝜷 𝒎𝑖 ( 𝜷))
                                       𝑟=1
                                       ∑︁𝑇
                                  =(        𝑚𝑖𝑟 (𝒙𝑖𝑟 , 𝜷)) −1 ( 𝑰𝑇 − 𝒑𝑖 ( 𝜷)1′)∇ 𝜷 𝒎𝑖 ( 𝜷)
                                       𝑟=1
                                 Í𝑇
As 𝐸 (−𝑛𝑖 |𝒙𝑖 ) = −𝜇𝑐 (𝒙𝑖 )         𝑟=1 𝑚 𝑖𝑟 (𝒙𝑖𝑟 , 𝜷0 ), evaluating the derivative at 𝜷0 and multiplying by
𝐸 (−𝑛𝑖 |𝒙𝑖 ) yields the final result.
     Now let 𝑨(𝒙𝑖 , 𝜷) be an 𝐿 × 𝑇 matrix satisfying the assumption of the lemma. 𝑨(𝒙𝑖 , 𝜷)( 𝑰𝑇 −
𝒑𝑖 ( 𝜷)1′) = 𝑨(𝒙𝑖 , 𝜷) for all 𝜷 near 𝜷0 . Then writing 𝒈(𝒙𝑖 , 𝜷) = ( 𝑰𝑇 − 𝒑𝑖 ( 𝜷)1′) 𝒚𝑖 , we have for all
𝜷 near 𝜷0
           𝐸 (∇ 𝜷 ( 𝑨(𝒙𝑖 , 𝜷) 𝒚𝑖 )|𝒙𝑖 ) = 𝐸 (∇ 𝜷 ( 𝑨(𝒙𝑖 , 𝜷) 𝒈(𝒙𝑖 , 𝜷))|𝒙𝑖 )
                                        = ∇ 𝜷 𝑨(𝒙𝑖 , 𝜷)𝐸 ( 𝒈(𝒙𝑖 , 𝜷)|𝒙𝑖 ) + 𝑨(𝒙𝑖 , 𝜷)𝐸 (∇ 𝜷 𝒈(𝒙𝑖 , 𝜷)|𝒙𝑖 )
Evaluating at 𝜷0 yields 𝐸 (∇ 𝜷 𝑨(𝒙𝑖 , 𝜷0 ) 𝒚𝑖 |𝒙𝑖 ) = 𝑨(𝒙𝑖 , 𝜷0 )∇ 𝜷 𝒎𝑖 ( 𝜷0 ) since 𝐸 ( 𝒈(𝒙𝑖 , 𝜷0 )|𝒙𝑖 ) = 0 and
𝐸 (∇ 𝜷 𝒈(𝒙𝑖 , 𝜷0 )|𝒙𝑖 ) = ( 𝑰𝑇 − 𝒑𝑖 ( 𝜷0 )1′)∇ 𝜷 𝒎𝑖 ( 𝜷0 ). □
                                                              95


Proof of Lemma 2.3.1
     Write 𝐸 ( 𝒚𝑖 𝒚′𝑖 |𝒙𝑖 ) = 𝚺𝑖 . Then for any 𝑇 − 1 × 𝑇 transformation 𝑨(𝒙𝑖 , 𝜷0 ) with rank 𝑇 − 1,
                  𝑅𝑎𝑛𝑘 ( 𝑨(𝒙𝑖 , 𝜷0 )𝚺𝑖 𝑨(𝒙𝑖 , 𝜷0 )) = 𝑅𝑎𝑛𝑘 (( 𝑨(𝒙𝑖 , 𝜷0 )𝚺𝑖1/2 )(( 𝑨(𝒙𝑖 , 𝜷0 )𝚺𝑖1/2 ) ′)
                                                      = 𝑅𝑎𝑛𝑘 ( 𝑨(𝒙𝑖 , 𝜷0 )𝚺𝑖1/2 )
                                                      = 𝑅𝑎𝑛𝑘 ( 𝑨(𝒙𝑖 , 𝜷0 )) = 𝑇 − 1
as 𝚺𝑖1/2 is 𝑇 × 𝑇 and full rank. Thus the conditional variance is nonsingular and (2.2.4) holds with
a proper inverse. Any generalized differencing residual with transformation satisfying Assumption
RK.1 has a nonsingular conditional variance. This result goes for 𝑸( 𝑰𝑇 − 𝒑𝑖 ( 𝜷0 )1′) and 𝑸( 𝑰𝑇 −
𝒎𝑖 ( 𝜷0 ) (𝒎𝑖 ( 𝜷0 ) ′ 𝒎𝑖 ( 𝜷0 )) −1 𝒎𝑖 ( 𝜷0 ) since their full transformations have rank 𝑇 − 1. Lemma 1 of
Verdier (2018) shows 𝑅𝑎𝑛𝑘 (( 𝑰𝑇 − 𝒑𝑖 ( 𝜷0 )1′)) = 𝑇 −1; the rank of the residual maker transformation
is a well-known result.
     First note that 𝑽𝑖− 𝒎𝑖 ( 𝜷0 ) = 0 by construction. As
                                                 1
                             𝒑𝑖 ( 𝜷0 )1′ ( 𝑰𝑇 −     𝒎𝑖 ( 𝜷0 )𝒎𝑖 ( 𝜷0 ) ′𝚺𝑖−1 ) = 0
                                                 𝑎𝑖
                             ( 𝑰𝑇 − 𝒎𝑖 ( 𝜷0 )(𝒎𝑖 ( 𝜷0 ) ′ 𝒎𝑖 ( 𝜷0 )) −1 𝒎𝑖 ( 𝜷0 ) ′)𝒎𝑖 ( 𝜷0 ) = 0
the conditional gradients are given as
                             ( 𝑰𝑇 − 𝒑𝑖 ( 𝜷0 )1′)∇ 𝜷 𝒎𝑖 ( 𝜷0 )
                             ( 𝑰𝑇 − 𝒎𝑖 ( 𝜷0 )(𝒎𝑖 ( 𝜷0 ) ′ 𝒎𝑖 ( 𝜷0 )) −1 𝒎𝑖 ( 𝜷0 ) ′)∇ 𝜷 𝒎𝑖 ( 𝜷0 )
by Lemma 2.2.3. Then the systems defined by Assumption SYS for both transformations are con-
sistent with 𝑭(𝒙𝑖 ) = 𝑽𝑖− ∇ 𝜷 𝒎𝑖 ( 𝜷0 ) and the singularity assumption in Assumption RK.2 guarantees
both efficiency bounds exist. □
Proof of Theorem 2.3.1
     As mentioned in the text, Assumptions CM, RK.1, RK.2, and the positive definiteness of
𝐸 ( 𝒚𝑖 𝒚′𝑖 |𝒙𝑖 ) are sufficient for each of the transformations studied to satisfy Assumptions SYS and
ORTH (and thus MAT) so that their asymptotic efficiency bounds are well-defined and given by
                                                             96


(2.2.8). Let 𝑩𝑖 be one of the full rank 𝑇 − 1 × 𝑇 transformation (evaluated at 𝒙𝑖 and 𝜷0 ) studied.
𝑩𝑖 could be the generalized within transformation, or either the generalized within or residual
maker transformation with any arbitrary row deleted. I will prove the theorem by showing each of
these transformations are information equivalent to the full generalized within transformation via
Theorem 1, and noting that a similar proof holds for the full residual maker transformation. Write
𝚺𝑖 = 𝐸 (𝒚𝑖 𝒚′𝑖 |𝒙𝑖 ). Since each of the potential 𝑩𝑖 matrices satisfy Assumption ORTH, its efficiency
bound is given by (2.2.8):
                               𝐸 (∇ 𝜷 𝒎𝑖 ( 𝜷0 ) ′ 𝑩𝑖′ (𝑩𝑖 𝚺𝑖 𝑩𝑖′) −1 𝑩𝑖 ∇ 𝜷 𝒎𝑖 ( 𝜷0 )) −1
In the notation of Theorem 1, let 𝑽𝑖 = ( 𝑰𝑇 − 𝒑𝑖 ( 𝜷0 )1′)𝚺𝑖 ( 𝑰𝑇 − 1 𝒑𝑖 ( 𝜷0 ) ′) and 𝑴𝑖 = ( 𝑰𝑇 − 𝒑𝑖 ( 𝜷0 )1′).
    𝑩𝑖 𝑴𝑖 = 𝑩𝑖 as 𝑩𝑖 𝒑𝑖 ( 𝜷0 ) = 0 by Assumption CM. Also 𝑅𝑎𝑛𝑘 ( 𝑴𝑖 𝑽𝑖 𝑴𝑖′) = 𝑅𝑎𝑛𝑘 (𝑽𝑖 ) =
𝑇 − 1 = 𝑅𝑎𝑛𝑘 ( 𝑴𝑖 ), so Assumption GR.1 holds for the same 𝑴𝑖 regardless of 𝑩𝑖 . As 𝑩𝑖 𝑽𝑖 𝑩𝑖′ =
𝑩𝑖 𝚺𝑖 𝑩𝑖′, we have 𝑅𝑎𝑛𝑘 (𝑩𝑖 𝑽𝑖 𝑩𝑖 ) = 𝑇 − 1 = 𝑅𝑎𝑛𝑘 (𝑩𝑖 ), so Assumption GR.2 holds. Thus by
Theorem 1 𝑩𝑖′ (𝑩𝑖 𝚺𝑖 𝑩𝑖′) −1 𝑩𝑖 = 𝑴𝑖′ ( 𝑴𝑖 𝚺𝑖 𝑴𝑖′) − 𝑴𝑖 . The information bound for the generalized
within transformation is
                              𝐸 (∇ 𝜷 𝒎𝑖 ( 𝜷0 ) 𝑴𝑖′ ( 𝑴𝑖 𝚺𝑖 𝑴𝑖′) − 𝑴𝑖 ∇ 𝜷 𝒎𝑖 ( 𝜷0 )) −1
This expression is equal to the expression in (2.2.6) by Theorem 2.2.1, so the generalized within
transformation is information equivalent to 𝑩𝑖 . The proof for the residual maker transformation is
similar with 𝑴𝑖 = ( 𝑰𝑇 −𝒎𝑖 ( 𝜷0 )(𝒎𝑖 ( 𝜷0 ) ′ 𝒎𝑖 ( 𝜷0 )) −1 𝒎𝑖 ( 𝜷0 ) ′) and 𝑽𝑖 being the respective conditional
covariance matrix. □
                                                          97


                                              APPENDIX
                                     PROOFS FOR CHAPTER 3
Proof of Lemma 3.3.1
    Separate the estimated parameters into the respective 𝑇 − 𝑝 × 𝑝 − 𝑝 0 and 𝑇 − 𝑝 × 𝑝 0 matrices
(𝚯1 |𝚯2 ). Separate the true regularized parameters by rows (𝚯10′ |𝚯20′) ′, which are then 𝑇 − 𝑝 × 𝑝 0
and 𝑝 − 𝑝 0 × 𝑝 0 matrices, respectively. Then for 𝑝 > 𝑝 0 , 𝑯(𝜽) ′ 𝑭0 = 𝚯10 + 𝚯1 𝚯20 − 𝚯2 . Set
𝚯2 = 𝚯10 + 𝚯1 𝚯20 for any value of 𝚯1 , so that there are infinitely many solutions which make
equation (3.3.1) zero. Finally when 𝑝 < 𝑝 0 there are too many parameters than can be consistently
estimated. Thus there are no values of 𝚯 which cause (3.3.1) to be zero. These order conditions
for estimation of 𝜽 0 are identical to Ahn et al. (2013). □
Proof of Theorem 3.3.2
    I first state the Identifying Assumption (IA) which comes from Ahn et al. (2013)’s Basic
Assumptions:
Identifying Assumption: Rk(𝐸 (𝜸𝑖 𝜸𝑖′)) = 𝑝 0 < 𝑇. For any 𝑇 × (𝑇 − 𝑝 0 ) matrix 𝑯0 such that
Rk(𝑭0 , 𝑯0 ) = 𝑇, the following matrix has full column rank:
                            𝐸 (𝑯0′ 𝑿𝑖 ⊗ vec( 𝑿𝑖 )), 𝑰𝑇−𝑝0 ⊗ 𝐸 (vec( 𝑿𝑖 )𝜸𝑖′) ■
                                                                            
 The two equations under consideration are equations (3.2.7) and (3.2.8),
                                      𝐸 (𝒘 𝑖 ⊗ 𝑯0′ ( 𝒚𝑖 − 𝑿𝑖 𝜷0 )) = 0
                                              𝐸 (𝑯0′ 𝑽𝑖 ) = 0
I appeal to the partial redundancy results given in Section 4 of Breusch et al. (1999). In this setting,
partial redundancy of two sets of moment conditions means that the asymptotic variance of the
GMM estimator of 𝜷0 based off of both sets of moment conditions is the same as that of the GMM
estimator which only uses the first set. See Section 1 of Breusch et al. (1999) for examples.
                                                     98


    Write 𝝀 = ( 𝜷′0 , 𝜽 0′ ) ′ and let 𝝀 1 = 𝜷0 and 𝝀2 = 𝜽 0 . Then 𝝀 1 is identified by equation (3.2.7)
under IA1 and 𝝀 2 is identified by equation (3.2.8), both facts I use in the proof. They consider a
general vector of moment conditions
                                                                          
                                                            𝒈1 (𝝀, 𝜼𝑖 )) 
                                         𝐸 ( 𝒈(𝝀, 𝜼𝑖 )) =                  =0
                                                                          
                                                            𝒈 (𝝀, 𝜼 )) 
                                                            2         𝑖 
                                                                          
where in my notation 𝜼𝑖 = (𝒚𝑖 , 𝑿𝑖 , 𝜸𝑖 , 𝚪𝑖 ), 𝒈1 = 𝑯(𝜽) ′ (𝒚𝑖 − 𝑿𝑖 𝜷0 + 𝑭𝜸𝑖 ), and 𝒈2 = 𝑯(𝜽) ′𝑽𝑖 . I
partition the gradient and covariances matrices as
                                                                      
                                                        𝑫 11    𝑫 12 
                                                𝑫=
                                                       
                                                                       
                                                       𝑫        𝑫 22 
                                                        21
                                                                      
                                                                     
                                                       𝛀11      𝛀12 
                                                𝛀= 
                                                       
                                                                      
                                                       𝛀        𝛀22 
                                                        21
                                                                     
where 𝑫 𝑚 𝑛 = 𝐸 (∇𝝀 𝑛 𝒈𝑚 (𝝀, 𝜼𝑖 )) and 𝛀𝑚 𝑛 = 𝐸 ( 𝒈𝑚 (𝝀, 𝜼𝑖 ) 𝒈𝑛 (𝝀, 𝜼𝑖 ) ′). Equation (3.2.8) is partially
redundant for estimating 𝜷0 if and only if
               𝑫 21 − 𝛀21 𝛀−1                              −1            ′    −1  −1   ′   −1
                               11 𝑫 11 = (𝑫 22 − 𝛀21 𝛀11 𝑫 12 )(𝑫 12 𝛀11 𝑫 12 ) (𝑫 12 𝛀11 𝑫 11 )
by Theorem 7 of Breusch et al. (1999). As 𝒖𝑖 is mean independent of 𝑿𝑖 , 𝛀21 = 0 and 𝛀12 = 0 so
that the necessary and sufficient condition of partial redundancy is
                                   𝑫 21 = 𝑫 22 (𝑫 ′12 𝛀−1         −1      ′   −1
                                                        11 𝑫 12 ) (𝑫 12 𝛀11 𝑫 11 )
Since 𝒈2 (𝝀, 𝜼𝑖 ) is not a function of 𝜷0 , we also have 𝑫 21 = 0. Assumption PF gives that 𝑫 22 has
full column rank so that 𝑫 22 (𝑫 ′12 𝛀−1  11 𝑫 12 )
                                                    −1 is left-invertible. Therefore the redundancy condition
becomes
                                                 𝑫 ′12 𝛀−1
                                                         11 𝑫 11 = 0
□
Proof of Theorem 3.3.4
    1 See Section 3 of Ahn et al. (2013).
                                                          99


     I start with the proof of consistency. The centered QLDP estimator is written as
                                            𝑁
                                                             ! −1         𝑁
                                                                                                  !
                    b𝑄𝐿𝐷𝑃 − 𝜷0 = 1
                                           ∑︁                         1 ∑︁ ′ b b′
                    𝜷                          𝑿′ 𝑯b𝑯b′ 𝑿𝑖                    𝑿 𝑯 𝑯 (𝑭0 𝜸𝑖 + 𝒖𝑖 )
                                        𝑁 𝑖=1 𝑖                       𝑁 𝑖=1 𝑖
                                                                 1                  ′
                                                                            ′          up to a 𝑂 𝑝 (𝑁 −1/2 ) term by
                                                                   Í𝑁
The denominator equals its infeasible counterpart                𝑁    𝑖=1 𝑽𝑖 𝑯0 𝑯0𝑽𝑖
Theorem 1 and the moment bounds from BASE. The inverse exists with probability approaching
one by condition (1) of the theorem. Thus the denominator is a 𝑂 𝑝 (1) term so consistency depends
on the numerator.
     The difference between the numerator and its infeasible counterpart is
     𝑁                                             𝑁
                                                                               !
 1 ∑︁ ′ b b′                                   1  ∑︁
         𝑿 ( 𝑯 𝑯 −𝑯0 𝑯0′ )(𝑭0 𝜸𝑖 +𝒖𝑖 ) =              (𝑭0 𝜸𝑖 + 𝒖𝑖 ) ′ ⊗ 𝑿𝑖′ vec( 𝑯    b𝑯b′ −𝑯0 𝑯′ ) = 𝑂 𝑝 (1)𝑜 𝑝 (1)
                                                                                                  0
𝑁 𝑖=1 𝑖                                       𝑁 𝑖=1
The sum converges to its finite expectation by the moment bounds from Assumption 2(2). vec( 𝑯                  b𝑯 b′−
𝑯0 𝑯0′ ) = 𝑂 𝑝 (𝑁 −1/2 ) by Theorem 3.3.1. The infeasible numerator, 𝑁1 𝑖=1
                                                                                      Í𝑁 ′
                                                                                            𝑿𝑖 𝑯0 𝑯0′ (𝑭0 𝜸𝑖 + 𝒖𝑖 ), is
𝑜 𝑝 (1) as 𝑯0′ 𝑭0 = 0 and 𝑁1 𝑖=1
                                 Í𝑁 ′
                                        𝑿𝑖 𝑯0 𝑯0′ 𝒖𝑖 = 𝑜 𝑝 (1) by condition (3), so we have 𝜷         b𝑄𝐿𝐷𝑃 − 𝜷0 =
𝑜 𝑝 (1).
     Before deriving the asymptotic distribution of the QLDP, I need the following lemma:
Lemma .0.1. Let 𝝐𝑖 = 𝑭0 𝜸𝑖 + 𝒖𝑖 . Then
                                                         © 𝒙𝑖 ∗1′ ⊗ 𝑰𝑇−𝑝0 ª
                                                       ­­         ..
                                                                            ®
                  ∇𝜽 ( 𝑿𝑖′ 𝑯0 𝑯0′ 𝝐𝑖 ) = 𝑰𝐾 ⊗ 𝒖′𝑖 𝑯0 ­                      ® + 𝑽𝑖′ 𝑯0 𝝐𝑖∗′ ⊗ 𝑰𝑇−𝑝0
                                                                                                    
                                                                                                                (.0.1)
                                                                            ®
                                                                    .
                                                         ­                  ®
                                                         ­                  ®
                                                              ∗ ′
                                                           𝒙𝑖 𝐾 ⊗ 𝑰𝑇−𝑝0
                                                         «                  ¬
where 𝒙𝑖 𝑗 is the 𝑗’th column of 𝑿𝑖 and 𝒗 ∗ = (𝑣𝑇−𝑝0 +1 , ..., 𝑣𝑇 ) ′ is the last 𝑝 0 elements of the 𝑇 × 1
vector 𝒗.
Proof. I omit the pure factor notation for simplicity and work with the full matrix 𝑿𝑖 . Proposition
5.4 of Dhrymes (2013) gives
              ∇𝜽 ( 𝑿𝑖′ 𝑯(𝜽)𝑯(𝜽) ′𝝐𝑖 ) = (𝝐𝑖′ 𝑯(𝜽) ⊗ 𝑰𝐾 )∇𝜽 ( 𝑿𝑖′ 𝑯(𝜽)) + 𝑿𝑖′ 𝑯(𝜽)∇𝜽 (𝑯(𝜽) ′𝝐𝑖 )                 (.0.2)
                                                         100


where I follow standard notation in writing the derivative of the 𝑛 × 𝑚 matrix 𝑨 with respect to the
𝑘 × 1 vector 𝜶 as ∇𝜶 𝑨 = ∇𝜶 vec( 𝑨). The row vectors of ∇𝜶 𝑨 are then the 1 × 𝑘 gradient vectors
of the elements of vec( 𝑨) with respect to 𝜶.
    In order to derive the various derivatives, I first start with the case of an arbitrary 𝑇 × 1
vector 𝒗 = (𝑣 1 , ..., 𝑣𝑇 ) ′. As described in Section 3.1, 𝑯(𝜽) ′ = ( 𝑰𝑇−𝑝0 , 𝚯) where 𝜽 = vec(𝚯).
I write the 𝑝 0 column vectors of 𝚯 as 𝚯 = (𝜽 1 , ..., 𝜽 𝑝0 ) where each column can be written as
𝜽 𝑗 = (𝜃 𝑗 1 , ..., 𝜃 𝑗,𝑇−𝑝0 ) ′. These definitions give the expression
                                             ©        𝑣 1 + 𝜃 11 𝑣𝑇−𝑝0 +1 + ... + 𝜃 𝑝1 𝑣𝑇           ª
                                             ­                                                      ®
                                       ′                                   ..
                                 𝑯(𝜽) 𝒗 = ­                                                                   (.0.3)
                                             ­                                                      ®
                                                                            .                       ®
                                             ­                                                      ®
                                             ­                                                      ®
                                               𝑣       + 𝜃 1,𝑇−𝑝0 𝑣𝑇−𝑝0 +1 + ... + 𝜃 𝑝,𝑇−𝑝0 𝑣𝑇
                                             « 𝑇−𝑝0                                                 ¬
    The expression above is similar to that derived below equation (4) of Ahn et al. (2013). They
write the terms as the dot product between the rows of 𝑯(𝜽) ′ and 𝒗 ∗ . However, I expand the sums
so that the gradient is easier to see. Taking the gradient of the 𝑟’th element of 𝑯(𝜽) ′ 𝒗 with respect
to 𝜽 𝑗 gives
                         ∇𝜽 𝑗 (𝑣 𝑟 + 𝜃 1𝑟 𝑣𝑇−𝑝0 +1 + ... + 𝜃 𝑝0𝑟 𝑣𝑇 ) = (0, ..., 0, 𝑣𝑇−𝑝0 + 𝑗 , 0, ..., 0)
where the only nonzero term is in the 𝑟’th column. Thus differentiating with respect to the 𝑗’th
vector gives
                                           ©𝑣𝑇−𝑝0 + 𝑗         0       ...         0 ª
                                           ­                                              ®
                                                0                                 0
                                           ­                                              ®
                                    ′
                                           ­             𝑣 𝑇−𝑝 0 + 𝑗  . . .               ®
                         ∇𝜽 𝑗 𝑯(𝜽) 𝒗 = ­ . ­
                                                                                  .
                                                                                          ® = 𝑣𝑇−𝑝 + 𝑗 𝑰𝑇−𝑝
                                                                                                  0         0
                                           ­ .  .                     . . .       .       ®
                                           ­                                      .       ®
                                                                                          ®
                                           ­                                              ®
                                           «    0           . . .     . . .   𝑣 𝑇−𝑝 0 + 𝑗 ¬
Putting together the 𝑇 − 𝑝 0 gradients gives
                               ∇𝜽 𝑯(𝜽) ′ 𝒗 = 𝑣𝑇−𝑝0 +1 𝑰𝑇−𝑝0 , ..., 𝑣𝑇 𝑰𝑇−𝑝0 = 𝒗 ∗′ ⊗ 𝑰𝑇−𝑝0
                                                                                    
                                                                                                              (.0.4)
    Equation (.0.4) implies ∇𝜽 𝑯(𝜽) ′𝝐𝑖 = 𝝐𝑖∗′ ⊗ 𝑰𝑇−𝑝0 . Handling 𝑯(𝜽) ′ 𝑿𝑖 is done similarly. Writing
the covariates in terms of its column vectors 𝑿𝑖 = (𝒙𝑖1 , ..., 𝒙𝑖 𝐾 ) where now the subscript on 𝒙𝑖 𝑘
denotes the 𝑇 × 1 vector of observations for variable 𝑘 of individual 𝑖, we can see that
                                          𝑯(𝜽) ′ 𝑿𝑖 = (𝑯(𝜽) ′ 𝒙𝑖1 , ..., 𝑯(𝜽) ′ 𝒙𝑖 𝐾 )
                                                                  101


which implies that
                                                           © 𝑯(𝜽) ′ 𝒙𝑖1 ª
                                                           ­                  ®
                                                                      ..
                                        vec(𝑯(𝜽) ′ 𝑿𝑖 ) = ­
                                                           ­                  ®
                                                                       .      ®
                                                           ­                  ®
                                                           ­                  ®
                                                             𝑯(𝜽) 𝒙𝑖 𝐾   ′
                                                           «                  ¬
      ′
𝑯(𝜽) 𝒙𝑖 𝑘 is a (𝑇 − 𝑝 0 ) × 1 vector so its gradient follow the same form as equation (.0.4). Thus
                                                           © 𝒙𝑖 ∗1′ ⊗ 𝑰𝑇−𝑝0 ª
                                                           ­                     ®
                                                   ′                      ..
                                    ∇𝜽 vec(𝑯(𝜽) 𝑿𝑖 ) = ­
                                                           ­                     ®
                                                                           .     ®
                                                           ­                     ®
                                                           ­                     ®
                                                                ∗   ′
                                                             𝒙𝑖 𝐾 ⊗ 𝑰𝑇−𝑝0
                                                           «                     ¬
Filling in the gradient in equation (.0.1) gives our final answer.                                                 □
    Returning to the main proof of asymptotic normality, the pooled QLD estimator can be written
as                                                         ! −1                                       !
              √                             𝑁
                                         1 ∑︁ ′ b b′
                                                                              𝑁
                                                                         1 ∑︁ ′ b b′
                𝑁 ( 𝜷𝑄𝐿𝐷𝑃 − 𝜷0 ) =
                    b                           𝑿 𝑯 𝑯 𝑿𝑖              √          𝑿𝑖 𝑯 𝑯 (𝑭0 𝜸𝑖 + 𝒖𝑖 )
                                         𝑁 𝑖=1 𝑖                           𝑁 𝑖=1
As before, he denominator equals 𝑨𝑃 up to a 𝑂 𝑝 (𝑁 −1/2 ). The inverse exists with probability
approaching one by condition (1) of the theorem. Thus asymptotic normality depends on the
numerator.
    Write the full error as 𝝐𝑖 = 𝑭0 𝜸𝑖 +𝒖𝑖 so that we study the asymptotic distribution of √1                 𝑿𝑖′ 𝑯 b′𝝐𝑖 .
                                                                                                         Í𝑁       b𝑯
                                                                                                       𝑁  𝑖=1
Mean value expansion about 𝜽 0 gives
                        𝑁
                   1 ∑︁ ′ b b′             1 ∑︁ ′
                                                 𝑁                           √
                  √        𝑿𝑖 𝑯 𝑯 𝝐𝑖 = √            𝑽𝑖 𝑯0 𝑯0′ 𝒖𝑖 + 𝑮 𝑃 𝑁 ( 𝜽b − 𝜽 0 ) + 𝑜 𝑝 (1)
                    𝑁 𝑖=1                   𝑁 𝑖=1
where 𝑮 𝑃 = 𝐸 (∇𝜽 𝑿𝑖′ 𝑯0 𝑯0′ 𝝐𝑖 ) which is derived explicitly in Lemma .0.1. The estimator 𝜽b is
derived in Theorem 3.3.1 as based on the moments 𝐸 (vec(𝑯0′ 𝒁𝑖 ) = 0. It is a GMM estimator using
the optimal weight matrix 𝑨      b𝜽 = 1 Í𝑁 vec( 𝑯˜ ′ 𝒁𝑖 )vec( 𝑯˜ ′ 𝒁𝑖 ) ′ where 𝑯˜ = 𝑯( 𝜽)       ˜ uses an initial
                                        𝑁   𝑖=1
estimator. The first order conditions of the GMM optimization problem give
                                 𝑁
                                                   !′         𝑁
                                                                                    !
                                ∑︁                          ∑︁
                                            b′ 𝒁𝑖 ) 𝑨
                                    ∇𝜽 vec( 𝑯         b−1          vec( 𝑯    b′ 𝒁𝑖 ) = 0
                                                        𝜽
                                𝑖=1                          𝑖=1
where ∇𝜽 vec( 𝑯 b′ 𝒁𝑖 ) = (𝒛∗ ⊗ 𝑰𝑇−𝑝0 , ..., 𝒛∗                   ′
                             𝑖,1                𝑖,𝐾+1 ⊗ 𝑰𝑇−𝑝 0 ) comes from Lemma 3.3.1. Interestingly,
this gradient is free of any parameters and thus the same regardless of the estimator.
                                                      102


    Write 𝑫 𝜽 = 𝐸 (∇𝜽 vec(𝑯0′ 𝒁𝑖 )) and 𝑨𝜽 = 𝐸 (vec(𝑯0′ 𝒁𝑖 )vec(𝑯0′ 𝒁𝑖 ) ′), the notation from Theorem
3.3.1. Using another standard mean value expansion gives
                    √                            𝑁
                                            1 ∑︁ ′ −1
                       𝑁 ( 𝜽b − 𝜽 0 ) = √           (𝑫 𝜽 𝑨𝜽 𝑫 𝜽 ) −1 𝑫 ′𝜽 𝑨−1          ′
                                                                            𝜽 vec(𝑯0 𝒁𝑖 ) + 𝑜 𝑝 (1)          (.0.5)
                                            𝑁 𝑖=1
which allows us to write the estimator as
                   √                              −1 1
                                                            𝑁
                                                           ∑︁
                                                                𝑽𝑖′ 𝑯0 𝑯0′ 𝒖𝑖 + 𝑮 𝑃 𝒓𝑖 (𝜽 0 ) + 𝑜 𝑝 (1)
                                                                                             
                     𝑁 ( 𝜷𝑄𝐿𝐷𝑃 − 𝜷0 ) = 𝑨𝑃 √
                           b                                                                                 (.0.6)
                                                       𝑁 𝑖=1
where 𝒓𝑖 (𝜽 0 ) = (𝑫 ′𝜽 𝑨−1          −1 ′ −1             ′
                            𝜽 𝑫 𝜽 ) 𝑫 𝜽 𝑨𝜽 vec(𝑯0 𝒁𝑖 ). Thus we have
                                      √                       𝑑
                                        𝑁(𝜷  b𝑄𝐿𝐷𝑃 − 𝜷0 ) →       𝑁 (0, 𝑨−1
                                                                          𝑃 𝑩 𝑃 𝑨𝑃 )
                                                                                    −1
                                                                                                             (.0.7)
where 𝑩 𝑃 = 𝐸 ((𝑽𝑖′ 𝑯0 𝑯0′ 𝒖𝑖 + 𝑮 𝑃 𝒓𝑖 (𝜽 0 ))(𝑽𝑖′ 𝑯0 𝑯0′ 𝒖𝑖 + 𝑮 𝑃 𝒓𝑖 (𝜽 0 )) ′). □
Proof of Theorem 3.4.2
    Now the asymptotic variance depends only on the moments 𝐸 (𝑯0′ 𝑽𝑖 ) = 0.
Lemma .0.2. Suppose Assumption 2 holds and Rk(𝐸 (𝚪𝑖 )) = 𝑝 0 and let 𝜽b be the GMM estimator
based off of 𝐸 (vec(𝑯0′ 𝑿𝑖 )) = 𝐸 (vec(𝑯0′ 𝑽𝑖 ) = 0 using a consistent estimator of the optimal weight
matrix. Then
                                    √                 𝑑
                                                                                  −1
                                       𝑁 ( 𝜽b − 𝜽 0 ) → 𝑁 (0, 𝑫 ′𝑥,𝜽 𝑨𝑥,𝜽−1
                                                                            𝑫 𝑥,𝜽 ).
and 𝒓 𝑥,𝑖 (𝜽 0 ) = (𝑫 ′𝑥,𝜽 𝑨𝑥,𝜽
                             −1 𝑫 ) −1 𝑫 ′ 𝑨−1 vec(𝑯′ 𝑽 ), where 𝑨
                                  𝑥,𝜽         𝑥,𝜽 𝑥,𝜽         0 𝑖
                                                                                                 ′      ′   ′
                                                                               𝑥,𝜽 = 𝐸 (vec(𝑯0𝑽𝑖 )vec(𝑯0𝑽𝑖 ) ) and
𝑫 𝑥,𝜽 = 𝐸 (∇𝜽 vec(𝑯0′ 𝑽𝑖 )) is derived in Lemma .0.1.
    □
Proof of Theorem 3.4.3
    I first consider the proof of consistency. Facts about uniform convergence shown for consistency
will be taken for granted in the proof of asymptotic normality.
    As a technical aside, I do not differentiate between the Euclidean vector norm and the Frobenius
matrix norm in terms of notation. It does not affect the proof as the two norms are compatible
in the sense that ∥ 𝑨𝒙∥ 𝐸 ≤ ∥ 𝑨∥ 𝐹 ∥𝒙∥ 𝐸 where 𝑨 is a 𝑛 × 𝑚 matrix, 𝒙 is a 𝑚 × 1 vector, and the
                                                             103


F and E subscripts refer to Frobenius and Euclidean respectively. Further, since both norms are
submultiplicative, it does not matter for the point of this proof. As such the notation should be clear
from the context. Finally, all statements involving random quantities are assumed to hold almost
surely unless stated otherwise.
    The QDMG estimator can be written as
                                       𝑁                                               𝑁
                                   1 ∑︁ ′ b b′ −1 ′ b b′                            1 ∑︁
              (𝜷b𝑄𝐿𝐷 𝑀𝐺 − 𝜷0 ) =          ( 𝑿𝑖 𝑯 𝑯 𝑿𝑖 ) 𝑿𝑖 𝑯 𝑯 (𝑭0 𝜸𝑖 + 𝒖𝑖 ) +            𝒃𝑖
                                  𝑁 𝑖=1                                             𝑁 𝑖=1
                                       𝑁
                                   1 ∑︁ ′ b b′ −1 ′ b b′
                                =         ( 𝑿 𝑯 𝑯 𝑿𝑖 ) 𝑿𝑖 𝑯 𝑯 (𝑭0 𝜸𝑖 + 𝒖𝑖 ) + 𝑂 𝑝 (𝑁 −1/2 )
                                  𝑁 𝑖=1 𝑖
                        𝑝                            1
                                                             𝒃𝑖 = 𝑂 𝑝 (𝑁 −1/2 ) by the CLT, consistency of
                                                        Í𝑁
where 𝑯 b = 𝑯( 𝜽),
                 b 𝜽b → 𝜽 0 by Theorem 1. As
                                                    𝑁    𝑖=1
the QLDMG does not depend on the correlation between 𝒃𝑖 and ( 𝑿𝑖 , 𝜸𝑖 , 𝒖𝑖 ). However, since the
                          √
rate of convergence is 𝑁, it will affect the asymptotic distribution. This fact is handled later in
the proof.
    I write 𝒁𝑖 (𝜽) = ( 𝑿𝑖′ 𝑯(𝜽)𝑯(𝜽) ′ 𝑿𝑖 ) −1 𝑿𝑖′ 𝑯(𝜽)𝑯(𝜽) ′ (𝑭0 𝜸𝑖 + 𝒖𝑖 ) for convenience. The goal of
this section is to show that
                                         𝑁
                                     1 ∑︁            𝑝
                                            𝒁𝑖 ( 𝜽)
                                                 b → 𝐸 ( 𝒁𝑖 (𝜽 0 )) = 0                               (.0.8)
                                     𝑁 𝑖=1
By Theorem 21.6 of Davidson (1994), the convergence result in equation (.0.8) is implied by
conditions:
                                                      𝑝
                                                  𝜽b → 𝜽 0                                            (.0.9)
                  𝑁
               1 ∑︁
        sup          𝒁𝑖 (𝜽) − 𝐸 (𝒁𝑖 (𝜽)) = 𝑜 𝑝 (1) where 𝑩0 is some open set about 𝜽 0 .            (.0.10)
        𝜽∈𝐵0 𝑁 𝑖=1
where ∥.∥ denotes the Euclidean 𝐿 2 norm for vectors and Frobenius norm for matrices. Consistency
of 𝜽b holds by Theorem 1 so that uniform convergence is the only condition which needs to be
verified. I show uniform convergence via a traditional argument which demonstrates both pointwise
convergence in probability and stochastic equicontinuity (SE).
    Pointwise convergence in probability follows from the WLLN by the moment bounds and
sampling assumptions in Assumption 3(2). {𝑿𝑖′ 𝑯(𝜽)𝑯(𝜽) ′ 𝑿𝑖 }𝑖≥1 is a sequence of positive definite
                                                    104


random matrices for all possible values of 𝜽 by condition (1) of the theorem. Thus for each 𝜽,
                                                                                 𝑝
{𝒁𝑖 (𝜽)}𝑖≥1 is well-defined and iid. By the WLLN, 𝑁1 𝑖=1
                                                                   Í𝑁
                                                                         𝒁𝑖 (𝜽) → 𝐸 ( 𝒁𝑖 (𝜽)) which is 0 when
𝜽 = 𝜽0.
    For the purpose of verifying SE of the random sequence, I show that the following Lipschitz
condition of Theorem 21.11 from Davidson (1994) holds: for some random sequence {𝐵 𝑁 𝑖 }𝑖≥1
with bounded expectations and real function ℎ such that ℎ(𝑥) → 0 as 𝑥 → 0, there exists 𝑛 ∈ N
such that
                  1                                     ¤ − 𝐸 (𝒁𝑖 ( 𝜽)))
                                                                      ¤
                       ( 𝒁𝑖 (𝜽) − 𝐸 ( 𝒁𝑖 (𝜽))) − ( 𝒁𝑖 ( 𝜽)                   ≤ 𝐵 𝑁 𝑖 ℎ(∥𝜽 − 𝜽 ′ ∥)            (.0.11)
                  𝑁
for all 𝜽, 𝜽¤ ∈ T and 𝑁 ≥ 𝑛, where all stated inequalities hold almost surely as stated above.
    I start with the stochastic component 𝒁𝑖 (𝜽) − 𝒁𝑖 ( 𝜽).         ¤ It will make sense to write 𝒁𝑖 (𝜽) =
𝑨(𝜽) −1 𝑩(𝜽) where
                                           𝑨𝑖 (𝜽) = 𝑿𝑖′ 𝑯(𝜽)𝑯(𝜽) ′ 𝑿𝑖
                                     𝑩𝑖 (𝜽) = 𝑿𝑖′ 𝑯(𝜽)𝑯(𝜽) ′ (𝑭0 𝜸𝑖 + 𝒖𝑖 )
We then have
                      ¤ = 𝑨𝑖 (𝜽) −1 𝑩𝑖 (𝜽) − 𝑨𝑖 ( 𝜽)
        𝒁𝑖 (𝜽) − 𝒁𝑖 ( 𝜽)                                ¤ −1 𝑩𝑖 ( 𝜽)
                                                                  ¤
                           ≤ 𝑨𝑖 (𝜽) −1 𝑩𝑖 (𝜽) − 𝑨𝑖 ( 𝜽) ¤ −1 𝑩𝑖 (𝜽) + 𝑨𝑖 ( 𝜽)  ¤ −1 𝑩(𝜽) − 𝑨𝑖 ( 𝜽)  ¤ −1 𝑩( 𝜽)
                                                                                                            ¤
    We can bound the second normed value on the right-hand side. Let 𝑫 (𝜽, 𝜽)                   ¤ = 𝑯(𝜽)𝑯(𝜽) ′ −
    ¤
𝑯( 𝜽)𝑯(    ¤ ′. The Frobenius norm of a matrix is equal to the square root of the sum of its squared
          𝜽)
singular values (see, for example, Horn and Johnson (2013)). Thus 𝑨(𝜽) −1 = 𝑎𝑖 (𝜽) > 0 and we
have
                       ¤ −1 𝑩𝑖 (𝜽) − 𝑨𝑖 ( 𝜽)
                 𝑨𝑖 ( 𝜽)                   ¤ −1 𝑩𝑖 ( 𝜽)
                                                     ¤ = 𝑨𝑖 ( 𝜽)   ¤ −1 (𝑩𝑖 (𝜽) − 𝑩𝑖 ( 𝜽))
                                                                                        ¤
                                                                 ¤ 𝑿 ′ 𝑫 (𝜽, 𝜽)(𝑭𝜸
                                                         ≤ 𝑎𝑖 ( 𝜽)             ¤       𝑖 + 𝒖𝑖 )
                                                                       𝑖
                                                                 ¤ ∥ 𝑿𝑖 ∥ ∥𝑭𝜸𝑖 + 𝒖𝑖 ∥ 𝑫 (𝜽, 𝜽)
                                                         ≤ 𝑎𝑖 ( 𝜽)                                ¤
                                                        105


Turning now to the other term from the triangle inequality, note that condition (1) of the theorem
implies 𝑨(𝜽) is nonsingular for any 𝜽 in the parameter space. Then
                                                                           
    𝑨𝑖 (𝜽) −1 𝑩𝑖 (𝜽) − 𝑨𝑖 ( 𝜽)¤ −1 𝑩𝑖 (𝜽) =         𝑨𝑖 (𝜽) −1 − 𝑨𝑖 ( 𝜽) ¤ −1 𝑩𝑖 (𝜽)
                                                                                                              
                                              =          ¤ −1 𝑨𝑖 ( 𝜽)
                                                    𝑨𝑖 ( 𝜽)        ¤ 𝑨𝑖 (𝜽) −1 − 𝑨𝑖 ( 𝜽)¤ −1 𝑨𝑖 (𝜽) 𝑨𝑖 (𝜽) −1 𝑩𝑖 (𝜽)
                                                       ¤ −1 𝑨𝑖 ( 𝜽) ¤ − 𝑨𝑖 (𝜽) 𝑨𝑖 (𝜽) −1 𝑩𝑖 (𝜽)
                                                                                 
                                              = 𝑨𝑖 ( 𝜽)
                                              ≤ 𝑨𝑖 ( 𝜽) ¤ −1   𝑨𝑖 ( 𝜽) ¤ − 𝑨𝑖 (𝜽)      𝑨𝑖 (𝜽) −1 ∥ 𝑩𝑖 (𝜽) ∥
As before, 𝑨𝑖 ( 𝜽)   ¤ −1     𝑨𝑖 (𝜽) −1 = 𝑎𝑖 ( 𝜽)𝑎  ¤ 𝑖 (𝜽). ∥ 𝑩𝑖 (𝜽) ∥ =          𝑿𝑖′ 𝑯(𝜽)𝑯(𝜽) ′ (𝑭𝜸𝑖 + 𝒖𝑖 ) where
 (𝑭𝜸𝑖 + 𝒖𝑖 ) 𝑿𝑖′ is bounded in expectation.
    Condition (3) implies that sup𝜽∈T ∥𝑯(𝜽)𝑯(𝜽) ′ ∥ < 𝜏 for some 𝜏 < ∞. Finally note that
                                              ¤ − 𝑨𝑖 (𝜽) = 𝑿 ′ 𝑫 ( 𝜽,
                                         𝑨𝑖 ( 𝜽)                           ¤ 𝜽) 𝑿𝑖
                                                                     𝑖
                                                             ≤ ∥ 𝑿𝑖 ∥ 2 𝑫 (𝜽, 𝜽)   ¤
as 𝑫 (𝜽, 𝜽)¤ = −𝑫 ( 𝜽, ¤ 𝜽). Putting everything together yields
 1                  ¤ ≤ 1 𝑎𝑖 ( 𝜽)
                                
                                       ¤ ∥ 𝑿𝑖 ∥ ∥(𝑭0 𝜸𝑖 + 𝒖𝑖 )∥ + 𝜏𝑎𝑖 ( 𝜽)𝑎
                                                                                                               
                                                                                                                      ¤
                                                                             ¤ 𝑖 (𝜽) ∥ 𝑿𝑖 ∥ 3 ∥ (𝑭0 𝜸𝑖 + 𝒖𝑖 ) ∥ 𝑫 (𝜽, 𝜽)
     𝒁𝑖 (𝜽) − 𝒁𝑖 ( 𝜽)
𝑁                           𝑁
Clearly 𝑫 (𝜽, 𝜽)  ¤ → 0 as 𝜽 − 𝜽¤ → 0. In the language of Davidson’s Theorem 21.11,
                        𝑁                 𝑁
                       ∑︁              1 ∑︁                                 ¤ (1 + 𝜏𝑎𝑖 (𝜽) ∥ 𝑿𝑖 ∥)
                            𝐵𝑁 𝑖  =           ∥ 𝑿𝑖 ∥ ∥(𝑭0 𝜸𝑖 + 𝒖𝑖 ) ∥ 𝑎𝑖 ( 𝜽)
                       𝑖=1
                                      𝑁 𝑖=1
The random variables here have identical moments by Assumption 2(2) and the bound on 𝒂𝑖 (𝜽)
holds uniformly over T by Condition (2) so that
                         ∑︁𝑁
                                                                            ¤ (1 + 𝜏𝑎𝑖 (𝜽) ∥ 𝑿𝑖 ∥)
                                                                                                    
                      𝐸(       𝐵 𝑁 𝑖 ) = 𝐸 ∥ 𝑿𝑖 ∥ ∥(𝑭0 𝜸𝑖 + 𝒖𝑖 ) ∥ 𝑎𝑖 ( 𝜽)
                          𝑖=1
                                       = 𝑂 (1)
as the expectation is finite. Looking at equation (.0.11), we have
                                           ¤ − 𝐸 ( 𝒁𝑖 ( 𝜽))
                                                          ¤                           ¤ + 𝐸 ( 𝒁𝑖 (𝜽) − 𝒁𝑖 ( 𝜽))  ¤
                                                             
        (𝒁𝑖 (𝜽) − 𝐸 (𝒁𝑖 (𝜽))) − 𝒁𝑖 ( 𝜽)                          ≤ 𝒁𝑖 (𝜽) − 𝒁𝑖 ( 𝜽)
                                                            106


As norms are convex, 𝐸 (( 𝒁𝑖 (𝜽) − 𝒁𝑖 ( 𝜽))    ¤    ≤ 𝐸 ( 𝒁𝑖 (𝜽) − 𝒁𝑖 ( 𝜽)     ¤ ) which is bounded by the same
argument as above. I have thus verified SE and so 𝜷           b𝑄𝐿𝐷 𝑀𝐺 − 𝜷0 = 𝑜 𝑝 (1).
    Turning to asymptotic normality, I need a lemma on the mean value expansion of the QLDMG
estimator like in Theorem 3.3.4.
Lemma .0.3. Let 𝝐𝑖 = 𝑿𝑖 𝒃𝑖 + 𝑭0 𝜸𝑖 + 𝒖𝑖 . Then
                                                                                                                      
     ∇𝜽 ( 𝑿𝑖 𝑯0 𝑯0′ 𝑿𝑖 ) −1 𝑿𝑖′ 𝑯0 𝑯0′ 𝝐𝑖              𝝐𝑖′ 𝑯0 𝑯0′ 𝑽𝑖     (𝑽𝑖′ 𝑯0 𝑯0′ 𝑽𝑖 ) −1       (𝑽𝑖′ 𝑯0 𝑯0′ 𝑽𝑖 ) −1
                                                                     
                                           = − 𝑰𝐾 ⊗                                             ⊗
                                                                                 © 𝒙𝑖 ∗1′ ⊗ 𝑰𝑇−𝑝0 ª
                                                                                 ­                    ®
                                                                                             ..
                                              ∗ ( 𝑰𝐾 2 + 𝑲𝐾 )( 𝑰𝐾 ⊗ 𝑽𝑖′ 𝑯0 ) ­                        ®+
                                                                                 ­                    ®
                                                                                              .
                                                                                 ­                    ®
                                                                                 ­                    ®
                                                                                   𝒙𝑖 ∗𝐾 ′ ⊗ 𝑰𝑇−𝑝0
                                                                                 «                    ¬
                                                                                       © 𝒙𝑖 ∗1′ ⊗ 𝑰𝑇−𝑝0 ª
                                                                                       ­                  ®
                                                                                                  ..
                                              + (𝑽𝑖′ 𝑯0 𝑯0′ 𝑽𝑖 ) −1 ( 𝑰𝐾 ⊗ 𝝐𝑖′ 𝑯0 ) ­                     ®+
                                                                                       ­                  ®
                                                                                                   .
                                                                                       ­                  ®
                                                                                       ­                  ®
                                                                                         𝒙𝑖 ∗𝐾 ′ ⊗ 𝑰𝑇−𝑝0
                                                                                       «                  ¬
                                              + (𝑽𝑖′ 𝑯0 𝑯0′ 𝑽𝑖 ) −1𝑽𝑖′ 𝑯0 𝝐𝑖∗′ ⊗ 𝑰𝑇−𝑝0
                                                                                                
where 𝑲𝐾 is the 𝐾 2 × 𝐾 2 commutation matrix.
Proof. Like in Lemma .0.1, I omit the factor structure 𝑿𝑖 = 𝑭0 𝚪𝑖 + 𝑽𝑖 and derive the above form
with respect to just 𝑿𝑖 . The factor structure is substituted in later after the lemma. Assumption
2 and conditions (1) and (2) imply that the inverse of 𝑿𝑖′ 𝑯(𝜽)𝑯(𝜽) ′ 𝑿𝑖 is differentiable about 𝜽 0 .
Proposition 5.16 of Dhrymes (2013) gives
                                                                                      
           ∇𝜽 ( 𝑿𝑖′ 𝑯0 𝑯0′ 𝑿𝑖 ) −1 = − ( 𝑿𝑖′ 𝑯0 𝑯0′ 𝑿𝑖 ) −1 ⊗ ( 𝑿𝑖′ 𝑯0 𝑯0′ 𝑿𝑖 ) −1 ∇𝜽 𝑿𝑖′ 𝑯0 𝑯0′ 𝑿𝑖
                                                                                                               
The differential of the 𝑿𝑖′ 𝑯(𝜽)𝑯(𝜽) ′ 𝑿𝑖 can be worked out via 13.19(b) of Abadir and Magnus
(2013):
               𝑑vec( 𝑿𝑖′ 𝑯(𝜽)𝑯(𝜽) ′ 𝑿𝑖 ) = ( 𝑰𝐾 2 + 𝑲𝐾 )( 𝑰𝐾 ⊗ 𝑿𝑖′ 𝑯(𝜽))𝑑vec(𝑯(𝜽) ′ 𝑿𝑖 )
                                                         107


The associated gradient was worked out in the proof of Theorem 3.3.4. Thus we have
                                                                                                    © 𝒙𝑖 ∗1′ ⊗ 𝑰𝑇−𝑝0 ª
                                                                                                  ­                  ®
                                                                                                              ..
∇𝜽 ( 𝑿𝑖′ 𝑯0 𝑯0′ 𝑿𝑖 ) −1 = − ( 𝑿𝑖′ 𝑯0 𝑯0′ 𝑿𝑖 ) −1 ⊗ ( 𝑿𝑖′ 𝑯0 𝑯0′ 𝑿𝑖 ) −1 ( 𝑰𝐾 2 +𝑲𝐾 )( 𝑰𝐾 ⊗ 𝑿𝑖′ 𝑯0 ) ­
                                                                                                    ­                  ®
                                                                                                               .       ®
                                                                                                    ­                  ®
                                                                                                    ­                  ®
                                                                                                         ∗ ′
                                                                                                      𝒙𝑖 𝐾 ⊗ 𝑰𝑇−𝑝0
                                                                                                    «                  ¬
The product rule of the gradient is given in Proposition 5.4 of Dhrymes (2013) and the gradient
∇𝜽 𝑿𝑖′ 𝑯0 𝑯0′ 𝝐𝑖 comes from Lemma .0.1 in the proof of Theorem 3.3.4.                                                 □
          √
    The 𝑁-normalized estimator is
                           √                                𝑁
                                                        1 ∑︁ ′ b b′ −1 ′ b b′
                             𝑁 ( 𝜷𝑄𝐿𝐷 𝑀𝐺 − 𝜷0 ) = √
                                 b                              ( 𝑿𝑖 𝑯 𝑯 𝑿𝑖 ) 𝑿𝑖 𝑯 𝑯 𝝐𝑖
                                                         𝑁 𝑖=1
where 𝝐𝑖 = 𝑿𝑖 𝒃𝑖 + 𝑭0 𝜸𝑖 + 𝒖𝑖 . I write the estimator in terms of its full error because the asymptotic
variance generally depends on the correlation between 𝒃𝑖 and the other terms. I derive the asymptotic
variance in full, with a simpler form under stronger exogeneity conditions. I apply a mean value
expansion to the above sum and get
        𝑁
  1 ∑︁ ′ b b′ −1 ′ b b′                    1 ∑︁ ′
                                                 𝑁                                         √
 √        ( 𝑿𝑖 𝑯 𝑯 𝑿𝑖 ) 𝑿𝑖 𝑯 𝑯 𝜖𝑖 = √              (𝑽𝑖 𝑯0 𝑯0′ 𝑽𝑖 ) −1𝑽𝑖′ 𝑯0 𝑯0′ 𝝐𝑖 + 𝑮 𝑀𝐺 𝑁 ( 𝜽b − 𝜽 0 ) + 𝑜 𝑝 (1)
   𝑁 𝑖=1                                     𝑁 𝑖=1
where 𝑮 𝑀𝐺 comes from Lemma .0.3. Thus
     √                                  𝑁
                                   1 ∑︁  ′           ′    −1 ′         ′
                                                                                            
       𝑁 ( 𝜷𝑄𝐿𝐷 𝑀𝐺 − 𝜷0 ) = √
           b                               (𝑽𝑖 𝑯0 𝑯0𝑽𝑖 ) 𝑽𝑖 𝑯0 𝑯0 𝝐𝑖 + 𝑮 𝑀𝐺 𝒓 𝑥,𝑖 (𝜽 0 ) + 𝑜 𝑝 (1)               (.0.12)
                                   𝑁 𝑖=1
where 𝒓 𝑥,𝑖 (𝜽 0 ) = (𝑫 ′𝑥,𝜽 𝑨𝑥,𝜽
                               −1 𝑫 ) −1 𝑫 ′ 𝑨−1 vec(𝑯′ 𝑽 ) comes from Lemma .0.2. We then have
                                    𝑥,𝜽     𝑥,𝜽 𝑥,𝜽         0 𝑖
                                      √                         𝑑
                                        𝑁(𝜷
                                          b𝑄𝐿𝐷 𝑀𝐺 − 𝜷0 ) → 𝑁 (0, 𝑩 𝑀𝐺 )                                          (.0.13)
                                                                         
where 𝑩 𝑀𝐺 = 𝑉 𝑎𝑟 (𝑽𝑖′ 𝑯0 𝑯0′ 𝑽𝑖 ) −1𝑽𝑖′ 𝑯0 𝑯0′ 𝝐𝑖 + 𝑮 𝑀𝐺 𝒓 𝑥,𝑖 (𝜽 0 ) . □
                                                         108


                                            APPENDIX
                           ADDITIONAL TABLES FOR CHAPTER 3
I now present additional simulations comparing the pooled CCE and QLD estimators. Table .1
gives results for 𝐾 = 2 and 𝑝 0 = 2 but for larger values of 𝑇.
                                  Table .1: Pooled estimator, 𝐾 = 2
                                       Bias                 SD                  RMSE
                                 CCEP QLDP           CCEP QLDP            CCEP QLDP
              N = 50    T=6      0.0128    0.0074    0.1028    0.0956     0.1036    0.0959
                                 0.0128    0.0132    0.1019    0.1025     0.1027    0.1034
                        T=7      0.0146    0.0102    0.0994    0.1222     0.1004    0.1226
                                 0.0150    0.0096    0.0910    0.1191     0.0922    0.1194
                        T=8      0.0105    0.0061    0.0873    0.0886     0.0879    0.0888
                                 0.0166    0.0086    0.0855    0.0852     0.0871    0.0856
              N = 300   T=6      0.0029    0.0015    0.0405    0.0392     0.0406    0.0392
                                 0.0039    0.0013    0.0416    0.0406     0.0418    0.0406
                        T=7      0.0016    0.0001    0.0376    0.0477     0.0377    0.0477
                                 0.0021   -0.0001    0.0374    0.0450     0.0374    0.0450
                        T=8      0.0020    0.0009    0.0344    0.0348     0.0344    0.0349
                                 0.0010    0.0001    0.0345    0.0344     0.0346    0.0344
    Both estimators perform poorly when 𝑁 = 50 with CCEP typically outperforming the QLDP in
terms of SD for all 𝑁 and 𝑇. Interestingly, QLDP seems to decrease in bias as 𝑇 gets larger despite
the fact that the number of parameters increases linearly in 𝑇 for fixed 𝑝 0 . Generally, the differences
in bias are small, and CCEP has a smaller RMSE dues to its reduced SD. Table .2 performs the
same simulations but for 𝐾 = 3. In these cases, the QLDP has the smaller SD, most likely due
to the fact that the additional covariates provide information which the QLD transformation can
exploit.
                                                 109


               Table .2: Pooled estimators, 𝐾 = 3
                    Bias                SD             RMSE
     K=3       CCE       QLDP      CCE QLDP         CCE QLDP
N = 50  T=6    0.0115 0.0055      0.1174    0.1010 0.1179 0.1012
               0.0207 0.0131      0.1143    0.1024 0.1161 0.1032
              -0.0041 -0.0009     0.1151    0.1001 0.1151 0.1001
        T = 7 0.0184 0.0127       0.0991    0.1255 0.1008 0.1261
               0.0218 0.0079      0.1009    0.1245 0.1033 0.1247
              -0.0054 -0.0022     0.0998    0.1157 0.0999 0.1157
        T = 8 0.0151 0.0122       0.0883    0.0867 0.0896 0.0875
               0.0095 0.0084      0.0896    0.0873 0.0901 0.0877
               0.0015 -0.0041     0.0895    0.0870 0.0895 0.0871
N = 300 T = 6 0.0034 0.0024       0.0451    0.0374 0.0452 0.0375
              -0.0001 0.0007      0.0468    0.0404 0.0468 0.0404
               0.0001 -0.0016     0.0440    0.0391 0.0440 0.0391
        T = 7 0.0038 0.0021       0.0385    0.0468 0.0387 0.0468
               0.0048 0.0010      0.0381    0.0448 0.0384 0.0448
               0.0005 0.0016      0.0382    0.0461 0.0382 0.0461
        T = 8 0.0005 -0.0002      0.0352    0.0347 0.0352 0.0347
               0.0042 0.0015      0.0364    0.0336 0.0367 0.0336
               0.0000 0.0012      0.0351    0.0344 0.0351 0.0344
                              110


BIBLIOGRAPHY
     111


                                         BIBLIOGRAPHY
Abadir, K. M., & Magnus, J. R. (2005). Matrix algebra (Vol. 1). Cambridge University Press.
Andrews, D. W. K. (2005). Cross-section regression with common shocks. Econometrica, 73,
         1551–1585.
Ahn, S. C. (2015). Comment on ’iv estimation of panels with factor residuals’ by d. robertson and
         v. sarafidis. Journal of Econometrics, 185, 542–544. https://doi.org/10.1016/j.jeconom.
         2014.12.002
Ahn, S. C., Lee, Y. H., & Schmidt, P. (2013). Panel data models with multiple time-varying
         individual effects. Journal of Econometrics, 174, 1–14. https://doi.org/10.1016/j.jeconom.
         2012.12.002
Ahn, S. C., & Schmidt, P. (1997). Efficient estimation of dynamic panel data models: Alternative
         assumptions and simplified estimation. Journal of Econometrics, 76, 309–321.
Amsler, C., Lee, Y. H., & Schmidt, P. (2009). A survey of stochastic frontier models and likely
         future developments. Seoul Journal of Economics, 22(1).
Arellano, M., & Bover, O. (1995). Another look at the instrumental variable estimation of error-
         components models. Journal of Econometrics, 68(1), 29–51.
Arellano, M., Hahn, J. et al. (2005). Understanding bias in nonlinear panel models: Some recent
         developments (tech. rep.). Mimeo, CEMFI.
Bai, J. (2003). Inferential theory for factor models of large dimensions. Econometrica, 71(1),
         135–171.
Bai, J. (2009). Panel data models with interactive fixed effects. Econometrica, 77(4), 1229–1279.
Breitung, J., & Hansen, P. (2021). Alternative estimation approaches for the factor augmented
         panel data model with small t. Empirical Economics, 60, 327–351. https://doi.org/10.
         1007/s00181-020-01948-7
Breitung, J., & Salish, N. (2021). Estimation of heterogeneous panels with systematic slope
         variations. Journal of Econometrics, 220, 399–415. https://doi.org/10.1016/j.jeconom.
         2020.04.007
Breusch, T., Qian, H., Schmidt, P., & Wyhowski, D. J. (1997). Redundancy of moment conditions.
         Journal of Econometrics, 91.
                                                112


Brown, N. (2021). Information equivalence among transformations of semiparametric nonlinear
        panel data models *. https://www.researchgate.net/publication/344047637_Information-equivalence_
        among_transformations_of_semiparametric_nonlinear_panel_data_models
Brown, N., Schmidt, P., & Wooldridge, J. M. (2021). Simple alternatives to the common correlated
        effects model. https://doi.org/10.13140/RG.2.2.12655.76969/1
Brown, N. L., & Wooldridge, J. M. (2021). More efficient estimation of multiplicative panel data
        models in the presence of serial correlation. Manuscript submitted for publication.
Campello, M., Galvao, A. F., & Juhl, T. (2019). Testing for slope heterogeneity bias in panel data
        models. Journal of Business and Economic Statistics, 37, 749–760. https://doi.org/10.
        1080/07350015.2017.1421545
Castillo, J. C., Mejía, D., & Restrepo, P. (2020). Scarcity without leviathan: The violent effects
        of cocaine supply shortages in the mexican drug war. Review of Economics and Statistics,
        102(2), 269–286.
Chamberlain, G. (1980). Analysis of covariance with qualitative data. Review of Economic Studies,
        47, 225–238.
Chamberlain, G. (1987). Asymptotic efficiency in estimation with conditional moment restrictions.
        Journal of Econometrics, 34(3), 305–334.
Chamberlain, G. (1992). Efficiency bounds for semiparametric regression. Econometrica: Journal
        of the Econometric Society, 60(3), 567–596.
Chen, M., Fernández-Val, I., & Weidner, M. (2014). Nonlinear factor models for network and panel
        data. arXiv preprint arXiv:1412.5647.
Chudik, A., & Pesaran, M. H. (2015). Common correlated effects estimation of heterogeneous
        dynamic panel data models with weakly exogenous regressors. Journal of Econometrics,
        188, 393–420. https://doi.org/10.1016/j.jeconom.2015.03.007
Davidson, J. (1994, October). Stochastic limit theory: An introduction for econometricians. Oxford
        University Press. https://doi.org/10.1093/0198774036.001.0001
Vos, I. D., & Everaert, G. (2021). Bias-corrected common correlated effects pooled estimation
        in dynamic panels. Journal of Business and Economic Statistics, 39, 294–306. https:
        //doi.org/10.1080/07350015.2019.1654879
Vos, I. D., & Westerlund, J. (2019). On cce estimation of factor-augmented models when regressors
        are not linear in the factors. Economics Letters, 178, 5–7. https://doi.org/10.1016/j.econlet.
        2019.02.001
                                                 113


Dhrymes, P. J. (2013). Mathematics for econometrics. Springer Science; Business Media.
Fernández-Val, I., & Weidner, M. (2018). Fixed effects estimation of large-t panel data models.
        Annual Review of Economics, 10, 109–138.
Fischer, S., Royer, H., & White, C. (2018). The impacts of reduced access to abortion and family
        planning services on abortions, births, and contraceptive purchases. Journal of Public
        Economics, 167, 43–68.
Hahn, J. (1997). A note on the efficient semiparametric estimation of some exponential panel
        models. Econometric Theory, 13(4), 583–588.
Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators.
        Econometrica, 50, 1029–1054. https://doi.org/10.2307/1912775
Hardin, J. W., & Hilbe, J. M. (2012). Generalized estimation equations (2nd ed.). London:
        Chapman Hall.
Hausman, J., Hall, B. H., & Griliches, Z. (1984). Econometric models for count data with an
        application to the patents-r&d relationship. Econometrica: Journal of the Econometric
        Society, 52(4), 909–938.
Hayakawa, K. (2012). Gmm estimation of short dynamic panel data models with interactive fixed
        effects. J. Japan Statist. Soc, 42, 109–123.
Hayakawa, K. (2016). Identification problem of gmm estimators for short panel data models with
        interactive fixed effects. Economics Letters, 139, 22–26. https://doi.org/10.1016/j.econlet.
        2015.12.012
Horn, R. A., & Johnson, C. R. (2012). Matrix analysis. Cambridge University Press.
Hsiao, C. (2018). Panel models with interactive effects. Journal of Econometrics, 206, 645–673.
        https://doi.org/10.1016/j.jeconom.2018.06.017
Im, K. S., Ahn, S. C., Schmidt, P., & Wooldridge, J. M. (1999). Efficient estimation of panel
        data models with strictly exogenous explanatory variables. Journal of Econometrics, 93(1),
        177–201.
Juhl, T., & Lugovskyy, O. (2014). A test for slope heterogeneity in fixed effects models. Econo-
        metric Reviews, 33, 906–935. https://doi.org/10.1080/07474938.2013.806708
Juodis, A., & Sarafidis, V. (2018). Fixed t dynamic panel data estimators with multifactor errors.
        Econometric Reviews, 37, 893–929. https://doi.org/10.1080/00927872.2016.1178875
Juodis, A., & Sarafidis, V. (2020). A linear estimator for factor-augmented fixed-t panels with
                                                 114


        endogenous regressors. Journal of Business and Economic Statistics. https://doi.org/10.
        1080/07350015.2020.1766469
Juodis, A., & Sarafidis, V. (2021). An incidental parameters free inference approach for panels with
        common shocks. Journal of Econometrics. https://doi.org/10.1016/j.jeconom.2021.03.011
Karabiyik, H., Reese, S., & Westerlund, J. (2017). On the role of the rank condition in cce estimation
        of factor-augmented panel regressions. Journal of Econometrics, 197(1), 60–64.
Krapf, M., Ursprung, H. W., & Zimmermann, C. (2017). Parenthood and productivity of highly
        skilled labor: Evidence from the groves of academe. Journal of Economic Behavior &
        Organization, 140, 147–175.
Liang, Y., & Zeger, S. (1986).        Longitudinal data analysis using generalized linear models.
        Biometrika, 73, 13–22.
McCabe, M. J., & Snyder, C. M. (2014). Identifying the effect of open access on citations using a
        panel of science journals. Economic Inquiry, 52(4), 1284–1300.
McCabe, M. J., & Snyder, C. M. (2015). Does online availability increase citations? theory and
        evidence from a panel of economics and business journals. Review of Economics and
        Statistics, 97(1), 144–165.
McCullagh, P., & Nelder, J. (1989). Generalized linear models (2nd ed.). London: Chapman Hall.
Moon, H. R., & Weidner, M. (2015). Linear regression for panel with unknown number of factors as
        interactive fixed effects. Econometrica, 83, 1543–1579. https://doi.org/10.3982/ecta9382
Mundlak, Y. (1978). On the pooling of time series and cross section data. Econometrica, 46,
        69–85.
Murtazashvili, I., & Wooldridge, J. M. (2008). Fixed effects instrumental variables estimation in
        correlated random coefficient panel data models. Journal of Econometrics, 142, 539–552.
        https://doi.org/10.1016/j.jeconom.2007.09.001
Neal, T. (2015). Estimating heterogeneous coefficients in panel data models with endogenous
        regressors and common factors.
Newey, W. K. (2001). Conditional moment restrictions in censored and truncated regression
        models. Econometric Theory, 17(5), 863–888.
Newey, K., & McFadden, D. (1994). Large sample estimation and hypothesis. Handbook of
        Econometrics, IV, Edited by RF Engle and DL McFadden, 2112–2245.
Norkutė, M., Sarafidis, V., Yamagata, T., & Cui, G. (2021). Instrumental variable estimation
                                                115


        of dynamic linear panel data models with defactored regressors and a multifactor error
        structure. Journal of Econometrics, 220, 416–446. https://doi.org/10.1016/j.jeconom.
        2020.04.008
Papke, L. E. (2005). The effects of spending on test pass rates: Evidence from michigan. Journal
        of Public Economics, 89, 821–839. https://doi.org/10.1016/j.jpubeco.2004.05.008
Papke, L. E., & Wooldridge, J. M. (2008). Panel data methods for fractional response variables
        with an application to test pass rates. Journal of Econometrics, 145, 121–133. https:
        //doi.org/10.1016/j.jeconom.2008.05.009
Pesaran, M. H. (2006). Estimation and inference in large heterogeneous panels with a multifactor
        error structure. Econometrica, 74, 967–1012.
Phillips, R. F. (2020). Quantifying the advantages of forward orthogonal deviations for long time
        series. Computational Economics, 55(2), 653–672.
Rao, C. R., & Mitra, S. K. Generalized inverse of a matrix and its applications. In: Proceedings of
        the sixth berkeley symposium on mathematical statistics and probability, volume 1: Theory
        of statistics. The Regents of the University of California. 1972.
Robertson, D., & Sarafidis, V. (2015). Iv estimation of panels with factor residuals. Journal of
        Econometrics, 185, 526–541. https://doi.org/10.1016/j.jeconom.2014.12.001
Schlenker, W., & Walker, W. R. (2016). Airports, air pollution, and contemporaneous health. The
        Review of Economic Studies, 83(2), 768–809.
Schmidt, P., Ahn, S. C., & Wyhowski, D. (1992). On the estimation of panel-data models with serial
        correlation when instruments are not strictly exogenous: Comment. Journal of Business
        Economic Statistics, 10, 10–14. https://doi.org/10.2307/1391796
Sherman, J, & Morrison, W. (1950). Adjustment of an inverse matrix corresponding to a change
        in one element of a given matrix. Annals of Mathematical Statistics, 21, 124–127.
Verdier, V. (2018). Local semi-parametric efficiency of the poisson fixed effects estimator. Journal
        of Econometric Methods, 7(1).
Westerlund, J. (2019). On estimation and inference in heterogeneous panel regressions with
        interactive effects. Journal of Time Series Analysis, 40, 852–857. https://doi.org/10.1111/
        jtsa.12432
Westerlund, J. (2020). A cross-section average-based principal components approach for fixed-t
        panels. Journal of Applied Econometrics, 35(6), 776–785.
Westerlund, J., Petrova, Y., & Norkutė, M. (2019). Cce in fixed-t panels. Journal of Applied
                                                116


       Econometrics, 34, 746–761. https://doi.org/10.1002/jae.2707
Williams, M. L., Burnap, P., Javed, A., Liu, H., & Ozalp, S. (2020). Hate in the machine: Anti-
       black and anti-muslim social media posts as predictors of offline racially and religiously
       aggravated crime. The British Journal of Criminology, 60(1), 93–117.
Wooldridge, J. M. (1997). Multiplicative panel data models without the strict exogeneity assump-
       tion. Econometric Theory, 13(5), 667–678.
Wooldridge, J. M. (1999). Distribution-free estimation of some nonlinear panel data models.
       Journal of Econometrics, 90(1), 77–97.
Wooldridge, J. M. (2005). Fixed-effects and related estimators for correlated random-coefficient
       and treatment-effect panel data models. Source: The Review of Economics and Statistics,
       87, 385–390. https://about.jstor.org/terms
Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data (2nd ed., Vol. 1).
       MIT press.
                                               117