EFFICIENT ESTIMATION WITH MISSING VALUES IN CROSS SECTION AND PANEL
                                      DATA
                                       By
                                    Bhavna Rai
                               A DISSERTATION
                                   Submitted to
                           Michigan State University
                   in partial fulfillment of the requirements
                                for the degree of
                      Economics – Doctor of Philosophy
                                      2021


                                             ABSTRACT
   EFFICIENT ESTIMATION WITH MISSING VALUES IN CROSS SECTION AND PANEL
                                                DATA
                                                  By
                                             Bhavna Rai
Chapter 1: Efficient Estimation with Missing Data and Endogeneity
I study the problem of missing values in both the outcome and the covariates in linear models
with endogenous covariates. I propose an estimator that improves efficiency relative to a Two
Stage Least Squares (2SLS) based only on the complete cases. My framework also unifies the
literature on missing data and combining data sets, and includes the “Two-Sample 2SLS" as a
special case. The method is an extension of Abrevaya and Donald (2017), who provide methods of
improving efficiency over complete cases estimators in linear models with cross-section data and
missing covariates. I also provide guidance on dealing with missing values in the instruments and
in commonly used nonlinear functions of the endogenous covariates, likes squares and interactions,
without introducing inconsistency in the estimates.
Chapter 2: Imputing Missing Covariate Values in Nonlinear Models
I study the problem of missing covariate values in nonlinear models with continuous or discrete
covariates. In order to use the information in the incomplete cases, I propose an inverse probability
weighted one-step imputation estimator that provides gains in efficiency relative to the complete
cases estimator using a reduced form for the outcome in terms of the always-observed covariates.
Unlike the two-step imputation and dummy variable methods commonly used in empirical work,
my estimator is consistent for a wide class of nonlinear models. It relies only on the commonly used
“missing at random" assumption, and provides a specification test for the resulting restrictions. I
show how the results apply to nonlinear models for fractional and nonnegative responses.
Chapter 3: Efficient Estimation of Linear Panel Data Models with Missing Covariates


We study the problem of missing covariates in the context of linear, unobserved effects panel
data models. In order to use information on incomplete cases, we propose generalized method of
moments (GMM) estimation. By using information on the incomplete cases from all time periods,
the proposed estimators provide gains in efficiency relative to the fixed effects (and Mundlak)
estimator that use only the complete cases. The method is an extension of Abrevaya and Donald
(2017), who consider a linear model with cross-sectional data and incorporate the linear imputation
method in the set of moment conditions to obtain gains in efficiency. Our first proposed estimator
uses the assumption of strict exogeneity of the covariates as well as the selection, while allowing
the selection to be correlated with the observed covariates and unobserved heterogeneity in both the
outcome equation and the imputation equation. We also consider the case in which the covariates are
only sequentially exogenous and propose an estimator based on the method of forward orthogonal
deviations introduced by Arellano and Bover (1995). Our framework suggests a simple test for
whether selection is correlated with unobserved shocks, both contemporaneous and those in other
time periods.


                                     ACKNOWLEDGEMENTS
    My sincere gratitude to my adviser Jeffrey Wooldridge not only for lending his expertise to my
dissertation but also for his patience and motivation throughout my Ph.D. I would also like to thank
my committee members Peter Schmidt, Todd Elder and Vincenzina Caputo for their insightful
comments and discussions.
    My deep gratitude to my parents and brother for supporting me both materially and spiritually
throughout the program. I am also thankful to my friends and fellow graduate students Katie
Bollman, Marissa Eckrote, Pallavi Pal, and Ruonan Xu for their constant support and all the fun
times.
    The financial support received from the Department of Economics, the Graduate School, and
the College of Social Science has been instrumental in completion of this work. Finally, the
administrative support received from Lori Jean Nichols and Jay Feight has greatly facilitated
navigating the program.
                                                  iv


                                  TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
CHAPTER 1 EFFICIENT ESTIMATION WITH MISSING DATA AND ENDOGENEITY                                 1
   1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  1
   1.2 The population model and assumptions . . . . . . . . . . . . . . . . . . . . . . . .      3
   1.3 The missing data scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     4
   1.4 Moment conditions and GMM estimation . . . . . . . . . . . . . . . . . . . . . .          7
   1.5 Comparison with related estimators . . . . . . . . . . . . . . . . . . . . . . . . . .   10
        1.5.1 Complete cases estimator . . . . . . . . . . . . . . . . . . . . . . . . . . .    10
        1.5.2 Estimators combining different data sets . . . . . . . . . . . . . . . . . . .    12
        1.5.3 Sequential estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   15
        1.5.4 Dummy variable method . . . . . . . . . . . . . . . . . . . . . . . . . . .       16
   1.6 Missing instruments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  17
   1.7 Nonlinearity in covariates and instruments . . . . . . . . . . . . . . . . . . . . . .   20
        1.7.1 Missingness in outcome and covariates . . . . . . . . . . . . . . . . . . . .     22
        1.7.2 Missingness in instruments . . . . . . . . . . . . . . . . . . . . . . . . . .    23
   1.8 Monte Carlo simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    24
        1.8.1 Missingness in outcome and covariates . . . . . . . . . . . . . . . . . . . .     24
        1.8.2 Missingness in instruments . . . . . . . . . . . . . . . . . . . . . . . . . .    26
   1.9 Empirical application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  28
   1.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  29
CHAPTER 2 IMPUTING MISSING COVARIATE VALUES IN NONLINEAR MODELS                                 30
   2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
   2.2 The population optimization problems . . . . . . . . . . . . . . . . . . . . . . . .     33
   2.3 Non random sampling and inverse probability weighting . . . . . . . . . . . . . .        36
   2.4 Moment conditions and GMM . . . . . . . . . . . . . . . . . . . . . . . . . . . .        37
   2.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   41
        2.5.1 Models for binary and fractional responses . . . . . . . . . . . . . . . . . .    42
               2.5.1.1 Continuous covariate with missing values . . . . . . . . . . . . .       42
               2.5.1.2 Binary covariate with missing values . . . . . . . . . . . . . . .       45
               2.5.1.3 Average partial effects . . . . . . . . . . . . . . . . . . . . . . .    47
        2.5.2 Exponential models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    48
   2.6 Comparison with related estimators . . . . . . . . . . . . . . . . . . . . . . . . . .   49
        2.6.1 Complete cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    49
        2.6.2 Sequential procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   51
        2.6.3 Dummy variable method . . . . . . . . . . . . . . . . . . . . . . . . . . .       54
        2.6.4 Unweighted estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . .     56
   2.7 Empirical application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  61
                                               v


  2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
CHAPTER 3   EFFICIENT ESTIMATION OF LINEAR PANEL DATA MODELS WITH
            MISSING COVARIATES* . . . . . . . . . . . . . . . . . . . . . . . .            . .  64
  3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  64
  3.2 Population model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . .  66
  3.3 The missing data mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . .     . .  67
  3.4 Moment conditions and GMM . . . . . . . . . . . . . . . . . . . . . . . . . .        . .  70
  3.5 Comparison to related estimators . . . . . . . . . . . . . . . . . . . . . . . . .   . .  74
      3.5.1 Complete cases estimator . . . . . . . . . . . . . . . . . . . . . . . . .     . .  74
      3.5.2 Dummy variable method . . . . . . . . . . . . . . . . . . . . . . . . .        . .  75
      3.5.3 Regression imputation . . . . . . . . . . . . . . . . . . . . . . . . . .      . .  76
      3.5.4 Mundlak device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . .  78
  3.6 Estimation under sequential exogeneity . . . . . . . . . . . . . . . . . . . . . .   . .  81
  3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . .  86
APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . .  88
  APPENDIX A        PROOFS FOR CHAPTER 1 . . . . . . . . . . . . . . . . . . .             . .  89
  APPENDIX B        TABLES FOR CHAPTER 1 . . . . . . . . . . . . . . . . . . .             . .  95
  APPENDIX C        FIGURES FOR CHAPTER 1 . . . . . . . . . . . . . . . . . . .            . .  98
  APPENDIX D        PROOFS FOR CHAPTER 2 . . . . . . . . . . . . . . . . . . .             . .  99
  APPENDIX E        ASYMPTOTIC THEORY FOR UNWEIGHTED ESTIMATION                            . . 111
  APPENDIX F        TABLES FOR CHAPTER 2 . . . . . . . . . . . . . . . . . . .             . . 113
  APPENDIX G        PROOFS FOR CHAPTER 3 . . . . . . . . . . . . . . . . . . .             . . 115
  APPENDIX H        EXTENSIONS TO CHAPTER 3 . . . . . . . . . . . . . . . . .              . . 122
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
                                              vi


                                      LIST OF TABLES
Table B.1: Monte Carlo simulations, Design 1 . . . . . . . . . . . . . . . . . . . . . . . . . 95
Table B.2: Monte Carlo simulations, Design 2 . . . . . . . . . . . . . . . . . . . . . . . . . 95
Table B.3: Monte Carlo simulations, Design 3 . . . . . . . . . . . . . . . . . . . . . . . . . 95
Table B.4: Monte Carlo simulations, Design 4 . . . . . . . . . . . . . . . . . . . . . . . . . 96
Table B.5: Monte Carlo simulations, Design 5 . . . . . . . . . . . . . . . . . . . . . . . . . 96
Table B.6: Monte Carlo simulations, Design 6 . . . . . . . . . . . . . . . . . . . . . . . . . 96
Table B.7: Monte Carlo simulations, Design 7 . . . . . . . . . . . . . . . . . . . . . . . . . 96
Table B.8: Effect of physician’s advice on calorie consumption: complete cases versus
           the proposed estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Table F.1: Summary of missing data methods used in 5 highly ranked economics journals
           from 2018 to August 2020. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Table F.2: Effect of grade variance on probability of having a 4 year college degree. . . . . 114
                                               vii


                                    LIST OF FIGURES
Figure C.1: Some admissible patterns of missingness (shaded areas represent complete cases) 98
                                             viii


                                                CHAPTER 1
          EFFICIENT ESTIMATION WITH MISSING DATA AND ENDOGENEITY
1.1      Introduction
     The problem of missing data is highly prevalent in empirical research. While there is a vast
literature on methods to deal with missing data, the issue of endogeneity of the covariates with
missing values has not been explicitly addressed in the majority of it.1
     In linear models with endogenous covariates and missing values in either the outcome or the
endogenous covariates, a frequently used method is a 2SLS that only uses the “complete cases"
- the observations for which all the variables are observed.2 While consistent under commonly
used assumptions, this method can lead to a substantial loss of efficiency due to discarding the
information in the incomplete cases. Recent literature has considered the case of missingness only
in the endogenous covariates and has suggested some methods that make use of these incomplete
cases. The first set of methods is based on “imputation". For instance, McDonough & Millimet
(2017) discuss an estimator which replaces the missing covariate values with fitted values from a
first stage regression of the endogenous covariate on the instruments. A more efficient estimator is
suggested by Abrevaya & Donald (2011), who use the incomplete cases via a reduced form for the
outcome in terms of the instruments.
     The first contribution of this paper is to extend the framework of Abrevaya & Donald (2011) to
allow for missingness in both the outcome and the endogenous covariates. I show that it is possible
to obtain strict gains in efficiency for all coefficients relative to the complete cases 2SLS.
     My framework also unifies the literature on missing data and that on combining data sets
with missing variables. Empirical researchers sometimes have two distinct data sets, one of
which contains only the outcome and the instruments, and the other contains only the endogenous
1For a comprehensive discussion of methods used to deal with missing data, see Schafer & Graham (2002).
2Wooldridge (2010), Section 17.2.1.
                                                        1


covariates and the instruments. A commonly used estimator that combines the two is the “Two-
Sample 2SLS" (henceforth TS2SLS).3 I relax assumptions traditionally used by this estimator and
also provide a framework for combining more than two data sets with more general patterns of
missing variables.
     A second method that makes use of the incomplete cases is the so-called “dummy variable
method", which replaces the missing covariate values with zeros and includes an indicator for
missingness as an additional covariate in the model. When the covariates are exogenous, Jones
(1996) shows that this method produces inconsistent estimates unless some zero restrictions are
imposed in the population. I show that this inconsistency carries over to the case of endogenous
covariates.
     One can also encounter missing values in the instruments, in which case interest lies in con-
tinuing to use the observations with missing instruments instead of discarding them. Mogstad &
Wiswall (2012) discuss an estimator that imputes missing instrument values. This is a two-step
estimator that in the first step replaces the missing instrument values with predicted values from a
regression of the instrument on the always-observed exogenous covariates, and in the second step
estimates the main model using a 2SLS with both the actual and imputed instrument values. They
show that the resulting estimator for the coefficient on the endogenous covariate is numerically
equivalent to a complete cases 2SLS. A second contribution of this paper is to propose an imputa-
tion estimator for the instruments that can achieve strict gains in efficiency over the complete cases
2SLS for all coefficients.4 This estimator includes as a special case the estimator suggested by
Abrevaya & Donald (2017) in the case where the covariates are exogenous.
     Finally, I show how to impute commonly used nonlinear functions of the endogenous covariates
like squares and interactions. I show that two-step procedures which in the first step replace the
missing values of the nonlinear functions of the covariates with the same nonlinear functions of
3TS2SLS was first introduced by Klevmarken (1982), and more recently used by Angrist & Krueger (1995). Inoue
 & Solon (2010) show that the TS2SLS is more efficient than the related Two-Sample IV estimator. Inoue & Solon
 (2005) consider GMM estimation with arbitrary heteroskedasticity and stratification. Pacini & Windmeĳer (2016)
 obtain robust standard errors for the traditional TS2SLS with arbitrary heteroskedasticity.
4Abrevaya & Donald (2011) also propose an estimator for the case of missing instruments. My estimator is based on
 different moment conditions and is no less efficient than theirs.
                                                           2


the imputed values generally produce inconsistent estimates. A third contribution of this paper is
to propose a consistent imputation estimator in this context that improves upon the efficiency of
complete cases 2SLS.
    The rest of the paper is organized as follows. Section 1.2 presents the population model
of interest and associated assumptions. Section 1.3 describes the missing data scheme and the
assumptions on the missingness mechanism for the case of missingness in outcome and endogenous
covariates. Section 1.4 describes the resulting moment conditions and the asymptotic distribution
of the proposed GMM estimator. Section 1.5 discusses four related estimators: the complete
cases 2SLS, the TS2SLS, the imputation estimator, and the dummy variable estimator. Section 1.6
discusses the case of missingness in the instruments. Section 1.7 discusses the case of nonlinearity in
the covariates. Section 1.8 presents results from Monte Carlo simulations comparing the proposed
estimator with related estimators. Section 1.9 presents an empirical application to the effect of
physician’s advice on individuals’ calorie consumption. Section 1.10 concludes. The Appendices
include the proofs and tables.
1.2    The population model and assumptions
    Consider the standard linear regression model:
                                     𝑦 = 𝑥 1 𝛽1 + 𝑥 2 𝛽2 + 𝑢 ≡ 𝑥 𝛽 + 𝑢,                          (1.2.1)
where 𝑥 = (𝑥 1 , 𝑥2 ) is the 1 × ( 𝑝 + 𝑘) vector of covariates. 𝑥 1 is a 1 × 𝑝 vector of potentially
endogenous covariates, while 𝑥 2 is a 1 × 𝑘 vector of exogenous covariates (including the constant).
That is,
                                               E(𝑥 20 𝑢) = 0,                                    (1.2.2)
and we allow for E(𝑥10 𝑢) ≠ 0. We are interested in estimating 𝛽 = (𝛽01 , 𝛽02 ) 0, where 𝛽1 and 𝛽2 are
𝑝 × 1 and 𝑘 × 1 respectively. As is well known, OLS is inconsistent for 𝛽 under (1.2.2). Suppose we
have a set of instruments 𝑧 = (𝑧 1 , 𝑥2 ), where 𝑧 1 is a 1 × 𝑞 (𝑞 ≥ 𝑝) vector of excluded instruments,
                                                      3


such that
                                                E(𝑧0𝑢) = 0.                                         (1.2.3)
The first stage is given by the linear projection
                                    𝑥 = 𝑧1 Π1 + 𝑥2 Π2 + 𝑟 ≡ 𝑧Π + 𝑟,                                 (1.2.4)
where Π is the (𝑞 + 𝑘) × ( 𝑝 + 𝑘) matrix of all the first stage coefficients, and Π1 and Π2 are 𝑞 × ( 𝑝 + 𝑘)
and 𝑘 × ( 𝑝 + 𝑘) matrices of coefficients on 𝑧1 and 𝑥 2 respectively. By definition,
                                                E(𝑧0𝑟) = 0,                                         (1.2.5)
and by assumption Π ≠ 0.
    Then given a random sample and a rank condition, we can use 2SLS to consistently estimate 𝛽.
Note that the errors 𝑢 and 𝑟 are assumed only to satisfy a zero correlation with the instruments in
(1.2.3) and (1.2.5), and no other assumptions such as homoskedasticity or zero conditional mean
have been imposed on them.
    Now, using (1.2.1) and (1.2.4), we get a reduced form for 𝑦 given by
                                       𝑦 = 𝑧Π𝛽 + 𝑣,        𝑣 ≡ 𝑟𝛽 + 𝑢                               (1.2.6)
and using (1.2.3) and (1.2.5), we have
                                                E(𝑧0𝑣) = 0.                                         (1.2.7)
Under the missing data scheme described in the next section, equation (1.2.6) allows us to use the
incomplete cases for estimating 𝛽. When there is no missing data, the information in this equation
is redundant given equations (1.2.1)-(1.2.5).
1.3    The missing data scheme
    I characterize the potential missingness of the data using selection indicators. For any random
draw (𝑥𝑖 , 𝑦𝑖 , 𝑧𝑖 ) from the population, we also draw the selection indicators (𝑠1𝑖 , 𝑠2𝑖 ) defined as
follows:
                                                      4


                                                     
                                                     
                                                     1  if 𝑦𝑖 is observed
                                                     
                                                     
                                                     
                                               𝑠1𝑖 =
                                                     
                                                     0  otherwise
                                                     
                                                     
                                                     
                                                     
                                                     
                                                     1   if 𝑥𝑖 is observed
                                                     
                                                     
                                                     
                                               𝑠2𝑖 =
                                                     
                                                     0   otherwise
                                                     
                                                     
                                                     
Two things should be noted. First, I am assuming that 𝑧𝑖 is always observed. Since 𝑧𝑖 = (𝑧1𝑖 , 𝑥2𝑖 ),
I am allowing for missingness only in the endogenous covariates 𝑥 1𝑖 .5 Second, I am assum-
ing that either all or none of the elements in 𝑥 1𝑖 are observed. Then our “data" consists of
{(𝑦𝑖 , 𝑥𝑖 , 𝑧𝑖 , 𝑠1𝑖 , 𝑠2𝑖 ) : 𝑖 = 1, ..., 𝑛}.
     Because identification is properly studied in the population, let 𝑠1 and 𝑠2 denote random
variables with distributions of 𝑠1𝑖 and 𝑠2𝑖 respectively for all 𝑖. In other words, (𝑦, 𝑥, 𝑧, 𝑠1 , 𝑠2 ) now
denotes the population.
     This framework allows for several kinds of missing data patterns that arise frequently in practice.
Figure 1 shows some of these cases. First, it allows for the case where we have a single sample
in which 𝑦 is missing for certain observations, 𝑥 is missing for certain other observations, and for
the rest of the observations all 𝑦, 𝑥 and 𝑧 observed (Figure 1.1). Another case is where only 𝑥
is missing for certain observations (Figure 1.2).6 In both of these cases, using only the complete
cases may lead to a substantial loss of information. A third case is where 𝑦 and 𝑥 are missing for
disjoint observations such that there are no complete cases (Figure 1.3). Such a sample is typically
obtained by combining two samples such that only 𝑦 and 𝑧 are observed in one sample and only
𝑥 and 𝑧 in the other. The most commonly used estimator in this case is the TS2SLS, which is a
special case of the estimator I propose in the next section.
     To determine the properties of any estimation procedure using selected samples, we need to
know how 𝑠1 and 𝑠2 are related to (𝑦, 𝑥, 𝑧). I place the following assumptions on the missingness
indicators.
5I discuss the case of missingness in exogenous covariates in Section 1.6.
6This case has briefly been considered in Abrevaya & Donald (2011).
                                                           5


Assumption 1.3.1: (𝑖) E(𝑠1 𝑠2 𝑧0𝑢) = 0             (𝑖𝑖) E(𝑠1 𝑠2 𝑧0𝑟) = 0      (𝑖𝑖𝑖) E(𝑠1 𝑧0𝑢) = 0   (𝑖𝑣) E(𝑠1 𝑧0𝑟) =
0   (𝑣) E(𝑠2 𝑧0𝑟) = 0
    This assumption essentially implies that the orthogonality assumptions on the errors given in
(1.2.3), (1.2.5) and (1.2.7) hold in the selected sub-populations as well. For instance, the first part
of this assumption, which is the weakest possible assumption required for the consistency a 2SLS
based only on the complete cases, can be written as
                 E(𝑠1 𝑠2 𝑧0𝑢) = E[E(𝑠1 𝑠2 𝑧0𝑢)|𝑠1 𝑠2 ] = 𝑃(𝑠1 𝑠2 = 1) E(𝑧0𝑢|𝑠1 𝑠2 = 1) = 0,                  (1.3.1)
where the first equality holds by the law of iterated expectations (LIE). If we assume that 𝑃(𝑠1 𝑠2 = 1)
is strictly positive, then we need the population orthogonality condition E(𝑧0𝑢) = 0 to hold in the
sub-population where 𝑠1 = 𝑠2 = 1 for this assumption to be true. The other parts of this assumption
impose similar restrictions on the errors in (1.2.1) and (1.2.4) for different sub-populations.
    Sufficient for Assumption 1.3.1 to hold is that (𝑠1 , 𝑠2 )       |=   (𝑧, 𝑢, 𝑟), for which a sufficient condition
is that (𝑠1 , 𝑠2 )   |=   (𝑥, 𝑦, 𝑧). That is, selection is independent of everything else in the model. This is
generally known as “missing completely at random" (MCAR) in the missing data literature.7 For
instance, consider the first part of Assumption 1.3.1.
                                          E(𝑠1 𝑠2 𝑧0𝑢) = E(𝑠1 𝑠2 ) E(𝑧0𝑢) = 0                                (1.3.2)
and similarly for the other parts.
    Assumption 1.3.1 also holds if we have correctly specified conditional means and selection is
independent of errors in both the model of interest and the first stage conditional on the instru-
ments. That is, strengthening the exogeneity conditions in (1.2.3) and (1.2.5) to E(𝑢|𝑧) = 0 and
E(𝑟 |𝑧) = 0 respectively and assuming (𝑠1 , 𝑠2 )        |=   (𝑢, 𝑟)|𝑧 is sufficient. Again, consider the first part
of Assumption 1.3.1.
    E(𝑠1 𝑠2 𝑧0𝑢) = E[E(𝑠1 𝑠2 𝑧0𝑢|𝑧, 𝑠1 𝑠2 )] = E[𝑠1 𝑠2 𝑧0 E(𝑢|𝑧, 𝑠1 𝑠2 )] = E[𝑠1 𝑠2 𝑧0 E(𝑢|𝑧)] = 0           (1.3.3)
7We do not require 𝑠1 and 𝑠2 to be independent of each other for Assumption 1.3.1 to hold.
                                                             6


where the third equality holds because of the conditional independence and the last one holds
because of the zero conditional mean of the errors. An important special case is when selection is
a deterministic function of 𝑧. But it can also depend on other unobservable random variables under
certain conditions. For instance, we can let
                                                    𝑠1 𝑠2 = 𝑓 (𝑧, 𝑤),                                              (1.3.4)
where 𝑤 is an unobserved random variable. Then E(𝑠1 𝑠2 𝑧0𝑢) holds if E(𝑢|𝑧) = 0 and 𝑤 (𝑧, 𝑢), as             |=
      E(𝑠1 𝑠2 𝑧0𝑢) = E[E(𝑠1 𝑠2 𝑧0𝑢|𝑧, 𝑤)] = E[𝑠1 𝑠2 𝑧0 E(𝑢|𝑧, 𝑤)] = E[𝑠1 𝑠2 𝑧0 E(𝑢|𝑧)] = 0.                        (1.3.5)
What Assumption 1.3.1 rules out is (𝑠1 , 𝑠2 ) depending on the errors 𝑢 and 𝑟. That is, selection
cannot depend on the idiosyncratic errors in either 𝑦 or 𝑥. Whether or not this holds in an empirical
application should be carefully considered by the researcher.
1.4     Moment conditions and GMM estimation
    Using equations (1.2.1)-(1.2.7) along with Assumption 1.3.1, I define the vector of moment
functions as follows.
                                              𝑠1 𝑠2 𝑧0 (𝑦 − 𝑥 𝛽)
                                                                            
                                                                  𝑔1 (𝛽, Π) 
                                                                            
                                                                            
                                        𝑠 𝑠 𝑧0 ⊗ (𝑥 − 𝑧Π) 0  𝑔 (𝛽, Π) 
                                             1 2                      2
                             𝑔(𝛽, Π) =                           ≡                                              (1.4.1)
                                                                            
                                                                               
                                        (1 − 𝑠 )𝑠 𝑧0 ⊗ (𝑥 − 𝑧Π) 0 𝑔 (𝛽, Π) 
                                              1 2                  3        
                                                                            
                                        𝑠1 (1 − 𝑠2 )𝑧0 (𝑦 − 𝑧Π𝛽)  𝑔4 (𝛽, Π) 
                                                                            
                                                                            
where I suppress (𝑦, 𝑥, 𝑧, 𝑠1 , 𝑠2 ) from 𝑔(.) for notational convenience. In the vector 𝑔(.), 𝑔1 (.) and
𝑔2 (.) use the information contained in the complete cases. 𝑔3 (.) uses the observations for which
𝑥 is observed but 𝑦 is not, while 𝑔4 (.) uses the observations for which 𝑦 is observed but 𝑥 is not.8
Then, the following result holds for 𝑔(.).
Lemma 1.4.1. Under Assumption 1.3.1, E[𝑔(𝛽, Π)] = 0.
8Note that equations (1.2.1)-(1.2.7) and our missing data scheme suggest 5 different moment functions: 𝑔1 (.)-𝑔4 (.)
 along with 𝑔5 (.) = 𝑠1 𝑠2 𝑧0 (𝑦 − 𝑧Π𝛽). However, since 𝑔5 (.) is a linear combination of 𝑔1 (.) and 𝑔2 (.), it is redundant
 given 𝑔1 (.)-𝑔4 (.) and hence I exclude it from the set of relevant moment functions.
                                                             7


This gives us a vector of 2(𝑞 + 𝑘)(1+ 𝑝 + 𝑘) moment conditions satisfied by the population parameter
values (𝛽, Π). We have ( 𝑝+𝑘)(1+𝑞+𝑘) parameters to estimate, giving us 2(𝑞+𝑘)+( 𝑝+𝑘)(𝑞+𝑘 −1)
overidentifying restrictions.
         ¯ Π) = 𝑛−1 𝑖=1
                          Í𝑛
    Let 𝑔(𝛽,                    𝑔(𝑦𝑖 , 𝑥𝑖 , 𝑧𝑖 , 𝑠1𝑖 , 𝑠2𝑖 , 𝛽, Π), Ω be a square matrix of order 2(𝑞 + 𝑘)(1 +
𝑝 + 𝑘) that is nonrandom, symmetric, and positive definite, and Ω̂ be a first step consistent estimate
of Ω. Then, the standard two-step GMM estimator minimizes the objective function
                                               ¯ Π) 0 Ω̂ 𝑔(𝛽,
                                              𝑔(𝛽,               ¯ Π).                                  (1.4.2)
The variance-covariance matrix of the moment functions is given by
                                                                                       
                                                                 𝐶11 𝐶12 0         0   
                                                                                       
                                                                                       
                                                                 𝐶 0 𝐶       0     0
                                                                         22
                                                                                        
                                                           0        12
                          𝐶 ≡ E[𝑔(𝛽, Π) 𝑔(𝛽, Π) ] = 
                                                                                       
                                                                                        
                                                                  0     0   𝐶      0   
                                                                              33       
                                                                                       
                                                                                       
                                                                  0     0    0 𝐶44 
                                                                                       
where
             𝐶11 = E(𝑠1 𝑠2 𝑢 2 𝑧0 𝑧)    𝐶12 = E(𝑠1 𝑠2 𝑧0𝑢𝑧 ⊗ 𝑟)          𝐶22 = E(𝑠1 𝑠2 𝑧0 ⊗ 𝑟 0 𝑧 ⊗ 𝑟)
                      𝐶33 = E[(1 − 𝑠1 )𝑠2 𝑧0 ⊗ 𝑟 0 𝑧 ⊗ 𝑟]          𝐶44 = E[𝑠1 (1 − 𝑠2 )𝑣 2 𝑧0 𝑧]        (1.4.3)
and 𝑔(.) is evaluated at the true value of the parameters. The optimal weight matrix is given by
the inverse of 𝐶. Let 𝐶ˆ be a consistent estimate of 𝐶 which can be obtained by replacing the
expectation by sample average in the definition of 𝐶 above and replacing 𝑢, 𝑟 and 𝑣 by consistent
estimates obtained using, for instance, GMM estimators that use 𝑔1 (.) only, 𝑔2 (.) and 𝑔3 (.) only,
and 𝑔4 (.) only respectively. Then, the optimal GMM estimator is defined as the following.
Definition 1.4.1. Call the estimators of 𝛽 and Π that minimize (1.4.2) with the optimal weight
matrix Ω̂ = 𝐶ˆ −1 , 𝛽ˆ and Π̂ respectively.
Further, define the (𝑘 + 𝑞)(2 + 𝑘 + 𝑝) × (𝑘 + 𝑝)(1 + 𝑘 + 𝑞) matrix of expected derivatives of 𝑔(.)
                                                              8


w.r.t. (𝛽0, 𝑣𝑒𝑐(Π) 0) 0
                                                                                  
                                                                      𝐷 11 0 
                                                                                  
                                                                                  
                                                                      0 𝐷 
                                                                                22 
                                          𝐷 ≡ E[∇𝑔(𝛽, Π)] = 
                                                                     
                                                                                   
                                                                      0 𝐷 
                                                                               32 
                                                                                  
                                                                                  
                                                                      𝐷 41 𝐷 42 
                                                                                  
where
                    𝐷 11 = − E(𝑠1 𝑠2 𝑧0𝑥)        𝐷 22 = − E[𝑠1 𝑠2 (𝑧0 𝑧 ⊗ 𝑐 1 , . . . , 𝑧0 𝑧 ⊗ 𝑐 ( 𝑝+𝑘) )]
             𝐷 32 = − E[(1 − 𝑠1 )𝑠2 (𝑧0 𝑧 ⊗ 𝑐 1 , . . . , 𝑧0 𝑧 ⊗ 𝑐 ( 𝑝+𝑘) )]   𝐷 41 = − E[𝑠1 (1 − 𝑠2 )𝑧0 𝑧Π]
                                            𝐷 42 = − E[𝑠1 (1 − 𝑠2 ) 𝛽0 ⊗ 𝑧0 𝑧],                              (1.4.4)
where 𝑐 𝑚 is a ( 𝑝 + 𝑘) × 1 vector with one in the 𝑚 𝑡ℎ row and all other rows being zero, 𝑚 =
1, . . . , ( 𝑝 + 𝑘). I impose the following rank condition on 𝐷 for identification of 𝛽 and Π.
Assumption 1.3.2: 𝑟𝑎𝑛𝑘 (𝐷) = ( 𝑝 + 𝑘)(1 + 𝑞 + 𝑘)
If 𝑃(𝑠1 𝑠2 = 1) > 0, then sufficient for this assumption to hold is that E(𝑧0𝑥|𝑠1 𝑠2 = 1) and
E(𝑧0 𝑧|𝑠1 𝑠2 = 1) have full column ranks ( 𝑝 + 𝑘) and ( 𝑝 + 𝑘)(𝑞 + 𝑘) respectively. In this case,
E[𝑔1 (𝛽)] = 0 identifies 𝛽 and E[𝑔2 (Π)] = 0 identifies Π. If 𝑃(𝑠1 𝑠2 = 1) = 0, for instance in
the TS2SLS case, then sufficient is that E(𝑧0 𝑧|𝑠2 = 1) and E(𝑧0𝑥|𝑠1 = 1) have full column ranks
( 𝑝 + 𝑘) (𝑞 + 𝑘) and ( 𝑝 + 𝑘) respectively. In this case, E[𝑔3 (Π)] = 0 identifies Π and E[𝑔4 (𝛽)] = 0
identifies 𝛽 since for the purpose of identification, we can treat Π as known.
     Then, we have the following result using Hansen (1982).
Theorem 1.4.1 Under standard regularity conditions and Assumptions 1.3.1 and 1.3.2,
                       √
                         𝑛[ ( 𝛽ˆ0, 𝑣𝑒𝑐( Π̂) 0 0 − 𝛽0, 𝑣𝑒𝑐(Π) 0 0] −−−−→ 𝑁 (0, (𝐷 0𝐶 −1 𝐷) −1 )
                                                                        𝑑
and
                                ¯ 𝛽,
                             𝑛 𝑔(   ˆ Π̂) 0 𝐶ˆ −1 𝑔( ˆ Π̂) −−−𝑑−→ 𝜒2
                                                  ¯ 𝛽,                 2(𝑞+𝑘)+( 𝑝+𝑘) (𝑞+𝑘−1) .
                                                               9


This statistic can be used for the standard test of overidentifying restrictions. Note that this statistic
is just the GMM objective function in (1.4.2) evaluated at the efficient values of the parameters
and is distributed as chi-squared with degrees of freedom equal to the number of overidentifying
restrictions.
1.5     Comparison with related estimators
1.5.1    Complete cases estimator
    The most common practice in the presence of missing data is to just use the complete cases for
estimation; that is, only use the observations for which both 𝑦 and 𝑥 are observed. In the current
framework, the first and the most commonly used estimator that uses only the complete cases is the
standard 2SLS. This estimator uses only 𝑔1 (.) in estimation as it requires 𝑠1 = 𝑠2 = 1, and uses a
weight matrix that is optimal when 𝑢 is homoskedastic.
Definition 1.5.1.1 Call the estimator of 𝛽 that minimizes (1.4.2), where 𝑔(.) contains only 𝑔1 (.)
and Ω̂ = (𝑛−1 𝑖=1     𝑠1𝑖 𝑠2𝑖 𝑧0𝑖 𝑧𝑖 ) −1 , the complete cases 2SLS (or 𝛽ˆ𝐶𝐶−2𝑆𝐿𝑆 ).
                 Í𝑛
The weight matrix used by 𝛽ˆ𝐶𝐶−2𝑆𝐿𝑆 is optimal if E(𝑢 2 |𝑧, 𝑠1 , 𝑠2 ) = 𝜎 2 . When this assumption is
violated, a more efficient complete cases estimator can be obtained by using optimal weighting.
Definition 1.5.1.2 Call the estimator of 𝛽 that minimizes (1.4.2), where 𝑔(.) contains only 𝑔1 (.)
            −1 , the complete cases GMM (or 𝛽ˆ
and Ω̂ = 𝐶ˆ11                                          𝐶𝐶−𝐺 𝑀 𝑀 ).
This is the optimal GMM estimator based only on the complete cases. Its asymptotic variance is
easily obtained using the standard GMM theory.
Lemma 1.5.1.1 Under Assumption 1.3.1, the complete cases GMM has an asymptotic variance
given by
                                        √
                                          𝑛( 𝛽ˆ𝐶𝐶−𝐺 𝑀 𝑀 − 𝛽) = (𝐷 011𝐶11−1 𝐷 ) −1 .
                                                             
                            𝐴𝑣𝑎𝑟                                            11
Comparing the asymptotic variances of 𝛽ˆ and 𝛽ˆ𝐶𝐶−𝐺 𝑀 𝑀 , the former is no less efficient than the
latter because it uses the information contained in the incomplete cases, while the latter simply
                                                          10


ignores this information. The gain in efficiency follows from the fact that adding valid moment
conditions decreases, or at least does not increase the asymptotic variance of a GMM estimator.9
Proposition 1.5.1.1 Under Assumption 1.3.1,
                       √                                    √
                         𝑛( 𝛽ˆ𝐶𝐶−𝐺 𝑀 𝑀 − 𝛽) − 𝐴𝑣𝑎𝑟 𝑛( 𝛽ˆ − 𝛽) is positive semi-definite.
                                                                         
              𝐴𝑣𝑎𝑟
    Further, I break down the gains in efficiency by 𝛽1 and 𝛽2 , the coefficients on the potentially
missing endogenous covariates 𝑥 1 and the always observed exogenous covariates 𝑥 2 respectively.
For algebraic convenience, I consider the case where both 𝑥 1 and 𝑧1 are scalars.10
Proposition 1.5.1.2 Let 𝑝 = 𝑞 = 1. Under Assumption 1.3.1,
                                                                                               
                √                                          √                    h         i   𝐴 
                                                                                               1
                      ˆ                                          ˆ
                                                                          
    (i) 𝐴𝑣𝑎𝑟 𝑛( 𝛽1−𝐶𝐶−𝐺 𝑀 𝑀 − 𝛽1 ) − 𝐴𝑣𝑎𝑟 𝑛( 𝛽1 − 𝛽1 ) = 𝐴 𝐵 𝐸   ≥ 0             0    0
                                                                                   1    1     𝐵 
                                                                                               1
                                                                                               
                                                                                                
                 √                                          √                h             i  𝐴2 
    (ii) 𝐴𝑣𝑎𝑟 𝑛( 𝛽ˆ2−𝐶𝐶−𝐺 𝑀 𝑀 − 𝛽2 ) − 𝐴𝑣𝑎𝑟 𝑛( 𝛽ˆ2 − 𝛽2 ) = 𝐴0 𝐵0 𝐸   ≥ 0,
                                                                                               
                                                                                    2     2    𝐵 
                                                                                                2
                                                                                                
where 𝐴 𝑗 = 𝐷 32 𝐷 −1     𝐶 𝑊 , 𝐵 𝑗 = (𝐷 41 𝐷 −1
                        22 21 𝑗
                                                         𝐶 + 𝐷 42 𝐷 −1
                                                      11 11
                                                                            𝐶 )𝑊 𝑗 , 𝑗 = 1, 2 and 𝑊1 , 𝑊2 and 𝐸
                                                                        22 21
are matrices defined in the appendix, 𝐸 being a positive definite matrix.
Starting with the first part of Proposition 1.5.1.2, since 𝐸 is positive definite, the difference is 0 if
and only if 𝐴1 = 𝐵1 = 0. The corresponding difference for 𝛽2 is 0 if and only if 𝐴2 = 𝐵2 = 0.
Since neither 𝐴 𝑗 nor 𝐵 𝑗 are necessarily 0 under the assumptions made so far, it is possible to obtain
strict gains in efficiency for both 𝛽1 and 𝛽2 .
    Finally, when there is no missingness, the moment conditions in (1.4.1) just give us the standard
2SLS estimator. Because 𝑠1 and 𝑠2 are always 1, 𝑔3 (.) and 𝑔4 (.) are always zero and we are left
with 𝑔1 (.) and 𝑔2 (.). Since 𝑔2 (.) adds equal number of additional parameters as the number of
additional moment functions to 𝑔1 (.), the GMM estimator of 𝛽 from 𝑔1 (.) will be the same as that
from 𝑔1 (.) and 𝑔2 (.).11 Thus, estimation is based only on 𝑔1 (.) = 𝑧0 (𝑦 − 𝑥 𝛽), which is the usual
9Wooldridge (2010), Section 8.6.
10The proof for this proposition is an extension of the proof of Proposition 2 in Abrevaya & Donald (2017).
11Ahu & Schmidt (1995), Theorem 1.
                                                           11


moment function used by 2SLS along with a weight matrix constructed under homoskedasticity of
𝑢.
                                                                               Í𝑛 0 −1
Proposition 1.5.1.3 If 𝑃(𝑠1 = 1) = 𝑃(𝑠2 = 1) = 1 and Ω̂11 = (𝑛−1 𝑖=1              𝑧𝑖 𝑧𝑖 ) , where Ω̂11 is the
upper left (𝑞 + 𝑘) × (𝑞 + 𝑘) block of Ω̂, 𝛽ˆ equals the standard 2SLS estimator.
1.5.2    Estimators combining different data sets
    A special case of missingness occurs when data is combined from more than one data sets, one
or more of which do not contain either 𝑦 or some or all elements of 𝑥. For instance, the pattern of
missingness in Figure 1.1 can result from combining three data sets, one of which contains all 𝑦, 𝑥
and 𝑧, a second is missing 𝑦, and a third is missing 𝑥. In this case, one can just use the first data set
to estimate 𝛽, but the second and third can be used to achieve efficiency gains using the framework
in Section 1.4. One does have to be careful in making sure that Assumption 1.3.1 holds in order to
ensure consistency. For instance, a sufficient condition would be that the different data sets being
combined are just random samples of different variables from the same population.
    There may also be cases where estimation using a single data set is not possible at all. A
prominent example is when one data set contains only 𝑦 and 𝑧, while the second contains only 𝑥 and
𝑧. The most commonly used estimator in this case is the TS2SLS.12 The TS2SLS is a sequential
GMM estimator based only on 𝑔3 (.) and 𝑔4 (.) since 𝑠1 𝑠2 = 0 in this case.
Definition 1.5.2.1 Call the estimator of 𝛽 obtained by the following sequential procedure the
two-sample two stage least squares (or 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆 ).
Step 1: Obtain Π̆ by minimizing (1.4.2), where 𝑔(.) contains only 𝑔3 (.) and Ω̂ = 𝐼.
Step 2: Estimate 𝛽 by minimizing (1.4.2), where 𝑔(.) contains only 𝑔4 (.), Ω̂ = (𝑛−1 𝑖=1         𝑠1𝑖 𝑧0𝑖 𝑧𝑖 ) −1 ,
                                                                                            Í𝑛
and Π = Π̆ is treated as given.
    There are two differences between 𝛽ˆ and 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆 . First is in terms of the assumptions made
by the two estimators. The traditional analysis of 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆 or the related two-sample IV (TSIV)
12This estimator is discussed in detail in a GMM context by Inoue & Solon (2010).
                                                         12


estimator either assumes MCAR (Angrist & Krueger, 1995), or imposes restrictions on 𝑧 and 𝑥
that essentially follow from assuming MCAR. For instance, Angrist & Krueger (1992), in using
the TSIV estimator, assume that E(𝑧0𝑥|𝑠1 = 1) = E(𝑧0𝑥|𝑠2 = 1). Inoue & Solon (2010) make the
same assumption, along with E(𝑧0 𝑧|𝑠1 = 1) = E(𝑧0 𝑧|𝑠2 = 1) and that the fourth moments of 𝑧
conditional on 𝑠1 and 𝑠2 are equal. The framework presented in this paper allows for relaxation of
these restrictive assumptions. By allowing 𝑠1 and 𝑠2 to depend on 𝑧, I allow for the distribution of
𝑧 (and 𝑥) to be different conditional on 𝑠1 and 𝑠2 . However, the coefficient in the linear projection
of 𝑥 on 𝑧 (that is, Π) remains the same conditional on 𝑠1 and 𝑠2 under Assumption 1.3.1.13
     The second difference is in terms of the weight matrix used. Note that the weight matrix used
in Step 2 of Definition 1.5.2.1 is the sample counterpart of 𝐶44                  −1 (divided by the variance of 𝑣, which
is just a constant), when 𝑣 satisfies the following assumption.
                                                       E(𝑣 2 |𝑧, 𝑠1 ) = 𝜎𝑣2 .                                        (1.5.1)
That is, the variance of 𝑣 is constant conditional on both the instruments 𝑧 and 𝑠1 . If this assumption
is not true, then 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆 uses a sub-optimal weight matrix in Step 2 and efficiency gains are possible
by using the optimal weight matrix.14 Let
                   1 Õ                                                                     1 Õ
        𝐷ˆ 32 = −        [𝑠2𝑖 (𝑧0𝑖 𝑧𝑖 ⊗ 𝑐 1 , . . . , 𝑧0𝑖 𝑧𝑖 ⊗ 𝑐 ( 𝑝+𝑘) )]    𝐷ˆ 42 = −          (𝑠1𝑖 𝛽ˆ0 ⊗ 𝑧0𝑖 𝑧𝑖 ) (1.5.2)
                   𝑛                                                                       𝑛
                       𝑖                                                                       𝑖
be consistent estimates of 𝐷 32 and 𝐷 42 respectively, and 𝐶ˆ44 and 𝐶ˆ33 are as defined in Section 1.4,
where consistent estimates of 𝛽 and Π can now be obtained using 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆 .
Definition 1.5.2.2 Call the estimator of 𝛽 obtained by replacing
                                    Ω̂ = (𝐶ˆ44 + 𝐷ˆ 32 ( 𝐷ˆ 042𝐶ˆ33   −1 𝐷 ˆ 42 ) −1 𝐷ˆ 0 ) −1
                                                                                        32
13Note that for 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆 to be consistent, we only need Π to be the same conditional on 𝑠1 and 𝑠2 , and not the
  individual moments involved in the calculation of Π.
14Two things should be noted here:
     • Because 𝑠1 𝑠2 = 0, 𝑔3 (.) = 𝑠2 𝑧0 ⊗ (𝑥 − 𝑧Π) 0 and 𝑔4 (.) = 𝑠1 𝑧0 (𝑦 − 𝑧Π𝛽).
     • Since E[𝑔2 (.)] = 0 is an exactly identified set of moment conditions, the weight matrix does not matter for
        estimation in Step 1 of Definition 1.5.2.1.
                                                                 13


in step 2 of the procedure in Definition 1.5.2.1, the Optimal TS2SLS estimator (or 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆−𝑂 ).
This is the optimal sequential GMM estimator under the assumptions made so far and its asymptotic
variance is given in the following result.
Proposition 1.5.2.1 Under Assumption 1.3.1, 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆−𝑂 is the optimal sequential GMM estimator
of 𝛽, and has an asymptotic variance given by
                 √                                                         0−1 0 −1
                   𝑛( 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆−𝑂 − 𝛽) = {𝐷 041 [𝐶44 + 𝐷 42 (𝐷 −1                           −1
                                           
           𝐴𝑣𝑎𝑟                                                   32 𝐶33 𝐷 42 )𝐷 32 ] 𝐷 41 } .
Since 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆 uses a sub-optimal weight matrix as opposed to 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆−𝑂 , the latter will be no
less efficient than the former.
Proposition 1.5.2.2 Under Assumption 1.3.1,
                 √                                  √
                   𝑛( 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆−𝑂 − 𝛽)) − 𝐴𝑣𝑎𝑟 ( 𝑛( 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆 − 𝛽) is positive semi-definite.
                                                                      
          𝐴𝑣𝑎𝑟
The proposed estimator 𝛽ˆ is then equally efficient as 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆−𝑂 .
Proposition 1.5.2.3 Under Assumption 1.3.1,
                                    √                              √
                                      𝑛( 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆−𝑂 − 𝛽) = 𝐴𝑣𝑎𝑟 ( 𝑛( 𝛽ˆ − 𝛽)).
                                                         
                             𝐴𝑣𝑎𝑟
From Propositions 1.5.2.2 and 1.5.2.3, we can conclude that 𝛽ˆ is no less efficient than 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆 .
Proposition 1.5.2.4 Under Assumption 1.3.1,
                       √                              √
                          𝑛( 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆 − 𝛽)) − 𝐴𝑣𝑎𝑟 ( 𝑛( 𝛽ˆ − 𝛽) is positive semi-definite.
                                                                
                𝐴𝑣𝑎𝑟
    Inoue & Solon (2005) address the issues of optimal weighting using a joint GMM and allowing
for conditional heteroskedasticity. Their framework however is more restrictive than necessary.
First, they start with zero conditional means of the errors in (2.1) and (2.4), which rules out the
important case when (1.2.1) and (1.2.4) are just linear projections and the data is MCAR. Second,
they impose restrictions on the second and third moments of 𝑥 and 𝑧, which this framework does
not.
                                                      14


    Finally, 𝛽ˆ is numerically equivalent to 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆 if 𝛽 is exactly identified. This is because in case
of exact identification, the efficiency due to using the optimal weight matrix is lost as the weight
matrix does not matter for estimation.
Proposition 1.5.2.5 If 𝑝 = 𝑞 and Assumption 1.3.1 holds, 𝛽ˆ = 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆 . Therefore,
                                       √                           √
                                         𝑛( 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆 − 𝛽) = 𝐴𝑣𝑎𝑟 𝑛( 𝛽ˆ − 𝛽) .
                                                                             
                              𝐴𝑣𝑎𝑟
1.5.3   Sequential estimators
    Consider the case where 𝑦 is always observed—that is, 𝑃(𝑠1 = 1) = 1—and the only variables
that contain missing values are 𝑥. Thus, 𝑔3 (.) = 0 and we are only left with 𝑔1 (.), 𝑔2 (.) and
𝑔4 (.). For this case, McDonough & Millimet (2017) discuss a sequential estimator which is the
counterpart of linear imputation in the case where 𝑥 is exogenous in equation (1.2.1).
Definition 1.5.3.1 Call the estimator of 𝛽 obtained by the following procedure the imputation
estimator (or 𝛽ˆ 𝐼𝑚 𝑝 ).
Step 1: Obtain Π̂ by minimizing (1.4.2), where 𝑔(.) contains only 𝑔2 (.) and Ω̂ = 𝐼.
Step 2: Estimate 𝛽 by minimizing (1.4.2), where 𝑔(.) = 𝑔5 (.) = 𝑧0 {𝑦 − [𝑠𝑥 + (1 − 𝑠)𝑧Π̂] 𝛽},
Ω̂ = [𝑛−1 𝑖=1     𝑔5𝑖 (.)𝑔5𝑖 (.) 0] −1 and Π̂ is treated as given.
           Í𝑛
So in the first step, we estimate the first stage coefficients Π. We then replace the missing values of
𝑥 with 𝑧Π̂ and estimate 𝛽 in the second step using 2SLS on the full sample and treating the fitted
values of 𝑥 as given. It is straightforward to show that this estimator is no more efficient than 𝛽.  ˆ
    Consider the sequential estimator of 𝛽 that first estimates Π using 𝑔2 (.) and then estimates 𝛽
using 𝑔1 (.) and 𝑔4 (.), where 𝑔4 (.) uses the estimated Π from the first step.
Definition 1.5.3.2 Call the estimator of 𝛽 obtained by the following procedure the sequential
estimator (or 𝛽ˆ𝑆𝑒𝑞 ).
Step 1: Same as Step 1 in Definition 1.5.3.1.
                                                        15


Step 2: Estimate 𝛽 by minimizing (1.4.2), where
                                                                             𝑛
                      𝑔(𝛽, Π̂) = 𝑔1 (𝛽) 0, 𝑔4 (𝛽, Π̂) 0 0, Ω̂ = [𝑛−1
                                                                            Õ
                                                                                 𝑔𝑖 (.)𝑔𝑖 (.) 0] −1
                                                         
                                                                            𝑖=1
and Π̂ is treated as given.
By standard GMM theory, we know that 𝛽ˆ is no less efficient than 𝛽ˆ𝑆𝑒𝑞 , since it is a sequential
estimator (as opposed to a joint estimator) based on the same moment conditions as 𝛽.15              ˆ Moreover,
𝑔5 (.), which is the moment condition used in Step 2 of Definition 1.5.3.1 can be obtained by adding
𝑔1 (𝛽) and 𝑔4 (𝛽, Π̂), which are the moment conditions used in step 2 of Definition 1.5.3.2. Since
𝛽ˆ𝑆𝑒𝑞 uses 𝑔5 (.) and an additional moment condition, it is no less efficient than 𝛽ˆ 𝐼𝑚 𝑝 . Thus we can
conclude that 𝛽ˆ is no less efficient than 𝛽ˆ 𝐼𝑚 𝑝 and there is no reason to choose the latter over the
former other than computational convenience.
1.5.4    Dummy variable method
     A common method used to deal with missingness in 𝑥 in the case where 𝑥 is exogenous is the
dummy variable method, which entails replacing the missing values of 𝑥 with zeros and including
an indicator for missingness as a covariate. As shown by Abrevaya & Donald (2017), this method
is inconsistent unless some zero restrictions are imposed in the population. This method continues
to be inconsistent in the current framework where 𝑥 is endogenous.
     Let 𝑃(𝑠1 = 1) = 1, that is, 𝑦 is always observed. Also note that (1.2.4) implies
                                          𝑥 1 = 𝑧1 Π11 + 𝑥2 Π21 + 𝑟 1 ,                                        (1.5.3)
where Π11 , Π21 and 𝑟 1 constitute the first 𝑝 columns of Π1 , Π2 and 𝑟 respectively.16 Then (1.2.1)
and (1.5.2) imply
                         𝑦 = [𝑠2 𝑥1 + (1 − 𝑠2 )(𝑧1 Π11 + 𝑥 2 Π21 + 𝑟 1 )] 𝛽1 + 𝑥 2 𝛽2 + 𝑢.                     (1.5.4)
15Prokhorov & Schmidt (2009), Theorem 2.2, part 5.
16One can similarly write 𝑥2 = 𝑧1 Π12 + 𝑥2 Π22 + 𝑟 2 . However, it is clear that both Π12 and 𝑟 2 are identically 0 and
  Π22 is a 𝑘 × 𝑘 identity matrix.
                                                        16


Since 𝑥 2 contains the constant, write 𝑥 2 = (1, 𝑥22 ) where 𝑥 22 constitutes the last (𝑘 − 1) columns
of 𝑥 2 . Correspondingly, write Π21 = (Π0211 , Π0212 ) 0, where Π211 is the first row of Π21 and Π212
constitutes the last (𝑘 − 1) rows of Π21 . Plugging this into (1.5.3) and re-arranging gives
                                                                             
                  𝑦 = 𝑠2 𝑥 1 𝛽1 + (1 − 𝑠2 ) 𝑧1 Π11 + Π211 + 𝑥 22 Π212 + 𝑟 1 𝛽1 + 𝑥 2 𝛽2 + 𝑢.          (1.5.5)
The dummy variable method omits the covariates (1 − 𝑠2 )𝑧 1 and (1 − 𝑠2 )𝑥22 from equation (1.5.4)
and estimates using 2SLS the equation
                                   𝑦 = 𝑠2 𝑥 1 𝛽1 + (1 − 𝑠2 )Π211 + 𝑥 2 𝛽2 + 𝑒                         (1.5.6)
using instruments (𝑠2 𝑧1 , 1 − 𝑠2 , 𝑥2 ), where 𝑒 ≡ (1 − 𝑠2 )(𝑧1 Π11 + 𝑥 22 Π212 ) 𝛽1 + 𝑟 1 𝛽1 + 𝑢. However,
since each of these instruments is now correlated with the new error 𝑒, 2SLS will not yield consistent
estimates in general unless we impose some zero restrictions in the population.
Proposition 1.5.4.1 The 2SLS estimators of 𝛽 from equation (1.5.5) using instruments (𝑠2 𝑧1 , 1 −
𝑠2 , 𝑥2 ) are inconsistent unless (i) 𝛽1 = 0 or (ii) Π11 = Π212 = 0.
The first condition implies that 𝑥1 is irrelevant in the model of interest (1.2.1), so the best solution
is to drop it. The second implies that neither the excluded instruments 𝑧1 nor the always observed
covariates 𝑥22 help in explaining 𝑥 1 , in which case any estimation method based on 𝑧1 cannot be
used at all.
1.6      Missing instruments
     In Sections 1.2-1.5, I discussed the case where 𝑦 and the endogenous elements of 𝑥 (that is, 𝑥1 )
contain missing values, while the instruments 𝑧 are always observed. In this section, I consider
the case where the excluded instruments 𝑧1 contain missing values. This includes as a special case
missingness in covariates when all the covariates are exogenous.
     Starting with the population model in Section 1.2, I now additionally introduce a linear projection
of the excluded instruments 𝑧1 on the always observed exogenous covariates 𝑥 2 .
                                                 𝑧1 = 𝑥 2 Γ + 𝑒,                                      (1.6.1)
                                                       17


where by definition of a linear projection
                                                      E(𝑥20 𝑒) = 0.                                        (1.6.2)
As discussed in Section 1.5.4, (1.2.4) implies that
                                           𝑥 1 = 𝑧1 Π11 + 𝑥 2 Π21 + 𝑟 1 .                                  (1.6.3)
Plugging (1.6.1) into (1.6.3) gives us a first stage in terms of 𝑥2 only.
                                   𝑥 1 = 𝑥 2 (ΓΠ11 + Π21 ) + (𝑒Π11 + 𝑟 1 ).                                (1.6.4)
Plugging (1.6.4) into (1.2.1) gives us a reduced form for 𝑦 in terms of 𝑥 2 only.
                       𝑦 = 𝑥 2 (ΓΠ11 𝛽1 + Π21 𝛽1 + 𝛽2 ) + (𝑒Π11 𝛽1 + 𝑟 1 𝛽1 + 𝑢).                          (1.6.5)
Now, for observation 𝑖, let
                                                
                                                
                                                1         if 𝑧1𝑖 is observed
                                                
                                                
                                                
                                        𝑠3𝑖 =
                                                
                                                0         otherwise
                                                
                                                
                                                
I impose the following assumptions on the missingness mechanism, which can be interpreted in a
similar way as Assumption 1.3.1.
Assumption 1.6.1: (𝑖) E(𝑠3 𝑧0𝑢) = 0             (𝑖𝑖) E(𝑠3 𝑧0𝑟) = 0         (𝑖𝑖𝑖) E(𝑠3 𝑥 20 𝑒) = 0.
    This gives us the following moment functions.
                                             𝑠3 𝑧0 (𝑦 − 𝑥 𝛽)                         ℎ1 (𝛽, Π, 𝛾) 
                                                                                                      
                      
                                                                                                      
                                      0                                    0
                                                                                                      
                      
                               𝑠 3 𝑧   ⊗  (𝑥 1  −  𝑧 1 Π 11  − 𝑥 2 Π 21 )           ℎ2 (𝛽, Π, 𝛾) 
                                                                                                       
                                                                                                      
        ℎ(𝛽, Π, 𝛾) =                        0
                                         𝑠3 𝑥2 ⊗ (𝑧1 − 𝑥 2 Γ)      0                =      ℎ (𝛽, Π, 𝛾)  (1.6.6)
                                                                                     3                 
                                                                                                      
                       (1 − 𝑠3 )𝑥 20 ⊗ [𝑥1 − 𝑥2 (ΓΠ11 + Π21 )] 0   ℎ4 (𝛽, Π, 𝛾) 
                                                                                                      
                                                                                                      
                                     0
                                                                                                      
                       (1 − 𝑠3 )𝑥 [𝑦 − 𝑥2 (ΓΠ11 𝛽1 + Π21 𝛽1 + 𝛽2 )]   ℎ5 (𝛽, Π, 𝛾) 
                                    2                                                                 
This vector of moment functions is basically using the original model of interest and first stage
                                            
when 𝑧1 is observed ℎ1 (.) and ℎ2 (.) . When 𝑧1 is missing, it uses the reduced forms for 𝑥 1 and 𝑦
                                                            18


in terms of 𝑥 2 in order to use the incomplete cases ℎ4 (.) and ℎ5 (.) . ℎ3 (.) simply identifies the
parameters in the linear projection of 𝑧1 on 𝑥 2 . Then under Assumption 1.6.1, the following result
holds for ℎ(.).
Lemma 1.6.1. Under Assumption 1.6.1, E[ℎ(𝛽, Π, 𝛾)] = 0.
This gives us a set of 2𝑘 (1 + 𝑝) + 𝑞(1 + 𝑝 + 𝑘) moment conditions for ( 𝑝 + 𝑘)(1 + 𝑞 + 𝑘) + 𝑘𝑞
parameters, giving us 𝑘 (1 + 𝑝) + 𝑞 − 𝑝 overidentifying restrictions.17 Then, let ℎ(𝛽,         ¯  Π, Γ) =
𝑛−1 𝑖=1
      Í𝑛
             ℎ(𝑦𝑖 , 𝑥𝑖 , 𝑧𝑖 , 𝑠3𝑖 , 𝛽, Π, Γ), Λ be a square matrix of order 2𝑘 (1 + 𝑝) + 𝑞(1 + 𝑝 + 𝑘) that
is nonrandom, symmetric, and positive definite, and Λ̃ be a first step consistent estimate of Λ.
Then, 𝛽˜0, 𝑣𝑒𝑐( Π̃) 0, 𝑣𝑒𝑐( Γ̃) 0 is the standard two-step GMM estimator that minimizes the objective
                                       
function
                                                ¯
                                                ℎ(𝛽, Π, Γ) 0 Λ̃ ℎ(𝛽,
                                                                ¯      Π, Γ).                       (1.6.7)
Let 𝛽˜𝑐𝑐 be the complete cases GMM that minimizes (1.6.7) with ℎ(.) = ℎ1 (.) and Λ̃ is a consistent
estimate of [E(ℎ1 (.)ℎ1 (.) 0] −1 . Then we know that 𝛽˜ is no less efficient than 𝛽˜𝑐𝑐 because the former
uses more moment conditions.
Proposition 1.6.1. Under Assumption 1.6.1,
                               √                         √
                                  𝑛( 𝛽˜𝑐𝑐 − 𝛽) − 𝐴𝑣𝑎𝑟 𝑛( 𝛽˜ − 𝛽) is positive semi-definite.
                                                                     
                     𝐴𝑣𝑎𝑟
     Similar to Section 1.5, we can break down the efficiency gains by 𝛽1 and 𝛽2 , the coefficients
on the endogenous and exogenous elements of 𝑥 respectively, and show that it is possible to obtain
strict gains in efficiency for both 𝛽1 and 𝛽2 .18
     This is in contrast with the sequential estimator discussed in Mogstad & Wiswall (2012). They
consider the case where 𝑝 = 𝑞 = 1 and the estimator proceeds in two steps. In the first step, it
estimates Γ using ℎ3 (.). It then replaces the missing values of 𝑧1 by the imputed values 𝑥 2 Γ̂, where
Γ̂ is the first step estimate of Γ, and then in the second step estimates 𝛽 by minimizing (1.6.7) where
ℎ(.) = 𝑧∗0 (𝑦 − 𝑥 𝛽) and 𝑧∗ = 𝑠3 𝑧1 + (1 − 𝑠3 )𝑥2 Γ̂, 𝑥2 .19 They show that the estimate of 𝛽1 using
                                                                  
17Since Π21 = 0 and Π22 = 𝐼, the only elements of Π that are being estimated are Π11 and Π21 .
18This proof is analogous to that of Proposition 5.1.2 and is available upon request.
19The weight matrix is irrelevant in this case due to exact identification.
                                                             19


this estimator is numerically equivalent to that using complete cases estimator 𝛽˜𝑐𝑐 . Thus, 𝛽˜ does
better than this estimator as it is possible to obtain strict gains in efficiency for both 𝛽1 and 𝛽2 .
    Abrevaya & Donald (2011) also propose a GMM estimator to deal with missingness in 𝑧1 .
Their estimator is based on the moment functions
                                              ℎ 𝐴 (𝛽) = 𝑧0𝐴 (𝑦 − 𝑥 𝛽),                             (1.6.8)
where 𝑧 𝐴 = (𝑥 2 , (1 − 𝑠3 )𝑥2 , 𝑠3 𝑧1 ). It is clear that the moment functions in (1.6.6) contain (1.6.8)
as a linear combination plus some additional moment conditions. Thus, 𝛽˜ is no less efficient than
their estimator.
    Now, when 𝑥 1 is exogenous in equation (1.2.1) in the sense that
                                                      E(𝑥 10 𝑢) = 0,                               (1.6.9)
then 𝑥1 = 𝑧1 . In this case, ℎ2 (.) = 0 and ℎ4 (.) cannot be used anymore.20 So our vector of moment
conditions is
                                                        0 (𝑦 − 𝑥 𝛽)
                                                                            
                                                𝑠  3 𝑥                     0
                                                                            
                                                                            
                                  E        𝑠3 𝑥 20 ⊗ (𝑥 1 − 𝑥 2 Γ) 0      = 0
                                                                                               (1.6.10)
                                                                            
                                      (1 − 𝑠3 )𝑥 20 [𝑦 − 𝑥 2 (Γ𝛽1 + 𝛽2 )]  0
                                                                            
                                                                            
    These are the moment conditions used by Abrevaya & Donald (2017) who consider the case
of missingness in a single exogenous covariate. Thus, the framework presented here encompasses
theirs as a special case when when 𝑥 1 is exogenous and 𝑝 = 1.
1.7      Nonlinearity in covariates and instruments
    Nonlinear functions of the covariates, like squares and interactions, are frequently used in
empirical work. If these covariates are endogenous, one generally uses nonlinear functions of the
instruments as well. In general, any sequential procedures that plug in the fitted values of the
covariates or the instruments from a first step into nonlinear functions of these variables generally
produce inconsistent estimates. For instance, traditional imputation used when the covariates are
20ℎ2 (.) = 0 because Π11 is a 𝑝 × 𝑝 identity matrix since 𝑧 1 = 𝑥1 and Π21 is a matrix of zeros.
                                                             20


exogenous will result in inconsistency if one replaces the missing value of say, the square of a
covariate, with square of the imputed value of that covariate. In this section, I provide estimators
that are consistent as well as more efficient than these sequential procedures and the complete case
methods.
      Suppose that the model of interest is now given by
                                                   𝑦 = 𝐹1 (𝑥1 , 𝑥2 ) 𝛽 + 𝑢,                                 (1.7.1)
where 𝑥 1 is a 1 × 𝑝 vector of potentially endogenous covariates, 𝑥2 is a 1 × 𝑘 vector of exogenous
covariates, 𝑥 = (𝑥 1 , 𝑥2 ), and 𝐹1 (𝑥 1 , 𝑥2 ) is a 1× 𝑗 1 vector of potentially nonlinear functions of 𝑥 1 and
𝑥 2 , 𝑗 1 ≥ ( 𝑝 + 𝑘). For instance, suppose 𝑝 = 𝑘 = 1. Then 𝐹1 (𝑥 1 , 𝑥2 ) could equal (𝑥 1 , 𝑥12 , 𝑥1 𝑥 2 , 𝑥2 ).
We also have a 1 × 𝑞 vector of instruments 𝑧 1 for 𝑥 1 , 𝑞 ≥ 𝑝. I assume
                                                       E(𝑢|𝑧1 , 𝑥2 ) = 0,                                   (1.7.2)
and allow for E(𝑥10 𝑢) ≠ 0. So I now assume that 𝑢 has a zero mean conditional on 𝑧1 and 𝑥2 .21 The
first stage is given by the linear projection
                                              𝐹1 (𝑥 1 , 𝑥2 ) = 𝐹2 (𝑧1 , 𝑥2 )Π + 𝑟.                          (1.7.3)
𝐹2 (𝑧1 , 𝑥2 ) is a 1 × 𝑗2 vector of instruments where 𝐹2 (.) is chosen by the researcher, and Π is a
𝑗2 × 𝑗 1 vector of coefficients. Because 𝐹1 (𝑥1 , 𝑥2 ) contains nonlinear functions of 𝑥1 , 𝐹2 (𝑧 1 , 𝑥2 )
will most likely also contain nonlinear functions of 𝑧1 and 𝑥 2 . For instance, as discussed in
Wooldridge (2010)22, if 𝐹1 (𝑥1 , 𝑥2 ) = (𝑥 1 , 𝑥12 , 𝑥1 𝑥2 , 𝑥2 ), one might want to choose 𝐹2 (𝑧1 , 𝑥2 ) =
(𝑧1 , 𝑧21 , 𝑧1 𝑥 2 , 𝑥2 , 𝑥22 ). By definition
                                                   E[𝐹2 (𝑧1 , 𝑥2 ) 0𝑟] = 0.                                 (1.7.4)
From equations (1.7.1) and (1.7.3), we get a reduced form for 𝑦 in terms of only 𝑧1 and 𝑥 2 .
                                                  𝑦 = 𝐹2 (𝑧1 , 𝑥2 )Π𝛽 + 𝑣,                                  (1.7.5)
21This is a standard assumption made in the literature when the model includes nonlinear functions of covariates and
  motivates the choice of instruments.
22Section 9.5.
                                                               21


where 𝑣 ≡ 𝑟 𝛽 + 𝑢. Using (1.7.2) and (1.7.4), we have that
                                                E[𝐹2 (𝑧1 , 𝑥2 ) 0𝑣] = 0.                                       (1.7.6)
1.7.1    Missingness in outcome and covariates
    Starting with the case of missingness in 𝑦 and 𝑥1 , let the scheme of missingness be the same as
described in Section 1.3. That is, both 𝑦 and 𝑥 1 contain missing values, while 𝑧1 and 𝑥 2 are always
observed.
    In this case, what seems like the natural extension of the sequential estimator discussed in
McDonough & Millimet (2017) will be inconsistent for 𝛽 because it performs the “forbidden
regression" as discussed in Wooldridge (2010).23 For instance, let 𝐹1 (𝑥1 , 𝑥2 ) = (𝑥 1 , 𝑥12 , 𝑥1 𝑥 2 , 𝑥2 ).
The sequential estimator would regresses 𝑥 1 on 𝐹2 (𝑧 1 , 𝑥2 ) and obtain the fitted values (say 𝑥ˆ1 ) in
the first step, replace the missing values of 𝑥 1 , 𝑥12 and 𝑥 1 𝑥 2 with 𝑥ˆ1 , ( 𝑥ˆ1 ) 2 and 𝑥ˆ1 𝑥 2 respectively, and
then estimate 𝛽 using 2SLS in the second step treating the fitted values as data. The inconsistency
is a result of replacing nonlinear functions of 𝑥 1 with the same nonlinear function of fitted values.
The correct way to go is to simultaneously estimate the first stage parameters Π and the parameters
of interest 𝛽.
    I first impose the following assumption on the missingness mechanism.
Assumption 1.7.1.1. (𝑖) E[𝑠1 𝑠2 𝐹2 (𝑧1 , 𝑥2 ) 0𝑢] = 0                (𝑖𝑖) E[𝑠1 𝑠2 𝐹2 (𝑧1 , 𝑥2 ) 0𝑟] = 0
(𝑖𝑖𝑖) E[𝑠1 𝐹2 (𝑧1 , 𝑥2 ) 0𝑢] = 0   (𝑖𝑣) E[𝑠1 𝐹2 (𝑧1 , 𝑥2 ) 0𝑟] = 0          (𝑣) E[𝑠2 𝐹2 (𝑧1 , 𝑥2 ) 0𝑟] = 0.
This gives us the following moment conditions.
                              
                                          𝑠   𝑠   𝐹  (𝑧   , 𝑥   ) 0 [𝑦 − 𝐹 (𝑥 , 𝑥 ) 𝛽]                 
                                                                                                      0 
                              
                                           1   2  2    1     2            1   1   2                   
                                                                                                      
                               𝑠 𝑠 𝐹 (𝑧 , 𝑥 ) 0 ⊗ [𝐹 (𝑥 , 𝑥 ) − 𝐹 (𝑧 , 𝑥 )Π] 0  0
                                    1 2 2 1 2                    1 1 2            2 1 2
       E[𝑔 𝑁 𝐿 (𝛽, Π)] = E                                                                          =      (1.7.7)
                                                                                                      
                               (1 − 𝑠 )𝑠 𝐹 (𝑧 , 𝑥 ) 0 ⊗ [𝐹 (𝑥 , 𝑥 ) − 𝐹 (𝑧 , 𝑥 )Π] 0 0
                                     1 2 2 1 2                       1 1 2          2 1 2             
                                                                                                      
                              
                                                                    0
                                      𝑠1 (1 − 𝑠2 )𝐹2 (𝑧1 , 𝑥2 ) [𝑦 − 𝐹2 (𝑧1 , 𝑥2 )Π𝛽]
                                                                                                       
                                                                                                      0 
                                                                                                      
Compared to Section 1.4, we have simply replaced 𝑥 with 𝐹1 (𝑥 1 , 𝑥2 ) and 𝑧 with 𝐹2 (𝑧1 , 𝑥2 ). Unlike
Section 1.4 though, the traditional imputation is not consistent.
23Section 9.5.2.
                                                             22


1.7.2    Missingness in instruments
    Next we move to the missing data scenario of Section 1.6. That is, the only variables that
contain missing values are the excluded instruments 𝑧1 . We re-write equation (1.7.3) as follows by
breaking up 𝐹2 (𝑧1 , 𝑥2 ) into elements that do and do not depend on 𝑧 1 .
                                     𝐹1 (𝑥 1 , 𝑥2 ) = 𝐹21 (𝑧1 , 𝑥2 )Π𝑎 + 𝐹22 (𝑥 2 )Π𝑏 + 𝑟,                              (1.7.8)
where 𝐹2 (𝑧1 , 𝑥2 )Π ≡ 𝐹21 (𝑧1 , 𝑥2 )Π𝑎 + 𝐹22 (𝑥 2 )Π𝑏 , 𝐹21 (𝑧1 , 𝑥2 ) is a 1 × 𝑗21 vector that includes all
elements of 𝐹2 (𝑧1 , 𝑥2 ) that are functions of 𝑧1 , 𝐹22 (𝑥 2 ) is a 1 × 𝑗 22 vector that includes all elements
of 𝐹2 (𝑧 1 , 𝑥2 ) that are functions only of 𝑥 2 , and 𝑗2 = 𝑗21 + 𝑗22 . From our example in Section 1.7.1,
if 𝐹2 (𝑧 1 , 𝑥2 ) = (𝑧 1 , 𝑧21 , 𝑧1 𝑥2 , 𝑥2 , 𝑥22 ), then 𝐹21 (𝑧1 , 𝑥2 ) = (𝑧1 , 𝑧21 , 𝑧1 𝑥 2 ) and 𝐹22 (𝑥2 ) = (𝑥 2 , 𝑥22 ). To
handle missingness in 𝑧 1 , we also need a linear projection of each of the instruments on 𝐹22 (𝑥2 ).24
                                                  𝐹21 (𝑧1 , 𝑥2 ) = 𝐹22 (𝑥 2 )Γ + 𝑒,                                     (1.7.9)
where by definition
                                                        E[𝐹22 (𝑥 2 ) 0 𝑒] = 0.                                        (1.7.10)
This gives us the reduced forms of 𝐹1 (𝑥1 , 𝑥2 ) and 𝑦 in terms of 𝑥 2 only. Plugging (1.7.9) into
(1.7.8) we get
                                     𝐹1 (𝑥 1 , 𝑥2 ) = 𝐹22 (𝑥 2 )(ΓΠ𝑎 + Π𝑏 ) + 𝑒Π𝑎 + 𝑟.                                (1.7.11)
Similarly, plugging (1.7.11) into (1.7.1) we get
                                     𝑦 = 𝐹22 (𝑥 2 )(ΓΠ𝑎 + Π𝑏 ) 𝛽 + (𝑒Π𝑎 + 𝑟) 𝛽 + 𝑢.                                   (1.7.12)
Next, I impose the following assumption on the missingness mechanism.
Assumption 1.7.2.1. (i) E[𝑠3 𝐹2 (𝑧 1 , 𝑥2 ) 0𝑢] = 0 (ii) E[𝑠3 𝐹2 (𝑧1 , 𝑥2 ) 0𝑟] = 0 (iii) E[𝑠3 𝐹22 (𝑥 2 ) 0 𝑒] = 0.
24Based on the exact functional form of 𝐹1 (.), one might want to choose different functions of 𝑥2 in equation (1.7.9)
  than those in 𝐹22 (.). This framework can be easily extended to allow for that by replacing 𝐹22 (𝑥2 ) by a different
  function 𝐹3 (𝑥2 ) in (1.7.9) and deriving the reduced forms in (1.7.11) and (1.7.12) accordingly. For the ease of
  exposition, I stick here with the same functions of 𝑥 2 in both (1.7.8) and (1.7.9).
                                                                 23


This gives us the following moment conditions.
                                                     𝑠3 𝐹2 (𝑧1 , 𝑥2 ) 0 [𝑦 − 𝐹1 (𝑥1 , 𝑥2 ) 𝛽]                      0 
                                                                                                                   
                                 
                                                                                                                   
                                                                 0                                            0
                                                                                                                   
                                 
                                           𝑠 3 𝐹 2 (𝑧 1 , 𝑥 2 )   ⊗    [𝐹 1 (𝑥 1 , 𝑥 2 ) −  𝐹2 (𝑧 1 , 𝑥 2 )]      0 
                                                                                                                    
                                                                                                                   
   E[ℎ 𝑁 𝐿 (𝛽, Π, 𝛾)] = E                                   0
                                              𝑠3 𝐹22 (𝑥 2 ) ⊗ [𝐹21 (𝑥 1 , 𝑥2 ) − 𝐹22 (𝑥 2 )Γ]               0      = 0 (1.7.13)
                                                                                                                    
                                                                                                                   
                                  (1 − 𝑠3 )𝐹22 (𝑥 2 ) 0 ⊗ [𝐹1 (𝑥1 , 𝑥2 ) − 𝐹22 (𝑥2 )(ΓΠ𝑎 + Π𝑏 )] 0 0
                                                                                                                   
                                                                                                                   
                                                                      0
                                                                                                                   
                                 
                                          (1  −  𝑠 3 )𝐹  22 (𝑥   2 )    𝑦 −  𝐹 22  (𝑥  2 )(ΓΠ  𝑎 +    Π 𝑏 ) 𝛽]    0 
                                                                                                                    
In the case where 𝑥 1 is exogenous (and hence 𝑥 1 = 𝑧1 ), this reduces to
                                                           𝑠3 𝐹2 (𝑧1 , 𝑥2 ) 0 [𝑦 − 𝐹1 (𝑥1 , 𝑥2 ) 𝛽]
                                                                                                               
                                                                                                               
                                                                                                               
                                                                                                               
               E[ℎ 𝑁 𝐿 (𝛽, Π, 𝛾)] = E  𝑠3 𝐹22 (𝑥 2 ) 0 ⊗ [𝐹21 (𝑥 1 , 𝑥2 ) − 𝐹22 (𝑥 2 )Γ] 0                            (1.7.14)
                                                                                                               
                                               (1 − 𝑠3 )𝐹22 (𝑥 2 ) 0 [𝑦 − 𝐹22 (𝑥2 )(ΓΠ𝑎 + Π𝑏 ] 𝛽] 
                                                                                                               
                                                                                                               
As discussed in Abrevaya & Donald (2017), when 𝑥 1 is exogenous, the second most commonly used
method after the complete cases OLS is linear imputation. In the example we have been carrying
along where 𝐹1 (𝑥 1 , 𝑥2 ) = (𝑥 1 , 𝑥12 , 𝑥1 𝑥 2 , 𝑥2 ), it proceeds as follows. In the first step, it regresses 𝑥1
on 𝑥 2 and obtains the fitted values (say 𝑥˜1 ). In the second step, it replaces the missing values of
𝑥 1 , 𝑥 12 and 𝑥 1 𝑥 2 with 𝑥˜1 , 𝑥˜12 , and 𝑥˜1 𝑥 2 respectively. Not only does this method not use the optimal
instruments for 𝑥 1 (as it fails to include the nonlinear functions of 𝑥 2 in the imputation equation), it
performs a forbidden regression in the second step, and hence results in inconsistent estimates for
𝛽.
1.8        Monte Carlo simulations
1.8.1       Missingness in outcome and covariates
      The data generating process is as follows.
                                                    𝑦 = 1 + 𝑥 1 𝛽1 + 𝑥 2 𝛽2 + 𝑢,
where 𝑥 1 is a scalar and 𝑥 2 = [1 𝑥 22 𝑥 23 ] is a 1 × 3 vector. Moreover,
                                                                                      !
                                                 𝑥                 1  2 0.1
                                                   22
                                                   ∼𝑁  ,
                                                                                      
                                                                                           
                                                 𝑥                 1 0.1 3 
                                                  23                                  
                                                                                      
                                                                        24


𝛽2 = (𝛽21 , 𝛽22 , 𝛽23 ) 0 is fixed at (1, 1, 1) 0 throughout all designs. The error is 𝑢 = 𝜎𝑢 𝑢 ∗ , where 𝑢 ∗
is a standard normal, and 𝜎𝑢 will be used to vary the error variance. The vector of instruments
𝑧1 = (𝑧11 , 𝑧12 , 𝑧13 , 𝑧14 ) is 1 × 4 vector where
                                                                        
                                    𝑧 11         0 1 0.5 0.4 0.3
                                                                        
                                                                        !
                                    𝑧            0       1 0.2 0.1
                                     12 
                                      ∼𝑁  ,
                                                     
                                                                             
                                    𝑧            0            1      0 
                                     13            
                                                                        
                                                                        
                                    𝑧 14         0                   1
                                                                        
The first stage is given by
                                            𝑥 1 = 𝑧1 Π11 + 𝑥2 Π21 + 𝑟 1 ,
where Π11 = (1, 1, 1, 1) 0, Π21 = (0.5, 0.5, 0.5) 0 and 𝑟 1 = 𝑟 1∗ + 𝑢 ∗ , where 𝑟 1∗ is a standard normal.
Thus, 𝑢 ∗ is the part of 𝑥1 that is correlated with 𝑢 and causes 𝑥 1 to be endogenous.
    The missingness is based on a uniform random variable, making the data MCAR.
                       𝑠∗ ∼ U (0, 1), 𝑠1 = 1[𝑠∗ < 𝑎 or 𝑠∗ > 𝑏], 𝑠2 = 1[𝑠∗ < 𝑏].
I consider 4 designs.
Design 1: 𝛽 = 1, 𝜎𝑢 = 3.5, 𝑎 = 0.5, 𝑏 = 0.75.
                             q
Design 2: 𝛽 = 1, 𝜎𝑢 = 𝑒𝑥 𝑝(𝑧211 ), 𝑎 = 0.5, 𝑏 = 0.75.
                             q
Design 3: 𝛽 = 1, 𝜎𝑢 = 𝑒𝑥 𝑝(𝑧211 ), 𝑎 = 0.4, 𝑏 = 0.75.
                               q
Design 4: 𝛽 = 0.1, 𝜎𝑢 = 𝑒𝑥 𝑝(𝑧211 ), 𝑎 = 0.5, 𝑏 = 0.75.
    The first design is the basic case of homoskedasticity in the model of interest. Design 2 allows
for 𝑢 to be heteroskedastic. Design 3 reduces the percentage of complete cases, and design 4
reduces the magnitude of the coefficient of interest. For all the designs, I do 1000 iterations with
𝑛 = 3000.
    I look at five estimators, starting with the most commonly used, which is the complete cases
2SLS. When the data is heteroskedastic, a GMM based on the complete cases will be more efficient
than the 2SLS, and that is the second estimator I consider. The third is the imputation estimator
                                                        25


discussed in Section 1.5.3, followed by the dummy variable method and finally the proposed
estimator.
    The first thing to note is that the proposed estimator works best in terms of efficiency in all
cases, with substantial reductions in the standard deviation relative to other estimators. This is true
not only for 𝛽22 and 𝛽23 , the coefficients on 𝑥2 , but also for 𝛽1 , the coefficient on the covariate
with missing values. The pattern on bias relative to other estimators is less clear, but the proposed
estimator still has the smallest root mean squared error out of all the estimators in all cases.
    The gains in efficiency of the proposed estimator are more pronounced when we have het-
eroskedasticity. Relative to the complete cases GMM, the gains increase as the percentage of
complete cases decreases, which is to be expected as the proposed estimator now incorporates
more additional information into estimation. The gains remain substantial in the case where the
coefficient on the covariate with missing values is small.
    The complete cases GMM is more efficient than the complete cases 2SLS when there is
heteroskedasticity because of the optimal weighting, as expected. Yet it is less efficient than the
proposed estimator in all cases, including when the error in the model of interest is homoskedastic.
The imputation estimator on the other hand is not guaranteed to bring any efficiency gains relative
to the complete cases GMM, and hence has no reason to be preferred over the former. The dummy
variable method shows severe bias in all but the last design where the coefficient on the variable
with missing value is close to 0, and does not even guarantee gains in efficiency over the complete
cases GMM. Thus, this estimator cannot be recommended either.
1.8.2   Missingness in instruments
    The data generating process is as follows.
                                       𝑦 = 1 + 𝑥 1 𝛽1 + 𝑥 2 𝛽2 + 𝑢.
                                                    26


where 𝑥 1 is a scalar and 𝑥 2 = [1 𝑥 22 𝑥 23 ] is a 1 × 3 vector. Moreover,
                                                                    !
                                          𝑥          1  2 0.2
                                            22
                                            ∼𝑁  ,
                                                                    
                                                                         
                                          𝑥          1 0.2 1 
                                           23                       
                                                                    
𝛽1 = 1, 𝛽2 = (𝛽21 , 𝛽22 , 𝛽23 ) 0 is fixed at (1, 1, 1) 0 throughout all designs. The error is 𝑢 = 𝜎𝑢 𝑢 ∗ ,
where 𝑢 ∗ is a standard normal, and 𝜎𝑢 will be used to vary the error variance. We have a single
instrument 𝑧1 such that
                                                   𝑧1 = 𝑥 2 Γ + 𝑒,
where Γ = (1, 0.5, 0.5) 0 and 𝑒 is standard normal. The first stage is given by
                                            𝑥 1 = 𝑧1 Π11 + 𝑥2 Π21 + 𝑟 1 ,
where Π11 = 1, Π21 = (1, 0.5, 0.5) 0 and 𝑟 1 = 𝑟 1∗ + 𝑢 ∗ , where 𝑟 1∗ is a standard normal and 𝑢 ∗ is the
part of 𝑥 1 that is correlated with 𝑢.
    The missingness is based on a uniform random variable, making the data MCAR.
                                        𝑠∗ ∼ U (0, 1), 𝑠3 = 1[𝑠∗ > 𝑎].
I consider 3 designs.
Design 5: 𝜎𝑢 = 4, 𝑎 = 0.5.
                   q
Design 6: 𝜎𝑢 = 𝑒𝑥 𝑝(𝑧211 ), 𝑎 = 0.5.
                   q
Design 7: 𝜎𝑢 = 𝑒𝑥 𝑝(𝑧211 ), 𝑎 = 0.4.
Design 5 is the case of homoskedasticity in the model of interest, design 6 allows for 𝑢 to be
heteroskedastic, and design 7 increases the percentage of complete cases. For all the designs, I do
1000 iterations with 𝑛 = 2000.
    The results are qualitatively similar to those in the previous sub-section. The proposed estimator
substantially improves efficiency and has a lower root mean squared error relative to the complete
cases 2SLS in all cases including that of homoskedasticity.25 The gains are more pronounced in
the case of heteroskedasticity and increase with a reduction in the percentage of complete cases.
25The only exception is 𝛽1 in the case of homoskedasticity, where the two estimators perform equally well.
                                                         27


      The imputation estimator for 𝛽1 is numerically equivalent to the complete cases 2SLS, as noted
in Mogstad & Wiswall (2012). For 𝛽22 and 𝛽23 , this estimator always does no better than the
proposed estimator, and sometimes does worse than even the complete cases 2SLS. Since it does
not guarantee efficiency gains over the complete cases or the proposed estimator, there is no reason
to prefer it over either of the two.
1.9       Empirical application
      I estimate the effect of physician’s advice to reduce weight on calorie consumption by individuals
using the estimator proposed in Section 1.4. As noted by Joshi and Wooldridge (2020), physician’s
advice is a low cost and precisely targeted intervention that can affect food consumption habits
of individuals. The effect of physician’s advice on outcomes like smoking, dietary and exercise
behavior has been considered by Loureiro & Nayga Jr (2006), Loureiro & Nayga Jr (2007), Secker-
Walker et al. (1998), and Ortega-Sanchez et al. (2004), among others.
      The data comes from five most recent cycles of National Health and Nutritional Examination
Survey (NHANES): 2007-08, 2009-2010, 2011-12, 2013-14, and 2015-16.26 The NHANES is
designed to assess the health and nutritional status of adults and children in the US. It examines
a nationally representative sample of about 5000 persons each year and contains demographic,
socioeconomic, dietary, and health-related questions.
      The dependent variable (𝑦) is the log of calorie intake of individuals. The endogenous covariate
(𝑥 1 ) is a binary variable which equals one if the physician advised the individual to lose weight.
The excluded instruments (𝑧1 ) are binary variables indicating whether the individual has health
insurance and a regular source of care. Other explanatory variables (𝑥 2 ) include demographic
variables like age, gender, race, education, and income of the individual as well as health-related
variables such as the individual’s body mass index (BMI), and indicators for whether they have
high blood pressure, high cholesterol, Arthritis, a heart condition and Diabetes. Also included are
year fixed effects and all variables have been demeaned.
26I would like to thank Rĳu Joshi for providing me with neatly compiled and cleaned data.
                                                         28


    I restrict the sample to overweight individuals, that is, those with BMI greater than or equal
to 25. I also exclude from the sample women who are pregnant, and individuals for whom the
covariates 𝑥 2 or the excluded instruments 𝑧1 are missing. The final sample consists of 11,512
observations with 𝑦 missing for 952 observations and 𝑥 1 missing for 2173 observations.
    Table B8 reports the results for two estimators: the complete cases GMM and the estimator
proposed in Section 1.4 which uses the incomplete cases. The former results in the coefficient of
interest being insignificant, which continues to hold true with the reduced standard error resulting
from the proposed estimator. The standard errors for all other coefficients are smaller as well using
the proposed estimator, while the coefficients for most variables remain similar to those obtained
using the complete cases GMM.
1.10     Conclusion
    I have offered some simple GMM estimators that improve efficiency over the currently used
methods in the presence of missing data in linear regression models with endogenous covariates.
I consider the cases of missingness in the outcomes and the endogenous covariates as well as that
of missingness in the instruments. The latter includes the missingness in exogenous covariates as
a special case. I also consider models that are nonlinear in the covariates and need a more careful
treatment to ensure consistency. Thus, my framework can be used to deal with missingness in a
wide variety of models frequently used in empirical work. In ongoing work, I am extending these
methods to the case of panel data and models nonlinear in the parameters.
                                                 29


                                                     CHAPTER 2
           IMPUTING MISSING COVARIATE VALUES IN NONLINEAR MODELS
2.1     Introduction
     Nonlinear models are widely considered better suited to explain limited dependent variables
than linear models. With missing covariate values - a ubiquitous problem in empirical research
- nonlinear models become even more important because unlike the case where all variables are
observed, estimates from linear models are now not necessarily consistent for parameters in the
best linear approximations to nonlinear models.1 Yet not much of the vast literature on missing data
has explicitly addressed the unique issues that arise when dealing with missingness in nonlinear
models.
     Economists deal with missing covariate values predominantly in three ways. The most common
thing to do is to just use the “complete cases" - the observations for which all the covariates are
observed. While easy to use, this method can lead to substantial loss of efficiency because of
discarding the incomplete cases. This has inspired methods that make use of these incomplete
cases. The first commonly used method in this regard is the dummy variable method (DVM),
which replaces the missing values with 0 and includes an indicator for missingness as an additional
covariate. The second commonly used method is two-step regression imputation. In the first step,
it regresses the covariate with missing values (CMV) on the always-observed covariates using the
complete cases and uses the estimated coefficients to predict missing values of the CMV. In the
second step, it estimates the model of interest using all observations with this “composite" CMV,
which consists of both observed and predicted values (Dagenais, 1973). Table D1 summarizes
the usage of these methods in 5 highly ranked economics journals in the last 3 years. Out of 846
papers, about 26% reported having missing data. Out of these, about 62%, 19% and 14% used the
complete cases estimator, the DVM and the two-step regression imputation respectively.2
1I discuss this issue in detail in Section 2.6.4. Also see Wooldridge (2002).
2Of all the other methods used, no single category stood out. About 18% of the papers use other methods, most
                                                            30


     The choice of method comes down to consistency and relative efficiency. The complete cases
estimator generally requires the least number of assumptions in both linear and nonlinear models
to be consistent. For instance, when the econometric model is correctly specified, say a model
of a mean or a distribution conditional on the covariates, it only requires that the missingness
depends only on the covariates (Wooldridge, 2002). However, as mentioned above, it can be
inefficient relative to the other two estimators that use the incomplete cases. The DVM on the
other hand is generally inconsistent even in linear models (Jones, 1996) and as I show in this
paper, in nonlinear models as well, unless some very strong zero assumptions are imposed. Even
with these assumptions, it does not guarantee efficiency improvements over the complete cases
estimator (Abrevaya & Donald, 2017). Yet this method is still widely used as is evident from Table
D1, perhaps because of its ease of use.
     Two-step regression imputation also imposes additional assumptions on the model relative to
the complete cases estimator, but these assumptions are much more plausible than those imposed
by DVM. Practically, the most important one is ruling out the dependence of missingness on the
CMV itself. Under this assumption, it is generally consistent in linear models.
     However, in this paper I show that even under this assumption, this method is generally in-
consistent in nonlinear models. Most notable are models based on conditional means, including
commonly used models like probit, tobit, and Poisson regression. The reason for inconsistency is
that this method simply plugs the imputed values in the same objective function that one would
minimize if there were no missing values. However, in nonlinear models, this objective function
does not necessarily capture the correct relationship between the observed variables in observations
with missing values. The core issue is that conditional expectation does not pass through nonlinear
functions, unlike linear ones. For instance, in binary choice models, simply plugging imputed
values in the standard probit response probability and maximizing the resulting log likelihood will
generally result in inconsistency in estimators of both the structural parameters and other quantities
 of which are ad-hoc. This includes methods like replacing missing values with observations from the previous or
 following time period in case of panel data (5%), replacing missing values with 0 (4%), and dropping or combining
 variables with missingness (2%). Some papers also used hot deck (3%) and context specific imputation methods
 (2%). There were 2 instances each of multiple imputation and weighting.
                                                         31


of interest, such as average partial effects. To my knowledge, this issue has not been addressed in
the literature and on the contrary, it has been claimed that this method is consistent in binary choice
models (DeCanio & Watkins, 1998).
     The key contribution of this paper is to propose a one-step imputation estimator which relies
on the same assumptions as two-step imputation, but is consistent in nonlinear models. It simulta-
neously estimates the model of interest and the imputation model using the complete cases and a
“reduced form" using all observations. The reduced form is a version of the main model in which
we have “integrated out" the CMV using the imputation model, and hence it is able to make use of
the incomplete cases. The key is that it correctly captures the relationship between the observed
variables when the CMV is missing.
     The estimator provides potentially strict efficiency gains over the complete cases estimator
for all coefficients, and using a generalized method of moments (GMM) framework provides the
overidentification test as a test for underlying restrictions. The method is an extension of Abrevaya
& Donald (2017), who proposed a one-step imputation estimator for linear models. I provide a
unified treatment of linear and nonlinear models using an M-estimation framework. Special cases
include linear and nonlinear least squares, conditional maximum likelihood, and quasi maximum
likelihood methods.
     A second contribution is that I allow for nonlinearity in the imputation model itself. As
mentioned above, the presence of missing data heightens the concerns about using linear models
for limited dependent variables. Therefore, when imputing say a binary CMV, a probit may be more
appropriate than a linear probability model. To my knowledge, regression imputation literature
has solely focused on linear imputation models, though some of these nonlinear models have been
discussed in the context of multiple imputation which is a Bayesian method of imputing (Rubin,
1987, Van Buuren, 2007).
     The rest of this paper is organized as follows. Section 2.2 lays out the population minimization
problems obtained from the underlying model of interest and imputation model. Section 2.3
describes the selection problem and estimation of selection probabilities. Section 2.4 derives the
                                                    32


proposed estimator, its asymptotic distribution and a simple estimator of the asymptotic variance.
Section 2.5 discusses two practically important examples: nonlinear models for fractional responses
and nonnegative responses, including count responses. Within each model, I consider a continuous
and a binary CMV. Section 2.6 compares the proposed estimator to three other estimators: complete
cases, two-step imputation and DVM. Section 2.7 provides simulation results showing the relative
performance of these estimators. Section 2.8 provides an empirical application to the estimation of
association between grade variance and educational attainment as considered in Sandsor (2020).
Section 9 concludes. Proofs, tables and figures are given in appendices.
2.2     The population optimization problems
     We start with the population optimization problem which defines the parameters of interest.
Let 𝑦 be a 1 × 𝐽 random vector taking values in Y ⊂ R𝐽 and 𝑥 be a 1 × (𝐾 + 1) random vector
taking values in X ⊂ R𝐾+1 . We are interested in explaining 𝑦 in terms of 𝑥. Some aspect of the
joint distribution of (𝑦, 𝑥) depends on a 𝐿 1 × 1 parameter vector, 𝛼, contained in a parameter space
A ⊂ R 𝐿 1 . Let 𝑓1 (𝑦, 𝑥1 , 𝑥2 , 𝛼) denote an objective function.
     Assumption 2.2.1. 𝛼0 is the unique solution to the population minimization problem
                                               min E[ 𝑓1 (𝑦, 𝑥1 , 𝑥2 , 𝛼)].                                    (2.2.1)
                                               𝛼∈A
Often, 𝛼0 indexes some correctly specified feature of the distribution of 𝑦 conditional on 𝑥, such
as a conditional mean or a conditional median. But we will derive consistency and asymptotic
normality results for a general class of problems in which the underlying population model can be
misspecified in some way.
     Next, let 𝑥 = (𝑥1 , 𝑥2 ), where 𝑥 1 is a scalar,3 and 𝑥 2 is a 1 × 𝐾 random vector taking values in
X1 ⊂ R and X2 ⊂ R𝐾 respectively, and X = X1 × X2 . As discussed in Section 2.3, we will allow
𝑥 1 to contain missing values and assume that (𝑦, 𝑥2 ) are always observed. Thus, we are interested
in imputing 𝑥 1 using 𝑥2 . Let some aspect of the joint distribution of (𝑥 1 , 𝑥2 ) depends on a 𝐿 2 × 1
3The discussion for a random vector 𝑥1 , all elements of which are missing and observed at the same time, is essentially
 the same.
                                                          33


parameter vector 𝛽, contained in a parameter space B ⊂ R 𝐿 2 . Let 𝑓2 (𝑥 1 , 𝑥2 , 𝛽) denote an objective
function, and consider the population optimization problem which characterizes the imputation
parameters.
    Assumption 2.2.2. 𝛽0 is the unique solution to the population minimization problem
                                         min E[ 𝑓2 (𝑥 1 , 𝑥2 , 𝛽)].                                 (2.2.2)
                                          𝛽∈B
Similar to the model of interest, the underlying population model here can be misspecified in some
way.
    The case that has been well studied in the classical imputation literature is where the underlying
models for both 𝑓1 (𝑦, 𝑥, 𝛼) and 𝑓2 (𝑥1 , 𝑥2 , 𝛽) are linear. The framework presented here allows for
both the underlying models to be nonlinear as long as they are estimable using M-estimators, which
includes maximum likelihood, quasi-maximum likelihood, nonlinear least squares, and many other
procedures. For instance, if both 𝑦 and 𝑥1 are binary, we can let both 𝑓1 (𝑦, 𝑥, 𝛼) and 𝑓2 (𝑥 1 , 𝑥2 , 𝛽) be
negative of probit log-likelihoods, instead of basing them on linear models. Alternatively, 𝑦 could
be a nonnegative count variable and 𝑥 1 could be continuous, in which case we can let 𝑓1 (𝑦, 𝑥, 𝛼) be
the negative of Poisson log-likelihood and let 𝑓2 (𝑥 1 , 𝑥2 , 𝛽) come from a linear model. We consider
these examples in detail in Section 2.5.
    Next, we define a reduced form M-estimation problem which is based only on the always-
observed variables (𝑦, 𝑥2 ). This reduced form is what allows us to use the incomplete cases, and
hence is the key to the efficiency gains of the proposed estimator.
    Let 𝛾 = 𝑞(𝛼, 𝛽) be a (potentially nonlinear) 𝐿 3 × 1 function of the parameters of interest
𝛼 and the imputation parameters 𝛽, where 𝛾 is contained in a parameter space Γ ⊂ R 𝐿 3 and
𝐿 3 ≤ 𝐿 1 + 𝐿 2 . We assume that we can obtain a “reduced form" objective function 𝑓3 (𝑦, 𝑥2 , 𝛾) in
terms of the always-observed variables 𝑦 and 𝑥 2 as well as 𝛾 such that 𝛾0 = 𝑞(𝛼0 , 𝛽0 ) uniquely
minimizes this function.
    Assumption 2.2.3. 𝛾0 is the unique solution to the population minimization problem
                                          min E[ 𝑓3 (𝑦, 𝑥2 , 𝛾)].                                   (2.2.3)
                                          𝛾∈Γ
                                                    34


The reduced form model underlying 𝑓3 (𝑦, 𝑥2 , 𝛾) is derived by “integrating out" 𝑥1 from the model
of interest using the imputation model. When the model of interest is a linear projection or a
model of conditional mean linear in the parameters, the reduced form can be derived using iterated
projections or iterated expectations properties without having to do explicit integration. This is
the case considered in Abrevaya & Donald (2017). In commonly used models nonlinear in the
parameters like probit and Poisson regression, “substituting" for 𝑥 1 using the imputation model
eliminates the need for explicit integration. We consider these examples in Section 2.5.
    The dimension of 𝛾 warrants some discussion. It is possible that 𝐿 3 < 𝐿 1 + 𝐿 2 , that is, the
reduced form only identifies certain functions of 𝛼0 and 𝛽0 , and not each element of 𝛼0 and 𝛽0
separately. Some examples are the case of linear projections considered in Abrevaya & Donald
(2017) and the case of probit with continuous 𝑥 1 considered in Section 5.1.1. It is however, also
possible that 𝐿 3 = 𝐿 1 + 𝐿 2 , in which case 𝛾0 = (𝛼00 , 𝛽00 ) 0, for instance in the case of probit with
binary 𝑥 1 considered in Section 2.5.1.2.4
    Assumptions (2.2.1)-(2.2.3) imply that (𝛼0 , 𝛽0 ) is the unique solution to the following equations,
provided that we can interchange the expectation and the derivative.
                                                                ∗                     
                                                               𝑔 (𝑦, 𝑥1 , 𝑥2 , 𝛼)  0
                                                                1                     
                               E[𝑔 ∗ (𝑦, 𝑥1 , 𝑥2 , 𝛼, 𝛽)] = E  𝑔 ∗ (𝑥 1 , 𝑥2 , 𝛽)  = 0 ,
                                                                                      
                                                                                                                  (2.2.4)
                                                                2                     
                                                                ∗
                                                                𝑔3 (𝑦, 𝑥2 , 𝛼, 𝛽)  0
                                                                                       
                                                                                      
where 𝑔1∗ (𝑦, 𝑥1 , 𝑥2 , 𝛼) ≡ ∇𝛼 𝑓1 (𝑦, 𝑥1 , 𝑥2 , 𝛼) 0 is the 𝐿 1 × 1 score of 𝑓1 (𝑦, 𝑥1 , 𝑥2 , 𝛼) , 𝑔2∗ (𝑥 1 , 𝑥2 , 𝛽) ≡
∇ 𝛽 𝑓2 (𝑥 1 , 𝑥2 , 𝛽) 0 is the 𝐿 2 × 1 score of 𝑓2 (𝑥 1 , 𝑥2 , 𝛽), and 𝑔3∗ (𝑦, 𝑥2 , 𝛼, 𝛽) ≡ ∇𝛾 𝑓3 (𝑦, 𝑥2 , 𝛾) 0 is the
𝐿 3 × 1 score of 𝑓3 (𝑦, 𝑥2 , 𝛾). (2.2.4) gives us a set of moment conditions, a transformation of which
will be the basis of the proposed estimator as discussed in Section 2.4.
4As a note on notation, I express functions as explicitly depending on 𝛾 only when it is necessary to take into account
 the nature of 𝛾. For instance, when 𝐿 3 < 𝐿 1 + 𝐿 2 , the score of 𝑓3 (𝑦, 𝑥2 , 𝛾) should only contain partial derivatives
 with respect to 𝛾 and not with respect to individual elements of 𝛼 and 𝛽 to prevent redundancy in the resulting
 moment conditions. But for the most part, when looking at the derivatives of 𝑓1 (.), 𝑓2 (.), and 𝑓3 (.), we only need to
 acknowledge the fact that they are functions of (𝛼, 𝛽).
                                                            35


2.3      Non random sampling and inverse probability weighting
     I characterize nonrandom sampling through a selection indicator. For any random draw
(𝑦𝑖 , 𝑥1𝑖 , 𝑥2𝑖 ) from the population, we also draw 𝑠𝑖 , a binary indicator equal to unity if 𝑥 1𝑖 is
observed, and zero otherwise. We assume that 𝑦𝑖 and 𝑥2𝑖 are always observed. A generic element
from the population is now denoted (𝑦, 𝑥1 , 𝑥2 , 𝑠). Then the following assumption characterizes the
nature of selection.
     Assumption 2.3.1 (i) 𝑥 1 is observed whenever 𝑠 = 1, (𝑦, 𝑥2 ) is always observed. (ii) There
is a random vector 𝑧 such that 𝑃(𝑠 = 1|𝑦, 𝑥, 𝑧) = 𝑃(𝑠 = 1|𝑧) ≡ 𝑝(𝑧). (iii) For all 𝑧 ∈ Z ⊂ R 𝑀 ,
𝑝(𝑧) > 0. (iv) 𝑧 is always observed.
     Part (i) simply defines data observability. Parts (ii) and (ii) are the key assumptions. They
state that selection is based on observable variables. This is the same as the “missing at random"
assumption used in statistics literature (Rubin, 1976). Part (ii) states that 𝑠 is independent of (𝑦, 𝑥)
conditional on 𝑧. Because the only variable assumed to contain missing values is 𝑥1 , we can, at a
minimum, allow 𝑧 to contain (𝑦, 𝑥2 ). Although apart from this, 𝑧 can also contain some “outside"
variables that are good predictors of selection and are always observed. Then Assumption 2.3.1 is
more general than allowing 𝑠 to depend only on the covariates 𝑥2 , which is the case considered in
Abrevaya & Donald (2017) in the context of linear models.
     Moreover, the framework presented here can also be used when 𝑦 contains missing values. We
simply redefine 𝑠 to equal 1 when both 𝑦 and 𝑥 1 are observed, and rule out 𝑧 containing 𝑦 in addition
to 𝑧 containing 𝑥 1 . Then the proposed estimator discussed in the next section will impute using the
observations for which only 𝑥 1 is missing, and discard the 𝑦-missing observations.
     For selection as described in Assumption 2.3.1, note that the first and second moment functions
in (2.2.4) can only use the 𝑠 = 1 observations since they depend on 𝑥 1 , and the third moment
function is able to use the 𝑠 = 0 observations. We will weight each of the moment functions by the
inverse of appropriate probabilities in order to account for this selection. To this end, we specify
a model for the selection probability. We assume that a conditional density determining selection
is correctly specified, and that the standard regularity conditions required for maximum likelihood
                                                  36


estimation (MLE) of the selection model are satisfied. Let 𝐷 (.|.) denote conditional distribution.
    Assumption 2.3.2 (i) 𝐺 (𝑧, 𝛿) is a parametric model for 𝑝(𝑧), where 𝛿 ∈ Δ ⊂ R𝑃 and 𝐺 (𝑧, 𝛿) >
0, all 𝑧 ∈ Z ⊂ R 𝑀 , 𝛿 ∈ Δ. (ii) There exists 𝛿0 ∈ Δ such that 𝑝(𝑧) = 𝐺 (𝑧, 𝛿0 ). (iii) The estimator 𝛿ˆ
solves the binary response problem
                               Õ𝑁
                         max       {𝑠𝑖 𝑙𝑜𝑔[𝐺 (𝑧𝑖 , 𝛿)] + (1 − 𝑠𝑖 )𝑙𝑜𝑔[1 − 𝐺 (𝑧𝑖 , 𝛿)]}.                (2.3.1)
                          𝛿∈Δ
                               𝑖=1
Given 𝛿,ˆ we can form 𝐺 (𝑧𝑖 , 𝛿) ˆ for all 𝑖. This leads us to the problem of estimation.
2.4     Moment conditions and GMM
    The proposed estimator is a GMM estimator based on the following transformation of the
moment functions in (2.2.4).
                                    𝑔1𝑖 (𝛼, 𝛽; 𝛿)   [𝑠𝑖 /𝐺 (𝑧𝑖 , 𝛿)]𝑔 ∗ (𝑦𝑖 , 𝑥1𝑖 , 𝑥2𝑖 , 𝛼) 
                                                                                              
                                                                         1                    
                                                                                              
                    𝑔𝑖 (𝛼, 𝛽; 𝛿) = 𝑔2𝑖 (𝛼, 𝛽; 𝛿)  ≡  [𝑠𝑖 /𝐺 (𝑧𝑖 , 𝛿)]𝑔 ∗ (𝑥 1𝑖 , 𝑥2𝑖 , 𝛽)  .   (2.4.1)
                                                                           2                  
                                    𝑔3𝑖 (𝛼, 𝛽; 𝛿)  
                                                                ∗
                                                                  𝑔3 (𝑦𝑖 , 𝑥2𝑖 , 𝛼, 𝛽)
                                                                                                 
                                                                                                 
                                                                                              
Because both 𝑔1∗ (𝑦𝑖 , 𝑥1𝑖 , 𝑥2𝑖 , 𝛼) and 𝑔2∗ (𝑥 1𝑖 , 𝑥2𝑖 , 𝛽) are functions of 𝑥 1𝑖 , they can only use the
complete cases - the observations for which 𝑠𝑖 = 1. We thus multiply these by 𝑠𝑖 and weight
by the inverse of selection probability in the usual inverse probability weighting (IPW) fashion
(Wooldridge, 2002, 2007). Since 𝑔3∗ (𝑦𝑖 , 𝑥2𝑖 , 𝛼, 𝛽) is a function only of the always-observed variables
𝑦𝑖 and 𝑥2𝑖 , it can use all the observations including the incomplete cases and hence we do not need
to weight it.
    For a generic element from the population (𝑦, 𝑥1 , 𝑥2 , 𝑧, 𝑠), denote this vector of moment func-
tions by 𝑔(𝛼, 𝛽; 𝛿) and its individual elements by 𝑔 𝑗 (𝛼, 𝛽; 𝛿), 𝑗 = 1, 2, 3. This is a set of overi-
dentified moment functions. 𝑔1 (.) exactly identifies the parameters of interest 𝛼0 and 𝑔2 (.) exactly
identifies the imputation parameters 𝛽0 . The overidentification (and hence the efficiency gains) in
the system come from 𝑔3 (.). The number of overidentifying restrictions is 𝐿 3 , the dimension of
the reduced form parameters 𝛾0 . Given the first step estimate 𝛿,          ˆ we can write the sample analogue
                                                         37


of moment conditions based on (2.4.1) as
                                                          Õ𝑁
                                              ˆ = 𝑁 −1
                                 𝑔¯ 𝑗 (𝛼, 𝛽; 𝛿)                           ˆ
                                                              𝑔 𝑗𝑖 (𝛼, 𝛽; 𝛿),   𝑗 = 1, 2, 3,                 (2.4.2)
                                                          𝑖=1
and 𝑔(𝛼,
     ¯        ˆ = [ 𝑔¯ 1 (𝛼, 𝛽; 𝛿)
          𝛽; 𝛿)                       ˆ 0, 𝑔¯ 2 (𝛼, 𝛽; 𝛿)
                                                       ˆ 0, 𝑔¯ 3 (𝛼, 𝛽; 𝛿)
                                                                        ˆ 0] 0. A GMM estimator based on (2.4.1)
minimizes the following objective function with respect to (𝛼, 𝛽).
                                         ˆ
                                        𝑄(𝛼,       ˆ = 𝑔(𝛼,
                                                𝛽; 𝛿)    ¯         ˆ 0 𝑊ˆ 𝑔(𝛼,
                                                               𝛽; 𝛿)        ¯       ˆ
                                                                                 𝛽; 𝛿),                      (2.4.3)
                                                                      𝑝
where 𝑊ˆ is an estimated weight matrix such that 𝑊ˆ −                → 𝑊.
    We first discuss identification of (𝛼0 , 𝛽0 ). The limit function for 𝑄(𝛼,            ˆ       ˆ is 𝑄(𝛼, 𝛽; 𝛿0 ) =
                                                                                               𝛽; 𝛿)
E[𝑔(𝛼, 𝛽; 𝛿0 )] 0 𝑊 E[𝑔(𝛼, 𝛽; 𝛿0 )].
    Lemma 2.4.1. (Identification) Assume that 𝑊 is a symmetric positive definite matrix. Then
under Assumptions 2.2.1-2.2.3, 2.3.1, and 2.3.2, 𝑄(𝛼, 𝛽; 𝛿0 ) has a unique minimum at (𝛼0 , 𝛽0 ).
    For a nonsingular 𝑊, the GMM identification condition reduces to E[𝑔(𝛼, 𝛽; 𝛿0 )] ≠ 0 if
(𝛼, 𝛽) ≠ (𝛼0 , 𝛽0 ). Sufficient is to show that a corresponding condition holds for each element of
𝑔(𝛼, 𝛽; 𝛿0 ). For instance, E[𝑔1 (𝛼, 𝛽; 𝛿0 )] ≠ 0 if 𝛼 ≠ 𝛼0 follows from identification of 𝛼0 in the
population (Assumption 2.2.1) and the assumptions on selection (Assumptions 2.3.1 and 2.3.2). A
formal proof of Lemma 2.4.1, along with all other proofs in the rest of the paper are given in the
appendix.
    The GMM estimator based on the general weight matrix 𝑊ˆ is defined as the following.
    Definition 2.4.1: Call the estimator of (𝛼, 𝛽) that minimizes (2.4.3), ( 𝛼,                ˆ
                                                                                             ˆ 𝛽).
    Consistency of ( 𝛼,     ˆ 𝛽)ˆ follows from Lemma 2.4.1 and standard regularity conditions given in
the following theorem.
Theorem 2.4.1 (Consistency) Assume that
   1. {(𝑦𝑖 , 𝑥𝑖 , 𝑧𝑖 , 𝑠𝑖 ) : 𝑖 = 1, . . . , 𝑁 } are random draws from the population satisfying Assumptions
       2.3.1 and 2.3.2.
   2. The assumptions in Lemma 2.4.1 hold.
                                                              38


    3. A, B, Γ, Δ, A × Δ, B × Δ, and A × B × Γ are compact subsets of R 𝐿 1 , R 𝐿 2 , R 𝐿 3 , R𝑃 , R 𝐿 1 +𝑃 ,
       R 𝐿 2 +𝑃 , and R 𝐿 1 +𝐿 2 +𝐿 3 respectively.
    4. 𝑓1 (𝑦, 𝑥, 𝛼), 𝑓2 (𝑥, 𝛽) and 𝑓3 (𝑦, 𝑥2 , 𝛾) are twice differentiably continuous on A, B, Γ respec-
       tively for each (𝑦, 𝑥), 𝑥 and (𝑦, 𝑥2 ) in Y × X, X and Y × X2 respectively.
    5. 𝐺 (𝑧, 𝛿) is continuous in Δ for each 𝑧 ∈ Z, twice continuously differentiable on 𝑖𝑛𝑡 (Δ), and
       𝛿0 ∈ 𝑖𝑛𝑡 (Δ). For some 𝑎 > 0, 𝐺 (𝑧, 𝛿) ≥ 𝑎 for all 𝑧 ∈ Z, 𝛿 ∈ Δ.
    6. For all (𝛼, 𝛽, 𝛾) ∈ A × B × Γ, |𝑔 ∗ (𝑦, 𝑥, 𝛼, 𝛽, 𝛾)| ≤ 𝑏(𝑦, 𝑥), where
                                       𝑏(𝑦, 𝑥) ≡ [𝑏 1 (𝑦, 𝑥) 0, 𝑏 2 (𝑥) 0, 𝑏 3 (𝑦, 𝑥2 ) 0] 0
       and 𝑏(.) is a function such that E[𝑏(𝑦, 𝑥)] < ∞.
                    𝑝
     Then ( 𝛼,
             ˆ 𝛽)ˆ −→ (𝛼0 , 𝛽0 ) as 𝑁 →  − ∞.
The consistency of ( 𝛼,  ˆ 𝛽)ˆ follows from standard arguments involving consistency of two-step M-
estimators. First, analogous to the discussion in Wooldridge (2002), Lemma 2.4 of Newey &
McFadden (1994) applies to show that 𝑔1 (𝛼, 𝛽; 𝛿), 𝑔2 (𝛼, 𝛽; 𝛿) and 𝑔3 (𝛼, 𝛽; 𝛿) satisfy the uniform
weak law of large numbers over A × Δ, B × Δ, Γ respectively under Assumptions 1, 3, 4, 5 and 6
of Theorem 2.4.1. Then the averages in (2.4.2) can be shown to converge to
                                         E[𝑔 𝑗 (𝛼, 𝛽; 𝛿0 )],     𝑗 = 1, 2, 3,                           (2.4.4)
uniformly over A, B, and Γ respectively. Along with the identification from Lemma 2.4.1, this can
be shown to imply consistency of ( 𝛼,     ˆ 𝛽)ˆ for (𝛼0 , 𝛽0 ).
     Now, assuming that E[𝑔(𝛼, 𝛽; 𝛿0 )] is differentiable at (𝛼0 , 𝛽0 ), its derivative is defined as the
following.
                                                                                                  0          
                                                                                                 𝐷
                                                                                                  11    0    
                                                                                                              
                                                                        ∗
                                                                                                             
 𝐷 0 ≡ E[∇ (𝛼0,𝛽0)0 𝑔(𝛼, 𝛽; 𝛿0 )| (𝛼,𝛽)=(𝛼 ,𝛽 ) ] = E[∇ (𝛼0,𝛽0)0 𝑔 (𝛼, 𝛽)| (𝛼,𝛽)=(𝛼 ,𝛽 ) ] =  0 𝐷  ,
                                                                                                          0
                                             0 0                                             0 0           22 
                                                                                                 
                                                                                                  0
                                                                                                  𝐷 31 𝐷 032 
                                                                                                              
                                                                                                             
                                                                                                        (2.4.5)
                                                        39


where 𝐷 0𝑗1 = 𝜕𝑔 ∗𝑗 (𝛼, 𝛽)/𝜕𝛼| (𝛼,𝛽)=(𝛼 ,𝛽 ) and 𝐷 0𝑗2 = 𝜕𝑔 ∗𝑗 (𝛼, 𝛽)/𝜕 𝛽| (𝛼,𝛽)=(𝛼 ,𝛽 ) , 𝑗 = 1, 2, 3 and
                                                     0 0                                                           0 0
the first equality follows by the standard IPW argument given Assumptions 2.3.1 and 2.3.2. Then
the following result gives the asymptotic distribution of ( 𝛼,                      ˆ 𝛽). ˆ
Theorem 2.4.2 (Asymptotic normality) Assume that
    1. The assumptions in Theorem 2.4.1 hold.
    2. (𝛼0 , 𝛽0 ) ∈ 𝑖𝑛𝑡 (A × B).
    3. 𝑔(𝛼, 𝛽; 𝛿) is twice continuously differentiable on 𝑖𝑛𝑡 (A × B × Δ).
    4. 𝐷 0 is of full rank 𝐿 1 + 𝐿 2 .
    5. E[sup (𝛼,𝛽;𝛿)∈A×B×Δ |∇ (𝛼,𝛽,𝛿) 𝑔(𝛼, 𝛽, 𝛿)|] < ∞.
Then,
   √                                         𝑑
      𝑁 [( 𝛼ˆ 0, 𝛽ˆ0) 0 − (𝛼00 , 𝛽00 ) 0] −−−−→ 𝑁𝑜𝑟𝑚𝑎𝑙 [0, (𝐷 00𝑊 𝐷 0 ) −1 𝐷 00𝑊 𝐹0𝑊 𝐷 0 (𝐷 00𝑊 𝐷 0 ) −1 ],                (2.4.6)
where 𝐹0 = E(𝑔𝑖 𝑔𝑖0) − {E(𝑔𝑖 𝑑𝑖0) [E(𝑑𝑖 𝑑𝑖0)] −1 E(𝑑𝑖 𝑔𝑖0)} ◦ 𝑅, 𝑔𝑖 ≡ 𝑔𝑖 (𝛼0 , 𝛽0 ; 𝛿0 ), 𝑑𝑖 ≡ 𝑠𝑖 (∇𝛿 𝐺 0𝑖 /𝐺 𝑖 ) −
(1 − 𝑠𝑖 ) [∇𝛿 𝐺 0𝑖 /(1 − 𝐺 𝑖 )] is the 𝑃 × 1 score of the binary response log-likelihood, 𝑅 is a square
matrix of order 𝐿 1 + 𝐿 2 + 𝐿 3 with all elements being unity except the lower right 𝐿 3 × 𝐿 3 block
which is a 0 matrix,5 𝐺 𝑖 ≡ 𝐺 (𝑧𝑖 , 𝛿0 ), 𝐻0 ≡ E[∇𝛿 𝑔(𝛼0 , 𝛽0 ; 𝛿0 )] and 𝜓(𝑠𝑖 , 𝑧𝑖 ) = −[E(𝑑𝑖 𝑑𝑖0)] −1 𝑑𝑖 .
     Standard GMM theory dictates that the optimal weight matrix to be used in (2.4.3) is 𝑊ˆ = 𝐹ˆ −1 ,
where 𝐹ˆ is a consistent estimate of 𝐹0 which can be obtained as
                         Õ𝑁                     Õ𝑁                    Õ𝑁            −1       Õ𝑁            
         𝐹ˆ =      𝑁 −1       𝑔ˆ𝑖 𝑔ˆ𝑖0   −    𝑁 −1      𝑔ˆ𝑖 𝑑ˆ𝑖0    𝑁 −1        𝑑ˆ𝑖 𝑑ˆ𝑖0        𝑁 −1      𝑑ˆ𝑖 𝑔ˆ𝑖0    ◦ 𝑅, (2.4.7)
                          𝑖=1                       𝑖=1                     𝑖=1                      𝑖=1
where
                                                                             ˆ0                                    ˆ0
                                                                                                                   
                                                             ∇     𝐺 (𝑧   , 𝛿)                      ∇   𝐺 (𝑧    , 𝛿)
                                         ˆ 𝛿),
                                           ˆ     𝑑ˆ𝑖 ≡ 𝑠𝑖        𝛿      𝑖                             𝛿       𝑖
                       𝑔ˆ𝑖 ≡ 𝑔𝑖 ( 𝛼,
                                   ˆ 𝛽;                                           − (1 − 𝑠𝑖 )                           .  (2.4.8)
                                                                 𝐺 (𝑧𝑖 , 𝛿)ˆ                        1 − 𝐺 (𝑧𝑖 , 𝛿)  ˆ
Then, the proposed estimator is the optimal GMM estimator based on (2.4.1), as defined below.
5◦ denotes a Hadamard product.
                                                                    40


    Definition 2.4.2: Call the estimator of (𝛼, 𝛽) that minimizes (2.4.3) with 𝑊ˆ = 𝐹ˆ −1 , the weighted
joint GMM estimator or ( 𝛼ˆ 𝑊 𝐽 , 𝛽ˆ𝑊 𝐽 ).
    Because ( 𝛼ˆ 𝑊 𝐽 , 𝛽ˆ𝑊 𝐽 ) uses the optimal weight matrix, the asymptotic variance in (2.4.6) reduces
to (𝐷 00 𝐹0−1 𝐷 0 ) −1 . A consistent estimator can be obtained using 𝐹ˆ and a consistent estimator of
𝐷 0 defined as
                                                       Õ𝑁
                                         𝐷ˆ = 𝑁 −1          [∇ (𝛼0,𝛽0)0 𝑔𝑖 ( 𝛼,    ˆ 𝛿)].
                                                                                ˆ 𝛽;   ˆ                  (2.4.9)
                                                       𝑖=1
    Then the following result follows from Theorem 2.4.2.
Theorem 2.4.3 (Asymptotic Normality of the optimal GMM) Let all assumptions of Theorem 2.4.2
hold. Then,
                    √          0 , 𝛽ˆ0 ) 0 − (𝛼0 , 𝛽0 ) 0] −        𝑑
                       𝑁 [( 𝛼ˆ 𝑊 𝐽 𝑊𝐽               0 0           −−−→ 𝑁𝑜𝑟𝑚𝑎𝑙 [0, (𝐷 00 𝐹0−1 𝐷 0 ) −1 ], (2.4.10)
                                               √           0 , 𝛽ˆ0 ) 0 − (𝛼0 , 𝛽0 ) 0]} is given by
and a consistent estimator of 𝐴𝑣𝑎𝑟 { 𝑁 [( 𝛼ˆ 𝑊                𝐽 𝑊𝐽               0 0
                                                        ( 𝐷ˆ 0 𝐹ˆ −1 𝐷)
                                                                      ˆ −1 ,                             (2.4.11)
where 𝐹ˆ is given in (2.4.7) and 𝐷ˆ is given in (2.4.9).
    Further, we can use the standard test of overidentifying restrictions based on the objective
function evaluated at the parameter estimates proposed by Hansen (1982). The original result was
obtained for a standard GMM. It is straightforward to extend the proof to the case where the moment
functions depend on an estimate of 𝛿 from a first step.
    Proposition 2.4.1: Let all assumptions of Theorem 2.4.2 hold. Then under the null hypothesis
that E[𝑔(𝛼0 , 𝛽0 ; 𝛿0 )] = 0,
                                                                                           𝑝
                             𝑁 𝑔(                   ˆ 0 𝐹ˆ −1 𝑔(
                                 ¯ 𝛼ˆ 𝑊 𝐽 , 𝛽ˆ𝑊 𝐽 ; 𝛿)                                ˆ −−−−→ 𝜒2 .
                                                                   ¯ 𝛼ˆ 𝑊 𝐽 , 𝛽ˆ𝑊 𝐽 ; 𝛿)                 (2.4.12)
                                                                                               𝐿3
2.5     Examples
    The proposed estimator can be applied to many cases relevant for empirical research. I provide
two important examples: a binary or fractional 𝑦 and a nonnegative 𝑦, both of which are estimated
using quasi-MLE.
                                                                  41


2.5.1    Models for binary and fractional responses
     Binary response models are one of the most commonly used nonlinear models in empirical
research. Suppose that 𝑦 is a variable taking values in the unit interval, [0, 1]. This includes the
case where 𝑦 is binary but also allows 𝑦 to be a continuous proportion. Further, 𝑦 can have both
discrete and continuous characteristics (for instance, 𝑦 can be a proportion that takes on zero or one
with positive probability). We start by assuming that the mean of 𝑦 conditional on 𝑥 has a probit
form.
                                  E(𝑦|𝑥 1 , 𝑥2 ) = Φ(𝛼10 𝑥1 + 𝑥 2 𝛼20 ) ≡ Φ(𝑥𝛼0 ),                (2.5.1)
where 𝑥 1 is a scalar and 𝑥 2 is a 1 × 𝑘 vector. If 𝑥 1 was always observed, we would simply estimate
𝛼0 using quasi-MLE with a Bernoulli log likelihood, which identifies the parameters in a correctly
specified conditional mean by the virtue of being in the linear exponential family (Gourieroux et
al., 1984). But because 𝑥1 is sometimes missing, now we additionally specify a model to impute
𝑥 1 using 𝑥2 and use it to obtain the reduced form conditional mean of 𝑦 given 𝑥 2 .
     I consider two cases: where 𝑥 1 is continuous, and where it is binary.
2.5.1.1    Continuous covariate with missing values
     We assume that the imputation model is linear.
                                                   𝑥1 = 𝑥 2 𝜃 0 + 𝑟,                              (2.5.2)
                                      𝑟 |𝑥 2 ∼ 𝑁𝑜𝑟𝑚𝑎𝑙 [0, 𝜎02 𝑒𝑥 𝑝(2𝑥 21𝜆 0 )],                   (2.5.3)
where 𝑥 21 ⊂ 𝑥 2 . That is, 𝑥 1 is assumed to be normally distributed conditional on 𝑥 2 . To make
the model more flexible, we allow the error to be heteroskedastic with variance dependent on
𝑥 21 . Typically, 𝑥 21 will include all elements of 𝑥 2 except the constant, so that the case where 𝑟 is
homoskedastic with variance 𝜎02 is obtained as a special case by setting 𝜆 0 = 0. The conditional
pdf of 𝑥1 is given by
                                                                           (𝑥 1 − 𝑥 2 𝜃 0 ) 2
                                                                                             
                                                   1
                    𝑓 (𝑥 1 |𝑥2 , 𝛽0 ) = q                        𝑒𝑥 𝑝 −                         . (2.5.4)
                                                 2                       2𝜎 2 𝑒𝑥 𝑝(2𝑥 𝜆 )
                                            2𝜋𝜎0 𝑒𝑥 𝑝(2𝑥 21𝜆 0 )            0          21 0
                                                         42


In order to find E(𝑦|𝑥2 ), we integrate out 𝑥 1 from E(𝑦|𝑥 1 , 𝑥2 ) given in (2.5.1) using the density
given in (2.5.4).
                              ∫ ∞                                                     
                                                                        𝑥1 − 𝑥2 𝜃 0
                  E(𝑦|𝑥 2 ) =      Φ(𝑥𝛼0 )𝜎0−1 𝑒𝑥 𝑝(−𝑥21𝜆 0 )𝜙                           𝑑𝑥1
                               −∞                                    𝜎0 𝑒𝑥 𝑝(𝑥 21𝜆 0 )
                                                              
                                       𝑥2 (𝛼10 𝜃 0 + 𝛼20 )
                            =Φ q                                 .                             (2.5.5)
                                          2 𝜎 2 𝑒𝑥 𝑝(2𝑥 𝜆 )
                                    1 + 𝛼10     0         21 0
We can derive E(𝑦|𝑥2 ) without carrying out the explicit integration as well. Define a binary variable
as following.
                                  𝑤 ∗ = 𝛼10 𝑥 1 + 𝑥 2 𝛼20 + 𝑢 ≡ 𝑥𝛼0 + 𝑢,                       (2.5.6)
                                        𝑢|𝑥 1 , 𝑥2 ∼ 𝑁𝑜𝑟𝑚𝑎𝑙 (0, 1),                            (2.5.7)
                                               𝑤 = 1[𝑤 ∗ > 0].                                 (2.5.8)
Next, note that
                            E(𝑤|𝑥1 , 𝑥2 ) = E(𝑦|𝑥 1 , 𝑥2 ) = Φ(𝛼10 𝑥 1 + 𝑥 2 𝛼20 ),            (2.5.9)
and so, by iterated expectations,
                                           E(𝑤|𝑥2 ) = E(𝑦|𝑥 2 ).                              (2.5.10)
(2.5.10) is what allows us to obtain E(𝑦|𝑥 2 ). Substituting (2.5.2) into (2.5.6) gives
                                       𝑤 ∗ = 𝑥 2 (𝛼10 𝜃 0 + 𝛼20 ) + 𝑣,                        (2.5.11)
where 𝑣 ≡ 𝑢 + 𝛼10𝑟 and 𝑣|𝑥 2 ∼ 𝑁𝑜𝑟𝑚𝑎𝑙 [0, 1 + 𝛼10      2 𝜎 2 𝑒𝑥 𝑝(2𝑥 𝜆 )] under the assumptions made
                                                            0         21 0
so far. Therefore,
                                   𝑤 = 1[𝑥2 (𝛼10 𝜃 0 + 𝛼20 ) + 𝑣 > 0],                        (2.5.12)
which implies
                      E(𝑤|𝑥 2 ) = 𝑃(𝑤 = 1|𝑥2 ) = 𝑃[𝑣 > −𝑥 2 (𝛼10 𝜃 0 + 𝛼20 )|𝑥2 ],            (2.5.13)
which gives the same expression as (2.5.5). Now we can use quasi-MLE with a Bernoulli log
likelihood for both the model of interest (2.5.1) and the reduced form (2.5.5), and full MLE for the
                                                      43


imputation model using (2.5.4). The objective functions in (2.2.1)-(2.2.3) are given by
                  𝑓1 (𝑦, 𝑥, 𝛼) = −𝑙𝑜𝑔{Φ(𝑥𝛼) 𝑦 [1 − Φ(𝑥𝛼)] (1−𝑦) }
                                                                             (𝑥 1 − 𝑥 2 𝜃) 2
                                                                                           
                                                    1
                  𝑓2 (𝑥1 , 𝑥2 , 𝛽) = −𝑙𝑜𝑔 p                     𝑒𝑥 𝑝 −
                                            2𝜋𝜎 2 𝑒𝑥 𝑝(2𝑥21𝜆)             2𝜎 2 𝑒𝑥 𝑝(2𝑥 21𝜆)
                  𝑓3 (𝑦, 𝑥2 , 𝛾) = −𝑙𝑜𝑔{Φ[ℎ1 (𝑥 2 , 𝛾)] 𝑦 {1 − Φ[ℎ1 (𝑥 2 , 𝛾)]} (1−𝑦) },        (2.5.14)
                                          q
where ℎ1 (𝑥 2 , 𝛾) ≡ [𝑥 2 (𝛼1 𝜃 + 𝛼2 )]/ 1 + 𝛼12 𝜎 2 𝑒𝑥 𝑝(2𝑥 21𝜆) and in the general notation of Section
2.2, 𝛽 = (𝜃, 𝜎 2 , 𝜆) and 𝛾 = [(𝛼1 𝜃 + 𝛼2 ), 𝛼12 𝜎 2 , 𝜆].
    The issue of defining 𝛾 warrants some discussion. It can be shown that first, the partial
derivatives of 𝑓3 (𝑦, 𝑥2 , 𝛾) with respect to (𝛼1 , 𝜃) are linear combinations of those with respect
to (𝛼2 , 𝜎 2 , 𝜆). Since we use the the weighted versions of these partial derivatives as moment
functions, we should use only those taken with respect to (𝛼2 , 𝜎 2 , 𝜆) to prevent redundancy in
the resulting moment conditions. Second, the partial derivatives with respect to (𝛼2 , 𝜎 2 , 𝜆) are
just scaled versions of those with respect to 𝛾 as defined above, which makes this definition of 𝛾
preferable both intuitively and for algebraic simplicity.
                                                       44


     The objective functions in (2.5.14) result in the following score functions.
                       [𝑦 − Φ(𝑥𝛼)]𝜙(𝑥𝛼)
 𝑔1∗ (𝑦, 𝑥, 𝛼) = 𝑥 0
                      Φ(𝑥𝛼) [1 − Φ(𝑥𝛼)]
                      
                             𝑥 20 (𝑥 1 − 𝑥 2 𝜃)           
                                                           
                      
                              2
                             𝜎 𝑒𝑥 𝑝(2𝑥21𝜆)
                                                           
                                                           
                                                          
                       (𝑥 1 − 𝑥 2 𝜃) 2
                                                          
   ∗                                                 1     
 𝑔2 (𝑥 1 , 𝑥2 , 𝛽) =                           −         
                       𝑒𝑥 𝑝(2𝑥21𝜆)𝜎 4 𝜎 2 
                                                           
                                                          
                       
                       0      (𝑥1 − 𝑥2 𝜃)      2        
                                                           
                       21 𝜎 2 𝑒𝑥 𝑝(2𝑥 𝜆) − 1 
                      𝑥                                   
                                             21           
                    
                     q                 𝑥 20                    
                                                                
                                                               
                           1 +   𝛼 2 𝜎 2 𝑒𝑥 𝑝(2𝑥 𝜆) 
                                   1                21         
                                              (𝜃𝛼    +      )
                                                                                                           
   ∗
                        𝑒𝑥 𝑝(2𝑥   21 𝜆)𝑥   2      1   𝛼 2                        𝑦 − Φ[ℎ 1 (𝑥 2 , 𝛾)]
 𝑔3 (𝑦, 𝑥2 , 𝛾) =            2 𝜎 2 𝑒𝑥 𝑝(2𝑥 𝜆)] 3/2  𝜙[ℎ1 (𝑥 2 , 𝛾)] Φ[ℎ (𝑥 , 𝛾)]{1 − Φ[ℎ (𝑥 , 𝛾)]} .
                                                                
                       [1 + 𝛼 1                  21                            1 2                 1 2
                     𝑒𝑥 𝑝(2𝑥 21𝜆)𝑥2 (𝜃𝛼1 + 𝛼2 )𝑥 0 
                                                               
                                                            21 
                     [1 + 𝛼2 𝜎 2 𝑒𝑥 𝑝(2𝑥 21𝜆)] 3/2 
                                                               
                              1                                
                                                                                                         (2.5.15)
In the case where 𝜆 0 = 0 and hence 𝑟 is homoskedastic, the third elements of 𝑔2∗ (.) and 𝑔3∗ (.), which
are the partial derivatives with respect to 𝜆 of 𝑓2 (.) and 𝑓3 (.) respectively go away. Moreover, the
second element of 𝑔3∗ (.) in that case is just a linear function of the first element of 𝑔3∗ (.) and hence
should be removed to prevent redundancy.
     Given these score functions and 𝛿ˆ obtained in Section 2.3, it is straightforward to form the
moment functions in (2.4.1) and estimate (𝛼0 , 𝛽0 ) by minimizing (2.4.3).
2.5.1.2     Binary covariate with missing values
     We now consider the case where 𝑥 1 is binary. Equations (2.5.2) and (2.5.3) are replaced by
                                                         𝑥 1∗ = 𝑥 2 𝜃 0 + 𝑟,                             (2.5.16)
                                            𝑟 |𝑥 2 ∼ 𝑁𝑜𝑟𝑚𝑎𝑙 [0, 𝑒𝑥 𝑝(2𝑥21𝜆 0 )],                         (2.5.17)
                                                        𝑥 1 = 1[𝑥 1∗ > 0],                               (2.5.18)
                                                                  45


where 𝑥 21 ⊂ 𝑥 2 . Just as in Section 2.5.1.1, 𝑥 21 typically includes all elements of 𝑥2 except the
constant, so that we can get a standard probit with unit variance as a special case by setting 𝜆 0 = 0.
Now, (2.5.16)-(2.5.18) imply that
                             𝑃(𝑥 1 = 1|𝑥2 ) = Φ[𝑒𝑥 𝑝(−𝑥 21𝜆 0 )𝑥 2 𝜃 0 ] ≡ Φ[ℎ2 (𝑥2 , 𝛽0 )],                   (2.5.19)
where in the general notation of Section 2.2, 𝛽 = (𝜃, 𝜆). Using (2.5.1) and iterated expectations,
    E(𝑦|𝑥 2 ) = E[E(𝑦|𝑥 1 , 𝑥2 )|𝑥2 ] = E(𝑦|𝑥 1 = 1, 𝑥2 )𝑃(𝑥1 = 1|𝑥 2 ) + E(𝑦|𝑥1 = 0, 𝑥2 )𝑃(𝑥1 = 0|𝑥 2 )
                = Φ(𝛼10 + 𝑥2 𝛼20 )Φ[𝑒𝑥 𝑝(−𝑥 21𝜆 0 )𝑥 2 𝜃 0 ] + Φ(𝑥 2 𝛼20 ){1 − Φ[𝑒𝑥 𝑝(−𝑥 21𝜆 0 )𝑥 2 𝜃 0 ]}
                ≡ ℎ3 (𝑥 2 , 𝛾0 ),                                                                              (2.5.20)
where in the general notation of Section 2.2, 𝛾 = (𝛼, 𝛽). Analogous to the previous section, we use
quasi-MLE with a Bernoulli log likelihood for the model of interest (2.5.1) and the reduced form
(2.5.20), and full MLE for the imputation model using (2.5.19). The objective functions are given
by
                       𝑓1 (𝑦, 𝑥, 𝛼) = −𝑙𝑜𝑔{Φ(𝑥𝛼) 𝑦 [1 − Φ(𝑥𝛼)] (1−𝑦) }
                       𝑓2 (𝑥 1 , 𝑥2 , 𝛽) = −𝑙𝑜𝑔(Φ[ℎ2 (𝑥2 , 𝛽)] 𝑥1 {1 − Φ[ℎ2 (𝑥 2 , 𝛽)]} (1−𝑥1 ) )
                       𝑓3 (𝑦, 𝑥2 , 𝛾) = −𝑙𝑜𝑔{ℎ3 (𝑥 2 , 𝛾) 𝑦 [1 − ℎ3 (𝑥 2 , 𝛾)] (1−𝑦) }.                        (2.5.21)
This results in the following score functions.
                          [𝑦 − Φ(𝑥𝛼)]𝜙(𝑥𝛼)
   𝑔1∗ (𝑦, 𝑥, 𝛼) = 𝑥 0                                                                                         (2.5.22)
                          Φ(𝑥𝛼) [1 − Φ(𝑥𝛼)]
                                               
                         𝑒𝑥 𝑝(−𝑥 𝜆)𝑥 0                           {𝑥1 − Φ[ℎ2 (𝑥 2 , 𝛽)]}𝜙[ℎ2 (𝑥 2 , 𝛽)]
                                      21      2
   𝑔2∗ (𝑥 1 , 𝑥2 , 𝛽) =                         𝜙[ℎ2 (𝑥 2 , 𝛽)]                                              (2.5.23)
                                               
                          ℎ (𝑥 , 𝛽)𝑥 0                           Φ[ℎ2 (𝑥2 , 𝛽)]{1 − Φ[ℎ2 (𝑥2 , 𝛽)]}
                          2 2            21   
                         
                                                                                                     
                       
                                                 𝜙(𝛼 1 +  𝑥 2 𝛼 2 )Φ[ℎ 2 (𝑥 2 , 𝛽)]                  
                                                                                                      
                                                                                                    
                           0
                       𝑥 𝜙(𝛼 + 𝑥 𝛼 )Φ[ℎ (𝑥 , 𝛽)] + 𝜙(𝑥 𝛼 ){1 − Φ[ℎ (𝑥 , 𝛽)]} 
                                    1     2    2     2    2               2  2           2  2
   𝑔3∗ (𝑦, 𝑥2 , 𝛾) =  2                                                                               ℎ (𝑦, 𝑥2 , 𝛾),
                                                                                                     
                        𝑥 0 𝑒𝑥 𝑝(−𝑥 𝜆)𝜙[ℎ (𝑥 , 𝛽)] [Φ(𝛼 + 𝑥 𝛼 ) − Φ(𝑥 𝛼 )]  4
                               2           21        2 2                1       2 2       2 2        
                                                                                                     
                       
                                 0
                                𝑥 21 ℎ2 (𝑥 2 , 𝛽)𝜙[ℎ2 (𝑥2 , 𝛽)] [Φ(𝑥 2 𝛼2 ) − Φ(𝛼1 + 𝑥 2 𝛼2 )]
                                                                                                      
                                                                                                      
                                                                                                     
                                                                                                               (2.5.24)
                                                                46


                                   𝑦 − ℎ3 (𝑥2 , 𝛾)
where ℎ4 (𝑦, 𝑥2 , 𝛾) ≡                                      .
                            ℎ3 (𝑥 2 , 𝛾) [1 − ℎ3 (𝑥 2 , 𝛾)]
2.5.1.3     Average partial effects
    In a probit, usually the average partial effects (APEs) are the quantities of interest rather than
the coefficients themselves. It is important to note that the APEs of interest are still derived from
the model of interest in (2.5.1), just as in the case where there is no missing data. The partial effect
(PE) of the 𝑗 𝑡ℎ element of 𝑥, 𝑥 ( 𝑗) on E(𝑦|𝑥) is given by6
                                     𝜕 E(𝑦|𝑥)
                       𝑃𝐸 𝑗 (𝑥) =              = 𝛼 ( 𝑗)0 𝜙(𝑥𝛼0 ) = 𝛼 ( 𝑗)0 𝜙(𝛼10 𝑥 1 + 𝑥 2 𝛼20 ).       (2.5.25)
                                       𝜕𝑥 ( 𝑗)
The average partial effect of 𝑥 ( 𝑗) , 𝐴𝑃𝐸 𝑗 , is the expected value of 𝑃𝐸 𝑗 (𝑥) with respect to 𝑥.
                                                         
                                                𝜕 E(𝑦|𝑥)
                          𝐴𝑃𝐸 𝑗 (𝑥) = E𝑥                    = 𝛼 ( 𝑗)0 E[𝜙(𝑥𝛼0 )].                 (2.5.26)
                                                  𝜕𝑥 ( 𝑗)
In the absence of missing data, this can be consistently estimated using
                                                                     𝑁
                                                                      Õ                
                                                 𝛼˜ ( 𝑗)       𝑁 −1              ˜ ,
                                                                            𝜙(𝑥𝑖 𝛼)                     (2.5.27)
                                                                      𝑖=1
where 𝛼˜ is any consistent estimate of 𝛼0 . That is, one simply computes the partial effect for each
unit in the sample and then averages over the entire sample.
    However, when we have missing data on 𝑥1 , this quantity is not estimable as we cannot calculate
the partial effect for individuals with missing 𝑥1 . A quantity that is feasible to compute is the average
of partial effects over the complete cases only. This is given by
                                                                            𝑁                    
                                        𝑐                                    Õ
                                    [ 𝑗 (𝑥)
                                    𝐴𝑃𝐸          =   𝛼ˆ 𝑊 𝐽 ( 𝑗) 𝑁𝑐−1              𝑠𝑖 𝜙(𝑥𝑖 𝛼ˆ 𝑊 𝐽 ) ,
                                                                             𝑖=1
                  Í𝑁
where 𝑁𝑐 =             𝑠
                    𝑖=1 𝑖
                             is the number of complete cases in the sample. That is, we average the
individual partial effects over the complete cases only. This estimator however, is not consistent for
                                                                                   𝑐
𝐴𝑃𝐸 𝑗 (𝑥) unless 𝑠 𝑥 . If 𝑠 depends on say 𝑥 2 , then 𝐴𝑃𝐸
                       |=                             [ 𝑗 (𝑥) will be inconsistent for 𝐴𝑃𝐸 𝑗 (𝑥).
6If 𝑥 ( 𝑗) is discrete, the derivative is replaced with a difference.
                                                                      47


    The current framework, however, makes it possible to recover 𝐴𝑃𝐸 𝑗 (𝑥) using IPW.
 E{[𝑠/𝑝(𝑧)]𝜙(𝑥𝛼)} = E{E([𝑠/𝑝(𝑧)]𝜙(𝑥𝛼)|𝑦, 𝑥, 𝑧)} = E{[E(𝑠|𝑦, 𝑥, 𝑧)/𝑝(𝑧)]𝜙(𝑥𝛼)} = E[𝜙(𝑥𝛼)],
                                                                                                  (2.5.28)
where the last equality follows from Assumption 2.3.1. Therefore, a consistent estimator of
𝐴𝑃𝐸 𝑗 (𝑥) is
                                                               𝑁
                               [ 𝑗 (𝑥) = 𝛼ˆ 𝑊 𝐽 ( 𝑗) 𝑁 −1
                                                              Õ       𝑠𝑖
                               𝐴𝑃𝐸                                             𝜙(𝑥𝑖 𝛼ˆ 𝑊 𝐽 ).     (2.5.29)
                                                                  𝐺  (𝑧 𝑖 , ˆ
                                                                            𝛿)
                                                              𝑖=1
2.5.2    Exponential models
    Next we consider exponential models for nonnegative responses 𝑦, including but not restricted
to count variables. We focus on a continuous 𝑥 1 .7 The model of interest is characterized by the
conditional mean
                                  E(𝑦|𝑥) = 𝑒𝑥 𝑝(𝛼10 𝑥 1 + 𝑥 2 𝛼20 ) ≡ 𝑒𝑥 𝑝(𝑥𝛼0 ),                 (2.5.30)
where in the absence of missing data, 𝛼0 can be estimated using a Poisson quasi log likelihood. We
consider the same linear imputation model as in Section 2.5.1.1.
                                                    𝑥1 = 𝑥 2 𝜃 0 + 𝑟,                             (2.5.31)
                                      𝑟 |𝑥 2 ∼ 𝑁𝑜𝑟𝑚𝑎𝑙 [0, 𝜎02 𝑒𝑥 𝑝(2𝑥 21𝜆 0 )].                   (2.5.32)
The reduced form conditional mean can be obtained using (2.5.30)-(2.5.32) and an iterated expec-
tations argument.
               E(𝑦|𝑥 2 ) = E[𝑒𝑥 𝑝(𝛼10 𝑥 1 + 𝑥 2 𝛼20 )|𝑥2 ] = 𝑒𝑥 𝑝(𝑥2 𝛼20 ) E[𝑒𝑥 𝑝(𝛼10 𝑥 1 )|𝑥 2 ]
                          = 𝑒𝑥 𝑝 [𝑥2 (𝜃 0 𝛼10 + 𝛼20 )] E[𝑒𝑥 𝑝(𝑟𝛼10 )|𝑥 2 ],                       (2.5.33)
where the third equality follows from substituting for 𝑥 1 using (2.5.31). Moreover, (2.5.32) implies
that 𝑒𝑥 𝑝(𝑟𝛼10 ) conditional on 𝑥 2 follows a lognormal distribution with
                              E[𝑒𝑥 𝑝(𝑟𝛼10 )|𝑥2 ] = 𝑒𝑥 𝑝 [𝛼10    2 𝜎 2 𝑒𝑥 𝑝(2𝑥 𝜆 )/2].             (2.5.34)
                                                                    0           21 0
7The discussion for a binary 𝑥 1 follows easily given the discussion in Section 2.5.1.2.
                                                           48


Plugging into (2.5.33), we get
                      E(𝑦|𝑥 2 ) = 𝑒𝑥 𝑝 [𝑥2 (𝜃 0 𝛼10 + 𝛼20 ) + 𝛼10     2 𝜎 2 𝑒𝑥 𝑝(2𝑥 𝜆 )/2].          (2.5.35)
                                                                          0          21 0
Thus, we have 𝛽 = (𝜃, 𝜎 2 , 𝜆), 𝛾 = (𝜃𝛼1 +𝛼2 , 𝜎 2 , 𝜆), ℎ5 (𝑥2 , 𝛾) ≡ 𝑥 2 (𝜃𝛼1 +𝛼2 )+𝛼12 𝜎 2 𝑒𝑥 𝑝(2𝑥 21𝜆)/2
and the objective functions are given by
                𝑓1 (𝑦, 𝑥, 𝛼) = 𝑒𝑥 𝑝(𝑥𝛼) − 𝑦𝑥𝛼
                                                                                  (𝑥 1 − 𝑥 2 𝜃) 2
                                                                                                
                                                          1
                𝑓2 (𝑥1 , 𝑥2 , 𝛽) = −𝑙𝑜𝑔 p                             𝑒𝑥 𝑝 −
                                               2𝜋𝜎 2 𝑒𝑥 𝑝(2𝑥21𝜆)                2𝜎 2 𝑒𝑥 𝑝(2𝑥 21𝜆)
                𝑓3 (𝑦, 𝑥2 , 𝛾) = 𝑒𝑥 𝑝[ℎ5 (𝑥 2 , 𝛾)] − 𝑦[ℎ5 (𝑥 2 , 𝛾)].                               (2.5.36)
This results in the following score functions.
                           𝑔1∗ (𝑦, 𝑥, 𝛼) = 𝑥 0 [𝑦 − 𝑒𝑥 𝑝(𝑥𝛼)]
                                             
                                                    𝑥20 (𝑥 1 − 𝑥 2 𝜃)        
                                                                              
                                             
                                                     2
                                                   𝜎 𝑒𝑥 𝑝(2𝑥 21𝜆)
                                                                              
                                                                              
                                                                             
                                                                2
                                                                             
                          ∗
                                                 (𝑥 1 − 𝑥 2 𝜃)         1 
                        𝑔2 (𝑥 1 , 𝑥2 , 𝛽) =                        −
                                              𝑒𝑥 𝑝(2𝑥 21𝜆)𝜎 4 𝜎 2 
                                                                              
                                                                             
                                                      (𝑥 1 − 𝑥 2 𝜃) 2
                                                                           
                                              0                              
                                             𝑥
                                              21 𝜎 2 𝑒𝑥 𝑝(2𝑥 𝜆)        − 1   
                                                                              
                                                                21           
                                                      𝑥 20
                                                                 
                                                                 
                                                                 
                          𝑔3∗ (𝑦, 𝑥2 , 𝛾) =  𝑒𝑥 𝑝(2𝑥 21𝜆)  {𝑦 − 𝑒𝑥 𝑝 [ℎ5 (𝑥 2 , 𝛾)]}.
                                                                 
                                                                                                     (2.5.37)
                                                                 
                                             
                                             𝑒𝑥 𝑝(2𝑥21𝜆)𝑥 21  0 
                                                                 
Similar to Section 2.5.1.1, when 𝜆 0 = 0, the third element of 𝑔2∗ (.) and the second and third
elements of 𝑔3∗ (.) become redundant.
2.6    Comparison with related estimators
2.6.1   Complete cases
    The most common practice when dealing with missing covariate values is to just use the complete
cases for estimation; that is, use only the observations for which 𝑥 1 is observed. The inverse
                                                            49


probability weighted complete cases estimator has been discussed in detail by Wooldridge (2002).
In this section, I show that the weighted joint GMM does no worse than the weighted complete
cases estimator in terms of asymptotic variance, and can potentially provide strict efficiency gains.
     Definition 2.6.1.1. Call the estimator of 𝛼0 that minimizes (2.4.3), where 𝑔(.) contains only
𝑔1 (.) and 𝑊ˆ = 𝐼, the weighted complete cases estimator (or 𝛼ˆ 𝑊 𝑐𝑐 ).
     Define the upper-left 𝑃1 × 𝑃1 block of 𝐹0 as
                              𝐹110 ≡ E(𝑔 𝑔0 ) − E(𝑔 𝑑 0) [E(𝑑 𝑑 0)] −1 E(𝑑 𝑔0 ),                               (2.6.1)
                                            1𝑖 1𝑖        1𝑖 𝑖        𝑖 𝑖            𝑖 1𝑖
where 𝑔𝑖 = [𝑔1𝑖   0 , 𝑔0 , 𝑔0 ] 0. Then the asymptotic variance of the weighted complete cases estimator
                       2𝑖 3𝑖
as derived in Wooldridge (2002) is given in the following lemma, where we have used the fact that
𝐷 011 is symmetric.
Lemma 2.6.1.1 Under the assumptions of Theorems 4.1 and 4.2,
                                        √                                0 ) −1 𝐷 0 ] −1 .
                                𝐴𝑣𝑎𝑟 [ 𝑁 ( 𝛼ˆ 𝑊 𝑐𝑐 − 𝛼0 )] = [𝐷 011 (𝐹11          11
Then we know that 𝛼ˆ 𝑊 𝐽 is no less efficient than 𝛼ˆ 𝑊 𝑐𝑐 , since standard GMM theory dictates that a
GMM estimator that uses more valid moment conditions is no less efficient.
     Proposition 2.6.1.1. Under the assumptions of Theorem 2.4.1 and 2.4.2,
                        √                               √
                𝐴𝑣𝑎𝑟 [ 𝑁 ( 𝛼ˆ 𝑊 𝑐𝑐 − 𝛼0 )] − 𝐴𝑣𝑎𝑟 [ 𝑁 ( 𝛼ˆ 𝑊 𝐽 − 𝛼0 )] is positive semidefinite.
 We can further disaggregate the efficiency gains by 𝛼10 and 𝛼20 . In linear models, the “plug-in"
imputation estimators, as discussed in the next section, are generally equivalent to the complete
cases estimators for 𝛼10 and may provide some efficiency gains for 𝛼20 .8 Abrevaya & Donald
(2017) were the first to propose an estimator that provides potential gains for 𝛼10 as well in the
linear case. I extend their result to the case discussed in Section 2.5.1.1 with the simplifying
assumption that 𝜆 0 = 0, and show that efficiency gains are possible for both 𝛼10 and 𝛼20 .
8For instance, Abrevaya & Donald (2011) show that in the case where both the main model and the imputation model are
 linear, the plug-in estimator that estimates the main model using ordinary least squares (OLS) or feasible generalized
 least squares with missing values being replaced by predicted values using a first step OLS is numerically equivalent
 to the complete cases estimator for 𝛼10 .
                                                          50


     Proposition 2.6.1.2. Consider the case in Section 2.5.1.1 with 𝜆 0 = 0. Under the assumptions
of Theorems 2.4.1 and 2.4.2,
               √                                 √
    1. 𝐴𝑣𝑎𝑟 [ 𝑁 ( 𝛼ˆ 1𝑊 𝑐𝑐 − 𝛼10 )] − 𝐴𝑣𝑎𝑟 [ 𝑁 ( 𝛼ˆ 1𝑊 𝐽 − 𝛼10 )] = 𝐿 01 𝐾 𝐿 1 ≥ 0
               √                                 √
    2. 𝐴𝑣𝑎𝑟 [ 𝑁 ( 𝛼ˆ 2𝑊 𝑐𝑐 − 𝛼20 )] − 𝐴𝑣𝑎𝑟 [ 𝑁 ( 𝛼ˆ 2𝑊 𝐽 − 𝛼20 )] = 𝐿 02 𝐾 𝐿 2 ≥ 0,
     where 𝐿 1 , 𝐿 2 and 𝐾 are matrices defined in the appendix. I show that 𝐾 is a positive definite
matrix and neither 𝐿 1 nor 𝐿 2 are necessarily zero under the assumptions made so far, and hence it
is possible to obtain strict efficiency gains for both 𝛼10 and 𝛼20 .
2.6.2     Sequential procedures
     Traditionally, imputation is done in two steps using a “plug-in" method (Dagenais, 1973). In
the first step, the missing values of 𝑥 1 are replaced with predicted values from a regression of 𝑥 1
on 𝑥2 and in the second step, the main model is estimated using the observed values as well as the
predicted values. Methods like mean imputation,9 where the missing values are replaced by the
sample mean of 𝑥 1 , can be considered a special case of this method where the first step regression
only includes the constant as a covariate.
     Definition 2.6.2.1: Call the estimator of 𝛼0 obtained using the following procedure the plug-in
estimator (or 𝛼ˆ 𝑃 ).
Step 1: Obtain 𝛽ˆ𝑊 𝑐𝑐 by minimizing (2.4.3) where 𝑔(.) contains only 𝑔2 (𝛽) and 𝑊ˆ = 𝐼.
Step 2: Estimate 𝛼0 by minimizing (2.4.3) where 𝑔(.) contains only 𝑔1 ( 𝑥˜1 , 𝑥2 , 𝛼) and 𝑥˜1𝑖 =
𝑠𝑖 𝑥1𝑖 + (1 − 𝑠𝑖 )ℎ(𝑥2𝑖 , 𝛽ˆ𝑊 𝑐𝑐 ) and ℎ(.) is the function defining predicted values.
     In the first step, 𝛽0 is consistently estimated using only the complete cases and the missing
values of 𝑥 1 are replaced with predicted values based on the imputation model. The function ℎ(.)
depends on what the imputation model is. For instance, in the linear case, ℎ(𝑥 2𝑖 , 𝛽ˆ𝑊 𝑐𝑐 ) = 𝑥 2𝑖 𝛽ˆ𝑊 𝑐𝑐 .
We denote this new variable by 𝑥˜1 . In the second step, 𝛼0 is estimated by solving the sample
counterpart of (2.2.1) with 𝑥 1 being replaced by 𝑥˜1 .
9(Little & Rubin, 2002)
                                                       51


    While this procedure can be consistent when the model of interest is linear, contrary to prior
claims in the literature (DeCanio & Watkins, 1998), it is generally inconsistent when the model of
interest is nonlinear in the parameters.10 This is because under the assumptions made so far, 𝛼0 is
generally not a solution to
                                             min E[ 𝑓1 (𝑦, 𝑥1∗ , 𝑥2 , 𝛼)],                                    (2.6.2)
                                             𝛼∈A
where 𝑥 1∗ = 𝑠𝑥 1 + (1 − 𝑠)ℎ(𝑥2 , 𝛽0 ).
    To see why this procedure is inconsistent, consider the model in Section 2.5.1.1. Suppose 𝑦 is
binary, that is, 𝑦 = 𝑤 (and 𝑦 ∗ ≡ 𝑤 ∗ ). For simplicity, assume that 𝜆 0 = 0 and 𝑧 = 𝑥 2 , that is, the
imputation error is homoskedastic and selection is independent of (𝑦, 𝑥1 ) conditional on 𝑥 2 . Since
E(𝑥1 |𝑥 2 ) = 𝑥 2 𝜃 0 and 𝜃 0 is consistently estimated by Ordinary Least Squares (OLS) of 𝑥1 on 𝑥2
using the complete cases only (call this estimator 𝜃ˆ𝑐𝑐 ), it is tempting to replace the missing values
of 𝑥 1 by 𝑥 2 𝜃ˆ𝑐𝑐 and estimate 𝛼0 from the probit of 𝑦 on 𝑥˜1 ≡ 𝑠𝑥1 + (1 − 𝑠)𝑥 2 𝜃ˆ𝑐𝑐 and 𝑥 2 . Standard
two-step M-estimation theory11 states that for this procedure to be consistent, we require that 𝛼0
uniquely solves
                  min − E{𝑦 𝑙𝑜𝑔 Φ(𝛼1 𝑥 1∗ + 𝑥 2 𝛼2 ) + (1 − 𝑦)𝑙𝑜𝑔[1 − Φ(𝛼1 𝑥1∗ + 𝑥 2 𝛼2 )]},                  (2.6.3)
                  𝛼∈A
where 𝑥 1∗ ≡ 𝑠𝑥1 + (1 − 𝑠)𝑥2 𝜃 0 . However, 𝛼0 does not minimize (2.6.3) in general since for that to
be true, we would need
                                  𝑃(𝑦 = 1|𝑠𝑥1 , 𝑥2 , 𝑠) = Φ(𝛼10 𝑥 1∗ + 𝑥 2 𝛼20 ).                             (2.6.4)
However, (2.5.2) and (2.5.6) imply
                         𝑦 ∗ = 𝛼10 [𝑠𝑥1 + (1 − 𝑠)𝑥 2 𝜃 0 ] + 𝑥2 𝛼20 + 𝑢 + (1 − 𝑠)𝑟𝛼10
                             ≡ 𝛼10 𝑥 1∗ + 𝑥 2 𝛼20 + 𝑢 + (1 − 𝑠)𝑟𝛼10 ,                                         (2.6.5)
and
            E{1[𝛼10 𝑥 1∗ + 𝑥 2 𝛼20 + 𝑢 + (1 − 𝑠)𝑟𝛼10 ]|𝑠𝑥 1 , 𝑥2 , 𝑠} ≠ Φ(𝛼10 𝑥 1∗ + 𝑥 2 𝛼20 ).               (2.6.6)
10This procedure also requires extra caution when the model of interest is nonlinear in the variables, as discussed in
  Rai (2020).
11Wooldridge (2010) Section 17.4.
                                                        52


The core issue is that expectation does not pass through nonlinear operators, in this case the indicator
function 1[.]. In fact, in this example,
             E(𝑦 = 1|𝑠𝑥1 , 𝑥2 , 𝑠) = 𝑃(𝑦 = 1|𝑠𝑥 1 , 𝑥2 , 𝑠)
                                    = 𝑃{[𝑢 + (1 − 𝑠)𝑟𝛼10 ] > −(𝛼10 𝑥1∗ + 𝑥 2 𝛼20 )|𝑠𝑥1 , 𝑥2 , 𝑠}
                                             𝛼10 𝑥 1∗ + 𝑥 2 𝛼20
                                                                 
                                    =Φ q                            ,                                (2.6.7)
                                            1 + (1 − 𝑠)𝛼10   2 𝜎2
                                                                0
since 𝑢 + (1 − 𝑠)𝑟𝛼10 |𝑠𝑥 1 , 𝑥2 , 𝑠 ∼ 𝑁𝑜𝑟𝑚𝑎𝑙 [0, 1 + (1 − 𝑠)𝛼10      2 𝜎 2 ] under Assumption 2.3.1, which
                                                                          0
makes the main estimation problem a heteroskedastic probit. The correct log likelihood function
is therefore based on (2.6.7), and 𝛼0 is not a solution to (2.6.3).
     Proposition 2.6.2.1: Consider the case in Section 2.5.1.1. Let Assumptions 2.2.1-2.2.3, 2.3.1,
2.3.2 and the assumptions in Theorems 2.4.1 and 2.4.2 hold. Additionally assume that 𝑧 = 𝑥 2 and
𝜆0 = 0. Then 𝛼ˆ 𝑃 is inconsistent for 𝛼10 unless 𝛼10 = 0.
     However, 𝛼10 = 0 implies that 𝑥 1 is irrelevant in the model of interest, in which case the best
solution is to just drop it from the model.
     As a second example, consider the exponential model from Section 2.5.2 and again for simplicity,
assume that 𝜆 0 = 0 and 𝑧 = 𝑥2 . The plug-in method would entail estimating 𝛼0 using Poisson quasi-
MLE with the conditional mean function 𝑒𝑥 𝑝(𝛼1 𝑥˜1 + 𝑥2 𝛼2 ). For this estimator to be consistent,
we would require that 𝛼0 uniquely solves
                             min − E[𝑦(𝛼1 𝑥 1∗ + 𝑥 2 𝛼2 ) − 𝑒𝑥 𝑝(𝛼1 𝑥1∗ + 𝑥2 𝛼2 )],                  (2.6.8)
                             𝛼∈A
which would be true if
                                  E(𝑦|𝑠𝑥1 , 𝑥2 , 𝑠) = 𝑒𝑥 𝑝(𝛼10 𝑥1∗ + 𝑥 2 𝛼20 ).                      (2.6.9)
However, under Assumption 2.3.1, equations (2.5.30) and (2.5.35) imply that
             E(𝑦|𝑠𝑥1 , 𝑥2 , 𝑠) = 𝑒𝑥 𝑝{𝛼10 [𝑠𝑥1 + (1 − 𝑠)𝑥 2 𝜃 0 ] + 𝑥 2 𝛼20 + (1 − 𝑠)𝛼102 𝜎 2 /2}
                                                                                           0
                               ≡ 𝑒𝑥 𝑝 [𝛼10 𝑥 1∗ + 𝑥 2 𝛼20 + (1 − 𝑠)𝛼10 2 𝜎 2 /2].
                                                                           0                        (2.6.10)
                                                        53


Since the log likelihood in (2.6.8) is based on an incorrect specification of the conditional mean of
𝑦, 𝛼0 will generally not solve (2.6.8).
     Proposition 2.6.2.2: Consider the case in Section 2.5.2. Let Assumptions 2.2.1-2.2.3, 2.3.1,
2.3.2 and the assumptions in Theorems 2.4.1 and 2.4.2 hold. Additionally assume that 𝑧 = 𝑥 2 and
𝜆0 = 0. Then 𝛼ˆ 𝑃 is inconsistent unless 𝛼10 = 0.
     A sequential procedure that would be consistent is plugging 𝛽ˆ𝑊 𝑐𝑐 in 𝑔3 (.), and estimating 𝛼0
using 𝑔1 (𝛼) and 𝑔3 (𝛼, 𝛽ˆ𝑊 𝑐𝑐 ) in a joint GMM procedure.
     Definition 2.6.2.2: Call the estimator of 𝛼0 obtained using the following procedure the sequen-
tial estimator (or 𝛼ˆ 𝑆𝑒𝑞 ).
Step 1: Obtain 𝛽ˆ𝑊 𝑐𝑐 by minimizing (2.4.3) where 𝑔(.) contains only 𝑔2 (𝛽) and 𝑊ˆ = 𝐼.
Step 2: Estimate 𝛼0 by minimizing (2.4.3) where 𝑔(.) contains only 𝑔1 (𝛼) and 𝑔3 (𝛼, 𝛽ˆ𝑊 𝑐𝑐 ), and 𝑊ˆ =
𝐹ˆ −1 , where 𝐹ˆ −1 can be obtained using equation (2.4.7) and imposing 𝑔ˆ𝑖 = [𝑔1𝑖 ( 𝛼)
                                                                                     ˜ 0 𝑔2𝑖 ( 𝛼,
                                                                                               ˜ 𝛽ˆ𝑊 𝑐𝑐 ) 0] 0,
𝛼˜ being a first step consistent estimate of 𝛼0 .
     Even though 𝛼ˆ 𝑆𝑒𝑞 is consistent, it is going to be less efficient than 𝛼ˆ 𝑊 𝐽 because the former
does not utilize the correlation between the moment functions 𝑔1 (.) and 𝑔2 (.). From a GMM
perspective, it is well known that a sequential procedure using the same moment conditions is no
more efficient than its joint counterpart.
     Proposition 2.6.2.3. Under Assumptions 2.2.1-2.2.3, 2.3.1, 2.3.2, and the assumptions made
in Theorems 2.4.1 and 2.4.2,
                      √                          √
              𝐴𝑣𝑎𝑟 [ 𝑁 ( 𝛼ˆ 𝑆𝑒𝑞 − 𝛼0 )] − 𝐴𝑣𝑎𝑟 [ 𝑁 ( 𝛼ˆ 𝑊 𝐽 − 𝛼0 )] is positive semi-definite.
 Thus, there is no reason to prefer 𝛼ˆ 𝑆𝑒𝑞 over 𝛼ˆ 𝑊 𝐽 other than computational convenience.
2.6.3    Dummy variable method
     The dummy variable estimator (𝛼ˆ 𝐷 ) replaces the missing values of 𝑥 1 with zeros and uses an
indicator for missingness as an additional covariate. Jones (1996) and Rai (2020) show that the
                                                    54


resulting estimator is generally inconsistent for 𝛼0 in linear models with exogenous and endogenous
𝑥 1 respectively. This inconsistency continues to hold in nonlinear models.
     Consider again the example in Section 2.5.1.1 with 𝜆 0 = 0 and 𝑧 = 𝑥 2 . The DVM would entail
doing a probit of 𝑦 on (𝑠𝑥1 , 1 − 𝑠, 𝑥2 ). Analogous to the discussion in Section 2.6.2, this estimator
would be consistent if
                      𝑃(𝑦 = 1|𝑠𝑥1 , 𝑥2 , 𝑠) = Φ[𝛼10 𝑠𝑥1 + (1 − 𝑠)𝜃 10 𝛼10 + 𝑥2 𝛼20 ],               (2.6.11)
which is not true in general. Too see this, let 𝑥 2 = (1, 𝑥22 ) and 𝜃 0 = (𝜃 10 , 𝜃 020 ) 0 and note that we
can rewrite equation (2.6.7) as
                                                                                             
                                  𝛼10 𝑠𝑥1 + (1 − 𝑠)𝜃 10 𝛼10 + (1 − 𝑠)𝑥 22 𝜃 20 𝛼10 + 𝑥 2 𝛼20
      𝑃(𝑦 = 1|𝑠𝑥1 , 𝑥2 , 𝑠) = Φ                     q                                           .   (2.6.12)
                                                                    2
                                                      1 + (1 − 𝑠)𝛼10 𝜎0 2
As can be seen from this equation, 𝛼ˆ 𝐷 is inconsistent for two reasons. The first issue, which is
unique to this method, is that it omits the covariates (1 − 𝑠)𝑥22 , leading to endogeneity unless
𝛼10 = 0 and/or 𝜃 20 = 0. The second issue, which is common with the plug-in method, is that it
ignores the scale factor in the denominator which remains unless 𝛼10 = 0.
     Proposition 2.6.3.1: Consider the case in Section 2.5.1.1. Let Assumptions 2.2.1-2.2.3, 2.3.1,
2.3.2 and the assumptions in Theorems 2.4.1 and 2.4.2 hold. Additionally assume that 𝑧 = 𝑥 2 and
𝜆0 = 0. Then 𝛼ˆ 𝐷 is inconsistent unless (i) 𝛼10 = 0 or (ii) 𝜃 20 = 𝜎02 = 0.
     Similar to Section 2.6.2, if 𝛼10 = 0, the best solution is to drop 𝑥1 . The second condition
requires that both the imputation coefficients and the imputation error variance are zero at the same
time, which is not possible.
     A second example is the exponential model discussed in Section 2.5.2. Consider again the case
where 𝑧 = 𝑥 2 and 𝜆 0 = 0. The DVM would entail using (𝑠𝑥1 , 1 − 𝑠, 𝑥2 ) as covariates for a Poisson
quasi-MLE, which would be consistent if
                                                                    2 𝜎 2 /2) + 𝑥 𝛼 ].
           E(𝑦|𝑠𝑥1 , 𝑥2 , 𝑠) = 𝑒𝑥 𝑝 [𝛼10 𝑠𝑥 1 + (1 − 𝑠)(𝜃 10 𝛼10 + 𝛼10                              (2.6.13)
                                                                        0         2 20
However, we can re-write (2.6.10) as
                                                         2 𝜎 2 /2) + (1 − 𝑠)𝑥 𝜃 𝛼 + 𝑥 𝛼 ]. (2.6.14)
  E(𝑦|𝑠𝑥1 , 𝑥2 , 𝑠) = 𝑒𝑥 𝑝[𝛼10 𝑠𝑥1 + (1 − 𝑠)(𝜃 10 𝛼10 + 𝛼10   0              22 20 10        2 20
                                                     55


Similar to the probit case, the DVM omits the covariates (1 − 𝑠)𝑥 22 from the above conditional
mean function.
    Proposition 2.6.3.2: Consider the case in Section 2.5.2. Let Assumptions 2.2.1-2.2.3, 2.3.1,
2.3.2 and the assumptions in Theorems 2.4.1 and 2.4.2 hold. Additionally assume that 𝑧 = 𝑥 2 and
𝜆0 = 0. Then 𝛼ˆ 𝐷 is inconsistent unless (i) 𝛼10 = 0 or (ii) 𝜃 20 = 0.
    That is, 𝛼ˆ 𝐷 is inconsistent unless 𝑥 1 is irrelevant in the model of interest or 𝑥 22 does not help
in predicting 𝑥 1 .
2.6.4   Unweighted estimators
    The key to efficiency gains of 𝛼ˆ 𝑊 𝐽 over 𝛼ˆ 𝑊 𝑐𝑐 is that the former uses the information in the
incomplete cases. Weighting the moment functions in (2.4.1) allows for more flexibility in terms
of what variables selection can depend on and estimation of interesting parameters in the presence
of misspecification, but that core reason for efficiency gains is independent of weighting. In other
words, the joint GMM based on the unweighted version of the moment functions in (2.4.1) will still
be more efficient than the unweighted complete cases estimator. These two unweighted estimators
are defined below.
    Definition 6.4.1: Call the estimator of 𝛼0 that minimizes (2.4.3) where 𝑔(.) = 𝑠 · 𝑔1∗ (𝑦, 𝑥, 𝛼)
and 𝑊ˆ = 𝐼, the unweighted complete cases estimator, or 𝛼ˆ 𝑈𝑐𝑐 .
    The unweighted joint estimator is based on the following vector of moment conditions.
                                          𝑔1𝑖 (𝛼, 𝛽)   𝑠𝑖 𝑔 ∗ (𝑦𝑖 , 𝑥𝑖 , 𝛼) 
                                                                                   
                                                                1𝑖                 
                                                                                   
                             𝑔𝑖 (𝛼, 𝛽) = 𝑔2𝑖 (𝛼, 𝛽)  ≡ 𝑠𝑖 𝑔 ∗ (𝑥 1𝑖 , 𝑥2𝑖 , 𝛽)  .        (2.6.15)
                                                        2𝑖                         
                                                         ∗
                                          𝑔3𝑖 (𝛼, 𝛽)  𝑔3𝑖      (𝑦𝑖 , 𝑥2𝑖 , 𝛼, 𝛽) 
                                                                                     
                                                                                   
    For a generic element from the population (𝑦, 𝑥1 , 𝑥2 , 𝑠), denote this vector of moment functions
by 𝑔(𝛼, 𝛽). Then the variance-covariance matrix of 𝑔(𝛼, 𝛽) evaluated at the true parameter values
is given by
                                     𝐶0 = E[𝑔(𝛼0 , 𝛽0 ) 𝑔(𝛼0 , 𝛽0 ) 0],                          (2.6.16)
and the optimal GMM estimator based on (2.6.15) is defined as follows.
                                                      56


       Definition 2.6.4.2. Call the estimator of (𝛼0 , 𝛽0 ) that solves
                                                  min    ¯
                                                        𝑔(𝛼, 𝛽) 0 𝐶ˆ −1 𝑔(𝛼,
                                                                          ¯     𝛽),
                                             (𝛼,𝛽)∈A×B
                                                                                                                   𝑝
the unweighted joint estimator, or ( 𝛼ˆ 𝑈𝐽 , 𝛽ˆ𝑈𝐽 ), where 𝑔(𝛼,             𝛽) = 𝑁 −1 𝑖=1     𝑔𝑖 (𝛼, 𝛽) and 𝐶ˆ −
                                                                                       Í𝑁
                                                                     ¯                                            → 𝐶0 .
       I provide the asymptotic distribution of this estimator in Appendix E. The key point to note is
that just like 𝛼ˆ 𝑊 𝐽 is no less efficient than 𝛼ˆ 𝑊 𝑐𝑐 , 𝛼ˆ 𝑈𝐽 is no less efficient than 𝛼ˆ 𝑈𝑐𝑐 .
       Proposition 2.6.4.1. Under the assumptions of Theorems E.1 and E.2,
                          √                                √
                  𝐴𝑣𝑎𝑟 [ 𝑁 ( 𝛼ˆ 𝑈𝑐𝑐 − 𝛼0 )] − 𝐴𝑣𝑎𝑟 [ 𝑁 ( 𝛼ˆ 𝑈𝐽 − 𝛼0 )] is positive semidefinite.
 The proof of this proposition is very similar to that of Proposition 2.6.1.1, and hence is omitted.
       The natural question that arises then is whether one should weight when using the joint estimator,
and whether 𝛼ˆ 𝑊 𝐽 is preferred over 𝛼ˆ 𝑈𝑐𝑐 , which is the most commonly used estimator out of all
four.12 The issue of whether to weight has previously been considered in Wooldridge (2002), but
the use of an imputation model here brings in some new issues. In looking at these two alternatives
to 𝛼ˆ 𝑊 𝐽 , there are two issues to address: consistency and asymptotic efficiency.
       Start with 𝛼ˆ 𝑈𝐽 . From the point of view of consistency, 𝛼ˆ 𝑊 𝐽 is always preferred over 𝛼ˆ 𝑈𝐽 as
the former is always consistent when the latter is, but the converse is not true. This is because
while both estimators rule out 𝑧 containing 𝑥1 to be consistent,13 𝛼ˆ 𝑊 𝐽 allows 𝑧 to contain 𝑦 as
well as some outside predictors of selection, while 𝛼ˆ 𝑈𝐽 does not. A related issue is that of correct
specification of the models underlying 𝑓1 (𝑦, 𝑥, 𝛼), 𝑓2 (𝑥 1 , 𝑥2 , 𝛽), and 𝑓3 (𝑦, 𝑥2 , 𝛾) in (2.2.1)-(2.2.3),
by which I mean that (𝛼0 , 𝛽0 , 𝛾0 ) characterize a correctly specified feature of 𝐷 (𝑦|𝑥), 𝐷 (𝑥 1 |𝑥 2 )
and 𝐷 (𝑦|𝑥2 ) respectively.14 For instance, this can be a model of a conditional mean, conditional
median, conditional distribution, and so on. When 𝑧 = 𝑥 2 , 𝛼ˆ 𝑊 𝐽 is always consistent for 𝛼0 and 𝛽0
12That is, out of 𝛼ˆ 𝑈𝑐𝑐 , 𝛼ˆ 𝑊 𝑐𝑐 , 𝛼ˆ 𝑈𝐽 and 𝛼ˆ 𝑊 𝐽 .
13 𝛼ˆ 𝑈𝐽 rules out 𝑧 containing 𝑥 1 because it uses the imputation equation in estimation in addition to the main equation.
   Since unweighted estimators can only allow selection to depend on covariates in order to maintain consistency, 𝑥 1
   being the outcome variable in the imputation model means that we cannot allow 𝑠 to depend on 𝑥 1 , conditional on
   𝑥2 . This is the cost of getting more efficiency using the imputation model. 𝛼ˆ 𝑊 𝐽 rules out this dependence because
   the weights cannot be estimated using a variable that contains missing values. Therefore, irrespective of whether one
   uses the imputation model, weighted estimation cannot allow 𝑧 to contain 𝑥1 .
14I make this notion precise in Assumption B.1.
                                                             57


that solve (2.2.1) and (2.2.2) irrespective of whether the underlying models are correctly specified,
but 𝛼ˆ 𝑈𝐽 is consistent for 𝛼0 and 𝛽0 only if they characterize some correctly specified feature of the
respective distributions.
     For instance, consider the linear case discussed in Abrevaya & Donald (2017) where the 3
M-estimation problems are given by
                                             min E[𝑠 · (𝑦 − 𝛼1 𝑥 1 − 𝑥 2 𝛼2 ) 2 ]                          (2.6.17)
                                             𝛼∈A
                                                  min E[𝑠 · (𝑥 1 − 𝑥 2 𝛽) 2 ]                              (2.6.18)
                                                  𝛽∈B
                                                    min E[(𝑦 − 𝑥 2 𝛾) 2 ]                                  (2.6.19)
                                                    𝛾∈Γ
where 𝛾 ≡ 𝛼1 𝛽+𝛼2 . Consider first the problem in (2.6.17). Suppose that 𝑦 is binary with a nonlinear
conditional mean E(𝑦|𝑥) = Φ(𝑥𝜅 0 ), and the linear projection of 𝑦 on 𝑥 is 𝑥𝛼0 . When 𝑥 1 is always
observed, the usual motivation for using a linear model here is that it gives consistent estimates of
the linear projection parameters 𝛼0 , and linear projection is the best linear approximation to the
true conditional mean Φ(𝑥𝜅 0 ). That is, the solution to
                                                min E[(𝑦 − 𝛼1 𝑥 1 − 𝑥 2 𝛼2 ) 2 ]                           (2.6.20)
                                               𝛼∈A
is 𝛼0 .
     However, this result does not always carry over to the case with missing data. Suppose 𝑠
depends on 𝑥2 . Then the solution to (2.6.17) will generally neither be 𝜅0 and more importantly nor
be 𝛼0 (Wooldridge, 2002). So by estimating a linear model using only the complete cases, we are
not getting consistent estimates of anything interesting in the population.15
     In general, if we want the solution to
                                                 min E[𝑠 · 𝑓1 (𝑦, 𝑥1 , 𝑥2 , 𝛼)]                            (2.6.21)
                                                 𝛼∈A
to be the conditional mean parameters, we want to make sure that the we have correctly specified the
conditional mean. In the above example, one way to do that here is to use a better model of E(𝑦|𝑥),
15An exception is the case where 𝑠 is independent of both 𝑦 and 𝑥, also known as “missing completely at random". In
  this case the solution to (6.17) is still 𝛼0 . However, this case rarely holds in practice.
                                                              58


that is, a probit instead of a linear probability model. This highlights the importance of nonlinear
models with missing data, even if one is generally satisfied with using a linear approximation when
𝑥 1 was always observed.
    The weighted estimator on the other hand recovers the linear projection parameters even when
using only the complete cases. In other words, the solution to
                                     min E{[𝑠/𝑝(𝑧)] (𝑦 − 𝛼1 𝑥1 − 𝑥2 𝛼2 ) 2 ]                          (2.6.22)
                                     𝛼∈A
is 𝛼0 .
    A similar discussion holds for the imputation problem in (2.6.18). If 𝑥1 is binary, then we
should either weight the imputation model in order to consistently estimate the linear projection
parameters or impute using a probit if not using weights.
    The second consideration is that of asymptotic efficiency. When 𝑧 = 𝑥 2 and the models underly-
ing 𝑓1 (𝑦, 𝑥, 𝛼), 𝑓2 (𝑥1 , 𝑥2 , 𝛽), and 𝑓3 (𝑦, 𝑥2 , 𝛾) are correctly specified, both estimators are consistent.
A theoretical comparison of the asymptotic variances of the two estimators in this case will likely
depend on whether a generalized conditional information matrix equality (GCIME), discussed in
Wooldridge (2002), holds for each of the three models underlying (2.2.1)-(2.2.3). For instance, the
GCIME always holds for conditional MLE under correct specification of the conditional density
and for quasi-MLE in the linear exponential family under the so-called generalized linear models
assumption. Wooldridge (2002) shows that in this case, 𝛼ˆ 𝑈𝑐𝑐 is more efficient than 𝛼ˆ 𝑊 𝑐𝑐 when
GCIME holds. So it is reasonable to expect that 𝛼ˆ 𝑈𝐽 will be more efficient than 𝛼ˆ 𝑊 𝐽 as well. I
do not undertake a theoretical comparison here but provide some simulation evidence in the next
section in support of this speculated efficiency ranking.
    The other unweighted alternative to 𝛼ˆ 𝑊 𝐽 is 𝛼ˆ 𝑈𝑐𝑐 , and it is not clear from the perspective
of consistency whether it is preferred to 𝛼ˆ 𝑊 𝐽 (or the weighted complete cases estimator 𝛼ˆ 𝑊 𝑐𝑐 ).
Suppose that 𝛼0 characterizes a correctly specified feature of 𝐷 (𝑦|𝑥) in (2.2.1). Then if selection is
exogenous and depends on 𝑥 1 after conditioning on 𝑥2 , that is, 𝑧 = (𝑥 1 , 𝑥2 ), then 𝛼ˆ 𝑈𝑐𝑐 is consistent
for 𝛼0 . However, both the weighted estimators 𝛼ˆ 𝑊 𝐽 and 𝛼ˆ 𝑊 𝑐𝑐 are inconsistent. This is because
the estimation of weights cannot depend on 𝑥 1 which is missing for some observations. Therefore,
                                                         59


the weights will generally not be consistently estimated. However, if selection depends on 𝑦 after
conditioning on 𝑥, that is, 𝑧 contains 𝑦, then 𝛼ˆ 𝑈𝑐𝑐 is inconsistent while both 𝛼ˆ 𝑊 𝐽 and 𝛼ˆ 𝑊 𝑐𝑐 are
consistent.
    The other consideration is that of correct specification of the model underlying 𝑓1 (𝑦, 𝑥, 𝛼) in
(2.2.1). Suppose that 𝑧 = 𝑥 2 . Then 𝛼ˆ 𝑊 𝐽 will be consistent for 𝛼0 , the solution to (2.2.1), whether
or not there is any model misspecification. But under misspecification, 𝛼ˆ 𝑈𝑐𝑐 will generally not be
consistent for 𝛼0 .
    For instance, let us go back to the linear model given by (2.6.17)-(2.6.19). Suppose E(𝑢|𝑥 1 , 𝑥2 ) =
0. That is, 𝛼0 in (6.17) are actually the coefficients in the conditional mean of 𝑦 given (𝑥 1 , 𝑥2 ). If
𝑧 = (𝑥 1 , 𝑥2 ), then 𝛼ˆ 𝑈𝑐𝑐 is consistent for 𝛼0 , but 𝛼ˆ 𝑊 𝐽 and 𝛼ˆ 𝑊 𝑐𝑐 are inconsistent since the weights
can only be based on 𝑥 2 . If 𝑧 = (𝑦, 𝑥2 ), then 𝛼ˆ 𝑊 𝐽 and 𝛼ˆ 𝑊 𝑐𝑐 with weights based on (𝑦, 𝑥2 ) are
consistent but 𝛼ˆ 𝑈𝑐𝑐 is inconsistent.
    On the other hand, suppose 𝑧 = 𝑥2 . Then 𝛼ˆ 𝑊 𝐽 will be consistent for 𝛼0 , the linear projection
parameters, whether or not they are the conditional mean parameters as well. However, 𝛼ˆ 𝑈𝑐𝑐 will
be inconsistent for 𝛼0 if they are only the linear projection parameters, and not the conditional mean
parameters.
    As far as asymptotic efficiency goes, when 𝑧 = 𝑥 2 and the model underlying 𝑓1 (𝑦, 𝑥, 𝛼) is
correctly specified, both 𝛼ˆ 𝑈𝑐𝑐 and 𝛼ˆ 𝑊 𝐽 are consistent, and we can again expect the efficiency
comparison to depend on the GCIME. Again, I do not provide a theoretical comparison but the next
section gives some simulation evidence that when the GCIME holds, 𝛼ˆ 𝑊 𝐽 is still more efficient
than 𝛼ˆ 𝑈𝑐𝑐 despite of the former being a weighted estimator.
    In conclusion, one can choose whether or not to weight when using the joint GMM based on
the nature of selection and model specification, but in either case, the joint estimator is no less (and
generally more) efficient than its complete cases counterpart.
                                                       60


2.7    Empirical application
    I apply the proposed estimation method to the setting of Sandsor (2020), who studies the
association between individuals’ grade variance and educational attainment. One measure of
individuals’ cognitive skills is their grades received in school, which are generally summarized
using the grade point average (GPA), the mean of the grades. The author looks at the importance of
grade variance on educational attainment for a given level of GPA. That is, is it better to specialize
in some subjects or to be a “jack-of-all-subjects". She finds that grade variance is negatively
associated with educational attainment, that is, students who are jack-of-all-subjects have higher
educational attainment.
    The data comes from the National Longitudinal Survey of Youth, 1979 (NLSY79). The
NLSY79 is a nationally representative sample of 12,686 young men and women between the ages
of 14 and 22. Following the author, I only use the sub-sample of 6111 respondents representing the
non-institutionalized civilian segment of the population. The data includes high school transcripts,
educational attainment, socio-economic characteristics and other measures of cognitive and non-
cognitive skills. GPA is measured as the mean of all grades received in upper secondary education
(grades 9 to 12). The measure of grade variance is the standard deviation of an individual’s grades
(GSD). The outcome of interest I consider is whether the individual has a four year college degree
at age 30. Again, following the author, I restrict the sample to individuals with at least 10 valid
grades and with non-missing data on all variables other than family income in 1979, which is the
covariate with missing values I focus on. This leaves me with a sample of 3942 individuals out of
which family income is missing for 723 (about 18%) individuals.
    I model the relationship between GSD and attainment of a four year college degree as a probit.
Since family income is a continuous variable, we are in the general framework of Section 2.5.1.1.
The model of interest is given by:
                        𝑦𝑖 = 1[𝛼10 𝑙𝑖𝑛𝑐𝑖 + 𝛼210 𝐺𝑆𝐷 𝑖 + 𝑥22𝑖 𝛼220 + 𝑢𝑖 > 0],                   (2.7.1)
                                        𝑢𝑖 |𝑥𝑖 ∼ 𝑁𝑜𝑟𝑚𝑎𝑙 (0, 1),                                (2.7.2)
                                                  61


where 𝑦𝑖 is a binary variable equal to 1 if individual 𝑖 has a college degree by the age of 30 and
0 otherwise. 𝑙𝑖𝑛𝑐𝑖 is the log of family income of individual 𝑖 in 1979, and 𝐺𝑆𝐷 𝑖 is the grade
standard deviation of individual 𝑖, the covariate of interest. 𝑥22𝑖 is the vector of other covariates
which includes individual’s GPA, gender, race, ethnicity, area of residence, and parental education.
It also includes measures of cognitive and noncognitive abilities which are based on the Armed
Services Vocational Aptitude Battery (ASVAB) test and a combination of Rotter Locus of Control
Scale and Rosenberg Self-Esteem Scale respectively. In our general notation from Section 2.5.1.1,
𝑥 1𝑖 = 𝑙𝑖𝑛𝑐𝑖 , 𝑥 2𝑖 = (𝐺𝑆𝐷 𝑖 , 𝑥22𝑖 ), and 𝑥𝑖 = (𝑥 1𝑖 , 𝑥2𝑖 ).
     Note that Assumption 2.3.1 in this context states that conditional on 𝑥 2𝑖 , the missingness of 𝑙𝑖𝑛𝑐𝑖
is independent of 𝑙𝑖𝑛𝑐𝑖 itself. This assumption is the basis of many standard procedures used to
impute income. For instance, the method of hot decking used by the Current Population Survey is
based on this assumption, so is multiple imputation used by the National Health Interview Survey.
The standard two-step regression imputation is also based on this assumption.
     The imputation model is given by
                               𝑙𝑖𝑛𝑐𝑖 = 𝑥2𝑖 𝜃 0 + 𝑟𝑖 ,   𝑟𝑖 |𝑥 2𝑖 ∼ 𝑁𝑜𝑟𝑚𝑎𝑙 (0, 𝜎 2 ).               (2.7.3)
Table D2 presents the results. Columns 1 and 2 give the coefficient estimates and standard errors
from the complete cases probit and the joint GMM respectively. Columns 3 gives the percentage
reduction in standard errors of the joint GMM. The standard errors fall for all coefficients, and
quite substantially so for many coefficients. While there is not much gain for the coefficient on log
of family income, there is about a 10% reduction in the standard error for 𝐺𝑆𝐷 𝑖 , the variable of
interest. The reduction for coefficients on other variables range from about 7% − 12%. The last
row of the table gives the Hansen’s J-statistic discussed in Proposition 2.4.1. The null hypothesis of
correct specification is not rejected at any reasonable significance level, giving us some confidence
in the assumptions underlying the joint GMM.
     Columns 4 and 5 give the estimates and standard errors for the plug-in method and DVM
respectively. In this particular case, both estimators give quite similar results as the joint GMM
estimator, which is not surprising given that the coefficient of log(income) is fairly small in
                                                         62


magnitude. As the simulations suggest, the plug-in estimator performs similarly to CC and the joint
GMM in terms of both bias and efficiency when 𝛼10 is small in magnitude. The DVM also has
small biases and a smaller standard deviation than the joint GMM for such values, although that
efficiency gain does not seem to be present in this application. Moreover, the joint GMM here still
has the additional advantage of providing an overidentification test for the assumptions underlying
the imputation procedure.
2.8     Conclusion
    I have provided a new method of consistently imputing missing covariate values in nonlinear
models. The estimator uses the standard assumptions used in the imputation literature, but unlike
other imputation estimators based on classical principles, it is consistent in nonlinear models for
both the structural parameters and other quantities of interest like average partial effects. I have
provided two practically important examples: fractional and nonnegative responses with binary or
continuous CMV. The proposed estimator provides substantial efficiency gains over the complete
cases estimator, and as a byproduct of using GMM, the overidentification test provides a way to test
the extra restrictions imposed by the imputation estimator compared to the complete cases estimator.
I have also provided a comprehensive framework for imputing using a variety of nonlinear models
for cases where a linear model might be unrealistic.
    I have provided the weighted and unweighted versions of the estimator, both of which provide
efficiency gains over their complete cases counterparts. This allows the empirical researcher to
choose the version best suited for their particular model and the nature of missingness in their
specific data.
                                                  63


                                                    CHAPTER 3
    EFFICIENT ESTIMATION OF LINEAR PANEL DATA MODELS WITH MISSING
                                                  COVARIATES*
3.1     Introduction
     The problem of missingness is ubiquitous in empirical research. In this paper, we provide
some methods to deal with missing covariate values in linear panel data models with unobserved
heterogeneity.
     Economists use a variety of methods to deal with missing covariate values in panel data. One
common method is to just use the “complete cases" - the observations for which all covariates are
observed [for instance Cabral et al. (2018), David & Venkateswaran (2019)]. While easy to use,
methods based only on complete cases can lead to substantial loss of efficiency when missingness
is large because of discarding the potentially useful information in the incomplete cases. This has
inspired methods that make use of these incomplete cases. One method used in this regard is the
“last observation carried forward" (LOCF), which replaces the missing observations in a given time
period with observations from the previous time period [for instance, Doraszelski et al. (2018),
Giroud & Rauh (2019)].1 Another method is the dummy variable method (DVM), which replaces
the missing values with zeros and includes an indicator for missingness as an additional covariate
in the model [for instance, Antecol et al. (2018)]. A third method we consider is regression
imputation. This is a two-step method which in the first step, regresses the covariate with missing
values (CMV) on the always-observed covariates using complete cases and uses the estimated
coefficients to predict missing values of the CMV. In the second step, it estimates the model of
interest using all observations with this “composite" CMV, which consists of both observed and
predicted values.2
*This chapter is co-authored with Professor Jeffrey Wooldridge.
1Sometimes the missing observations are also replaced with observations from the following time period.
2Moffitt et al. (2020) use this method for imputing a variable which is used to define a covariate in the model of interest.
                                                           64


    In this paper, we consider the issue of proper imputation specifically when using the fixed
effects estimator, which is perhaps the most frequently used method to estimate linear panel data
models with unobserved heterogeneity. We propose a new method of imputing when using fixed
effects that improves upon the performance of the estimators mentioned above. The choice of
method comes down to consistency and relative efficiency. The complete cases fixed effects
estimator [as described in Wooldridge (2019)] generally requires the least number of assumptions
to be consistent. However, as mentioned above, it can be inefficient relative to the estimators that
make use of the incomplete cases. LOCF has been shown to be generally biased and inconsistent
even under the strongest assumptions on missingness (Lane, 2008). We show that DVM is also
generally inconsistent unless some very strong zero restrictions are imposed in the model, including
the assumption that the CMV does not contain individual specific unobserved heterogeneity - an
assumption generally unlikely to hold in practice. Regression imputation is consistent under less
restrictive assumptions than the DVM, but still requires that the CMV does not contain unobserved
heterogeneity.
    The key contribution of this paper therefore is to propose a new imputation estimator which is
consistent under assumptions that are much less restrictive than those required by the estimators
above. We do not impose the zero restrictions required by the DVM, allow for unobserved
heterogeneity in the CMV, and allow for missingness to depend on the always-observed covariates.
We propose imputation methods for the cases of both strict as well as sequential exogeneity of the
covariates, the latter allowing for things like lagged dependent variables and feedback effects.
    A second contribution we make is proposing a novel variable addition test (VAT) for exogeneity
of missingness. The VATs proposed so far in this context have only been able to test for missingness
in other time periods being uncorrelated with unobservables in a given time period (Wooldridge,
2010). We propose a test for missingness in the same time period being uncorrelated with the
unobservables in a given time period, which is the kind of exogeneity one is most likely to be
concerned about in practice.
    The rest of the paper proceeds as follows. Section 2 presents the population model of interest
                                                   65


and the associated assumptions of strict exogeneity of the covariates. Section 3 describes the
missing data scheme and the assumptions on the missingness mechanism. Section 4 presents the
proposed estimator and its asymptotic distribution. Section 5 compares the proposed estimator to
some commonly used alternatives. Section 6 proposes an imputation estimator under sequential
exogeneity of the covariates and the novel VAT for the exogeneity of missingness. Section 7
concludes. Proofs and extensions to the cases of missing vectors and time-varying unobserved
heterogeneity are given in the appendix.
3.2    Population model
    We consider a standard linear model with additive heterogeneity. Assume that an underlying
population consists of a large number of units for whom data on 𝑇 time periods are potentially
available. We assume random sampling from this population, and let 𝑖 denote a random draw.
Along with the outcome 𝑦𝑖𝑡 and covariates 𝑥𝑖𝑡 = [𝑥 1𝑖𝑡 𝑥 2𝑖𝑡 ], we also draw scalars 𝑐𝑖 and 𝑑𝑖 , which
are the unobserved heterogeneities in 𝑦𝑖𝑡 and 𝑥 1𝑖𝑡 respectively.
    The linear model with additive heterogeneity is
                       𝑦𝑖𝑡 = 𝛽1 𝑥1𝑖𝑡 + 𝑥 2𝑖𝑡 𝛽2 + 𝑐𝑖 + 𝑢𝑖𝑡 ≡ 𝑥𝑖𝑡 𝛽 + 𝑐𝑖 + 𝑢𝑖𝑡 ,         𝑡 = 1, . . . , 𝑇,      (3.2.1)
where 𝑥 1𝑖𝑡 is a scalar, 𝑥 2𝑖𝑡 is a 1 × 𝑘 vector which includes the constant term3, and 𝛽 = [𝛽1 𝛽02 ] 0.
We are interested in estimators of 𝛽 that allow for correlation between 𝑐𝑖 and the history of the
covariates, {𝑥𝑖𝑡 : 𝑡 = 1, . . . , 𝑇 }.
    We first define the histories of all variables. Let y𝑖 = (𝑦𝑖1 , . . . , 𝑦𝑖𝑇 ), x𝑖 = (𝑥𝑖1 , . . . , 𝑥𝑖𝑇 ),
x1𝑖 = (𝑥 1𝑖1 , . . . , 𝑥 1𝑖𝑇 ), x2𝑖 = (𝑥 2𝑖1 , . . . , 𝑥 2𝑖𝑇 ), and u𝑖 = (𝑢𝑖1 , . . . , 𝑢𝑖𝑇 ). We place the following
assumption on the idiosyncratic error 𝑢𝑖𝑡 in equation (3.2.1).
    Assumption 3.2.1. E(x0𝑖 𝑢𝑖𝑡 ) = 0,              𝑡 = 1, . . . , 𝑇 .
    This is a kind of strict exogeneity assumption of the covariates with respect to the idiosyncratic
error. It implies that 𝑥𝑖𝑠 is uncorrelated with 𝑢𝑖𝑡 , 𝑠 = 1, . . . , 𝑇. In other words, the idiosyncratic
error at time 𝑡 is uncorrelated with the covariates in all time periods. Note that this assumption
3where 𝑥2𝑖𝑡 can include a full set of time dummies, or other aggregate time variables.
                                                                66


does not restrict the relationship between x𝑖 and the unobserved heterogeneity 𝑐𝑖 , which can be
arbitrarily correlated.
     The model which underlies the gains in efficiency in this paper is the following linear imputation
model with unobserved heterogeneity, which explains 𝑥 1𝑖𝑡 in terms of 𝑥 2𝑖𝑡 .
                                                𝑥 1𝑖𝑡 = 𝑥2𝑖𝑡 𝜋 + 𝑑𝑖 + 𝑟𝑖𝑡 .                            (3.2.2)
We impose an assumption analogous to Assumption 3.2.1 on the idiosyncratic error 𝑟𝑖𝑡 .
     Assumption 3.2.2: E(x02𝑖 𝑟𝑖𝑡 ) = 0,            𝑡 = 1, . . . , 𝑇 .
     Again, this assumption implies that 𝑥2𝑖𝑠 is orthogonal to the idiosyncratic error 𝑟𝑖𝑡 in every time
period 𝑠 = 1, . . . , 𝑇. Moreover, it does not restrict the relation between x2𝑖 and the unobserved
heterogeneity 𝑑𝑖 .
     Using the imputation model which explains 𝑥 1𝑖𝑡 in terms of 𝑥 2𝑖𝑡 , we are able to obtain a “reduced
form" for 𝑦𝑖𝑡 in terms of only 𝑥 2𝑖𝑡 . Plugging (3.2.2) in (3.2.1) gives
                        𝑦𝑖𝑡 = 𝛽1 (𝑥 2𝑖𝑡 𝜋 + 𝑑𝑖 + 𝑟𝑖𝑡 ) + 𝑥 2𝑖𝑡 𝛽2 + 𝑐𝑖 + 𝑢𝑖𝑡 ≡ 𝑥 2𝑖𝑡 𝛾 + ℎ𝑖 + 𝑣𝑖𝑡 ,    (3.2.3)
where 𝛾 ≡ 𝛽1 𝜋 + 𝛽2 , ℎ𝑖 ≡ 𝛽1 𝑑𝑖 + 𝑐𝑖 , and 𝑣𝑖𝑡 ≡ 𝛽1𝑟𝑖𝑡 + 𝑢𝑖𝑡 . As we will discuss in Section 3, we
allow 𝑥 1𝑖𝑡 to contain missing values while assuming that 𝑥 2𝑖𝑡 is always observed. Equation (3.2.3)
allows us to utilize the observations for which 𝑥 1𝑖𝑡 is not observed but 𝑦𝑖𝑡 and 𝑥 2𝑖𝑡 are.
     Note that Assumptions 3.2.1 and 3.2.2 imply that
                                        E(x02𝑖 𝑣𝑖𝑡 ) = E[x02𝑖 (𝛽1𝑟𝑖𝑡 + 𝑢𝑖𝑡 )] = 0.                     (3.2.4)
That is, 𝑥 2𝑖𝑠 is orthogonal to the idiosyncratic error 𝑣𝑖𝑡 in equation (3.2.3) for all 𝑠 = 1, . . . , 𝑇.
3.3       The missing data mechanism
     To allow for unbalanced panels, we introduce a series of selection indicators for each 𝑖, s𝑖 =
{𝑠𝑖1 , . . . , 𝑠𝑖𝑇 }, where 𝑠𝑖𝑡 = 1 if 𝑥1𝑖𝑡 is observed; otherwise 𝑠𝑖𝑡 = 0. In this paper, we only allow
𝑥 1𝑖𝑡 to contain missing values. Hence, 𝑠𝑖𝑡 indicates whether we have a “complete case" for unit 𝑖 in
period 𝑡.
                                                             67


       Our main estimation method is based on the well-known fixed effects estimator. Define 𝑇𝑖 =
Í𝑇
   𝑞=1 𝑠𝑖𝑞   as the total number of time periods for which 𝑥 1𝑖𝑡 is observed for individual 𝑖. Unlike 𝑇,
𝑇𝑖 is random, since 𝑠𝑖𝑡 is random for every 𝑡 = 1, . . . , 𝑇. We impose the following assumption on
𝑇𝑖 .
       Assumption 3.3.1. 𝑃(𝑇𝑖 = 0) = 0.
       This assumption simply says that for every individual 𝑖 in the population, the probability that
their 𝑥1𝑖𝑡 is not observed in any time period 𝑡 = 1. . . . , 𝑇 is zero.
       Further, define the time-demeaned covariates as 𝑥¥𝑖𝑡 = 𝑥𝑖𝑡 − 𝑇𝑖−1 𝑇𝑞=1 𝑠𝑖𝑞 𝑥𝑖𝑞 , where the time
                                                                        Í
demeaning here has been done using the complete cases only. We can write 𝑥¥𝑖𝑡 = [𝑥¥1𝑖𝑡 𝑥¥2𝑖𝑡 ],
where 𝑥¥1𝑖𝑡 = 𝑥 1𝑖𝑡 − 𝑇𝑖−1 𝑇𝑞=1 𝑠𝑖𝑞 𝑥1𝑖𝑞 , and 𝑥¥2𝑖𝑡 = 𝑥2𝑖𝑡 − 𝑇𝑖−1 𝑇𝑞=1 𝑠𝑖𝑞 𝑥 2𝑖𝑞 . Moreover, 𝑥¤2𝑖𝑡 =
                              Í                                      Í
𝑥 2𝑖𝑡 − (𝑇 − 𝑇𝑖 ) −1 𝑇𝑞=1 (1 − 𝑠𝑖𝑞 )𝑥𝑖𝑞 are the time demeaned covariates where the time demeaning has
                    Í
been done using the incomplete cases only. Under Assumption 3.3.1, 𝑥¥𝑖𝑡 and 𝑥¤2𝑖𝑡 are well defined.4
       For consistent estimation in the selected samples using fixed effects, we impose the following
assumptions on the population distribution.
       Assumption 3.3.2. For every 𝑡 = 1, . . . , 𝑇, (i) E(𝑠𝑖𝑡 𝑥¥𝑖𝑡0 𝑢𝑖𝑡 ) = 0 (ii) E(𝑠𝑖𝑡 𝑥¥2𝑖𝑡
                                                                                            0 𝑟 ) = 0 (iii)
                                                                                                𝑖𝑡
               0 𝑣 ] = 0.
E[(1 − 𝑠𝑖𝑡 ) 𝑥¤2𝑖𝑡 𝑖𝑡
       One case where this assumption would hold is when s𝑖                   |=   (x𝑖 , u𝑖 , r𝑖 , 𝑐𝑖 , 𝑑𝑖 ). That is, selection is
independent of everything else in the model, a case we will call “missing completely at random"
(MCAR). For instance, data will be MCAR when we have a randomly rotating panel. Then, part
(i) of Assumption 3.3.2 becomes
                                                                            𝑇
                                                                            Õ
                       E(𝑠𝑖𝑡 𝑥¥𝑖𝑡0 𝑢𝑖𝑡 ) = E(𝑠𝑖𝑡 𝑥𝑖𝑡0 𝑢𝑖𝑡 ) − E(𝑠𝑖𝑡 𝑇𝑖−1                0 𝑢 )
                                                                                   𝑠𝑖𝑞 𝑥𝑖𝑞 𝑖𝑡
                                                                            𝑞=1
                                                                     𝑇
                                                                     Õ
                                      =   E(𝑠𝑖𝑡 ) E(𝑥𝑖𝑡0 𝑢𝑖𝑡 )   −         E(𝑠𝑖𝑡 𝑇𝑖−1 𝑠𝑖𝑞 ) E(𝑥𝑖𝑞
                                                                                               0 𝑢 )
                                                                                                  𝑖𝑡
                                                                     𝑞=1
                                      = 0.                                                                                (3.3.1)
The third equality follows from Assumption 3.2.1 under which E(x0𝑖 𝑢𝑖𝑡 ) = 0. Similarly, part (ii) of
4So are all other time demeaned variables defined in Section 3.
                                                            68


Assumption 3.3.2 becomes
                                                                             𝑇
                                                                             Õ
                                0 𝑟 )
                        E(𝑠𝑖𝑡 𝑥¥2𝑖𝑡      =           0 𝑟 )
                                             E(𝑠𝑖𝑡 𝑥 2𝑖𝑡      − E(𝑠𝑖𝑡 𝑇𝑖−1               0 𝑟 )
                                                                                    𝑠𝑖𝑞 𝑥2𝑖𝑞
                                    𝑖𝑡                   𝑖𝑡                                  𝑖𝑡
                                                                             𝑞=1
                                                                     𝑇
                                                                     Õ
                                                       0 𝑟 )−
                                         = E(𝑠𝑖𝑡 ) E(𝑥 2𝑖𝑡                 E(𝑠𝑖𝑡 𝑇𝑖−1 𝑠𝑖𝑞 ) E(𝑥 2𝑖𝑞
                                                                                                0 𝑟 )
                                                           𝑖𝑡                                       𝑖𝑡
                                                                     𝑞=1
                                         = 0.                                                                  (3.3.2)
The third equality follows from Assumption 3.2.2 under which E(x02𝑖 𝑟𝑖𝑡 ) = 0. As we will see in
Section 4, time demeaning using complete cases gets rid of the unobserved heterogeneities 𝑐𝑖 and
𝑑𝑖 in equations (3.2.1) and (3.2.2) respectively. Therefore, Assumption 3.3.2 does not put any
restrictions on the unobserved heterogeneities, and we do not need selection to be independent
of the unobserved heterogeneities for this assumption to hold. So along with Assumptions 3.2.1
and 3.2.2, MCAR is sufficient for Assumption 3.3.2 to hold, but we can get by with the weaker
assumption s𝑖   |=   (x𝑖 , u𝑖 , r𝑖 ).5
    We can also allow selection to be a function of the always-observed covariates 𝑥 2𝑖𝑡 or unobserved
random variables outside the model, but we have to strengthen the exogeneity Assumptions 3.2.1
and 3.2.2 to the following zero conditional mean assumptions.
    Assumption 3.2.1’ E(𝑢𝑖𝑡 |x1𝑖 , x2𝑖 , 𝑐𝑖 , s𝑖 ) = 0,             𝑡 = 1, . . . , 𝑇.
    Assumption 3.2.1’ is a version of strict exogeneity of selection (along with strict exogeneity
of the covariates) conditional on 𝑐𝑖 . It implies that observing 𝑥 1𝑖𝑡 in any time period 𝑡 cannot be
systematically related to the idiosyncratic errors u𝑖 . As a practical matter, Assumption 3.2.1 allows
selection 𝑠𝑖𝑡 at time period 𝑡 to be arbitrarily correlated with (x1𝑖 , x2𝑖 , 𝑐𝑖 ), that is, with the covariates
in any time period and the unobserved heterogeneity in 𝑦𝑖𝑡 .
    We also need to strengthen Assumption 3.2.2 to the following zero conditional mean assumption.
    Assumption 3.2.2’: E(𝑟𝑖𝑡 |x2𝑖 , 𝑑𝑖 , s𝑖 ) = 0,             𝑡 = 1, . . . , 𝑇 .
    Assumption 3.2.2’ implies that observing 𝑥 1𝑖𝑡 in any time period 𝑡 cannot be systematically
related to r𝑖 , where r𝑖 = (𝑟𝑖1 , . . . , 𝑟𝑖𝑇 ). But it can be arbitrarily correlated with (x2𝑖 , 𝑑𝑖 ), that is,
5It is however hard to think of situations where selection is independent of the covariates and the idiosyncratic errors
 but not the unobserved heterogeneities.
                                                               69


with the always-observed covariates and the unobserved heterogeneity in 𝑥 1𝑖𝑡 .
     Together, Assumptions 3.2.1’ and 3.2.2’ allow 𝑠𝑖𝑡 to be arbitrarily correlated with the always-
observed covariates x2𝑖 , as well as with the unobserved heterogeneity in both 𝑦𝑖𝑡 and 𝑥 1𝑖𝑡 , that is,
𝑐𝑖 and 𝑑𝑖 . But it rules out 𝑠𝑖𝑡 being a function of the idiosyncratic errors u𝑖 and r𝑖 .
     To see that Assumption 3.3.2 holds under Assumptions 3.2.1’ and 3.2.2’, consider part (i) of
Assumption 3.3.2.
                         E(𝑠𝑖𝑡 𝑥¥𝑖𝑡0 𝑢𝑖𝑡 ) = E[E(𝑠𝑖𝑡 𝑥¥𝑖𝑡0 𝑢𝑖𝑡 |x𝑖 , s𝑖 )] = E[𝑠𝑖𝑡 𝑥¥𝑖𝑡0 E(𝑢𝑖𝑡 |x𝑖 , s𝑖 )] = 0.          (3.3.3)
The first equality follows from the Law of Iterated Expectations (LIE), and the third follows from the
fact that under Assumption 3.2.1’, E(𝑢𝑖𝑡 |x𝑖 , s𝑖 ) = 0 using the LIE. Similarly, part (ii) of Assumption
3.3.2 becomes
                             0 𝑟 ) = E[E(𝑠 𝑥¥0 𝑟 |x , s )] = E[𝑠 𝑥¥0 E(𝑟 |x , s )] = 0,
                     E(𝑠𝑖𝑡 𝑥¥2𝑖𝑡                                                                                         (3.3.4)
                                     𝑖𝑡              𝑖𝑡 2𝑖𝑡 𝑖𝑡 2𝑖 𝑖              𝑖𝑡 2𝑖𝑡       𝑖𝑡 2𝑖 𝑖
where the third equality follows from the fact that under Assumption 3.2.2’, E(𝑟𝑖𝑡 |x2𝑖 , s𝑖 ) = 0 using
the LIE.
3.4      Moment conditions and GMM
     It is well known that the fixed effects (within) estimator that uses only the complete cases is
generally consistent under Assumption 3.2.1’. One way to characterize this estimator is to multiply
equation (3.2.1) through by the selection indicator to get
                          𝑠𝑖𝑡 𝑦𝑖𝑡 = 𝛽1 𝑠𝑖𝑡 𝑥1𝑖𝑡 + 𝑠𝑖𝑡 𝑥 2𝑖𝑡 𝛽2 + 𝑠𝑖𝑡 𝑐𝑖 + 𝑠𝑖𝑡 𝑢𝑖𝑡 ,        𝑡 = 1, . . . , 𝑇 .            (3.4.1)
Averaging this equation across 𝑡 for each 𝑖 gives
                                        𝑦¯ 𝑖 = 𝛽1 𝑥¯1𝑖 + 𝑥¯2𝑖 𝛽2 + 𝑐𝑖 + 𝑢¯𝑖 ,  𝑡 = 1, . . . , 𝑇,                         (3.4.2)
where 𝑦¯ 𝑖 = 𝑇𝑖−1 𝑇𝑞=1 𝑠𝑖𝑞 𝑦𝑖𝑞 is the average of the selected observations. The other averages in
                        Í
(3.4.2) are defined similarly. If we now multiply (3.4.2) by 𝑠𝑖𝑡 and subtract from (3.4.1), we remove
𝑐𝑖 .
       𝑠𝑖𝑡 (𝑦𝑖𝑡 − 𝑦¯ 𝑖 ) = 𝛽1 𝑠𝑖𝑡 (𝑥 1𝑖𝑡 − 𝑥¯1𝑖 ) + 𝑠𝑖𝑡 (𝑥 2𝑖𝑡 − 𝑥¯2𝑖 ) 𝛽2 + 𝑠𝑖𝑡 (𝑢𝑖𝑡 − 𝑢¯𝑖 ),        𝑡 = 1, . . . , 𝑇 . (3.4.3)
                                                                   70


Equivalently,
                              𝑠𝑖𝑡 𝑦¥𝑖𝑡 = 𝛽1 𝑠𝑖𝑡 𝑥¥1𝑖𝑡 + 𝑠𝑖𝑡 𝑥¥2𝑖𝑡 𝛽2 + 𝑠𝑖𝑡 𝑢¥𝑖𝑡 ,        𝑡 = 1, . . . , 𝑇, (3.4.4)
where 𝑦¥𝑖𝑡 ≡ 𝑦𝑖𝑡 − 𝑦¯ 𝑖 , 𝑢¥𝑖𝑡 ≡ 𝑢𝑖𝑡 − 𝑢¯𝑖 , and 𝑥¥1𝑖𝑡 and 𝑥¥2𝑖𝑡 are as defined in Section 3. These are the time-
demeaned variables, where the demeaning has been done using the complete cases. Then pooled
OLS on (3.4.4) gives consistent estimates of 𝛽 under part (i) of Assumption 3.3.2. Estimating 𝛽
using pooled OLS is equivalent to GMM estimation using the following moment conditions.
                                                     Õ 𝑇                                            
                           E[ 𝑓1𝑖 (𝛽, 𝜋)] ≡ E               𝑠𝑖𝑡 𝑥¥𝑖𝑡0 ( 𝑦¥𝑖𝑡  − 𝑥¥1𝑖𝑡 𝛽1 − 𝑥¥2𝑖𝑡 𝛽2 ) = 0. (3.4.5)
                                                      𝑡=1
    These moment conditions give the fixed effects estimator based only on complete cases. Even
though this estimator is consistent, it leaves room for gains in efficiency as it ignores the information
contained in those observations for which 𝑥 1𝑖𝑡 is missing but 𝑦𝑖𝑡 and 𝑥 2𝑖𝑡 are observed. In order to
utilize those observations, we augment the above moment conditions with those from the imputation
model and the reduced form for 𝑦𝑖𝑡 .
    We can time demean the imputation model (3.2.2) in a similar fashion as (3.2.1), that is, using
the complete cases. This gives
                                       𝑠𝑖𝑡 𝑥¥1𝑖𝑡 = 𝑠𝑖𝑡 𝑥¥2𝑖𝑡 𝜋 + 𝑠𝑖𝑡 𝑟¥𝑖𝑡 ,       𝑡 = 1, . . . , 𝑇,        (3.4.6)
where 𝑟¥𝑖𝑡 ≡ 𝑟𝑖𝑡 − 𝑟¯𝑖 and 𝑟¯𝑖 = 𝑇𝑖−1 𝑇𝑞=1 𝑠𝑖𝑞 𝑟𝑖𝑞 . Again, the unobserved heterogeneity 𝑑𝑖 is eliminated
                                              Í
by the time demeaning. Estimating 𝜋 using pooled OLS in this equation is equivalent to GMM
estimation using the moment functions
                                                           Õ𝑇
                                            𝑓2𝑖 (𝛽, 𝜋) =                 0 ( 𝑥¥ − 𝑥¥ 𝜋).
                                                                 𝑠𝑖𝑡 𝑥¥2𝑖𝑡                                 (3.4.7)
                                                                                1𝑖𝑡     2𝑖𝑡
                                                           𝑡=1
For the reduced form, we use the incomplete cases to time demean the data. Define
                                                                            Õ 𝑇
                                         𝑦¤ 𝑖𝑡 ≡ 𝑦𝑖𝑡 − (𝑇 − 𝑇𝑖 ) −1              (1 − 𝑠𝑖𝑞 )𝑦𝑖𝑞
                                                                            𝑞=1
                                                                             Õ𝑇
                                      𝑥¤2𝑖𝑡 ≡ 𝑥 2𝑖𝑡 − (𝑇 − 𝑇𝑖 ) −1               (1 − 𝑠𝑖𝑞 )𝑥 2𝑖𝑞 .
                                                                             𝑞=1
                                                                  71


Then estimating 𝛾 using pooled OLS on the equation
                         (1 − 𝑠𝑖𝑡 ) 𝑦¤ 𝑖𝑡 = (1 − 𝑠𝑖𝑡 ) 𝑥¤2𝑖𝑡 𝛾 + (1 − 𝑠𝑖𝑡 ) 𝑣¤ 𝑖𝑡 ,   𝑡 = 1, . . . , 𝑇       (3.4.8)
is equivalent to GMM estimation using the following moment functions for the reduced form.
                                              Õ 𝑇
                             𝑓3𝑖 (𝛽, 𝜋) =                        0 [ 𝑦¤ − 𝑥¤ (𝛽 𝜋 + 𝛽 )].
                                                    (1 − 𝑠𝑖𝑡 ) 𝑥¤2𝑖𝑡                                         (3.4.9)
                                                                       𝑖𝑡   2𝑖𝑡 1          2
                                               𝑡=1
The full vector of moment functions is given by:
                                              0
                            Í𝑇                                                                     
                                  𝑡=1 𝑠𝑖𝑡 𝑥¥𝑖𝑡 ( 𝑦¥𝑖𝑡 − 𝑥¥1𝑖𝑡 𝛽1 − 𝑥¥2𝑖𝑡 𝛽2 )
                                                                                    𝑓1𝑖 (𝛽, 𝜋) 
                                                                                                   
                                                                                                   
              𝑓𝑖 (𝛽, 𝜋) = 
                                      Í   𝑇 𝑠 𝑥¥0 ( 𝑥¥ − 𝑥¥ 𝜋)                      ≡  𝑓 (𝛽, 𝜋)  .      (3.4.10)
                           Í             𝑡=1   𝑖𝑡  2𝑖𝑡 1𝑖𝑡        2𝑖𝑡               2𝑖
                                                                                    
                                                                                                      
                                                                                                      
                            𝑇                    0                              
                            𝑡=1 (1 − 𝑠𝑖𝑡 ) 𝑥¤2𝑖𝑡 𝑦¤ 𝑖𝑡 − 𝑥¤2𝑖𝑡 (𝛽1 𝜋 + 𝛽2 )   𝑓3𝑖 (𝛽, 𝜋) 
                                                                                                    
                                                                                                   
    Lemma 3.4.1: Under Assumptions 3.2.1’, 3.2.2’, 3.3.1 and 3.3.2, E[ 𝑓𝑖 (𝛽, 𝜋)] = 0.
    This is a set of 3𝑘 + 1 moment conditions with 2𝑘 + 1 parameters, giving us 𝑘 over-identifying
restrictions. It is the availability of these over-identifying restrictions that leads to gains in efficiency
in this model. As the following result shows, using either 𝑓1𝑖 (.) and 𝑓2𝑖 (.) or 𝑓1𝑖 (.) and 𝑓3𝑖 (.) leads
to an estimator of 𝛽 that is identical to the estimator that uses only 𝑓1𝑖 (.) and hence utilizes only
the complete cases.
    Lemma 3.4.2: Under Assumptions 3.2.1’, 3.2.2’, 3.3.1 and 3.3.2, GMM estimators of 𝛽 based
on moment functions [ 𝑓1𝑖 (.) 0 𝑓2𝑖 (.) 0] 0 or moment functions [ 𝑓1𝑖 (.) 0 𝑓3𝑖 (.) 0] 0 are identical to that
based only on 𝑓1𝑖 (.).
    Lemma 3.4.2 follows directly from the result in Ahu & Schmidt (1995)6 that adding equal
number of additional parameters and extra moment conditions does not change the GMM estimate
of the original parameter. Both 𝑓2𝑖 (.) and 𝑓3𝑖 (.) are a set of 𝑘 moment functions which add 𝑘 extra
parameters 𝜋.
    To define the GMM estimator based on the entire vector 𝑓𝑖 (.), let 𝑓¯(𝛽, 𝜋) = 𝑁 −1 𝑖=1
                                                                                                        Í𝑁
                                                                                                           𝑓𝑖 (𝛽, 𝜋),
Ω be a square matrix of order 3𝑘 + 1 that is nonrandom, symmetric, and positive definite, and Ω̂
be a first step consistent estimate of Ω. Then the standard two-step GMM minimization problem is
6p3. Thoerem 1
                                                                72


given by:
                                                       min 𝑓¯(𝛽, 𝜋) 0 Ω̂ 𝑓¯(𝛽, 𝜋).                                                         (3.4.11)
                                                        𝛽,𝜋
    The variance-covariance matrix of the moment functions is given by:
                                                                                                           
                                                                                    𝐶11 𝐶12 𝐶13 
                                                                                                           
                                                                              0
                                                                                                           
                                   𝐶 ≡ E[ 𝑓𝑖 (𝛽, 𝜋) 𝑓𝑖 (𝛽, 𝜋) ] = 𝐶21 𝐶22 𝐶23  ,
                                                                                                           
                                                                                                           
                                                                                    𝐶31 𝐶32 𝐶33 
                                                                                                           
where
                              Õ  𝑇                    Õ𝑇                                 Õ𝑇                 Õ 𝑇
                                       𝑠𝑖𝑡 𝑥¥𝑖𝑡0 𝑢𝑖𝑡                                          𝑠𝑖𝑡 𝑥¥𝑖𝑡0 𝑢𝑖𝑡
                                                                                                                                
               𝐶11 = E                                    𝑠𝑖𝑟 𝑥¥𝑖𝑟 𝑢𝑖𝑟          𝐶12 =                              𝑠𝑖𝑟 𝑥¥2𝑖𝑟 𝑟𝑖𝑟
                               𝑡=1                    𝑟=1                                𝑡=1                𝑟=1
                      Õ 𝑇                  Õ 𝑇                                              Õ 𝑇                   Õ 𝑇
          𝐶13 =             𝑠𝑖𝑡 𝑥¥𝑖𝑡0 𝑟𝑖𝑡         (1 − 𝑠𝑖𝑟 ) 𝑥¤2𝑖𝑟 𝑣𝑖𝑟
                                                                            
                                                                                 𝐶22 = E                 0 𝑟
                                                                                                   𝑠𝑖𝑡 𝑥¥2𝑖𝑡             𝑠𝑖𝑟 𝑥¥2𝑖𝑟 𝑟𝑖𝑟
                                                                                                                                       
                                                                                                               𝑖𝑡
                      𝑡=1                  𝑟=1                                               𝑡=1                  𝑟=1
              Õ𝑇                  Õ  𝑇                                                 Õ𝑇                            Õ𝑇
  𝐶23 = E                0 𝑟
                   𝑠𝑖𝑡 𝑥¥2𝑖𝑡             (1 − 𝑠𝑖𝑟 ) 𝑥¤2𝑖𝑟 𝑣𝑖𝑟
                                                                    
                                                                          𝐶33 = E          (1 − 𝑠𝑖𝑡 ) 𝑥¤2𝑖𝑡0 𝑣            (1 − 𝑠𝑖𝑟 ) 𝑥¤2𝑖𝑟 𝑣𝑖𝑟 ,
                                                                                                                                                
                              𝑖𝑡                                                                                 𝑖𝑡
              𝑡=1                 𝑟=1                                                  𝑡=1                           𝑟=1
    and 𝑓𝑖 (.) is evaluated at the true value of the parameters. The optimal weight matrix is given by
the inverse of 𝐶𝑖 . Let 𝐶ˆ be a consistent estimate of 𝐶.7 Then the joint GMM is defined as follows.
    Definition 3.4.1. Call the estimator of [𝛽0 𝜋0] 0 that solves (3.4.11), where Ω̂ = 𝐶ˆ −1 the joint
GMM estimator (or [ 𝛽ˆ0𝐽𝑜𝑖𝑛𝑡𝐹𝐸 𝜋ˆ 0𝐽𝑜𝑖𝑛𝑡𝐹𝐸 ] 0).
    Further, define the gradient as follows:
                                                                                               
                                                                                   𝐷 11 0 
                                                                                               
                                                                                               
                                                𝐷 ≡ E[∇ 𝑓𝑖 (𝛽, Π)] =  0 𝐷 22  ,
                                                                                               
                                                                                               
                                                                                   𝐷 31 𝐷 32 
                                                                                               
where
                                                    Õ 𝑇                                     Õ 𝑇
                                                         𝑠𝑖𝑡 𝑥¥𝑖𝑡0 𝑥¥𝑖𝑡                                  0 𝑥¥
                                                                                                                   
                             𝐷 11 = − E                                       𝐷 22 = − E           𝑠𝑖𝑡 𝑥¥2𝑖𝑡   2𝑖𝑡
                                                     𝑡=1                                     𝑡=1
                  Õ 𝑇                                    Õ                                                    Õ𝑇
   𝐷 31 = − E           (1 − 𝑠𝑖𝑡 ) 𝑥¤2𝑖𝑡 0 𝑥¤ 𝜋               (1 − 𝑠𝑖𝑡 ) 𝑥¤2𝑖𝑡  0 𝑥¤     𝐷 32 = − E                               0 𝑥¤ 𝛽  .
                                                                                                                    (1 − 𝑠𝑖𝑡 ) 𝑥¤2𝑖𝑡
                                                 2𝑖𝑡                                𝑖𝑡                                                   2𝑖𝑡 1
                  𝑡=1                                     𝑡                                                   𝑡=1
We impose the following rank condition on 𝐷 for identification of 𝛽 and 𝜋.
7which can be obtained by replacing the expectations with sample averages and substituting the estimated errors.
                                                                          73


    Assumption 3.4.1: 𝑟𝑎𝑛𝑘 (𝐷 11 ) = 𝑘 + 1 and 𝑟𝑎𝑛𝑘 (𝐷 22 ) = 𝑘.
    Under this assumption, 𝑓1𝑖 (𝛽) identifies 𝛽 and 𝑓2𝑖 (𝜋) identifies 𝜋. Then we have the following
result using Hansen (1982).
Theorem 3.4.1 Under standard regularity conditions and Assumptions 3.2.1’, 3.2.2’, 3.3.1, 3.3.2,
and 3.4.1, the estimators [ 𝛽ˆ0𝐽𝑜𝑖𝑛𝑡𝐹𝐸 𝜋ˆ 0𝐽𝑜𝑖𝑛𝑡𝐹𝐸 ] 0 are consistent and asymptotically normal, with
asymptotic variance given by (𝐷 0𝐶 −1 𝐷) −1 , and
                                                           ˆ Π̂) −−−𝑑−→ 𝜒2 .
                                         ˆ Π̂) 0 𝐶ˆ −1 𝑓¯( 𝛽,
                                   𝑁 𝑓¯( 𝛽,                               𝑘
    This statistic can be used for the standard test of over-identifying restrictions. Note that this
statistic is just the GMM objective function in (3.4.11) evaluated at the efficient values of the
parameters, and is distributed as chi-squared with degrees of freedom equal to the number of
over-identifying restrictions.
3.5     Comparison to related estimators
3.5.1    Complete cases estimator
    The most common practice in the presence of missing data is to just use the complete cases for
estimation; that is, only use the observations for which 𝑥1 is observed. One estimator that uses only
complete cases is a GMM estimator based only on ℎ1𝑖 (.) which is defined as follows.
Definition 3.5.1.1 Call the estimator of 𝛽 that solves (3.4.11), where 𝑓𝑖 (.) contains only 𝑓1𝑖 (.) and
Ω̂ = 𝐼, the complete cases estimator (or 𝛽ˆ𝐶𝐶 ).
Since 𝑓1𝑖 (.) is an exactly identified set of moment functions, the weight matrix is irrelevant for this
estimation procedure. The asymptotic variance of this estimator is given in the following result.
Lemma 3.5.1.1 Under Assumptions 3.2.1’, 3.3.1, 3.3.2 and 3.4.1, the complete cases estimator
𝛽ˆ𝐶𝐶 has an asymptotic variance given by
                                      √
                               𝐴𝑣𝑎𝑟 [ 𝑁 ( 𝛽ˆ𝐶𝐶 − 𝛽)] = (𝐷 011𝐶11      −1 𝐷 ) −1 .
                                                                          11
                                                      74


This estimator simply ignores the information in the observations with missing 𝑥1 . 𝛽ˆ𝐽𝑜𝑖𝑛𝑡𝐹𝐸 allows
for utilization of this information, leading to potential efficiency gains. The gain in efficiency just
follows from the fact that adding valid moment conditions [in this case, 𝑓2𝑖 (.) and 𝑓3𝑖 (.)] decreases,
or at least does not increase, the asymptotic variance of a GMM estimator.
Proposition 3.5.1.1 Under Assumptions 3.2.1’, 3.2.2’, 3.3.1, 3.3.2, and 3.4.1,
                    √                               √
             𝐴𝑣𝑎𝑟 [ 𝑁 ( 𝛽ˆ𝐶𝐶 − 𝛽)] − 𝐴𝑣𝑎𝑟 [ 𝑁 ( 𝛽ˆ𝐽𝑜𝑖𝑛𝑡𝐹𝐸 − 𝛽)] is positive semi-definite.
3.5.2    Dummy variable method
    For cross section data, the dummy variable method refers to setting the missing values of
the covariate to zero and using an indicator for whether the covariate is missing as an additional
covariate. Jones (1996) showed that this generally leads to biased and inconsistent estimates for
the case of cross section data.
    For panel data, one way the dummy variable method could proceed is the following. Note that
using (3.2.1) and (3.2.2), we can write
                    𝑦𝑖𝑡 = 𝛽1 [𝑠𝑖𝑡 𝑥1𝑖𝑡 + (1 − 𝑠𝑖𝑡 )(𝑥2𝑖𝑡 𝜋 + 𝑑𝑖 + 𝑟𝑖𝑡 )] + 𝑥 2𝑖𝑡 𝛽2 + 𝑐𝑖 + 𝑢𝑖𝑡 . (3.5.1)
Now, separating the intercept in the imputation model (3.2.2), we get
                                        𝑥1𝑖𝑡 = 𝜋1 + 𝑥 22𝑖𝑡 𝜋2 + 𝑑𝑖 + 𝑟𝑖𝑡 ,                       (3.5.2)
where 𝑥 2𝑖𝑡 = [1     𝑥 22𝑖𝑡 ]. Substituting (3.5.2) in (3.5.1) and rearranging gives
                                𝑦𝑖𝑡 = 𝛽1 𝑠𝑖𝑡 𝑥 1𝑖𝑡 + 𝛽1 𝜋1 (1 − 𝑠𝑖𝑡 ) + 𝑥 2𝑖𝑡 𝛽2 + 𝑒𝑖𝑡 ,         (3.5.3)
where 𝑒𝑖𝑡 ≡ 𝛽1 (1 − 𝑠𝑖𝑡 )(𝑥 22𝑖𝑡 𝜋2 + 𝑑𝑖 + 𝑟𝑖𝑡 ) + 𝑐𝑖 + 𝑢𝑖𝑡 .
    The dummy variable method omits the term (1 − 𝑠𝑖𝑡 )𝑥 22𝑖𝑡 𝜋2 𝛽1 from the model and includes it in
the error term. This omitted variable bias is the source of inconsistency of this method, and hence
even when the data is missing completely at random, neither POLS nor fixed effects consistently
estimates the parameters in the model under the assumptions made so far.
                                                         75


      As is expected, POLS on (3.5.3) is additionally inconsistent because 𝑒𝑖𝑡 contains 𝑐𝑖 and 𝑑𝑖 which
are correlated with 𝑥𝑖𝑡 . But even fixed effects estimation of (3.5.3) is additionally inconsistent as it
does not get rid of the term (1 − 𝑠𝑖𝑡 )𝑑𝑖 in the error, which is correlated with 𝑥𝑖𝑡 . The fixed effects
estimator where we time demean using all observations proceeds as follows.
      Averaging (3.5.3) across 𝑡 for each 𝑖 and then subtracting the averaged equation from (3.5.3)
gives
                              𝑦` 𝑖𝑡 = 𝛽1 𝑠`𝑖𝑡 𝑥`1𝑖𝑡 + 𝛽1 𝜋1 (1 − 𝑠`𝑖𝑡 ) + 𝑥`2𝑖𝑡 𝛽2 + 𝑒`𝑖𝑡 ,                (3.5.4)
where 𝑦` 𝑖𝑡 = 𝑦𝑖𝑡 − 𝑇 −1 𝑇𝑞=1 𝑦𝑖𝑞 , 𝑠`𝑖𝑡 𝑥`1𝑖𝑡 = 𝑠𝑖𝑡 𝑥1𝑖𝑡 − 𝑇 −1 𝑇𝑞=1 𝑠𝑖𝑞 𝑥1𝑖𝑞 and so on. Estimating this
                          Í                                            Í
equation using POLS gives the dummy variable estimator 𝛽ˆ 𝐷 . This estimator is inconsistent unless
we impose the restrictions that certain objects are zero in the model.
      Proposition 3.5.2.1. Under Assumptions 3.2.1’, 3.2.2’, and 3.4.1, 𝛽ˆ 𝐷 is inconsistent unless (i)
𝛽1 = 0 or (ii) 𝜋2 = 0 and 𝑑𝑖 = 0 ∀ 𝑖.
      The first condition is setting 𝛽1 = 0, which clearly gets rid of both sources of inconsistency in
this model. If 𝛽1 = 0, 𝑒`𝑖𝑡 = 𝑢`𝑖𝑡 , which is clearly uncorrelated with the regressors in (3.5.3) under
Assumption 3.2.1. Intuitively, this condition implies that 𝑥 1𝑖𝑡 is irrelevant in model of interest
(3.2.1). In this case, the best solution is to drop it and use all observations to estimate 𝛽2 in (3.2.1)
using a standard fixed effects estimator that is used when there is no missingness. The second
condition implies that first, there is no unobserved heterogeneity in the variable with missing values
𝑥 1𝑖𝑡 . As mentioned above, this condition is required because the fixed effects transformation does
not get rid of 𝑑𝑖 in (3.5.3) because it is now multiplied by (1 − 𝑠𝑖𝑡 ). But even if 𝑑𝑖 = 0 ∀ 𝑖, this
estimator is inconsistent because of omitting the term (1 − 𝑠𝑖𝑡 )𝑥 22𝑖𝑡 𝜋2 𝛽1 . Therefore, we need an
additional condition that 𝜋2 = 0, which intuitively means that 𝑥2𝑖𝑡 does not help in predicting 𝑥1𝑖𝑡 .
3.5.3     Regression imputation
      Regression imputation is a two-step method which proceeds as following. In the first step,
estimate 𝜋 in (3.2.2) using POLS and complete cases only (call it 𝜋).              ˜ In the second step, plug 𝜋˜ in
                                                          76


the equation
                                                    ∗ + 𝑥 𝜔 + 𝑒𝑟𝑟𝑜𝑟 ,
                                         𝑦𝑖𝑡 = 𝜔1 𝑥1𝑖𝑡                                                      (3.5.5)
                                                           2𝑖𝑡 2            𝑖𝑡
where 𝑥1𝑖𝑡∗ ≡ 𝑠 𝑥 + (1 − 𝑠 )𝑥 𝜋. This is the “composite" 𝑥 which contains the true values of
                 𝑖𝑡 1𝑖𝑡           𝑖𝑡 2𝑖𝑡                                    1
𝑥 1 when it is observed (i.e. when 𝑠𝑖𝑡 = 1) and the predicted values from the imputation equation
(3.2.2) when it is missing (i.e. when 𝑠𝑖𝑡 = 0). Then estimate 𝜔1 and 𝜔2 ) using fixed effects.
    To establish the performance of this estimator, recall that we can write using (3.2.1) and (3.2.2)
                      𝑦𝑖𝑡 = 𝛽1 [𝑠𝑖𝑡 𝑥1𝑖𝑡 + (1 − 𝑠𝑖𝑡 )(𝑥2𝑖𝑡 𝜋 + 𝑑𝑖 + 𝑟𝑖𝑡 )] + 𝑥 2𝑖𝑡 𝛽2 + 𝑐𝑖 + 𝑢𝑖𝑡 .          (3.5.6)
This boils down to the model of interest (3.2.1) when 𝑠𝑖𝑡 = 1 and to the reduced form (3.2.3) when
𝑠𝑖𝑡 = 0. Re-arrange this and write as
       𝑦𝑖𝑡 = 𝛽1 [𝑠𝑖𝑡 𝑥1𝑖𝑡 + (1 − 𝑠𝑖𝑡 )𝑥2𝑖𝑡 𝜋] + 𝑥2𝑖𝑡 𝛽2 + [(1 − 𝑠𝑖𝑡 )𝑑𝑖 𝛽1 + 𝑐𝑖 ] + [(1 − 𝑠𝑖𝑡 )𝑟𝑖𝑡 𝛽1 + 𝑢𝑖𝑡 ]
                  ∗ + 𝑥 𝛽 + [(1 − 𝑠 )𝑑 𝛽 + 𝑐 ] + [(1 − 𝑠 )𝑟 𝛽 + 𝑢 ].
           ≡ 𝛽1 𝑥 1𝑖𝑡                                                                                       (3.5.7)
                          2𝑖𝑡 2             𝑖𝑡 𝑖 1      𝑖            𝑖𝑡 𝑖𝑡 1        𝑖𝑡
Comparing (3.5.7) with (3.5.5), we note that the error in (3.5.5) contains both of the last two terms
in (3.5.7), that is, the term that occurs due to the idiosyncratic errors in the model of interest and
the imputation model as well as the term that occurs due to the unobserved heterogeneities in the
two models. The issue with plugging 𝜋˜ in (3.5.7) and then estimating using fixed effects is twofold.
First, estimating (3.2.2) using POLS and not fixed effects will lead to an inconsistent estimator of
𝜋 due to the presence of 𝑑𝑖 in (3.2.2). Second, and more importantly, even if one gets a consistent
estimate of 𝜋 using fixed effects on (3.2.2) and plugs it in (3.5.7), a standard fixed effects on this
equation does not produce consistent estimates of 𝛽1 and 𝛽2 because the unobserved heterogeneity
term [(1 − 𝑠𝑖𝑡 )𝑑𝑖 𝛽1 + 𝑐𝑖 ] is not time constant anymore and hence cannot be eliminated by the
standard fixed effects transformation. This method is therefore generally going to be inconsistent
due to the presence of 𝑑𝑖 in the imputation model.
    A sequential estimator that is consistent is the following. First estimate 𝜋 using 𝑓2𝑖 (.), plug the
estimated 𝜋 into 𝑓3𝑖 (.), and then estimate 𝛽 using 𝑓1𝑖 (.) and 𝑓3𝑖 (.) together.
    Definition 3.5.3.1. Call the following two-step estimator the sequential GMM (or [ 𝛽ˆ0𝑆𝑒𝑞 𝜋ˆ 0𝑆𝑒𝑞 ] 0).
                                                          77


    Step 1: Obtain 𝜋ˆ 𝑆𝑒𝑞 by solving (3.4.11), where 𝑓𝑖 (.) contains only 𝑓2𝑖 (.) and Ω̂ = 𝐼.
    Step 2: Obtain 𝛽ˆ𝑆𝑒𝑞 by solving (3.4.11), where
                                                                                                         
                                    Í𝑇
                                         𝑠   𝑥¥ 0 ( 𝑦¥ − 𝑥¥ 𝛽 − 𝑥¥ 𝛽 )                            𝑓 (𝛽, 𝜋)
                                                              1𝑖𝑡 1       2𝑖𝑡 2               1𝑖
                                                                                                         
                                      𝑡=1 𝑖𝑡 𝑖𝑡 𝑖𝑡
            𝑓𝑖 (𝛽, 𝜋ˆ 𝑆𝑒𝑞 ) = Í                                                            ≡
                                                                                                           
                                                                                                            
                               𝑇 (1 − 𝑠 ) 𝑥¤0 𝑦¤ − 𝑥¤ (𝛽 𝜋ˆ                    +  𝛽    )
                                                                                             𝑓 (𝛽, 𝜋ˆ ) 
                                                                2𝑖𝑡 1 𝑆𝑒𝑞             2
                                                                                              3𝑖
                               𝑡=1       𝑖𝑡 2𝑖𝑡 𝑖𝑡                                                   𝑆𝑒𝑞 
                                                                                                           
and                                  "                                                 # −1
                                             Õ𝑁
                                 Ω̂ = 𝑁 −1           𝑓𝑖 (𝛽, 𝜋ˆ 𝑆𝑒𝑞 ) 𝑓𝑖 (𝛽, 𝜋ˆ 𝑆𝑒𝑞 ) 0        .
                                             𝑖=1
    As is well known, sequential GMM estimators are generally less, or at least no more, efficient
than joint GMM estimators that use the same moment conditions. Therefore, 𝛽ˆ𝑆𝑒𝑞 is generally
less efficient than 𝛽ˆ𝐽𝑜𝑖𝑛𝑡𝐹𝐸 8 and there would be no reason to choose it other than computational
convenience.
3.5.4    Mundlak device
    In the case of balanced panels, it is well known that the Mundlak device which adds time
averages of the covariates as additional explanatory variables in equation (3.2.1) and estimates
the model using POLS is numerically equivalent to the fixed effects estimator (Mundlak, 1978).
Wooldridge (2019) shows that this numerical equivalence carries over to the case of unbalanced
panels as well. In equation (3.2.1), if we include time averages of 𝑥𝑖𝑡 computed using only the
complete cases as additional covariates and estimate the model using POLS on complete cases only,
then this estimator is numerically equivalent to the complete cases fixed effects estimator 𝛽ˆ𝐶𝐶 .
    This suggests an alternative to the joint fixed effects GMM estimator introduced in Section 4.
Instead of time demeaning each of the equations (3.2.1)-(3.2.3), we can use the Mundlak device for
each of them. Consider first equation (3.2.1) and write
                             𝑐𝑖 = 𝜓1 + 𝜉11 𝑥¯1𝑖 + 𝑥¯2𝑖 𝜉12 + 𝑎 1𝑖 ≡ 𝜓1 + 𝑥¯𝑖 𝜉1 + 𝑎 1𝑖 .                      (3.5.8)
8Prokhorov and Schmidt (2009), Theorem 2.2, part 5.
                                                            78


This is a model that explains the unobserved heterogeneity 𝑐𝑖 in terms of the time averages of
covariates in equation (3.2.1), where the averaging has been done using the complete cases only.
We impose the following zero conditional mean assumption on the error 𝑎 1𝑖 .
    Assumption 3.5.4.1. E(𝑎 1𝑖 |x𝑖 , s𝑖 ) = 0.
    This implies first that E(𝑐𝑖 | 𝑥¯𝑖 ) = 𝜓1 + 𝑥¯𝑖 𝜉1 . Second, it implies that selection in all time periods
is uncorrelated with the error 𝑎 1𝑖 . Plugging (3.5.8) into (3.2.1), we get
                                      𝑦𝑖𝑡 = 𝑥𝑖𝑡 𝛽 + 𝜓1 + 𝑥¯𝑖 𝜉1 + 𝑎 1𝑖 + 𝑢𝑖𝑡 .                          (3.5.9)
Let 𝑥´𝑖𝑡 = [1 𝑥𝑖𝑡 𝑥¯𝑖 ]. Estimating this model using POLS with the 𝑠𝑖𝑡 = 1 observations is equivalent
to doing GMM with the following moment functions
                              𝑔1𝑖 (𝛽, 𝜓1 , 𝜉1 ) = 𝑠𝑖𝑡 𝑥´𝑖𝑡0 (𝑦𝑖𝑡 − 𝑥𝑖𝑡 𝛽 − 𝜓1 − 𝑥¯𝑖 𝜉1 ).             (3.5.10)
Similarly, for the unobserved heterogeneity in the imputation model in equation (3.2.2), we can
write
                                              𝑑𝑖 = 𝜓2 + 𝑥¯2𝑖 𝜉2 + 𝑎 2𝑖 .                              (3.5.11)
Analogous to Assumption 3.5.4.1, we place the following assumption on the error term 𝑎 2𝑖 , which
implies that E(𝑑𝑖 | 𝑥¯2𝑖 ) = 𝜓2 + 𝑥¯2𝑖 𝜉2 and that selection in all time periods is uncorrelated with 𝑎 2𝑖 .
    Assumption 3.5.4.2. E(𝑎 2𝑖 |x2𝑖 , s𝑖 ) = 0.
    Plugging (3.5.11) into equation (3.2.2), we get
                                    𝑥 1𝑖𝑡 = 𝑥 2𝑖𝑡 𝜋 + 𝜓2 + 𝑥¯2𝑖 𝜉2 + 𝑎 2𝑖 + 𝑟𝑖𝑡 .                     (3.5.12)
Let 𝑥´2𝑖𝑡 = [1 𝑥 2𝑖𝑡 𝑥¯2𝑖 ]. Estimating this model using POLS with the 𝑠𝑖𝑡 = 1 observations is
equivalent to doing GMM with the following moment functions.
                                                      0 (𝑥 − 𝑥 𝜋 − 𝜓 − 𝑥¯ 𝜉 ).
                            𝑔2𝑖 (𝜋, 𝜓2 , 𝜉2 ) = 𝑠𝑖𝑡 𝑥´2𝑖𝑡                                             (3.5.13)
                                                             1𝑖𝑡    2𝑖𝑡      2      2𝑖 2
For the reduced form in equation (3.2.3), we first plug in for the unobserved heterogeneity ℎ𝑖 using
(3.5.8) and (3.5.11). Recall that ℎ𝑖 ≡ 𝛽1 𝑑𝑖 + 𝑐𝑖 . We first obtain 𝑐𝑖 as a function of 𝑥¯2𝑖 . To do this,
                                                            79


we substitute for 𝑥¯1𝑖 in (3.5.8) using equation (3.2.2). Averaging (3.2.2) over all time periods for
which 𝑠𝑖𝑡 = 1, we get
                                                     𝑥¯1𝑖 = 𝑥¯2𝑖 𝜋 + 𝑑𝑖 + 𝑟¯𝑖 .                                          (3.5.14)
Plugging in for 𝑑𝑖 from (3.5.11) in this equation, we have
                                        𝑥¯1𝑖 = 𝑥¯2𝑖 (𝜋 + 𝜉2 ) + 𝜓2 + 𝑎 2𝑖 + 𝑟¯𝑖 .                                        (3.5.15)
Plugging this into equation (3.5.8),
                          𝑐𝑖 = 𝜓1 + 𝜉11 [𝑥¯2𝑖 (𝜋 + 𝜉2 ) + 𝜓2 + 𝑎 2𝑖 + 𝑟¯𝑖 ] + 𝑥¯2𝑖 𝜉12 + 𝑎 1𝑖 .                          (3.5.16)
Thus, using equations (3.5.11) and (3.5.16), we can write ℎ𝑖 as
 ℎ𝑖 ≡ 𝛽1 𝑑𝑖 + 𝑐𝑖 = 𝛽1 (𝜓2 + 𝑥¯2𝑖 𝜉2 + 𝑎 2𝑖 ) + 𝜓1 + 𝜉11 [ 𝑥¯2𝑖 (𝜋 + 𝜉2 ) + 𝜓2 + 𝑎 2𝑖 + 𝑟¯𝑖 ] + 𝑥¯2𝑖 𝜉12 + 𝑎 1𝑖 . (3.5.17)
Plugging this into equation (3.2.3) and re-arranging, we get
                                         𝑦𝑖𝑡 = 𝑥 2𝑖𝑡 𝛾 + 𝜓 + 𝑥¯2𝑖 𝛿 + 𝑒𝑟𝑟𝑜𝑟𝑖𝑡 .                                          (3.5.18)
where 𝜓 ≡ 𝜓1 + 𝜉11 𝜓2 + 𝛽1 𝜓2 , 𝛿 ≡ 𝜉11 (𝜋 + 𝜉2 ) + 𝜉12 + 𝛽1 𝜉2 , and 𝑒𝑟𝑟𝑜𝑟𝑖𝑡 ≡ 𝜉11 (𝑎 2𝑖 + 𝑟¯𝑖 ) + 𝑎 1𝑖 +
𝛽1 𝑎 2𝑖 + 𝑣𝑖𝑡 . Estimating this model using POLS with the 𝑠𝑖𝑡 = 0 observations is equivalent to doing
GMM with the following moment functions.
                      𝑔3𝑖 (𝛽, 𝜋, 𝜓1 , 𝜓2 , 𝜉1 , 𝜉2 ) = (1 − 𝑠𝑖𝑡 ) 𝑥´2𝑖𝑡  0 (𝑦 − 𝑥 𝛾 − 𝜓 − 𝑥¯ 𝛿).                         (3.5.19)
                                                                              𝑖𝑡       2𝑖𝑡          2𝑖
So the final set of moment functions is given by
                                                     0 (𝑦 − 𝑥 𝛽 − 𝜓 − 𝑥¯ 𝜉 )
                                 Í𝑇                                                                                            
                                
                                      𝑡=1    𝑠 𝑖𝑡 ´
                                                   𝑥 𝑖𝑡   𝑖𝑡     𝑖𝑡        1     𝑖  1        
                                                                                                     𝑔 1𝑖 (𝛽, 𝜓 1 , 𝜉 1 )       
                                                                                                                                  
                                 Í                                                                                             
𝑔𝑖 (𝛽, 𝜋, 𝜓1 , 𝜓2 , 𝜉1 , 𝜉2 ) = 
                                   𝑇     𝑠    ´
                                               𝑥 0    (𝑥     − 𝑥     𝜋 −  𝜓   −  ¯
                                                                                 𝑥    𝜉   )  ≡
                                                                                                      𝑔    (𝜋, 𝜓   , 𝜉   )       .
                                    𝑡=1     𝑖𝑡   2𝑖𝑡     1𝑖𝑡     2𝑖𝑡        2      2𝑖   2               2𝑖       2     2         
                                Í                                                                                              
                                 𝑇
                                 𝑡=1 (1 − 𝑠𝑖𝑡 ) 𝑥´2𝑖𝑡  0 (𝑦 − 𝑥 𝛾 − 𝜓 − 𝑥¯ 𝛿)               
                                                                                                 𝑔  (𝛽,   𝜋, 𝜓  , 𝜓   , 𝜉   , 𝜉 )
                                                                                                                                  
                                                             𝑖𝑡      2𝑖𝑡             2𝑖       3𝑖
                                                                                                              1 2 1 2 
                                                                                                                         (3.5.20)
    Lemma 3.5.4.1. Under Assumptions 3.5.4.1 and 3.5.4.2, E[𝑔𝑖 (𝛽, 𝜋, 𝜓1 , 𝜓2 , 𝜉1 , 𝜉2 )] = 0.
    The rest of the GMM estimation proceeds as usual using the moment conditions in (3.5.21).
Define the variance-covariance matrix of the moment functions in (3.5.20) as
                            Λ ≡ E[𝑔𝑖 (𝛽, 𝜋, 𝜓1 , 𝜓2 , 𝜉1 , 𝜉2 )𝑔𝑖 (𝛽, 𝜋, 𝜓1 , 𝜓2 , 𝜉1 , 𝜉2 ) 0].                         (3.5.21)
                                                                 80


and let Λ̂ be a consistent estimate of Λ. Then we define the optimal GMM estimator based on
moment conditions (3.5.20) as follows.
    Definition 3.5.4.1. Call the estimator of [𝛽0 𝜋0 𝜓10 𝜓20 𝜉10 𝜉20 ] 0 that solves
                          min 𝑔(𝛽,
                                ¯       𝜋, 𝜓1 , 𝜓2 , 𝜉1 , 𝜉2 ) 0 Ω̂ 𝑔(𝛽,
                                                                       ¯   𝜋, 𝜓1 , 𝜓2 , 𝜉1 , 𝜉2 ),       (3.5.22)
                           𝛽,𝜋
                                                                               and Ω̂ = Λ̂−1 , the joint Mundlak
                                          Í𝑁
         ¯
where 𝑔(𝛽,   𝜋, 𝜓1 , 𝜓2 , 𝜉1 , 𝜉2 ) =        𝑖=1 𝑖
                                                  𝑔 (𝛽, 𝜋, 𝜓1 , 𝜓2 , 𝜉1 , 𝜉2 )
estimator. Denote the estimator of 𝛽 from this vector as 𝛽ˆ𝐽𝑜𝑖𝑛𝑡 𝑀𝑢𝑛𝑑𝑙𝑎𝑘 .
3.6     Estimation under sequential exogeneity
    As is well known, the strict exogeneity Assumption 3.2.1 rules out lagged dependent variables
and feedback from past shocks to current covariates in the model of interest (3.2.1).9 For instance,
if 𝑥𝑖𝑡 contains a policy variable, then Assumption 3.2.1 imposes that there is no feedback where
policy is more likely to occur based on past shocks. Or if (3.2.1) is a wage equation and one of the
covariates is union status, then it rules out a negative wage shock today leading to someone deciding
to join the union next year. Assumption 3.2.2’ imposes these restrictions on the imputation model
(3.2.2).
    In order to allow for such effects, we relax Assumption 3.2.1 and 3.2.2 to sequential exogeneity
Assumptions 3.6.1 and 3.6.2.
    Assumption 3.6.1. E(x𝑖𝑡0𝑢𝑖𝑡 ) = 0,               𝑡 = 1, . . . , 𝑇,
    where x𝑖𝑡 = (𝑥𝑖𝑡 , 𝑥𝑖,𝑡−1 , . . . , 𝑥𝑖1 ). This assumes correct distributed lag dynamics but is silent on
feedback as it allows for 𝑢𝑖𝑡 to be arbitrarily correlated with 𝑥𝑖,𝑡+𝑠 for 𝑠 ∈ {1, . . . , 𝑇 − 𝑡}. For the
imputation model (3.2.2), we relax Assumption 3.2.2 to the following.
    Assumption 3.6.2: E(x𝑡0         𝑟 ) = 0,
                                 2𝑖 𝑖𝑡
    where x𝑡2𝑖 = (𝑥 2𝑖𝑡 , 𝑥2𝑖,𝑡−1 , . . . , 𝑥 2𝑖1 ).
    Under these assumptions, we can use an alternative transformation called “forward orthogonal-
ization" suggested by Arellano & Bover (1995). It demeans data using average over future time
9Wooldridge (2010), Chapter 10
                                                              81


periods instead of average over all time periods. It thus preserves sequential exogeneity while still
using as much data as possible.
     We begin with the model of interest (3.2.1). At time 𝑡 ≤ 𝑇 − 1, consider the equations for
𝑡 + 1, . . . , 𝑇.
                                      𝑦𝑖,𝑡+1 = 𝛽1 𝑥 1𝑖,𝑡+1 + 𝑥 2𝑖,𝑡+1 𝛽2 + 𝑐𝑖 + 𝑢𝑖,𝑡+1
                                                                   ..
                                                                    .
                                              𝑦𝑖𝑇 = 𝛽1 𝑥1𝑖𝑇 + 𝑥2𝑖𝑇 𝛽2 + 𝑐𝑖 + 𝑢𝑖𝑇 .
     In order to time demean (3.2.1), we can naturally use only those future time periods for which
𝑥 1 is observed. Define
                                                                      𝑇
                                                                      Õ
                                                        𝑇𝑖 (𝑡) =           𝑠𝑖𝑞                                (3.6.1)
                                                                   𝑞=𝑡+1
     as the number of time periods for which 𝑥 1 is observed after time 𝑡 for unit 𝑖. Multiply each
equation for 𝑡 + 1 ≤ 𝑞 ≤ 𝑇 by 𝑠𝑖𝑞 and sum
        Õ 𝑇                    Õ   𝑇                    Õ  𝑇                              Õ𝑇          
                𝑠𝑖𝑞 𝑦𝑖𝑞 = 𝛽1             𝑠𝑖𝑞 𝑥1𝑖𝑞 +                𝑠𝑖𝑞 𝑥 2𝑖𝑞 𝛽2 + 𝑇𝑖 (𝑡)𝑐𝑖 +        𝑠𝑖𝑞 𝑢𝑖𝑞 . (3.6.2)
       𝑞=𝑡+1                      𝑞=𝑡+1                     𝑞=𝑡+1                             𝑞=𝑡+1
Multiplying through by 𝑇𝑖 (𝑡) −1 gives
                                       𝑦¯ 𝑖 (𝑡) = 𝛽1 𝑥¯1𝑖 (𝑡) + 𝑥¯2𝑖 (𝑡) 𝛽2 + 𝑐𝑖 + 𝑢¯𝑖 (𝑡),                   (3.6.3)
where 𝑦¯ 𝑖 (𝑡) = 𝑇𝑖 (𝑡) −1 𝑇𝑞=𝑡+1 𝑠𝑖𝑞 𝑦𝑖𝑞 is the average of the observed 𝑦𝑖𝑞 after time 𝑡 and 𝑥¯1𝑖 (𝑡), 𝑥¯2𝑖 (𝑡)
                            Í
and 𝑢¯𝑖 (𝑡) are defined similarly.
     Subtracting this equation from (3.2.1), which is the equation at time 𝑡 gives
                       𝑦𝑖𝑡 − 𝑦¯ 𝑖 (𝑡) = 𝛽1 [𝑥 1𝑖𝑡 − 𝑥¯1𝑖 (𝑡)] + [𝑥 2𝑖𝑡 − 𝑥¯2𝑖 (𝑡)] 𝛽2 + [𝑢𝑖𝑡 − 𝑢¯𝑖 (𝑡)]       (3.6.4)
or
                                            𝑦˜ 𝑖 (𝑡) = 𝛽1 𝑥˜1𝑖 (𝑡) + 𝑥˜2𝑖 (𝑡) 𝛽2 + 𝑢˜𝑖 (𝑡).                   (3.6.5)
     Subtracting the forward averages thus eliminates 𝑐𝑖 just as with the usual within transformation.
Now we use 𝑥1𝑖 𝑝 and 𝑥 2𝑖 𝑝 , 𝑝 ≤ 𝑡 as instrumental variables in this equation, and use only those time
                                                                  82


periods for which 𝑠𝑖𝑡 = 1, i.e. the complete cases. This gives the following moment functions.
                                                                            
                   𝑠𝑖 𝑝 𝑥 𝑠𝑖𝑡 [ 𝑦˜ 𝑖 (𝑡) − 𝛽 𝑥˜ (𝑡) − 𝑥˜ (𝑡) 𝛽 ] 
                           1𝑖 𝑝                  1 1𝑖            2𝑖       2 
        𝑚 1𝑖 (𝛽) =                                                                𝑝 ≤ 𝑡,    𝑡 = 1, . . . , 𝑇 − 1.  (3.6.6)
                   
                                                                             
                    𝑥 0 𝑠 [ 𝑦˜ (𝑡) − 𝛽 𝑥˜ (𝑡) − 𝑥˜ (𝑡) 𝛽 ] 
                    2𝑖 𝑝 𝑖𝑡 𝑖                 1 1𝑖            2𝑖       2 
                                                                            
We require an additional selection indicator for the first set of moment conditions here as in addition
to 𝑥 1𝑖𝑡 , these moment conditions also require 𝑥 1𝑖 𝑝 to be observed for it to be used as an instrumental
variable.
     Since the moment conditions in (3.6.6) utilize only the complete cases, they leave room for gains
in efficiency by utilizing the incomplete cases. We can again implement forward orthogonalization
with time demeaning using complete cases to estimate 𝜋 in (3.2.2). Similar to (3.6.4), we can write
                                   𝑥 1𝑖𝑡 − 𝑥¯1𝑖 (𝑡) = [𝑥 2𝑖𝑡 − 𝑥¯2𝑖 (𝑡)]𝜋 + [𝑟𝑖𝑡 − 𝑟¯𝑖 (𝑡)],                        (3.6.7)
where 𝑟¯𝑖 (𝑡) = 𝑇𝑖 (𝑡) −1 𝑇𝑞=𝑡+1 𝑠𝑖𝑞 𝑟𝑖𝑞 . Multiplying through by 𝑇𝑖 (𝑡) −1 , we get
                            Í
                                                  𝑥˜1𝑖 (𝑡) = 𝑥˜2𝑖 (𝑡)𝜋 + 𝑟˜𝑖 (𝑡).                                   (3.6.8)
Using 𝑥2𝑖 𝑝 , 𝑝 ≤ 𝑡 as instrumental variables and using only the complete cases, we get the moment
functions
                     𝑚 2𝑖 (𝜋) = 𝑥2𝑖      0 𝑠 [ 𝑥˜ (𝑡) − 𝑥˜ (𝑡)𝜋]            𝑝 ≤ 𝑡,     𝑡 = 1, . . . , 𝑇 − 1.        (3.6.9)
                                          𝑝 𝑖𝑡 1𝑖             2𝑖
Similar to Section 4, the moment conditions that allow gains in efficiency come from the reduced
form (3.2.3). Here we do the forward orthogonalization using incomplete cases. Let
                                                                               𝑇
                                 𝑦˘ 𝑖 (𝑡) = 𝑦𝑖𝑡 − 𝑇 − 𝑡 − 𝑇𝑖 (𝑡) −1
                                                                             Õ
                                                                                    (1 − 𝑠𝑖𝑞 )𝑦𝑖𝑞
                                                                             𝑞=𝑡+1
                                                                               𝑇
                                                                        −1 Õ
                              𝑥˘2𝑖 (𝑡) = 𝑥2𝑖𝑡 − 𝑇 − 𝑡 − 𝑇𝑖 (𝑡)                      (1 − 𝑠𝑖𝑞 )𝑥2𝑖𝑞 .
                                                                             𝑞=𝑡+1
We can then write
                                                    𝑦˘ 𝑖 (𝑡) = 𝑥˘2𝑖 (𝑡)𝛾 + 𝑣˘ 𝑖𝑡 .                                 (3.6.10)
     We estimate 𝛾 ≡ (𝛽1 𝜋 + 𝛽2 ) using incomplete cases as well. This gives moment functions
                       0 (1 − 𝑠 ) [ 𝑦˘ (𝑡) − 𝑥˘ (𝑡)(𝛽 𝜋 + 𝛽 )]
       𝑚 3𝑖 (𝛽, 𝜋) = 𝑥 2𝑖                                                          𝑝 ≤ 𝑡,   𝑡 = 1, . . . , 𝑇 − 1.  (3.6.11)
                          𝑝            𝑖𝑡   𝑖          2𝑖        1        2
                                                                  83


      The full set of moment functions is given by
                                                                                        
                               𝑠𝑖 𝑝 𝑥 1𝑖 𝑝 𝑠𝑖𝑡 [ 𝑦˜ 𝑖 (𝑡) − 𝛽1 𝑥˜1𝑖 (𝑡) − 𝑥˜2𝑖 (𝑡) 𝛽2 ] 
                                                                                      
              𝑚 1𝑖 (𝛽)                                                                
                            0
                            𝑥 2𝑖 𝑝 𝑠𝑖𝑡 [ 𝑦˜ 𝑖 (𝑡) − 𝛽1 𝑥˜1𝑖 (𝑡) − 𝑥˜2𝑖 (𝑡) 𝛽2 ] 
                                                                                        
             
𝑚𝑖 (𝛽, 𝜋) =  𝑚 2𝑖 (𝜋)  =                                                              𝑝 ≤ 𝑡,   𝑡 = 1, . . . , 𝑇 − 1.
                                             0
                                                                                         
                                        𝑥 2𝑖 𝑝 𝑠𝑖𝑡 [ 𝑥˜1𝑖 (𝑡) − 𝑥˜2𝑖 (𝑡)𝜋]            
             𝑚 3𝑖 (𝛽, 𝜋)  
                                                                                      
                                                                                         
                          𝑥 0 (1 − 𝑠 ) [ 𝑦˘ (𝑡) − 𝑥˘ (𝑡)(𝛽 𝜋 + 𝛽 )] 
                               2𝑖 𝑝            𝑖𝑡       𝑖        2𝑖       1        2 
                                                                                        
                                                                                                                 (3.6.12)
      The moment functions 𝑚𝑖 (𝛽, 𝜋) have a zero mean if Assumptions 3.6.1 and 3.6.2 hold and
s𝑖   |=   (x𝑖 , u𝑖 , r𝑖 ).10 However, if we want to allow the selection to be more general (for instance,
depend on x2𝑖 or other unobserved variables), we need to strengthen Assumptions 3.6.1 and 3.6.2
to the following zero conditional mean assumptions.
      Assumption 3.6.1’: E(𝑢𝑖𝑡 |x𝑖𝑡 , s𝑖 , 𝑐𝑖 ) = 0,         𝑡 = 1, . . . , 𝑇 .
      Assumption 3.6.2’: E(𝑟𝑖𝑡 |x𝑡2𝑖 , s𝑖 , 𝑑𝑖 ) = 0,         𝑡 = 1, . . . .𝑇 .
      Note that although Assumptions 3.6.1’ and 3.6.2’ allow the covariates to be sequentially exoge-
nous in both the model of interest (3.2.1) and the imputation model (3.2.2), selection is assumed
to be strictly exogenous in both models. This is because in the moment functions 𝑚 1𝑖 (𝛽) and
𝑚 2𝑖 (𝜋), 𝑦¯ 𝑖 (𝑡), 𝑥¯1𝑖 (𝑡) and 𝑥¯2𝑖 (𝑡) depend non-linearly on all selection indicators from 𝑡 + 1 to 𝑇 and
we use instruments with 𝑝 ≤ 𝑡. Therefore, we need selection to be strictly, and not just sequentially,
exogenous for these moment functions to have a zero mean. Moreover, Assumption 3.6.1’ allows
selection to be arbitrarily correlated with x𝑖 and 𝑐𝑖 . Assumption 3.6.2’ allows selection to be
arbitrarily correlated with x2𝑖 and 𝑑𝑖 , but it rules out selection depending on 𝑥 1 once we condition
on 𝑥 2 . Thus together, Assumptions 3.6.1’ and 3.6.2’ allow selection to depend on x2𝑖 , 𝑐𝑖 and 𝑑𝑖 ,
but not r𝑖 or u𝑖 .
      We summarize the conditions under which the moment functions in (3.6.12) have an expected
value of zero in the following lemma.
      Lemma 3.6.1: E[𝑚𝑖 (𝛽, 𝜋)] = 0 if either of the following conditions hold.
      (i) s𝑖   |=   (x𝑖 , u𝑖 , r𝑖 ) and Assumptions 3.6.1 and 3.6.2 hold.
10Recall that this is weaker than MCAR as it allows s𝑖 to depend on 𝑐𝑖 and 𝑑𝑖 .
                                                              84


     (ii) 𝑠𝑖𝑡 is a function of x2𝑖 or some other random variable 𝑤𝑖𝑡 and Assumptions 3.6.1’ and 3.6.2’
hold..
     Then, E[𝑚𝑖 (𝛽, 𝜋)] = 0 gives us a set of (3𝑘 + 1)𝑇 (𝑇 − 1)/2 moment conditions with 2𝑘 + 1
parameters and hence number of over-identifying restrictions depends on 𝑇. We can use the regular
two-step GMM estimator using these moment conditions.
     One way to test for exogeneity of s𝑖 with respect to {𝑢𝑖𝑡 : 𝑡 = 1, . . . , 𝑇 } is to include selection
indicators from other time periods as covariates in equation (3.2.1) and check for their significance
at time 𝑡. For instance, one might be concerned that a shock today causes people to drop out from
the sample in the next time period. Then one can add 𝑠𝑖,𝑡+1 as a covariate at time 𝑡 (so that the last
time period is lost), estimate the model using the moment conditions in (3.6.6), and compute the
robust 𝑡-statistic on 𝑠𝑖,𝑡+1 . Another option is to use 𝑠𝑖,𝑡−1 as a covariate, but that does not work in
the case of attrition when it is an absorbing state because if 𝑠𝑖𝑡 = 1 for 𝑖, then so is 𝑠𝑖,𝑡−1 .
     Note that this test can be used even if one is only using the complete cases11, as it does not
even require us to write down the imputation equation (3.2.2). But when using the GMM based on
full set of moment conditions in (3.6.12), one can also test for the exogeneity of s𝑖 with respect to
{𝑟𝑖𝑡 : 𝑡 = 1, . . . , 𝑇 } by including 𝑠𝑖,𝑡+1 as a covariate in the imputation equation (3.2.2) at time 𝑡,
estimating the model using moment conditions in (3.6.9), and computing the robust 𝑡-statistic on
𝑠𝑖,𝑡+1 .
     However, what we are most likely to be concerned about in an application is the contemporaneous
selection problem, that is, 𝑠𝑖𝑡 being correlated with 𝑢𝑖𝑡 . But one cannot test for 𝑠𝑖𝑡 by including it
as a covariate in either (3.2.1) or (3.2.2). This is because both of these models are estimated using
complete cases and hence 𝑠𝑖𝑡 will always equal 1 for the observations used in moment conditions
in (3.6.6) and (3.6.9). The reduced form in (3.2.3), however, provides a way to test for 𝑠𝑖𝑡 as it can
be used for all observations 𝑖 irrespective of whether 𝑠𝑖𝑡 = 0 or 𝑠𝑖𝑡 = 1.
     Since 𝑦𝑖𝑡 and 𝑥 2𝑖𝑡 are observed for all observations, instead of (3.6.10), we can use the following
11that is, only the moment conditions in (3.6.6)
                                                      85


moment conditions.
                          0                              
                      E[𝑥 2𝑖 𝑝 𝑦˘ 𝑖𝑡 − 𝑥˘2𝑖𝑡 (𝛽1 𝜋 + 𝛽2 ) ] = 0  𝑝 ≤ 𝑡,  𝑡 = 1, . . . , 𝑇 − 1.   (3.6.13)
We have simply removed the (1 − 𝑠𝑖𝑡 ) from (3.6.10), which means that instead of restricting these
moment conditions to the incomplete cases, we are using all observations. Then we can test for
the exogeneity of 𝑠𝑖𝑡 with respect to {𝑣𝑖𝑡 : 𝑡 = 1, . . . , 𝑇 } by including 𝑠𝑖𝑡 as a covariate in the
reduced form (3.2.3) at time 𝑡, estimating the model using the moment conditions in (3.6.12), and
computing the robust 𝑡-statistic on 𝑠𝑖𝑡 . The null hypothesis here is that 𝑠𝑖𝑡 is uncorrelated with 𝑣𝑖𝑡 .
Since 𝑣𝑖𝑡 = 𝑢𝑖𝑡 + 𝛽1𝑟𝑖𝑡 , if we reject the null, then we can conclude that 𝑠𝑖𝑡 is correlated with either
𝑢𝑖𝑡 or 𝑟𝑖𝑡 or both. Since we require both of these correlations to be zero in order for the moment
conditions in (3.6.12) to be valid, a rejection would bring the validity of this method into question
irrespective of which idiosyncratic error 𝑠𝑖𝑡 is correlated with.
    Finally, we can also use this test for 𝑠𝑖𝑡 in the framework of Section 4 where we are assuming
strict exogeneity of the covariates with respect to the idiosyncratic errors. In that case, we simply
include 𝑠𝑖𝑡 as a covariate in the reduced form (3.2.3) at time 𝑡 and estimate the model using the
following moment conditions
                                         Õ 𝑇
                                      E[        0 𝑦¤ − 𝑥¤ (𝛽 𝜋 + 𝛽 )  ] = 0.
                                              𝑥¤2𝑖𝑡                                              (3.6.14)
                                                     𝑖𝑡    2𝑖𝑡 1     2
                                          𝑡=1
instead of those in (3.6.13), and computing the robust 𝑡-statistic on 𝑠𝑖𝑡 . The moment conditions in
(3.6.14) are essentially the same as in (3.4.9) except that we have removed the selection indicator
(1 − 𝑠𝑖𝑡 ) just like in the case of sequential exogeneity. Note that all the tests discussed here require
that 𝑇 ≥ 3.
3.7     Conclusion
    We have provided new methods of consistently imputing missing covariate values in linear
panel data models with unobserved heterogeneity when using fixed effects. We provide imputation
estimators under both strict and sequential exogeneity of the covariates. We relax some substantial
assumptions made by currently used imputation estimators, most notably allowing the covariate
                                                           86


with missing values to contain individual specific unobserved heterogeneity. We provide two tests
for the assumptions underlying our imputation procedure. The first is a GMM overidentification
test which tests the validity of the moment conditions, the second is a novel variable addition test
for the missingness in a given time period being uncorrelated with the unobservables, both in the
same time period as well as in other time periods.
                                                87


APPENDICES
    88


                                                 APPENDIX A
                                        PROOFS FOR CHAPTER 1
Proof of Proposition 1.5.1.2
    We know that
                               √
                     𝐴𝑣𝑎𝑟 ( 𝑛[( 𝛽ˆ0 𝑣𝑒𝑐( Π̂) 0) 0 − (𝛽0 𝑣𝑒𝑐(Π) 0) 0]) = (𝐷 0𝐶 −1 𝐷) −1
Now,
                                          −1                                         −1              
              𝐷0        0     𝐶      𝐶           𝐷         0        0    𝐷 0      𝐶     0    0 𝐷 
                                 11      12          11                               33                     32
                                                                            
                                                                             41  
 𝐷 0𝐶 −1 𝐷 =  11                                               +
                                                                                                       
                                                                                                        
               0 𝐷0          𝐶 0 𝐶   0 𝐷   𝐷 0 𝐷 0   0 𝐶   𝐷                                         
                        22   12      22                22   32       42         44      41 𝐷 42 
                                                                                                      
            ≡ 𝐺 + 𝐻𝐼𝐻 0
where
                                     −1                                                         −1
          𝐷0      0                                   0               0 𝐷 
                                                                                             33 0 
                          𝐶      𝐶     𝐷                                                 𝐶
                             11     12          11                               32
                                                                                                      
    𝐺 =  11                                                      𝐻=                    𝐼=
                                                                             
                                                       ,                       ,                  
           0 𝐷0         𝐶 0 𝐶   0 𝐷                             𝐷                    0 𝐶 
                  22    12      22                22            41 𝐷 42                    44 
                                                                                                
Using the matrix inversion lemma,
              (𝐷 0𝐶 −1 𝐷) −1 = (𝐺 + 𝐻𝐼𝐻 0) −1 = 𝐺 −1 − 𝐺 −1 𝐻 (𝐼 + 𝐻 0𝐺 −1 𝐻) −1 𝐻 0𝐺 −1
and thus
                      𝐺 −1 − (𝐷 0𝐶 −1 𝐷) −1 = 𝐺 −1 𝐻 (𝐼 −1 + 𝐻 0𝐺 −1 𝐻) −1 𝐻 0𝐺 −1                         (A.1)
Let 𝐸 ≡ (𝐼 + 𝐻 0𝐺 −1 𝐻) −1 . Now,
                                             𝐷 −1𝐶 𝐷 −10 𝐷 −1𝐶 𝐷 −10 
                                                                              
                                                     11                 12
                                   𝐺 −1 =  11            11       11      22 
                                            
                                             𝐷 −1𝐶 𝐷 −1 𝐷 −1𝐶 𝐷 −10 
                                                              0
                                                                               
                                             22 21 11             22 22 22 
                                            
and the asymptotic variance of the complete cases GMM is given by the upper left (𝑘 + 1) × (𝑘 + 1)
block of 𝐺 −1 . Therefore, the difference between the asymptotic variances of the complete cases
estimator and the proposed estimator is given by the upper left (𝑘 + 1) × (𝑘 + 1) block of the
                                                         89


expression on the right hand side of (A.1). For this we need the first (𝑘 + 1) columns of 𝐻 0𝐺 −1 ,
which are given by
                                                          −1 𝐶 𝐷 −10
                                                                                      
                                                𝐷  32 𝐷  22    21    11
                                                                                       
                                                                                                                      (A.2)
                                                                                      
                                     𝐷 𝐷 −1𝐶 𝐷 −10 + 𝐷 𝐷 −1𝐶 𝐷 −10 
                                                                                      
                                     41 11 11 11                 42 22 21 11 
                                                                                      
For the difference corresponding to 𝛽1 , we need the first column of this matrix. To find that,
consider
                               𝐷 −111 = [E(𝑠1 𝑠2 𝑥 𝑧)]
                                                      0 −1
                                                                                          
                                                𝐽 −1                −𝐽  −1 𝐾 𝐾 −1
                                                                            1 2
                                                                                          
                                       =
                                                                                          
                                                                                           
                                          −𝐾 −1 𝐾 𝐽 −1 (𝐾 − 𝐾 𝐾 −1 𝐾 ) −1 
                                           2 4                    2      4 3    1         
                                                                                          
where 𝐽 ≡ E(𝑠1 𝑠2 𝑥 10 𝑧 1 ) − E(𝑠1 𝑠2 𝑥 1 𝑥 2 ) [E(𝑠1 𝑠2 𝑥 20 𝑥 2 )] −1 E(𝑠1 𝑠2 𝑥 20 𝑧 1 ) , 𝐾1 ≡ E(𝑠1 𝑠2 𝑥 1 𝑥 2 ), 𝐾2 ≡
                                                                                            
E(𝑠1 𝑠2 𝑥 20 𝑥2 ), 𝐾3 ≡ E(𝑠1 𝑠2 𝑥 1 𝑧1 ), and 𝐾4 ≡ E(𝑠1 𝑠2 𝑥 20 𝑧1 ). The first column and the last 𝑘 columns
of this matrix are given by 𝑊1 and 𝑊2 respectively, where
                                                                    
                                                             1      
                                                                      −1
                                              𝑊1 =                                                                (A.0.1)
                                                       
                                                                     𝐽
                                                       −𝐾 −1 𝐾 
                                                        2 4
                                                                    
                                                                            
                                                         −𝐽  −1 𝐾 𝐾 −1
                                                                   1
                                                                            
                                                                       2
                                           𝑊2 =                                                                   (A.0.2)
                                                                            
                                                                             
                                                   (𝐾 − 𝐾 𝐾 −1 𝐾 ) −1 
                                                   2          4 3       1   
                                                                            
Now, the first column of the matrix in (A.2) is given by
                                                                             
                                             𝐷     𝐷  −1 𝐶 𝑊                 𝐴 
                                               32 22 21 1                     1
                                 
                                                                            ≡ 
                                 
                                 
                                  (𝐷 𝐷 −1𝐶 + 𝐷 𝐷 −1𝐶 )𝑊   𝐵 
                                  41 11 11               42 22 21 1 
                                                                              1
                                                                                  
                                 
Similarly, the last 𝑘 columns of the matrix in (A.2) are given by
                                                                             
                                            𝐷 32 𝐷 −122
                                                         𝐶 21 𝑊  2
                                                                             𝐴 
                                                                              2
                                                                            ≡ 
                                 
                                 
                                  (𝐷 𝐷 −1𝐶 + 𝐷 𝐷 −1𝐶 )𝑊   𝐵 
                                  41 11 11               42 22 21 2 
                                                                              2
                                                                                  
                                 
Thus, the difference corresponding to 𝛽 𝑗 , 𝑗 = 1, 2 is
                                                                    
                                                 h           i     𝐴 𝑗 
                                                     0
                                                   𝐴𝑗 𝐵𝑗   0    𝐸
                                                                    
                                                                    
                                                                   𝐵 
                                                                    𝑗
                                                                    
                                                           90


as stated in the proposition. 
Proof of proposition 1.5.2.1
When we have two distinct samples containing (𝑦, 𝑧) and (𝑥, 𝑧), and hence the estimation is based
only on 𝑔3 (.) and 𝑔4 (.). Thus
                                                                                                     
                                h                       i            𝐶
                                                                         33      0         0 𝐷 
                                                                                                     32
                ℎ(𝛽, Π) = 𝑔3 (Π)𝑔4 (𝛽, Π)                    𝐶=                       𝐷=
                                                                                                     
                                                                                                       .
                                                                      0 𝐶                𝐷           
                                                                                44        41 𝐷 42 
                                                                                                     
The first step solves
                                                    𝑛
                                                1Õ
                                                         𝑔3 (𝑥𝑖 , 𝑧𝑖 , 𝑠2𝑖 , Π̆) = 0.
                                                𝑛
                                                   𝑖=1
By standard GMM theory,
                          √                  𝑑                                                  0
                            𝑛( Π̆ − Π) −−−−→ 𝑁 (0, 𝑉2 )             where 𝑉2 = 𝐷 −1               −1
                                                                                       32 𝐶33 𝐷 32 .
The second step solves
                                            min       ℎ¯ 4 (𝛽, Π̆) 0 Ω̆1 ℎ¯ 4 (𝛽, Π̆),
                                              𝛽
                        1 Í𝑛
where ℎ¯ 4 (𝛽, Π) =               𝑔 (𝑦𝑖 , 𝑧𝑖 , 𝑠1𝑖 , 𝛽, Π). The first order condition is given by
                        𝑛 𝑖=1 4
                                                    𝐷ˆ 41 Ω̆1 ℎ¯ 4 ( 𝛽,
                                                                      ˘ Π̆) = 0                            (A.3)
                𝜕 ℎ¯ 4 ( 𝛽,
                         ˘ Π̆)             𝑝
where 𝐷ˆ 41 =                  , Ω̆1 −−−−→ Ω1 , and Ω1 is a general weight matrix. A Taylor expansion of
                      𝜕𝛽
¯ℎ4 ( 𝛽,
      ˘ Π̆) around 𝛽 gives
                                        ℎ¯ 4 ( 𝛽,
                                               ˘ Π̆) = ℎ¯ 4 (𝛽, Π̆) + 𝐷¯ 41 ( 𝛽˘ − 𝛽),
                𝜕 ℎ¯ 4 ( 𝛽,
                         ¯ Π̆)
where 𝐷¯ 41 =                   and 𝛽¯ ∈ [𝛽, 𝛽].   ˘ Substituting in (A.3)
                      𝜕𝛽
                                    𝐷ˆ 41 Ω̆1 ℎ¯ 4 (𝛽, Π̆) + 𝐷ˆ 41 Ω̆1 𝐷¯ 41 ( 𝛽˘ − 𝛽) = 0.
Thus,
                             √                                                       √
                                𝑛( 𝛽˘ − 𝛽) = −( 𝐷ˆ 41 Ω̆1 𝐷¯ 41 ) −1 𝐷ˆ 41 Ω̆1 𝑛 ℎ¯ 4 (𝛽, Π̆).
                                                                 91


Now, a Taylor expansion of ℎ¯ 4 (𝛽, Π̆) around Π gives
                                       ℎ¯ 4 (𝛽, Π̆) = ℎ¯ 4 (𝛽, Π) + 𝐷¯ 42 ( Π̆ − Π),
                𝜕 ℎ¯ 4 (𝛽, Π̄)
where 𝐷¯ 42 =                     and Π̄ ∈ [Π, Π̆]. Thus,
                  𝜕 𝑣𝑒𝑐Π
              √                                                   √                     √
                𝑛( 𝛽˘ − 𝛽) = −( 𝐷ˆ 41 Ω̆1 𝐷¯ 41 ) −1 𝐷ˆ 41 Ω̆1 [ 𝑛 ℎ¯ 4 (𝛽, Π) + 𝐷¯ 42 𝑛 ( Π̆ − Π)].
                 √                           √
Now, let 𝑍 ≡ [ 𝑛 ℎ¯ 4 (𝛽, Π) + 𝐷¯ 42 𝑛 ( Π̆ − Π)]. Since
                  √                     𝑑                           √                𝑑
                      𝑛 ℎ¯ 4 (𝛽, Π) −−−−→ 𝑁 (0, 𝐶44 )        and      𝑛 ( Π̆ − Π) −−−−→ 𝑁 (0, 𝑉2 ),
therefore
                                    𝑑
                               𝑍 −−−−→ 𝑁 (0, Σ)         where    Σ = 𝐶44 + 𝐷 42𝑉2 𝐷 042 .
Moreover,
                            𝑝                       𝑝                     𝑝                  𝑝
                 𝐷ˆ 41 −−−−→ 𝐷 41          𝐷¯ 41 −−−−→ 𝐷 41      𝐷¯ 32 −−−−→ 𝐷 42     Ω̆1 −−−−→ Ω1 .
Let 𝛽˘ ≡ 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆−𝑂 . Then,
 √
   𝑛( 𝛽˘ − 𝛽) = −[( 𝐷ˆ 41 Ω̆1 𝐷¯ 41 ) −1 𝐷ˆ 41 Ω̆1 − (𝐷 41 Ω1 𝐷 41 ) −1 𝐷 41 Ω1 ]𝑍 − (𝐷 41 Ω1 𝐷 41 ) −1 𝐷 41 Ω1 𝑍
              = 𝑜 𝑝 (1) − (𝐷 41 Ω1 𝐷 41 ) −1 𝐷 41 Ω1 𝑍
where [( 𝐷ˆ 41 Ω̆1 𝐷¯ 41 ) −1 𝐷ˆ 41 Ω̆1 − (𝐷 41 Ω1 𝐷 41 ) −1 𝐷 41 Ω1 ] is 𝑜 𝑝 (1) because of the Slutsky’s theo-
rem, 𝑍 is 𝑂 𝑝 (1), and 𝑜 𝑝 (1).𝑂 𝑝 (1) = 𝑜 𝑝 (1). Then, by the asymptotic equivalence lemma,
                                               √               𝑑
                                                 𝑛( 𝛽˘ − 𝛽) −−−−→ 𝑁 (0, 𝑉1 )
where
                           𝑉1 = (𝐷 041 Ω1 𝐷 41 ) −1 𝐷 041 Ω1 Σ Ω1 𝐷 41 (𝐷 041 Ω1 𝐷 41 ) −1 .
By standard GMM theory, the optimal weight matrix for this step is Σ−1 . Using this matrix gives
                                                𝑉1∗ = (𝐷 041 Σ−1 𝐷 41 ) −1 .
                                                             92


Proof of proposition 1.5.2.3
                                  √
The asymptotic variance of          𝑛( 𝛽ˆ − 𝛽) is given by the upper left ( 𝑝 + 𝑘) × ( 𝑝 + 𝑘) block of
(𝐷 0𝐶 −1 𝐷) −1 . Now,
                                                                                        −1
                                           0 𝐷 0  𝐶 −1 0                   0 𝐷 
                                                          41                             32  
                   (𝐷 0𝐶 −1 𝐷) −1 =                           33
                                                                          
                                                                                          
                                           𝐷 0 𝐷 0   0 𝐶 −1               𝐷      𝐷    
                                           32           42         44   41      42  
                                                                                         
                                                                                                  −1
                                          𝐷 0 𝐶 −1 𝐷                      0     −1
                                                                         𝐷 41𝐶44 𝐷 42
                                                           41
                                                                                                  
                                    =  41 44
                                                                                                 
                                                                                                  
                                          𝐷 0 𝐶 −1 𝐷               0  −1           0    −1
                                          42 44 41 𝐷 32𝐶33 𝐷 32 + 𝐷 42𝐶44 𝐷 42 
                                                                                                  
                                                                                                 
Using the formula for the inversion of a block matrix, the upper left ( 𝑝 + 𝑘) × ( 𝑝 + 𝑘) block of this
inverse is
        (𝐷 041𝐶44
                −1 𝐷 − 𝐷 0 𝐶 −1 𝐷 (𝐷 0 𝐶 −1 𝐷 + 𝐷 0 𝐶 −1 𝐷 ) −1 𝐷 0 𝐶 −1 𝐷 ) −1
                     41      41 44 42              32 33 32            42 44 42           42 44 41       (A.4)
                                            √
On the other hand, we know 𝐴𝑣𝑎𝑟 ( 𝑛( 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆−𝑂 − 𝛽))
      = (𝐷 041 Σ−1 𝐷 41 ) −1                                                                             (A.5)
      = (𝐷 041 (𝐶44 + 𝐷 42 (𝐷 032𝐶33  −1 𝐷 ) −1 𝐷 0 ) −1 𝐷 ) −1
                                              32          42        41
      = (𝐷 041 (𝐶44
                  −1 − 𝐶 −1 𝐷 (𝐷 0 𝐶 −1 𝐷 + 𝐷 0 𝐶 −1 𝐷 ) −1 𝐷 0 𝐶 −1 )𝐷 ) −1
                           44 42        32 33 32                 42 44 42         42 44      41
      = (𝐷 041𝐶44−1 𝐷 − 𝐷 0 𝐶 −1 𝐷 (𝐷 0 𝐶 −1 𝐷 + 𝐷 0 𝐶 −1 𝐷 ) −1 𝐷 0 𝐶 −1 𝐷 ) −1
                      41      41 44 42             32 33 32            42 44 42           42 44 41     (A.0.3)
where the third equality uses the matrix inversion lemma. The result follows from the fact that
(A.4) = (A.5). 
Proof of proposition 1.5.2.5
With exact identification, 𝛽ˆ simply solves
                                                                                                 
                         𝑛                                                                𝑔 (.) 
                     1Õ                                        ˆ = 0 where                   3
                            ℎ(𝑦𝑖 , 𝑥𝑖 , 𝑧𝑖 , 𝑠1𝑖 , 𝑠2𝑖 , Π̂, 𝛽)                   ℎ(.) = 
                                                                                                 
                                                                                                  .
                     𝑛                                                                    𝑔 (.) 
                       𝑖=1                                                                 4 
                                                                                                 
                                                                93


This is the same as first solving
                                         𝑛
                                      1Õ
                                            𝑔3 (𝑥𝑖 , 𝑧𝑖 , 𝑠2𝑖 , Π̆) = 0
                                      𝑛
                                        𝑖=1
for Π̆, and then solving
                                    𝑛
                                 1Õ
                                       𝑔4 (𝑦𝑖 , 𝑧𝑖 , 𝑠1𝑖 , Π̆, 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆 ) = 0
                                 𝑛
                                   𝑖=1
for 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆 . 
                                                      94


                                      APPENDIX B
                               TABLES FOR CHAPTER 1
                       Table B.1: Monte Carlo simulations, Design 1
                               𝛽1                     𝛽22                   𝛽23
      Estimator       Bias    SD   RMSE        Bias   SD    RMSE     Bias   SD   RMSE
 Complete cases 2SLS  0.008  0.035  0.036    -0.000  0.066   0.066  -0.023 0.055 0.059
 Complete cases GMM   0.008  0.035  0.036    -0.001  0.066   0.066  -0.023 0.055 0.060
     Imputation       0.009  0.029  0.031    -0.013  0.056   0.057  -0.011 0.047 0.048
Dummy variable method 0.008  0.035  0.036     0.154  0.065   0.167   0.153 0.054 0.162
   Proposed GMM       0.008  0.027  0.028    -0.004  0.051   0.051  -0.011 0.043 0.044
                       Table B.2: Monte Carlo simulations, Design 2
                               𝛽1                     𝛽22                   𝛽23
      Estimator       Bias    SD   RMSE        Bias   SD    RMSE     Bias   SD   RMSE
 Complete cases 2SLS  0.013  0.069  0.070    -0.008  0.074   0.074  -0.024 0.065 0.069
 Complete cases GMM   0.004  0.052  0.052    -0.005  0.067   0.068  -0.018 0.059 0.062
     Imputation       0.008  0.060  0.061    -0.019  0.066   0.069  -0.010 0.058 0.059
Dummy variable method 0.013  0.069  0.070     0.146  0.069   0.161   0.151 0.060 0.163
   Proposed GMM       0.010  0.041  0.042    -0.012  0.055   0.056  -0.011 0.049 0.050
                       Table B.3: Monte Carlo simulations, Design 3
                               𝛽1                     𝛽22                   𝛽23
      Estimator       Bias    SD   RMSE        Bias   SD    RMSE     Bias   SD   RMSE
 Complete cases 2SLS  0.009  0.076  0.077     0.003  0.083   0.083  -0.018 0.074 0.076
 Complete cases GMM   0.002  0.056  0.056     0.004  0.074   0.074  -0.014 0.067 0.068
     Imputation       0.006  0.065  0.065    -0.015  0.074   0.075  -0.005 0.064 0.064
Dummy variable method 0.009  0.076  0.077     0.176  0.078   0.193   0.181 0.066 0.193
   Proposed GMM       0.010  0.044  0.045    -0.009  0.059   0.060  -0.009 0.053 0.054
                                           95


                          Table B.4: Monte Carlo simulations, Design 4
                                   𝛽1                     𝛽22                   𝛽23
      Estimator          Bias     SD    RMSE      Bias    SD   RMSE     Bias    SD  RMSE
 Complete cases 2SLS     0.013   0.069   0.070   -0.008  0.074  0.074  -0.024 0.065  0.069
 Complete cases GMM      0.004   0.052   0.052   -0.005  0.067  0.068  -0.018 0.059  0.062
     Imputation          0.008   0.060   0.060   -0.015  0.064  0.066  -0.013 0.057  0.058
Dummy variable method    0.013   0.070   0.070    0.001  0.060  0.060   0.003 0.052  0.052
   Proposed GMM          0.010   0.040   0.041   -0.011  0.053  0.054  -0.011 0.047  0.049
                          Table B.5: Monte Carlo simulations, Design 5
                                 𝛽1                       𝛽22                   𝛽23
     Estimator         Bias      SD    RMSE       Bias    SD   RMSE     Bias   SD   RMSE
Complete cases 2SLS   -0.001   0.118    0.118   -0.006  0.145   0.146  0.023  0.166 0.168
    Imputation        -0.001   0.118    0.118   -0.004  0.136   0.136  0.019  0.148 0.149
  Proposed GMM         0.000   0.119    0.119   -0.006  0.136   0.136  0.018  0.149 0.150
                          Table B.6: Monte Carlo simulations, Design 6
                                 𝛽1                      𝛽22                    𝛽23
     Estimator         Bias     SD     RMSE      Bias    SD    RMSE     Bias   SD   RMSE
Complete cases 2SLS   0.013    0.180   0.180    -0.018  0.171  0.172   0.035  0.192 0.195
    Imputation        0.013    0.180   0.180    -0.023  0.174  0.175   0.020  0.185 0.186
  Proposed GMM        0.005    0.155   0.155    -0.012  0.150  0.150   0.030  0.165 0.167
                          Table B.7: Monte Carlo simulations, Design 7
                                 𝛽1                      𝛽22                    𝛽23
     Estimator         Bias     SD     RMSE      Bias    SD    RMSE     Bias   SD   RMSE
Complete cases 2SLS   0.003    0.198   0.198    -0.014  0.186  0.187   0.039  0.206 0.210
    Imputation        0.003    0.198   0.198    -0.014  0.192  0.192   0.030  0.199 0.201
  Proposed GMM        0.005    0.162   0.162    -0.012  0.155  0.156   0.031  0.172 0.175
                                               96


Table B.8: Effect of physician’s advice on calorie consumption: complete cases versus the proposed
estimator
            Estimator                                Complete cases GMM Proposed GMM
            Physician advised to lose weight                 0.126           0.119
                                                           (0.099)          (0.091)
            Age                                             -0.004           -0.004
                                                         (0.00040)         (0.00036)
            Female                                          -0.294           -0.300
                                                           (0.011)          (0.010)
            Black                                           -0.054           -0.056
                                                           (0.013)          (0.011)
            Other race                                      -0.040           -0.142
                                                           (0.013)          (0.011)
            9 to 12 years of schooling                       0.083           0.085
                                                           (0.024)          (0.021)
            High school grad or equivalent                   0.074           0.074
                                                           (0.022)          (0.020)
            Some college or AA                               0.049           0.063
                                                           (0.021)          (0.019)
            College or above                                 0.053           0.060
                                                           (0.023)          (0.021)
            Married                                         -0.015           -0.019
                                                           (0.010)          (0.009)
            Has high BP                                     -0.002           -0.007
                                                           (0.015)          (0.014)
            Has high cholesterol                             0.005           -0.002
                                                           (0.019)          (0.016)
            Has Arthritis                                  -0.0005           0.006
                                                           (0.013)          (0.012)
            Has heart condition                             -0.074           -0.073
                                                           (0.025)          (0.023)
            Has Diabetes                                    -0.079           -0.085
                                                           (0.020)          (0.019)
            BMI                                             0.0007          0.0003
                                                           (0.003)          (0.002)
            Monthly income < $2100                          -0.019           -0.033
                                                           (0.016)          (0.014)
            Monthly income between $2100 and $5400          -0.003           -0.013
                                                           (0.014)          (0.012)
            Monthly income between $5400 and $8400          -0.017           -0.027
                                                           (0.015)          (0.013)
            Is employed                                      0.086           0.081
                                                           (0.011)          (0.010)
                                                 97


                                      APPENDIX C
                             FIGURES FOR CHAPTER 1
Figure C.1: Some admissible patterns of missingness (shaded areas represent complete cases)
 1.1: Partial overlap         1.2: Univariate missing data          1.3: The TS2SLS case
   𝑦      𝑥       𝑧                  𝑦      𝑥      𝑧                    𝑦     𝑥      𝑧
                                            98


                                                      APPENDIX D
                                           PROOFS FOR CHAPTER 2
    Proof of Lemma 2.4.1
    Since 𝑊 is nonsingular by assumption, it suffices to show that E[𝑔(𝛼, 𝛽; 𝛿0 )] ≠ 0 for (𝛼, 𝛽) ≠
(𝛼0 , 𝛽0 ).1 We show this element-by-element of 𝑔(𝛼, 𝛽; 𝛿0 ).
    Starting with the weighted moment functions from the model of interest, given Assumptions
2.3.1 and 2.3.2 and the standard IPW argument, we know that
   E{[𝑠/𝐺 (𝑧, 𝛿0 )]𝑔1∗ (𝑦, 𝑥, 𝛼0 )} = E{[𝑠/𝑝(𝑧)]𝑔1∗ (𝑦, 𝑥, 𝛼0 )} = E{E([𝑠/𝑝(𝑧)]𝑔1∗ (𝑦, 𝑥, 𝛼0 )|𝑦, 𝑥, 𝑧)}
                                                                              = E{[E(𝑠|𝑦, 𝑥, 𝑧)/𝑝(𝑧)]𝑔1∗ (𝑦, 𝑥, 𝛼0 )}
                                                                              = E[𝑔1∗ (𝑦, 𝑥, 𝛼0 )].
Now, Assumption 2.2.1 implies that E[𝑔1∗ (𝑦, 𝑥, 𝛼)] ≠ 0 for any 𝛼 ≠ 𝛼0 . It follows that for any
𝛼 ≠ 𝛼0 ,
                                          E{[𝑠/𝐺 (𝑧, 𝛿0 )]𝑔1∗ (𝑦, 𝑥, 𝛼)} ≠ 0.
    Moving on to the imputation model, first note that by iterated expectations,
          E(𝑠|𝑥 1 , 𝑥2 , 𝑧) = E[E(𝑠|𝑦, 𝑥1 , 𝑥2 , 𝑧)|𝑥1 , 𝑥2 , 𝑧] = E[E(𝑠|𝑧)|𝑥1 , 𝑥2 , 𝑧] = E(𝑠|𝑧) ≡ 𝑝(𝑧),
where the second equality follows from Assumption 2.3.1. Now consider the weighted moment
functions from the imputation model.
                 E{[𝑠/𝐺 (𝑧, 𝛿0 )]𝑔2∗ (𝑥1 , 𝑥2 , 𝛽0 )} = E{[𝑠/𝑝(𝑧)]𝑔2∗ (𝑥 1 , 𝑥2 , 𝛽0 )}
                                                          = E{E([𝑠/𝑝(𝑧)]𝑔2∗ (𝑥 1 , 𝑥2 , 𝛽0 )|𝑥 1 , 𝑥2 , 𝑧)}
                                                          = E{[E(𝑠|𝑥 1 , 𝑥2 , 𝑧)/𝑝(𝑧)]𝑔2∗ (𝑥 1 , 𝑥2 , 𝛽0 )}
                                                          = E[𝑔2∗ (𝑥1 , 𝑥2 , 𝛽0 )] = 0
1Note that even though 𝑔3 (𝛼, 𝛽; 𝛿) sometimes only identifies functions of (𝛼0 , 𝛽0 ) and not each element of (𝛼0 , 𝛽0 )
 separately, the entire vector 𝑔(𝛼, 𝛽; 𝛿) still identifies (𝛼0 , 𝛽0 ) separately because 𝑔1 (𝛼; 𝛿) identifies 𝛼0 and 𝑔2 (𝛽; 𝛿)
 identifies 𝛽0 .
                                                              99


and the same argument as above applies for identification of 𝛽0 using Assumption 2.2.2.
    For the reduced form moment functions, identification of 𝛾0 simply follows from Assumption
2.2.3.
    Proof of Theorem 2.4.1
                                                                              𝑝
    Identification of (𝛼0 , 𝛽0 ) follows from Lemma 2.4.1 and 𝛿ˆ −           → 𝛿0 follows from Assumption
2.3.2 and standard MLE theory. To complete the proof, we simply show that the objective function
satisfies the weak uniform law of large numbers. By 5 and 6,
                             |𝑔1 (𝑦, 𝑥, 𝑧, 𝑠, 𝛼; 𝛿0 )| ≤ 𝑎 −1 𝑏 1 (𝑦, 𝑥), all (𝑧, 𝑠),
                                |𝑔2 (𝑥, 𝑧, 𝑠, 𝛽; 𝛿0 )| ≤ 𝑎 −1 𝑏 2 (𝑥), all (𝑧, 𝑠),
                                         |𝑔3 (𝑦, 𝑥2 , 𝛾)| ≤ 𝑏 3 (𝑦, 𝑥2 ).
and by 6, E[𝑏(𝑦, 𝑥)] < ∞, where 𝑔1 (𝑦, 𝑥, 𝑧, 𝑠, 𝛼; 𝛿0 ), 𝑔2 (𝑥, 𝑧, 𝑠, 𝛽; 𝛿0 ), and 𝑔3 (𝑦, 𝑥2 , 𝛾) are as
defined in (2.4.1). It follows from Lemma 2.4 in Newey and McFadden (1994) that
                               𝑁
                              Õ                                                                     𝑝
              sup        𝑁 −1                                    ˆ − E[𝑔(𝑦, 𝑥, 𝑧, 𝑠, 𝛼, 𝛽, 𝛾; 𝛿0 )] −
                                  𝑔(𝑦𝑖 , 𝑥𝑖 , 𝑧𝑖 , 𝑠𝑖 , 𝛼, 𝛽, 𝛾; 𝛿)                                 → 0.
        (𝛼,𝛽,𝛾)∈A×B×Γ         𝑖=1
The rest of the proof is standard, see Wooldridge (2010, Section 12.4.1).
    Proof of Theorem 2.4.2
    For notational convenience, let 𝜏 ≡ (𝛼0, 𝛽0) 0. First we will show that
                            √                     𝑑
                              𝑁∇𝜏 𝑄(𝜏ˆ 0 ; 𝛿)ˆ −  → 𝑁𝑜𝑟𝑚𝑎𝑙 (0, 𝐷 00𝑊 𝐹0𝑊 𝐷 0 ).
        ˆ
Since 𝑄(𝜏;    ˆ = 𝑔(𝜏;
              𝛿)  ¯ 𝛿)  ˆ 0 𝑊ˆ 𝑔(𝜏;
                                 ¯ 𝛿), ˆ
                         =⇒ ∇𝜏 𝑄(𝜏;ˆ      ˆ = [∇𝜏 𝑔(𝜏;
                                         𝛿)             ¯ 𝛿)] ˆ 0 𝑊ˆ 𝑔(𝜏;
                                                                       ¯ 𝛿)ˆ
                               √                                            √
                         =⇒ 𝑁∇𝜏 𝑄(𝜏     ˆ 0 ; 𝛿)ˆ = [∇𝜏 𝑔(𝜏         ˆ 0 𝑊ˆ 𝑁 𝑔(𝜏
                                                             ¯ 0 ; 𝛿)]                 ˆ
                                                                                 ¯ 0 ; 𝛿).
                                                         100


                                                                             √
Carrying out an element-by-element mean value expansion of 𝑁∇𝜏 𝑄(𝜏                    ˆ 0 ; 𝛿)
                                                                                             ˆ around 𝛿0 gives,
 √                                                √                              √
   𝑁∇𝜏 𝑄(𝜏      ˆ = [𝐷 0 + 𝑜 𝑝 (1)] 0 𝑊ˆ [ 𝑁 𝑔(𝜏
         ˆ 0 ; 𝛿)                                     ¯ 0 ; 𝛿0 ) + ∇𝛿 𝑔(𝜏      ¯ 𝑁 ( 𝛿ˆ − 𝛿0 )]
                                                                        ¯ 0 ; 𝛿)                                   (D.1)
                                                  √                                 √
                   = [𝐷 0 + 𝑜 𝑝 (1)] 0 𝑊ˆ { 𝑁 𝑔(𝜏     ¯ 0 ; 𝛿0 ) + [𝐻0 + 𝑜 𝑝 (1)] 𝑁 ( 𝛿ˆ − 𝛿0 )}
                                                  √                                          𝑁
                                                                                         −1
                                                                                            Õ
                   = [𝐷 0 + 𝑜 𝑝    (1)] 0    ˆ
                                           𝑊 { 𝑁 𝑔(𝜏  ¯ 0 ; 𝛿0 ) + [𝐻0 + 𝑜 𝑝 (1)] [𝑁      2     𝜓(𝑠𝑖 , 𝑧𝑖 ) + 𝑜 𝑝 (1)]}
                                                                                            𝑖=1
                                     1Õ  𝑁
                                   −
                   = 𝐷 00 𝑊 {𝑁 2             [𝑔𝑖 + 𝐻0 𝜓(𝑠𝑖 , 𝑧𝑖 )]} + 𝑜 𝑝 (1),                                  (D.0.1)
                                        𝑖=1
                                                𝑝
where 𝛿¯ lies between 𝛿ˆ and 𝛿0 (thus 𝛿¯ −     → 𝛿0 ), 𝐻0 ≡ E[∇𝛿 𝑔(𝜏0 , 𝛿0 )] and 𝜓(𝑠𝑖 , 𝑧𝑖 ) = −[E(𝑑𝑖 𝑑𝑖0)] −1 𝑑𝑖
                                  √
is the influence function for 𝑁 ( 𝛿ˆ − 𝛿0 ). Moreover, by central limit theorem,
                                            𝑁
                                     −1
                                          Õ                             𝑑
                                   𝑁 2         [𝑔𝑖 + 𝐻0 𝜓(𝑠𝑖 , 𝑧𝑖 )]} −→ 𝑁 (0, 𝐹0 ),
                                          𝑖=1
where
              𝐹0 ≡ E[𝑔𝑖 𝑔𝑖0 + 𝐻0 𝜓(𝑠𝑖 , 𝑧𝑖 )𝑔𝑖0 + 𝑔𝑖 𝜓(𝑠𝑖 , 𝑧𝑖 ) 0 𝐻00 + 𝐻0 𝜓(𝑠𝑖 , 𝑧𝑖 )𝜓(𝑠𝑖 , 𝑧𝑖 ) 0 𝐻00 ].
Now, note that by definition,
                                          −(𝑠𝑖 /𝐺 𝑖 )𝑔 ∗ (∇𝛿 𝐺 𝑖 /𝐺 𝑖 ) 
                                                                        
                                                      1𝑖                
                               𝐻0 = E −(𝑠𝑖 /𝐺 𝑖 )𝑔 (∇𝛿 𝐺 𝑖 /𝐺 𝑖 )  = − E( 𝑔˜𝑖 𝑑𝑖0),
                                                                        
                                                       ∗
                                                      2𝑖                
                                                         0
                                                                        
                                                                        
                                                                        
where 𝑔˜𝑖 ≡ (𝑔1𝑖0 , 𝑔0 , 0) 0 and the third element is a 1 × 𝐿 zero vector. This is because
                      2𝑖                                              3
                                  (𝑠𝑖 /𝐺 𝑖 )𝑔 ∗ {𝑠𝑖 (∇𝛿 𝐺 𝑖 /𝐺 𝑖 ) − (1 − 𝑠𝑖 ) [∇𝛿 𝐺 𝑖 /(1 − 𝐺 𝑖 )]}
                                                                                                     
                                              1𝑖                                                     
                         0
                                                                                                     
                                               ∗
                 E( 𝑔˜𝑖 𝑑𝑖 ) = E  (𝑠𝑖 /𝐺 𝑖 )𝑔 {𝑠𝑖 (∇𝛿 𝐺 𝑖 /𝐺 𝑖 ) − (1 − 𝑠𝑖 ) [∇𝛿 𝐺 𝑖 /(1 − 𝐺 𝑖 )]}
                                 
                                              2𝑖                                                     
                                                                    0
                                                                                                     
                                                                                                     
                                                                                                     
                                  (𝑠𝑖 /𝐺 𝑖 )𝑔 ∗ (∇𝛿 𝐺 𝑖 /𝐺 𝑖 ) 
                                                               
                                              1𝑖               
                                                               
                                               ∗
                             = E  (𝑠𝑖 /𝐺 𝑖 )𝑔 (∇𝛿 𝐺 𝑖 /𝐺 𝑖 )  ,
                                 
                                              2𝑖               
                                                 0
                                                               
                                                               
                                                               
since 𝑠𝑖2 = 𝑠𝑖 and (1 − 𝑠𝑖 ) 2 = (1 − 𝑠𝑖 ). This implies
                             E[𝐻0 𝜓(𝑠𝑖 , 𝑧𝑖 )𝑔𝑖0] = − E( 𝑔˜𝑖 𝑑𝑖0) [E(𝑑𝑖 𝑑𝑖0)] −1 E(𝑑𝑖 𝑔𝑖0),
                                                           101


and
                       E[𝐻0 𝜓(𝑠𝑖 , 𝑧𝑖 )𝜓(𝑠𝑖 , 𝑧𝑖 ) 0 𝐻00 ] = E( 𝑔˜𝑖 𝑑𝑖0) [E(𝑑𝑖 𝑑𝑖0)] −1 E(𝑑𝑖 𝑔˜𝑖0).
Therefore,
                             𝐹0 = E(𝑔𝑖 𝑔𝑖0) − {E(𝑔𝑖 𝑑𝑖0) [E(𝑑𝑖 𝑑𝑖0)] −1 E(𝑑𝑖 𝑔𝑖0) ◦ 𝑅},
where 𝑅 is a square matrix of order 𝐿 1 + 𝐿 2 + 𝐿 3 with all elements being unity except the lower
right 𝐿 3 × 𝐿 3 block which is a 0 matrix, and ◦ denotes Hadamard product. Then using (D.1) and
the asymptotic equivalence lemma,
                                 √                  𝑑
                                   𝑁∇𝜏 𝑄(𝜏ˆ 0 ; 𝛿) → 𝑁𝑜𝑟𝑚𝑎𝑙 (0, 𝐷 00𝑊 𝐹0𝑊 𝐷 0 ).
                                                 ˆ −                                                   (D.2)
Next, an element-by-element mean value expansion of ∇𝜏 𝑄(                 ˆ 𝜏; ˆ around 𝜏0 gives,
                                                                            ˆ 𝛿)
                          ∇𝜏 𝑄( ˆ 𝜏;
                                   ˆ 𝛿)ˆ = ∇𝜏 𝑄(𝜏      ˆ + [𝐷 0 𝑊 𝐷 0 + 𝑜 𝑝 (1)] ( 𝜏ˆ − 𝜏0 )
                                                ˆ 0 ; 𝛿)
                                                                0
                                √                                   √
                        =⇒        𝑁 ( 𝜏ˆ − 𝜏0 ) = −(𝐷 00𝑊 𝐷 0 ) −1 𝑁∇𝜏 𝑄(𝜏   ˆ 0 ; 𝛿) ˆ + 𝑜 𝑝 (1).     (D.3)
Combining (D.2) and (D.3) and using the asymptotic equivalence lemma gives
          √                   𝑑
            𝑁 ( 𝜏ˆ − 𝜏0 ) −−−−→ 𝑁𝑜𝑟𝑚𝑎𝑙 [0, (𝐷 00𝑊 𝐷 0 ) −1 𝐷 00𝑊 𝐹0𝑊 𝐷 0 (𝐷 00𝑊 𝐷 0 ) −1 ],          (D.0.2)
which is the desired result.
    Proof of Proposition 2.4.1
    For notational convenience, let 𝜏ˆ𝑊 𝐽 ≡ ( 𝛼ˆ 𝑊         0 , 𝛽ˆ0 ) 0. We want to show that under the null
                                                             𝐽 𝑊𝐽
                                  𝑑
hypothesis , 𝑁 𝑄(  ˆ 𝜏ˆ𝑊 𝐽 ; 𝛿)
                             ˆ − → 𝜒2 , where 𝑊ˆ = 𝐹ˆ −1 .
                                        𝐿3
    First note that a mean value expansion around 𝛿0 yields
                  √                   √                                √                𝑑
                       ¯ 0 ; 𝛿)
                     𝑁 𝑔(𝜏      ˆ =        ¯ 0 ; 𝛿0 ) + ∇𝛿 𝑔(𝜏
                                         𝑁 𝑔(𝜏              ¯ 0 ; 𝛿)¯ 𝑁 ( 𝛿ˆ − 𝛿0 ) −  → 𝑁 (0, 𝐹0 ).
by equation (A.9). This implies
                                    −1 √                       𝑑
                               −  𝐹0 2 𝑁 𝑔(𝜏       ˆ
                                             ¯ 0 ; 𝛿) = 𝑈𝑁 −  → 𝑈 ∼ 𝑁𝑜𝑟𝑚𝑎𝑙 (0, 𝐼).                     (D.4)
                                                           102


Moreover, the first order conditions for the objective function in (4.3) imply that
                       √                                          √
                         𝑁∇𝜏 𝑄(ˆ 𝜏;  ˆ = [∇𝜏 𝑔(
                                  ˆ 𝛿)          ¯ 𝜏;   ˆ 0 𝐹ˆ −1 𝑁 𝑔(
                                                   ˆ 𝛿)]                ¯ 𝜏;
                                                                           ˆ 𝛿)ˆ =0                  (D.5)
                                       √
                         =⇒ 𝐷 00 𝐹0−1 𝑁 𝑔(  ¯ 𝜏;  ˆ + 𝑜 𝑝 (1) = 0
                                               ˆ 𝛿)
                                             1            √
                         =⇒ 𝐷 00 𝐹0−1 [−𝐹02 𝑈 𝑁 + 𝐷 0 𝑁 ( 𝜏ˆ − 𝜏0 )] + 𝑜 𝑝 (1) = 0
                              √                     0    −1    −1    0   −1
                         =⇒ 𝑁 ( 𝜏ˆ − 𝜏0 ) = (𝐷 0 𝐹0 𝐷 0 ) 𝐷 0 𝐹0 2 𝑈 𝑁 + 𝑜 𝑝 (1).                  (D.0.3)
Now, a mean value expansion of the sample moments around 𝜏0 gives
                           √               √                             √
                               ¯ 𝜏;
                             𝑁 𝑔(    ˆ =
                                  ˆ 𝛿)          ¯ 0 ; 𝛿)
                                             𝑁 𝑔(𝜏      ˆ + ∇𝜏 𝑔(
                                                                ¯ 𝜏;
                                                                  ¯ 𝛿) ˆ 𝑁 ( 𝜏ˆ − 𝜏0 ),              (D.6)
where 𝜏¯ lies between 𝜏ˆ and 𝜏0 . Substituting (D.4) and (D.5) into (D.6), we get
                  √                   −  1                                     − 1
                        ¯ 𝜏;
                     𝑁 𝑔(    ˆ = −𝐹 2 𝑈 𝑁 + 𝐷 0 (𝐷 0 𝐹 −1 𝐷 0 ) −1 𝐷 0 𝐹 2 𝑈 𝑁 + 𝑜 𝑝 (1)
                          ˆ 𝛿)        0                   0 0             0 0
                                      −  1
                                = −𝐹0 2 𝑅0𝑈 𝑁 + 𝑜 𝑝 (1),
                    − 1                            −  1
where 𝑅0 = 𝐼 − 𝐹0 2 𝐷 0 (𝐷 00 𝐹0−1 𝐷 0 ) −1 𝐷 00 𝐹0 2 is idempotent of rank 𝐿 3 . Then,
                                                                        𝑑
                                𝑁 𝑄(ˆ 𝜏;   ˆ = 𝑈 0 𝑅0𝑈 𝑁 + 𝑜 𝑝 (1) −
                                        ˆ 𝛿)                            → 𝜒𝐿2 .
                                                  𝑁                             3
    Proof of Proposition 2.6.1.1
    I drop the 0 subscripts/superscripts for notational convenience, but all expressions in this proof
are evaluated at the true values of the parameters, that is, at (𝛼0 , 𝛽0 , 𝛾0 ).
    First note that the GMM estimator of 𝛼0 that minimizes (2.4.3) with
                                   𝑔(𝛼, 𝛽; 𝛿) ˆ = [𝑔1 (𝛼; 𝛿)ˆ 0, 𝑔2 (𝛽; 𝛿)
                                                                         ˆ 0] 0
and 𝑊ˆ = 𝐼 is numerically equivalent to 𝛼ˆ 𝑊 𝑐𝑐 , which is based only on 𝑔1 (𝛼; 𝛿).     ˆ This is because
       ˆ simply adds equal number of parameters to be estimated and moment conditions to the
𝑔2 (𝛽; 𝛿)
system.2 To characterize the asymptotic variance of this estimator, first define the following
2Ahu & Schmidt (1995)
                                                        103


quantities.
                                                                                           
              𝐷
                  11      0               h            i             𝐹
                                                                         11    𝐹 12
                                                                                             𝐹 
                                                                                               13 
      𝐷1 ≡                         𝐷 2 ≡ 𝐷 31 𝐷 32           𝐹1 ≡                      𝐹2 ≡           𝐹3 ≡ 𝐹33 ,
                                                                                 
                                                                                   
               0 𝐷                                                  𝐹 0 𝐹                 𝐹 
                          22                                         12 22                 23 
                                                                                           
                                                                                                                    (D.0.4)
with 𝐹 𝑗𝑛 = E(𝑔 𝑗 𝑔0𝑛 ) − E(𝑔 𝑗 𝑑 0) [E(𝑑𝑑 0)] −1 E(𝑑𝑔0𝑛 ), 𝑗, 𝑛 = 1, 2, 3 except 𝐹33 which equals E(𝑔3 𝑔30 ).
Then the asymptotic variance of this estimator is given by (𝐷 01 𝐹1−1 𝐷 1 ) −1 , and the required differ-
ence in the proposition is given by the upper-left 𝐿 1 × 𝐿 1 block of (𝐷 01 𝐹1−1 𝐷 1 ) −1 − (𝐷 0 𝐹 −1 𝐷) −1 .
We will now characterize this difference.
     First note that
                                  −1                                                       
                      𝐹 𝐹              𝐹 −1 (𝐼 + 𝐹 𝐻𝐹 0 𝐹 −1 ) −𝐹 −1 𝐹 𝐻                  𝐷 
              −1          1     2            1        2      2 1             1    2            1
            𝐹 =                   =                                                 ,𝐷 =  ,                   (D.0.5)
                                                                                    
                      𝐹 0 𝐹                  −𝐻𝐹20 𝐹1−1                     𝐻             𝐷 
                       2 3                                                                 2
                                                                                           
where 𝐻 ≡ (𝐹3 − 𝐹20 𝐹 −1 𝐹2 ) −1 . Therefore,
                                                                                   
                                 i 𝐹 −1 (𝐼 + 𝐹2 𝐻𝐹 0 𝐹 −1 ) −𝐹 −1 𝐹2 𝐻            𝐷 
                                                                                     1
                    h
                                       1              2   1            1
    𝐷 0 𝐹 −1 𝐷 = 𝐷 0 𝐷 0                                                                    = 𝐷 1 𝐹1−1 𝐷 1 + 𝐽 0 𝐻𝐽,
                                                                               
                                                                                    
                        1       2 
                                           −𝐻𝐹20 𝐹1−1                   𝐻      
                                                                                
                                                                                    𝐷 
                                                                                     2
                                                                                   
                                                                                                                    (D.0.6)
where 𝐽 ≡ 𝐹20 𝐹1−1 𝐷 1 − 𝐷 2 . Therefore, using the Sherman Morrison formula,
(𝐷 0 𝐹 −1 𝐷) −1 = (𝐷 01 𝐹1−1 𝐷 1 + 𝐽 0 𝐻𝐽) −1
                  = (𝐷 01 𝐹1−1 𝐷 1 ) −1 − (𝐷 01 𝐹1−1 𝐷 1 ) −1 𝐽 0 [𝐻 −1 + 𝐽 (𝐷 01 𝐹1−1 𝐷 1 ) −1 𝐽 0] −1 𝐽 (𝐷 01 𝐹1−1 𝐷 1 ) −1 ,
                                                                                                                    (D.0.7)
which implies that
(𝐷 01 𝐹1−1 𝐷 1 ) −1 − (𝐷 0 𝐹 −1 𝐷) −1 = (𝐷 01 𝐹1−1 𝐷 1 ) −1 𝐽 0 [𝐻 −1 + 𝐽 (𝐷 01 𝐹1−1 𝐷 1 ) −1 𝐽 0] −1 𝐽 (𝐷 01 𝐹1−1 𝐷 1 ) −1
                                         ≡ (𝐷 01 𝐹1−1 𝐷 1 ) −1 𝐽 0 𝐾 𝐽 (𝐷 01 𝐹1−1 𝐷 1 ) −1 ,                        (D.0.8)
where 𝐾 ≡ [𝐻 −1 + 𝐽 (𝐷 01 𝐹1−1 𝐷 1 ) −1 𝐽 0] −1 is a positive definite matrix. The matrix in (A.32) is
clearly positive semidefinite, which proves the proposition.
                                                            104


     For use in the next proof, we want to characterize the difference corresponding specifically to
𝛼0 , which is given by the upper-left 𝐿 1 × 𝐿 1 block of the matrix in (A.32). For this difference, we
focus on the first 𝐿 1 columns of 𝐽 (𝐷 01 𝐹1−1 𝐷 1 ) −1 . Note that
        𝐽 (𝐷 01 𝐹1−1 𝐷 1 ) −1 = (𝐹20 𝐹1−1 𝐷 1 − 𝐷 2 )(𝐷 01 𝐹1−1 𝐷 1 ) −1 = (𝐹20 𝐹1−1 𝐷 1 − 𝐷 2 )𝐷 −1     −1
                                                                                                  1 𝐹1 𝐷 1
                              = (𝐹20 − 𝐷 2 𝐷 −1        −1
                                              1 𝐹1 )𝐷 1 ,                                                  (D.0.9)
where we have used the fact that 𝐷 1 is symmetric. Substituting the definitions of 𝐹1 , 𝐹2 , 𝐷 1 , and
𝐷 2 , we get 𝐽 (𝐷 01 𝐹1−1 𝐷 1 ) −1 equals
  h                                                                                                  i
       0 − 𝐷 𝐷 −1 𝐹 − 𝐷 𝐷 −1 𝐹 0 )𝐷 −1 (𝐹 0 − 𝐷 𝐷 −1 𝐹 − 𝐷 𝐷 −1 𝐹 )𝐷 −1 . (D.0.10)
    (𝐹13        31 11 11          32 22 12 11             23       31 11 12          32 22 22 22
The first 𝐿 1 columns of this matrix are given by the left block, which is
                                   0 𝐷 −1 − 𝐷 𝐷 −1 𝐹 𝐷 −1 − 𝐷 𝐷 −1 𝐹 0 𝐷 −1 .
                             𝐿 ≡ 𝐹13                                                                     (D.0.11)
                                        11      31 11 11 11              32 22 12 11
Let 𝐿 = [𝐿 1 𝐿 2 ], where 𝐿 1 is the first column of 𝐿 and 𝐿 2 is the matrix of last 𝐿 1 − 1 columns of
𝐿. Then the difference in asymptotic variances corresponding to 𝛼1 and 𝛼2 is 𝐿 01 𝐾 𝐿 1 and 𝐿 02 𝐾 𝐿 2
respectively.
     Proof of Proposition 6.1.2
     We want to show that neither 𝐿 1 nor 𝐿 2 derived in the proof of Proposition 6.1.1 is zero in
general. For notational simplicity, I drop the 0 sub/superscripts in this proof, but all expressions
are evaluated at the true parameter values.
     By standard second order conditions for a probit and a normal MLE,
                                                                                        
                                                                 𝜎 −2 𝑥 0 𝑥       0
                                                                          2 2
                                                                                         
                              𝐷 11 = − E(𝑥 0𝑥𝑒 1 )     𝐷 22 = E 
                                                                                        
                                                                                         
                                                                 
                                                                       0       𝜎 −4 /2
                                                                                        
                               𝐷 31 = − E(𝑥 20 𝑥2 𝑒 2 )ℎ𝛼    𝐷 32 = − E(𝑥20 𝑥 2 𝑒 2 )ℎ 𝛽                 (D.0.12)
                                                         105


where
                        ℎ𝛼 = [ℎ𝛼1 ℎ𝛼2 ]
                                𝜃 − (𝜃𝛼 + 𝛼 )(1 + 𝛼2 𝜎 2 ) −1 𝛼 𝜎 2
                                                                                     1
                                                                                                  
                                        1       2         1           1                          
                            ≡                 q                                q             𝐼𝑘 
                                                  1 + 𝛼12 𝜎 2                     1 + 𝛼12 𝜎 2 
                                                                                                  
                               
                                                                                                 
                        ℎ 𝛽 = [ℎ 𝜃 ℎ 𝜎 2 ]
                               
                                      𝛼1                (𝜃𝛼1 + 𝛼2 )𝛼12 
                                                𝐼𝑘 −
                               
                            ≡  q                              2 𝜎 2 ) 3/2  ,
                                                                                                    (D.0.13)
                                1 + 𝛼1 𝜎2    2        2(1  + 𝛼 1           
                                                                           
    𝑒 1 ≡ [𝜙(𝑥𝛼)] 2 /{Φ(𝑥𝛼) [1 − Φ(𝑥𝛼)]}, 𝑒 2 ≡ [𝜙(𝑥 2 𝛾)] 2 /{Φ(𝑥 2 𝛾) [1 − Φ(𝑥 2 𝛾)]}. Then we can
write
                                                                              
                                                        𝑥 0 𝑥 𝑒 𝑥 0 𝑥 𝑒 
                                                         1 1 1 2 1
                                          𝐷 11 = − E  1                                            (D.0.14)
                                                        𝑥 0 𝑥 𝑒 𝑥 0 𝑥 𝑒 
                                                         2 1 1 2 2 1
                                                                              
         0                    0                         2          2
Let E(𝑥 2 𝑥 2 𝑒 1 ) ≡ Γ1 , E(𝑥2𝑟𝑒 1 ) ≡ Γ2 , and E(𝑟 𝑒 1 ) = 𝜎 2 . Then using 𝑥 1 = 𝑥2 𝜃 + 𝑟, we can write
                                                                   𝑟 𝑒  1
                                            𝜃 Γ1 𝜃 + 2Γ0 𝜃 + 𝜎 2           𝜃 0Γ1 + Γ02 
                                             0                                          
                                                           2        2
                                                                   𝑟 𝑒1
                                 𝐷 11 = −                                                           (D.0.15)
                                            
                                                                                         
                                                                                        
                                                     Γ1 𝜃 + Γ2                  Γ1 
                                                                                        
Let Γ3 ≡ (𝜎 22 − Γ02 Γ−1    1 2
                                Γ ). Using the partitioned inverse formula, we can write
               𝑟 𝑒  1
                                                                                               
                                 Γ −1                       −Γ −1 (𝜃 0 + Γ0 Γ−1 )              
                                    3                           3             2 1
               𝐷 −1   =                                                                              (D.0.16)
                                                                                               
                   11    
                         −Γ−1 (𝜃 + Γ−1 Γ ) Γ−1 + (𝜃 + Γ−1 Γ )Γ−1 (𝜃 0 + Γ0 Γ−1 ) 
                                                                                                
                          3
                                      1 2          1             1 2 3                2 1 
    To calculate the first term in (A.35), we begin by deriving 𝐹13 .
                                 𝐹13 = E(𝑔1 𝑔30 ) − E(𝑔1 𝑑 0) [E(𝑑𝑑 0)] −1 E(𝑑𝑔30 ).                 (D.0.17)
Let 𝑢 1 ≡ [𝑦−Φ(𝑥𝛼)]𝜙(𝑥𝛼)/Φ(𝑥𝛼) [1−Φ(𝑥𝛼)] be the generalized residual for the model of interest,
𝑣 1 ≡ [𝑦 − Φ(𝑥2 𝛾)]𝜙(𝑥 2 𝛾)/Φ(𝑥2 𝛾) [1 − Φ(𝑥 2 𝛾)] be the generalized residual for the reduced form,
Ω𝑢1 𝑣 1 ≡ E{[𝑠/𝑝(𝑧)]𝑥 20 𝑥 2 𝑢 1 𝑣 1 }, Ω𝑟𝑢1 𝑣 1 ≡ E{[𝑠/𝑝(𝑧)]𝑥 2𝑟𝑢 1 𝑣 1 }, Ω𝑣 1 𝑑 ≡ E(𝑥20 𝑣 1 𝑑 0), Ω𝑢1 𝑑 ≡
E{[𝑠/𝑝(𝑧)]𝑥 20 𝑢 1 𝑑 0 }, and Ω𝑟𝑢1 𝑑 ≡ E{[𝑠/𝑝(𝑧) 2 ]𝑟𝑢 1 𝑑 0 }. Then
                         0
                        𝜃 Ω𝑢 𝑣 + Ω𝑟𝑢 𝑣 − (𝜃 0Ω𝑢 𝑑 + Ω𝑟𝑢 𝑑 ) [E(𝑑𝑑 0)] −1 Ω0 
                                                                                                
                              1 1          1 1            1          1                     𝑣1 𝑑 
               𝐹13 =                                                                                (D.0.18)
                        
                                                                                                
                        
                                      Ω𝑢1 𝑣 1 − Ω𝑢1 𝑑 [E(𝑑𝑑 0)] −1 Ω0𝑣 𝑑                       
                                                                                                
                                                                            1                  
                                                          106


Using the definitions of 𝐹13 and 𝐷 −1      11
                                               , we get that the first column of 𝐹13          0 𝐷 −1 is
                                                                                                  11
  𝑄 11 ≡ {Ω0𝑟𝑢 𝑣 − Ω𝑣 1 𝑑 [E(𝑑𝑑 0)] −1 Ω0𝑟𝑢 𝑑 }Γ−1                     0                        0 −1 0
                                                           3 − {Ω𝑢 1 𝑣 1 − Ω𝑣 1 𝑑 [E(𝑑𝑑 )] Ω𝑢 1 𝑑 }Γ3 Γ1 Γ2
                                                                                                              −1 −1
                1 1                                  1
                                                                                                                 (D.0.19)
and the last 𝑘 columns of 𝐹13     0 𝐷 −1 are
                                       11
        𝑄 12 ≡ −{Ω0𝑟𝑢 𝑣 − Ω𝑣 1 𝑑 [E(𝑑𝑑 0)] −1 Ω0𝑟𝑢 𝑑 }Γ−1                  0      0 −1
                                                                     3 (𝜃 + Γ2 Γ1 )
                          1 1                                 1
                 + {Ω0𝑢 𝑣 − Ω𝑣 1 𝑑 [E(𝑑𝑑 0)] −1 Ω0𝑢 𝑑 }Γ−1                           −1 0
                                                                    1 [𝐼 𝑘 + Γ2 Γ3 (𝜃 + Γ2 Γ1 )]
                                                                                                  0 −1           (D.0.20)
                          1 1                                1
   Next we derive the second term in (A.35).
  ℎ𝛼 𝐷 −1                     −1
       11 = [ℎ 𝛼1 ℎ 𝛼2 ]𝐷 11
           = [ Γ−1
                 3
                      [ℎ𝛼1 −ℎ𝛼2 (𝜃+Γ−1 Γ2 )] −ℎ𝛼1 Γ−1 (𝜃 0+Γ0 Γ−1 )+ℎ𝛼2 [Γ−1 +(𝜃+Γ−1 Γ2 )Γ−1 (𝜃 0+Γ0 Γ−1 )] ]
                                       1                 3          2 1              1          1         3    2 1
                                                                                                                 (D.0.21)
Let ℎ ≡ ℎ𝛼1 − ℎ𝛼2 𝜃 and Γ4 ≡ Γ−1        3
                                           (ℎ − ℎ𝛼2 Γ−1  1 2
                                                             Γ ). Then
                                               h                                             i
                                ℎ𝛼 𝐷 −111 =      Γ4    −Γ 4 (𝜃 0 + Γ0 Γ−1 ) + ℎ Γ−1
                                                                                   𝛼2 1                          (D.0.22)
                                                                      2 1
Now consider 𝐹11 . Let Ω 2 ≡ E{[𝑠/𝑝(𝑧) 2 ]𝑥 20 𝑥 2 𝑢 21 }, Ω 2 ≡ E{[𝑠/𝑝(𝑧) 2 ]𝑥20 𝑟𝑢 21 }, and Ω 2 2 ≡
                                𝑢                                        𝑟𝑢                                       𝑟 𝑢
                                  1                                         1                                         1
E{[𝑠/𝑝(𝑧) 2 ]𝑟 2 𝑢 21 }. Then,
                                                  𝐹11 = [𝐹111 𝐹112 ]                                             (D.0.23)
where
               0
              𝜃 Ω 2 𝜃 + 2𝜃 0Ω 2 + Ω                   0                          0 −1 0               0    
                                            2 𝑢 2 − (𝜃 Ω𝑢 1 𝑑 + Ω𝑟𝑢 1 𝑑 ) [E(𝑑𝑑 )] (Ω𝑢 1 𝑑 𝜃 + Ω𝑟𝑢 1 𝑑 ) 
                                                                                                            
                    𝑢           𝑟𝑢       𝑟
      𝐹111 ≡ 
                      1            1           1                                                           
                                                                                                                  (D.0.24)
                                  0                               0  −1    0           0
                                                                                                            
                              Ω 𝜃 + Ω 2 − Ω𝑢 𝑑 [E(𝑑𝑑 )] (Ω                      𝜃+Ω          )             
              
                                  𝑢 2      𝑟𝑢          1                   𝑢 1 𝑑       𝑟𝑢 1 𝑑               
              
                                   1          1                                                            
                                                                                                            
                                0
                               𝜃 Ω 2 + Ω0 − (𝜃 0Ω𝑢 𝑑 + Ω𝑟𝑢 𝑑 ) [E(𝑑𝑑 0)] −1 Ω0 
                                                                                                        
                                     𝑢
                                        1       𝑟𝑢 2            1           1                      𝑢1 𝑑 
                       𝐹112 ≡                    1                                                            (D.0.25)
                                                                                                        
                                                                        0
                                                 Ω 2 − Ω𝑢1 𝑑 [E(𝑑𝑑 )] Ω𝑢 𝑑    −1   0                    
                               
                                                  𝑢
                                                     1                               1                  
                                                                                                        
                                                            107


Using the definitions of ℎ𝛼 𝐷 −1       11
                                          , and 𝐹11 , we find that the first column of ℎ𝛼 𝐷 −1             𝐹 𝐷 −1 is given
                                                                                                        11 11 11
by
                 𝑄 ∗21 ≡ Γ4 Γ−1                                0 −1 0
                               3 ({Ω𝑟 2 𝑢 2 − Ω𝑟𝑢 1 𝑑 [E(𝑑𝑑 )] Ω𝑟𝑢 1 𝑑 }
                                           1
                       − {Ω0 2 − Ω𝑟𝑢1 𝑑 [E(𝑑𝑑 0)] −1 Ω0𝑢 𝑑 }Γ−1      1 Γ2 − Ω𝑟𝑢 2 𝜃)
                                                                                    0
                             𝑟𝑢                               1
                                1                                                      1
                       − (Γ4 Γ02 + ℎ𝛼2 )Γ−1      −1                               0 −1 0
                                             1 Γ3 [{Ω𝑟𝑢 2 − Ω𝑢 1 𝑑 [E(𝑑𝑑 )] Ω𝑟𝑢 1 𝑑 }
                                                           1
                       − {Ω 2 − Ω𝑢1 𝑑 [E(𝑑𝑑 0)] −1 Ω0𝑢 𝑑 }Γ−1     1 Γ2 − (𝜃 + Γ1 Γ2 )]
                                                                                        −1                            (D.0.26)
                             𝑢
                               1                            1
and the last 𝑘 columns of ℎ𝛼 𝐷 −1           𝐹 𝐷 −1 are given by
                                         11 11 11
        𝑄 ∗22 ≡ −Γ4 Γ−1                                  0 −1 0                 0              −1        0     0 −1
                        3 [{Ω𝑟 2 𝑢 2 − Ω𝑟𝑢 1 𝑑 [E(𝑑𝑑 )] Ω𝑟𝑢 1 𝑑 } − Ω𝑟𝑢 2 (𝜃 + Γ1 Γ2 )] (𝜃 + Γ2 Γ1 )
                                    1                                               1
              + (Γ4 Γ02 + ℎ𝛼2 )Γ−1                              0 −1 0               −1 0         0 −1           0      −1
                                   1 {Ω𝑟𝑢 2 − Ω𝑢 1 𝑑 [E(𝑑𝑑 )] Ω𝑟𝑢 1 𝑑 }Γ3 (𝜃 + Γ2 Γ1 ) + Γ4 Ω𝑟𝑢 2 Γ1
                                             1                                                                      1
              − [Γ4 Ω𝑟𝑢1 𝑑 [E(𝑑𝑑 0)] −1 Ω0𝑢 𝑑 + (Γ4 Γ02 + ℎ𝛼2 )Γ−1                                     0 −1 0
                                                                         1 {Ω𝑢 2 − Ω𝑢 1 𝑑 [E(𝑑𝑑 )] Ω𝑢 1 𝑑 }]
                                               1                                   1
                 Γ−1               −1 0        0 −1
                    1 [𝐼 𝑘 + Γ2 Γ3 (𝜃 + Γ2 Γ1 )]                                                                      (D.0.27)
Let 𝑄 21 ≡ − E(𝑥20 𝑥2 𝑒 2 )𝑄 ∗21 and 𝑄 22 ≡ − E(𝑥 20 𝑥 2 𝑒 2 )𝑄 ∗22 . Then,
                                            𝐷 31 𝐷 −1        −1
                                                   11 𝐹11 𝐷 11 = [𝑄 21 𝑄 22 ]                                         (D.0.28)
Clearly, neither 𝑄 21 nor 𝑄 22 is zero.
       Next we want to find 𝐷 32 𝐷 −1        𝐹 0 𝐷 −1 . First note that
                                          22 12 11
                                                                                  
                                                     𝜎 2 E(𝑥 0 𝑥 ) −1 0 
                                                               2  2
                                            𝐷 −1
                                               22 =                                                                 (D.0.29)
                                                                                  
                                                                                   .
                                                              0              2𝜎  4
                                                                                  
                                                                                  
Further, let Ω𝑟 𝑑 ≡ E{[𝑠/𝑝(𝑧)]𝑥20 𝑟𝑑 0 }, Ω𝑟 2 𝑑 ≡ E{[𝑠/𝑝(𝑧)] (𝑟 2 𝜎 −2 − 1)𝑑 0 }, Ω𝑢1𝑟 ≡ E{[𝑠/𝑝(𝑧) 2 ]
𝑥 20 𝑥 2 𝑢 1𝑟}, Ω𝑢 𝑟 2 ≡ E{[𝑠/𝑝(𝑧) 2 ]𝑥 20 𝑢 1𝑟 2 }, and Ω𝑢 𝑟 3 ≡ E{[𝑠/𝑝(𝑧) 2 ]𝑢 1𝑟 3 }. Then
                    1                                           1
                  (Ω0𝑢 𝑟 𝜃+Ω0       )−[Ω𝑟 𝑑 [E(𝑑𝑑 0)] −1 (Ω0 𝜃+Ω0            )]        Ω0𝑢 𝑟 −Ω𝑟 𝑑 [E(𝑑𝑑 0)] −1 Ω0
              "                                                                                                            #
        0 =            1      𝑢1𝑟 2                        𝑢1 𝑑       𝑟𝑢 1 𝑑              1                        𝑢1 𝑑
     𝐹12                    0         −2               0  −1   0          0                                                  .
                (Ω       𝜃+Ω      )𝜎 −Ω 2 [E(𝑑𝑑 )] (Ω                𝜃+Ω         ) 𝜎 Ω𝑢 𝑟 −Ω 2 [E(𝑑𝑑 )] Ω0
                                                                                      −2   0                 0  −1
                  𝑢1𝑟 2     𝑢1𝑟 3           𝑟 𝑑                 𝑢1 𝑑      𝑟𝑢 1 𝑑             1     𝑟 𝑑                𝑢1 𝑑
                                                                                                                      (D.0.30)
                                                             108


Next note that ℎ 𝛽 𝐷 −1 22
                              = [ℎ 𝜃 𝜎 2 E(𝑥20 𝑥 2 ) −1 2ℎ 𝜎 2 𝜎 4 ]. Using the definitions of 𝐹12 and 𝐷 −1
                                                                                                         22
                                                                                                            , we
get that the first column of ℎ 𝛽 𝐷 −1      𝐹 0 𝐷 −1 is
                                         22 12 11
     𝑄 ∗31 ≡ ℎ 𝜃 𝜎 2 E(𝑥 20 𝑥 2 ) −1 Γ−1
                                      3
          (Ω0       − Ω𝑟 𝑑 [E(𝑑𝑑 0)] −1 Ω0𝑟𝑢 𝑑 − {Ω0𝑢 𝑟 − Ω𝑟 𝑑 [E(𝑑𝑑 0)] −1 Ω0𝑢 𝑑 }Γ−1    1 Γ2 )
             𝑢1𝑟 2                              1          1                         1
           + 2ℎ 𝜎 2 𝜎 4 Γ−1
                         3
          (𝜎 −2 Ω𝑢 𝑟 3 − Ω𝑟 2 𝑑 [E(𝑑𝑑 0)] −1 Ω0𝑟𝑢 𝑑 − {𝜎 −2 Ω0 2 − Ω𝑟 2 𝑑 [E(𝑑𝑑 0)] −1 Ω0𝑢 𝑑 }Γ−1    1 Γ2 )
                    1                                  1               𝑢1𝑟                       1
                                                                                                       (D.0.31)
    and the last 𝑘 columns of ℎ 𝛽 𝐷 −1        𝐹 0 𝐷 −1 are
                                           22 12 11
     𝑄 ∗32 ≡ 𝜎 2 E(𝑥 20 𝑥 2 ) −1 ([−ℎ 𝜃 {Ω0        − Ω𝑟 𝑑 [E(𝑑𝑑 0)] −1 Ω0𝑟𝑢 𝑑 }
                                           𝑢1𝑟 2                             1
           − 2ℎ 𝜎 2 𝜎 2 {Ω0        𝜎 −2 − Ω𝑟 2 𝑑 [E(𝑑𝑑 0)] −1 Ω0𝑟𝑢 𝑑 }Γ−1      0    0 −1
                                                                           3 (𝜃 + Γ2 Γ1 )]
                             𝑢1𝑟 3                                   1
           + {(ℎ 𝜃 {Ω0𝑢 𝑟 − Ω𝑟 𝑑 [E(𝑑𝑑 0)] −1 Ω0𝑢 𝑑 }
                        1                              1
           − 2ℎ 𝜎 2 𝜎 2 {Ω0𝑢 𝑟 − Ω𝑟 2 𝑑 [E(𝑑𝑑 0)] −1 Ω0𝑢 𝑑 })Γ−1                  −1 0     0 −1
                                                                     1 [𝐼 𝑘 + Γ2 Γ3 (𝜃 + Γ2 Γ1 )]})    (D.0.32)
                              1                             1
Let 𝑄 31 ≡ − E(𝑥20 𝑥2 𝑒 2 )𝑄 ∗31 and 𝑄 32 ≡ − E(𝑥 20 𝑥 2 𝑒 2 )𝑄 ∗32 . Then,
                                          𝐷 32 𝐷 −1    0   −1
                                                  22 𝐹12 𝐷 11 = [𝑄 31 𝑄 32 ]                           (D.0.33)
Clearly, neither 𝑄 31 nor 𝑄 32 is zero.
    Thus,
                                             𝐿 1 = 𝑄 11 + 𝑄 21 + 𝑄 31 ≠ 0                              (D.0.34)
                                             𝐿 2 = 𝑄 12 + 𝑄 22 + 𝑄 32 ≠ 0                              (D.0.35)
which implies that it is possible to obtain strict efficiency gains for both 𝛼1 and 𝛼2 .
    Proof of Proposition E.1.
    We first show that 𝛼0 is a solution to 𝑚𝑖𝑛𝛼∈A E[𝑠 · 𝑓1 (𝑦, 𝑥, 𝛼)]. First note that for any 𝛼 ∈ A,
               E[𝑠 · 𝑓1 (𝑦, 𝑥, 𝛼)] = E{E[𝑠 · 𝑓1 (𝑦, 𝑥, 𝛼)|𝑥]} = E{𝑝(𝑥 2 ) E[ 𝑓1 (𝑦, 𝑥, 𝛼)|𝑥]},
                                                           109


where the second equality follows by iterated expectations.
           E[𝑠 · 𝑓1 (𝑦, 𝑥, 𝛼)|𝑥] = E{E[𝑠 · 𝑓1 (𝑦, 𝑥, 𝛼)|𝑦, 𝑥]|𝑥} = E[E(𝑠|𝑦, 𝑥) 𝑓1 (𝑦, 𝑥, 𝛼)|𝑥]
                                   = E[ 𝑝(𝑥2 ) 𝑓1 (𝑦, 𝑥, 𝛼)|𝑥] = 𝑝(𝑥2 ) E[ 𝑓1 (𝑦, 𝑥, 𝛼)|𝑥],
where the third equality follows from part 2 of Assumption E.2. Because 𝑝(𝑥 2 ) ≥ 0 ∀𝑥 2 ∈ X2 , and
𝛼0 minimizes E[ 𝑓1 (𝑦, 𝑥, 𝛼)|𝑥] for all 𝑥 ∈ X,
                𝑝(𝑥 2 ) E[ 𝑓1 (𝑦, 𝑥, 𝛼0 )|𝑥] ≤ 𝑝(𝑥 2 ) E[ 𝑓1 (𝑦, 𝑥, 𝛼)|𝑥],  𝑥 ∈ X,     𝛼 ∈ A.
The result follows from taking an expectation with respect to 𝑥.
    A similar argument can be used to verify that 𝛽0 solves min 𝛽∈B E[𝑠 · 𝑓2 (𝑥 1 , 𝑥2 , 𝛽)] and noting
that E(𝑠|𝑥) = 𝑝(𝑥 2 ) under part 2 of Assumption E.2. For the reduced form, part 1 of Assumption
E.1 implies using iterated expectations that 𝛾0 minimizes E[ 𝑓3 (𝑦, 𝑥2 , 𝛾)].
                                                      110


                                                 APPENDIX E
                   ASYMPTOTIC THEORY FOR UNWEIGHTED ESTIMATION
     The notion of econometric models underlying the objective functions in (2.2.1)-(2.2.3) being
correctly specified, and sample selection being based on 𝑥 2 is formalized in the following two
assumptions.
     Assumption E.1. Assume that
    1. For each 𝑥 ∈ X, 𝛼0 solves min𝛼∈A E[ 𝑓1 (𝑦, 𝑥, 𝛼)|𝑥]. For each 𝑥2 ∈ X2 , 𝛽0 and 𝛾0 solve
        min 𝛽∈B E[ 𝑓2 (𝑥 1 , 𝑥2 , 𝛽)|𝑥 2 ] and min𝛾∈Γ E[ 𝑓3 (𝑦, 𝑥2 , 𝛾)|𝑥2 ] respectively.
    2. 𝛼0 , 𝛽0 , and 𝛾0 are the unique solutions to min𝛼∈A E[𝑠 · 𝑓1 (𝑦, 𝑥, 𝛼)] and min 𝛽∈B E[𝑠 ·
         𝑓2 (𝑥1 , 𝑥2 , 𝛽)] respectively.
     Part 1 of this assumption practically means that the underlying model is correctly specified. Part
2 is needed to ensure that the selected subpopulation is sufficiently rich to identify the respective
parameters. The notion that 𝑠 depends on 𝑥2 is formalized in part 2 of the following assumption.
     Assumption E.2. Assume that
    1. 𝑥 1 is observed whenever 𝑠 = 1, (𝑦, 𝑥2 ) are always observed.
    2. 𝑃(𝑠 = 1|𝑦, 𝑥1 , 𝑥2 ) = 𝑃(𝑠 = 1|𝑥 2 ) ≡ 𝑝(𝑥 2 ).
     It is simple to show that Assumptions E.1 and E.2 along with regularity conditions, imply
consistency of ( 𝛼ˆ 𝑈𝐽 , 𝛽ˆ𝑈𝐽 ). I show that the following proposition holds.
     Proposition E.1.          Under Assumptions E.1 and E.2, 𝛼0 , 𝛽0 , and 𝛾0 solve min𝛼∈A E[𝑠 ·
𝑓1 (𝑦, 𝑥, 𝛼)], min 𝛽∈B E[𝑠 · 𝑓2 (𝑥 1 , 𝑥2 , 𝛽)] and min𝛾∈Γ E[ 𝑓3 (𝑦, 𝑥2 , 𝛾)] respectively.
     The proof (given in Appendix C) simply follows from an iterated expectations argument and is
an extension of that in Wooldridge (2002).
     Theorem E.1. Assume that
                                                      111


   1. {(𝑦𝑖 , 𝑥𝑖 , 𝑠𝑖 ) : 𝑖 = 1, . . . , 𝑁 } are random draws from the population satisfying Assumption
       E.2.
   2. Assumption E.1 holds.
   3. Parts 3 (except the assumptions on Δ), 4, and 6 of Theorem 2.4.1 hold.
                        𝑝
Then ( 𝛼ˆ 𝑈𝐽 , 𝛽ˆ𝑈𝐽 ) −→ (𝛼0 , 𝛽0 ) as 𝑁 →   − ∞.
    Once we verify that (𝛼0 , 𝛽0 ) are identified in the subpopulations defined by 𝑠 = 1, the proof of
Theorem E.1 is very similar to that of Theorem 2.4.1, and hence is omitted.
    To derive the asymptotic distribution of ( 𝛼ˆ 𝑈𝐽 , 𝛽ˆ𝑈𝐽 ), we assume that E[𝑔(𝛼, 𝛽)] is differentiable
at (𝛼0 , 𝛽0 ) with the derivative defined as the following.
                                                                             0             
                                                                            𝐷
                                                                             𝑈11       0   
                                                                                            
                                                                                           
                       𝐷 𝑈0 ≡ E[∇ (𝛼0,𝛽0)0 𝑔(𝛼, 𝛽)| (𝛼,𝛽)=(𝛼 ,𝛽 ) ] =  0           𝐷  0   ,            (E.0.1)
                                                                  0 0                   𝑈22 
                                                                                           
                                                                             0         0   
                                                                             𝐷 𝑈31 𝐷 𝑈32 
                                                                                           
where 𝐷 𝑈 0      = 𝜕𝑔 𝑗 (𝛼, 𝛽)/𝜕𝛼| (𝛼,𝛽)=(𝛼 ,𝛽 ) and 𝐷 𝑈       0    = 𝜕𝑔 𝑗 (𝛼, 𝛽)/𝜕 𝛽| (𝛼,𝛽)=(𝛼 ,𝛽 ) , 𝑗 = 1, 2, 3.
             𝑗1                                 0 0              𝑗2                            0 0
                                               √
Then the following theorem gives the 𝑁−asymptotic normality result.
    Theorem E.2.(Asymptotic Normality): Assume that
   1. The assumptions in Theorem E.1 hold
   2. (𝛼0 , 𝛽0 ) ∈ 𝑖𝑛𝑡 (A × B).
   3. 𝑔(𝛼, 𝛽) is twice continuously differentiable on 𝑖𝑛𝑡 (A × B).
   4. 𝐷 𝑈0 is of full rank 𝐿 1 + 𝐿 2 .
Then,
                  √          0 , 𝛽ˆ0 ) 0 − (𝛼0 , 𝛽0 ) 0] − 𝑑                    0 𝐶 −1 𝐷 ) −1 ].
                     𝑁 [( 𝛼ˆ 𝑈𝐽    𝑈𝐽         0 0        −−−→ 𝑁𝑜𝑟𝑚𝑎𝑙 [0, (𝐷 𝑈0      0     𝑈0
 The proof follows in a straightforward manner from Theorem 3.4 of Newey and McFadden (1994)
and hence is omitted.
                                                           112


                                                     APPENDIX F
                                           TABLES FOR CHAPTER 2
Table F.1: Summary of missing data methods used in 5 highly ranked economics journals from
2018 to August 2020.
                                              Total    % Missingness        % CC       % DVM          % RI      % Other
    American Economic Review                  319           20.69            71.21       16.67       15.15        15.15
 Quarterly Journal of Economics               109           28.44            74.19        9.68        9.68        29.68
    Journal of Labor Economics                109           35.78            58.97       15.38       10.26        17.95
    Journal of Human Resources                 98           43.88            46.51       32.56       11.63        16.28
   Journal of Political Economy               211           19.91            59.52       16.67       21.43        14.29
                  Total                       846           26.12            62.44       18.55       14.03        17.65
  1  Column 1 shows the total number of papers published. Column 2 shows the percentage of papers that reported missing
     values. Columns 3-6 show the percentage of papers that used the complete cases estimator, the dummy variable
     method, the two-step regression imputation, and other methods respectively.
  2 The row percentages add to more than 100 because some papers use multiple methods.
  3 The articles that do not explicitly mention the method of imputation are included in the two-step regression imputation
     category since this is the most frequently used method within the imputation category.
                                                           113


     Table F.2: Effect of grade variance on probability of having a 4 year college degree.
                          Complete cases     Joint GMM        % ↓ in s.e.   Plug-in     DVM
    Log(income)                0.148             0.148                        0.149      0.150
                              (0.042)           (0.041)          2.38       (0.042)    (0.042)
        GSD                   -0.146            -0.140                       -0.138     -0.139
                              (0.039)           (0.035)         10.26       (0.037)    (0.035)
        GPA                    0.329             0.331                       0.338       0.339
                              (0.049)           (0.043)         12.24       (0.043)    (0.043)
        Black                  0.413             0.407                        0.386      0.395
                              (0.128)           (0.116)          9.38       (0.114)    (0.114)
      Hispanic                 0.539             0.445                        0.404      0.419
                              (0.147)           (0.135)          8.16       (0.138)    (0.135)
    Live in south              0.140             0.149                        0.144      0.137
                              (0.065)           (0.058)         10.77       (0.057)    (0.057)
Lived in urban area            0.093             0.080                       0.082       0.083
                              (0.068)           (0.061)         10.29       (0.062)    (0.060)
Mother’s education             0.060             0.057                        0.055      0.056
                              (0.015)           (0.014)          6.67       (0.014)    (0.014)
 Father’s education            0.063             0.063                        0.062      0.062
                              (0.011)           (0.010)          9.09       (0.010)    (0.010)
       Female                 -0.128            -0.132                       -0.138     -0.146
                              (0.059)           (0.053)         10.17       (0.052)    (0.052)
  Cognitive skills             0.436             0.420                       0.400       0.404
                              (0.050)           (0.044)           12        (0.044)    (0.044)
Non-Cognitive skills           0.012             0.015                        0.018      0.016
                              (0.030)           (0.027)           10        (0.028)    (0.027)
          N                    3219              3942                         3942       3942
  p-value for J stat                             0.590
                                             114


                                                            APPENDIX G
                                                 PROOFS FOR CHAPTER 3
    Proof of Lemma 3.4.1
    Starting with 𝑓1𝑖 (.), we want to show that E
                                                                            Í𝑇           0                             Í𝑇   0
                                                                               𝑡=1 𝑠𝑖𝑡 𝑥¥𝑖𝑡 𝑢¥𝑖𝑡 = 0. Since 𝑡=1 𝑠𝑖𝑡 𝑥¥𝑖𝑡 𝑢¥𝑖𝑡 =
Í𝑇           0                                            Í𝑇          0     
  𝑡=1 𝑠𝑖𝑡 𝑥¥𝑖𝑡 𝑢𝑖𝑡 , we want to show that E                 𝑡=1 𝑠𝑖𝑡 𝑥¥𝑖𝑡 𝑢𝑖𝑡 = 0.
                                                                                                                               
    First, Assumption 3.2.1 implies by the law of iterated expectations (LIE) that E 𝑢𝑖𝑡 |𝑥𝑖 , 𝑠𝑖 = 0.
Now, ∀ 𝑡 = 1, . . . , 𝑇
                       𝐸 (𝑠𝑖𝑡 𝑥¥𝑖𝑡0 𝑢𝑖𝑡 ) = E[E 𝑠𝑖𝑡 𝑥¥𝑖𝑡0 𝑢𝑖𝑡 |𝑥𝑖 , 𝑠𝑖 ] = E[𝑠𝑖𝑡 𝑥¥𝑖𝑡0 E 𝑢𝑖𝑡 |𝑥𝑖 , 𝑠𝑖 ] = 0.
                                                                                                           
Therefore, E
                     Í𝑇
                           𝑠    ¥
                                𝑥 0 𝑢  = 0.
                      𝑡=1    𝑖𝑡   𝑖𝑡 𝑖𝑡
    Using a similar argument for 𝑓2𝑖 (.), we want to show that E
                                                                                                    Í𝑇
                                                                                                           𝑠    ¥
                                                                                                                𝑥 0 𝑟  = 0. Now, ∀
                                                                                                       𝑡=1   𝑖𝑡   2𝑖𝑡 𝑖𝑡
𝑡 = 1, . . . , 𝑇
                             0 𝑟 ) = E[E 𝑠 𝑥¥0 𝑟 |𝑥 , 𝑠  ] = E[𝑠 𝑥¥0 E 𝑟 |𝑥 , 𝑠  ] = 0.
                     E(𝑠𝑖𝑡 𝑥¥2𝑖𝑡    𝑖𝑡             𝑖𝑡 2𝑖𝑡 𝑖𝑡 2𝑖 𝑖                    𝑖𝑡 2𝑖𝑡          𝑖𝑡 2𝑖 𝑖
                                                              
The last equality follows from E 𝑟𝑖𝑡 |𝑥 2𝑖 , 𝑠𝑖 = 0 which follows from Assumption 3.2.2 and LIE.
                     Í𝑇           0       
Therefore, E          𝑡=1 𝑠𝑖𝑡 𝑥¥2𝑖𝑡 𝑟𝑖𝑡 = 0.
                                                          Í
    For 𝑓3𝑖 (.), we want to show that E[ 𝑇𝑡=1 (1 − 𝑠𝑖𝑡 ) 𝑥¤2𝑖𝑡                 0 𝑣 ] = 0. First, note that using the LIE,
                                                                                     𝑖𝑡
                                                                                                                        
Assumption 3.2.1 implies that E 𝑢𝑖𝑡 |𝑥 2𝑖 , 𝑠𝑖 = 0. This combined with E 𝑟𝑖𝑡 |𝑥 2𝑖 , 𝑠𝑖 = 0 implies
                                                         
that E 𝑣𝑖𝑡 |𝑥 2𝑖 , 𝑠𝑖 = E 𝛽1𝑟𝑖𝑡 + 𝑢𝑖𝑡 |𝑥2𝑖 , 𝑠𝑖 = 0. Now, ∀ 𝑡 = 1, . . . , 𝑇
                        0 𝑣 ] = E{E[(1 − 𝑠 ) 𝑥¤0 𝑣 |𝑥 , 𝑠 ]} = E[(1 − 𝑠 ) 𝑥¤0 E(𝑣 |𝑥 , 𝑠 )] = 0.
       E[(1 − 𝑠𝑖𝑡 ) 𝑥¤2𝑖𝑡     𝑖𝑡                        𝑖𝑡 2𝑖𝑡 𝑖𝑡 2𝑖 𝑖                             𝑖𝑡 2𝑖𝑡         𝑖𝑡 2𝑖 𝑖
                   Í
and hence E[ 𝑇𝑡=1 (1 − 𝑠𝑖𝑡 ) 𝑥¤2𝑖𝑡       0 𝑣 ] = 0.
                                              𝑖𝑡
    Proof of Proposition 3.4.2.1
    𝛽ˆ 𝐷 is obtained by estimating the parameters in equation (3.4.3) using POLS. POLS will be
consistent if
                                                                              0
                                                             Í𝑇                        
                                            𝑔1𝑖 (.) 
                                                               𝑡=1 𝑠`𝑖𝑡 𝑥`1𝑖𝑡 𝑒`𝑖𝑡  0
                                                                                         
                                                     
                                                            Í                          
                                         E 𝑔2𝑖 (.)  ≡ E  𝑇 (1 − 𝑠`𝑖𝑡 ) 𝑒`𝑖𝑡  = 0 .                                   (F.1)
                                                             𝑡=1                       
                                                               Í𝑇 0
                                            𝑔3𝑖 (.)                      `     `              0 
                                                                                        
                                                                     𝑡=1  𝑥 2𝑖𝑡
                                                                                 𝑒 𝑖𝑡 
                                                                                       
                                                                  115


We are going to show that each of these holds true iff either 𝛽1 = 0 or 𝜋2 = 𝑑𝑖 = 0 ∀ 𝑖.
    First, note that
            𝑒`𝑖𝑡 = 𝑒𝑖𝑡 − 𝑒¯𝑖 =[(1 − 𝑠𝑖𝑡 )𝑥 22𝑖𝑡 − (1 − 𝑠𝑖 )𝑥 22𝑖 ]𝜋2 𝛽1 + [(1 − 𝑠𝑖𝑡 )𝑑𝑖 − (1 − 𝑠𝑖 )𝑑𝑖 ] 𝛽1
                              + [(1 − 𝑠𝑖𝑡 )𝑟𝑖𝑡 − (1 − 𝑠𝑖 )𝑟𝑖 ] 𝛽1 + [𝑢𝑖𝑡 − 𝑢¯𝑖 ],
where
                                                                       Õ  𝑇
                                              (1 − 𝑠𝑖 )𝑥 22𝑖 = 𝑇𝑖−1            (1 − 𝑠𝑖𝑞 )𝑥 22𝑖𝑞
                                                                       𝑞=1
                                                     Õ 𝑇
                               (1 − 𝑠𝑖 )𝑑𝑖 = 𝑇𝑖−1           (1 − 𝑠𝑖𝑞 )𝑑𝑖 = (1 − 𝑇 −1𝑇𝑖 )𝑑𝑖
                                                    𝑞=1
                                                                           Õ𝑇
                                                   (1 − 𝑠𝑖 )𝑟𝑖 = 𝑇𝑖−1            (1 − 𝑠𝑖𝑞 )𝑟𝑖𝑞 .
                                                                           𝑞=1
    Starting with 𝑔1𝑖 , the first term is
   Õ𝑇                                                             Õ𝑇
              0 [(1− 𝑠 )𝑥                                                       0 [(1− 𝑠 )𝑥
E{     𝑠`𝑖𝑡 𝑥`1𝑖𝑡       𝑖𝑡 22𝑖𝑡 − (1 − 𝑠𝑖 )𝑥 22𝑖 ]𝜋2 𝛽1 }       =     E{ 𝑠`𝑖𝑡 𝑥`1𝑖𝑡       𝑖𝑡 22𝑖𝑡 − (1 − 𝑠𝑖 )𝑥 22𝑖 ]𝜋2 𝛽1 }.
   𝑡=1                                                            𝑡=1
Consider this expectation for each 𝑡 separately. It is 0 iff either 𝜋2 = 0 or 𝛽1 = 0 or both are 0. If
neither of these conditions holds, then this term will be a non-zero number, except by fluke. For
the second term,
                                              0 [(1 − 𝑠 ) − (1 − 𝑠 )𝑑 ]𝑑 𝛽 }
                                    E{ 𝑠`𝑖𝑡 𝑥`1𝑖𝑡            𝑖𝑡            𝑖 𝑖 𝑖 1
is zero ∀ 𝑡 iff 𝛽1 = 0 or 𝑑𝑖 = 0 ∀ 𝑖 or both. For the third term,
                                              0 [(1 − 𝑠 )𝑟 − (1 − 𝑠 )𝑟 ] 𝛽 }
                                    E{ 𝑠`𝑖𝑡 𝑥`1𝑖𝑡            𝑖𝑡 𝑖𝑡             𝑖 𝑖 1
is zero ∀ 𝑡 iff 𝛽1 = 0. For the fourth term,
                                                            0 (𝑢 − 𝑢¯ )]
                                                  E[ 𝑠`𝑖𝑡 𝑥`1𝑖𝑡   𝑖𝑡   𝑖
is zero ∀ 𝑡 under Assumption 3.2.1.
    Moving on to 𝑔2𝑖 , for the first term
                      E{[(1 − 𝑠𝑖𝑡 ) − (1 − 𝑇 −1𝑇𝑖 )] [(1 − 𝑠𝑖𝑡 )𝑥22𝑖𝑡 − (1 − 𝑠𝑖 )𝑥 22𝑖 ]𝜋2 𝛽1 }
                                                               116


is zero ∀ 𝑡 iff 𝜋2 = 0 or 𝛽1 = 0 or both. For the second term,
                                         E{[(1 − 𝑠𝑖𝑡 ) − (1 − 𝑇 −1𝑇𝑖 )] 2 𝑑𝑖 𝛽1 }
is zero ∀ 𝑡 iff 𝛽1 = 0 or 𝑑𝑖 = 0 ∀ 𝑖 or both. For the third term,
                          E{[(1 − 𝑠𝑖𝑡 ) − (1 − 𝑇 −1𝑇𝑖 )] [(1 − 𝑠𝑖𝑡 )𝑟𝑖𝑡 − (1 − 𝑠𝑖 )𝑟𝑖 ] 𝛽1 }
is zero under Assumption 3.2.2. For the fourth term,
                                        E[[(1 − 𝑠𝑖𝑡 ) − (1 − 𝑇 −1𝑇𝑖 )] (𝑢𝑖𝑡 − 𝑢¯𝑖 )]
is zero ∀ 𝑡 under Assumption 3.2.1.
     Moving on to 𝑔3𝑖 , for the first term
                                         0 [(1 − 𝑠 )𝑥
                                    E{𝑥`2𝑖𝑡         𝑖𝑡 22𝑖𝑡 − (1 − 𝑠𝑖 )𝑥 22𝑖 ]𝜋2 𝛽1 }
is zero ∀ 𝑡 iff 𝜋2 = 0 or 𝛽1 = 0 or both. For the second term,
                                            0 [(1 − 𝑠 ) − (1 − 𝑇 −1𝑇 )]𝑑 𝛽 }
                                        E{𝑥`2𝑖𝑡        𝑖𝑡                   𝑖    𝑖 1
is zero ∀ 𝑡 iff 𝛽1 = 0 or 𝑑𝑖 = 0 ∀ 𝑖 or both. For the third term,
                                             0 [(1 − 𝑠 )𝑟 − (1 − 𝑠 )𝑟 ] 𝛽 }
                                         E{𝑥`2𝑖𝑡         𝑖𝑡 𝑖𝑡             𝑖 𝑖 1
is zero under Assumption 3.2.2. For the fourth term,
                                                        0 (𝑢 − 𝑢¯ )]
                                                   E[ 𝑥`2𝑖𝑡     𝑖𝑡     𝑖
is zero ∀ 𝑡 under Assumption 3.2.1.
     Thus, for each of the moment conditions in (F.1) to be zero, we need either 𝛽1 = 0 or 𝜋2 = 𝑑𝑖 = 0
∀ 𝑖.
     Proof of Proposition 3.4.3.1
     Let the error 𝛽1 (1 − 𝑠𝑖𝑡 ) [ 𝑥¥2𝑖𝑡 (𝜋 − 𝜋ˆ 𝐼𝑚 𝑝 ) + 𝑟¥𝑖𝑡 ] + 𝑢¥𝑖𝑡 ≡ 𝑒¥𝑖𝑡 and let the set of regressors [𝑠𝑖𝑡 𝑥¥1𝑖𝑡 +
(1 − 𝑠𝑖𝑡 ) 𝑥¥2𝑖𝑡 𝜋ˆ 𝐼𝑚 𝑝 𝑥¥2𝑖𝑡 ] ≡ 𝑧¥𝑖𝑡
                                                            117


    The POLS estimator is
                                                    𝑧¥0𝑖𝑡 𝑧¥𝑖𝑡 −1
                                          ÕÕ                            ÕÕ
                            𝛽ˆ 𝐼𝑚 𝑝 =                                                 𝑧¥0𝑖𝑡 𝑦¥𝑖𝑡
                                                               
                                                                                                                               (F.2)
                                            𝑖   𝑡                          𝑖     𝑡
                                                      ÕÕ                        −1              ÕÕ
                                     = 𝛽 + 𝑁 −1                     𝑧¥0𝑖𝑡 𝑧¥𝑖𝑡          𝑁 −1               𝑧¥0𝑖𝑡 𝑒¥𝑖𝑡        (G.0.1)
                                                         𝑖       𝑡                                 𝑖    𝑡
    Consider the probability limit of the term 𝑁 −1 𝑖 𝑡 𝑧¥0𝑖𝑡 𝑒¥𝑖𝑡 . Plugging in the definitions of 𝑧¥𝑖𝑡
                                                                            Í Í
and 𝑒¥𝑖𝑡 , the first term is
                                      Õ           Õ                            Õ
                               𝑝𝑙𝑖𝑚        𝑁 −1          𝑠𝑖𝑡 𝑥¥1𝑖𝑡 𝑢¥𝑖𝑡 =              E(𝑠𝑖𝑡 𝑥¥1𝑖𝑡 𝑢¥𝑖𝑡 ) = 0
                                        𝑡          𝑖                             𝑡
This last equality is due to Assumption 3.2.1. The second term is
                         Õ           Õ                                            Õ
                   𝑝𝑙𝑖𝑚       𝑁 −1        (1 − 𝑠𝑖𝑡 ) 𝜋ˆ 0𝐼𝑚 𝑝 𝑥¥2𝑖𝑡0 𝑢¥ =
                                                                         𝑖𝑡               E[(1 − 𝑠𝑖𝑡 )𝜋0𝑥¥2𝑖𝑡     0 𝑢¥ ] = 0
                                                                                                                      𝑖𝑡
                          𝑡           𝑖                                             𝑡
where the last equality again holds because of Assumption 3.2.1. The third term is
                                  Õ           Õ
                          𝑝𝑙𝑖𝑚         𝑁 −1       (1 − 𝑠𝑖𝑡 ) 𝜋ˆ 0𝐼𝑚 𝑝 𝑥¥2𝑖𝑡    0 [𝑟¥ + 𝑥¥ (𝜋 − 𝜋)]
                                                                                        𝑖𝑡        2𝑖𝑡         ˆ 𝛽1
                                   𝑡          𝑖
                              Õ
                           =       E[(1 − 𝑠𝑖𝑡 )𝜋0𝑥¥2𝑖𝑡    0 𝑟¥ ] 𝛽 = 0
                                                                𝑖𝑡 1
                                𝑡
The second equality here holds because 𝜋ˆ 𝐼𝑚 𝑝 is a consistent estimator of 𝜋, and the third holds due
to Assumption 3.2.2. The fourth term is
                                          Õ           Õ                        Õ
                                  𝑝𝑙𝑖𝑚        𝑁 −1              0 𝑢¥ =
                                                              𝑥¥2𝑖𝑡   𝑖𝑡
                                                                                              0 𝑢¥ ) = 0
                                                                                       E( 𝑥¥2𝑖𝑡      𝑖𝑡
                                           𝑡             𝑖                       𝑡
where the last equality holds due to Assumption 3.2.1. Finally,
                                     Õ          Õ
                              𝑝𝑙𝑖𝑚         𝑁 −1            0 (1 − 𝑠 ) [𝑟¥ + 𝑥¥ (𝜋 − 𝜋)]
                                                        𝑥¥2𝑖𝑡              𝑖𝑡      𝑖𝑡         2𝑖𝑡         ˆ 𝛽1
                                       𝑡          𝑖
                                  Õ
                               =       E[(1 − 𝑠𝑖𝑡 ) 𝑥¥2𝑖𝑡 0 𝑟¥ ] 𝛽 = 0
                                                                 𝑖𝑡 1
                                   𝑡
as proved above.
    Since 𝑝𝑙𝑖𝑚 𝑁 −1 𝑖 𝑡 𝑧¥0𝑖𝑡 𝑒¥𝑖𝑡 = 0, from (F.2), 𝑝𝑙𝑖𝑚 𝛽ˆ 𝐼𝑚 𝑝 = 𝛽.
                         Í Í
    Proof of Lemma 3.6.1
                                                                   118


   (i) Start with E[𝑚 1𝑖 (𝛽, 𝜋)] = 0. This will hold true if
                                                                    
                                               𝑠𝑖 𝑝 𝑥 𝑠𝑖𝑡 𝑢˜𝑖 (𝑡)  0
                                                      1𝑖 𝑝
                                                                   = 
                                                                    
                                              E
                                                𝑥 0 𝑠 𝑢˜ (𝑡)  0
                                                2𝑖 𝑝 𝑖𝑡 𝑖           
                                                                    
Now, we can write
                                                                                                  𝑇
                                                                                                   Õ                 
         E[𝑠𝑖 𝑝 𝑥 1𝑖 𝑝 𝑠𝑖𝑡 𝑢˜𝑖 (𝑡)] = E(𝑠𝑖 𝑝 𝑥 1𝑖 𝑝 𝑠𝑖𝑡 𝑢𝑖𝑡 ) − E 𝑠𝑖 𝑝 𝑥 1𝑖 𝑝 𝑠𝑖𝑡 𝑇𝑖      (𝑡) −1           𝑠𝑖𝑞 𝑢𝑖𝑞
                                                                                                   𝑞=𝑡+1
                                                                 𝑇
                                                                 Õ
                                   = E(𝑠𝑖 𝑝 𝑥 1𝑖 𝑝 𝑠𝑖𝑡 𝑢𝑖𝑡 ) −               E[𝑠𝑖 𝑝 𝑥 1𝑖 𝑝 𝑠𝑖𝑡 𝑇𝑖 (𝑡) −1 𝑠𝑖𝑞 𝑢𝑖𝑞 ]
                                                                 𝑞=𝑡+1
                                                                         𝑇
                                                                         Õ
                                   = E(𝑠𝑖 𝑝 𝑠𝑖𝑡 ) E(𝑥 1𝑖 𝑝 𝑢𝑖𝑡 ) −                 E[𝑠𝑖 𝑝 𝑠𝑖𝑡 𝑇𝑖 (𝑡) −1 𝑠𝑖𝑞 ] E(𝑥 1𝑖 𝑝 𝑢𝑖𝑞 )
                                                                         𝑞=𝑡+1
                                   =0
The first equality follows from just the definition of 𝑢𝑖 (𝑡), the third follows from s𝑖                             |=   (x𝑖 , u𝑖 , r𝑖 ) and
                                                           0 𝑠 𝑢˜ (𝑡)] = 0.
the last one follows from Assumption 3.6.1. Similarly, E[𝑥 2𝑖 𝑝 𝑖𝑡 𝑖
                                                0 𝑠 𝑟˜ ] = 0. We can write
   Moving on to E[𝑚 2𝑖 (𝛽, 𝜋)] = 0, we need E[𝑥 2𝑖 𝑝 𝑖𝑡 𝑖𝑡
                                                                                           𝑇
                                                                                            Õ                
                  0 𝑠 𝑟˜ (𝑡)]
               E[𝑥2𝑖                 =       0 𝑠 𝑟 )
                                         E(𝑥 2𝑖           −E           0 𝑠 𝑇 (𝑡) −1
                                                                     𝑥 2𝑖                          𝑠𝑖𝑞 𝑟𝑖𝑞
                     𝑝 𝑖𝑡 𝑖                     𝑝 𝑖𝑡 𝑖𝑡                   𝑝 𝑖𝑡 𝑖
                                                                                           𝑞=𝑡+1
                                                              𝑇
                                                              Õ                                           
                                     =       0 𝑠 𝑟 )
                                         E(𝑥 2𝑖           −              E         0 𝑠 𝑇 (𝑡) −1 𝑠 𝑟
                                                                                 𝑥 2𝑖
                                                𝑝 𝑖𝑡 𝑖𝑡                               𝑝 𝑖𝑡 𝑖     𝑖𝑞 𝑖𝑞
                                                              𝑞=𝑡+1
                                                                         𝑇
                                                                         Õ
                                     =               0 𝑟 )
                                         E(𝑠𝑖𝑡 ) E(𝑥 2𝑖          −               E[𝑠𝑖𝑡 𝑇𝑖 (𝑡) −1 𝑠𝑖𝑞 ] E(𝑥 2𝑖
                                                                                                           0 𝑟 )
                                                        𝑝 𝑖𝑡                                                  𝑝 𝑖𝑞
                                                                     𝑞=𝑡+1
                                     =0
The third equality follows from s𝑖          |=   (x𝑖 , u𝑖 , r𝑖 ) and the last one follows from Assumption 3.6.2.
Finally we consider the third set of moment conditions E[𝑚 2𝑖 (𝛽, 𝜋)] = 0, for which we need
                                                              119


    0 (1 − 𝑠 ) 𝑣˘ ] = 0. We can write E[𝑥 0 (1 − 𝑠 ) 𝑣˘ (𝑡)] equals
E[𝑥2𝑖  𝑝         𝑖𝑡 𝑖𝑡                                     2𝑖 𝑝         𝑖𝑡 𝑖
                                                                                             Õ𝑇
              E[𝑥 2𝑖0 (1 − 𝑠 )𝑣 ]       − E[𝑥 2𝑖 0 (1 − 𝑠 )(𝑇         − 𝑡 − 𝑇𝑖   (𝑡)) −1          (1 − 𝑠𝑖𝑞 )𝑣𝑖𝑞 ]
                       𝑝        𝑖𝑡 𝑖𝑡               𝑝           𝑖𝑡
                                                                                           𝑞=𝑡+1
                                             Õ 𝑇
           = E[𝑥2𝑖  0 (1 − 𝑠 )𝑣 ] −                 E[𝑥2𝑖 0 (1 − 𝑠 )(𝑇 − 𝑡 − 𝑇 (𝑡)) −1 (1 − 𝑠 )𝑣 ]
                       𝑝        𝑖𝑡 𝑖𝑡                        𝑝         𝑖𝑡                𝑖                 𝑖𝑞 𝑖𝑞
                                           𝑞=𝑡+1
                                                Õ𝑇
           = E(1 − 𝑠𝑖𝑡 ) E(𝑥 2𝑖   0 𝑣 )−                E[(1 − 𝑠𝑖𝑡 )(𝑇 − 𝑡 − 𝑇𝑖 (𝑡)) −1 (1 − 𝑠𝑖𝑞 )] E(𝑥 2𝑖       0 𝑣 )
                                   𝑝 𝑖𝑡                                                                            𝑝 𝑖𝑞
                                               𝑞=𝑡+1
           =0
where the last equality follows from E(𝑥 2𝑖              0 𝑣 ) = 0 which follows from Assumptions 3.6.1 and
                                                            𝑝 𝑖𝑞
3.6.2.
    (ii) Starting with E[𝑚 1𝑖 (𝛽, 𝜋)] = 0, we can first write
                                                              Õ𝑇
    E[𝑠𝑖 𝑝 𝑥 1𝑖 𝑝 𝑠𝑖𝑡 𝑢˜𝑖 (𝑡)] = E(𝑠𝑖 𝑝 𝑥 1𝑖 𝑝 𝑠𝑖𝑡 𝑢𝑖𝑡 ) −          E[𝑠𝑖 𝑝 𝑥 1𝑖 𝑝 𝑠𝑖𝑡 𝑇𝑖 (𝑡) −1 𝑠𝑖𝑞 𝑢𝑖𝑞 ]
                                                            𝑞=𝑡+1
                                                                           Õ𝑇
                               = E[E(𝑠𝑖 𝑝 𝑥 1𝑖 𝑝 𝑠𝑖𝑡 𝑢𝑖𝑡 |x𝑖𝑡 , s𝑖 )] −         E{E[𝑠𝑖 𝑝 𝑥 1𝑖 𝑝 𝑠𝑖𝑡 𝑇𝑖 (𝑡) −1 𝑠𝑖𝑞 𝑢𝑖𝑞 |x𝑖𝑡 , s𝑖 ]}
                                                                          𝑞=𝑡+1
                                                                           Õ𝑇
                               = E[𝑠𝑖 𝑝 𝑥 1𝑖 𝑝 𝑠𝑖𝑡 E(𝑢𝑖𝑡 |x𝑖𝑡 , s𝑖 )] −          E{𝑠𝑖 𝑝 𝑥 1𝑖 𝑝 𝑠𝑖𝑡 𝑇𝑖 (𝑡) −1 𝑠𝑖𝑞 E[𝑢𝑖𝑞 |x𝑖𝑡 , s𝑖 ]}
                                                                          𝑞=𝑡+1
                               =0
The second equality follows from the LIE and the fourth follows from Assumption 3.6.1’. This
is because using the LIE, Assumption 3.6.1’ implies that E(𝑢𝑖𝑡 |x𝑖𝑡 , s𝑖 ) = 0 for every 𝑡 = 1, . . . , 𝑇.
                                 𝑞
Moreover, since E(𝑢𝑖𝑞 |x𝑖 , s𝑖 ) = 0 for 𝑞 = 𝑡 + 1, . . . , 𝑇, using the LIE implies that E(𝑢𝑖𝑞 |x𝑖𝑡 , s𝑖 ) = 0
for any 𝑡 < 𝑞. Similarly, E[𝑥 2𝑖      0 𝑠 𝑢˜ (𝑡)] = 0.
                                        𝑝 𝑖𝑡 𝑖
    We can write a similar proof for E[𝑚 2𝑖 (𝛽, 𝜋)] = 0 using the LIE and Assumption 3.6.2’. For
                                                                   120


E[𝑚 3𝑖 (𝛽, 𝜋)] = 0, write
                                                             Õ𝑇
     0 (1 − 𝑠 ) 𝑣˘ (𝑡)]
 E[𝑥2𝑖                       =     0 (1 − 𝑠 )𝑣 ]
                               E[𝑥 2𝑖                   −            0 (1 − 𝑠 )(𝑇 − 𝑡 − 𝑇 (𝑡)) −1 (1 − 𝑠 )𝑣 ]
                                                                 E[𝑥 2𝑖
       𝑝         𝑖𝑡 𝑖                 𝑝       𝑖𝑡 𝑖𝑡                     𝑝    𝑖𝑡             𝑖              𝑖𝑞 𝑖𝑞
                                                           𝑞=𝑡+1
                             = E{E[𝑥 2𝑖0 (1 − 𝑠 )𝑣 |𝑥 𝑡 , s ]}
                                        𝑝          𝑖𝑡 𝑖𝑡 2𝑖 𝑖
                                   Õ𝑇
                                −       E{E[𝑥 2𝑖 0 (1 − 𝑠 )(𝑇 − 𝑡 − 𝑇 (𝑡)) −1 (1 − 𝑠 )𝑣 |𝑥 𝑡 , s ]}
                                                    𝑝         𝑖𝑡          𝑖              𝑖𝑞 𝑖𝑞 2𝑖 𝑖
                                  𝑞=𝑡+1
                                   0 (1 − 𝑠 ) E[𝑣 |𝑥 𝑡 , s ]}
                             = E{𝑥 2𝑖 𝑝       𝑖𝑡        𝑖𝑡 2𝑖 𝑖
                                   Õ𝑇
                                −           0 (1 − 𝑠 )(𝑇 − 𝑡 − 𝑇 (𝑡)) −1 (1 − 𝑠 ) E[𝑣 |𝑥 𝑡 , s ]}
                                        E{𝑥2𝑖  𝑝          𝑖𝑡            𝑖            𝑖𝑞       𝑖𝑞 2𝑖 𝑖
                                  𝑞=𝑡+1
                             =0
where the second equality follows from the LIE and the fourth from Assumptions 3.6.1’ and 3.6.2’.
This is because using the LIE and the fact that 𝑣𝑖𝑡 = 𝛽1𝑟𝑖𝑡 + 𝑢𝑖𝑡 , Assumptions 3.6.1’ and 3.6.2’ imply
                                                                                𝑞
that E(𝑣𝑖𝑡 |x𝑡2𝑖 , s𝑖 ) = 0 for every 𝑡 = 1, . . . , 𝑇. Moreover, since E(𝑣𝑖𝑞 |x2𝑖 , s𝑖 ) = 0 for 𝑞 = 𝑡 + 1, . . . , 𝑇,
using the LIE implies that E(𝑣𝑖𝑞 |x𝑡2𝑖 , s𝑖 ) = 0 for any 𝑡 < 𝑞.
                                                             121


                                                     APPENDIX H
                                       EXTENSIONS TO CHAPTER 3
H.1       Missing vectors
     In the model of interest (3.2.1), we assumed that 𝑥 1𝑖𝑡 is a scalar. We can extend this framework
to the case where 𝑥 1𝑖𝑡 is a 𝑚 × 1 vector, all elements of which are missing at the same time. In other
words, if one element of 𝑥1𝑖𝑡 is missing for observation 𝑖 at time 𝑡, then so are all the other elements
of 𝑥 1𝑖𝑡 . This does not fundamentally change the analysis and the single missing data indicator 𝑠𝑖𝑡
is still sufficient to characterize missingness.
     The population model is given by
                      𝑦𝑖𝑡 = 𝑥 1𝑖𝑡 𝛽1 + 𝑥 2𝑖𝑡 𝛽2 + 𝑐𝑖 + 𝑢𝑖𝑡 ≡ 𝑥𝑖𝑡 𝛽 + 𝑐𝑖 + 𝑢𝑖𝑡 ,          𝑡 = 1, . . . , 𝑇,   (H.1.1)
which is the same as equation (3.2.1) except 𝑥 1𝑖𝑡 is a 1 × 𝑚 vector now. The imputation equations
are a set of 𝑚 equations (one for each element in 𝑥1𝑖𝑡 ).
                                                𝑥1𝑖𝑡 = 𝑥 2𝑖𝑡 Π + 𝑑𝑖 + 𝑟𝑖𝑡                                    (H.1.2)
where Π is a 𝑘 × 𝑚 matrix and 𝑑𝑖 is a 1 × 𝑚 vector. The reduced form is
                       𝑦𝑖𝑡 = (𝑥 2𝑖𝑡 Π + 𝑑𝑖 + 𝑟𝑖𝑡 ) 𝛽1 + 𝑥 2𝑖𝑡 𝛽2 + 𝑐𝑖 + 𝑢𝑖𝑡 ≡ 𝑥 2𝑖𝑡 𝛾 + ℎ𝑖 + 𝑣𝑖𝑡 ,           (H.1.3)
where 𝛾 ≡ Π𝛽1 + 𝛽2 , ℎ𝑖 ≡ 𝑑𝑖 𝛽1 + 𝑐𝑖 , and 𝑣𝑖𝑡 ≡ 𝑟𝑖𝑡 𝛽1 + 𝑢𝑖𝑡 .
     Since all elements of 𝑥 1𝑖𝑡 are missing at the same time, the definition of the missing data
indicator given in section 3 is still sufficient to characterize missingness. That is, 𝑠𝑖𝑡 = 1 if 𝑥 1𝑖𝑡 is
observed and 0 otherwise. Then the joint GMM is based on the following set of moment functions.
                                                  0
                                  Í𝑇                                                                    
                                        𝑡=1 𝑠𝑖𝑡 𝑥¥𝑖𝑡 ( 𝑦¥𝑖𝑡 − 𝑥¥1𝑖𝑡 𝛽1 − 𝑥¥2𝑖𝑡 𝛽2 )
                                                                                         𝑓1𝑖 (𝛽, Π) 
                                                                                                        
                                                                                                        
                   𝑓𝑖 (𝛽, Π) = 
                                        Í  𝑇 𝑠 𝑥¥0 ⊗ ( 𝑥¥ − 𝑥¥ Π) 0                      ≡  𝑓 (𝛽, Π)      (H.1.4)
                                           𝑡=1 𝑖𝑡 2𝑖𝑡            1𝑖𝑡     2𝑖𝑡              2𝑖             
                                 Í                                                                      
                                  𝑇                  0 𝑦¤ − 𝑥¤ (𝛽 Π + 𝛽 ) 
                                  𝑡=1 (1 − 𝑠𝑖𝑡 ) 𝑥¤2𝑖𝑡
                                                                                           
                                                                                               𝑓    (𝛽, Π)
                                                                                                           
                                                            𝑖𝑡      2𝑖𝑡  1         2    
                                                                                                3𝑖        
                                                                                                           
                                                             122


This is a set of 𝑘 (2 + 𝑚) + 𝑚 moment conditions with 𝑘 (1 + 𝑚) + 𝑚 parameters to estimate. Thus
the number of over-identifying restrictions still equals 𝑘. Note that 𝑓2𝑖 (.) is still a set of exactly
identified moment functions, and hence Lemma 3.4.2 is still valid. The rest of the GMM estimation
proceeds the same way as in Section 4, except the matrices 𝐶 and 𝐷 are now based on the moment
conditions in (G.4).
     This framework can further be extended to the case where the elements of 𝑥1𝑖𝑡 are not missing
at the same time. Although it leads to loss of some information in this case, it is still more efficient
than using the complete case analysis. For instance, consider the case where in equation (3.2.1),
𝑥 1𝑖𝑡 = [𝑤𝑖𝑡 𝑤𝑖,𝑡−1 ], where 𝑤𝑖𝑡 is a policy variable. If 𝑤𝑖𝑡 contains missing values, then so does
𝑤𝑖,𝑡−1 . In this case, the missingness cannot be entirely characterized with a single missing data
indicator as 𝑤𝑖𝑡 and 𝑤𝑖,𝑡−1 are missing in different time periods for observation 𝑖. We define the
selection indicators as the following.
                           
                           
                           1      if both 𝑤𝑖𝑡 and 𝑤𝑖,𝑡−1 are observed 𝑡 = 1, ..., 𝑇
                           
                           
                           
                    𝑠1𝑖𝑡 =
                           
                           0      otherwise
                           
                           
                           
                           
                           
                           1     if neither 𝑤𝑖𝑡 nor 𝑤𝑖,𝑡−1 is observed 𝑡 = 1, ..., 𝑇
                           
                           
                           
                    𝑠2𝑖𝑡 =
                           
                           0     otherwise
                           
                           
                           
     Thus, the complete cases are those time periods for individual 𝑖 for which 𝑤𝑖 is observed in
both the current and the previous period, and are characterized by 𝑠1𝑖𝑡 = 1. One option in this case
is to estimate 𝛽 using the complete cases fixed effects, as discussed in Section 4.
     However, we can also use the joint GMM by utilizing the observations for which 𝑠2𝑖𝑡 = 1. Note
that 𝑠2𝑖𝑡 does not characterize all the incomplete cases. It is equal to 1 only for the observations for
which neither 𝑤𝑖𝑡 nor 𝑤𝑖,𝑡−1 is observed, and 0 for both the complete cases as well as the observations
for which either 𝑤𝑖𝑡 or 𝑤𝑖,𝑡−1 is observed. It thus does not make use of the observations for which
both 𝑠1𝑖𝑡 and 𝑠2𝑖𝑡 are 0.
     We impose the following assumption on the population distribution.
                                                   123


    Assumption G.1 For every 𝑡 = 1, . . . , 𝑇, (i) E(𝑠1𝑖𝑡 𝑥¥𝑖𝑡0 𝑢𝑖𝑡 ) = 0 (ii) E(𝑠1𝑖𝑡 𝑥¥2𝑖𝑡                  0 𝑟 ) = 0 (iii)
                                                                                                                𝑖𝑡
         0 𝑣 )=0
E(𝑠2𝑖𝑡 𝑥¤2𝑖𝑡  𝑖𝑡
    The joint GMM is then based on the following moment functions.
                                                  0 ( 𝑦¥ − 𝑥¥ 𝛽 − 𝑥¥ 𝛽 ) 
                                  Í𝑇                                                                    
                                         𝑠 1𝑖𝑡 ¥
                                                𝑥       𝑖𝑡       1𝑖𝑡   1       2𝑖𝑡 2          𝑓1𝑖 (𝛽, Π) 
                                  𝑡=1            𝑖𝑡                                                     
                                                                                                        
                     𝑓𝑖 (𝛽, 𝜋) = 
                                        Í𝑇
                                                 𝑠     ¥
                                                       𝑥 0   ( ¥
                                                               𝑥     −  𝑥¥     𝜋)         ≡  𝑓2𝑖 (𝛽, Π) 
                                                                                                                   (H.1.5)
                                           𝑡=1     1𝑖𝑡   2𝑖𝑡     1𝑖𝑡       2𝑖𝑡           
                                 Í
                                  𝑇             0                                                    
                                  𝑡=1 𝑠2𝑖𝑡 𝑥¤2𝑖𝑡 𝑦¤ 𝑖𝑡 − 𝑥¤2𝑖𝑡 (𝛽1 𝜋 + 𝛽2 )   𝑓3𝑖 (𝛽, Π) 
                                                                                                           
                                                                                                        
    where
                                                        Õ 𝑇               Õ 𝑇
                                    𝑥¥𝑖𝑡 = 𝑥𝑖𝑡 − (             𝑠1𝑖𝑡 ) −1        𝑠1𝑖𝑞 𝑥𝑖𝑞
                                                        𝑞=1               𝑞=1
                                                        Õ 𝑇               Õ 𝑇
                                    𝑦¥𝑖𝑡 = 𝑦𝑖𝑡 − (             𝑠1𝑖𝑡 ) −1        𝑠1𝑖𝑞 𝑦𝑖𝑞
                                                        𝑞=1               𝑞=1
                                                        Õ 𝑇               Õ 𝑇
                                    𝑥¤𝑖𝑡 = 𝑥𝑖𝑡 − (             𝑠2𝑖𝑡 ) −1        𝑠2𝑖𝑞 𝑥𝑖𝑞
                                                        𝑞=1               𝑞=1
                                                        Õ 𝑇               Õ 𝑇
                                    𝑦¤ 𝑖𝑡 = 𝑦𝑖𝑡 − (            𝑠2𝑖𝑡 ) −1        𝑠2𝑖𝑞 𝑦𝑖𝑞
                                                        𝑞=1               𝑞=1
That is, for 𝑓1𝑖 (.) and 𝑓2𝑖 (.), the variables are still time demeaned using the complete cases, but
for 𝑓3𝑖 (.), they are time demeaned using only the observations for which neither 𝑤𝑖𝑡 nor 𝑤𝑖,𝑡−1 is
observed. Note that the moment functions 𝑓2𝑖 (.) imply that both 𝑤𝑖𝑡 and 𝑤𝑖,𝑡−1 will be imputed
using the same covariates 𝑥 2𝑖𝑡
    The rest of the GMM estimation proceeds in the usual fashion using the moment functions in
(G.5).
    In order to utilize all the incomplete cases, we can further extend this framework by introducing
a separate selection indicator for 𝑤𝑖𝑡 and 𝑤𝑖,𝑡−1 and writing a separate imputation equation (with
different sets of covariates) for each of these.
                                                             124


H.2     Time varying unobserved heterogeneity
    We can extend the basic model in Section 2 to allow for the unobserved heterogeneity to vary
over time. So instead of equation (3.2.1), our model of interest is now
                                     𝑦𝑖𝑡 = 𝑥𝑖𝑡 𝛽 + 𝜂𝑡 𝑐𝑖 + 𝑢𝑖𝑡 ,          𝑡 = 1, . . . , 𝑇 .            (H.2.1)
The coefficients of 𝑐𝑖 are now 𝜂𝑡 which are time-varying parameters to be estimated. We also allow
for time-varying heterogeneity in the imputation model. The new model is
                                   𝑥 1𝑖𝑡 = 𝑥 2𝑖𝑡 𝜋 + 𝜁𝑡 𝑑𝑖 + 𝑟𝑖𝑡 ,          𝑡 = 1, . . . , 𝑇 .          (H.2.2)
The reduced form then becomes
                                      𝑦𝑖𝑡 = 𝑥 2𝑖𝑡 𝛾 + ℎ𝑖𝑡 + 𝑣𝑖𝑡 ,          𝑡 = 1, . . . , 𝑇             (H.2.3)
where 𝛾 ≡ 𝛽1 𝜋 + 𝛽2 , ℎ𝑖𝑡 ≡ 𝛽1 𝜁𝑡 𝑑𝑖 + 𝜂𝑡 𝑐𝑖 , and 𝑣𝑖𝑡 ≡ 𝛽1𝑟𝑖𝑡 + 𝑢𝑖𝑡 .1
    The question we consider here is that under what assumptions will the joint GMM defined in
Section 3 consistently estimates 𝛽 and 𝜋. Starting with equation (G.6), if we time demean using
the complete cases, we get
                                     𝑦¥𝑖𝑡 = 𝑥¥𝑖𝑡 𝛽 + 𝜂¥𝑡 𝑐𝑖 + 𝑢¥𝑖𝑡 ,      𝑡 = 1, . . . , 𝑇,             (H.2.4)
where 𝑦¥𝑖𝑡 , 𝑥¥𝑖𝑡 , and 𝑢¥𝑖𝑡 are defined in the same way as in Section 3. But now, this transformation
does not eliminate 𝑐𝑖 . Therefore, for the moment conditions E[ 𝑓1𝑖 (𝛽)] = 0 in (3.4.10) to be valid,
we need for every 𝑡 = 1, . . . , 𝑇
                                              E[𝑠𝑖𝑡 𝑥¥𝑖𝑡0 ( 𝜂¥𝑡 𝑐𝑖 + 𝑢¥𝑖𝑡 )] = 0.                       (H.2.5)
We know that for every 𝑡 = 1, . . . , 𝑇
                                                    E(𝑠𝑖𝑡 𝑥¥𝑖𝑡0 𝑢¥𝑖𝑡 ) = 0                              (H.2.6)
under Assumption 3.3.2. We additionally need that for every 𝑡 = 1, . . . , 𝑇
                                                   E(𝑠𝑖𝑡 𝑥¥𝑖𝑡0 𝜂¥𝑡 𝑐𝑖 ) = 0.                            (H.2.7)
1Note that the definitions of 𝛾 and 𝑣𝑖𝑡 are the same as those in Section 2. Only the unobserved heterogeneity has
 changed.
                                                              125


A sufficient condition for this to hold is that for every 𝑡 = 1, . . . , 𝑇
                                                E(𝑐𝑖 | 𝑥¥𝑖𝑡 , 𝑠𝑖 ) = 0.                        (H.2.8)
This says that at time 𝑡, the unobserved heterogeneity 𝑐𝑖 is mean independent of the time deviated
𝑥𝑖𝑡 and selection in all time periods. This is clearly stronger than Assumption 3.3.2 which did not
put any restriction on the relationship between 𝑠𝑖 and 𝑐𝑖 . However, it is weaker than assuming 𝑐𝑖 is
mean independent of 𝑥𝑖𝑡 . We are only assuming that it is mean independent of the time deviated
𝑥𝑖𝑡 , that is 𝑥¥𝑖𝑡 .
      Similarly, when we time demean the new imputation model (G.7), we get
                                𝑥¥1𝑖𝑡 = 𝑥¥2𝑖𝑡 𝜋 + 𝜁¥𝑡 𝑑𝑖 + 𝑟¥𝑖𝑡 ,      𝑡 = 1, . . . , 𝑇 .      (H.2.9)
For the moment conditions E[ 𝑓2𝑖 (𝜋)] = 0 in (3.4.10) to be valid, we need that for every 𝑡 = 1, . . . , 𝑇
                                                   0 ( 𝜁¥ 𝑑 + 𝑟¥ )] = 0
                                           E[𝑠𝑖𝑡 𝑥¥2𝑖𝑡                                        (H.2.10)
                                                          𝑡 𝑖       𝑖𝑡
for which we need to assume that for every 𝑡 = 1, . . . , 𝑇
                                                E(𝑑𝑖 | 𝑥¥2𝑖𝑡 , 𝑠𝑖 ) = 0                       (H.2.11)
in addition to Assumption 3.3.2. Similarly, the time deviated reduced form is
                                  𝑦¥𝑖𝑡 = 𝑥¥2𝑖𝑡 𝛾 + ℎ¥ 𝑖𝑡 + 𝑣¥𝑖𝑡 ,    𝑡 = 1, . . . , 𝑇 .       (H.2.12)
It is easy to see that given equation (G.17), Assumptions (G.13) and (G.16) along with Assumption
3.3.2 are sufficient for the moment conditions E[ 𝑓3𝑖 (𝛽, 𝜋)] = 0 in (3.4.10) to be valid.
                                                         126


REFERENCES
    127


                                         REFERENCES
Abrevaya, J., & Donald, S. G. (2011). A gmm approach for dealing with missing data on regressors
  and instruments. Unpublished manuscript.
Abrevaya, J., & Donald, S. G. (2017). A gmm approach for dealing with missing data on regressors.
  Review of Economics and Statistics, 99(4), 657–662.
Ahu, S. C., & Schmidt, P. (1995). A separability result for gmm estimation, with applications to
  gls prediction and conditional moment tests. Econometric Reviews, 14(1), 19–34.
Angrist, J. D., & Krueger, A. B. (1992). The effect of age at school entry on educational
  attainment: an application of instrumental variables with moments from two samples. Journal
  of the American statistical Association, 87(418), 328–336.
Angrist, J. D., & Krueger, A. B. (1995). Split-sample instrumental variables estimates of the return
  to schooling. Journal of Business & Economic Statistics, 13(2), 225–235.
Arellano, M., & Meghir, C. (1992). Female labour supply and on-the-job search: an empirical
  model estimated using complementary data sets. The Review of Economic Studies, 59(3), 537–
  559.
Card, D. (1993). Using geographic variation in college proximity to estimate the return to schooling.
  National Bureau of Economic Research Cambridge, Mass., USA.
Dagenais, M. G. (1973). The use of incomplete observations in multiple regression analysis: A
  generalized least squares approach. Journal of Econometrics, 1(4), 317–328.
Devereux, P. J., & Hart, R. A. (2010). Forced to be rich? returns to compulsory schooling in
  britain. The Economic Journal, 120(549), 1345–1364.
Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators.
  Econometrica: Journal of the Econometric Society, 1029–1054.
Hellerstein, J. K., & Imbens, G. W. (1999). Imposing moment restrictions from auxiliary data by
  weighting. Review of Economics and Statistics, 81(1), 1–14.
Hentschel, J., Lanjouw, J. O., Lanjouw, P., & Poggi, J. (2000). Combining census and survey data
  to trace the spatial dimensions of poverty: A case study of ecuador. The World Bank Economic
  Review, 14(1), 147–165.
Inoue, A., & Solon, G. (2005). Two-sample instrumental variables estimators. NBER Working
  Paper(t0311).
Inoue, A., & Solon, G. (2010). Two-sample instrumental variables estimators. The Review of
  Economics and Statistics, 92(3), 557–561.
                                                128


Jones, M. P. (1996). Indicator and stratification methods for missing explanatory variables in
  multiple linear regression. Journal of the American statistical association, 91(433), 222–230.
Klevan, S., Weinberg, S. L., & Middleton, J. A. (2016). Why the boys are missing: Using
  social capital to explain gender differences in college enrollment for public high school students.
  Research in Higher Education, 57(2), 223–257.
Klevmarken, N. A. (1982). Missing variables and two-stage least-squares estimation from more
  than one data set (Tech. Rep.). Research Institute of Industrial Economics.
Little, R. J. (1992). Regression with missing x’s: a review. Journal of the American Statistical
  Association, 87(420), 1227–1237.
Little, R. J., & Rubin, D. B. (2014). Statistical analysis with missing data (Vol. 333). John Wiley
  & Sons.
Loureiro, M. L., & Nayga Jr, R. M. (2006). Obesity, weight loss, and physician’s advice. Social
  science & medicine, 62(10), 2458–2468.
Loureiro, M. L., & Nayga Jr, R. M. (2007). Physician’s advice affects adoption of desirable dietary
  behaviors. Review of Agricultural Economics, 29(2), 318–330.
MaCurdy, T., Mroz, T., & Gritz, R. M. (1998). An evaluation of the national longitudinal survey
  on youth. The Journal of Human Resources, 33(2), 345–436.
McDonough, I. K., & Millimet, D. L. (2017). Missing data, imputation, and endogeneity. Journal
  of econometrics, 199(2), 141–155.
Mogstad, M., & Wiswall, M. (2012). Instrumental variables estimation with partially missing
  instruments. Economics Letters, 114(2), 186–189.
Newey, W. K., & McFadden, D. (1994). Large sample estimation and hypothesis testing. Handbook
  of econometrics, 4, 2111–2245.
Ortega-Sanchez, R., Jimenez-Mena, C., Cordoba-Garcia, R., Muñoz-Lopez, J., Garcia-Machado,
  M. L., & Vilaseca-Canals, J. (2004). The effect of office-based physician’s advice on adolescent
  exercise behavior. Preventive medicine, 38(2), 219–226.
Pacini, D., & Windmeĳer, F. (2016). Robust inference for the two-sample 2sls estimator. Economics
  letters, 146, 50–54.
Prokhorov, A., & Schmidt, P. (2009). Gmm redundancy results for general missing data problems.
  Journal of Econometrics, 151(1), 47–55.
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592.
Rubin, D. B. (1987). Multiple imputation for non-response in surveys. John Wiley & Sons.
Schafer, J. L. (1999). Multiple imputation: a primer. Statistical methods in medical research, 8(1),
  3–15.
                                                 129


Schafer, J. L., & Graham, J. W. (2002). Missing data: our view of the state of the art. Psychological
  methods, 7(2), 147.
Secker-Walker, R. H., Solomon, L. J., Flynn, B. S., Skelly, J. M., & Mead, P. B. (1998). Re-
  ducing smoking during pregnancy and postpartum: physician’s advice supported by individual
  counseling. Preventive medicine, 27(3), 422–430.
White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations:
  issues and guidance for practice. Statistics in medicine, 30(4), 377–399.
Wooldridge, J. M. (2007). Inverse probability weighted estimation for general missing data
  problems. Journal of Econometrics, 141(2), 1281–1301.
Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data. MIT press.
                                                 130