EFFICIENT ESTIMATION WITH MISSING VALUES IN CROSS SECTION AND PANEL DATA By Bhavna Rai A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Economics – Doctor of Philosophy 2021 ABSTRACT EFFICIENT ESTIMATION WITH MISSING VALUES IN CROSS SECTION AND PANEL DATA By Bhavna Rai Chapter 1: Efficient Estimation with Missing Data and Endogeneity I study the problem of missing values in both the outcome and the covariates in linear models with endogenous covariates. I propose an estimator that improves efficiency relative to a Two Stage Least Squares (2SLS) based only on the complete cases. My framework also unifies the literature on missing data and combining data sets, and includes the “Two-Sample 2SLS" as a special case. The method is an extension of Abrevaya and Donald (2017), who provide methods of improving efficiency over complete cases estimators in linear models with cross-section data and missing covariates. I also provide guidance on dealing with missing values in the instruments and in commonly used nonlinear functions of the endogenous covariates, likes squares and interactions, without introducing inconsistency in the estimates. Chapter 2: Imputing Missing Covariate Values in Nonlinear Models I study the problem of missing covariate values in nonlinear models with continuous or discrete covariates. In order to use the information in the incomplete cases, I propose an inverse probability weighted one-step imputation estimator that provides gains in efficiency relative to the complete cases estimator using a reduced form for the outcome in terms of the always-observed covariates. Unlike the two-step imputation and dummy variable methods commonly used in empirical work, my estimator is consistent for a wide class of nonlinear models. It relies only on the commonly used “missing at random" assumption, and provides a specification test for the resulting restrictions. I show how the results apply to nonlinear models for fractional and nonnegative responses. Chapter 3: Efficient Estimation of Linear Panel Data Models with Missing Covariates We study the problem of missing covariates in the context of linear, unobserved effects panel data models. In order to use information on incomplete cases, we propose generalized method of moments (GMM) estimation. By using information on the incomplete cases from all time periods, the proposed estimators provide gains in efficiency relative to the fixed effects (and Mundlak) estimator that use only the complete cases. The method is an extension of Abrevaya and Donald (2017), who consider a linear model with cross-sectional data and incorporate the linear imputation method in the set of moment conditions to obtain gains in efficiency. Our first proposed estimator uses the assumption of strict exogeneity of the covariates as well as the selection, while allowing the selection to be correlated with the observed covariates and unobserved heterogeneity in both the outcome equation and the imputation equation. We also consider the case in which the covariates are only sequentially exogenous and propose an estimator based on the method of forward orthogonal deviations introduced by Arellano and Bover (1995). Our framework suggests a simple test for whether selection is correlated with unobserved shocks, both contemporaneous and those in other time periods. ACKNOWLEDGEMENTS My sincere gratitude to my adviser Jeffrey Wooldridge not only for lending his expertise to my dissertation but also for his patience and motivation throughout my Ph.D. I would also like to thank my committee members Peter Schmidt, Todd Elder and Vincenzina Caputo for their insightful comments and discussions. My deep gratitude to my parents and brother for supporting me both materially and spiritually throughout the program. I am also thankful to my friends and fellow graduate students Katie Bollman, Marissa Eckrote, Pallavi Pal, and Ruonan Xu for their constant support and all the fun times. The financial support received from the Department of Economics, the Graduate School, and the College of Social Science has been instrumental in completion of this work. Finally, the administrative support received from Lori Jean Nichols and Jay Feight has greatly facilitated navigating the program. iv TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii CHAPTER 1 EFFICIENT ESTIMATION WITH MISSING DATA AND ENDOGENEITY 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 The population model and assumptions . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 The missing data scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Moment conditions and GMM estimation . . . . . . . . . . . . . . . . . . . . . . 7 1.5 Comparison with related estimators . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5.1 Complete cases estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5.2 Estimators combining different data sets . . . . . . . . . . . . . . . . . . . 12 1.5.3 Sequential estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.5.4 Dummy variable method . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.6 Missing instruments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.7 Nonlinearity in covariates and instruments . . . . . . . . . . . . . . . . . . . . . . 20 1.7.1 Missingness in outcome and covariates . . . . . . . . . . . . . . . . . . . . 22 1.7.2 Missingness in instruments . . . . . . . . . . . . . . . . . . . . . . . . . . 23 1.8 Monte Carlo simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.8.1 Missingness in outcome and covariates . . . . . . . . . . . . . . . . . . . . 24 1.8.2 Missingness in instruments . . . . . . . . . . . . . . . . . . . . . . . . . . 26 1.9 Empirical application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 1.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 CHAPTER 2 IMPUTING MISSING COVARIATE VALUES IN NONLINEAR MODELS 30 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.2 The population optimization problems . . . . . . . . . . . . . . . . . . . . . . . . 33 2.3 Non random sampling and inverse probability weighting . . . . . . . . . . . . . . 36 2.4 Moment conditions and GMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.5.1 Models for binary and fractional responses . . . . . . . . . . . . . . . . . . 42 2.5.1.1 Continuous covariate with missing values . . . . . . . . . . . . . 42 2.5.1.2 Binary covariate with missing values . . . . . . . . . . . . . . . 45 2.5.1.3 Average partial effects . . . . . . . . . . . . . . . . . . . . . . . 47 2.5.2 Exponential models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.6 Comparison with related estimators . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.6.1 Complete cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.6.2 Sequential procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.6.3 Dummy variable method . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.6.4 Unweighted estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.7 Empirical application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 v 2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 CHAPTER 3 EFFICIENT ESTIMATION OF LINEAR PANEL DATA MODELS WITH MISSING COVARIATES* . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.2 Population model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.3 The missing data mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.4 Moment conditions and GMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.5 Comparison to related estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.5.1 Complete cases estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.5.2 Dummy variable method . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.5.3 Regression imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.5.4 Mundlak device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.6 Estimation under sequential exogeneity . . . . . . . . . . . . . . . . . . . . . . . . 81 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 APPENDIX A PROOFS FOR CHAPTER 1 . . . . . . . . . . . . . . . . . . . . . 89 APPENDIX B TABLES FOR CHAPTER 1 . . . . . . . . . . . . . . . . . . . . . 95 APPENDIX C FIGURES FOR CHAPTER 1 . . . . . . . . . . . . . . . . . . . . . 98 APPENDIX D PROOFS FOR CHAPTER 2 . . . . . . . . . . . . . . . . . . . . . 99 APPENDIX E ASYMPTOTIC THEORY FOR UNWEIGHTED ESTIMATION . . 111 APPENDIX F TABLES FOR CHAPTER 2 . . . . . . . . . . . . . . . . . . . . . 113 APPENDIX G PROOFS FOR CHAPTER 3 . . . . . . . . . . . . . . . . . . . . . 115 APPENDIX H EXTENSIONS TO CHAPTER 3 . . . . . . . . . . . . . . . . . . . 122 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 vi LIST OF TABLES Table B.1: Monte Carlo simulations, Design 1 . . . . . . . . . . . . . . . . . . . . . . . . . 95 Table B.2: Monte Carlo simulations, Design 2 . . . . . . . . . . . . . . . . . . . . . . . . . 95 Table B.3: Monte Carlo simulations, Design 3 . . . . . . . . . . . . . . . . . . . . . . . . . 95 Table B.4: Monte Carlo simulations, Design 4 . . . . . . . . . . . . . . . . . . . . . . . . . 96 Table B.5: Monte Carlo simulations, Design 5 . . . . . . . . . . . . . . . . . . . . . . . . . 96 Table B.6: Monte Carlo simulations, Design 6 . . . . . . . . . . . . . . . . . . . . . . . . . 96 Table B.7: Monte Carlo simulations, Design 7 . . . . . . . . . . . . . . . . . . . . . . . . . 96 Table B.8: Effect of physician’s advice on calorie consumption: complete cases versus the proposed estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Table F.1: Summary of missing data methods used in 5 highly ranked economics journals from 2018 to August 2020. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Table F.2: Effect of grade variance on probability of having a 4 year college degree. . . . . 114 vii LIST OF FIGURES Figure C.1: Some admissible patterns of missingness (shaded areas represent complete cases) 98 viii CHAPTER 1 EFFICIENT ESTIMATION WITH MISSING DATA AND ENDOGENEITY 1.1 Introduction The problem of missing data is highly prevalent in empirical research. While there is a vast literature on methods to deal with missing data, the issue of endogeneity of the covariates with missing values has not been explicitly addressed in the majority of it.1 In linear models with endogenous covariates and missing values in either the outcome or the endogenous covariates, a frequently used method is a 2SLS that only uses the “complete cases" - the observations for which all the variables are observed.2 While consistent under commonly used assumptions, this method can lead to a substantial loss of efficiency due to discarding the information in the incomplete cases. Recent literature has considered the case of missingness only in the endogenous covariates and has suggested some methods that make use of these incomplete cases. The first set of methods is based on “imputation". For instance, McDonough & Millimet (2017) discuss an estimator which replaces the missing covariate values with fitted values from a first stage regression of the endogenous covariate on the instruments. A more efficient estimator is suggested by Abrevaya & Donald (2011), who use the incomplete cases via a reduced form for the outcome in terms of the instruments. The first contribution of this paper is to extend the framework of Abrevaya & Donald (2011) to allow for missingness in both the outcome and the endogenous covariates. I show that it is possible to obtain strict gains in efficiency for all coefficients relative to the complete cases 2SLS. My framework also unifies the literature on missing data and that on combining data sets with missing variables. Empirical researchers sometimes have two distinct data sets, one of which contains only the outcome and the instruments, and the other contains only the endogenous 1For a comprehensive discussion of methods used to deal with missing data, see Schafer & Graham (2002). 2Wooldridge (2010), Section 17.2.1. 1 covariates and the instruments. A commonly used estimator that combines the two is the “Two- Sample 2SLS" (henceforth TS2SLS).3 I relax assumptions traditionally used by this estimator and also provide a framework for combining more than two data sets with more general patterns of missing variables. A second method that makes use of the incomplete cases is the so-called “dummy variable method", which replaces the missing covariate values with zeros and includes an indicator for missingness as an additional covariate in the model. When the covariates are exogenous, Jones (1996) shows that this method produces inconsistent estimates unless some zero restrictions are imposed in the population. I show that this inconsistency carries over to the case of endogenous covariates. One can also encounter missing values in the instruments, in which case interest lies in con- tinuing to use the observations with missing instruments instead of discarding them. Mogstad & Wiswall (2012) discuss an estimator that imputes missing instrument values. This is a two-step estimator that in the first step replaces the missing instrument values with predicted values from a regression of the instrument on the always-observed exogenous covariates, and in the second step estimates the main model using a 2SLS with both the actual and imputed instrument values. They show that the resulting estimator for the coefficient on the endogenous covariate is numerically equivalent to a complete cases 2SLS. A second contribution of this paper is to propose an imputa- tion estimator for the instruments that can achieve strict gains in efficiency over the complete cases 2SLS for all coefficients.4 This estimator includes as a special case the estimator suggested by Abrevaya & Donald (2017) in the case where the covariates are exogenous. Finally, I show how to impute commonly used nonlinear functions of the endogenous covariates like squares and interactions. I show that two-step procedures which in the first step replace the missing values of the nonlinear functions of the covariates with the same nonlinear functions of 3TS2SLS was first introduced by Klevmarken (1982), and more recently used by Angrist & Krueger (1995). Inoue & Solon (2010) show that the TS2SLS is more efficient than the related Two-Sample IV estimator. Inoue & Solon (2005) consider GMM estimation with arbitrary heteroskedasticity and stratification. Pacini & Windmeijer (2016) obtain robust standard errors for the traditional TS2SLS with arbitrary heteroskedasticity. 4Abrevaya & Donald (2011) also propose an estimator for the case of missing instruments. My estimator is based on different moment conditions and is no less efficient than theirs. 2 the imputed values generally produce inconsistent estimates. A third contribution of this paper is to propose a consistent imputation estimator in this context that improves upon the efficiency of complete cases 2SLS. The rest of the paper is organized as follows. Section 1.2 presents the population model of interest and associated assumptions. Section 1.3 describes the missing data scheme and the assumptions on the missingness mechanism for the case of missingness in outcome and endogenous covariates. Section 1.4 describes the resulting moment conditions and the asymptotic distribution of the proposed GMM estimator. Section 1.5 discusses four related estimators: the complete cases 2SLS, the TS2SLS, the imputation estimator, and the dummy variable estimator. Section 1.6 discusses the case of missingness in the instruments. Section 1.7 discusses the case of nonlinearity in the covariates. Section 1.8 presents results from Monte Carlo simulations comparing the proposed estimator with related estimators. Section 1.9 presents an empirical application to the effect of physician’s advice on individuals’ calorie consumption. Section 1.10 concludes. The Appendices include the proofs and tables. 1.2 The population model and assumptions Consider the standard linear regression model: 𝑦 = 𝑥 1 𝛽1 + 𝑥 2 𝛽2 + 𝑢 ≡ 𝑥 𝛽 + 𝑢, (1.2.1) where 𝑥 = (𝑥 1 , 𝑥2 ) is the 1 × ( 𝑝 + 𝑘) vector of covariates. 𝑥 1 is a 1 × 𝑝 vector of potentially endogenous covariates, while 𝑥 2 is a 1 × 𝑘 vector of exogenous covariates (including the constant). That is, E(𝑥 20 𝑢) = 0, (1.2.2) and we allow for E(𝑥10 𝑢) ≠ 0. We are interested in estimating 𝛽 = (𝛽01 , 𝛽02 ) 0, where 𝛽1 and 𝛽2 are 𝑝 × 1 and 𝑘 × 1 respectively. As is well known, OLS is inconsistent for 𝛽 under (1.2.2). Suppose we have a set of instruments 𝑧 = (𝑧 1 , 𝑥2 ), where 𝑧 1 is a 1 × 𝑞 (𝑞 ≥ 𝑝) vector of excluded instruments, 3 such that E(𝑧0𝑢) = 0. (1.2.3) The first stage is given by the linear projection 𝑥 = 𝑧1 Π1 + 𝑥2 Π2 + 𝑟 ≡ 𝑧Π + 𝑟, (1.2.4) where Π is the (𝑞 + 𝑘) × ( 𝑝 + 𝑘) matrix of all the first stage coefficients, and Π1 and Π2 are 𝑞 × ( 𝑝 + 𝑘) and 𝑘 × ( 𝑝 + 𝑘) matrices of coefficients on 𝑧1 and 𝑥 2 respectively. By definition, E(𝑧0𝑟) = 0, (1.2.5) and by assumption Π ≠ 0. Then given a random sample and a rank condition, we can use 2SLS to consistently estimate 𝛽. Note that the errors 𝑢 and 𝑟 are assumed only to satisfy a zero correlation with the instruments in (1.2.3) and (1.2.5), and no other assumptions such as homoskedasticity or zero conditional mean have been imposed on them. Now, using (1.2.1) and (1.2.4), we get a reduced form for 𝑦 given by 𝑦 = 𝑧Π𝛽 + 𝑣, 𝑣 ≡ 𝑟𝛽 + 𝑢 (1.2.6) and using (1.2.3) and (1.2.5), we have E(𝑧0𝑣) = 0. (1.2.7) Under the missing data scheme described in the next section, equation (1.2.6) allows us to use the incomplete cases for estimating 𝛽. When there is no missing data, the information in this equation is redundant given equations (1.2.1)-(1.2.5). 1.3 The missing data scheme I characterize the potential missingness of the data using selection indicators. For any random draw (𝑥𝑖 , 𝑦𝑖 , 𝑧𝑖 ) from the population, we also draw the selection indicators (𝑠1𝑖 , 𝑠2𝑖 ) defined as follows: 4   1 if 𝑦𝑖 is observed    𝑠1𝑖 =  0 otherwise      1 if 𝑥𝑖 is observed    𝑠2𝑖 =  0 otherwise    Two things should be noted. First, I am assuming that 𝑧𝑖 is always observed. Since 𝑧𝑖 = (𝑧1𝑖 , 𝑥2𝑖 ), I am allowing for missingness only in the endogenous covariates 𝑥 1𝑖 .5 Second, I am assum- ing that either all or none of the elements in 𝑥 1𝑖 are observed. Then our “data" consists of {(𝑦𝑖 , 𝑥𝑖 , 𝑧𝑖 , 𝑠1𝑖 , 𝑠2𝑖 ) : 𝑖 = 1, ..., 𝑛}. Because identification is properly studied in the population, let 𝑠1 and 𝑠2 denote random variables with distributions of 𝑠1𝑖 and 𝑠2𝑖 respectively for all 𝑖. In other words, (𝑦, 𝑥, 𝑧, 𝑠1 , 𝑠2 ) now denotes the population. This framework allows for several kinds of missing data patterns that arise frequently in practice. Figure 1 shows some of these cases. First, it allows for the case where we have a single sample in which 𝑦 is missing for certain observations, 𝑥 is missing for certain other observations, and for the rest of the observations all 𝑦, 𝑥 and 𝑧 observed (Figure 1.1). Another case is where only 𝑥 is missing for certain observations (Figure 1.2).6 In both of these cases, using only the complete cases may lead to a substantial loss of information. A third case is where 𝑦 and 𝑥 are missing for disjoint observations such that there are no complete cases (Figure 1.3). Such a sample is typically obtained by combining two samples such that only 𝑦 and 𝑧 are observed in one sample and only 𝑥 and 𝑧 in the other. The most commonly used estimator in this case is the TS2SLS, which is a special case of the estimator I propose in the next section. To determine the properties of any estimation procedure using selected samples, we need to know how 𝑠1 and 𝑠2 are related to (𝑦, 𝑥, 𝑧). I place the following assumptions on the missingness indicators. 5I discuss the case of missingness in exogenous covariates in Section 1.6. 6This case has briefly been considered in Abrevaya & Donald (2011). 5 Assumption 1.3.1: (𝑖) E(𝑠1 𝑠2 𝑧0𝑢) = 0 (𝑖𝑖) E(𝑠1 𝑠2 𝑧0𝑟) = 0 (𝑖𝑖𝑖) E(𝑠1 𝑧0𝑢) = 0 (𝑖𝑣) E(𝑠1 𝑧0𝑟) = 0 (𝑣) E(𝑠2 𝑧0𝑟) = 0 This assumption essentially implies that the orthogonality assumptions on the errors given in (1.2.3), (1.2.5) and (1.2.7) hold in the selected sub-populations as well. For instance, the first part of this assumption, which is the weakest possible assumption required for the consistency a 2SLS based only on the complete cases, can be written as E(𝑠1 𝑠2 𝑧0𝑢) = E[E(𝑠1 𝑠2 𝑧0𝑢)|𝑠1 𝑠2 ] = 𝑃(𝑠1 𝑠2 = 1) E(𝑧0𝑢|𝑠1 𝑠2 = 1) = 0, (1.3.1) where the first equality holds by the law of iterated expectations (LIE). If we assume that 𝑃(𝑠1 𝑠2 = 1) is strictly positive, then we need the population orthogonality condition E(𝑧0𝑢) = 0 to hold in the sub-population where 𝑠1 = 𝑠2 = 1 for this assumption to be true. The other parts of this assumption impose similar restrictions on the errors in (1.2.1) and (1.2.4) for different sub-populations. Sufficient for Assumption 1.3.1 to hold is that (𝑠1 , 𝑠2 ) |= (𝑧, 𝑢, 𝑟), for which a sufficient condition is that (𝑠1 , 𝑠2 ) |= (𝑥, 𝑦, 𝑧). That is, selection is independent of everything else in the model. This is generally known as “missing completely at random" (MCAR) in the missing data literature.7 For instance, consider the first part of Assumption 1.3.1. E(𝑠1 𝑠2 𝑧0𝑢) = E(𝑠1 𝑠2 ) E(𝑧0𝑢) = 0 (1.3.2) and similarly for the other parts. Assumption 1.3.1 also holds if we have correctly specified conditional means and selection is independent of errors in both the model of interest and the first stage conditional on the instru- ments. That is, strengthening the exogeneity conditions in (1.2.3) and (1.2.5) to E(𝑢|𝑧) = 0 and E(𝑟 |𝑧) = 0 respectively and assuming (𝑠1 , 𝑠2 ) |= (𝑢, 𝑟)|𝑧 is sufficient. Again, consider the first part of Assumption 1.3.1. E(𝑠1 𝑠2 𝑧0𝑢) = E[E(𝑠1 𝑠2 𝑧0𝑢|𝑧, 𝑠1 𝑠2 )] = E[𝑠1 𝑠2 𝑧0 E(𝑢|𝑧, 𝑠1 𝑠2 )] = E[𝑠1 𝑠2 𝑧0 E(𝑢|𝑧)] = 0 (1.3.3) 7We do not require 𝑠1 and 𝑠2 to be independent of each other for Assumption 1.3.1 to hold. 6 where the third equality holds because of the conditional independence and the last one holds because of the zero conditional mean of the errors. An important special case is when selection is a deterministic function of 𝑧. But it can also depend on other unobservable random variables under certain conditions. For instance, we can let 𝑠1 𝑠2 = 𝑓 (𝑧, 𝑤), (1.3.4) where 𝑤 is an unobserved random variable. Then E(𝑠1 𝑠2 𝑧0𝑢) holds if E(𝑢|𝑧) = 0 and 𝑤 (𝑧, 𝑢), as |= E(𝑠1 𝑠2 𝑧0𝑢) = E[E(𝑠1 𝑠2 𝑧0𝑢|𝑧, 𝑤)] = E[𝑠1 𝑠2 𝑧0 E(𝑢|𝑧, 𝑤)] = E[𝑠1 𝑠2 𝑧0 E(𝑢|𝑧)] = 0. (1.3.5) What Assumption 1.3.1 rules out is (𝑠1 , 𝑠2 ) depending on the errors 𝑢 and 𝑟. That is, selection cannot depend on the idiosyncratic errors in either 𝑦 or 𝑥. Whether or not this holds in an empirical application should be carefully considered by the researcher. 1.4 Moment conditions and GMM estimation Using equations (1.2.1)-(1.2.7) along with Assumption 1.3.1, I define the vector of moment functions as follows. 𝑠1 𝑠2 𝑧0 (𝑦 − 𝑥 𝛽)       𝑔1 (𝛽, Π)           𝑠 𝑠 𝑧0 ⊗ (𝑥 − 𝑧Π) 0  𝑔 (𝛽, Π)  1 2 2 𝑔(𝛽, Π) =  ≡ (1.4.1)       (1 − 𝑠 )𝑠 𝑧0 ⊗ (𝑥 − 𝑧Π) 0 𝑔 (𝛽, Π)   1 2   3       𝑠1 (1 − 𝑠2 )𝑧0 (𝑦 − 𝑧Π𝛽)  𝑔4 (𝛽, Π)          where I suppress (𝑦, 𝑥, 𝑧, 𝑠1 , 𝑠2 ) from 𝑔(.) for notational convenience. In the vector 𝑔(.), 𝑔1 (.) and 𝑔2 (.) use the information contained in the complete cases. 𝑔3 (.) uses the observations for which 𝑥 is observed but 𝑦 is not, while 𝑔4 (.) uses the observations for which 𝑦 is observed but 𝑥 is not.8 Then, the following result holds for 𝑔(.). Lemma 1.4.1. Under Assumption 1.3.1, E[𝑔(𝛽, Π)] = 0. 8Note that equations (1.2.1)-(1.2.7) and our missing data scheme suggest 5 different moment functions: 𝑔1 (.)-𝑔4 (.) along with 𝑔5 (.) = 𝑠1 𝑠2 𝑧0 (𝑦 − 𝑧Π𝛽). However, since 𝑔5 (.) is a linear combination of 𝑔1 (.) and 𝑔2 (.), it is redundant given 𝑔1 (.)-𝑔4 (.) and hence I exclude it from the set of relevant moment functions. 7 This gives us a vector of 2(𝑞 + 𝑘)(1+ 𝑝 + 𝑘) moment conditions satisfied by the population parameter values (𝛽, Π). We have ( 𝑝+𝑘)(1+𝑞+𝑘) parameters to estimate, giving us 2(𝑞+𝑘)+( 𝑝+𝑘)(𝑞+𝑘 −1) overidentifying restrictions. ¯ Π) = 𝑛−1 𝑖=1 Í𝑛 Let 𝑔(𝛽, 𝑔(𝑦𝑖 , 𝑥𝑖 , 𝑧𝑖 , 𝑠1𝑖 , 𝑠2𝑖 , 𝛽, Π), Ω be a square matrix of order 2(𝑞 + 𝑘)(1 + 𝑝 + 𝑘) that is nonrandom, symmetric, and positive definite, and Ω̂ be a first step consistent estimate of Ω. Then, the standard two-step GMM estimator minimizes the objective function ¯ Π) 0 Ω̂ 𝑔(𝛽, 𝑔(𝛽, ¯ Π). (1.4.2) The variance-covariance matrix of the moment functions is given by   𝐶11 𝐶12 0 0      𝐶 0 𝐶 0 0 22  0 12 𝐶 ≡ E[𝑔(𝛽, Π) 𝑔(𝛽, Π) ] =      0 0 𝐶 0   33       0 0 0 𝐶44    where 𝐶11 = E(𝑠1 𝑠2 𝑢 2 𝑧0 𝑧) 𝐶12 = E(𝑠1 𝑠2 𝑧0𝑢𝑧 ⊗ 𝑟) 𝐶22 = E(𝑠1 𝑠2 𝑧0 ⊗ 𝑟 0 𝑧 ⊗ 𝑟) 𝐶33 = E[(1 − 𝑠1 )𝑠2 𝑧0 ⊗ 𝑟 0 𝑧 ⊗ 𝑟] 𝐶44 = E[𝑠1 (1 − 𝑠2 )𝑣 2 𝑧0 𝑧] (1.4.3) and 𝑔(.) is evaluated at the true value of the parameters. The optimal weight matrix is given by the inverse of 𝐶. Let 𝐶ˆ be a consistent estimate of 𝐶 which can be obtained by replacing the expectation by sample average in the definition of 𝐶 above and replacing 𝑢, 𝑟 and 𝑣 by consistent estimates obtained using, for instance, GMM estimators that use 𝑔1 (.) only, 𝑔2 (.) and 𝑔3 (.) only, and 𝑔4 (.) only respectively. Then, the optimal GMM estimator is defined as the following. Definition 1.4.1. Call the estimators of 𝛽 and Π that minimize (1.4.2) with the optimal weight matrix Ω̂ = 𝐶ˆ −1 , 𝛽ˆ and Π̂ respectively. Further, define the (𝑘 + 𝑞)(2 + 𝑘 + 𝑝) × (𝑘 + 𝑝)(1 + 𝑘 + 𝑞) matrix of expected derivatives of 𝑔(.) 8 w.r.t. (𝛽0, 𝑣𝑒𝑐(Π) 0) 0    𝐷 11 0       0 𝐷  22  𝐷 ≡ E[∇𝑔(𝛽, Π)] =     0 𝐷   32       𝐷 41 𝐷 42    where 𝐷 11 = − E(𝑠1 𝑠2 𝑧0𝑥) 𝐷 22 = − E[𝑠1 𝑠2 (𝑧0 𝑧 ⊗ 𝑐 1 , . . . , 𝑧0 𝑧 ⊗ 𝑐 ( 𝑝+𝑘) )] 𝐷 32 = − E[(1 − 𝑠1 )𝑠2 (𝑧0 𝑧 ⊗ 𝑐 1 , . . . , 𝑧0 𝑧 ⊗ 𝑐 ( 𝑝+𝑘) )] 𝐷 41 = − E[𝑠1 (1 − 𝑠2 )𝑧0 𝑧Π] 𝐷 42 = − E[𝑠1 (1 − 𝑠2 ) 𝛽0 ⊗ 𝑧0 𝑧], (1.4.4) where 𝑐 𝑚 is a ( 𝑝 + 𝑘) × 1 vector with one in the 𝑚 𝑡ℎ row and all other rows being zero, 𝑚 = 1, . . . , ( 𝑝 + 𝑘). I impose the following rank condition on 𝐷 for identification of 𝛽 and Π. Assumption 1.3.2: 𝑟𝑎𝑛𝑘 (𝐷) = ( 𝑝 + 𝑘)(1 + 𝑞 + 𝑘) If 𝑃(𝑠1 𝑠2 = 1) > 0, then sufficient for this assumption to hold is that E(𝑧0𝑥|𝑠1 𝑠2 = 1) and E(𝑧0 𝑧|𝑠1 𝑠2 = 1) have full column ranks ( 𝑝 + 𝑘) and ( 𝑝 + 𝑘)(𝑞 + 𝑘) respectively. In this case, E[𝑔1 (𝛽)] = 0 identifies 𝛽 and E[𝑔2 (Π)] = 0 identifies Π. If 𝑃(𝑠1 𝑠2 = 1) = 0, for instance in the TS2SLS case, then sufficient is that E(𝑧0 𝑧|𝑠2 = 1) and E(𝑧0𝑥|𝑠1 = 1) have full column ranks ( 𝑝 + 𝑘) (𝑞 + 𝑘) and ( 𝑝 + 𝑘) respectively. In this case, E[𝑔3 (Π)] = 0 identifies Π and E[𝑔4 (𝛽)] = 0 identifies 𝛽 since for the purpose of identification, we can treat Π as known. Then, we have the following result using Hansen (1982). Theorem 1.4.1 Under standard regularity conditions and Assumptions 1.3.1 and 1.3.2, √ 𝑛[ ( 𝛽ˆ0, 𝑣𝑒𝑐( Π̂) 0 0 − 𝛽0, 𝑣𝑒𝑐(Π) 0 0] −−−−→ 𝑁 (0, (𝐷 0𝐶 −1 𝐷) −1 )   𝑑 and ¯ 𝛽, 𝑛 𝑔( ˆ Π̂) 0 𝐶ˆ −1 𝑔( ˆ Π̂) −−−𝑑−→ 𝜒2 ¯ 𝛽, 2(𝑞+𝑘)+( 𝑝+𝑘) (𝑞+𝑘−1) . 9 This statistic can be used for the standard test of overidentifying restrictions. Note that this statistic is just the GMM objective function in (1.4.2) evaluated at the efficient values of the parameters and is distributed as chi-squared with degrees of freedom equal to the number of overidentifying restrictions. 1.5 Comparison with related estimators 1.5.1 Complete cases estimator The most common practice in the presence of missing data is to just use the complete cases for estimation; that is, only use the observations for which both 𝑦 and 𝑥 are observed. In the current framework, the first and the most commonly used estimator that uses only the complete cases is the standard 2SLS. This estimator uses only 𝑔1 (.) in estimation as it requires 𝑠1 = 𝑠2 = 1, and uses a weight matrix that is optimal when 𝑢 is homoskedastic. Definition 1.5.1.1 Call the estimator of 𝛽 that minimizes (1.4.2), where 𝑔(.) contains only 𝑔1 (.) and Ω̂ = (𝑛−1 𝑖=1 𝑠1𝑖 𝑠2𝑖 𝑧0𝑖 𝑧𝑖 ) −1 , the complete cases 2SLS (or 𝛽ˆ𝐶𝐶−2𝑆𝐿𝑆 ). Í𝑛 The weight matrix used by 𝛽ˆ𝐶𝐶−2𝑆𝐿𝑆 is optimal if E(𝑢 2 |𝑧, 𝑠1 , 𝑠2 ) = 𝜎 2 . When this assumption is violated, a more efficient complete cases estimator can be obtained by using optimal weighting. Definition 1.5.1.2 Call the estimator of 𝛽 that minimizes (1.4.2), where 𝑔(.) contains only 𝑔1 (.) −1 , the complete cases GMM (or 𝛽ˆ and Ω̂ = 𝐶ˆ11 𝐶𝐶−𝐺 𝑀 𝑀 ). This is the optimal GMM estimator based only on the complete cases. Its asymptotic variance is easily obtained using the standard GMM theory. Lemma 1.5.1.1 Under Assumption 1.3.1, the complete cases GMM has an asymptotic variance given by √ 𝑛( 𝛽ˆ𝐶𝐶−𝐺 𝑀 𝑀 − 𝛽) = (𝐷 011𝐶11−1 𝐷 ) −1 .  𝐴𝑣𝑎𝑟 11 Comparing the asymptotic variances of 𝛽ˆ and 𝛽ˆ𝐶𝐶−𝐺 𝑀 𝑀 , the former is no less efficient than the latter because it uses the information contained in the incomplete cases, while the latter simply 10 ignores this information. The gain in efficiency follows from the fact that adding valid moment conditions decreases, or at least does not increase the asymptotic variance of a GMM estimator.9 Proposition 1.5.1.1 Under Assumption 1.3.1, √ √ 𝑛( 𝛽ˆ𝐶𝐶−𝐺 𝑀 𝑀 − 𝛽) − 𝐴𝑣𝑎𝑟 𝑛( 𝛽ˆ − 𝛽) is positive semi-definite.   𝐴𝑣𝑎𝑟 Further, I break down the gains in efficiency by 𝛽1 and 𝛽2 , the coefficients on the potentially missing endogenous covariates 𝑥 1 and the always observed exogenous covariates 𝑥 2 respectively. For algebraic convenience, I consider the case where both 𝑥 1 and 𝑧1 are scalars.10 Proposition 1.5.1.2 Let 𝑝 = 𝑞 = 1. Under Assumption 1.3.1,   √ √ h i 𝐴   1 ˆ ˆ   (i) 𝐴𝑣𝑎𝑟 𝑛( 𝛽1−𝐶𝐶−𝐺 𝑀 𝑀 − 𝛽1 ) − 𝐴𝑣𝑎𝑟 𝑛( 𝛽1 − 𝛽1 ) = 𝐴 𝐵 𝐸   ≥ 0 0 0 1 1 𝐵   1     √ √  h i  𝐴2  (ii) 𝐴𝑣𝑎𝑟 𝑛( 𝛽ˆ2−𝐶𝐶−𝐺 𝑀 𝑀 − 𝛽2 ) − 𝐴𝑣𝑎𝑟 𝑛( 𝛽ˆ2 − 𝛽2 ) = 𝐴0 𝐵0 𝐸   ≥ 0,    2 2 𝐵   2   where 𝐴 𝑗 = 𝐷 32 𝐷 −1 𝐶 𝑊 , 𝐵 𝑗 = (𝐷 41 𝐷 −1 22 21 𝑗 𝐶 + 𝐷 42 𝐷 −1 11 11 𝐶 )𝑊 𝑗 , 𝑗 = 1, 2 and 𝑊1 , 𝑊2 and 𝐸 22 21 are matrices defined in the appendix, 𝐸 being a positive definite matrix. Starting with the first part of Proposition 1.5.1.2, since 𝐸 is positive definite, the difference is 0 if and only if 𝐴1 = 𝐵1 = 0. The corresponding difference for 𝛽2 is 0 if and only if 𝐴2 = 𝐵2 = 0. Since neither 𝐴 𝑗 nor 𝐵 𝑗 are necessarily 0 under the assumptions made so far, it is possible to obtain strict gains in efficiency for both 𝛽1 and 𝛽2 . Finally, when there is no missingness, the moment conditions in (1.4.1) just give us the standard 2SLS estimator. Because 𝑠1 and 𝑠2 are always 1, 𝑔3 (.) and 𝑔4 (.) are always zero and we are left with 𝑔1 (.) and 𝑔2 (.). Since 𝑔2 (.) adds equal number of additional parameters as the number of additional moment functions to 𝑔1 (.), the GMM estimator of 𝛽 from 𝑔1 (.) will be the same as that from 𝑔1 (.) and 𝑔2 (.).11 Thus, estimation is based only on 𝑔1 (.) = 𝑧0 (𝑦 − 𝑥 𝛽), which is the usual 9Wooldridge (2010), Section 8.6. 10The proof for this proposition is an extension of the proof of Proposition 2 in Abrevaya & Donald (2017). 11Ahu & Schmidt (1995), Theorem 1. 11 moment function used by 2SLS along with a weight matrix constructed under homoskedasticity of 𝑢. Í𝑛 0 −1 Proposition 1.5.1.3 If 𝑃(𝑠1 = 1) = 𝑃(𝑠2 = 1) = 1 and Ω̂11 = (𝑛−1 𝑖=1 𝑧𝑖 𝑧𝑖 ) , where Ω̂11 is the upper left (𝑞 + 𝑘) × (𝑞 + 𝑘) block of Ω̂, 𝛽ˆ equals the standard 2SLS estimator. 1.5.2 Estimators combining different data sets A special case of missingness occurs when data is combined from more than one data sets, one or more of which do not contain either 𝑦 or some or all elements of 𝑥. For instance, the pattern of missingness in Figure 1.1 can result from combining three data sets, one of which contains all 𝑦, 𝑥 and 𝑧, a second is missing 𝑦, and a third is missing 𝑥. In this case, one can just use the first data set to estimate 𝛽, but the second and third can be used to achieve efficiency gains using the framework in Section 1.4. One does have to be careful in making sure that Assumption 1.3.1 holds in order to ensure consistency. For instance, a sufficient condition would be that the different data sets being combined are just random samples of different variables from the same population. There may also be cases where estimation using a single data set is not possible at all. A prominent example is when one data set contains only 𝑦 and 𝑧, while the second contains only 𝑥 and 𝑧. The most commonly used estimator in this case is the TS2SLS.12 The TS2SLS is a sequential GMM estimator based only on 𝑔3 (.) and 𝑔4 (.) since 𝑠1 𝑠2 = 0 in this case. Definition 1.5.2.1 Call the estimator of 𝛽 obtained by the following sequential procedure the two-sample two stage least squares (or 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆 ). Step 1: Obtain Π̆ by minimizing (1.4.2), where 𝑔(.) contains only 𝑔3 (.) and Ω̂ = 𝐼. Step 2: Estimate 𝛽 by minimizing (1.4.2), where 𝑔(.) contains only 𝑔4 (.), Ω̂ = (𝑛−1 𝑖=1 𝑠1𝑖 𝑧0𝑖 𝑧𝑖 ) −1 , Í𝑛 and Π = Π̆ is treated as given. There are two differences between 𝛽ˆ and 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆 . First is in terms of the assumptions made by the two estimators. The traditional analysis of 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆 or the related two-sample IV (TSIV) 12This estimator is discussed in detail in a GMM context by Inoue & Solon (2010). 12 estimator either assumes MCAR (Angrist & Krueger, 1995), or imposes restrictions on 𝑧 and 𝑥 that essentially follow from assuming MCAR. For instance, Angrist & Krueger (1992), in using the TSIV estimator, assume that E(𝑧0𝑥|𝑠1 = 1) = E(𝑧0𝑥|𝑠2 = 1). Inoue & Solon (2010) make the same assumption, along with E(𝑧0 𝑧|𝑠1 = 1) = E(𝑧0 𝑧|𝑠2 = 1) and that the fourth moments of 𝑧 conditional on 𝑠1 and 𝑠2 are equal. The framework presented in this paper allows for relaxation of these restrictive assumptions. By allowing 𝑠1 and 𝑠2 to depend on 𝑧, I allow for the distribution of 𝑧 (and 𝑥) to be different conditional on 𝑠1 and 𝑠2 . However, the coefficient in the linear projection of 𝑥 on 𝑧 (that is, Π) remains the same conditional on 𝑠1 and 𝑠2 under Assumption 1.3.1.13 The second difference is in terms of the weight matrix used. Note that the weight matrix used in Step 2 of Definition 1.5.2.1 is the sample counterpart of 𝐶44 −1 (divided by the variance of 𝑣, which is just a constant), when 𝑣 satisfies the following assumption. E(𝑣 2 |𝑧, 𝑠1 ) = 𝜎𝑣2 . (1.5.1) That is, the variance of 𝑣 is constant conditional on both the instruments 𝑧 and 𝑠1 . If this assumption is not true, then 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆 uses a sub-optimal weight matrix in Step 2 and efficiency gains are possible by using the optimal weight matrix.14 Let 1 Õ 1 Õ 𝐷ˆ 32 = − [𝑠2𝑖 (𝑧0𝑖 𝑧𝑖 ⊗ 𝑐 1 , . . . , 𝑧0𝑖 𝑧𝑖 ⊗ 𝑐 ( 𝑝+𝑘) )] 𝐷ˆ 42 = − (𝑠1𝑖 𝛽ˆ0 ⊗ 𝑧0𝑖 𝑧𝑖 ) (1.5.2) 𝑛 𝑛 𝑖 𝑖 be consistent estimates of 𝐷 32 and 𝐷 42 respectively, and 𝐶ˆ44 and 𝐶ˆ33 are as defined in Section 1.4, where consistent estimates of 𝛽 and Π can now be obtained using 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆 . Definition 1.5.2.2 Call the estimator of 𝛽 obtained by replacing Ω̂ = (𝐶ˆ44 + 𝐷ˆ 32 ( 𝐷ˆ 042𝐶ˆ33 −1 𝐷 ˆ 42 ) −1 𝐷ˆ 0 ) −1 32 13Note that for 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆 to be consistent, we only need Π to be the same conditional on 𝑠1 and 𝑠2 , and not the individual moments involved in the calculation of Π. 14Two things should be noted here: • Because 𝑠1 𝑠2 = 0, 𝑔3 (.) = 𝑠2 𝑧0 ⊗ (𝑥 − 𝑧Π) 0 and 𝑔4 (.) = 𝑠1 𝑧0 (𝑦 − 𝑧Π𝛽). • Since E[𝑔2 (.)] = 0 is an exactly identified set of moment conditions, the weight matrix does not matter for estimation in Step 1 of Definition 1.5.2.1. 13 in step 2 of the procedure in Definition 1.5.2.1, the Optimal TS2SLS estimator (or 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆−𝑂 ). This is the optimal sequential GMM estimator under the assumptions made so far and its asymptotic variance is given in the following result. Proposition 1.5.2.1 Under Assumption 1.3.1, 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆−𝑂 is the optimal sequential GMM estimator of 𝛽, and has an asymptotic variance given by √ 0−1 0 −1 𝑛( 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆−𝑂 − 𝛽) = {𝐷 041 [𝐶44 + 𝐷 42 (𝐷 −1 −1  𝐴𝑣𝑎𝑟 32 𝐶33 𝐷 42 )𝐷 32 ] 𝐷 41 } . Since 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆 uses a sub-optimal weight matrix as opposed to 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆−𝑂 , the latter will be no less efficient than the former. Proposition 1.5.2.2 Under Assumption 1.3.1, √ √ 𝑛( 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆−𝑂 − 𝛽)) − 𝐴𝑣𝑎𝑟 ( 𝑛( 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆 − 𝛽) is positive semi-definite.  𝐴𝑣𝑎𝑟 The proposed estimator 𝛽ˆ is then equally efficient as 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆−𝑂 . Proposition 1.5.2.3 Under Assumption 1.3.1, √ √ 𝑛( 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆−𝑂 − 𝛽) = 𝐴𝑣𝑎𝑟 ( 𝑛( 𝛽ˆ − 𝛽)).  𝐴𝑣𝑎𝑟 From Propositions 1.5.2.2 and 1.5.2.3, we can conclude that 𝛽ˆ is no less efficient than 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆 . Proposition 1.5.2.4 Under Assumption 1.3.1, √ √ 𝑛( 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆 − 𝛽)) − 𝐴𝑣𝑎𝑟 ( 𝑛( 𝛽ˆ − 𝛽) is positive semi-definite.  𝐴𝑣𝑎𝑟 Inoue & Solon (2005) address the issues of optimal weighting using a joint GMM and allowing for conditional heteroskedasticity. Their framework however is more restrictive than necessary. First, they start with zero conditional means of the errors in (2.1) and (2.4), which rules out the important case when (1.2.1) and (1.2.4) are just linear projections and the data is MCAR. Second, they impose restrictions on the second and third moments of 𝑥 and 𝑧, which this framework does not. 14 Finally, 𝛽ˆ is numerically equivalent to 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆 if 𝛽 is exactly identified. This is because in case of exact identification, the efficiency due to using the optimal weight matrix is lost as the weight matrix does not matter for estimation. Proposition 1.5.2.5 If 𝑝 = 𝑞 and Assumption 1.3.1 holds, 𝛽ˆ = 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆 . Therefore, √ √ 𝑛( 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆 − 𝛽) = 𝐴𝑣𝑎𝑟 𝑛( 𝛽ˆ − 𝛽) .   𝐴𝑣𝑎𝑟 1.5.3 Sequential estimators Consider the case where 𝑦 is always observed—that is, 𝑃(𝑠1 = 1) = 1—and the only variables that contain missing values are 𝑥. Thus, 𝑔3 (.) = 0 and we are only left with 𝑔1 (.), 𝑔2 (.) and 𝑔4 (.). For this case, McDonough & Millimet (2017) discuss a sequential estimator which is the counterpart of linear imputation in the case where 𝑥 is exogenous in equation (1.2.1). Definition 1.5.3.1 Call the estimator of 𝛽 obtained by the following procedure the imputation estimator (or 𝛽ˆ 𝐼𝑚 𝑝 ). Step 1: Obtain Π̂ by minimizing (1.4.2), where 𝑔(.) contains only 𝑔2 (.) and Ω̂ = 𝐼. Step 2: Estimate 𝛽 by minimizing (1.4.2), where 𝑔(.) = 𝑔5 (.) = 𝑧0 {𝑦 − [𝑠𝑥 + (1 − 𝑠)𝑧Π̂] 𝛽}, Ω̂ = [𝑛−1 𝑖=1 𝑔5𝑖 (.)𝑔5𝑖 (.) 0] −1 and Π̂ is treated as given. Í𝑛 So in the first step, we estimate the first stage coefficients Π. We then replace the missing values of 𝑥 with 𝑧Π̂ and estimate 𝛽 in the second step using 2SLS on the full sample and treating the fitted values of 𝑥 as given. It is straightforward to show that this estimator is no more efficient than 𝛽. ˆ Consider the sequential estimator of 𝛽 that first estimates Π using 𝑔2 (.) and then estimates 𝛽 using 𝑔1 (.) and 𝑔4 (.), where 𝑔4 (.) uses the estimated Π from the first step. Definition 1.5.3.2 Call the estimator of 𝛽 obtained by the following procedure the sequential estimator (or 𝛽ˆ𝑆𝑒𝑞 ). Step 1: Same as Step 1 in Definition 1.5.3.1. 15 Step 2: Estimate 𝛽 by minimizing (1.4.2), where 𝑛 𝑔(𝛽, Π̂) = 𝑔1 (𝛽) 0, 𝑔4 (𝛽, Π̂) 0 0, Ω̂ = [𝑛−1 Õ 𝑔𝑖 (.)𝑔𝑖 (.) 0] −1  𝑖=1 and Π̂ is treated as given. By standard GMM theory, we know that 𝛽ˆ is no less efficient than 𝛽ˆ𝑆𝑒𝑞 , since it is a sequential estimator (as opposed to a joint estimator) based on the same moment conditions as 𝛽.15 ˆ Moreover, 𝑔5 (.), which is the moment condition used in Step 2 of Definition 1.5.3.1 can be obtained by adding 𝑔1 (𝛽) and 𝑔4 (𝛽, Π̂), which are the moment conditions used in step 2 of Definition 1.5.3.2. Since 𝛽ˆ𝑆𝑒𝑞 uses 𝑔5 (.) and an additional moment condition, it is no less efficient than 𝛽ˆ 𝐼𝑚 𝑝 . Thus we can conclude that 𝛽ˆ is no less efficient than 𝛽ˆ 𝐼𝑚 𝑝 and there is no reason to choose the latter over the former other than computational convenience. 1.5.4 Dummy variable method A common method used to deal with missingness in 𝑥 in the case where 𝑥 is exogenous is the dummy variable method, which entails replacing the missing values of 𝑥 with zeros and including an indicator for missingness as a covariate. As shown by Abrevaya & Donald (2017), this method is inconsistent unless some zero restrictions are imposed in the population. This method continues to be inconsistent in the current framework where 𝑥 is endogenous. Let 𝑃(𝑠1 = 1) = 1, that is, 𝑦 is always observed. Also note that (1.2.4) implies 𝑥 1 = 𝑧1 Π11 + 𝑥2 Π21 + 𝑟 1 , (1.5.3) where Π11 , Π21 and 𝑟 1 constitute the first 𝑝 columns of Π1 , Π2 and 𝑟 respectively.16 Then (1.2.1) and (1.5.2) imply 𝑦 = [𝑠2 𝑥1 + (1 − 𝑠2 )(𝑧1 Π11 + 𝑥 2 Π21 + 𝑟 1 )] 𝛽1 + 𝑥 2 𝛽2 + 𝑢. (1.5.4) 15Prokhorov & Schmidt (2009), Theorem 2.2, part 5. 16One can similarly write 𝑥2 = 𝑧1 Π12 + 𝑥2 Π22 + 𝑟 2 . However, it is clear that both Π12 and 𝑟 2 are identically 0 and Π22 is a 𝑘 × 𝑘 identity matrix. 16 Since 𝑥 2 contains the constant, write 𝑥 2 = (1, 𝑥22 ) where 𝑥 22 constitutes the last (𝑘 − 1) columns of 𝑥 2 . Correspondingly, write Π21 = (Π0211 , Π0212 ) 0, where Π211 is the first row of Π21 and Π212 constitutes the last (𝑘 − 1) rows of Π21 . Plugging this into (1.5.3) and re-arranging gives  𝑦 = 𝑠2 𝑥 1 𝛽1 + (1 − 𝑠2 ) 𝑧1 Π11 + Π211 + 𝑥 22 Π212 + 𝑟 1 𝛽1 + 𝑥 2 𝛽2 + 𝑢. (1.5.5) The dummy variable method omits the covariates (1 − 𝑠2 )𝑧 1 and (1 − 𝑠2 )𝑥22 from equation (1.5.4) and estimates using 2SLS the equation 𝑦 = 𝑠2 𝑥 1 𝛽1 + (1 − 𝑠2 )Π211 + 𝑥 2 𝛽2 + 𝑒 (1.5.6) using instruments (𝑠2 𝑧1 , 1 − 𝑠2 , 𝑥2 ), where 𝑒 ≡ (1 − 𝑠2 )(𝑧1 Π11 + 𝑥 22 Π212 ) 𝛽1 + 𝑟 1 𝛽1 + 𝑢. However, since each of these instruments is now correlated with the new error 𝑒, 2SLS will not yield consistent estimates in general unless we impose some zero restrictions in the population. Proposition 1.5.4.1 The 2SLS estimators of 𝛽 from equation (1.5.5) using instruments (𝑠2 𝑧1 , 1 − 𝑠2 , 𝑥2 ) are inconsistent unless (i) 𝛽1 = 0 or (ii) Π11 = Π212 = 0. The first condition implies that 𝑥1 is irrelevant in the model of interest (1.2.1), so the best solution is to drop it. The second implies that neither the excluded instruments 𝑧1 nor the always observed covariates 𝑥22 help in explaining 𝑥 1 , in which case any estimation method based on 𝑧1 cannot be used at all. 1.6 Missing instruments In Sections 1.2-1.5, I discussed the case where 𝑦 and the endogenous elements of 𝑥 (that is, 𝑥1 ) contain missing values, while the instruments 𝑧 are always observed. In this section, I consider the case where the excluded instruments 𝑧1 contain missing values. This includes as a special case missingness in covariates when all the covariates are exogenous. Starting with the population model in Section 1.2, I now additionally introduce a linear projection of the excluded instruments 𝑧1 on the always observed exogenous covariates 𝑥 2 . 𝑧1 = 𝑥 2 Γ + 𝑒, (1.6.1) 17 where by definition of a linear projection E(𝑥20 𝑒) = 0. (1.6.2) As discussed in Section 1.5.4, (1.2.4) implies that 𝑥 1 = 𝑧1 Π11 + 𝑥 2 Π21 + 𝑟 1 . (1.6.3) Plugging (1.6.1) into (1.6.3) gives us a first stage in terms of 𝑥2 only. 𝑥 1 = 𝑥 2 (ΓΠ11 + Π21 ) + (𝑒Π11 + 𝑟 1 ). (1.6.4) Plugging (1.6.4) into (1.2.1) gives us a reduced form for 𝑦 in terms of 𝑥 2 only. 𝑦 = 𝑥 2 (ΓΠ11 𝛽1 + Π21 𝛽1 + 𝛽2 ) + (𝑒Π11 𝛽1 + 𝑟 1 𝛽1 + 𝑢). (1.6.5) Now, for observation 𝑖, let   1 if 𝑧1𝑖 is observed    𝑠3𝑖 =  0 otherwise    I impose the following assumptions on the missingness mechanism, which can be interpreted in a similar way as Assumption 1.3.1. Assumption 1.6.1: (𝑖) E(𝑠3 𝑧0𝑢) = 0 (𝑖𝑖) E(𝑠3 𝑧0𝑟) = 0 (𝑖𝑖𝑖) E(𝑠3 𝑥 20 𝑒) = 0. This gives us the following moment functions. 𝑠3 𝑧0 (𝑦 − 𝑥 𝛽)   ℎ1 (𝛽, Π, 𝛾)           0 0       𝑠 3 𝑧 ⊗ (𝑥 1 − 𝑧 1 Π 11 − 𝑥 2 Π 21 )   ℎ2 (𝛽, Π, 𝛾)         ℎ(𝛽, Π, 𝛾) =  0 𝑠3 𝑥2 ⊗ (𝑧1 − 𝑥 2 Γ) 0  =  ℎ (𝛽, Π, 𝛾)  (1.6.6)   3       (1 − 𝑠3 )𝑥 20 ⊗ [𝑥1 − 𝑥2 (ΓΠ11 + Π21 )] 0   ℎ4 (𝛽, Π, 𝛾)          0      (1 − 𝑠3 )𝑥 [𝑦 − 𝑥2 (ΓΠ11 𝛽1 + Π21 𝛽1 + 𝛽2 )]   ℎ5 (𝛽, Π, 𝛾)   2    This vector of moment functions is basically using the original model of interest and first stage  when 𝑧1 is observed ℎ1 (.) and ℎ2 (.) . When 𝑧1 is missing, it uses the reduced forms for 𝑥 1 and 𝑦 18  in terms of 𝑥 2 in order to use the incomplete cases ℎ4 (.) and ℎ5 (.) . ℎ3 (.) simply identifies the parameters in the linear projection of 𝑧1 on 𝑥 2 . Then under Assumption 1.6.1, the following result holds for ℎ(.). Lemma 1.6.1. Under Assumption 1.6.1, E[ℎ(𝛽, Π, 𝛾)] = 0. This gives us a set of 2𝑘 (1 + 𝑝) + 𝑞(1 + 𝑝 + 𝑘) moment conditions for ( 𝑝 + 𝑘)(1 + 𝑞 + 𝑘) + 𝑘𝑞 parameters, giving us 𝑘 (1 + 𝑝) + 𝑞 − 𝑝 overidentifying restrictions.17 Then, let ℎ(𝛽, ¯ Π, Γ) = 𝑛−1 𝑖=1 Í𝑛 ℎ(𝑦𝑖 , 𝑥𝑖 , 𝑧𝑖 , 𝑠3𝑖 , 𝛽, Π, Γ), Λ be a square matrix of order 2𝑘 (1 + 𝑝) + 𝑞(1 + 𝑝 + 𝑘) that is nonrandom, symmetric, and positive definite, and Λ̃ be a first step consistent estimate of Λ. Then, 𝛽˜0, 𝑣𝑒𝑐( Π̃) 0, 𝑣𝑒𝑐( Γ̃) 0 is the standard two-step GMM estimator that minimizes the objective  function ¯ ℎ(𝛽, Π, Γ) 0 Λ̃ ℎ(𝛽, ¯ Π, Γ). (1.6.7) Let 𝛽˜𝑐𝑐 be the complete cases GMM that minimizes (1.6.7) with ℎ(.) = ℎ1 (.) and Λ̃ is a consistent estimate of [E(ℎ1 (.)ℎ1 (.) 0] −1 . Then we know that 𝛽˜ is no less efficient than 𝛽˜𝑐𝑐 because the former uses more moment conditions. Proposition 1.6.1. Under Assumption 1.6.1, √ √ 𝑛( 𝛽˜𝑐𝑐 − 𝛽) − 𝐴𝑣𝑎𝑟 𝑛( 𝛽˜ − 𝛽) is positive semi-definite.   𝐴𝑣𝑎𝑟 Similar to Section 1.5, we can break down the efficiency gains by 𝛽1 and 𝛽2 , the coefficients on the endogenous and exogenous elements of 𝑥 respectively, and show that it is possible to obtain strict gains in efficiency for both 𝛽1 and 𝛽2 .18 This is in contrast with the sequential estimator discussed in Mogstad & Wiswall (2012). They consider the case where 𝑝 = 𝑞 = 1 and the estimator proceeds in two steps. In the first step, it estimates Γ using ℎ3 (.). It then replaces the missing values of 𝑧1 by the imputed values 𝑥 2 Γ̂, where Γ̂ is the first step estimate of Γ, and then in the second step estimates 𝛽 by minimizing (1.6.7) where ℎ(.) = 𝑧∗0 (𝑦 − 𝑥 𝛽) and 𝑧∗ = 𝑠3 𝑧1 + (1 − 𝑠3 )𝑥2 Γ̂, 𝑥2 .19 They show that the estimate of 𝛽1 using  17Since Π21 = 0 and Π22 = 𝐼, the only elements of Π that are being estimated are Π11 and Π21 . 18This proof is analogous to that of Proposition 5.1.2 and is available upon request. 19The weight matrix is irrelevant in this case due to exact identification. 19 this estimator is numerically equivalent to that using complete cases estimator 𝛽˜𝑐𝑐 . Thus, 𝛽˜ does better than this estimator as it is possible to obtain strict gains in efficiency for both 𝛽1 and 𝛽2 . Abrevaya & Donald (2011) also propose a GMM estimator to deal with missingness in 𝑧1 . Their estimator is based on the moment functions ℎ 𝐴 (𝛽) = 𝑧0𝐴 (𝑦 − 𝑥 𝛽), (1.6.8) where 𝑧 𝐴 = (𝑥 2 , (1 − 𝑠3 )𝑥2 , 𝑠3 𝑧1 ). It is clear that the moment functions in (1.6.6) contain (1.6.8) as a linear combination plus some additional moment conditions. Thus, 𝛽˜ is no less efficient than their estimator. Now, when 𝑥 1 is exogenous in equation (1.2.1) in the sense that E(𝑥 10 𝑢) = 0, (1.6.9) then 𝑥1 = 𝑧1 . In this case, ℎ2 (.) = 0 and ℎ4 (.) cannot be used anymore.20 So our vector of moment conditions is 0 (𝑦 − 𝑥 𝛽)      𝑠 3 𝑥  0         E  𝑠3 𝑥 20 ⊗ (𝑥 1 − 𝑥 2 Γ) 0  = 0    (1.6.10)      (1 − 𝑠3 )𝑥 20 [𝑦 − 𝑥 2 (Γ𝛽1 + 𝛽2 )]  0         These are the moment conditions used by Abrevaya & Donald (2017) who consider the case of missingness in a single exogenous covariate. Thus, the framework presented here encompasses theirs as a special case when when 𝑥 1 is exogenous and 𝑝 = 1. 1.7 Nonlinearity in covariates and instruments Nonlinear functions of the covariates, like squares and interactions, are frequently used in empirical work. If these covariates are endogenous, one generally uses nonlinear functions of the instruments as well. In general, any sequential procedures that plug in the fitted values of the covariates or the instruments from a first step into nonlinear functions of these variables generally produce inconsistent estimates. For instance, traditional imputation used when the covariates are 20ℎ2 (.) = 0 because Π11 is a 𝑝 × 𝑝 identity matrix since 𝑧 1 = 𝑥1 and Π21 is a matrix of zeros. 20 exogenous will result in inconsistency if one replaces the missing value of say, the square of a covariate, with square of the imputed value of that covariate. In this section, I provide estimators that are consistent as well as more efficient than these sequential procedures and the complete case methods. Suppose that the model of interest is now given by 𝑦 = 𝐹1 (𝑥1 , 𝑥2 ) 𝛽 + 𝑢, (1.7.1) where 𝑥 1 is a 1 × 𝑝 vector of potentially endogenous covariates, 𝑥2 is a 1 × 𝑘 vector of exogenous covariates, 𝑥 = (𝑥 1 , 𝑥2 ), and 𝐹1 (𝑥 1 , 𝑥2 ) is a 1× 𝑗 1 vector of potentially nonlinear functions of 𝑥 1 and 𝑥 2 , 𝑗 1 ≥ ( 𝑝 + 𝑘). For instance, suppose 𝑝 = 𝑘 = 1. Then 𝐹1 (𝑥 1 , 𝑥2 ) could equal (𝑥 1 , 𝑥12 , 𝑥1 𝑥 2 , 𝑥2 ). We also have a 1 × 𝑞 vector of instruments 𝑧 1 for 𝑥 1 , 𝑞 ≥ 𝑝. I assume E(𝑢|𝑧1 , 𝑥2 ) = 0, (1.7.2) and allow for E(𝑥10 𝑢) ≠ 0. So I now assume that 𝑢 has a zero mean conditional on 𝑧1 and 𝑥2 .21 The first stage is given by the linear projection 𝐹1 (𝑥 1 , 𝑥2 ) = 𝐹2 (𝑧1 , 𝑥2 )Π + 𝑟. (1.7.3) 𝐹2 (𝑧1 , 𝑥2 ) is a 1 × 𝑗2 vector of instruments where 𝐹2 (.) is chosen by the researcher, and Π is a 𝑗2 × 𝑗 1 vector of coefficients. Because 𝐹1 (𝑥1 , 𝑥2 ) contains nonlinear functions of 𝑥1 , 𝐹2 (𝑧 1 , 𝑥2 ) will most likely also contain nonlinear functions of 𝑧1 and 𝑥 2 . For instance, as discussed in Wooldridge (2010)22, if 𝐹1 (𝑥1 , 𝑥2 ) = (𝑥 1 , 𝑥12 , 𝑥1 𝑥2 , 𝑥2 ), one might want to choose 𝐹2 (𝑧1 , 𝑥2 ) = (𝑧1 , 𝑧21 , 𝑧1 𝑥 2 , 𝑥2 , 𝑥22 ). By definition E[𝐹2 (𝑧1 , 𝑥2 ) 0𝑟] = 0. (1.7.4) From equations (1.7.1) and (1.7.3), we get a reduced form for 𝑦 in terms of only 𝑧1 and 𝑥 2 . 𝑦 = 𝐹2 (𝑧1 , 𝑥2 )Π𝛽 + 𝑣, (1.7.5) 21This is a standard assumption made in the literature when the model includes nonlinear functions of covariates and motivates the choice of instruments. 22Section 9.5. 21 where 𝑣 ≡ 𝑟 𝛽 + 𝑢. Using (1.7.2) and (1.7.4), we have that E[𝐹2 (𝑧1 , 𝑥2 ) 0𝑣] = 0. (1.7.6) 1.7.1 Missingness in outcome and covariates Starting with the case of missingness in 𝑦 and 𝑥1 , let the scheme of missingness be the same as described in Section 1.3. That is, both 𝑦 and 𝑥 1 contain missing values, while 𝑧1 and 𝑥 2 are always observed. In this case, what seems like the natural extension of the sequential estimator discussed in McDonough & Millimet (2017) will be inconsistent for 𝛽 because it performs the “forbidden regression" as discussed in Wooldridge (2010).23 For instance, let 𝐹1 (𝑥1 , 𝑥2 ) = (𝑥 1 , 𝑥12 , 𝑥1 𝑥 2 , 𝑥2 ). The sequential estimator would regresses 𝑥 1 on 𝐹2 (𝑧 1 , 𝑥2 ) and obtain the fitted values (say 𝑥ˆ1 ) in the first step, replace the missing values of 𝑥 1 , 𝑥12 and 𝑥 1 𝑥 2 with 𝑥ˆ1 , ( 𝑥ˆ1 ) 2 and 𝑥ˆ1 𝑥 2 respectively, and then estimate 𝛽 using 2SLS in the second step treating the fitted values as data. The inconsistency is a result of replacing nonlinear functions of 𝑥 1 with the same nonlinear function of fitted values. The correct way to go is to simultaneously estimate the first stage parameters Π and the parameters of interest 𝛽. I first impose the following assumption on the missingness mechanism. Assumption 1.7.1.1. (𝑖) E[𝑠1 𝑠2 𝐹2 (𝑧1 , 𝑥2 ) 0𝑢] = 0 (𝑖𝑖) E[𝑠1 𝑠2 𝐹2 (𝑧1 , 𝑥2 ) 0𝑟] = 0 (𝑖𝑖𝑖) E[𝑠1 𝐹2 (𝑧1 , 𝑥2 ) 0𝑢] = 0 (𝑖𝑣) E[𝑠1 𝐹2 (𝑧1 , 𝑥2 ) 0𝑟] = 0 (𝑣) E[𝑠2 𝐹2 (𝑧1 , 𝑥2 ) 0𝑟] = 0. This gives us the following moment conditions.  𝑠 𝑠 𝐹 (𝑧 , 𝑥 ) 0 [𝑦 − 𝐹 (𝑥 , 𝑥 ) 𝛽]     0    1 2 2 1 2 1 1 2         𝑠 𝑠 𝐹 (𝑧 , 𝑥 ) 0 ⊗ [𝐹 (𝑥 , 𝑥 ) − 𝐹 (𝑧 , 𝑥 )Π] 0  0 1 2 2 1 2 1 1 2 2 1 2 E[𝑔 𝑁 𝐿 (𝛽, Π)] = E  =  (1.7.7)      (1 − 𝑠 )𝑠 𝐹 (𝑧 , 𝑥 ) 0 ⊗ [𝐹 (𝑥 , 𝑥 ) − 𝐹 (𝑧 , 𝑥 )Π] 0 0  1 2 2 1 2 1 1 2 2 1 2          0 𝑠1 (1 − 𝑠2 )𝐹2 (𝑧1 , 𝑥2 ) [𝑦 − 𝐹2 (𝑧1 , 𝑥2 )Π𝛽]     0      Compared to Section 1.4, we have simply replaced 𝑥 with 𝐹1 (𝑥 1 , 𝑥2 ) and 𝑧 with 𝐹2 (𝑧1 , 𝑥2 ). Unlike Section 1.4 though, the traditional imputation is not consistent. 23Section 9.5.2. 22 1.7.2 Missingness in instruments Next we move to the missing data scenario of Section 1.6. That is, the only variables that contain missing values are the excluded instruments 𝑧1 . We re-write equation (1.7.3) as follows by breaking up 𝐹2 (𝑧1 , 𝑥2 ) into elements that do and do not depend on 𝑧 1 . 𝐹1 (𝑥 1 , 𝑥2 ) = 𝐹21 (𝑧1 , 𝑥2 )Π𝑎 + 𝐹22 (𝑥 2 )Π𝑏 + 𝑟, (1.7.8) where 𝐹2 (𝑧1 , 𝑥2 )Π ≡ 𝐹21 (𝑧1 , 𝑥2 )Π𝑎 + 𝐹22 (𝑥 2 )Π𝑏 , 𝐹21 (𝑧1 , 𝑥2 ) is a 1 × 𝑗21 vector that includes all elements of 𝐹2 (𝑧1 , 𝑥2 ) that are functions of 𝑧1 , 𝐹22 (𝑥 2 ) is a 1 × 𝑗 22 vector that includes all elements of 𝐹2 (𝑧 1 , 𝑥2 ) that are functions only of 𝑥 2 , and 𝑗2 = 𝑗21 + 𝑗22 . From our example in Section 1.7.1, if 𝐹2 (𝑧 1 , 𝑥2 ) = (𝑧 1 , 𝑧21 , 𝑧1 𝑥2 , 𝑥2 , 𝑥22 ), then 𝐹21 (𝑧1 , 𝑥2 ) = (𝑧1 , 𝑧21 , 𝑧1 𝑥 2 ) and 𝐹22 (𝑥2 ) = (𝑥 2 , 𝑥22 ). To handle missingness in 𝑧 1 , we also need a linear projection of each of the instruments on 𝐹22 (𝑥2 ).24 𝐹21 (𝑧1 , 𝑥2 ) = 𝐹22 (𝑥 2 )Γ + 𝑒, (1.7.9) where by definition E[𝐹22 (𝑥 2 ) 0 𝑒] = 0. (1.7.10) This gives us the reduced forms of 𝐹1 (𝑥1 , 𝑥2 ) and 𝑦 in terms of 𝑥 2 only. Plugging (1.7.9) into (1.7.8) we get 𝐹1 (𝑥 1 , 𝑥2 ) = 𝐹22 (𝑥 2 )(ΓΠ𝑎 + Π𝑏 ) + 𝑒Π𝑎 + 𝑟. (1.7.11) Similarly, plugging (1.7.11) into (1.7.1) we get 𝑦 = 𝐹22 (𝑥 2 )(ΓΠ𝑎 + Π𝑏 ) 𝛽 + (𝑒Π𝑎 + 𝑟) 𝛽 + 𝑢. (1.7.12) Next, I impose the following assumption on the missingness mechanism. Assumption 1.7.2.1. (i) E[𝑠3 𝐹2 (𝑧 1 , 𝑥2 ) 0𝑢] = 0 (ii) E[𝑠3 𝐹2 (𝑧1 , 𝑥2 ) 0𝑟] = 0 (iii) E[𝑠3 𝐹22 (𝑥 2 ) 0 𝑒] = 0. 24Based on the exact functional form of 𝐹1 (.), one might want to choose different functions of 𝑥2 in equation (1.7.9) than those in 𝐹22 (.). This framework can be easily extended to allow for that by replacing 𝐹22 (𝑥2 ) by a different function 𝐹3 (𝑥2 ) in (1.7.9) and deriving the reduced forms in (1.7.11) and (1.7.12) accordingly. For the ease of exposition, I stick here with the same functions of 𝑥 2 in both (1.7.8) and (1.7.9). 23 This gives us the following moment conditions. 𝑠3 𝐹2 (𝑧1 , 𝑥2 ) 0 [𝑦 − 𝐹1 (𝑥1 , 𝑥2 ) 𝛽]  0           0 0       𝑠 3 𝐹 2 (𝑧 1 , 𝑥 2 ) ⊗ [𝐹 1 (𝑥 1 , 𝑥 2 ) − 𝐹2 (𝑧 1 , 𝑥 2 )]  0         E[ℎ 𝑁 𝐿 (𝛽, Π, 𝛾)] = E   0 𝑠3 𝐹22 (𝑥 2 ) ⊗ [𝐹21 (𝑥 1 , 𝑥2 ) − 𝐹22 (𝑥 2 )Γ] 0  = 0 (1.7.13)         (1 − 𝑠3 )𝐹22 (𝑥 2 ) 0 ⊗ [𝐹1 (𝑥1 , 𝑥2 ) − 𝐹22 (𝑥2 )(ΓΠ𝑎 + Π𝑏 )] 0 0         0       (1 − 𝑠 3 )𝐹 22 (𝑥 2 ) 𝑦 − 𝐹 22 (𝑥 2 )(ΓΠ 𝑎 + Π 𝑏 ) 𝛽]  0     In the case where 𝑥 1 is exogenous (and hence 𝑥 1 = 𝑧1 ), this reduces to 𝑠3 𝐹2 (𝑧1 , 𝑥2 ) 0 [𝑦 − 𝐹1 (𝑥1 , 𝑥2 ) 𝛽]         E[ℎ 𝑁 𝐿 (𝛽, Π, 𝛾)] = E  𝑠3 𝐹22 (𝑥 2 ) 0 ⊗ [𝐹21 (𝑥 1 , 𝑥2 ) − 𝐹22 (𝑥 2 )Γ] 0  (1.7.14)    (1 − 𝑠3 )𝐹22 (𝑥 2 ) 0 [𝑦 − 𝐹22 (𝑥2 )(ΓΠ𝑎 + Π𝑏 ] 𝛽]      As discussed in Abrevaya & Donald (2017), when 𝑥 1 is exogenous, the second most commonly used method after the complete cases OLS is linear imputation. In the example we have been carrying along where 𝐹1 (𝑥 1 , 𝑥2 ) = (𝑥 1 , 𝑥12 , 𝑥1 𝑥 2 , 𝑥2 ), it proceeds as follows. In the first step, it regresses 𝑥1 on 𝑥 2 and obtains the fitted values (say 𝑥˜1 ). In the second step, it replaces the missing values of 𝑥 1 , 𝑥 12 and 𝑥 1 𝑥 2 with 𝑥˜1 , 𝑥˜12 , and 𝑥˜1 𝑥 2 respectively. Not only does this method not use the optimal instruments for 𝑥 1 (as it fails to include the nonlinear functions of 𝑥 2 in the imputation equation), it performs a forbidden regression in the second step, and hence results in inconsistent estimates for 𝛽. 1.8 Monte Carlo simulations 1.8.1 Missingness in outcome and covariates The data generating process is as follows. 𝑦 = 1 + 𝑥 1 𝛽1 + 𝑥 2 𝛽2 + 𝑢, where 𝑥 1 is a scalar and 𝑥 2 = [1 𝑥 22 𝑥 23 ] is a 1 × 3 vector. Moreover,      ! 𝑥  1  2 0.1 22   ∼𝑁  ,        𝑥  1 0.1 3   23            24 𝛽2 = (𝛽21 , 𝛽22 , 𝛽23 ) 0 is fixed at (1, 1, 1) 0 throughout all designs. The error is 𝑢 = 𝜎𝑢 𝑢 ∗ , where 𝑢 ∗ is a standard normal, and 𝜎𝑢 will be used to vary the error variance. The vector of instruments 𝑧1 = (𝑧11 , 𝑧12 , 𝑧13 , 𝑧14 ) is 1 × 4 vector where       𝑧 11  0 1 0.5 0.4 0.3            ! 𝑧  0  1 0.2 0.1  12    ∼𝑁  ,     𝑧  0  1 0   13                 𝑧 14  0  1       The first stage is given by 𝑥 1 = 𝑧1 Π11 + 𝑥2 Π21 + 𝑟 1 , where Π11 = (1, 1, 1, 1) 0, Π21 = (0.5, 0.5, 0.5) 0 and 𝑟 1 = 𝑟 1∗ + 𝑢 ∗ , where 𝑟 1∗ is a standard normal. Thus, 𝑢 ∗ is the part of 𝑥1 that is correlated with 𝑢 and causes 𝑥 1 to be endogenous. The missingness is based on a uniform random variable, making the data MCAR. 𝑠∗ ∼ U (0, 1), 𝑠1 = 1[𝑠∗ < 𝑎 or 𝑠∗ > 𝑏], 𝑠2 = 1[𝑠∗ < 𝑏]. I consider 4 designs. Design 1: 𝛽 = 1, 𝜎𝑢 = 3.5, 𝑎 = 0.5, 𝑏 = 0.75. q Design 2: 𝛽 = 1, 𝜎𝑢 = 𝑒𝑥 𝑝(𝑧211 ), 𝑎 = 0.5, 𝑏 = 0.75. q Design 3: 𝛽 = 1, 𝜎𝑢 = 𝑒𝑥 𝑝(𝑧211 ), 𝑎 = 0.4, 𝑏 = 0.75. q Design 4: 𝛽 = 0.1, 𝜎𝑢 = 𝑒𝑥 𝑝(𝑧211 ), 𝑎 = 0.5, 𝑏 = 0.75. The first design is the basic case of homoskedasticity in the model of interest. Design 2 allows for 𝑢 to be heteroskedastic. Design 3 reduces the percentage of complete cases, and design 4 reduces the magnitude of the coefficient of interest. For all the designs, I do 1000 iterations with 𝑛 = 3000. I look at five estimators, starting with the most commonly used, which is the complete cases 2SLS. When the data is heteroskedastic, a GMM based on the complete cases will be more efficient than the 2SLS, and that is the second estimator I consider. The third is the imputation estimator 25 discussed in Section 1.5.3, followed by the dummy variable method and finally the proposed estimator. The first thing to note is that the proposed estimator works best in terms of efficiency in all cases, with substantial reductions in the standard deviation relative to other estimators. This is true not only for 𝛽22 and 𝛽23 , the coefficients on 𝑥2 , but also for 𝛽1 , the coefficient on the covariate with missing values. The pattern on bias relative to other estimators is less clear, but the proposed estimator still has the smallest root mean squared error out of all the estimators in all cases. The gains in efficiency of the proposed estimator are more pronounced when we have het- eroskedasticity. Relative to the complete cases GMM, the gains increase as the percentage of complete cases decreases, which is to be expected as the proposed estimator now incorporates more additional information into estimation. The gains remain substantial in the case where the coefficient on the covariate with missing values is small. The complete cases GMM is more efficient than the complete cases 2SLS when there is heteroskedasticity because of the optimal weighting, as expected. Yet it is less efficient than the proposed estimator in all cases, including when the error in the model of interest is homoskedastic. The imputation estimator on the other hand is not guaranteed to bring any efficiency gains relative to the complete cases GMM, and hence has no reason to be preferred over the former. The dummy variable method shows severe bias in all but the last design where the coefficient on the variable with missing value is close to 0, and does not even guarantee gains in efficiency over the complete cases GMM. Thus, this estimator cannot be recommended either. 1.8.2 Missingness in instruments The data generating process is as follows. 𝑦 = 1 + 𝑥 1 𝛽1 + 𝑥 2 𝛽2 + 𝑢. 26 where 𝑥 1 is a scalar and 𝑥 2 = [1 𝑥 22 𝑥 23 ] is a 1 × 3 vector. Moreover,      ! 𝑥  1  2 0.2 22   ∼𝑁  ,        𝑥  1 0.2 1   23            𝛽1 = 1, 𝛽2 = (𝛽21 , 𝛽22 , 𝛽23 ) 0 is fixed at (1, 1, 1) 0 throughout all designs. The error is 𝑢 = 𝜎𝑢 𝑢 ∗ , where 𝑢 ∗ is a standard normal, and 𝜎𝑢 will be used to vary the error variance. We have a single instrument 𝑧1 such that 𝑧1 = 𝑥 2 Γ + 𝑒, where Γ = (1, 0.5, 0.5) 0 and 𝑒 is standard normal. The first stage is given by 𝑥 1 = 𝑧1 Π11 + 𝑥2 Π21 + 𝑟 1 , where Π11 = 1, Π21 = (1, 0.5, 0.5) 0 and 𝑟 1 = 𝑟 1∗ + 𝑢 ∗ , where 𝑟 1∗ is a standard normal and 𝑢 ∗ is the part of 𝑥 1 that is correlated with 𝑢. The missingness is based on a uniform random variable, making the data MCAR. 𝑠∗ ∼ U (0, 1), 𝑠3 = 1[𝑠∗ > 𝑎]. I consider 3 designs. Design 5: 𝜎𝑢 = 4, 𝑎 = 0.5. q Design 6: 𝜎𝑢 = 𝑒𝑥 𝑝(𝑧211 ), 𝑎 = 0.5. q Design 7: 𝜎𝑢 = 𝑒𝑥 𝑝(𝑧211 ), 𝑎 = 0.4. Design 5 is the case of homoskedasticity in the model of interest, design 6 allows for 𝑢 to be heteroskedastic, and design 7 increases the percentage of complete cases. For all the designs, I do 1000 iterations with 𝑛 = 2000. The results are qualitatively similar to those in the previous sub-section. The proposed estimator substantially improves efficiency and has a lower root mean squared error relative to the complete cases 2SLS in all cases including that of homoskedasticity.25 The gains are more pronounced in the case of heteroskedasticity and increase with a reduction in the percentage of complete cases. 25The only exception is 𝛽1 in the case of homoskedasticity, where the two estimators perform equally well. 27 The imputation estimator for 𝛽1 is numerically equivalent to the complete cases 2SLS, as noted in Mogstad & Wiswall (2012). For 𝛽22 and 𝛽23 , this estimator always does no better than the proposed estimator, and sometimes does worse than even the complete cases 2SLS. Since it does not guarantee efficiency gains over the complete cases or the proposed estimator, there is no reason to prefer it over either of the two. 1.9 Empirical application I estimate the effect of physician’s advice to reduce weight on calorie consumption by individuals using the estimator proposed in Section 1.4. As noted by Joshi and Wooldridge (2020), physician’s advice is a low cost and precisely targeted intervention that can affect food consumption habits of individuals. The effect of physician’s advice on outcomes like smoking, dietary and exercise behavior has been considered by Loureiro & Nayga Jr (2006), Loureiro & Nayga Jr (2007), Secker- Walker et al. (1998), and Ortega-Sanchez et al. (2004), among others. The data comes from five most recent cycles of National Health and Nutritional Examination Survey (NHANES): 2007-08, 2009-2010, 2011-12, 2013-14, and 2015-16.26 The NHANES is designed to assess the health and nutritional status of adults and children in the US. It examines a nationally representative sample of about 5000 persons each year and contains demographic, socioeconomic, dietary, and health-related questions. The dependent variable (𝑦) is the log of calorie intake of individuals. The endogenous covariate (𝑥 1 ) is a binary variable which equals one if the physician advised the individual to lose weight. The excluded instruments (𝑧1 ) are binary variables indicating whether the individual has health insurance and a regular source of care. Other explanatory variables (𝑥 2 ) include demographic variables like age, gender, race, education, and income of the individual as well as health-related variables such as the individual’s body mass index (BMI), and indicators for whether they have high blood pressure, high cholesterol, Arthritis, a heart condition and Diabetes. Also included are year fixed effects and all variables have been demeaned. 26I would like to thank Riju Joshi for providing me with neatly compiled and cleaned data. 28 I restrict the sample to overweight individuals, that is, those with BMI greater than or equal to 25. I also exclude from the sample women who are pregnant, and individuals for whom the covariates 𝑥 2 or the excluded instruments 𝑧1 are missing. The final sample consists of 11,512 observations with 𝑦 missing for 952 observations and 𝑥 1 missing for 2173 observations. Table B8 reports the results for two estimators: the complete cases GMM and the estimator proposed in Section 1.4 which uses the incomplete cases. The former results in the coefficient of interest being insignificant, which continues to hold true with the reduced standard error resulting from the proposed estimator. The standard errors for all other coefficients are smaller as well using the proposed estimator, while the coefficients for most variables remain similar to those obtained using the complete cases GMM. 1.10 Conclusion I have offered some simple GMM estimators that improve efficiency over the currently used methods in the presence of missing data in linear regression models with endogenous covariates. I consider the cases of missingness in the outcomes and the endogenous covariates as well as that of missingness in the instruments. The latter includes the missingness in exogenous covariates as a special case. I also consider models that are nonlinear in the covariates and need a more careful treatment to ensure consistency. Thus, my framework can be used to deal with missingness in a wide variety of models frequently used in empirical work. In ongoing work, I am extending these methods to the case of panel data and models nonlinear in the parameters. 29 CHAPTER 2 IMPUTING MISSING COVARIATE VALUES IN NONLINEAR MODELS 2.1 Introduction Nonlinear models are widely considered better suited to explain limited dependent variables than linear models. With missing covariate values - a ubiquitous problem in empirical research - nonlinear models become even more important because unlike the case where all variables are observed, estimates from linear models are now not necessarily consistent for parameters in the best linear approximations to nonlinear models.1 Yet not much of the vast literature on missing data has explicitly addressed the unique issues that arise when dealing with missingness in nonlinear models. Economists deal with missing covariate values predominantly in three ways. The most common thing to do is to just use the “complete cases" - the observations for which all the covariates are observed. While easy to use, this method can lead to substantial loss of efficiency because of discarding the incomplete cases. This has inspired methods that make use of these incomplete cases. The first commonly used method in this regard is the dummy variable method (DVM), which replaces the missing values with 0 and includes an indicator for missingness as an additional covariate. The second commonly used method is two-step regression imputation. In the first step, it regresses the covariate with missing values (CMV) on the always-observed covariates using the complete cases and uses the estimated coefficients to predict missing values of the CMV. In the second step, it estimates the model of interest using all observations with this “composite" CMV, which consists of both observed and predicted values (Dagenais, 1973). Table D1 summarizes the usage of these methods in 5 highly ranked economics journals in the last 3 years. Out of 846 papers, about 26% reported having missing data. Out of these, about 62%, 19% and 14% used the complete cases estimator, the DVM and the two-step regression imputation respectively.2 1I discuss this issue in detail in Section 2.6.4. Also see Wooldridge (2002). 2Of all the other methods used, no single category stood out. About 18% of the papers use other methods, most 30 The choice of method comes down to consistency and relative efficiency. The complete cases estimator generally requires the least number of assumptions in both linear and nonlinear models to be consistent. For instance, when the econometric model is correctly specified, say a model of a mean or a distribution conditional on the covariates, it only requires that the missingness depends only on the covariates (Wooldridge, 2002). However, as mentioned above, it can be inefficient relative to the other two estimators that use the incomplete cases. The DVM on the other hand is generally inconsistent even in linear models (Jones, 1996) and as I show in this paper, in nonlinear models as well, unless some very strong zero assumptions are imposed. Even with these assumptions, it does not guarantee efficiency improvements over the complete cases estimator (Abrevaya & Donald, 2017). Yet this method is still widely used as is evident from Table D1, perhaps because of its ease of use. Two-step regression imputation also imposes additional assumptions on the model relative to the complete cases estimator, but these assumptions are much more plausible than those imposed by DVM. Practically, the most important one is ruling out the dependence of missingness on the CMV itself. Under this assumption, it is generally consistent in linear models. However, in this paper I show that even under this assumption, this method is generally in- consistent in nonlinear models. Most notable are models based on conditional means, including commonly used models like probit, tobit, and Poisson regression. The reason for inconsistency is that this method simply plugs the imputed values in the same objective function that one would minimize if there were no missing values. However, in nonlinear models, this objective function does not necessarily capture the correct relationship between the observed variables in observations with missing values. The core issue is that conditional expectation does not pass through nonlinear functions, unlike linear ones. For instance, in binary choice models, simply plugging imputed values in the standard probit response probability and maximizing the resulting log likelihood will generally result in inconsistency in estimators of both the structural parameters and other quantities of which are ad-hoc. This includes methods like replacing missing values with observations from the previous or following time period in case of panel data (5%), replacing missing values with 0 (4%), and dropping or combining variables with missingness (2%). Some papers also used hot deck (3%) and context specific imputation methods (2%). There were 2 instances each of multiple imputation and weighting. 31 of interest, such as average partial effects. To my knowledge, this issue has not been addressed in the literature and on the contrary, it has been claimed that this method is consistent in binary choice models (DeCanio & Watkins, 1998). The key contribution of this paper is to propose a one-step imputation estimator which relies on the same assumptions as two-step imputation, but is consistent in nonlinear models. It simulta- neously estimates the model of interest and the imputation model using the complete cases and a “reduced form" using all observations. The reduced form is a version of the main model in which we have “integrated out" the CMV using the imputation model, and hence it is able to make use of the incomplete cases. The key is that it correctly captures the relationship between the observed variables when the CMV is missing. The estimator provides potentially strict efficiency gains over the complete cases estimator for all coefficients, and using a generalized method of moments (GMM) framework provides the overidentification test as a test for underlying restrictions. The method is an extension of Abrevaya & Donald (2017), who proposed a one-step imputation estimator for linear models. I provide a unified treatment of linear and nonlinear models using an M-estimation framework. Special cases include linear and nonlinear least squares, conditional maximum likelihood, and quasi maximum likelihood methods. A second contribution is that I allow for nonlinearity in the imputation model itself. As mentioned above, the presence of missing data heightens the concerns about using linear models for limited dependent variables. Therefore, when imputing say a binary CMV, a probit may be more appropriate than a linear probability model. To my knowledge, regression imputation literature has solely focused on linear imputation models, though some of these nonlinear models have been discussed in the context of multiple imputation which is a Bayesian method of imputing (Rubin, 1987, Van Buuren, 2007). The rest of this paper is organized as follows. Section 2.2 lays out the population minimization problems obtained from the underlying model of interest and imputation model. Section 2.3 describes the selection problem and estimation of selection probabilities. Section 2.4 derives the 32 proposed estimator, its asymptotic distribution and a simple estimator of the asymptotic variance. Section 2.5 discusses two practically important examples: nonlinear models for fractional responses and nonnegative responses, including count responses. Within each model, I consider a continuous and a binary CMV. Section 2.6 compares the proposed estimator to three other estimators: complete cases, two-step imputation and DVM. Section 2.7 provides simulation results showing the relative performance of these estimators. Section 2.8 provides an empirical application to the estimation of association between grade variance and educational attainment as considered in Sandsor (2020). Section 9 concludes. Proofs, tables and figures are given in appendices. 2.2 The population optimization problems We start with the population optimization problem which defines the parameters of interest. Let 𝑦 be a 1 × 𝐽 random vector taking values in Y ⊂ R𝐽 and 𝑥 be a 1 × (𝐾 + 1) random vector taking values in X ⊂ R𝐾+1 . We are interested in explaining 𝑦 in terms of 𝑥. Some aspect of the joint distribution of (𝑦, 𝑥) depends on a 𝐿 1 × 1 parameter vector, 𝛼, contained in a parameter space A ⊂ R 𝐿 1 . Let 𝑓1 (𝑦, 𝑥1 , 𝑥2 , 𝛼) denote an objective function. Assumption 2.2.1. 𝛼0 is the unique solution to the population minimization problem min E[ 𝑓1 (𝑦, 𝑥1 , 𝑥2 , 𝛼)]. (2.2.1) 𝛼∈A Often, 𝛼0 indexes some correctly specified feature of the distribution of 𝑦 conditional on 𝑥, such as a conditional mean or a conditional median. But we will derive consistency and asymptotic normality results for a general class of problems in which the underlying population model can be misspecified in some way. Next, let 𝑥 = (𝑥1 , 𝑥2 ), where 𝑥 1 is a scalar,3 and 𝑥 2 is a 1 × 𝐾 random vector taking values in X1 ⊂ R and X2 ⊂ R𝐾 respectively, and X = X1 × X2 . As discussed in Section 2.3, we will allow 𝑥 1 to contain missing values and assume that (𝑦, 𝑥2 ) are always observed. Thus, we are interested in imputing 𝑥 1 using 𝑥2 . Let some aspect of the joint distribution of (𝑥 1 , 𝑥2 ) depends on a 𝐿 2 × 1 3The discussion for a random vector 𝑥1 , all elements of which are missing and observed at the same time, is essentially the same. 33 parameter vector 𝛽, contained in a parameter space B ⊂ R 𝐿 2 . Let 𝑓2 (𝑥 1 , 𝑥2 , 𝛽) denote an objective function, and consider the population optimization problem which characterizes the imputation parameters. Assumption 2.2.2. 𝛽0 is the unique solution to the population minimization problem min E[ 𝑓2 (𝑥 1 , 𝑥2 , 𝛽)]. (2.2.2) 𝛽∈B Similar to the model of interest, the underlying population model here can be misspecified in some way. The case that has been well studied in the classical imputation literature is where the underlying models for both 𝑓1 (𝑦, 𝑥, 𝛼) and 𝑓2 (𝑥1 , 𝑥2 , 𝛽) are linear. The framework presented here allows for both the underlying models to be nonlinear as long as they are estimable using M-estimators, which includes maximum likelihood, quasi-maximum likelihood, nonlinear least squares, and many other procedures. For instance, if both 𝑦 and 𝑥1 are binary, we can let both 𝑓1 (𝑦, 𝑥, 𝛼) and 𝑓2 (𝑥 1 , 𝑥2 , 𝛽) be negative of probit log-likelihoods, instead of basing them on linear models. Alternatively, 𝑦 could be a nonnegative count variable and 𝑥 1 could be continuous, in which case we can let 𝑓1 (𝑦, 𝑥, 𝛼) be the negative of Poisson log-likelihood and let 𝑓2 (𝑥 1 , 𝑥2 , 𝛽) come from a linear model. We consider these examples in detail in Section 2.5. Next, we define a reduced form M-estimation problem which is based only on the always- observed variables (𝑦, 𝑥2 ). This reduced form is what allows us to use the incomplete cases, and hence is the key to the efficiency gains of the proposed estimator. Let 𝛾 = 𝑞(𝛼, 𝛽) be a (potentially nonlinear) 𝐿 3 × 1 function of the parameters of interest 𝛼 and the imputation parameters 𝛽, where 𝛾 is contained in a parameter space Γ ⊂ R 𝐿 3 and 𝐿 3 ≤ 𝐿 1 + 𝐿 2 . We assume that we can obtain a “reduced form" objective function 𝑓3 (𝑦, 𝑥2 , 𝛾) in terms of the always-observed variables 𝑦 and 𝑥 2 as well as 𝛾 such that 𝛾0 = 𝑞(𝛼0 , 𝛽0 ) uniquely minimizes this function. Assumption 2.2.3. 𝛾0 is the unique solution to the population minimization problem min E[ 𝑓3 (𝑦, 𝑥2 , 𝛾)]. (2.2.3) 𝛾∈Γ 34 The reduced form model underlying 𝑓3 (𝑦, 𝑥2 , 𝛾) is derived by “integrating out" 𝑥1 from the model of interest using the imputation model. When the model of interest is a linear projection or a model of conditional mean linear in the parameters, the reduced form can be derived using iterated projections or iterated expectations properties without having to do explicit integration. This is the case considered in Abrevaya & Donald (2017). In commonly used models nonlinear in the parameters like probit and Poisson regression, “substituting" for 𝑥 1 using the imputation model eliminates the need for explicit integration. We consider these examples in Section 2.5. The dimension of 𝛾 warrants some discussion. It is possible that 𝐿 3 < 𝐿 1 + 𝐿 2 , that is, the reduced form only identifies certain functions of 𝛼0 and 𝛽0 , and not each element of 𝛼0 and 𝛽0 separately. Some examples are the case of linear projections considered in Abrevaya & Donald (2017) and the case of probit with continuous 𝑥 1 considered in Section 5.1.1. It is however, also possible that 𝐿 3 = 𝐿 1 + 𝐿 2 , in which case 𝛾0 = (𝛼00 , 𝛽00 ) 0, for instance in the case of probit with binary 𝑥 1 considered in Section 2.5.1.2.4 Assumptions (2.2.1)-(2.2.3) imply that (𝛼0 , 𝛽0 ) is the unique solution to the following equations, provided that we can interchange the expectation and the derivative.  ∗    𝑔 (𝑦, 𝑥1 , 𝑥2 , 𝛼)  0  1    E[𝑔 ∗ (𝑦, 𝑥1 , 𝑥2 , 𝛼, 𝛽)] = E  𝑔 ∗ (𝑥 1 , 𝑥2 , 𝛽)  = 0 ,     (2.2.4)  2     ∗  𝑔3 (𝑦, 𝑥2 , 𝛼, 𝛽)  0        where 𝑔1∗ (𝑦, 𝑥1 , 𝑥2 , 𝛼) ≡ ∇𝛼 𝑓1 (𝑦, 𝑥1 , 𝑥2 , 𝛼) 0 is the 𝐿 1 × 1 score of 𝑓1 (𝑦, 𝑥1 , 𝑥2 , 𝛼) , 𝑔2∗ (𝑥 1 , 𝑥2 , 𝛽) ≡ ∇ 𝛽 𝑓2 (𝑥 1 , 𝑥2 , 𝛽) 0 is the 𝐿 2 × 1 score of 𝑓2 (𝑥 1 , 𝑥2 , 𝛽), and 𝑔3∗ (𝑦, 𝑥2 , 𝛼, 𝛽) ≡ ∇𝛾 𝑓3 (𝑦, 𝑥2 , 𝛾) 0 is the 𝐿 3 × 1 score of 𝑓3 (𝑦, 𝑥2 , 𝛾). (2.2.4) gives us a set of moment conditions, a transformation of which will be the basis of the proposed estimator as discussed in Section 2.4. 4As a note on notation, I express functions as explicitly depending on 𝛾 only when it is necessary to take into account the nature of 𝛾. For instance, when 𝐿 3 < 𝐿 1 + 𝐿 2 , the score of 𝑓3 (𝑦, 𝑥2 , 𝛾) should only contain partial derivatives with respect to 𝛾 and not with respect to individual elements of 𝛼 and 𝛽 to prevent redundancy in the resulting moment conditions. But for the most part, when looking at the derivatives of 𝑓1 (.), 𝑓2 (.), and 𝑓3 (.), we only need to acknowledge the fact that they are functions of (𝛼, 𝛽). 35 2.3 Non random sampling and inverse probability weighting I characterize nonrandom sampling through a selection indicator. For any random draw (𝑦𝑖 , 𝑥1𝑖 , 𝑥2𝑖 ) from the population, we also draw 𝑠𝑖 , a binary indicator equal to unity if 𝑥 1𝑖 is observed, and zero otherwise. We assume that 𝑦𝑖 and 𝑥2𝑖 are always observed. A generic element from the population is now denoted (𝑦, 𝑥1 , 𝑥2 , 𝑠). Then the following assumption characterizes the nature of selection. Assumption 2.3.1 (i) 𝑥 1 is observed whenever 𝑠 = 1, (𝑦, 𝑥2 ) is always observed. (ii) There is a random vector 𝑧 such that 𝑃(𝑠 = 1|𝑦, 𝑥, 𝑧) = 𝑃(𝑠 = 1|𝑧) ≡ 𝑝(𝑧). (iii) For all 𝑧 ∈ Z ⊂ R 𝑀 , 𝑝(𝑧) > 0. (iv) 𝑧 is always observed. Part (i) simply defines data observability. Parts (ii) and (ii) are the key assumptions. They state that selection is based on observable variables. This is the same as the “missing at random" assumption used in statistics literature (Rubin, 1976). Part (ii) states that 𝑠 is independent of (𝑦, 𝑥) conditional on 𝑧. Because the only variable assumed to contain missing values is 𝑥1 , we can, at a minimum, allow 𝑧 to contain (𝑦, 𝑥2 ). Although apart from this, 𝑧 can also contain some “outside" variables that are good predictors of selection and are always observed. Then Assumption 2.3.1 is more general than allowing 𝑠 to depend only on the covariates 𝑥2 , which is the case considered in Abrevaya & Donald (2017) in the context of linear models. Moreover, the framework presented here can also be used when 𝑦 contains missing values. We simply redefine 𝑠 to equal 1 when both 𝑦 and 𝑥 1 are observed, and rule out 𝑧 containing 𝑦 in addition to 𝑧 containing 𝑥 1 . Then the proposed estimator discussed in the next section will impute using the observations for which only 𝑥 1 is missing, and discard the 𝑦-missing observations. For selection as described in Assumption 2.3.1, note that the first and second moment functions in (2.2.4) can only use the 𝑠 = 1 observations since they depend on 𝑥 1 , and the third moment function is able to use the 𝑠 = 0 observations. We will weight each of the moment functions by the inverse of appropriate probabilities in order to account for this selection. To this end, we specify a model for the selection probability. We assume that a conditional density determining selection is correctly specified, and that the standard regularity conditions required for maximum likelihood 36 estimation (MLE) of the selection model are satisfied. Let 𝐷 (.|.) denote conditional distribution. Assumption 2.3.2 (i) 𝐺 (𝑧, 𝛿) is a parametric model for 𝑝(𝑧), where 𝛿 ∈ Δ ⊂ R𝑃 and 𝐺 (𝑧, 𝛿) > 0, all 𝑧 ∈ Z ⊂ R 𝑀 , 𝛿 ∈ Δ. (ii) There exists 𝛿0 ∈ Δ such that 𝑝(𝑧) = 𝐺 (𝑧, 𝛿0 ). (iii) The estimator 𝛿ˆ solves the binary response problem Õ𝑁 max {𝑠𝑖 𝑙𝑜𝑔[𝐺 (𝑧𝑖 , 𝛿)] + (1 − 𝑠𝑖 )𝑙𝑜𝑔[1 − 𝐺 (𝑧𝑖 , 𝛿)]}. (2.3.1) 𝛿∈Δ 𝑖=1 Given 𝛿,ˆ we can form 𝐺 (𝑧𝑖 , 𝛿) ˆ for all 𝑖. This leads us to the problem of estimation. 2.4 Moment conditions and GMM The proposed estimator is a GMM estimator based on the following transformation of the moment functions in (2.2.4). 𝑔1𝑖 (𝛼, 𝛽; 𝛿)   [𝑠𝑖 /𝐺 (𝑧𝑖 , 𝛿)]𝑔 ∗ (𝑦𝑖 , 𝑥1𝑖 , 𝑥2𝑖 , 𝛼)         1      𝑔𝑖 (𝛼, 𝛽; 𝛿) = 𝑔2𝑖 (𝛼, 𝛽; 𝛿)  ≡  [𝑠𝑖 /𝐺 (𝑧𝑖 , 𝛿)]𝑔 ∗ (𝑥 1𝑖 , 𝑥2𝑖 , 𝛽)  . (2.4.1)    2  𝑔3𝑖 (𝛼, 𝛽; 𝛿)      ∗ 𝑔3 (𝑦𝑖 , 𝑥2𝑖 , 𝛼, 𝛽)       Because both 𝑔1∗ (𝑦𝑖 , 𝑥1𝑖 , 𝑥2𝑖 , 𝛼) and 𝑔2∗ (𝑥 1𝑖 , 𝑥2𝑖 , 𝛽) are functions of 𝑥 1𝑖 , they can only use the complete cases - the observations for which 𝑠𝑖 = 1. We thus multiply these by 𝑠𝑖 and weight by the inverse of selection probability in the usual inverse probability weighting (IPW) fashion (Wooldridge, 2002, 2007). Since 𝑔3∗ (𝑦𝑖 , 𝑥2𝑖 , 𝛼, 𝛽) is a function only of the always-observed variables 𝑦𝑖 and 𝑥2𝑖 , it can use all the observations including the incomplete cases and hence we do not need to weight it. For a generic element from the population (𝑦, 𝑥1 , 𝑥2 , 𝑧, 𝑠), denote this vector of moment func- tions by 𝑔(𝛼, 𝛽; 𝛿) and its individual elements by 𝑔 𝑗 (𝛼, 𝛽; 𝛿), 𝑗 = 1, 2, 3. This is a set of overi- dentified moment functions. 𝑔1 (.) exactly identifies the parameters of interest 𝛼0 and 𝑔2 (.) exactly identifies the imputation parameters 𝛽0 . The overidentification (and hence the efficiency gains) in the system come from 𝑔3 (.). The number of overidentifying restrictions is 𝐿 3 , the dimension of the reduced form parameters 𝛾0 . Given the first step estimate 𝛿, ˆ we can write the sample analogue 37 of moment conditions based on (2.4.1) as Õ𝑁 ˆ = 𝑁 −1 𝑔¯ 𝑗 (𝛼, 𝛽; 𝛿) ˆ 𝑔 𝑗𝑖 (𝛼, 𝛽; 𝛿), 𝑗 = 1, 2, 3, (2.4.2) 𝑖=1 and 𝑔(𝛼, ¯ ˆ = [ 𝑔¯ 1 (𝛼, 𝛽; 𝛿) 𝛽; 𝛿) ˆ 0, 𝑔¯ 2 (𝛼, 𝛽; 𝛿) ˆ 0, 𝑔¯ 3 (𝛼, 𝛽; 𝛿) ˆ 0] 0. A GMM estimator based on (2.4.1) minimizes the following objective function with respect to (𝛼, 𝛽). ˆ 𝑄(𝛼, ˆ = 𝑔(𝛼, 𝛽; 𝛿) ¯ ˆ 0 𝑊ˆ 𝑔(𝛼, 𝛽; 𝛿) ¯ ˆ 𝛽; 𝛿), (2.4.3) 𝑝 where 𝑊ˆ is an estimated weight matrix such that 𝑊ˆ − → 𝑊. We first discuss identification of (𝛼0 , 𝛽0 ). The limit function for 𝑄(𝛼, ˆ ˆ is 𝑄(𝛼, 𝛽; 𝛿0 ) = 𝛽; 𝛿) E[𝑔(𝛼, 𝛽; 𝛿0 )] 0 𝑊 E[𝑔(𝛼, 𝛽; 𝛿0 )]. Lemma 2.4.1. (Identification) Assume that 𝑊 is a symmetric positive definite matrix. Then under Assumptions 2.2.1-2.2.3, 2.3.1, and 2.3.2, 𝑄(𝛼, 𝛽; 𝛿0 ) has a unique minimum at (𝛼0 , 𝛽0 ). For a nonsingular 𝑊, the GMM identification condition reduces to E[𝑔(𝛼, 𝛽; 𝛿0 )] ≠ 0 if (𝛼, 𝛽) ≠ (𝛼0 , 𝛽0 ). Sufficient is to show that a corresponding condition holds for each element of 𝑔(𝛼, 𝛽; 𝛿0 ). For instance, E[𝑔1 (𝛼, 𝛽; 𝛿0 )] ≠ 0 if 𝛼 ≠ 𝛼0 follows from identification of 𝛼0 in the population (Assumption 2.2.1) and the assumptions on selection (Assumptions 2.3.1 and 2.3.2). A formal proof of Lemma 2.4.1, along with all other proofs in the rest of the paper are given in the appendix. The GMM estimator based on the general weight matrix 𝑊ˆ is defined as the following. Definition 2.4.1: Call the estimator of (𝛼, 𝛽) that minimizes (2.4.3), ( 𝛼, ˆ ˆ 𝛽). Consistency of ( 𝛼, ˆ 𝛽)ˆ follows from Lemma 2.4.1 and standard regularity conditions given in the following theorem. Theorem 2.4.1 (Consistency) Assume that 1. {(𝑦𝑖 , 𝑥𝑖 , 𝑧𝑖 , 𝑠𝑖 ) : 𝑖 = 1, . . . , 𝑁 } are random draws from the population satisfying Assumptions 2.3.1 and 2.3.2. 2. The assumptions in Lemma 2.4.1 hold. 38 3. A, B, Γ, Δ, A × Δ, B × Δ, and A × B × Γ are compact subsets of R 𝐿 1 , R 𝐿 2 , R 𝐿 3 , R𝑃 , R 𝐿 1 +𝑃 , R 𝐿 2 +𝑃 , and R 𝐿 1 +𝐿 2 +𝐿 3 respectively. 4. 𝑓1 (𝑦, 𝑥, 𝛼), 𝑓2 (𝑥, 𝛽) and 𝑓3 (𝑦, 𝑥2 , 𝛾) are twice differentiably continuous on A, B, Γ respec- tively for each (𝑦, 𝑥), 𝑥 and (𝑦, 𝑥2 ) in Y × X, X and Y × X2 respectively. 5. 𝐺 (𝑧, 𝛿) is continuous in Δ for each 𝑧 ∈ Z, twice continuously differentiable on 𝑖𝑛𝑡 (Δ), and 𝛿0 ∈ 𝑖𝑛𝑡 (Δ). For some 𝑎 > 0, 𝐺 (𝑧, 𝛿) ≥ 𝑎 for all 𝑧 ∈ Z, 𝛿 ∈ Δ. 6. For all (𝛼, 𝛽, 𝛾) ∈ A × B × Γ, |𝑔 ∗ (𝑦, 𝑥, 𝛼, 𝛽, 𝛾)| ≤ 𝑏(𝑦, 𝑥), where 𝑏(𝑦, 𝑥) ≡ [𝑏 1 (𝑦, 𝑥) 0, 𝑏 2 (𝑥) 0, 𝑏 3 (𝑦, 𝑥2 ) 0] 0 and 𝑏(.) is a function such that E[𝑏(𝑦, 𝑥)] < ∞. 𝑝 Then ( 𝛼, ˆ 𝛽)ˆ −→ (𝛼0 , 𝛽0 ) as 𝑁 → − ∞. The consistency of ( 𝛼, ˆ 𝛽)ˆ follows from standard arguments involving consistency of two-step M- estimators. First, analogous to the discussion in Wooldridge (2002), Lemma 2.4 of Newey & McFadden (1994) applies to show that 𝑔1 (𝛼, 𝛽; 𝛿), 𝑔2 (𝛼, 𝛽; 𝛿) and 𝑔3 (𝛼, 𝛽; 𝛿) satisfy the uniform weak law of large numbers over A × Δ, B × Δ, Γ respectively under Assumptions 1, 3, 4, 5 and 6 of Theorem 2.4.1. Then the averages in (2.4.2) can be shown to converge to E[𝑔 𝑗 (𝛼, 𝛽; 𝛿0 )], 𝑗 = 1, 2, 3, (2.4.4) uniformly over A, B, and Γ respectively. Along with the identification from Lemma 2.4.1, this can be shown to imply consistency of ( 𝛼, ˆ 𝛽)ˆ for (𝛼0 , 𝛽0 ). Now, assuming that E[𝑔(𝛼, 𝛽; 𝛿0 )] is differentiable at (𝛼0 , 𝛽0 ), its derivative is defined as the following.  0  𝐷  11 0   ∗   𝐷 0 ≡ E[∇ (𝛼0,𝛽0)0 𝑔(𝛼, 𝛽; 𝛿0 )| (𝛼,𝛽)=(𝛼 ,𝛽 ) ] = E[∇ (𝛼0,𝛽0)0 𝑔 (𝛼, 𝛽)| (𝛼,𝛽)=(𝛼 ,𝛽 ) ] =  0 𝐷  ,  0 0 0 0 0 22    0  𝐷 31 𝐷 032     (2.4.5) 39 where 𝐷 0𝑗1 = 𝜕𝑔 ∗𝑗 (𝛼, 𝛽)/𝜕𝛼| (𝛼,𝛽)=(𝛼 ,𝛽 ) and 𝐷 0𝑗2 = 𝜕𝑔 ∗𝑗 (𝛼, 𝛽)/𝜕 𝛽| (𝛼,𝛽)=(𝛼 ,𝛽 ) , 𝑗 = 1, 2, 3 and 0 0 0 0 the first equality follows by the standard IPW argument given Assumptions 2.3.1 and 2.3.2. Then the following result gives the asymptotic distribution of ( 𝛼, ˆ 𝛽). ˆ Theorem 2.4.2 (Asymptotic normality) Assume that 1. The assumptions in Theorem 2.4.1 hold. 2. (𝛼0 , 𝛽0 ) ∈ 𝑖𝑛𝑡 (A × B). 3. 𝑔(𝛼, 𝛽; 𝛿) is twice continuously differentiable on 𝑖𝑛𝑡 (A × B × Δ). 4. 𝐷 0 is of full rank 𝐿 1 + 𝐿 2 . 5. E[sup (𝛼,𝛽;𝛿)∈A×B×Δ |∇ (𝛼,𝛽,𝛿) 𝑔(𝛼, 𝛽, 𝛿)|] < ∞. Then, √ 𝑑 𝑁 [( 𝛼ˆ 0, 𝛽ˆ0) 0 − (𝛼00 , 𝛽00 ) 0] −−−−→ 𝑁𝑜𝑟𝑚𝑎𝑙 [0, (𝐷 00𝑊 𝐷 0 ) −1 𝐷 00𝑊 𝐹0𝑊 𝐷 0 (𝐷 00𝑊 𝐷 0 ) −1 ], (2.4.6) where 𝐹0 = E(𝑔𝑖 𝑔𝑖0) − {E(𝑔𝑖 𝑑𝑖0) [E(𝑑𝑖 𝑑𝑖0)] −1 E(𝑑𝑖 𝑔𝑖0)} ◦ 𝑅, 𝑔𝑖 ≡ 𝑔𝑖 (𝛼0 , 𝛽0 ; 𝛿0 ), 𝑑𝑖 ≡ 𝑠𝑖 (∇𝛿 𝐺 0𝑖 /𝐺 𝑖 ) − (1 − 𝑠𝑖 ) [∇𝛿 𝐺 0𝑖 /(1 − 𝐺 𝑖 )] is the 𝑃 × 1 score of the binary response log-likelihood, 𝑅 is a square matrix of order 𝐿 1 + 𝐿 2 + 𝐿 3 with all elements being unity except the lower right 𝐿 3 × 𝐿 3 block which is a 0 matrix,5 𝐺 𝑖 ≡ 𝐺 (𝑧𝑖 , 𝛿0 ), 𝐻0 ≡ E[∇𝛿 𝑔(𝛼0 , 𝛽0 ; 𝛿0 )] and 𝜓(𝑠𝑖 , 𝑧𝑖 ) = −[E(𝑑𝑖 𝑑𝑖0)] −1 𝑑𝑖 . Standard GMM theory dictates that the optimal weight matrix to be used in (2.4.3) is 𝑊ˆ = 𝐹ˆ −1 , where 𝐹ˆ is a consistent estimate of 𝐹0 which can be obtained as  Õ𝑁   Õ𝑁  Õ𝑁  −1  Õ𝑁  𝐹ˆ = 𝑁 −1 𝑔ˆ𝑖 𝑔ˆ𝑖0 − 𝑁 −1 𝑔ˆ𝑖 𝑑ˆ𝑖0 𝑁 −1 𝑑ˆ𝑖 𝑑ˆ𝑖0 𝑁 −1 𝑑ˆ𝑖 𝑔ˆ𝑖0 ◦ 𝑅, (2.4.7) 𝑖=1 𝑖=1 𝑖=1 𝑖=1 where ˆ0 ˆ0     ∇ 𝐺 (𝑧 , 𝛿) ∇ 𝐺 (𝑧 , 𝛿) ˆ 𝛿), ˆ 𝑑ˆ𝑖 ≡ 𝑠𝑖 𝛿 𝑖 𝛿 𝑖 𝑔ˆ𝑖 ≡ 𝑔𝑖 ( 𝛼, ˆ 𝛽; − (1 − 𝑠𝑖 ) . (2.4.8) 𝐺 (𝑧𝑖 , 𝛿)ˆ 1 − 𝐺 (𝑧𝑖 , 𝛿) ˆ Then, the proposed estimator is the optimal GMM estimator based on (2.4.1), as defined below. 5◦ denotes a Hadamard product. 40 Definition 2.4.2: Call the estimator of (𝛼, 𝛽) that minimizes (2.4.3) with 𝑊ˆ = 𝐹ˆ −1 , the weighted joint GMM estimator or ( 𝛼ˆ 𝑊 𝐽 , 𝛽ˆ𝑊 𝐽 ). Because ( 𝛼ˆ 𝑊 𝐽 , 𝛽ˆ𝑊 𝐽 ) uses the optimal weight matrix, the asymptotic variance in (2.4.6) reduces to (𝐷 00 𝐹0−1 𝐷 0 ) −1 . A consistent estimator can be obtained using 𝐹ˆ and a consistent estimator of 𝐷 0 defined as Õ𝑁 𝐷ˆ = 𝑁 −1 [∇ (𝛼0,𝛽0)0 𝑔𝑖 ( 𝛼, ˆ 𝛿)]. ˆ 𝛽; ˆ (2.4.9) 𝑖=1 Then the following result follows from Theorem 2.4.2. Theorem 2.4.3 (Asymptotic Normality of the optimal GMM) Let all assumptions of Theorem 2.4.2 hold. Then, √ 0 , 𝛽ˆ0 ) 0 − (𝛼0 , 𝛽0 ) 0] − 𝑑 𝑁 [( 𝛼ˆ 𝑊 𝐽 𝑊𝐽 0 0 −−−→ 𝑁𝑜𝑟𝑚𝑎𝑙 [0, (𝐷 00 𝐹0−1 𝐷 0 ) −1 ], (2.4.10) √ 0 , 𝛽ˆ0 ) 0 − (𝛼0 , 𝛽0 ) 0]} is given by and a consistent estimator of 𝐴𝑣𝑎𝑟 { 𝑁 [( 𝛼ˆ 𝑊 𝐽 𝑊𝐽 0 0 ( 𝐷ˆ 0 𝐹ˆ −1 𝐷) ˆ −1 , (2.4.11) where 𝐹ˆ is given in (2.4.7) and 𝐷ˆ is given in (2.4.9). Further, we can use the standard test of overidentifying restrictions based on the objective function evaluated at the parameter estimates proposed by Hansen (1982). The original result was obtained for a standard GMM. It is straightforward to extend the proof to the case where the moment functions depend on an estimate of 𝛿 from a first step. Proposition 2.4.1: Let all assumptions of Theorem 2.4.2 hold. Then under the null hypothesis that E[𝑔(𝛼0 , 𝛽0 ; 𝛿0 )] = 0, 𝑝 𝑁 𝑔( ˆ 0 𝐹ˆ −1 𝑔( ¯ 𝛼ˆ 𝑊 𝐽 , 𝛽ˆ𝑊 𝐽 ; 𝛿) ˆ −−−−→ 𝜒2 . ¯ 𝛼ˆ 𝑊 𝐽 , 𝛽ˆ𝑊 𝐽 ; 𝛿) (2.4.12) 𝐿3 2.5 Examples The proposed estimator can be applied to many cases relevant for empirical research. I provide two important examples: a binary or fractional 𝑦 and a nonnegative 𝑦, both of which are estimated using quasi-MLE. 41 2.5.1 Models for binary and fractional responses Binary response models are one of the most commonly used nonlinear models in empirical research. Suppose that 𝑦 is a variable taking values in the unit interval, [0, 1]. This includes the case where 𝑦 is binary but also allows 𝑦 to be a continuous proportion. Further, 𝑦 can have both discrete and continuous characteristics (for instance, 𝑦 can be a proportion that takes on zero or one with positive probability). We start by assuming that the mean of 𝑦 conditional on 𝑥 has a probit form. E(𝑦|𝑥 1 , 𝑥2 ) = Φ(𝛼10 𝑥1 + 𝑥 2 𝛼20 ) ≡ Φ(𝑥𝛼0 ), (2.5.1) where 𝑥 1 is a scalar and 𝑥 2 is a 1 × 𝑘 vector. If 𝑥 1 was always observed, we would simply estimate 𝛼0 using quasi-MLE with a Bernoulli log likelihood, which identifies the parameters in a correctly specified conditional mean by the virtue of being in the linear exponential family (Gourieroux et al., 1984). But because 𝑥1 is sometimes missing, now we additionally specify a model to impute 𝑥 1 using 𝑥2 and use it to obtain the reduced form conditional mean of 𝑦 given 𝑥 2 . I consider two cases: where 𝑥 1 is continuous, and where it is binary. 2.5.1.1 Continuous covariate with missing values We assume that the imputation model is linear. 𝑥1 = 𝑥 2 𝜃 0 + 𝑟, (2.5.2) 𝑟 |𝑥 2 ∼ 𝑁𝑜𝑟𝑚𝑎𝑙 [0, 𝜎02 𝑒𝑥 𝑝(2𝑥 21𝜆 0 )], (2.5.3) where 𝑥 21 ⊂ 𝑥 2 . That is, 𝑥 1 is assumed to be normally distributed conditional on 𝑥 2 . To make the model more flexible, we allow the error to be heteroskedastic with variance dependent on 𝑥 21 . Typically, 𝑥 21 will include all elements of 𝑥 2 except the constant, so that the case where 𝑟 is homoskedastic with variance 𝜎02 is obtained as a special case by setting 𝜆 0 = 0. The conditional pdf of 𝑥1 is given by (𝑥 1 − 𝑥 2 𝜃 0 ) 2   1 𝑓 (𝑥 1 |𝑥2 , 𝛽0 ) = q 𝑒𝑥 𝑝 − . (2.5.4) 2 2𝜎 2 𝑒𝑥 𝑝(2𝑥 𝜆 ) 2𝜋𝜎0 𝑒𝑥 𝑝(2𝑥 21𝜆 0 ) 0 21 0 42 In order to find E(𝑦|𝑥2 ), we integrate out 𝑥 1 from E(𝑦|𝑥 1 , 𝑥2 ) given in (2.5.1) using the density given in (2.5.4). ∫ ∞   𝑥1 − 𝑥2 𝜃 0 E(𝑦|𝑥 2 ) = Φ(𝑥𝛼0 )𝜎0−1 𝑒𝑥 𝑝(−𝑥21𝜆 0 )𝜙 𝑑𝑥1 −∞ 𝜎0 𝑒𝑥 𝑝(𝑥 21𝜆 0 )   𝑥2 (𝛼10 𝜃 0 + 𝛼20 ) =Φ q . (2.5.5) 2 𝜎 2 𝑒𝑥 𝑝(2𝑥 𝜆 ) 1 + 𝛼10 0 21 0 We can derive E(𝑦|𝑥2 ) without carrying out the explicit integration as well. Define a binary variable as following. 𝑤 ∗ = 𝛼10 𝑥 1 + 𝑥 2 𝛼20 + 𝑢 ≡ 𝑥𝛼0 + 𝑢, (2.5.6) 𝑢|𝑥 1 , 𝑥2 ∼ 𝑁𝑜𝑟𝑚𝑎𝑙 (0, 1), (2.5.7) 𝑤 = 1[𝑤 ∗ > 0]. (2.5.8) Next, note that E(𝑤|𝑥1 , 𝑥2 ) = E(𝑦|𝑥 1 , 𝑥2 ) = Φ(𝛼10 𝑥 1 + 𝑥 2 𝛼20 ), (2.5.9) and so, by iterated expectations, E(𝑤|𝑥2 ) = E(𝑦|𝑥 2 ). (2.5.10) (2.5.10) is what allows us to obtain E(𝑦|𝑥 2 ). Substituting (2.5.2) into (2.5.6) gives 𝑤 ∗ = 𝑥 2 (𝛼10 𝜃 0 + 𝛼20 ) + 𝑣, (2.5.11) where 𝑣 ≡ 𝑢 + 𝛼10𝑟 and 𝑣|𝑥 2 ∼ 𝑁𝑜𝑟𝑚𝑎𝑙 [0, 1 + 𝛼10 2 𝜎 2 𝑒𝑥 𝑝(2𝑥 𝜆 )] under the assumptions made 0 21 0 so far. Therefore, 𝑤 = 1[𝑥2 (𝛼10 𝜃 0 + 𝛼20 ) + 𝑣 > 0], (2.5.12) which implies E(𝑤|𝑥 2 ) = 𝑃(𝑤 = 1|𝑥2 ) = 𝑃[𝑣 > −𝑥 2 (𝛼10 𝜃 0 + 𝛼20 )|𝑥2 ], (2.5.13) which gives the same expression as (2.5.5). Now we can use quasi-MLE with a Bernoulli log likelihood for both the model of interest (2.5.1) and the reduced form (2.5.5), and full MLE for the 43 imputation model using (2.5.4). The objective functions in (2.2.1)-(2.2.3) are given by 𝑓1 (𝑦, 𝑥, 𝛼) = −𝑙𝑜𝑔{Φ(𝑥𝛼) 𝑦 [1 − Φ(𝑥𝛼)] (1−𝑦) } (𝑥 1 − 𝑥 2 𝜃) 2    1 𝑓2 (𝑥1 , 𝑥2 , 𝛽) = −𝑙𝑜𝑔 p 𝑒𝑥 𝑝 − 2𝜋𝜎 2 𝑒𝑥 𝑝(2𝑥21𝜆) 2𝜎 2 𝑒𝑥 𝑝(2𝑥 21𝜆) 𝑓3 (𝑦, 𝑥2 , 𝛾) = −𝑙𝑜𝑔{Φ[ℎ1 (𝑥 2 , 𝛾)] 𝑦 {1 − Φ[ℎ1 (𝑥 2 , 𝛾)]} (1−𝑦) }, (2.5.14) q where ℎ1 (𝑥 2 , 𝛾) ≡ [𝑥 2 (𝛼1 𝜃 + 𝛼2 )]/ 1 + 𝛼12 𝜎 2 𝑒𝑥 𝑝(2𝑥 21𝜆) and in the general notation of Section 2.2, 𝛽 = (𝜃, 𝜎 2 , 𝜆) and 𝛾 = [(𝛼1 𝜃 + 𝛼2 ), 𝛼12 𝜎 2 , 𝜆]. The issue of defining 𝛾 warrants some discussion. It can be shown that first, the partial derivatives of 𝑓3 (𝑦, 𝑥2 , 𝛾) with respect to (𝛼1 , 𝜃) are linear combinations of those with respect to (𝛼2 , 𝜎 2 , 𝜆). Since we use the the weighted versions of these partial derivatives as moment functions, we should use only those taken with respect to (𝛼2 , 𝜎 2 , 𝜆) to prevent redundancy in the resulting moment conditions. Second, the partial derivatives with respect to (𝛼2 , 𝜎 2 , 𝜆) are just scaled versions of those with respect to 𝛾 as defined above, which makes this definition of 𝛾 preferable both intuitively and for algebraic simplicity. 44 The objective functions in (2.5.14) result in the following score functions. [𝑦 − Φ(𝑥𝛼)]𝜙(𝑥𝛼) 𝑔1∗ (𝑦, 𝑥, 𝛼) = 𝑥 0 Φ(𝑥𝛼) [1 − Φ(𝑥𝛼)]   𝑥 20 (𝑥 1 − 𝑥 2 𝜃)     2 𝜎 𝑒𝑥 𝑝(2𝑥21𝜆)      (𝑥 1 − 𝑥 2 𝜃) 2   ∗ 1  𝑔2 (𝑥 1 , 𝑥2 , 𝛽) =  −   𝑒𝑥 𝑝(2𝑥21𝜆)𝜎 4 𝜎 2        0 (𝑥1 − 𝑥2 𝜃) 2    21 𝜎 2 𝑒𝑥 𝑝(2𝑥 𝜆) − 1  𝑥   21    q 𝑥 20      1 + 𝛼 2 𝜎 2 𝑒𝑥 𝑝(2𝑥 𝜆)   1 21  (𝜃𝛼 + )     ∗  𝑒𝑥 𝑝(2𝑥 21 𝜆)𝑥 2 1 𝛼 2  𝑦 − Φ[ℎ 1 (𝑥 2 , 𝛾)] 𝑔3 (𝑦, 𝑥2 , 𝛾) =  2 𝜎 2 𝑒𝑥 𝑝(2𝑥 𝜆)] 3/2  𝜙[ℎ1 (𝑥 2 , 𝛾)] Φ[ℎ (𝑥 , 𝛾)]{1 − Φ[ℎ (𝑥 , 𝛾)]} .   [1 + 𝛼 1 21  1 2 1 2  𝑒𝑥 𝑝(2𝑥 21𝜆)𝑥2 (𝜃𝛼1 + 𝛼2 )𝑥 0     21   [1 + 𝛼2 𝜎 2 𝑒𝑥 𝑝(2𝑥 21𝜆)] 3/2     1  (2.5.15) In the case where 𝜆 0 = 0 and hence 𝑟 is homoskedastic, the third elements of 𝑔2∗ (.) and 𝑔3∗ (.), which are the partial derivatives with respect to 𝜆 of 𝑓2 (.) and 𝑓3 (.) respectively go away. Moreover, the second element of 𝑔3∗ (.) in that case is just a linear function of the first element of 𝑔3∗ (.) and hence should be removed to prevent redundancy. Given these score functions and 𝛿ˆ obtained in Section 2.3, it is straightforward to form the moment functions in (2.4.1) and estimate (𝛼0 , 𝛽0 ) by minimizing (2.4.3). 2.5.1.2 Binary covariate with missing values We now consider the case where 𝑥 1 is binary. Equations (2.5.2) and (2.5.3) are replaced by 𝑥 1∗ = 𝑥 2 𝜃 0 + 𝑟, (2.5.16) 𝑟 |𝑥 2 ∼ 𝑁𝑜𝑟𝑚𝑎𝑙 [0, 𝑒𝑥 𝑝(2𝑥21𝜆 0 )], (2.5.17) 𝑥 1 = 1[𝑥 1∗ > 0], (2.5.18) 45 where 𝑥 21 ⊂ 𝑥 2 . Just as in Section 2.5.1.1, 𝑥 21 typically includes all elements of 𝑥2 except the constant, so that we can get a standard probit with unit variance as a special case by setting 𝜆 0 = 0. Now, (2.5.16)-(2.5.18) imply that 𝑃(𝑥 1 = 1|𝑥2 ) = Φ[𝑒𝑥 𝑝(−𝑥 21𝜆 0 )𝑥 2 𝜃 0 ] ≡ Φ[ℎ2 (𝑥2 , 𝛽0 )], (2.5.19) where in the general notation of Section 2.2, 𝛽 = (𝜃, 𝜆). Using (2.5.1) and iterated expectations, E(𝑦|𝑥 2 ) = E[E(𝑦|𝑥 1 , 𝑥2 )|𝑥2 ] = E(𝑦|𝑥 1 = 1, 𝑥2 )𝑃(𝑥1 = 1|𝑥 2 ) + E(𝑦|𝑥1 = 0, 𝑥2 )𝑃(𝑥1 = 0|𝑥 2 ) = Φ(𝛼10 + 𝑥2 𝛼20 )Φ[𝑒𝑥 𝑝(−𝑥 21𝜆 0 )𝑥 2 𝜃 0 ] + Φ(𝑥 2 𝛼20 ){1 − Φ[𝑒𝑥 𝑝(−𝑥 21𝜆 0 )𝑥 2 𝜃 0 ]} ≡ ℎ3 (𝑥 2 , 𝛾0 ), (2.5.20) where in the general notation of Section 2.2, 𝛾 = (𝛼, 𝛽). Analogous to the previous section, we use quasi-MLE with a Bernoulli log likelihood for the model of interest (2.5.1) and the reduced form (2.5.20), and full MLE for the imputation model using (2.5.19). The objective functions are given by 𝑓1 (𝑦, 𝑥, 𝛼) = −𝑙𝑜𝑔{Φ(𝑥𝛼) 𝑦 [1 − Φ(𝑥𝛼)] (1−𝑦) } 𝑓2 (𝑥 1 , 𝑥2 , 𝛽) = −𝑙𝑜𝑔(Φ[ℎ2 (𝑥2 , 𝛽)] 𝑥1 {1 − Φ[ℎ2 (𝑥 2 , 𝛽)]} (1−𝑥1 ) ) 𝑓3 (𝑦, 𝑥2 , 𝛾) = −𝑙𝑜𝑔{ℎ3 (𝑥 2 , 𝛾) 𝑦 [1 − ℎ3 (𝑥 2 , 𝛾)] (1−𝑦) }. (2.5.21) This results in the following score functions. [𝑦 − Φ(𝑥𝛼)]𝜙(𝑥𝛼) 𝑔1∗ (𝑦, 𝑥, 𝛼) = 𝑥 0 (2.5.22) Φ(𝑥𝛼) [1 − Φ(𝑥𝛼)]   𝑒𝑥 𝑝(−𝑥 𝜆)𝑥 0  {𝑥1 − Φ[ℎ2 (𝑥 2 , 𝛽)]}𝜙[ℎ2 (𝑥 2 , 𝛽)] 21 2 𝑔2∗ (𝑥 1 , 𝑥2 , 𝛽) =   𝜙[ℎ2 (𝑥 2 , 𝛽)] (2.5.23)    ℎ (𝑥 , 𝛽)𝑥 0  Φ[ℎ2 (𝑥2 , 𝛽)]{1 − Φ[ℎ2 (𝑥2 , 𝛽)]}  2 2 21        𝜙(𝛼 1 + 𝑥 2 𝛼 2 )Φ[ℎ 2 (𝑥 2 , 𝛽)]      0 𝑥 𝜙(𝛼 + 𝑥 𝛼 )Φ[ℎ (𝑥 , 𝛽)] + 𝜙(𝑥 𝛼 ){1 − Φ[ℎ (𝑥 , 𝛽)]}  1 2 2 2 2 2 2 2 2 𝑔3∗ (𝑦, 𝑥2 , 𝛾) =  2  ℎ (𝑦, 𝑥2 , 𝛾),    𝑥 0 𝑒𝑥 𝑝(−𝑥 𝜆)𝜙[ℎ (𝑥 , 𝛽)] [Φ(𝛼 + 𝑥 𝛼 ) − Φ(𝑥 𝛼 )]  4  2 21 2 2 1 2 2 2 2      0 𝑥 21 ℎ2 (𝑥 2 , 𝛽)𝜙[ℎ2 (𝑥2 , 𝛽)] [Φ(𝑥 2 𝛼2 ) − Φ(𝛼1 + 𝑥 2 𝛼2 )]     (2.5.24) 46 𝑦 − ℎ3 (𝑥2 , 𝛾) where ℎ4 (𝑦, 𝑥2 , 𝛾) ≡ . ℎ3 (𝑥 2 , 𝛾) [1 − ℎ3 (𝑥 2 , 𝛾)] 2.5.1.3 Average partial effects In a probit, usually the average partial effects (APEs) are the quantities of interest rather than the coefficients themselves. It is important to note that the APEs of interest are still derived from the model of interest in (2.5.1), just as in the case where there is no missing data. The partial effect (PE) of the 𝑗 𝑡ℎ element of 𝑥, 𝑥 ( 𝑗) on E(𝑦|𝑥) is given by6 𝜕 E(𝑦|𝑥) 𝑃𝐸 𝑗 (𝑥) = = 𝛼 ( 𝑗)0 𝜙(𝑥𝛼0 ) = 𝛼 ( 𝑗)0 𝜙(𝛼10 𝑥 1 + 𝑥 2 𝛼20 ). (2.5.25) 𝜕𝑥 ( 𝑗) The average partial effect of 𝑥 ( 𝑗) , 𝐴𝑃𝐸 𝑗 , is the expected value of 𝑃𝐸 𝑗 (𝑥) with respect to 𝑥.   𝜕 E(𝑦|𝑥) 𝐴𝑃𝐸 𝑗 (𝑥) = E𝑥 = 𝛼 ( 𝑗)0 E[𝜙(𝑥𝛼0 )]. (2.5.26) 𝜕𝑥 ( 𝑗) In the absence of missing data, this can be consistently estimated using  𝑁 Õ  𝛼˜ ( 𝑗) 𝑁 −1 ˜ , 𝜙(𝑥𝑖 𝛼) (2.5.27) 𝑖=1 where 𝛼˜ is any consistent estimate of 𝛼0 . That is, one simply computes the partial effect for each unit in the sample and then averages over the entire sample. However, when we have missing data on 𝑥1 , this quantity is not estimable as we cannot calculate the partial effect for individuals with missing 𝑥1 . A quantity that is feasible to compute is the average of partial effects over the complete cases only. This is given by  𝑁  𝑐 Õ [ 𝑗 (𝑥) 𝐴𝑃𝐸 = 𝛼ˆ 𝑊 𝐽 ( 𝑗) 𝑁𝑐−1 𝑠𝑖 𝜙(𝑥𝑖 𝛼ˆ 𝑊 𝐽 ) , 𝑖=1 Í𝑁 where 𝑁𝑐 = 𝑠 𝑖=1 𝑖 is the number of complete cases in the sample. That is, we average the individual partial effects over the complete cases only. This estimator however, is not consistent for 𝑐 𝐴𝑃𝐸 𝑗 (𝑥) unless 𝑠 𝑥 . If 𝑠 depends on say 𝑥 2 , then 𝐴𝑃𝐸 |= [ 𝑗 (𝑥) will be inconsistent for 𝐴𝑃𝐸 𝑗 (𝑥). 6If 𝑥 ( 𝑗) is discrete, the derivative is replaced with a difference. 47 The current framework, however, makes it possible to recover 𝐴𝑃𝐸 𝑗 (𝑥) using IPW. E{[𝑠/𝑝(𝑧)]𝜙(𝑥𝛼)} = E{E([𝑠/𝑝(𝑧)]𝜙(𝑥𝛼)|𝑦, 𝑥, 𝑧)} = E{[E(𝑠|𝑦, 𝑥, 𝑧)/𝑝(𝑧)]𝜙(𝑥𝛼)} = E[𝜙(𝑥𝛼)], (2.5.28) where the last equality follows from Assumption 2.3.1. Therefore, a consistent estimator of 𝐴𝑃𝐸 𝑗 (𝑥) is 𝑁 [ 𝑗 (𝑥) = 𝛼ˆ 𝑊 𝐽 ( 𝑗) 𝑁 −1 Õ 𝑠𝑖 𝐴𝑃𝐸 𝜙(𝑥𝑖 𝛼ˆ 𝑊 𝐽 ). (2.5.29) 𝐺 (𝑧 𝑖 , ˆ 𝛿) 𝑖=1 2.5.2 Exponential models Next we consider exponential models for nonnegative responses 𝑦, including but not restricted to count variables. We focus on a continuous 𝑥 1 .7 The model of interest is characterized by the conditional mean E(𝑦|𝑥) = 𝑒𝑥 𝑝(𝛼10 𝑥 1 + 𝑥 2 𝛼20 ) ≡ 𝑒𝑥 𝑝(𝑥𝛼0 ), (2.5.30) where in the absence of missing data, 𝛼0 can be estimated using a Poisson quasi log likelihood. We consider the same linear imputation model as in Section 2.5.1.1. 𝑥1 = 𝑥 2 𝜃 0 + 𝑟, (2.5.31) 𝑟 |𝑥 2 ∼ 𝑁𝑜𝑟𝑚𝑎𝑙 [0, 𝜎02 𝑒𝑥 𝑝(2𝑥 21𝜆 0 )]. (2.5.32) The reduced form conditional mean can be obtained using (2.5.30)-(2.5.32) and an iterated expec- tations argument. E(𝑦|𝑥 2 ) = E[𝑒𝑥 𝑝(𝛼10 𝑥 1 + 𝑥 2 𝛼20 )|𝑥2 ] = 𝑒𝑥 𝑝(𝑥2 𝛼20 ) E[𝑒𝑥 𝑝(𝛼10 𝑥 1 )|𝑥 2 ] = 𝑒𝑥 𝑝 [𝑥2 (𝜃 0 𝛼10 + 𝛼20 )] E[𝑒𝑥 𝑝(𝑟𝛼10 )|𝑥 2 ], (2.5.33) where the third equality follows from substituting for 𝑥 1 using (2.5.31). Moreover, (2.5.32) implies that 𝑒𝑥 𝑝(𝑟𝛼10 ) conditional on 𝑥 2 follows a lognormal distribution with E[𝑒𝑥 𝑝(𝑟𝛼10 )|𝑥2 ] = 𝑒𝑥 𝑝 [𝛼10 2 𝜎 2 𝑒𝑥 𝑝(2𝑥 𝜆 )/2]. (2.5.34) 0 21 0 7The discussion for a binary 𝑥 1 follows easily given the discussion in Section 2.5.1.2. 48 Plugging into (2.5.33), we get E(𝑦|𝑥 2 ) = 𝑒𝑥 𝑝 [𝑥2 (𝜃 0 𝛼10 + 𝛼20 ) + 𝛼10 2 𝜎 2 𝑒𝑥 𝑝(2𝑥 𝜆 )/2]. (2.5.35) 0 21 0 Thus, we have 𝛽 = (𝜃, 𝜎 2 , 𝜆), 𝛾 = (𝜃𝛼1 +𝛼2 , 𝜎 2 , 𝜆), ℎ5 (𝑥2 , 𝛾) ≡ 𝑥 2 (𝜃𝛼1 +𝛼2 )+𝛼12 𝜎 2 𝑒𝑥 𝑝(2𝑥 21𝜆)/2 and the objective functions are given by 𝑓1 (𝑦, 𝑥, 𝛼) = 𝑒𝑥 𝑝(𝑥𝛼) − 𝑦𝑥𝛼 (𝑥 1 − 𝑥 2 𝜃) 2    1 𝑓2 (𝑥1 , 𝑥2 , 𝛽) = −𝑙𝑜𝑔 p 𝑒𝑥 𝑝 − 2𝜋𝜎 2 𝑒𝑥 𝑝(2𝑥21𝜆) 2𝜎 2 𝑒𝑥 𝑝(2𝑥 21𝜆) 𝑓3 (𝑦, 𝑥2 , 𝛾) = 𝑒𝑥 𝑝[ℎ5 (𝑥 2 , 𝛾)] − 𝑦[ℎ5 (𝑥 2 , 𝛾)]. (2.5.36) This results in the following score functions. 𝑔1∗ (𝑦, 𝑥, 𝛼) = 𝑥 0 [𝑦 − 𝑒𝑥 𝑝(𝑥𝛼)]   𝑥20 (𝑥 1 − 𝑥 2 𝜃)     2 𝜎 𝑒𝑥 𝑝(2𝑥 21𝜆)     2   ∗  (𝑥 1 − 𝑥 2 𝜃) 1  𝑔2 (𝑥 1 , 𝑥2 , 𝛽) =  −  𝑒𝑥 𝑝(2𝑥 21𝜆)𝜎 4 𝜎 2     (𝑥 1 − 𝑥 2 𝜃) 2     0  𝑥  21 𝜎 2 𝑒𝑥 𝑝(2𝑥 𝜆) − 1    21  𝑥 20       𝑔3∗ (𝑦, 𝑥2 , 𝛾) =  𝑒𝑥 𝑝(2𝑥 21𝜆)  {𝑦 − 𝑒𝑥 𝑝 [ℎ5 (𝑥 2 , 𝛾)]}.   (2.5.37)    𝑒𝑥 𝑝(2𝑥21𝜆)𝑥 21 0    Similar to Section 2.5.1.1, when 𝜆 0 = 0, the third element of 𝑔2∗ (.) and the second and third elements of 𝑔3∗ (.) become redundant. 2.6 Comparison with related estimators 2.6.1 Complete cases The most common practice when dealing with missing covariate values is to just use the complete cases for estimation; that is, use only the observations for which 𝑥 1 is observed. The inverse 49 probability weighted complete cases estimator has been discussed in detail by Wooldridge (2002). In this section, I show that the weighted joint GMM does no worse than the weighted complete cases estimator in terms of asymptotic variance, and can potentially provide strict efficiency gains. Definition 2.6.1.1. Call the estimator of 𝛼0 that minimizes (2.4.3), where 𝑔(.) contains only 𝑔1 (.) and 𝑊ˆ = 𝐼, the weighted complete cases estimator (or 𝛼ˆ 𝑊 𝑐𝑐 ). Define the upper-left 𝑃1 × 𝑃1 block of 𝐹0 as 𝐹110 ≡ E(𝑔 𝑔0 ) − E(𝑔 𝑑 0) [E(𝑑 𝑑 0)] −1 E(𝑑 𝑔0 ), (2.6.1) 1𝑖 1𝑖 1𝑖 𝑖 𝑖 𝑖 𝑖 1𝑖 where 𝑔𝑖 = [𝑔1𝑖 0 , 𝑔0 , 𝑔0 ] 0. Then the asymptotic variance of the weighted complete cases estimator 2𝑖 3𝑖 as derived in Wooldridge (2002) is given in the following lemma, where we have used the fact that 𝐷 011 is symmetric. Lemma 2.6.1.1 Under the assumptions of Theorems 4.1 and 4.2, √ 0 ) −1 𝐷 0 ] −1 . 𝐴𝑣𝑎𝑟 [ 𝑁 ( 𝛼ˆ 𝑊 𝑐𝑐 − 𝛼0 )] = [𝐷 011 (𝐹11 11 Then we know that 𝛼ˆ 𝑊 𝐽 is no less efficient than 𝛼ˆ 𝑊 𝑐𝑐 , since standard GMM theory dictates that a GMM estimator that uses more valid moment conditions is no less efficient. Proposition 2.6.1.1. Under the assumptions of Theorem 2.4.1 and 2.4.2, √ √ 𝐴𝑣𝑎𝑟 [ 𝑁 ( 𝛼ˆ 𝑊 𝑐𝑐 − 𝛼0 )] − 𝐴𝑣𝑎𝑟 [ 𝑁 ( 𝛼ˆ 𝑊 𝐽 − 𝛼0 )] is positive semidefinite. We can further disaggregate the efficiency gains by 𝛼10 and 𝛼20 . In linear models, the “plug-in" imputation estimators, as discussed in the next section, are generally equivalent to the complete cases estimators for 𝛼10 and may provide some efficiency gains for 𝛼20 .8 Abrevaya & Donald (2017) were the first to propose an estimator that provides potential gains for 𝛼10 as well in the linear case. I extend their result to the case discussed in Section 2.5.1.1 with the simplifying assumption that 𝜆 0 = 0, and show that efficiency gains are possible for both 𝛼10 and 𝛼20 . 8For instance, Abrevaya & Donald (2011) show that in the case where both the main model and the imputation model are linear, the plug-in estimator that estimates the main model using ordinary least squares (OLS) or feasible generalized least squares with missing values being replaced by predicted values using a first step OLS is numerically equivalent to the complete cases estimator for 𝛼10 . 50 Proposition 2.6.1.2. Consider the case in Section 2.5.1.1 with 𝜆 0 = 0. Under the assumptions of Theorems 2.4.1 and 2.4.2, √ √ 1. 𝐴𝑣𝑎𝑟 [ 𝑁 ( 𝛼ˆ 1𝑊 𝑐𝑐 − 𝛼10 )] − 𝐴𝑣𝑎𝑟 [ 𝑁 ( 𝛼ˆ 1𝑊 𝐽 − 𝛼10 )] = 𝐿 01 𝐾 𝐿 1 ≥ 0 √ √ 2. 𝐴𝑣𝑎𝑟 [ 𝑁 ( 𝛼ˆ 2𝑊 𝑐𝑐 − 𝛼20 )] − 𝐴𝑣𝑎𝑟 [ 𝑁 ( 𝛼ˆ 2𝑊 𝐽 − 𝛼20 )] = 𝐿 02 𝐾 𝐿 2 ≥ 0, where 𝐿 1 , 𝐿 2 and 𝐾 are matrices defined in the appendix. I show that 𝐾 is a positive definite matrix and neither 𝐿 1 nor 𝐿 2 are necessarily zero under the assumptions made so far, and hence it is possible to obtain strict efficiency gains for both 𝛼10 and 𝛼20 . 2.6.2 Sequential procedures Traditionally, imputation is done in two steps using a “plug-in" method (Dagenais, 1973). In the first step, the missing values of 𝑥 1 are replaced with predicted values from a regression of 𝑥 1 on 𝑥2 and in the second step, the main model is estimated using the observed values as well as the predicted values. Methods like mean imputation,9 where the missing values are replaced by the sample mean of 𝑥 1 , can be considered a special case of this method where the first step regression only includes the constant as a covariate. Definition 2.6.2.1: Call the estimator of 𝛼0 obtained using the following procedure the plug-in estimator (or 𝛼ˆ 𝑃 ). Step 1: Obtain 𝛽ˆ𝑊 𝑐𝑐 by minimizing (2.4.3) where 𝑔(.) contains only 𝑔2 (𝛽) and 𝑊ˆ = 𝐼. Step 2: Estimate 𝛼0 by minimizing (2.4.3) where 𝑔(.) contains only 𝑔1 ( 𝑥˜1 , 𝑥2 , 𝛼) and 𝑥˜1𝑖 = 𝑠𝑖 𝑥1𝑖 + (1 − 𝑠𝑖 )ℎ(𝑥2𝑖 , 𝛽ˆ𝑊 𝑐𝑐 ) and ℎ(.) is the function defining predicted values. In the first step, 𝛽0 is consistently estimated using only the complete cases and the missing values of 𝑥 1 are replaced with predicted values based on the imputation model. The function ℎ(.) depends on what the imputation model is. For instance, in the linear case, ℎ(𝑥 2𝑖 , 𝛽ˆ𝑊 𝑐𝑐 ) = 𝑥 2𝑖 𝛽ˆ𝑊 𝑐𝑐 . We denote this new variable by 𝑥˜1 . In the second step, 𝛼0 is estimated by solving the sample counterpart of (2.2.1) with 𝑥 1 being replaced by 𝑥˜1 . 9(Little & Rubin, 2002) 51 While this procedure can be consistent when the model of interest is linear, contrary to prior claims in the literature (DeCanio & Watkins, 1998), it is generally inconsistent when the model of interest is nonlinear in the parameters.10 This is because under the assumptions made so far, 𝛼0 is generally not a solution to min E[ 𝑓1 (𝑦, 𝑥1∗ , 𝑥2 , 𝛼)], (2.6.2) 𝛼∈A where 𝑥 1∗ = 𝑠𝑥 1 + (1 − 𝑠)ℎ(𝑥2 , 𝛽0 ). To see why this procedure is inconsistent, consider the model in Section 2.5.1.1. Suppose 𝑦 is binary, that is, 𝑦 = 𝑤 (and 𝑦 ∗ ≡ 𝑤 ∗ ). For simplicity, assume that 𝜆 0 = 0 and 𝑧 = 𝑥 2 , that is, the imputation error is homoskedastic and selection is independent of (𝑦, 𝑥1 ) conditional on 𝑥 2 . Since E(𝑥1 |𝑥 2 ) = 𝑥 2 𝜃 0 and 𝜃 0 is consistently estimated by Ordinary Least Squares (OLS) of 𝑥1 on 𝑥2 using the complete cases only (call this estimator 𝜃ˆ𝑐𝑐 ), it is tempting to replace the missing values of 𝑥 1 by 𝑥 2 𝜃ˆ𝑐𝑐 and estimate 𝛼0 from the probit of 𝑦 on 𝑥˜1 ≡ 𝑠𝑥1 + (1 − 𝑠)𝑥 2 𝜃ˆ𝑐𝑐 and 𝑥 2 . Standard two-step M-estimation theory11 states that for this procedure to be consistent, we require that 𝛼0 uniquely solves min − E{𝑦 𝑙𝑜𝑔 Φ(𝛼1 𝑥 1∗ + 𝑥 2 𝛼2 ) + (1 − 𝑦)𝑙𝑜𝑔[1 − Φ(𝛼1 𝑥1∗ + 𝑥 2 𝛼2 )]}, (2.6.3) 𝛼∈A where 𝑥 1∗ ≡ 𝑠𝑥1 + (1 − 𝑠)𝑥2 𝜃 0 . However, 𝛼0 does not minimize (2.6.3) in general since for that to be true, we would need 𝑃(𝑦 = 1|𝑠𝑥1 , 𝑥2 , 𝑠) = Φ(𝛼10 𝑥 1∗ + 𝑥 2 𝛼20 ). (2.6.4) However, (2.5.2) and (2.5.6) imply 𝑦 ∗ = 𝛼10 [𝑠𝑥1 + (1 − 𝑠)𝑥 2 𝜃 0 ] + 𝑥2 𝛼20 + 𝑢 + (1 − 𝑠)𝑟𝛼10 ≡ 𝛼10 𝑥 1∗ + 𝑥 2 𝛼20 + 𝑢 + (1 − 𝑠)𝑟𝛼10 , (2.6.5) and E{1[𝛼10 𝑥 1∗ + 𝑥 2 𝛼20 + 𝑢 + (1 − 𝑠)𝑟𝛼10 ]|𝑠𝑥 1 , 𝑥2 , 𝑠} ≠ Φ(𝛼10 𝑥 1∗ + 𝑥 2 𝛼20 ). (2.6.6) 10This procedure also requires extra caution when the model of interest is nonlinear in the variables, as discussed in Rai (2020). 11Wooldridge (2010) Section 17.4. 52 The core issue is that expectation does not pass through nonlinear operators, in this case the indicator function 1[.]. In fact, in this example, E(𝑦 = 1|𝑠𝑥1 , 𝑥2 , 𝑠) = 𝑃(𝑦 = 1|𝑠𝑥 1 , 𝑥2 , 𝑠) = 𝑃{[𝑢 + (1 − 𝑠)𝑟𝛼10 ] > −(𝛼10 𝑥1∗ + 𝑥 2 𝛼20 )|𝑠𝑥1 , 𝑥2 , 𝑠} 𝛼10 𝑥 1∗ + 𝑥 2 𝛼20   =Φ q , (2.6.7) 1 + (1 − 𝑠)𝛼10 2 𝜎2 0 since 𝑢 + (1 − 𝑠)𝑟𝛼10 |𝑠𝑥 1 , 𝑥2 , 𝑠 ∼ 𝑁𝑜𝑟𝑚𝑎𝑙 [0, 1 + (1 − 𝑠)𝛼10 2 𝜎 2 ] under Assumption 2.3.1, which 0 makes the main estimation problem a heteroskedastic probit. The correct log likelihood function is therefore based on (2.6.7), and 𝛼0 is not a solution to (2.6.3). Proposition 2.6.2.1: Consider the case in Section 2.5.1.1. Let Assumptions 2.2.1-2.2.3, 2.3.1, 2.3.2 and the assumptions in Theorems 2.4.1 and 2.4.2 hold. Additionally assume that 𝑧 = 𝑥 2 and 𝜆0 = 0. Then 𝛼ˆ 𝑃 is inconsistent for 𝛼10 unless 𝛼10 = 0. However, 𝛼10 = 0 implies that 𝑥 1 is irrelevant in the model of interest, in which case the best solution is to just drop it from the model. As a second example, consider the exponential model from Section 2.5.2 and again for simplicity, assume that 𝜆 0 = 0 and 𝑧 = 𝑥2 . The plug-in method would entail estimating 𝛼0 using Poisson quasi- MLE with the conditional mean function 𝑒𝑥 𝑝(𝛼1 𝑥˜1 + 𝑥2 𝛼2 ). For this estimator to be consistent, we would require that 𝛼0 uniquely solves min − E[𝑦(𝛼1 𝑥 1∗ + 𝑥 2 𝛼2 ) − 𝑒𝑥 𝑝(𝛼1 𝑥1∗ + 𝑥2 𝛼2 )], (2.6.8) 𝛼∈A which would be true if E(𝑦|𝑠𝑥1 , 𝑥2 , 𝑠) = 𝑒𝑥 𝑝(𝛼10 𝑥1∗ + 𝑥 2 𝛼20 ). (2.6.9) However, under Assumption 2.3.1, equations (2.5.30) and (2.5.35) imply that E(𝑦|𝑠𝑥1 , 𝑥2 , 𝑠) = 𝑒𝑥 𝑝{𝛼10 [𝑠𝑥1 + (1 − 𝑠)𝑥 2 𝜃 0 ] + 𝑥 2 𝛼20 + (1 − 𝑠)𝛼102 𝜎 2 /2} 0 ≡ 𝑒𝑥 𝑝 [𝛼10 𝑥 1∗ + 𝑥 2 𝛼20 + (1 − 𝑠)𝛼10 2 𝜎 2 /2]. 0 (2.6.10) 53 Since the log likelihood in (2.6.8) is based on an incorrect specification of the conditional mean of 𝑦, 𝛼0 will generally not solve (2.6.8). Proposition 2.6.2.2: Consider the case in Section 2.5.2. Let Assumptions 2.2.1-2.2.3, 2.3.1, 2.3.2 and the assumptions in Theorems 2.4.1 and 2.4.2 hold. Additionally assume that 𝑧 = 𝑥 2 and 𝜆0 = 0. Then 𝛼ˆ 𝑃 is inconsistent unless 𝛼10 = 0. A sequential procedure that would be consistent is plugging 𝛽ˆ𝑊 𝑐𝑐 in 𝑔3 (.), and estimating 𝛼0 using 𝑔1 (𝛼) and 𝑔3 (𝛼, 𝛽ˆ𝑊 𝑐𝑐 ) in a joint GMM procedure. Definition 2.6.2.2: Call the estimator of 𝛼0 obtained using the following procedure the sequen- tial estimator (or 𝛼ˆ 𝑆𝑒𝑞 ). Step 1: Obtain 𝛽ˆ𝑊 𝑐𝑐 by minimizing (2.4.3) where 𝑔(.) contains only 𝑔2 (𝛽) and 𝑊ˆ = 𝐼. Step 2: Estimate 𝛼0 by minimizing (2.4.3) where 𝑔(.) contains only 𝑔1 (𝛼) and 𝑔3 (𝛼, 𝛽ˆ𝑊 𝑐𝑐 ), and 𝑊ˆ = 𝐹ˆ −1 , where 𝐹ˆ −1 can be obtained using equation (2.4.7) and imposing 𝑔ˆ𝑖 = [𝑔1𝑖 ( 𝛼) ˜ 0 𝑔2𝑖 ( 𝛼, ˜ 𝛽ˆ𝑊 𝑐𝑐 ) 0] 0, 𝛼˜ being a first step consistent estimate of 𝛼0 . Even though 𝛼ˆ 𝑆𝑒𝑞 is consistent, it is going to be less efficient than 𝛼ˆ 𝑊 𝐽 because the former does not utilize the correlation between the moment functions 𝑔1 (.) and 𝑔2 (.). From a GMM perspective, it is well known that a sequential procedure using the same moment conditions is no more efficient than its joint counterpart. Proposition 2.6.2.3. Under Assumptions 2.2.1-2.2.3, 2.3.1, 2.3.2, and the assumptions made in Theorems 2.4.1 and 2.4.2, √ √ 𝐴𝑣𝑎𝑟 [ 𝑁 ( 𝛼ˆ 𝑆𝑒𝑞 − 𝛼0 )] − 𝐴𝑣𝑎𝑟 [ 𝑁 ( 𝛼ˆ 𝑊 𝐽 − 𝛼0 )] is positive semi-definite. Thus, there is no reason to prefer 𝛼ˆ 𝑆𝑒𝑞 over 𝛼ˆ 𝑊 𝐽 other than computational convenience. 2.6.3 Dummy variable method The dummy variable estimator (𝛼ˆ 𝐷 ) replaces the missing values of 𝑥 1 with zeros and uses an indicator for missingness as an additional covariate. Jones (1996) and Rai (2020) show that the 54 resulting estimator is generally inconsistent for 𝛼0 in linear models with exogenous and endogenous 𝑥 1 respectively. This inconsistency continues to hold in nonlinear models. Consider again the example in Section 2.5.1.1 with 𝜆 0 = 0 and 𝑧 = 𝑥 2 . The DVM would entail doing a probit of 𝑦 on (𝑠𝑥1 , 1 − 𝑠, 𝑥2 ). Analogous to the discussion in Section 2.6.2, this estimator would be consistent if 𝑃(𝑦 = 1|𝑠𝑥1 , 𝑥2 , 𝑠) = Φ[𝛼10 𝑠𝑥1 + (1 − 𝑠)𝜃 10 𝛼10 + 𝑥2 𝛼20 ], (2.6.11) which is not true in general. Too see this, let 𝑥 2 = (1, 𝑥22 ) and 𝜃 0 = (𝜃 10 , 𝜃 020 ) 0 and note that we can rewrite equation (2.6.7) as   𝛼10 𝑠𝑥1 + (1 − 𝑠)𝜃 10 𝛼10 + (1 − 𝑠)𝑥 22 𝜃 20 𝛼10 + 𝑥 2 𝛼20 𝑃(𝑦 = 1|𝑠𝑥1 , 𝑥2 , 𝑠) = Φ q . (2.6.12) 2 1 + (1 − 𝑠)𝛼10 𝜎0 2 As can be seen from this equation, 𝛼ˆ 𝐷 is inconsistent for two reasons. The first issue, which is unique to this method, is that it omits the covariates (1 − 𝑠)𝑥22 , leading to endogeneity unless 𝛼10 = 0 and/or 𝜃 20 = 0. The second issue, which is common with the plug-in method, is that it ignores the scale factor in the denominator which remains unless 𝛼10 = 0. Proposition 2.6.3.1: Consider the case in Section 2.5.1.1. Let Assumptions 2.2.1-2.2.3, 2.3.1, 2.3.2 and the assumptions in Theorems 2.4.1 and 2.4.2 hold. Additionally assume that 𝑧 = 𝑥 2 and 𝜆0 = 0. Then 𝛼ˆ 𝐷 is inconsistent unless (i) 𝛼10 = 0 or (ii) 𝜃 20 = 𝜎02 = 0. Similar to Section 2.6.2, if 𝛼10 = 0, the best solution is to drop 𝑥1 . The second condition requires that both the imputation coefficients and the imputation error variance are zero at the same time, which is not possible. A second example is the exponential model discussed in Section 2.5.2. Consider again the case where 𝑧 = 𝑥 2 and 𝜆 0 = 0. The DVM would entail using (𝑠𝑥1 , 1 − 𝑠, 𝑥2 ) as covariates for a Poisson quasi-MLE, which would be consistent if 2 𝜎 2 /2) + 𝑥 𝛼 ]. E(𝑦|𝑠𝑥1 , 𝑥2 , 𝑠) = 𝑒𝑥 𝑝 [𝛼10 𝑠𝑥 1 + (1 − 𝑠)(𝜃 10 𝛼10 + 𝛼10 (2.6.13) 0 2 20 However, we can re-write (2.6.10) as 2 𝜎 2 /2) + (1 − 𝑠)𝑥 𝜃 𝛼 + 𝑥 𝛼 ]. (2.6.14) E(𝑦|𝑠𝑥1 , 𝑥2 , 𝑠) = 𝑒𝑥 𝑝[𝛼10 𝑠𝑥1 + (1 − 𝑠)(𝜃 10 𝛼10 + 𝛼10 0 22 20 10 2 20 55 Similar to the probit case, the DVM omits the covariates (1 − 𝑠)𝑥 22 from the above conditional mean function. Proposition 2.6.3.2: Consider the case in Section 2.5.2. Let Assumptions 2.2.1-2.2.3, 2.3.1, 2.3.2 and the assumptions in Theorems 2.4.1 and 2.4.2 hold. Additionally assume that 𝑧 = 𝑥 2 and 𝜆0 = 0. Then 𝛼ˆ 𝐷 is inconsistent unless (i) 𝛼10 = 0 or (ii) 𝜃 20 = 0. That is, 𝛼ˆ 𝐷 is inconsistent unless 𝑥 1 is irrelevant in the model of interest or 𝑥 22 does not help in predicting 𝑥 1 . 2.6.4 Unweighted estimators The key to efficiency gains of 𝛼ˆ 𝑊 𝐽 over 𝛼ˆ 𝑊 𝑐𝑐 is that the former uses the information in the incomplete cases. Weighting the moment functions in (2.4.1) allows for more flexibility in terms of what variables selection can depend on and estimation of interesting parameters in the presence of misspecification, but that core reason for efficiency gains is independent of weighting. In other words, the joint GMM based on the unweighted version of the moment functions in (2.4.1) will still be more efficient than the unweighted complete cases estimator. These two unweighted estimators are defined below. Definition 6.4.1: Call the estimator of 𝛼0 that minimizes (2.4.3) where 𝑔(.) = 𝑠 · 𝑔1∗ (𝑦, 𝑥, 𝛼) and 𝑊ˆ = 𝐼, the unweighted complete cases estimator, or 𝛼ˆ 𝑈𝑐𝑐 . The unweighted joint estimator is based on the following vector of moment conditions. 𝑔1𝑖 (𝛼, 𝛽)   𝑠𝑖 𝑔 ∗ (𝑦𝑖 , 𝑥𝑖 , 𝛼)         1𝑖      𝑔𝑖 (𝛼, 𝛽) = 𝑔2𝑖 (𝛼, 𝛽)  ≡ 𝑠𝑖 𝑔 ∗ (𝑥 1𝑖 , 𝑥2𝑖 , 𝛽)  . (2.6.15)    2𝑖    ∗ 𝑔3𝑖 (𝛼, 𝛽)  𝑔3𝑖 (𝑦𝑖 , 𝑥2𝑖 , 𝛼, 𝛽)        For a generic element from the population (𝑦, 𝑥1 , 𝑥2 , 𝑠), denote this vector of moment functions by 𝑔(𝛼, 𝛽). Then the variance-covariance matrix of 𝑔(𝛼, 𝛽) evaluated at the true parameter values is given by 𝐶0 = E[𝑔(𝛼0 , 𝛽0 ) 𝑔(𝛼0 , 𝛽0 ) 0], (2.6.16) and the optimal GMM estimator based on (2.6.15) is defined as follows. 56 Definition 2.6.4.2. Call the estimator of (𝛼0 , 𝛽0 ) that solves min ¯ 𝑔(𝛼, 𝛽) 0 𝐶ˆ −1 𝑔(𝛼, ¯ 𝛽), (𝛼,𝛽)∈A×B 𝑝 the unweighted joint estimator, or ( 𝛼ˆ 𝑈𝐽 , 𝛽ˆ𝑈𝐽 ), where 𝑔(𝛼, 𝛽) = 𝑁 −1 𝑖=1 𝑔𝑖 (𝛼, 𝛽) and 𝐶ˆ − Í𝑁 ¯ → 𝐶0 . I provide the asymptotic distribution of this estimator in Appendix E. The key point to note is that just like 𝛼ˆ 𝑊 𝐽 is no less efficient than 𝛼ˆ 𝑊 𝑐𝑐 , 𝛼ˆ 𝑈𝐽 is no less efficient than 𝛼ˆ 𝑈𝑐𝑐 . Proposition 2.6.4.1. Under the assumptions of Theorems E.1 and E.2, √ √ 𝐴𝑣𝑎𝑟 [ 𝑁 ( 𝛼ˆ 𝑈𝑐𝑐 − 𝛼0 )] − 𝐴𝑣𝑎𝑟 [ 𝑁 ( 𝛼ˆ 𝑈𝐽 − 𝛼0 )] is positive semidefinite. The proof of this proposition is very similar to that of Proposition 2.6.1.1, and hence is omitted. The natural question that arises then is whether one should weight when using the joint estimator, and whether 𝛼ˆ 𝑊 𝐽 is preferred over 𝛼ˆ 𝑈𝑐𝑐 , which is the most commonly used estimator out of all four.12 The issue of whether to weight has previously been considered in Wooldridge (2002), but the use of an imputation model here brings in some new issues. In looking at these two alternatives to 𝛼ˆ 𝑊 𝐽 , there are two issues to address: consistency and asymptotic efficiency. Start with 𝛼ˆ 𝑈𝐽 . From the point of view of consistency, 𝛼ˆ 𝑊 𝐽 is always preferred over 𝛼ˆ 𝑈𝐽 as the former is always consistent when the latter is, but the converse is not true. This is because while both estimators rule out 𝑧 containing 𝑥1 to be consistent,13 𝛼ˆ 𝑊 𝐽 allows 𝑧 to contain 𝑦 as well as some outside predictors of selection, while 𝛼ˆ 𝑈𝐽 does not. A related issue is that of correct specification of the models underlying 𝑓1 (𝑦, 𝑥, 𝛼), 𝑓2 (𝑥 1 , 𝑥2 , 𝛽), and 𝑓3 (𝑦, 𝑥2 , 𝛾) in (2.2.1)-(2.2.3), by which I mean that (𝛼0 , 𝛽0 , 𝛾0 ) characterize a correctly specified feature of 𝐷 (𝑦|𝑥), 𝐷 (𝑥 1 |𝑥 2 ) and 𝐷 (𝑦|𝑥2 ) respectively.14 For instance, this can be a model of a conditional mean, conditional median, conditional distribution, and so on. When 𝑧 = 𝑥 2 , 𝛼ˆ 𝑊 𝐽 is always consistent for 𝛼0 and 𝛽0 12That is, out of 𝛼ˆ 𝑈𝑐𝑐 , 𝛼ˆ 𝑊 𝑐𝑐 , 𝛼ˆ 𝑈𝐽 and 𝛼ˆ 𝑊 𝐽 . 13 𝛼ˆ 𝑈𝐽 rules out 𝑧 containing 𝑥 1 because it uses the imputation equation in estimation in addition to the main equation. Since unweighted estimators can only allow selection to depend on covariates in order to maintain consistency, 𝑥 1 being the outcome variable in the imputation model means that we cannot allow 𝑠 to depend on 𝑥 1 , conditional on 𝑥2 . This is the cost of getting more efficiency using the imputation model. 𝛼ˆ 𝑊 𝐽 rules out this dependence because the weights cannot be estimated using a variable that contains missing values. Therefore, irrespective of whether one uses the imputation model, weighted estimation cannot allow 𝑧 to contain 𝑥1 . 14I make this notion precise in Assumption B.1. 57 that solve (2.2.1) and (2.2.2) irrespective of whether the underlying models are correctly specified, but 𝛼ˆ 𝑈𝐽 is consistent for 𝛼0 and 𝛽0 only if they characterize some correctly specified feature of the respective distributions. For instance, consider the linear case discussed in Abrevaya & Donald (2017) where the 3 M-estimation problems are given by min E[𝑠 · (𝑦 − 𝛼1 𝑥 1 − 𝑥 2 𝛼2 ) 2 ] (2.6.17) 𝛼∈A min E[𝑠 · (𝑥 1 − 𝑥 2 𝛽) 2 ] (2.6.18) 𝛽∈B min E[(𝑦 − 𝑥 2 𝛾) 2 ] (2.6.19) 𝛾∈Γ where 𝛾 ≡ 𝛼1 𝛽+𝛼2 . Consider first the problem in (2.6.17). Suppose that 𝑦 is binary with a nonlinear conditional mean E(𝑦|𝑥) = Φ(𝑥𝜅 0 ), and the linear projection of 𝑦 on 𝑥 is 𝑥𝛼0 . When 𝑥 1 is always observed, the usual motivation for using a linear model here is that it gives consistent estimates of the linear projection parameters 𝛼0 , and linear projection is the best linear approximation to the true conditional mean Φ(𝑥𝜅 0 ). That is, the solution to min E[(𝑦 − 𝛼1 𝑥 1 − 𝑥 2 𝛼2 ) 2 ] (2.6.20) 𝛼∈A is 𝛼0 . However, this result does not always carry over to the case with missing data. Suppose 𝑠 depends on 𝑥2 . Then the solution to (2.6.17) will generally neither be 𝜅0 and more importantly nor be 𝛼0 (Wooldridge, 2002). So by estimating a linear model using only the complete cases, we are not getting consistent estimates of anything interesting in the population.15 In general, if we want the solution to min E[𝑠 · 𝑓1 (𝑦, 𝑥1 , 𝑥2 , 𝛼)] (2.6.21) 𝛼∈A to be the conditional mean parameters, we want to make sure that the we have correctly specified the conditional mean. In the above example, one way to do that here is to use a better model of E(𝑦|𝑥), 15An exception is the case where 𝑠 is independent of both 𝑦 and 𝑥, also known as “missing completely at random". In this case the solution to (6.17) is still 𝛼0 . However, this case rarely holds in practice. 58 that is, a probit instead of a linear probability model. This highlights the importance of nonlinear models with missing data, even if one is generally satisfied with using a linear approximation when 𝑥 1 was always observed. The weighted estimator on the other hand recovers the linear projection parameters even when using only the complete cases. In other words, the solution to min E{[𝑠/𝑝(𝑧)] (𝑦 − 𝛼1 𝑥1 − 𝑥2 𝛼2 ) 2 ] (2.6.22) 𝛼∈A is 𝛼0 . A similar discussion holds for the imputation problem in (2.6.18). If 𝑥1 is binary, then we should either weight the imputation model in order to consistently estimate the linear projection parameters or impute using a probit if not using weights. The second consideration is that of asymptotic efficiency. When 𝑧 = 𝑥 2 and the models underly- ing 𝑓1 (𝑦, 𝑥, 𝛼), 𝑓2 (𝑥1 , 𝑥2 , 𝛽), and 𝑓3 (𝑦, 𝑥2 , 𝛾) are correctly specified, both estimators are consistent. A theoretical comparison of the asymptotic variances of the two estimators in this case will likely depend on whether a generalized conditional information matrix equality (GCIME), discussed in Wooldridge (2002), holds for each of the three models underlying (2.2.1)-(2.2.3). For instance, the GCIME always holds for conditional MLE under correct specification of the conditional density and for quasi-MLE in the linear exponential family under the so-called generalized linear models assumption. Wooldridge (2002) shows that in this case, 𝛼ˆ 𝑈𝑐𝑐 is more efficient than 𝛼ˆ 𝑊 𝑐𝑐 when GCIME holds. So it is reasonable to expect that 𝛼ˆ 𝑈𝐽 will be more efficient than 𝛼ˆ 𝑊 𝐽 as well. I do not undertake a theoretical comparison here but provide some simulation evidence in the next section in support of this speculated efficiency ranking. The other unweighted alternative to 𝛼ˆ 𝑊 𝐽 is 𝛼ˆ 𝑈𝑐𝑐 , and it is not clear from the perspective of consistency whether it is preferred to 𝛼ˆ 𝑊 𝐽 (or the weighted complete cases estimator 𝛼ˆ 𝑊 𝑐𝑐 ). Suppose that 𝛼0 characterizes a correctly specified feature of 𝐷 (𝑦|𝑥) in (2.2.1). Then if selection is exogenous and depends on 𝑥 1 after conditioning on 𝑥2 , that is, 𝑧 = (𝑥 1 , 𝑥2 ), then 𝛼ˆ 𝑈𝑐𝑐 is consistent for 𝛼0 . However, both the weighted estimators 𝛼ˆ 𝑊 𝐽 and 𝛼ˆ 𝑊 𝑐𝑐 are inconsistent. This is because the estimation of weights cannot depend on 𝑥 1 which is missing for some observations. Therefore, 59 the weights will generally not be consistently estimated. However, if selection depends on 𝑦 after conditioning on 𝑥, that is, 𝑧 contains 𝑦, then 𝛼ˆ 𝑈𝑐𝑐 is inconsistent while both 𝛼ˆ 𝑊 𝐽 and 𝛼ˆ 𝑊 𝑐𝑐 are consistent. The other consideration is that of correct specification of the model underlying 𝑓1 (𝑦, 𝑥, 𝛼) in (2.2.1). Suppose that 𝑧 = 𝑥 2 . Then 𝛼ˆ 𝑊 𝐽 will be consistent for 𝛼0 , the solution to (2.2.1), whether or not there is any model misspecification. But under misspecification, 𝛼ˆ 𝑈𝑐𝑐 will generally not be consistent for 𝛼0 . For instance, let us go back to the linear model given by (2.6.17)-(2.6.19). Suppose E(𝑢|𝑥 1 , 𝑥2 ) = 0. That is, 𝛼0 in (6.17) are actually the coefficients in the conditional mean of 𝑦 given (𝑥 1 , 𝑥2 ). If 𝑧 = (𝑥 1 , 𝑥2 ), then 𝛼ˆ 𝑈𝑐𝑐 is consistent for 𝛼0 , but 𝛼ˆ 𝑊 𝐽 and 𝛼ˆ 𝑊 𝑐𝑐 are inconsistent since the weights can only be based on 𝑥 2 . If 𝑧 = (𝑦, 𝑥2 ), then 𝛼ˆ 𝑊 𝐽 and 𝛼ˆ 𝑊 𝑐𝑐 with weights based on (𝑦, 𝑥2 ) are consistent but 𝛼ˆ 𝑈𝑐𝑐 is inconsistent. On the other hand, suppose 𝑧 = 𝑥2 . Then 𝛼ˆ 𝑊 𝐽 will be consistent for 𝛼0 , the linear projection parameters, whether or not they are the conditional mean parameters as well. However, 𝛼ˆ 𝑈𝑐𝑐 will be inconsistent for 𝛼0 if they are only the linear projection parameters, and not the conditional mean parameters. As far as asymptotic efficiency goes, when 𝑧 = 𝑥 2 and the model underlying 𝑓1 (𝑦, 𝑥, 𝛼) is correctly specified, both 𝛼ˆ 𝑈𝑐𝑐 and 𝛼ˆ 𝑊 𝐽 are consistent, and we can again expect the efficiency comparison to depend on the GCIME. Again, I do not provide a theoretical comparison but the next section gives some simulation evidence that when the GCIME holds, 𝛼ˆ 𝑊 𝐽 is still more efficient than 𝛼ˆ 𝑈𝑐𝑐 despite of the former being a weighted estimator. In conclusion, one can choose whether or not to weight when using the joint GMM based on the nature of selection and model specification, but in either case, the joint estimator is no less (and generally more) efficient than its complete cases counterpart. 60 2.7 Empirical application I apply the proposed estimation method to the setting of Sandsor (2020), who studies the association between individuals’ grade variance and educational attainment. One measure of individuals’ cognitive skills is their grades received in school, which are generally summarized using the grade point average (GPA), the mean of the grades. The author looks at the importance of grade variance on educational attainment for a given level of GPA. That is, is it better to specialize in some subjects or to be a “jack-of-all-subjects". She finds that grade variance is negatively associated with educational attainment, that is, students who are jack-of-all-subjects have higher educational attainment. The data comes from the National Longitudinal Survey of Youth, 1979 (NLSY79). The NLSY79 is a nationally representative sample of 12,686 young men and women between the ages of 14 and 22. Following the author, I only use the sub-sample of 6111 respondents representing the non-institutionalized civilian segment of the population. The data includes high school transcripts, educational attainment, socio-economic characteristics and other measures of cognitive and non- cognitive skills. GPA is measured as the mean of all grades received in upper secondary education (grades 9 to 12). The measure of grade variance is the standard deviation of an individual’s grades (GSD). The outcome of interest I consider is whether the individual has a four year college degree at age 30. Again, following the author, I restrict the sample to individuals with at least 10 valid grades and with non-missing data on all variables other than family income in 1979, which is the covariate with missing values I focus on. This leaves me with a sample of 3942 individuals out of which family income is missing for 723 (about 18%) individuals. I model the relationship between GSD and attainment of a four year college degree as a probit. Since family income is a continuous variable, we are in the general framework of Section 2.5.1.1. The model of interest is given by: 𝑦𝑖 = 1[𝛼10 𝑙𝑖𝑛𝑐𝑖 + 𝛼210 𝐺𝑆𝐷 𝑖 + 𝑥22𝑖 𝛼220 + 𝑢𝑖 > 0], (2.7.1) 𝑢𝑖 |𝑥𝑖 ∼ 𝑁𝑜𝑟𝑚𝑎𝑙 (0, 1), (2.7.2) 61 where 𝑦𝑖 is a binary variable equal to 1 if individual 𝑖 has a college degree by the age of 30 and 0 otherwise. 𝑙𝑖𝑛𝑐𝑖 is the log of family income of individual 𝑖 in 1979, and 𝐺𝑆𝐷 𝑖 is the grade standard deviation of individual 𝑖, the covariate of interest. 𝑥22𝑖 is the vector of other covariates which includes individual’s GPA, gender, race, ethnicity, area of residence, and parental education. It also includes measures of cognitive and noncognitive abilities which are based on the Armed Services Vocational Aptitude Battery (ASVAB) test and a combination of Rotter Locus of Control Scale and Rosenberg Self-Esteem Scale respectively. In our general notation from Section 2.5.1.1, 𝑥 1𝑖 = 𝑙𝑖𝑛𝑐𝑖 , 𝑥 2𝑖 = (𝐺𝑆𝐷 𝑖 , 𝑥22𝑖 ), and 𝑥𝑖 = (𝑥 1𝑖 , 𝑥2𝑖 ). Note that Assumption 2.3.1 in this context states that conditional on 𝑥 2𝑖 , the missingness of 𝑙𝑖𝑛𝑐𝑖 is independent of 𝑙𝑖𝑛𝑐𝑖 itself. This assumption is the basis of many standard procedures used to impute income. For instance, the method of hot decking used by the Current Population Survey is based on this assumption, so is multiple imputation used by the National Health Interview Survey. The standard two-step regression imputation is also based on this assumption. The imputation model is given by 𝑙𝑖𝑛𝑐𝑖 = 𝑥2𝑖 𝜃 0 + 𝑟𝑖 , 𝑟𝑖 |𝑥 2𝑖 ∼ 𝑁𝑜𝑟𝑚𝑎𝑙 (0, 𝜎 2 ). (2.7.3) Table D2 presents the results. Columns 1 and 2 give the coefficient estimates and standard errors from the complete cases probit and the joint GMM respectively. Columns 3 gives the percentage reduction in standard errors of the joint GMM. The standard errors fall for all coefficients, and quite substantially so for many coefficients. While there is not much gain for the coefficient on log of family income, there is about a 10% reduction in the standard error for 𝐺𝑆𝐷 𝑖 , the variable of interest. The reduction for coefficients on other variables range from about 7% − 12%. The last row of the table gives the Hansen’s J-statistic discussed in Proposition 2.4.1. The null hypothesis of correct specification is not rejected at any reasonable significance level, giving us some confidence in the assumptions underlying the joint GMM. Columns 4 and 5 give the estimates and standard errors for the plug-in method and DVM respectively. In this particular case, both estimators give quite similar results as the joint GMM estimator, which is not surprising given that the coefficient of log(income) is fairly small in 62 magnitude. As the simulations suggest, the plug-in estimator performs similarly to CC and the joint GMM in terms of both bias and efficiency when 𝛼10 is small in magnitude. The DVM also has small biases and a smaller standard deviation than the joint GMM for such values, although that efficiency gain does not seem to be present in this application. Moreover, the joint GMM here still has the additional advantage of providing an overidentification test for the assumptions underlying the imputation procedure. 2.8 Conclusion I have provided a new method of consistently imputing missing covariate values in nonlinear models. The estimator uses the standard assumptions used in the imputation literature, but unlike other imputation estimators based on classical principles, it is consistent in nonlinear models for both the structural parameters and other quantities of interest like average partial effects. I have provided two practically important examples: fractional and nonnegative responses with binary or continuous CMV. The proposed estimator provides substantial efficiency gains over the complete cases estimator, and as a byproduct of using GMM, the overidentification test provides a way to test the extra restrictions imposed by the imputation estimator compared to the complete cases estimator. I have also provided a comprehensive framework for imputing using a variety of nonlinear models for cases where a linear model might be unrealistic. I have provided the weighted and unweighted versions of the estimator, both of which provide efficiency gains over their complete cases counterparts. This allows the empirical researcher to choose the version best suited for their particular model and the nature of missingness in their specific data. 63 CHAPTER 3 EFFICIENT ESTIMATION OF LINEAR PANEL DATA MODELS WITH MISSING COVARIATES* 3.1 Introduction The problem of missingness is ubiquitous in empirical research. In this paper, we provide some methods to deal with missing covariate values in linear panel data models with unobserved heterogeneity. Economists use a variety of methods to deal with missing covariate values in panel data. One common method is to just use the “complete cases" - the observations for which all covariates are observed [for instance Cabral et al. (2018), David & Venkateswaran (2019)]. While easy to use, methods based only on complete cases can lead to substantial loss of efficiency when missingness is large because of discarding the potentially useful information in the incomplete cases. This has inspired methods that make use of these incomplete cases. One method used in this regard is the “last observation carried forward" (LOCF), which replaces the missing observations in a given time period with observations from the previous time period [for instance, Doraszelski et al. (2018), Giroud & Rauh (2019)].1 Another method is the dummy variable method (DVM), which replaces the missing values with zeros and includes an indicator for missingness as an additional covariate in the model [for instance, Antecol et al. (2018)]. A third method we consider is regression imputation. This is a two-step method which in the first step, regresses the covariate with missing values (CMV) on the always-observed covariates using complete cases and uses the estimated coefficients to predict missing values of the CMV. In the second step, it estimates the model of interest using all observations with this “composite" CMV, which consists of both observed and predicted values.2 *This chapter is co-authored with Professor Jeffrey Wooldridge. 1Sometimes the missing observations are also replaced with observations from the following time period. 2Moffitt et al. (2020) use this method for imputing a variable which is used to define a covariate in the model of interest. 64 In this paper, we consider the issue of proper imputation specifically when using the fixed effects estimator, which is perhaps the most frequently used method to estimate linear panel data models with unobserved heterogeneity. We propose a new method of imputing when using fixed effects that improves upon the performance of the estimators mentioned above. The choice of method comes down to consistency and relative efficiency. The complete cases fixed effects estimator [as described in Wooldridge (2019)] generally requires the least number of assumptions to be consistent. However, as mentioned above, it can be inefficient relative to the estimators that make use of the incomplete cases. LOCF has been shown to be generally biased and inconsistent even under the strongest assumptions on missingness (Lane, 2008). We show that DVM is also generally inconsistent unless some very strong zero restrictions are imposed in the model, including the assumption that the CMV does not contain individual specific unobserved heterogeneity - an assumption generally unlikely to hold in practice. Regression imputation is consistent under less restrictive assumptions than the DVM, but still requires that the CMV does not contain unobserved heterogeneity. The key contribution of this paper therefore is to propose a new imputation estimator which is consistent under assumptions that are much less restrictive than those required by the estimators above. We do not impose the zero restrictions required by the DVM, allow for unobserved heterogeneity in the CMV, and allow for missingness to depend on the always-observed covariates. We propose imputation methods for the cases of both strict as well as sequential exogeneity of the covariates, the latter allowing for things like lagged dependent variables and feedback effects. A second contribution we make is proposing a novel variable addition test (VAT) for exogeneity of missingness. The VATs proposed so far in this context have only been able to test for missingness in other time periods being uncorrelated with unobservables in a given time period (Wooldridge, 2010). We propose a test for missingness in the same time period being uncorrelated with the unobservables in a given time period, which is the kind of exogeneity one is most likely to be concerned about in practice. The rest of the paper proceeds as follows. Section 2 presents the population model of interest 65 and the associated assumptions of strict exogeneity of the covariates. Section 3 describes the missing data scheme and the assumptions on the missingness mechanism. Section 4 presents the proposed estimator and its asymptotic distribution. Section 5 compares the proposed estimator to some commonly used alternatives. Section 6 proposes an imputation estimator under sequential exogeneity of the covariates and the novel VAT for the exogeneity of missingness. Section 7 concludes. Proofs and extensions to the cases of missing vectors and time-varying unobserved heterogeneity are given in the appendix. 3.2 Population model We consider a standard linear model with additive heterogeneity. Assume that an underlying population consists of a large number of units for whom data on 𝑇 time periods are potentially available. We assume random sampling from this population, and let 𝑖 denote a random draw. Along with the outcome 𝑦𝑖𝑡 and covariates 𝑥𝑖𝑡 = [𝑥 1𝑖𝑡 𝑥 2𝑖𝑡 ], we also draw scalars 𝑐𝑖 and 𝑑𝑖 , which are the unobserved heterogeneities in 𝑦𝑖𝑡 and 𝑥 1𝑖𝑡 respectively. The linear model with additive heterogeneity is 𝑦𝑖𝑡 = 𝛽1 𝑥1𝑖𝑡 + 𝑥 2𝑖𝑡 𝛽2 + 𝑐𝑖 + 𝑢𝑖𝑡 ≡ 𝑥𝑖𝑡 𝛽 + 𝑐𝑖 + 𝑢𝑖𝑡 , 𝑡 = 1, . . . , 𝑇, (3.2.1) where 𝑥 1𝑖𝑡 is a scalar, 𝑥 2𝑖𝑡 is a 1 × 𝑘 vector which includes the constant term3, and 𝛽 = [𝛽1 𝛽02 ] 0. We are interested in estimators of 𝛽 that allow for correlation between 𝑐𝑖 and the history of the covariates, {𝑥𝑖𝑡 : 𝑡 = 1, . . . , 𝑇 }. We first define the histories of all variables. Let y𝑖 = (𝑦𝑖1 , . . . , 𝑦𝑖𝑇 ), x𝑖 = (𝑥𝑖1 , . . . , 𝑥𝑖𝑇 ), x1𝑖 = (𝑥 1𝑖1 , . . . , 𝑥 1𝑖𝑇 ), x2𝑖 = (𝑥 2𝑖1 , . . . , 𝑥 2𝑖𝑇 ), and u𝑖 = (𝑢𝑖1 , . . . , 𝑢𝑖𝑇 ). We place the following assumption on the idiosyncratic error 𝑢𝑖𝑡 in equation (3.2.1). Assumption 3.2.1. E(x0𝑖 𝑢𝑖𝑡 ) = 0, 𝑡 = 1, . . . , 𝑇 . This is a kind of strict exogeneity assumption of the covariates with respect to the idiosyncratic error. It implies that 𝑥𝑖𝑠 is uncorrelated with 𝑢𝑖𝑡 , 𝑠 = 1, . . . , 𝑇. In other words, the idiosyncratic error at time 𝑡 is uncorrelated with the covariates in all time periods. Note that this assumption 3where 𝑥2𝑖𝑡 can include a full set of time dummies, or other aggregate time variables. 66 does not restrict the relationship between x𝑖 and the unobserved heterogeneity 𝑐𝑖 , which can be arbitrarily correlated. The model which underlies the gains in efficiency in this paper is the following linear imputation model with unobserved heterogeneity, which explains 𝑥 1𝑖𝑡 in terms of 𝑥 2𝑖𝑡 . 𝑥 1𝑖𝑡 = 𝑥2𝑖𝑡 𝜋 + 𝑑𝑖 + 𝑟𝑖𝑡 . (3.2.2) We impose an assumption analogous to Assumption 3.2.1 on the idiosyncratic error 𝑟𝑖𝑡 . Assumption 3.2.2: E(x02𝑖 𝑟𝑖𝑡 ) = 0, 𝑡 = 1, . . . , 𝑇 . Again, this assumption implies that 𝑥2𝑖𝑠 is orthogonal to the idiosyncratic error 𝑟𝑖𝑡 in every time period 𝑠 = 1, . . . , 𝑇. Moreover, it does not restrict the relation between x2𝑖 and the unobserved heterogeneity 𝑑𝑖 . Using the imputation model which explains 𝑥 1𝑖𝑡 in terms of 𝑥 2𝑖𝑡 , we are able to obtain a “reduced form" for 𝑦𝑖𝑡 in terms of only 𝑥 2𝑖𝑡 . Plugging (3.2.2) in (3.2.1) gives 𝑦𝑖𝑡 = 𝛽1 (𝑥 2𝑖𝑡 𝜋 + 𝑑𝑖 + 𝑟𝑖𝑡 ) + 𝑥 2𝑖𝑡 𝛽2 + 𝑐𝑖 + 𝑢𝑖𝑡 ≡ 𝑥 2𝑖𝑡 𝛾 + ℎ𝑖 + 𝑣𝑖𝑡 , (3.2.3) where 𝛾 ≡ 𝛽1 𝜋 + 𝛽2 , ℎ𝑖 ≡ 𝛽1 𝑑𝑖 + 𝑐𝑖 , and 𝑣𝑖𝑡 ≡ 𝛽1𝑟𝑖𝑡 + 𝑢𝑖𝑡 . As we will discuss in Section 3, we allow 𝑥 1𝑖𝑡 to contain missing values while assuming that 𝑥 2𝑖𝑡 is always observed. Equation (3.2.3) allows us to utilize the observations for which 𝑥 1𝑖𝑡 is not observed but 𝑦𝑖𝑡 and 𝑥 2𝑖𝑡 are. Note that Assumptions 3.2.1 and 3.2.2 imply that E(x02𝑖 𝑣𝑖𝑡 ) = E[x02𝑖 (𝛽1𝑟𝑖𝑡 + 𝑢𝑖𝑡 )] = 0. (3.2.4) That is, 𝑥 2𝑖𝑠 is orthogonal to the idiosyncratic error 𝑣𝑖𝑡 in equation (3.2.3) for all 𝑠 = 1, . . . , 𝑇. 3.3 The missing data mechanism To allow for unbalanced panels, we introduce a series of selection indicators for each 𝑖, s𝑖 = {𝑠𝑖1 , . . . , 𝑠𝑖𝑇 }, where 𝑠𝑖𝑡 = 1 if 𝑥1𝑖𝑡 is observed; otherwise 𝑠𝑖𝑡 = 0. In this paper, we only allow 𝑥 1𝑖𝑡 to contain missing values. Hence, 𝑠𝑖𝑡 indicates whether we have a “complete case" for unit 𝑖 in period 𝑡. 67 Our main estimation method is based on the well-known fixed effects estimator. Define 𝑇𝑖 = Í𝑇 𝑞=1 𝑠𝑖𝑞 as the total number of time periods for which 𝑥 1𝑖𝑡 is observed for individual 𝑖. Unlike 𝑇, 𝑇𝑖 is random, since 𝑠𝑖𝑡 is random for every 𝑡 = 1, . . . , 𝑇. We impose the following assumption on 𝑇𝑖 . Assumption 3.3.1. 𝑃(𝑇𝑖 = 0) = 0. This assumption simply says that for every individual 𝑖 in the population, the probability that their 𝑥1𝑖𝑡 is not observed in any time period 𝑡 = 1. . . . , 𝑇 is zero. Further, define the time-demeaned covariates as 𝑥¥𝑖𝑡 = 𝑥𝑖𝑡 − 𝑇𝑖−1 𝑇𝑞=1 𝑠𝑖𝑞 𝑥𝑖𝑞 , where the time Í demeaning here has been done using the complete cases only. We can write 𝑥¥𝑖𝑡 = [𝑥¥1𝑖𝑡 𝑥¥2𝑖𝑡 ], where 𝑥¥1𝑖𝑡 = 𝑥 1𝑖𝑡 − 𝑇𝑖−1 𝑇𝑞=1 𝑠𝑖𝑞 𝑥1𝑖𝑞 , and 𝑥¥2𝑖𝑡 = 𝑥2𝑖𝑡 − 𝑇𝑖−1 𝑇𝑞=1 𝑠𝑖𝑞 𝑥 2𝑖𝑞 . Moreover, 𝑥¤2𝑖𝑡 = Í Í 𝑥 2𝑖𝑡 − (𝑇 − 𝑇𝑖 ) −1 𝑇𝑞=1 (1 − 𝑠𝑖𝑞 )𝑥𝑖𝑞 are the time demeaned covariates where the time demeaning has Í been done using the incomplete cases only. Under Assumption 3.3.1, 𝑥¥𝑖𝑡 and 𝑥¤2𝑖𝑡 are well defined.4 For consistent estimation in the selected samples using fixed effects, we impose the following assumptions on the population distribution. Assumption 3.3.2. For every 𝑡 = 1, . . . , 𝑇, (i) E(𝑠𝑖𝑡 𝑥¥𝑖𝑡0 𝑢𝑖𝑡 ) = 0 (ii) E(𝑠𝑖𝑡 𝑥¥2𝑖𝑡 0 𝑟 ) = 0 (iii) 𝑖𝑡 0 𝑣 ] = 0. E[(1 − 𝑠𝑖𝑡 ) 𝑥¤2𝑖𝑡 𝑖𝑡 One case where this assumption would hold is when s𝑖 |= (x𝑖 , u𝑖 , r𝑖 , 𝑐𝑖 , 𝑑𝑖 ). That is, selection is independent of everything else in the model, a case we will call “missing completely at random" (MCAR). For instance, data will be MCAR when we have a randomly rotating panel. Then, part (i) of Assumption 3.3.2 becomes 𝑇 Õ E(𝑠𝑖𝑡 𝑥¥𝑖𝑡0 𝑢𝑖𝑡 ) = E(𝑠𝑖𝑡 𝑥𝑖𝑡0 𝑢𝑖𝑡 ) − E(𝑠𝑖𝑡 𝑇𝑖−1 0 𝑢 ) 𝑠𝑖𝑞 𝑥𝑖𝑞 𝑖𝑡 𝑞=1 𝑇 Õ = E(𝑠𝑖𝑡 ) E(𝑥𝑖𝑡0 𝑢𝑖𝑡 ) − E(𝑠𝑖𝑡 𝑇𝑖−1 𝑠𝑖𝑞 ) E(𝑥𝑖𝑞 0 𝑢 ) 𝑖𝑡 𝑞=1 = 0. (3.3.1) The third equality follows from Assumption 3.2.1 under which E(x0𝑖 𝑢𝑖𝑡 ) = 0. Similarly, part (ii) of 4So are all other time demeaned variables defined in Section 3. 68 Assumption 3.3.2 becomes 𝑇 Õ 0 𝑟 ) E(𝑠𝑖𝑡 𝑥¥2𝑖𝑡 = 0 𝑟 ) E(𝑠𝑖𝑡 𝑥 2𝑖𝑡 − E(𝑠𝑖𝑡 𝑇𝑖−1 0 𝑟 ) 𝑠𝑖𝑞 𝑥2𝑖𝑞 𝑖𝑡 𝑖𝑡 𝑖𝑡 𝑞=1 𝑇 Õ 0 𝑟 )− = E(𝑠𝑖𝑡 ) E(𝑥 2𝑖𝑡 E(𝑠𝑖𝑡 𝑇𝑖−1 𝑠𝑖𝑞 ) E(𝑥 2𝑖𝑞 0 𝑟 ) 𝑖𝑡 𝑖𝑡 𝑞=1 = 0. (3.3.2) The third equality follows from Assumption 3.2.2 under which E(x02𝑖 𝑟𝑖𝑡 ) = 0. As we will see in Section 4, time demeaning using complete cases gets rid of the unobserved heterogeneities 𝑐𝑖 and 𝑑𝑖 in equations (3.2.1) and (3.2.2) respectively. Therefore, Assumption 3.3.2 does not put any restrictions on the unobserved heterogeneities, and we do not need selection to be independent of the unobserved heterogeneities for this assumption to hold. So along with Assumptions 3.2.1 and 3.2.2, MCAR is sufficient for Assumption 3.3.2 to hold, but we can get by with the weaker assumption s𝑖 |= (x𝑖 , u𝑖 , r𝑖 ).5 We can also allow selection to be a function of the always-observed covariates 𝑥 2𝑖𝑡 or unobserved random variables outside the model, but we have to strengthen the exogeneity Assumptions 3.2.1 and 3.2.2 to the following zero conditional mean assumptions. Assumption 3.2.1’ E(𝑢𝑖𝑡 |x1𝑖 , x2𝑖 , 𝑐𝑖 , s𝑖 ) = 0, 𝑡 = 1, . . . , 𝑇. Assumption 3.2.1’ is a version of strict exogeneity of selection (along with strict exogeneity of the covariates) conditional on 𝑐𝑖 . It implies that observing 𝑥 1𝑖𝑡 in any time period 𝑡 cannot be systematically related to the idiosyncratic errors u𝑖 . As a practical matter, Assumption 3.2.1 allows selection 𝑠𝑖𝑡 at time period 𝑡 to be arbitrarily correlated with (x1𝑖 , x2𝑖 , 𝑐𝑖 ), that is, with the covariates in any time period and the unobserved heterogeneity in 𝑦𝑖𝑡 . We also need to strengthen Assumption 3.2.2 to the following zero conditional mean assumption. Assumption 3.2.2’: E(𝑟𝑖𝑡 |x2𝑖 , 𝑑𝑖 , s𝑖 ) = 0, 𝑡 = 1, . . . , 𝑇 . Assumption 3.2.2’ implies that observing 𝑥 1𝑖𝑡 in any time period 𝑡 cannot be systematically related to r𝑖 , where r𝑖 = (𝑟𝑖1 , . . . , 𝑟𝑖𝑇 ). But it can be arbitrarily correlated with (x2𝑖 , 𝑑𝑖 ), that is, 5It is however hard to think of situations where selection is independent of the covariates and the idiosyncratic errors but not the unobserved heterogeneities. 69 with the always-observed covariates and the unobserved heterogeneity in 𝑥 1𝑖𝑡 . Together, Assumptions 3.2.1’ and 3.2.2’ allow 𝑠𝑖𝑡 to be arbitrarily correlated with the always- observed covariates x2𝑖 , as well as with the unobserved heterogeneity in both 𝑦𝑖𝑡 and 𝑥 1𝑖𝑡 , that is, 𝑐𝑖 and 𝑑𝑖 . But it rules out 𝑠𝑖𝑡 being a function of the idiosyncratic errors u𝑖 and r𝑖 . To see that Assumption 3.3.2 holds under Assumptions 3.2.1’ and 3.2.2’, consider part (i) of Assumption 3.3.2. E(𝑠𝑖𝑡 𝑥¥𝑖𝑡0 𝑢𝑖𝑡 ) = E[E(𝑠𝑖𝑡 𝑥¥𝑖𝑡0 𝑢𝑖𝑡 |x𝑖 , s𝑖 )] = E[𝑠𝑖𝑡 𝑥¥𝑖𝑡0 E(𝑢𝑖𝑡 |x𝑖 , s𝑖 )] = 0. (3.3.3) The first equality follows from the Law of Iterated Expectations (LIE), and the third follows from the fact that under Assumption 3.2.1’, E(𝑢𝑖𝑡 |x𝑖 , s𝑖 ) = 0 using the LIE. Similarly, part (ii) of Assumption 3.3.2 becomes 0 𝑟 ) = E[E(𝑠 𝑥¥0 𝑟 |x , s )] = E[𝑠 𝑥¥0 E(𝑟 |x , s )] = 0, E(𝑠𝑖𝑡 𝑥¥2𝑖𝑡 (3.3.4) 𝑖𝑡 𝑖𝑡 2𝑖𝑡 𝑖𝑡 2𝑖 𝑖 𝑖𝑡 2𝑖𝑡 𝑖𝑡 2𝑖 𝑖 where the third equality follows from the fact that under Assumption 3.2.2’, E(𝑟𝑖𝑡 |x2𝑖 , s𝑖 ) = 0 using the LIE. 3.4 Moment conditions and GMM It is well known that the fixed effects (within) estimator that uses only the complete cases is generally consistent under Assumption 3.2.1’. One way to characterize this estimator is to multiply equation (3.2.1) through by the selection indicator to get 𝑠𝑖𝑡 𝑦𝑖𝑡 = 𝛽1 𝑠𝑖𝑡 𝑥1𝑖𝑡 + 𝑠𝑖𝑡 𝑥 2𝑖𝑡 𝛽2 + 𝑠𝑖𝑡 𝑐𝑖 + 𝑠𝑖𝑡 𝑢𝑖𝑡 , 𝑡 = 1, . . . , 𝑇 . (3.4.1) Averaging this equation across 𝑡 for each 𝑖 gives 𝑦¯ 𝑖 = 𝛽1 𝑥¯1𝑖 + 𝑥¯2𝑖 𝛽2 + 𝑐𝑖 + 𝑢¯𝑖 , 𝑡 = 1, . . . , 𝑇, (3.4.2) where 𝑦¯ 𝑖 = 𝑇𝑖−1 𝑇𝑞=1 𝑠𝑖𝑞 𝑦𝑖𝑞 is the average of the selected observations. The other averages in Í (3.4.2) are defined similarly. If we now multiply (3.4.2) by 𝑠𝑖𝑡 and subtract from (3.4.1), we remove 𝑐𝑖 . 𝑠𝑖𝑡 (𝑦𝑖𝑡 − 𝑦¯ 𝑖 ) = 𝛽1 𝑠𝑖𝑡 (𝑥 1𝑖𝑡 − 𝑥¯1𝑖 ) + 𝑠𝑖𝑡 (𝑥 2𝑖𝑡 − 𝑥¯2𝑖 ) 𝛽2 + 𝑠𝑖𝑡 (𝑢𝑖𝑡 − 𝑢¯𝑖 ), 𝑡 = 1, . . . , 𝑇 . (3.4.3) 70 Equivalently, 𝑠𝑖𝑡 𝑦¥𝑖𝑡 = 𝛽1 𝑠𝑖𝑡 𝑥¥1𝑖𝑡 + 𝑠𝑖𝑡 𝑥¥2𝑖𝑡 𝛽2 + 𝑠𝑖𝑡 𝑢¥𝑖𝑡 , 𝑡 = 1, . . . , 𝑇, (3.4.4) where 𝑦¥𝑖𝑡 ≡ 𝑦𝑖𝑡 − 𝑦¯ 𝑖 , 𝑢¥𝑖𝑡 ≡ 𝑢𝑖𝑡 − 𝑢¯𝑖 , and 𝑥¥1𝑖𝑡 and 𝑥¥2𝑖𝑡 are as defined in Section 3. These are the time- demeaned variables, where the demeaning has been done using the complete cases. Then pooled OLS on (3.4.4) gives consistent estimates of 𝛽 under part (i) of Assumption 3.3.2. Estimating 𝛽 using pooled OLS is equivalent to GMM estimation using the following moment conditions. Õ 𝑇  E[ 𝑓1𝑖 (𝛽, 𝜋)] ≡ E 𝑠𝑖𝑡 𝑥¥𝑖𝑡0 ( 𝑦¥𝑖𝑡 − 𝑥¥1𝑖𝑡 𝛽1 − 𝑥¥2𝑖𝑡 𝛽2 ) = 0. (3.4.5) 𝑡=1 These moment conditions give the fixed effects estimator based only on complete cases. Even though this estimator is consistent, it leaves room for gains in efficiency as it ignores the information contained in those observations for which 𝑥 1𝑖𝑡 is missing but 𝑦𝑖𝑡 and 𝑥 2𝑖𝑡 are observed. In order to utilize those observations, we augment the above moment conditions with those from the imputation model and the reduced form for 𝑦𝑖𝑡 . We can time demean the imputation model (3.2.2) in a similar fashion as (3.2.1), that is, using the complete cases. This gives 𝑠𝑖𝑡 𝑥¥1𝑖𝑡 = 𝑠𝑖𝑡 𝑥¥2𝑖𝑡 𝜋 + 𝑠𝑖𝑡 𝑟¥𝑖𝑡 , 𝑡 = 1, . . . , 𝑇, (3.4.6) where 𝑟¥𝑖𝑡 ≡ 𝑟𝑖𝑡 − 𝑟¯𝑖 and 𝑟¯𝑖 = 𝑇𝑖−1 𝑇𝑞=1 𝑠𝑖𝑞 𝑟𝑖𝑞 . Again, the unobserved heterogeneity 𝑑𝑖 is eliminated Í by the time demeaning. Estimating 𝜋 using pooled OLS in this equation is equivalent to GMM estimation using the moment functions Õ𝑇 𝑓2𝑖 (𝛽, 𝜋) = 0 ( 𝑥¥ − 𝑥¥ 𝜋). 𝑠𝑖𝑡 𝑥¥2𝑖𝑡 (3.4.7) 1𝑖𝑡 2𝑖𝑡 𝑡=1 For the reduced form, we use the incomplete cases to time demean the data. Define Õ 𝑇 𝑦¤ 𝑖𝑡 ≡ 𝑦𝑖𝑡 − (𝑇 − 𝑇𝑖 ) −1 (1 − 𝑠𝑖𝑞 )𝑦𝑖𝑞 𝑞=1 Õ𝑇 𝑥¤2𝑖𝑡 ≡ 𝑥 2𝑖𝑡 − (𝑇 − 𝑇𝑖 ) −1 (1 − 𝑠𝑖𝑞 )𝑥 2𝑖𝑞 . 𝑞=1 71 Then estimating 𝛾 using pooled OLS on the equation (1 − 𝑠𝑖𝑡 ) 𝑦¤ 𝑖𝑡 = (1 − 𝑠𝑖𝑡 ) 𝑥¤2𝑖𝑡 𝛾 + (1 − 𝑠𝑖𝑡 ) 𝑣¤ 𝑖𝑡 , 𝑡 = 1, . . . , 𝑇 (3.4.8) is equivalent to GMM estimation using the following moment functions for the reduced form. Õ 𝑇 𝑓3𝑖 (𝛽, 𝜋) = 0 [ 𝑦¤ − 𝑥¤ (𝛽 𝜋 + 𝛽 )]. (1 − 𝑠𝑖𝑡 ) 𝑥¤2𝑖𝑡 (3.4.9) 𝑖𝑡 2𝑖𝑡 1 2 𝑡=1 The full vector of moment functions is given by: 0  Í𝑇    𝑡=1 𝑠𝑖𝑡 𝑥¥𝑖𝑡 ( 𝑦¥𝑖𝑡 − 𝑥¥1𝑖𝑡 𝛽1 − 𝑥¥2𝑖𝑡 𝛽2 )    𝑓1𝑖 (𝛽, 𝜋)          𝑓𝑖 (𝛽, 𝜋) =  Í 𝑇 𝑠 𝑥¥0 ( 𝑥¥ − 𝑥¥ 𝜋)  ≡  𝑓 (𝛽, 𝜋)  . (3.4.10) Í 𝑡=1 𝑖𝑡 2𝑖𝑡 1𝑖𝑡 2𝑖𝑡   2𝑖      𝑇 0   𝑡=1 (1 − 𝑠𝑖𝑡 ) 𝑥¤2𝑖𝑡 𝑦¤ 𝑖𝑡 − 𝑥¤2𝑖𝑡 (𝛽1 𝜋 + 𝛽2 )   𝑓3𝑖 (𝛽, 𝜋)         Lemma 3.4.1: Under Assumptions 3.2.1’, 3.2.2’, 3.3.1 and 3.3.2, E[ 𝑓𝑖 (𝛽, 𝜋)] = 0. This is a set of 3𝑘 + 1 moment conditions with 2𝑘 + 1 parameters, giving us 𝑘 over-identifying restrictions. It is the availability of these over-identifying restrictions that leads to gains in efficiency in this model. As the following result shows, using either 𝑓1𝑖 (.) and 𝑓2𝑖 (.) or 𝑓1𝑖 (.) and 𝑓3𝑖 (.) leads to an estimator of 𝛽 that is identical to the estimator that uses only 𝑓1𝑖 (.) and hence utilizes only the complete cases. Lemma 3.4.2: Under Assumptions 3.2.1’, 3.2.2’, 3.3.1 and 3.3.2, GMM estimators of 𝛽 based on moment functions [ 𝑓1𝑖 (.) 0 𝑓2𝑖 (.) 0] 0 or moment functions [ 𝑓1𝑖 (.) 0 𝑓3𝑖 (.) 0] 0 are identical to that based only on 𝑓1𝑖 (.). Lemma 3.4.2 follows directly from the result in Ahu & Schmidt (1995)6 that adding equal number of additional parameters and extra moment conditions does not change the GMM estimate of the original parameter. Both 𝑓2𝑖 (.) and 𝑓3𝑖 (.) are a set of 𝑘 moment functions which add 𝑘 extra parameters 𝜋. To define the GMM estimator based on the entire vector 𝑓𝑖 (.), let 𝑓¯(𝛽, 𝜋) = 𝑁 −1 𝑖=1 Í𝑁 𝑓𝑖 (𝛽, 𝜋), Ω be a square matrix of order 3𝑘 + 1 that is nonrandom, symmetric, and positive definite, and Ω̂ be a first step consistent estimate of Ω. Then the standard two-step GMM minimization problem is 6p3. Thoerem 1 72 given by: min 𝑓¯(𝛽, 𝜋) 0 Ω̂ 𝑓¯(𝛽, 𝜋). (3.4.11) 𝛽,𝜋 The variance-covariance matrix of the moment functions is given by:   𝐶11 𝐶12 𝐶13    0   𝐶 ≡ E[ 𝑓𝑖 (𝛽, 𝜋) 𝑓𝑖 (𝛽, 𝜋) ] = 𝐶21 𝐶22 𝐶23  ,     𝐶31 𝐶32 𝐶33    where Õ 𝑇 Õ𝑇 Õ𝑇 Õ 𝑇 𝑠𝑖𝑡 𝑥¥𝑖𝑡0 𝑢𝑖𝑡 𝑠𝑖𝑡 𝑥¥𝑖𝑡0 𝑢𝑖𝑡   𝐶11 = E 𝑠𝑖𝑟 𝑥¥𝑖𝑟 𝑢𝑖𝑟 𝐶12 = 𝑠𝑖𝑟 𝑥¥2𝑖𝑟 𝑟𝑖𝑟 𝑡=1 𝑟=1 𝑡=1 𝑟=1 Õ 𝑇 Õ 𝑇 Õ 𝑇 Õ 𝑇 𝐶13 = 𝑠𝑖𝑡 𝑥¥𝑖𝑡0 𝑟𝑖𝑡 (1 − 𝑠𝑖𝑟 ) 𝑥¤2𝑖𝑟 𝑣𝑖𝑟  𝐶22 = E 0 𝑟 𝑠𝑖𝑡 𝑥¥2𝑖𝑡 𝑠𝑖𝑟 𝑥¥2𝑖𝑟 𝑟𝑖𝑟  𝑖𝑡 𝑡=1 𝑟=1 𝑡=1 𝑟=1 Õ𝑇 Õ 𝑇 Õ𝑇 Õ𝑇 𝐶23 = E 0 𝑟 𝑠𝑖𝑡 𝑥¥2𝑖𝑡 (1 − 𝑠𝑖𝑟 ) 𝑥¤2𝑖𝑟 𝑣𝑖𝑟  𝐶33 = E (1 − 𝑠𝑖𝑡 ) 𝑥¤2𝑖𝑡0 𝑣 (1 − 𝑠𝑖𝑟 ) 𝑥¤2𝑖𝑟 𝑣𝑖𝑟 ,  𝑖𝑡 𝑖𝑡 𝑡=1 𝑟=1 𝑡=1 𝑟=1 and 𝑓𝑖 (.) is evaluated at the true value of the parameters. The optimal weight matrix is given by the inverse of 𝐶𝑖 . Let 𝐶ˆ be a consistent estimate of 𝐶.7 Then the joint GMM is defined as follows. Definition 3.4.1. Call the estimator of [𝛽0 𝜋0] 0 that solves (3.4.11), where Ω̂ = 𝐶ˆ −1 the joint GMM estimator (or [ 𝛽ˆ0𝐽𝑜𝑖𝑛𝑡𝐹𝐸 𝜋ˆ 0𝐽𝑜𝑖𝑛𝑡𝐹𝐸 ] 0). Further, define the gradient as follows:    𝐷 11 0      𝐷 ≡ E[∇ 𝑓𝑖 (𝛽, Π)] =  0 𝐷 22  ,      𝐷 31 𝐷 32    where Õ 𝑇 Õ 𝑇 𝑠𝑖𝑡 𝑥¥𝑖𝑡0 𝑥¥𝑖𝑡 0 𝑥¥   𝐷 11 = − E 𝐷 22 = − E 𝑠𝑖𝑡 𝑥¥2𝑖𝑡 2𝑖𝑡 𝑡=1 𝑡=1 Õ 𝑇 Õ Õ𝑇 𝐷 31 = − E (1 − 𝑠𝑖𝑡 ) 𝑥¤2𝑖𝑡 0 𝑥¤ 𝜋 (1 − 𝑠𝑖𝑡 ) 𝑥¤2𝑖𝑡 0 𝑥¤  𝐷 32 = − E 0 𝑥¤ 𝛽  . (1 − 𝑠𝑖𝑡 ) 𝑥¤2𝑖𝑡 2𝑖𝑡 𝑖𝑡 2𝑖𝑡 1 𝑡=1 𝑡 𝑡=1 We impose the following rank condition on 𝐷 for identification of 𝛽 and 𝜋. 7which can be obtained by replacing the expectations with sample averages and substituting the estimated errors. 73 Assumption 3.4.1: 𝑟𝑎𝑛𝑘 (𝐷 11 ) = 𝑘 + 1 and 𝑟𝑎𝑛𝑘 (𝐷 22 ) = 𝑘. Under this assumption, 𝑓1𝑖 (𝛽) identifies 𝛽 and 𝑓2𝑖 (𝜋) identifies 𝜋. Then we have the following result using Hansen (1982). Theorem 3.4.1 Under standard regularity conditions and Assumptions 3.2.1’, 3.2.2’, 3.3.1, 3.3.2, and 3.4.1, the estimators [ 𝛽ˆ0𝐽𝑜𝑖𝑛𝑡𝐹𝐸 𝜋ˆ 0𝐽𝑜𝑖𝑛𝑡𝐹𝐸 ] 0 are consistent and asymptotically normal, with asymptotic variance given by (𝐷 0𝐶 −1 𝐷) −1 , and ˆ Π̂) −−−𝑑−→ 𝜒2 . ˆ Π̂) 0 𝐶ˆ −1 𝑓¯( 𝛽, 𝑁 𝑓¯( 𝛽, 𝑘 This statistic can be used for the standard test of over-identifying restrictions. Note that this statistic is just the GMM objective function in (3.4.11) evaluated at the efficient values of the parameters, and is distributed as chi-squared with degrees of freedom equal to the number of over-identifying restrictions. 3.5 Comparison to related estimators 3.5.1 Complete cases estimator The most common practice in the presence of missing data is to just use the complete cases for estimation; that is, only use the observations for which 𝑥1 is observed. One estimator that uses only complete cases is a GMM estimator based only on ℎ1𝑖 (.) which is defined as follows. Definition 3.5.1.1 Call the estimator of 𝛽 that solves (3.4.11), where 𝑓𝑖 (.) contains only 𝑓1𝑖 (.) and Ω̂ = 𝐼, the complete cases estimator (or 𝛽ˆ𝐶𝐶 ). Since 𝑓1𝑖 (.) is an exactly identified set of moment functions, the weight matrix is irrelevant for this estimation procedure. The asymptotic variance of this estimator is given in the following result. Lemma 3.5.1.1 Under Assumptions 3.2.1’, 3.3.1, 3.3.2 and 3.4.1, the complete cases estimator 𝛽ˆ𝐶𝐶 has an asymptotic variance given by √ 𝐴𝑣𝑎𝑟 [ 𝑁 ( 𝛽ˆ𝐶𝐶 − 𝛽)] = (𝐷 011𝐶11 −1 𝐷 ) −1 . 11 74 This estimator simply ignores the information in the observations with missing 𝑥1 . 𝛽ˆ𝐽𝑜𝑖𝑛𝑡𝐹𝐸 allows for utilization of this information, leading to potential efficiency gains. The gain in efficiency just follows from the fact that adding valid moment conditions [in this case, 𝑓2𝑖 (.) and 𝑓3𝑖 (.)] decreases, or at least does not increase, the asymptotic variance of a GMM estimator. Proposition 3.5.1.1 Under Assumptions 3.2.1’, 3.2.2’, 3.3.1, 3.3.2, and 3.4.1, √ √ 𝐴𝑣𝑎𝑟 [ 𝑁 ( 𝛽ˆ𝐶𝐶 − 𝛽)] − 𝐴𝑣𝑎𝑟 [ 𝑁 ( 𝛽ˆ𝐽𝑜𝑖𝑛𝑡𝐹𝐸 − 𝛽)] is positive semi-definite. 3.5.2 Dummy variable method For cross section data, the dummy variable method refers to setting the missing values of the covariate to zero and using an indicator for whether the covariate is missing as an additional covariate. Jones (1996) showed that this generally leads to biased and inconsistent estimates for the case of cross section data. For panel data, one way the dummy variable method could proceed is the following. Note that using (3.2.1) and (3.2.2), we can write 𝑦𝑖𝑡 = 𝛽1 [𝑠𝑖𝑡 𝑥1𝑖𝑡 + (1 − 𝑠𝑖𝑡 )(𝑥2𝑖𝑡 𝜋 + 𝑑𝑖 + 𝑟𝑖𝑡 )] + 𝑥 2𝑖𝑡 𝛽2 + 𝑐𝑖 + 𝑢𝑖𝑡 . (3.5.1) Now, separating the intercept in the imputation model (3.2.2), we get 𝑥1𝑖𝑡 = 𝜋1 + 𝑥 22𝑖𝑡 𝜋2 + 𝑑𝑖 + 𝑟𝑖𝑡 , (3.5.2) where 𝑥 2𝑖𝑡 = [1 𝑥 22𝑖𝑡 ]. Substituting (3.5.2) in (3.5.1) and rearranging gives 𝑦𝑖𝑡 = 𝛽1 𝑠𝑖𝑡 𝑥 1𝑖𝑡 + 𝛽1 𝜋1 (1 − 𝑠𝑖𝑡 ) + 𝑥 2𝑖𝑡 𝛽2 + 𝑒𝑖𝑡 , (3.5.3) where 𝑒𝑖𝑡 ≡ 𝛽1 (1 − 𝑠𝑖𝑡 )(𝑥 22𝑖𝑡 𝜋2 + 𝑑𝑖 + 𝑟𝑖𝑡 ) + 𝑐𝑖 + 𝑢𝑖𝑡 . The dummy variable method omits the term (1 − 𝑠𝑖𝑡 )𝑥 22𝑖𝑡 𝜋2 𝛽1 from the model and includes it in the error term. This omitted variable bias is the source of inconsistency of this method, and hence even when the data is missing completely at random, neither POLS nor fixed effects consistently estimates the parameters in the model under the assumptions made so far. 75 As is expected, POLS on (3.5.3) is additionally inconsistent because 𝑒𝑖𝑡 contains 𝑐𝑖 and 𝑑𝑖 which are correlated with 𝑥𝑖𝑡 . But even fixed effects estimation of (3.5.3) is additionally inconsistent as it does not get rid of the term (1 − 𝑠𝑖𝑡 )𝑑𝑖 in the error, which is correlated with 𝑥𝑖𝑡 . The fixed effects estimator where we time demean using all observations proceeds as follows. Averaging (3.5.3) across 𝑡 for each 𝑖 and then subtracting the averaged equation from (3.5.3) gives 𝑦` 𝑖𝑡 = 𝛽1 𝑠`𝑖𝑡 𝑥`1𝑖𝑡 + 𝛽1 𝜋1 (1 − 𝑠`𝑖𝑡 ) + 𝑥`2𝑖𝑡 𝛽2 + 𝑒`𝑖𝑡 , (3.5.4) where 𝑦` 𝑖𝑡 = 𝑦𝑖𝑡 − 𝑇 −1 𝑇𝑞=1 𝑦𝑖𝑞 , 𝑠`𝑖𝑡 𝑥`1𝑖𝑡 = 𝑠𝑖𝑡 𝑥1𝑖𝑡 − 𝑇 −1 𝑇𝑞=1 𝑠𝑖𝑞 𝑥1𝑖𝑞 and so on. Estimating this Í Í equation using POLS gives the dummy variable estimator 𝛽ˆ 𝐷 . This estimator is inconsistent unless we impose the restrictions that certain objects are zero in the model. Proposition 3.5.2.1. Under Assumptions 3.2.1’, 3.2.2’, and 3.4.1, 𝛽ˆ 𝐷 is inconsistent unless (i) 𝛽1 = 0 or (ii) 𝜋2 = 0 and 𝑑𝑖 = 0 ∀ 𝑖. The first condition is setting 𝛽1 = 0, which clearly gets rid of both sources of inconsistency in this model. If 𝛽1 = 0, 𝑒`𝑖𝑡 = 𝑢`𝑖𝑡 , which is clearly uncorrelated with the regressors in (3.5.3) under Assumption 3.2.1. Intuitively, this condition implies that 𝑥 1𝑖𝑡 is irrelevant in model of interest (3.2.1). In this case, the best solution is to drop it and use all observations to estimate 𝛽2 in (3.2.1) using a standard fixed effects estimator that is used when there is no missingness. The second condition implies that first, there is no unobserved heterogeneity in the variable with missing values 𝑥 1𝑖𝑡 . As mentioned above, this condition is required because the fixed effects transformation does not get rid of 𝑑𝑖 in (3.5.3) because it is now multiplied by (1 − 𝑠𝑖𝑡 ). But even if 𝑑𝑖 = 0 ∀ 𝑖, this estimator is inconsistent because of omitting the term (1 − 𝑠𝑖𝑡 )𝑥 22𝑖𝑡 𝜋2 𝛽1 . Therefore, we need an additional condition that 𝜋2 = 0, which intuitively means that 𝑥2𝑖𝑡 does not help in predicting 𝑥1𝑖𝑡 . 3.5.3 Regression imputation Regression imputation is a two-step method which proceeds as following. In the first step, estimate 𝜋 in (3.2.2) using POLS and complete cases only (call it 𝜋). ˜ In the second step, plug 𝜋˜ in 76 the equation ∗ + 𝑥 𝜔 + 𝑒𝑟𝑟𝑜𝑟 , 𝑦𝑖𝑡 = 𝜔1 𝑥1𝑖𝑡 (3.5.5) 2𝑖𝑡 2 𝑖𝑡 where 𝑥1𝑖𝑡∗ ≡ 𝑠 𝑥 + (1 − 𝑠 )𝑥 𝜋. This is the “composite" 𝑥 which contains the true values of 𝑖𝑡 1𝑖𝑡 𝑖𝑡 2𝑖𝑡 1 𝑥 1 when it is observed (i.e. when 𝑠𝑖𝑡 = 1) and the predicted values from the imputation equation (3.2.2) when it is missing (i.e. when 𝑠𝑖𝑡 = 0). Then estimate 𝜔1 and 𝜔2 ) using fixed effects. To establish the performance of this estimator, recall that we can write using (3.2.1) and (3.2.2) 𝑦𝑖𝑡 = 𝛽1 [𝑠𝑖𝑡 𝑥1𝑖𝑡 + (1 − 𝑠𝑖𝑡 )(𝑥2𝑖𝑡 𝜋 + 𝑑𝑖 + 𝑟𝑖𝑡 )] + 𝑥 2𝑖𝑡 𝛽2 + 𝑐𝑖 + 𝑢𝑖𝑡 . (3.5.6) This boils down to the model of interest (3.2.1) when 𝑠𝑖𝑡 = 1 and to the reduced form (3.2.3) when 𝑠𝑖𝑡 = 0. Re-arrange this and write as 𝑦𝑖𝑡 = 𝛽1 [𝑠𝑖𝑡 𝑥1𝑖𝑡 + (1 − 𝑠𝑖𝑡 )𝑥2𝑖𝑡 𝜋] + 𝑥2𝑖𝑡 𝛽2 + [(1 − 𝑠𝑖𝑡 )𝑑𝑖 𝛽1 + 𝑐𝑖 ] + [(1 − 𝑠𝑖𝑡 )𝑟𝑖𝑡 𝛽1 + 𝑢𝑖𝑡 ] ∗ + 𝑥 𝛽 + [(1 − 𝑠 )𝑑 𝛽 + 𝑐 ] + [(1 − 𝑠 )𝑟 𝛽 + 𝑢 ]. ≡ 𝛽1 𝑥 1𝑖𝑡 (3.5.7) 2𝑖𝑡 2 𝑖𝑡 𝑖 1 𝑖 𝑖𝑡 𝑖𝑡 1 𝑖𝑡 Comparing (3.5.7) with (3.5.5), we note that the error in (3.5.5) contains both of the last two terms in (3.5.7), that is, the term that occurs due to the idiosyncratic errors in the model of interest and the imputation model as well as the term that occurs due to the unobserved heterogeneities in the two models. The issue with plugging 𝜋˜ in (3.5.7) and then estimating using fixed effects is twofold. First, estimating (3.2.2) using POLS and not fixed effects will lead to an inconsistent estimator of 𝜋 due to the presence of 𝑑𝑖 in (3.2.2). Second, and more importantly, even if one gets a consistent estimate of 𝜋 using fixed effects on (3.2.2) and plugs it in (3.5.7), a standard fixed effects on this equation does not produce consistent estimates of 𝛽1 and 𝛽2 because the unobserved heterogeneity term [(1 − 𝑠𝑖𝑡 )𝑑𝑖 𝛽1 + 𝑐𝑖 ] is not time constant anymore and hence cannot be eliminated by the standard fixed effects transformation. This method is therefore generally going to be inconsistent due to the presence of 𝑑𝑖 in the imputation model. A sequential estimator that is consistent is the following. First estimate 𝜋 using 𝑓2𝑖 (.), plug the estimated 𝜋 into 𝑓3𝑖 (.), and then estimate 𝛽 using 𝑓1𝑖 (.) and 𝑓3𝑖 (.) together. Definition 3.5.3.1. Call the following two-step estimator the sequential GMM (or [ 𝛽ˆ0𝑆𝑒𝑞 𝜋ˆ 0𝑆𝑒𝑞 ] 0). 77 Step 1: Obtain 𝜋ˆ 𝑆𝑒𝑞 by solving (3.4.11), where 𝑓𝑖 (.) contains only 𝑓2𝑖 (.) and Ω̂ = 𝐼. Step 2: Obtain 𝛽ˆ𝑆𝑒𝑞 by solving (3.4.11), where     Í𝑇 𝑠 𝑥¥ 0 ( 𝑦¥ − 𝑥¥ 𝛽 − 𝑥¥ 𝛽 ) 𝑓 (𝛽, 𝜋) 1𝑖𝑡 1 2𝑖𝑡 2   1𝑖     𝑡=1 𝑖𝑡 𝑖𝑡 𝑖𝑡 𝑓𝑖 (𝛽, 𝜋ˆ 𝑆𝑒𝑞 ) = Í ≡     𝑇 (1 − 𝑠 ) 𝑥¤0 𝑦¤ − 𝑥¤ (𝛽 𝜋ˆ + 𝛽 )    𝑓 (𝛽, 𝜋ˆ )  2𝑖𝑡 1 𝑆𝑒𝑞 2   3𝑖  𝑡=1 𝑖𝑡 2𝑖𝑡 𝑖𝑡   𝑆𝑒𝑞    and " # −1 Õ𝑁 Ω̂ = 𝑁 −1 𝑓𝑖 (𝛽, 𝜋ˆ 𝑆𝑒𝑞 ) 𝑓𝑖 (𝛽, 𝜋ˆ 𝑆𝑒𝑞 ) 0 . 𝑖=1 As is well known, sequential GMM estimators are generally less, or at least no more, efficient than joint GMM estimators that use the same moment conditions. Therefore, 𝛽ˆ𝑆𝑒𝑞 is generally less efficient than 𝛽ˆ𝐽𝑜𝑖𝑛𝑡𝐹𝐸 8 and there would be no reason to choose it other than computational convenience. 3.5.4 Mundlak device In the case of balanced panels, it is well known that the Mundlak device which adds time averages of the covariates as additional explanatory variables in equation (3.2.1) and estimates the model using POLS is numerically equivalent to the fixed effects estimator (Mundlak, 1978). Wooldridge (2019) shows that this numerical equivalence carries over to the case of unbalanced panels as well. In equation (3.2.1), if we include time averages of 𝑥𝑖𝑡 computed using only the complete cases as additional covariates and estimate the model using POLS on complete cases only, then this estimator is numerically equivalent to the complete cases fixed effects estimator 𝛽ˆ𝐶𝐶 . This suggests an alternative to the joint fixed effects GMM estimator introduced in Section 4. Instead of time demeaning each of the equations (3.2.1)-(3.2.3), we can use the Mundlak device for each of them. Consider first equation (3.2.1) and write 𝑐𝑖 = 𝜓1 + 𝜉11 𝑥¯1𝑖 + 𝑥¯2𝑖 𝜉12 + 𝑎 1𝑖 ≡ 𝜓1 + 𝑥¯𝑖 𝜉1 + 𝑎 1𝑖 . (3.5.8) 8Prokhorov and Schmidt (2009), Theorem 2.2, part 5. 78 This is a model that explains the unobserved heterogeneity 𝑐𝑖 in terms of the time averages of covariates in equation (3.2.1), where the averaging has been done using the complete cases only. We impose the following zero conditional mean assumption on the error 𝑎 1𝑖 . Assumption 3.5.4.1. E(𝑎 1𝑖 |x𝑖 , s𝑖 ) = 0. This implies first that E(𝑐𝑖 | 𝑥¯𝑖 ) = 𝜓1 + 𝑥¯𝑖 𝜉1 . Second, it implies that selection in all time periods is uncorrelated with the error 𝑎 1𝑖 . Plugging (3.5.8) into (3.2.1), we get 𝑦𝑖𝑡 = 𝑥𝑖𝑡 𝛽 + 𝜓1 + 𝑥¯𝑖 𝜉1 + 𝑎 1𝑖 + 𝑢𝑖𝑡 . (3.5.9) Let 𝑥´𝑖𝑡 = [1 𝑥𝑖𝑡 𝑥¯𝑖 ]. Estimating this model using POLS with the 𝑠𝑖𝑡 = 1 observations is equivalent to doing GMM with the following moment functions 𝑔1𝑖 (𝛽, 𝜓1 , 𝜉1 ) = 𝑠𝑖𝑡 𝑥´𝑖𝑡0 (𝑦𝑖𝑡 − 𝑥𝑖𝑡 𝛽 − 𝜓1 − 𝑥¯𝑖 𝜉1 ). (3.5.10) Similarly, for the unobserved heterogeneity in the imputation model in equation (3.2.2), we can write 𝑑𝑖 = 𝜓2 + 𝑥¯2𝑖 𝜉2 + 𝑎 2𝑖 . (3.5.11) Analogous to Assumption 3.5.4.1, we place the following assumption on the error term 𝑎 2𝑖 , which implies that E(𝑑𝑖 | 𝑥¯2𝑖 ) = 𝜓2 + 𝑥¯2𝑖 𝜉2 and that selection in all time periods is uncorrelated with 𝑎 2𝑖 . Assumption 3.5.4.2. E(𝑎 2𝑖 |x2𝑖 , s𝑖 ) = 0. Plugging (3.5.11) into equation (3.2.2), we get 𝑥 1𝑖𝑡 = 𝑥 2𝑖𝑡 𝜋 + 𝜓2 + 𝑥¯2𝑖 𝜉2 + 𝑎 2𝑖 + 𝑟𝑖𝑡 . (3.5.12) Let 𝑥´2𝑖𝑡 = [1 𝑥 2𝑖𝑡 𝑥¯2𝑖 ]. Estimating this model using POLS with the 𝑠𝑖𝑡 = 1 observations is equivalent to doing GMM with the following moment functions. 0 (𝑥 − 𝑥 𝜋 − 𝜓 − 𝑥¯ 𝜉 ). 𝑔2𝑖 (𝜋, 𝜓2 , 𝜉2 ) = 𝑠𝑖𝑡 𝑥´2𝑖𝑡 (3.5.13) 1𝑖𝑡 2𝑖𝑡 2 2𝑖 2 For the reduced form in equation (3.2.3), we first plug in for the unobserved heterogeneity ℎ𝑖 using (3.5.8) and (3.5.11). Recall that ℎ𝑖 ≡ 𝛽1 𝑑𝑖 + 𝑐𝑖 . We first obtain 𝑐𝑖 as a function of 𝑥¯2𝑖 . To do this, 79 we substitute for 𝑥¯1𝑖 in (3.5.8) using equation (3.2.2). Averaging (3.2.2) over all time periods for which 𝑠𝑖𝑡 = 1, we get 𝑥¯1𝑖 = 𝑥¯2𝑖 𝜋 + 𝑑𝑖 + 𝑟¯𝑖 . (3.5.14) Plugging in for 𝑑𝑖 from (3.5.11) in this equation, we have 𝑥¯1𝑖 = 𝑥¯2𝑖 (𝜋 + 𝜉2 ) + 𝜓2 + 𝑎 2𝑖 + 𝑟¯𝑖 . (3.5.15) Plugging this into equation (3.5.8), 𝑐𝑖 = 𝜓1 + 𝜉11 [𝑥¯2𝑖 (𝜋 + 𝜉2 ) + 𝜓2 + 𝑎 2𝑖 + 𝑟¯𝑖 ] + 𝑥¯2𝑖 𝜉12 + 𝑎 1𝑖 . (3.5.16) Thus, using equations (3.5.11) and (3.5.16), we can write ℎ𝑖 as ℎ𝑖 ≡ 𝛽1 𝑑𝑖 + 𝑐𝑖 = 𝛽1 (𝜓2 + 𝑥¯2𝑖 𝜉2 + 𝑎 2𝑖 ) + 𝜓1 + 𝜉11 [ 𝑥¯2𝑖 (𝜋 + 𝜉2 ) + 𝜓2 + 𝑎 2𝑖 + 𝑟¯𝑖 ] + 𝑥¯2𝑖 𝜉12 + 𝑎 1𝑖 . (3.5.17) Plugging this into equation (3.2.3) and re-arranging, we get 𝑦𝑖𝑡 = 𝑥 2𝑖𝑡 𝛾 + 𝜓 + 𝑥¯2𝑖 𝛿 + 𝑒𝑟𝑟𝑜𝑟𝑖𝑡 . (3.5.18) where 𝜓 ≡ 𝜓1 + 𝜉11 𝜓2 + 𝛽1 𝜓2 , 𝛿 ≡ 𝜉11 (𝜋 + 𝜉2 ) + 𝜉12 + 𝛽1 𝜉2 , and 𝑒𝑟𝑟𝑜𝑟𝑖𝑡 ≡ 𝜉11 (𝑎 2𝑖 + 𝑟¯𝑖 ) + 𝑎 1𝑖 + 𝛽1 𝑎 2𝑖 + 𝑣𝑖𝑡 . Estimating this model using POLS with the 𝑠𝑖𝑡 = 0 observations is equivalent to doing GMM with the following moment functions. 𝑔3𝑖 (𝛽, 𝜋, 𝜓1 , 𝜓2 , 𝜉1 , 𝜉2 ) = (1 − 𝑠𝑖𝑡 ) 𝑥´2𝑖𝑡 0 (𝑦 − 𝑥 𝛾 − 𝜓 − 𝑥¯ 𝛿). (3.5.19) 𝑖𝑡 2𝑖𝑡 2𝑖 So the final set of moment functions is given by 0 (𝑦 − 𝑥 𝛽 − 𝜓 − 𝑥¯ 𝜉 )  Í𝑇      𝑡=1 𝑠 𝑖𝑡 ´ 𝑥 𝑖𝑡 𝑖𝑡 𝑖𝑡 1 𝑖 1     𝑔 1𝑖 (𝛽, 𝜓 1 , 𝜉 1 )    Í    𝑔𝑖 (𝛽, 𝜋, 𝜓1 , 𝜓2 , 𝜉1 , 𝜉2 ) =   𝑇 𝑠 ´ 𝑥 0 (𝑥 − 𝑥 𝜋 − 𝜓 − ¯ 𝑥 𝜉 )  ≡  𝑔 (𝜋, 𝜓 , 𝜉 ) . 𝑡=1 𝑖𝑡 2𝑖𝑡 1𝑖𝑡 2𝑖𝑡 2 2𝑖 2  2𝑖 2 2  Í     𝑇  𝑡=1 (1 − 𝑠𝑖𝑡 ) 𝑥´2𝑖𝑡 0 (𝑦 − 𝑥 𝛾 − 𝜓 − 𝑥¯ 𝛿)   𝑔 (𝛽, 𝜋, 𝜓 , 𝜓 , 𝜉 , 𝜉 )   𝑖𝑡 2𝑖𝑡 2𝑖   3𝑖  1 2 1 2  (3.5.20) Lemma 3.5.4.1. Under Assumptions 3.5.4.1 and 3.5.4.2, E[𝑔𝑖 (𝛽, 𝜋, 𝜓1 , 𝜓2 , 𝜉1 , 𝜉2 )] = 0. The rest of the GMM estimation proceeds as usual using the moment conditions in (3.5.21). Define the variance-covariance matrix of the moment functions in (3.5.20) as Λ ≡ E[𝑔𝑖 (𝛽, 𝜋, 𝜓1 , 𝜓2 , 𝜉1 , 𝜉2 )𝑔𝑖 (𝛽, 𝜋, 𝜓1 , 𝜓2 , 𝜉1 , 𝜉2 ) 0]. (3.5.21) 80 and let Λ̂ be a consistent estimate of Λ. Then we define the optimal GMM estimator based on moment conditions (3.5.20) as follows. Definition 3.5.4.1. Call the estimator of [𝛽0 𝜋0 𝜓10 𝜓20 𝜉10 𝜉20 ] 0 that solves min 𝑔(𝛽, ¯ 𝜋, 𝜓1 , 𝜓2 , 𝜉1 , 𝜉2 ) 0 Ω̂ 𝑔(𝛽, ¯ 𝜋, 𝜓1 , 𝜓2 , 𝜉1 , 𝜉2 ), (3.5.22) 𝛽,𝜋 and Ω̂ = Λ̂−1 , the joint Mundlak Í𝑁 ¯ where 𝑔(𝛽, 𝜋, 𝜓1 , 𝜓2 , 𝜉1 , 𝜉2 ) = 𝑖=1 𝑖 𝑔 (𝛽, 𝜋, 𝜓1 , 𝜓2 , 𝜉1 , 𝜉2 ) estimator. Denote the estimator of 𝛽 from this vector as 𝛽ˆ𝐽𝑜𝑖𝑛𝑡 𝑀𝑢𝑛𝑑𝑙𝑎𝑘 . 3.6 Estimation under sequential exogeneity As is well known, the strict exogeneity Assumption 3.2.1 rules out lagged dependent variables and feedback from past shocks to current covariates in the model of interest (3.2.1).9 For instance, if 𝑥𝑖𝑡 contains a policy variable, then Assumption 3.2.1 imposes that there is no feedback where policy is more likely to occur based on past shocks. Or if (3.2.1) is a wage equation and one of the covariates is union status, then it rules out a negative wage shock today leading to someone deciding to join the union next year. Assumption 3.2.2’ imposes these restrictions on the imputation model (3.2.2). In order to allow for such effects, we relax Assumption 3.2.1 and 3.2.2 to sequential exogeneity Assumptions 3.6.1 and 3.6.2. Assumption 3.6.1. E(x𝑖𝑡0𝑢𝑖𝑡 ) = 0, 𝑡 = 1, . . . , 𝑇, where x𝑖𝑡 = (𝑥𝑖𝑡 , 𝑥𝑖,𝑡−1 , . . . , 𝑥𝑖1 ). This assumes correct distributed lag dynamics but is silent on feedback as it allows for 𝑢𝑖𝑡 to be arbitrarily correlated with 𝑥𝑖,𝑡+𝑠 for 𝑠 ∈ {1, . . . , 𝑇 − 𝑡}. For the imputation model (3.2.2), we relax Assumption 3.2.2 to the following. Assumption 3.6.2: E(x𝑡0 𝑟 ) = 0, 2𝑖 𝑖𝑡 where x𝑡2𝑖 = (𝑥 2𝑖𝑡 , 𝑥2𝑖,𝑡−1 , . . . , 𝑥 2𝑖1 ). Under these assumptions, we can use an alternative transformation called “forward orthogonal- ization" suggested by Arellano & Bover (1995). It demeans data using average over future time 9Wooldridge (2010), Chapter 10 81 periods instead of average over all time periods. It thus preserves sequential exogeneity while still using as much data as possible. We begin with the model of interest (3.2.1). At time 𝑡 ≤ 𝑇 − 1, consider the equations for 𝑡 + 1, . . . , 𝑇. 𝑦𝑖,𝑡+1 = 𝛽1 𝑥 1𝑖,𝑡+1 + 𝑥 2𝑖,𝑡+1 𝛽2 + 𝑐𝑖 + 𝑢𝑖,𝑡+1 .. . 𝑦𝑖𝑇 = 𝛽1 𝑥1𝑖𝑇 + 𝑥2𝑖𝑇 𝛽2 + 𝑐𝑖 + 𝑢𝑖𝑇 . In order to time demean (3.2.1), we can naturally use only those future time periods for which 𝑥 1 is observed. Define 𝑇 Õ 𝑇𝑖 (𝑡) = 𝑠𝑖𝑞 (3.6.1) 𝑞=𝑡+1 as the number of time periods for which 𝑥 1 is observed after time 𝑡 for unit 𝑖. Multiply each equation for 𝑡 + 1 ≤ 𝑞 ≤ 𝑇 by 𝑠𝑖𝑞 and sum Õ 𝑇  Õ 𝑇   Õ 𝑇   Õ𝑇  𝑠𝑖𝑞 𝑦𝑖𝑞 = 𝛽1 𝑠𝑖𝑞 𝑥1𝑖𝑞 + 𝑠𝑖𝑞 𝑥 2𝑖𝑞 𝛽2 + 𝑇𝑖 (𝑡)𝑐𝑖 + 𝑠𝑖𝑞 𝑢𝑖𝑞 . (3.6.2) 𝑞=𝑡+1 𝑞=𝑡+1 𝑞=𝑡+1 𝑞=𝑡+1 Multiplying through by 𝑇𝑖 (𝑡) −1 gives 𝑦¯ 𝑖 (𝑡) = 𝛽1 𝑥¯1𝑖 (𝑡) + 𝑥¯2𝑖 (𝑡) 𝛽2 + 𝑐𝑖 + 𝑢¯𝑖 (𝑡), (3.6.3) where 𝑦¯ 𝑖 (𝑡) = 𝑇𝑖 (𝑡) −1 𝑇𝑞=𝑡+1 𝑠𝑖𝑞 𝑦𝑖𝑞 is the average of the observed 𝑦𝑖𝑞 after time 𝑡 and 𝑥¯1𝑖 (𝑡), 𝑥¯2𝑖 (𝑡) Í and 𝑢¯𝑖 (𝑡) are defined similarly. Subtracting this equation from (3.2.1), which is the equation at time 𝑡 gives 𝑦𝑖𝑡 − 𝑦¯ 𝑖 (𝑡) = 𝛽1 [𝑥 1𝑖𝑡 − 𝑥¯1𝑖 (𝑡)] + [𝑥 2𝑖𝑡 − 𝑥¯2𝑖 (𝑡)] 𝛽2 + [𝑢𝑖𝑡 − 𝑢¯𝑖 (𝑡)] (3.6.4) or 𝑦˜ 𝑖 (𝑡) = 𝛽1 𝑥˜1𝑖 (𝑡) + 𝑥˜2𝑖 (𝑡) 𝛽2 + 𝑢˜𝑖 (𝑡). (3.6.5) Subtracting the forward averages thus eliminates 𝑐𝑖 just as with the usual within transformation. Now we use 𝑥1𝑖 𝑝 and 𝑥 2𝑖 𝑝 , 𝑝 ≤ 𝑡 as instrumental variables in this equation, and use only those time 82 periods for which 𝑠𝑖𝑡 = 1, i.e. the complete cases. This gives the following moment functions.   𝑠𝑖 𝑝 𝑥 𝑠𝑖𝑡 [ 𝑦˜ 𝑖 (𝑡) − 𝛽 𝑥˜ (𝑡) − 𝑥˜ (𝑡) 𝛽 ]  1𝑖 𝑝 1 1𝑖 2𝑖 2  𝑚 1𝑖 (𝛽) =  𝑝 ≤ 𝑡, 𝑡 = 1, . . . , 𝑇 − 1. (3.6.6)    𝑥 0 𝑠 [ 𝑦˜ (𝑡) − 𝛽 𝑥˜ (𝑡) − 𝑥˜ (𝑡) 𝛽 ]   2𝑖 𝑝 𝑖𝑡 𝑖 1 1𝑖 2𝑖 2    We require an additional selection indicator for the first set of moment conditions here as in addition to 𝑥 1𝑖𝑡 , these moment conditions also require 𝑥 1𝑖 𝑝 to be observed for it to be used as an instrumental variable. Since the moment conditions in (3.6.6) utilize only the complete cases, they leave room for gains in efficiency by utilizing the incomplete cases. We can again implement forward orthogonalization with time demeaning using complete cases to estimate 𝜋 in (3.2.2). Similar to (3.6.4), we can write 𝑥 1𝑖𝑡 − 𝑥¯1𝑖 (𝑡) = [𝑥 2𝑖𝑡 − 𝑥¯2𝑖 (𝑡)]𝜋 + [𝑟𝑖𝑡 − 𝑟¯𝑖 (𝑡)], (3.6.7) where 𝑟¯𝑖 (𝑡) = 𝑇𝑖 (𝑡) −1 𝑇𝑞=𝑡+1 𝑠𝑖𝑞 𝑟𝑖𝑞 . Multiplying through by 𝑇𝑖 (𝑡) −1 , we get Í 𝑥˜1𝑖 (𝑡) = 𝑥˜2𝑖 (𝑡)𝜋 + 𝑟˜𝑖 (𝑡). (3.6.8) Using 𝑥2𝑖 𝑝 , 𝑝 ≤ 𝑡 as instrumental variables and using only the complete cases, we get the moment functions 𝑚 2𝑖 (𝜋) = 𝑥2𝑖 0 𝑠 [ 𝑥˜ (𝑡) − 𝑥˜ (𝑡)𝜋] 𝑝 ≤ 𝑡, 𝑡 = 1, . . . , 𝑇 − 1. (3.6.9) 𝑝 𝑖𝑡 1𝑖 2𝑖 Similar to Section 4, the moment conditions that allow gains in efficiency come from the reduced form (3.2.3). Here we do the forward orthogonalization using incomplete cases. Let 𝑇 𝑦˘ 𝑖 (𝑡) = 𝑦𝑖𝑡 − 𝑇 − 𝑡 − 𝑇𝑖 (𝑡) −1  Õ (1 − 𝑠𝑖𝑞 )𝑦𝑖𝑞 𝑞=𝑡+1 𝑇  −1 Õ 𝑥˘2𝑖 (𝑡) = 𝑥2𝑖𝑡 − 𝑇 − 𝑡 − 𝑇𝑖 (𝑡) (1 − 𝑠𝑖𝑞 )𝑥2𝑖𝑞 . 𝑞=𝑡+1 We can then write 𝑦˘ 𝑖 (𝑡) = 𝑥˘2𝑖 (𝑡)𝛾 + 𝑣˘ 𝑖𝑡 . (3.6.10) We estimate 𝛾 ≡ (𝛽1 𝜋 + 𝛽2 ) using incomplete cases as well. This gives moment functions 0 (1 − 𝑠 ) [ 𝑦˘ (𝑡) − 𝑥˘ (𝑡)(𝛽 𝜋 + 𝛽 )] 𝑚 3𝑖 (𝛽, 𝜋) = 𝑥 2𝑖 𝑝 ≤ 𝑡, 𝑡 = 1, . . . , 𝑇 − 1. (3.6.11) 𝑝 𝑖𝑡 𝑖 2𝑖 1 2 83 The full set of moment functions is given by    𝑠𝑖 𝑝 𝑥 1𝑖 𝑝 𝑠𝑖𝑡 [ 𝑦˜ 𝑖 (𝑡) − 𝛽1 𝑥˜1𝑖 (𝑡) − 𝑥˜2𝑖 (𝑡) 𝛽2 ]       𝑚 1𝑖 (𝛽)      0   𝑥 2𝑖 𝑝 𝑠𝑖𝑡 [ 𝑦˜ 𝑖 (𝑡) − 𝛽1 𝑥˜1𝑖 (𝑡) − 𝑥˜2𝑖 (𝑡) 𝛽2 ]     𝑚𝑖 (𝛽, 𝜋) =  𝑚 2𝑖 (𝜋)  =   𝑝 ≤ 𝑡, 𝑡 = 1, . . . , 𝑇 − 1. 0     𝑥 2𝑖 𝑝 𝑠𝑖𝑡 [ 𝑥˜1𝑖 (𝑡) − 𝑥˜2𝑖 (𝑡)𝜋]  𝑚 3𝑖 (𝛽, 𝜋)          𝑥 0 (1 − 𝑠 ) [ 𝑦˘ (𝑡) − 𝑥˘ (𝑡)(𝛽 𝜋 + 𝛽 )]   2𝑖 𝑝 𝑖𝑡 𝑖 2𝑖 1 2    (3.6.12) The moment functions 𝑚𝑖 (𝛽, 𝜋) have a zero mean if Assumptions 3.6.1 and 3.6.2 hold and s𝑖 |= (x𝑖 , u𝑖 , r𝑖 ).10 However, if we want to allow the selection to be more general (for instance, depend on x2𝑖 or other unobserved variables), we need to strengthen Assumptions 3.6.1 and 3.6.2 to the following zero conditional mean assumptions. Assumption 3.6.1’: E(𝑢𝑖𝑡 |x𝑖𝑡 , s𝑖 , 𝑐𝑖 ) = 0, 𝑡 = 1, . . . , 𝑇 . Assumption 3.6.2’: E(𝑟𝑖𝑡 |x𝑡2𝑖 , s𝑖 , 𝑑𝑖 ) = 0, 𝑡 = 1, . . . .𝑇 . Note that although Assumptions 3.6.1’ and 3.6.2’ allow the covariates to be sequentially exoge- nous in both the model of interest (3.2.1) and the imputation model (3.2.2), selection is assumed to be strictly exogenous in both models. This is because in the moment functions 𝑚 1𝑖 (𝛽) and 𝑚 2𝑖 (𝜋), 𝑦¯ 𝑖 (𝑡), 𝑥¯1𝑖 (𝑡) and 𝑥¯2𝑖 (𝑡) depend non-linearly on all selection indicators from 𝑡 + 1 to 𝑇 and we use instruments with 𝑝 ≤ 𝑡. Therefore, we need selection to be strictly, and not just sequentially, exogenous for these moment functions to have a zero mean. Moreover, Assumption 3.6.1’ allows selection to be arbitrarily correlated with x𝑖 and 𝑐𝑖 . Assumption 3.6.2’ allows selection to be arbitrarily correlated with x2𝑖 and 𝑑𝑖 , but it rules out selection depending on 𝑥 1 once we condition on 𝑥 2 . Thus together, Assumptions 3.6.1’ and 3.6.2’ allow selection to depend on x2𝑖 , 𝑐𝑖 and 𝑑𝑖 , but not r𝑖 or u𝑖 . We summarize the conditions under which the moment functions in (3.6.12) have an expected value of zero in the following lemma. Lemma 3.6.1: E[𝑚𝑖 (𝛽, 𝜋)] = 0 if either of the following conditions hold. (i) s𝑖 |= (x𝑖 , u𝑖 , r𝑖 ) and Assumptions 3.6.1 and 3.6.2 hold. 10Recall that this is weaker than MCAR as it allows s𝑖 to depend on 𝑐𝑖 and 𝑑𝑖 . 84 (ii) 𝑠𝑖𝑡 is a function of x2𝑖 or some other random variable 𝑤𝑖𝑡 and Assumptions 3.6.1’ and 3.6.2’ hold.. Then, E[𝑚𝑖 (𝛽, 𝜋)] = 0 gives us a set of (3𝑘 + 1)𝑇 (𝑇 − 1)/2 moment conditions with 2𝑘 + 1 parameters and hence number of over-identifying restrictions depends on 𝑇. We can use the regular two-step GMM estimator using these moment conditions. One way to test for exogeneity of s𝑖 with respect to {𝑢𝑖𝑡 : 𝑡 = 1, . . . , 𝑇 } is to include selection indicators from other time periods as covariates in equation (3.2.1) and check for their significance at time 𝑡. For instance, one might be concerned that a shock today causes people to drop out from the sample in the next time period. Then one can add 𝑠𝑖,𝑡+1 as a covariate at time 𝑡 (so that the last time period is lost), estimate the model using the moment conditions in (3.6.6), and compute the robust 𝑡-statistic on 𝑠𝑖,𝑡+1 . Another option is to use 𝑠𝑖,𝑡−1 as a covariate, but that does not work in the case of attrition when it is an absorbing state because if 𝑠𝑖𝑡 = 1 for 𝑖, then so is 𝑠𝑖,𝑡−1 . Note that this test can be used even if one is only using the complete cases11, as it does not even require us to write down the imputation equation (3.2.2). But when using the GMM based on full set of moment conditions in (3.6.12), one can also test for the exogeneity of s𝑖 with respect to {𝑟𝑖𝑡 : 𝑡 = 1, . . . , 𝑇 } by including 𝑠𝑖,𝑡+1 as a covariate in the imputation equation (3.2.2) at time 𝑡, estimating the model using moment conditions in (3.6.9), and computing the robust 𝑡-statistic on 𝑠𝑖,𝑡+1 . However, what we are most likely to be concerned about in an application is the contemporaneous selection problem, that is, 𝑠𝑖𝑡 being correlated with 𝑢𝑖𝑡 . But one cannot test for 𝑠𝑖𝑡 by including it as a covariate in either (3.2.1) or (3.2.2). This is because both of these models are estimated using complete cases and hence 𝑠𝑖𝑡 will always equal 1 for the observations used in moment conditions in (3.6.6) and (3.6.9). The reduced form in (3.2.3), however, provides a way to test for 𝑠𝑖𝑡 as it can be used for all observations 𝑖 irrespective of whether 𝑠𝑖𝑡 = 0 or 𝑠𝑖𝑡 = 1. Since 𝑦𝑖𝑡 and 𝑥 2𝑖𝑡 are observed for all observations, instead of (3.6.10), we can use the following 11that is, only the moment conditions in (3.6.6) 85 moment conditions. 0  E[𝑥 2𝑖 𝑝 𝑦˘ 𝑖𝑡 − 𝑥˘2𝑖𝑡 (𝛽1 𝜋 + 𝛽2 ) ] = 0 𝑝 ≤ 𝑡, 𝑡 = 1, . . . , 𝑇 − 1. (3.6.13) We have simply removed the (1 − 𝑠𝑖𝑡 ) from (3.6.10), which means that instead of restricting these moment conditions to the incomplete cases, we are using all observations. Then we can test for the exogeneity of 𝑠𝑖𝑡 with respect to {𝑣𝑖𝑡 : 𝑡 = 1, . . . , 𝑇 } by including 𝑠𝑖𝑡 as a covariate in the reduced form (3.2.3) at time 𝑡, estimating the model using the moment conditions in (3.6.12), and computing the robust 𝑡-statistic on 𝑠𝑖𝑡 . The null hypothesis here is that 𝑠𝑖𝑡 is uncorrelated with 𝑣𝑖𝑡 . Since 𝑣𝑖𝑡 = 𝑢𝑖𝑡 + 𝛽1𝑟𝑖𝑡 , if we reject the null, then we can conclude that 𝑠𝑖𝑡 is correlated with either 𝑢𝑖𝑡 or 𝑟𝑖𝑡 or both. Since we require both of these correlations to be zero in order for the moment conditions in (3.6.12) to be valid, a rejection would bring the validity of this method into question irrespective of which idiosyncratic error 𝑠𝑖𝑡 is correlated with. Finally, we can also use this test for 𝑠𝑖𝑡 in the framework of Section 4 where we are assuming strict exogeneity of the covariates with respect to the idiosyncratic errors. In that case, we simply include 𝑠𝑖𝑡 as a covariate in the reduced form (3.2.3) at time 𝑡 and estimate the model using the following moment conditions Õ 𝑇 E[ 0 𝑦¤ − 𝑥¤ (𝛽 𝜋 + 𝛽 )  ] = 0. 𝑥¤2𝑖𝑡 (3.6.14) 𝑖𝑡 2𝑖𝑡 1 2 𝑡=1 instead of those in (3.6.13), and computing the robust 𝑡-statistic on 𝑠𝑖𝑡 . The moment conditions in (3.6.14) are essentially the same as in (3.4.9) except that we have removed the selection indicator (1 − 𝑠𝑖𝑡 ) just like in the case of sequential exogeneity. Note that all the tests discussed here require that 𝑇 ≥ 3. 3.7 Conclusion We have provided new methods of consistently imputing missing covariate values in linear panel data models with unobserved heterogeneity when using fixed effects. We provide imputation estimators under both strict and sequential exogeneity of the covariates. We relax some substantial assumptions made by currently used imputation estimators, most notably allowing the covariate 86 with missing values to contain individual specific unobserved heterogeneity. We provide two tests for the assumptions underlying our imputation procedure. The first is a GMM overidentification test which tests the validity of the moment conditions, the second is a novel variable addition test for the missingness in a given time period being uncorrelated with the unobservables, both in the same time period as well as in other time periods. 87 APPENDICES 88 APPENDIX A PROOFS FOR CHAPTER 1 Proof of Proposition 1.5.1.2 We know that √ 𝐴𝑣𝑎𝑟 ( 𝑛[( 𝛽ˆ0 𝑣𝑒𝑐( Π̂) 0) 0 − (𝛽0 𝑣𝑒𝑐(Π) 0) 0]) = (𝐷 0𝐶 −1 𝐷) −1 Now,     −1      −1   𝐷0 0 𝐶 𝐶 𝐷 0 0 𝐷 0 𝐶 0   0 𝐷  11 12 11 33 32        41   𝐷 0𝐶 −1 𝐷 =  11 +                    0 𝐷0  𝐶 0 𝐶   0 𝐷   𝐷 0 𝐷 0   0 𝐶   𝐷   22   12 22   22   32 42  44   41 𝐷 42            ≡ 𝐺 + 𝐻𝐼𝐻 0 where     −1       −1 𝐷0 0 0  0 𝐷   33 0  𝐶 𝐶  𝐷 𝐶 11 12 11 32    𝐺 =  11 𝐻= 𝐼=             , ,   0 𝐷0  𝐶 0 𝐶   0 𝐷  𝐷   0 𝐶   22   12 22   22   41 𝐷 42   44           Using the matrix inversion lemma, (𝐷 0𝐶 −1 𝐷) −1 = (𝐺 + 𝐻𝐼𝐻 0) −1 = 𝐺 −1 − 𝐺 −1 𝐻 (𝐼 + 𝐻 0𝐺 −1 𝐻) −1 𝐻 0𝐺 −1 and thus 𝐺 −1 − (𝐷 0𝐶 −1 𝐷) −1 = 𝐺 −1 𝐻 (𝐼 −1 + 𝐻 0𝐺 −1 𝐻) −1 𝐻 0𝐺 −1 (A.1) Let 𝐸 ≡ (𝐼 + 𝐻 0𝐺 −1 𝐻) −1 . Now,  𝐷 −1𝐶 𝐷 −10 𝐷 −1𝐶 𝐷 −10    11 12 𝐺 −1 =  11 11 11 22    𝐷 −1𝐶 𝐷 −1 𝐷 −1𝐶 𝐷 −10  0   22 21 11 22 22 22   and the asymptotic variance of the complete cases GMM is given by the upper left (𝑘 + 1) × (𝑘 + 1) block of 𝐺 −1 . Therefore, the difference between the asymptotic variances of the complete cases estimator and the proposed estimator is given by the upper left (𝑘 + 1) × (𝑘 + 1) block of the 89 expression on the right hand side of (A.1). For this we need the first (𝑘 + 1) columns of 𝐻 0𝐺 −1 , which are given by −1 𝐶 𝐷 −10    𝐷 32 𝐷 22 21 11  (A.2)    𝐷 𝐷 −1𝐶 𝐷 −10 + 𝐷 𝐷 −1𝐶 𝐷 −10     41 11 11 11 42 22 21 11    For the difference corresponding to 𝛽1 , we need the first column of this matrix. To find that, consider 𝐷 −111 = [E(𝑠1 𝑠2 𝑥 𝑧)] 0 −1   𝐽 −1 −𝐽 −1 𝐾 𝐾 −1 1 2   =    −𝐾 −1 𝐾 𝐽 −1 (𝐾 − 𝐾 𝐾 −1 𝐾 ) −1   2 4 2 4 3 1    where 𝐽 ≡ E(𝑠1 𝑠2 𝑥 10 𝑧 1 ) − E(𝑠1 𝑠2 𝑥 1 𝑥 2 ) [E(𝑠1 𝑠2 𝑥 20 𝑥 2 )] −1 E(𝑠1 𝑠2 𝑥 20 𝑧 1 ) , 𝐾1 ≡ E(𝑠1 𝑠2 𝑥 1 𝑥 2 ), 𝐾2 ≡  E(𝑠1 𝑠2 𝑥 20 𝑥2 ), 𝐾3 ≡ E(𝑠1 𝑠2 𝑥 1 𝑧1 ), and 𝐾4 ≡ E(𝑠1 𝑠2 𝑥 20 𝑧1 ). The first column and the last 𝑘 columns of this matrix are given by 𝑊1 and 𝑊2 respectively, where    1   −1 𝑊1 =  (A.0.1)  𝐽 −𝐾 −1 𝐾   2 4     −𝐽 −1 𝐾 𝐾 −1 1   2 𝑊2 =  (A.0.2)     (𝐾 − 𝐾 𝐾 −1 𝐾 ) −1   2 4 3 1    Now, the first column of the matrix in (A.2) is given by     𝐷 𝐷 −1 𝐶 𝑊  𝐴  32 22 21 1   1  ≡     (𝐷 𝐷 −1𝐶 + 𝐷 𝐷 −1𝐶 )𝑊   𝐵   41 11 11 42 22 21 1    1    Similarly, the last 𝑘 columns of the matrix in (A.2) are given by      𝐷 32 𝐷 −122 𝐶 21 𝑊 2  𝐴    2 ≡     (𝐷 𝐷 −1𝐶 + 𝐷 𝐷 −1𝐶 )𝑊   𝐵   41 11 11 42 22 21 2    2    Thus, the difference corresponding to 𝛽 𝑗 , 𝑗 = 1, 2 is   h i 𝐴 𝑗  0 𝐴𝑗 𝐵𝑗 0 𝐸     𝐵   𝑗   90 as stated in the proposition.  Proof of proposition 1.5.2.1 When we have two distinct samples containing (𝑦, 𝑧) and (𝑥, 𝑧), and hence the estimation is based only on 𝑔3 (.) and 𝑔4 (.). Thus     h i 𝐶 33 0   0 𝐷  32 ℎ(𝛽, Π) = 𝑔3 (Π)𝑔4 (𝛽, Π) 𝐶= 𝐷=      .  0 𝐶  𝐷   44   41 𝐷 42      The first step solves 𝑛 1Õ 𝑔3 (𝑥𝑖 , 𝑧𝑖 , 𝑠2𝑖 , Π̆) = 0. 𝑛 𝑖=1 By standard GMM theory, √ 𝑑 0 𝑛( Π̆ − Π) −−−−→ 𝑁 (0, 𝑉2 ) where 𝑉2 = 𝐷 −1 −1 32 𝐶33 𝐷 32 . The second step solves min ℎ¯ 4 (𝛽, Π̆) 0 Ω̆1 ℎ¯ 4 (𝛽, Π̆), 𝛽 1 Í𝑛 where ℎ¯ 4 (𝛽, Π) = 𝑔 (𝑦𝑖 , 𝑧𝑖 , 𝑠1𝑖 , 𝛽, Π). The first order condition is given by 𝑛 𝑖=1 4 𝐷ˆ 41 Ω̆1 ℎ¯ 4 ( 𝛽, ˘ Π̆) = 0 (A.3) 𝜕 ℎ¯ 4 ( 𝛽, ˘ Π̆) 𝑝 where 𝐷ˆ 41 = , Ω̆1 −−−−→ Ω1 , and Ω1 is a general weight matrix. A Taylor expansion of 𝜕𝛽 ¯ℎ4 ( 𝛽, ˘ Π̆) around 𝛽 gives ℎ¯ 4 ( 𝛽, ˘ Π̆) = ℎ¯ 4 (𝛽, Π̆) + 𝐷¯ 41 ( 𝛽˘ − 𝛽), 𝜕 ℎ¯ 4 ( 𝛽, ¯ Π̆) where 𝐷¯ 41 = and 𝛽¯ ∈ [𝛽, 𝛽]. ˘ Substituting in (A.3) 𝜕𝛽 𝐷ˆ 41 Ω̆1 ℎ¯ 4 (𝛽, Π̆) + 𝐷ˆ 41 Ω̆1 𝐷¯ 41 ( 𝛽˘ − 𝛽) = 0. Thus, √ √ 𝑛( 𝛽˘ − 𝛽) = −( 𝐷ˆ 41 Ω̆1 𝐷¯ 41 ) −1 𝐷ˆ 41 Ω̆1 𝑛 ℎ¯ 4 (𝛽, Π̆). 91 Now, a Taylor expansion of ℎ¯ 4 (𝛽, Π̆) around Π gives ℎ¯ 4 (𝛽, Π̆) = ℎ¯ 4 (𝛽, Π) + 𝐷¯ 42 ( Π̆ − Π), 𝜕 ℎ¯ 4 (𝛽, Π̄) where 𝐷¯ 42 = and Π̄ ∈ [Π, Π̆]. Thus, 𝜕 𝑣𝑒𝑐Π √ √ √ 𝑛( 𝛽˘ − 𝛽) = −( 𝐷ˆ 41 Ω̆1 𝐷¯ 41 ) −1 𝐷ˆ 41 Ω̆1 [ 𝑛 ℎ¯ 4 (𝛽, Π) + 𝐷¯ 42 𝑛 ( Π̆ − Π)]. √ √ Now, let 𝑍 ≡ [ 𝑛 ℎ¯ 4 (𝛽, Π) + 𝐷¯ 42 𝑛 ( Π̆ − Π)]. Since √ 𝑑 √ 𝑑 𝑛 ℎ¯ 4 (𝛽, Π) −−−−→ 𝑁 (0, 𝐶44 ) and 𝑛 ( Π̆ − Π) −−−−→ 𝑁 (0, 𝑉2 ), therefore 𝑑 𝑍 −−−−→ 𝑁 (0, Σ) where Σ = 𝐶44 + 𝐷 42𝑉2 𝐷 042 . Moreover, 𝑝 𝑝 𝑝 𝑝 𝐷ˆ 41 −−−−→ 𝐷 41 𝐷¯ 41 −−−−→ 𝐷 41 𝐷¯ 32 −−−−→ 𝐷 42 Ω̆1 −−−−→ Ω1 . Let 𝛽˘ ≡ 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆−𝑂 . Then, √ 𝑛( 𝛽˘ − 𝛽) = −[( 𝐷ˆ 41 Ω̆1 𝐷¯ 41 ) −1 𝐷ˆ 41 Ω̆1 − (𝐷 41 Ω1 𝐷 41 ) −1 𝐷 41 Ω1 ]𝑍 − (𝐷 41 Ω1 𝐷 41 ) −1 𝐷 41 Ω1 𝑍 = 𝑜 𝑝 (1) − (𝐷 41 Ω1 𝐷 41 ) −1 𝐷 41 Ω1 𝑍 where [( 𝐷ˆ 41 Ω̆1 𝐷¯ 41 ) −1 𝐷ˆ 41 Ω̆1 − (𝐷 41 Ω1 𝐷 41 ) −1 𝐷 41 Ω1 ] is 𝑜 𝑝 (1) because of the Slutsky’s theo- rem, 𝑍 is 𝑂 𝑝 (1), and 𝑜 𝑝 (1).𝑂 𝑝 (1) = 𝑜 𝑝 (1). Then, by the asymptotic equivalence lemma, √ 𝑑 𝑛( 𝛽˘ − 𝛽) −−−−→ 𝑁 (0, 𝑉1 ) where 𝑉1 = (𝐷 041 Ω1 𝐷 41 ) −1 𝐷 041 Ω1 Σ Ω1 𝐷 41 (𝐷 041 Ω1 𝐷 41 ) −1 . By standard GMM theory, the optimal weight matrix for this step is Σ−1 . Using this matrix gives 𝑉1∗ = (𝐷 041 Σ−1 𝐷 41 ) −1 . 92  Proof of proposition 1.5.2.3 √ The asymptotic variance of 𝑛( 𝛽ˆ − 𝛽) is given by the upper left ( 𝑝 + 𝑘) × ( 𝑝 + 𝑘) block of (𝐷 0𝐶 −1 𝐷) −1 . Now,       −1   0 𝐷 0  𝐶 −1 0   0 𝐷  41 32   (𝐷 0𝐶 −1 𝐷) −1 =     33            𝐷 0 𝐷 0   0 𝐶 −1  𝐷 𝐷    32 42   44   41 42        −1  𝐷 0 𝐶 −1 𝐷 0 −1 𝐷 41𝐶44 𝐷 42 41  =  41 44     𝐷 0 𝐶 −1 𝐷 0 −1 0 −1  42 44 41 𝐷 32𝐶33 𝐷 32 + 𝐷 42𝐶44 𝐷 42     Using the formula for the inversion of a block matrix, the upper left ( 𝑝 + 𝑘) × ( 𝑝 + 𝑘) block of this inverse is (𝐷 041𝐶44 −1 𝐷 − 𝐷 0 𝐶 −1 𝐷 (𝐷 0 𝐶 −1 𝐷 + 𝐷 0 𝐶 −1 𝐷 ) −1 𝐷 0 𝐶 −1 𝐷 ) −1 41 41 44 42 32 33 32 42 44 42 42 44 41 (A.4) √ On the other hand, we know 𝐴𝑣𝑎𝑟 ( 𝑛( 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆−𝑂 − 𝛽)) = (𝐷 041 Σ−1 𝐷 41 ) −1 (A.5) = (𝐷 041 (𝐶44 + 𝐷 42 (𝐷 032𝐶33 −1 𝐷 ) −1 𝐷 0 ) −1 𝐷 ) −1 32 42 41 = (𝐷 041 (𝐶44 −1 − 𝐶 −1 𝐷 (𝐷 0 𝐶 −1 𝐷 + 𝐷 0 𝐶 −1 𝐷 ) −1 𝐷 0 𝐶 −1 )𝐷 ) −1 44 42 32 33 32 42 44 42 42 44 41 = (𝐷 041𝐶44−1 𝐷 − 𝐷 0 𝐶 −1 𝐷 (𝐷 0 𝐶 −1 𝐷 + 𝐷 0 𝐶 −1 𝐷 ) −1 𝐷 0 𝐶 −1 𝐷 ) −1 41 41 44 42 32 33 32 42 44 42 42 44 41 (A.0.3) where the third equality uses the matrix inversion lemma. The result follows from the fact that (A.4) = (A.5).  Proof of proposition 1.5.2.5 With exact identification, 𝛽ˆ simply solves   𝑛 𝑔 (.)  1Õ ˆ = 0 where 3 ℎ(𝑦𝑖 , 𝑥𝑖 , 𝑧𝑖 , 𝑠1𝑖 , 𝑠2𝑖 , Π̂, 𝛽) ℎ(.) =    . 𝑛 𝑔 (.)  𝑖=1  4    93 This is the same as first solving 𝑛 1Õ 𝑔3 (𝑥𝑖 , 𝑧𝑖 , 𝑠2𝑖 , Π̆) = 0 𝑛 𝑖=1 for Π̆, and then solving 𝑛 1Õ 𝑔4 (𝑦𝑖 , 𝑧𝑖 , 𝑠1𝑖 , Π̆, 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆 ) = 0 𝑛 𝑖=1 for 𝛽ˆ𝑇 𝑆2𝑆𝐿𝑆 .  94 APPENDIX B TABLES FOR CHAPTER 1 Table B.1: Monte Carlo simulations, Design 1 𝛽1 𝛽22 𝛽23 Estimator Bias SD RMSE Bias SD RMSE Bias SD RMSE Complete cases 2SLS 0.008 0.035 0.036 -0.000 0.066 0.066 -0.023 0.055 0.059 Complete cases GMM 0.008 0.035 0.036 -0.001 0.066 0.066 -0.023 0.055 0.060 Imputation 0.009 0.029 0.031 -0.013 0.056 0.057 -0.011 0.047 0.048 Dummy variable method 0.008 0.035 0.036 0.154 0.065 0.167 0.153 0.054 0.162 Proposed GMM 0.008 0.027 0.028 -0.004 0.051 0.051 -0.011 0.043 0.044 Table B.2: Monte Carlo simulations, Design 2 𝛽1 𝛽22 𝛽23 Estimator Bias SD RMSE Bias SD RMSE Bias SD RMSE Complete cases 2SLS 0.013 0.069 0.070 -0.008 0.074 0.074 -0.024 0.065 0.069 Complete cases GMM 0.004 0.052 0.052 -0.005 0.067 0.068 -0.018 0.059 0.062 Imputation 0.008 0.060 0.061 -0.019 0.066 0.069 -0.010 0.058 0.059 Dummy variable method 0.013 0.069 0.070 0.146 0.069 0.161 0.151 0.060 0.163 Proposed GMM 0.010 0.041 0.042 -0.012 0.055 0.056 -0.011 0.049 0.050 Table B.3: Monte Carlo simulations, Design 3 𝛽1 𝛽22 𝛽23 Estimator Bias SD RMSE Bias SD RMSE Bias SD RMSE Complete cases 2SLS 0.009 0.076 0.077 0.003 0.083 0.083 -0.018 0.074 0.076 Complete cases GMM 0.002 0.056 0.056 0.004 0.074 0.074 -0.014 0.067 0.068 Imputation 0.006 0.065 0.065 -0.015 0.074 0.075 -0.005 0.064 0.064 Dummy variable method 0.009 0.076 0.077 0.176 0.078 0.193 0.181 0.066 0.193 Proposed GMM 0.010 0.044 0.045 -0.009 0.059 0.060 -0.009 0.053 0.054 95 Table B.4: Monte Carlo simulations, Design 4 𝛽1 𝛽22 𝛽23 Estimator Bias SD RMSE Bias SD RMSE Bias SD RMSE Complete cases 2SLS 0.013 0.069 0.070 -0.008 0.074 0.074 -0.024 0.065 0.069 Complete cases GMM 0.004 0.052 0.052 -0.005 0.067 0.068 -0.018 0.059 0.062 Imputation 0.008 0.060 0.060 -0.015 0.064 0.066 -0.013 0.057 0.058 Dummy variable method 0.013 0.070 0.070 0.001 0.060 0.060 0.003 0.052 0.052 Proposed GMM 0.010 0.040 0.041 -0.011 0.053 0.054 -0.011 0.047 0.049 Table B.5: Monte Carlo simulations, Design 5 𝛽1 𝛽22 𝛽23 Estimator Bias SD RMSE Bias SD RMSE Bias SD RMSE Complete cases 2SLS -0.001 0.118 0.118 -0.006 0.145 0.146 0.023 0.166 0.168 Imputation -0.001 0.118 0.118 -0.004 0.136 0.136 0.019 0.148 0.149 Proposed GMM 0.000 0.119 0.119 -0.006 0.136 0.136 0.018 0.149 0.150 Table B.6: Monte Carlo simulations, Design 6 𝛽1 𝛽22 𝛽23 Estimator Bias SD RMSE Bias SD RMSE Bias SD RMSE Complete cases 2SLS 0.013 0.180 0.180 -0.018 0.171 0.172 0.035 0.192 0.195 Imputation 0.013 0.180 0.180 -0.023 0.174 0.175 0.020 0.185 0.186 Proposed GMM 0.005 0.155 0.155 -0.012 0.150 0.150 0.030 0.165 0.167 Table B.7: Monte Carlo simulations, Design 7 𝛽1 𝛽22 𝛽23 Estimator Bias SD RMSE Bias SD RMSE Bias SD RMSE Complete cases 2SLS 0.003 0.198 0.198 -0.014 0.186 0.187 0.039 0.206 0.210 Imputation 0.003 0.198 0.198 -0.014 0.192 0.192 0.030 0.199 0.201 Proposed GMM 0.005 0.162 0.162 -0.012 0.155 0.156 0.031 0.172 0.175 96 Table B.8: Effect of physician’s advice on calorie consumption: complete cases versus the proposed estimator Estimator Complete cases GMM Proposed GMM Physician advised to lose weight 0.126 0.119 (0.099) (0.091) Age -0.004 -0.004 (0.00040) (0.00036) Female -0.294 -0.300 (0.011) (0.010) Black -0.054 -0.056 (0.013) (0.011) Other race -0.040 -0.142 (0.013) (0.011) 9 to 12 years of schooling 0.083 0.085 (0.024) (0.021) High school grad or equivalent 0.074 0.074 (0.022) (0.020) Some college or AA 0.049 0.063 (0.021) (0.019) College or above 0.053 0.060 (0.023) (0.021) Married -0.015 -0.019 (0.010) (0.009) Has high BP -0.002 -0.007 (0.015) (0.014) Has high cholesterol 0.005 -0.002 (0.019) (0.016) Has Arthritis -0.0005 0.006 (0.013) (0.012) Has heart condition -0.074 -0.073 (0.025) (0.023) Has Diabetes -0.079 -0.085 (0.020) (0.019) BMI 0.0007 0.0003 (0.003) (0.002) Monthly income < $2100 -0.019 -0.033 (0.016) (0.014) Monthly income between $2100 and $5400 -0.003 -0.013 (0.014) (0.012) Monthly income between $5400 and $8400 -0.017 -0.027 (0.015) (0.013) Is employed 0.086 0.081 (0.011) (0.010) 97 APPENDIX C FIGURES FOR CHAPTER 1 Figure C.1: Some admissible patterns of missingness (shaded areas represent complete cases) 1.1: Partial overlap 1.2: Univariate missing data 1.3: The TS2SLS case 𝑦 𝑥 𝑧 𝑦 𝑥 𝑧 𝑦 𝑥 𝑧 98 APPENDIX D PROOFS FOR CHAPTER 2 Proof of Lemma 2.4.1 Since 𝑊 is nonsingular by assumption, it suffices to show that E[𝑔(𝛼, 𝛽; 𝛿0 )] ≠ 0 for (𝛼, 𝛽) ≠ (𝛼0 , 𝛽0 ).1 We show this element-by-element of 𝑔(𝛼, 𝛽; 𝛿0 ). Starting with the weighted moment functions from the model of interest, given Assumptions 2.3.1 and 2.3.2 and the standard IPW argument, we know that E{[𝑠/𝐺 (𝑧, 𝛿0 )]𝑔1∗ (𝑦, 𝑥, 𝛼0 )} = E{[𝑠/𝑝(𝑧)]𝑔1∗ (𝑦, 𝑥, 𝛼0 )} = E{E([𝑠/𝑝(𝑧)]𝑔1∗ (𝑦, 𝑥, 𝛼0 )|𝑦, 𝑥, 𝑧)} = E{[E(𝑠|𝑦, 𝑥, 𝑧)/𝑝(𝑧)]𝑔1∗ (𝑦, 𝑥, 𝛼0 )} = E[𝑔1∗ (𝑦, 𝑥, 𝛼0 )]. Now, Assumption 2.2.1 implies that E[𝑔1∗ (𝑦, 𝑥, 𝛼)] ≠ 0 for any 𝛼 ≠ 𝛼0 . It follows that for any 𝛼 ≠ 𝛼0 , E{[𝑠/𝐺 (𝑧, 𝛿0 )]𝑔1∗ (𝑦, 𝑥, 𝛼)} ≠ 0. Moving on to the imputation model, first note that by iterated expectations, E(𝑠|𝑥 1 , 𝑥2 , 𝑧) = E[E(𝑠|𝑦, 𝑥1 , 𝑥2 , 𝑧)|𝑥1 , 𝑥2 , 𝑧] = E[E(𝑠|𝑧)|𝑥1 , 𝑥2 , 𝑧] = E(𝑠|𝑧) ≡ 𝑝(𝑧), where the second equality follows from Assumption 2.3.1. Now consider the weighted moment functions from the imputation model. E{[𝑠/𝐺 (𝑧, 𝛿0 )]𝑔2∗ (𝑥1 , 𝑥2 , 𝛽0 )} = E{[𝑠/𝑝(𝑧)]𝑔2∗ (𝑥 1 , 𝑥2 , 𝛽0 )} = E{E([𝑠/𝑝(𝑧)]𝑔2∗ (𝑥 1 , 𝑥2 , 𝛽0 )|𝑥 1 , 𝑥2 , 𝑧)} = E{[E(𝑠|𝑥 1 , 𝑥2 , 𝑧)/𝑝(𝑧)]𝑔2∗ (𝑥 1 , 𝑥2 , 𝛽0 )} = E[𝑔2∗ (𝑥1 , 𝑥2 , 𝛽0 )] = 0 1Note that even though 𝑔3 (𝛼, 𝛽; 𝛿) sometimes only identifies functions of (𝛼0 , 𝛽0 ) and not each element of (𝛼0 , 𝛽0 ) separately, the entire vector 𝑔(𝛼, 𝛽; 𝛿) still identifies (𝛼0 , 𝛽0 ) separately because 𝑔1 (𝛼; 𝛿) identifies 𝛼0 and 𝑔2 (𝛽; 𝛿) identifies 𝛽0 . 99 and the same argument as above applies for identification of 𝛽0 using Assumption 2.2.2. For the reduced form moment functions, identification of 𝛾0 simply follows from Assumption 2.2.3. Proof of Theorem 2.4.1 𝑝 Identification of (𝛼0 , 𝛽0 ) follows from Lemma 2.4.1 and 𝛿ˆ − → 𝛿0 follows from Assumption 2.3.2 and standard MLE theory. To complete the proof, we simply show that the objective function satisfies the weak uniform law of large numbers. By 5 and 6, |𝑔1 (𝑦, 𝑥, 𝑧, 𝑠, 𝛼; 𝛿0 )| ≤ 𝑎 −1 𝑏 1 (𝑦, 𝑥), all (𝑧, 𝑠), |𝑔2 (𝑥, 𝑧, 𝑠, 𝛽; 𝛿0 )| ≤ 𝑎 −1 𝑏 2 (𝑥), all (𝑧, 𝑠), |𝑔3 (𝑦, 𝑥2 , 𝛾)| ≤ 𝑏 3 (𝑦, 𝑥2 ). and by 6, E[𝑏(𝑦, 𝑥)] < ∞, where 𝑔1 (𝑦, 𝑥, 𝑧, 𝑠, 𝛼; 𝛿0 ), 𝑔2 (𝑥, 𝑧, 𝑠, 𝛽; 𝛿0 ), and 𝑔3 (𝑦, 𝑥2 , 𝛾) are as defined in (2.4.1). It follows from Lemma 2.4 in Newey and McFadden (1994) that 𝑁 Õ 𝑝 sup 𝑁 −1 ˆ − E[𝑔(𝑦, 𝑥, 𝑧, 𝑠, 𝛼, 𝛽, 𝛾; 𝛿0 )] − 𝑔(𝑦𝑖 , 𝑥𝑖 , 𝑧𝑖 , 𝑠𝑖 , 𝛼, 𝛽, 𝛾; 𝛿) → 0. (𝛼,𝛽,𝛾)∈A×B×Γ 𝑖=1 The rest of the proof is standard, see Wooldridge (2010, Section 12.4.1). Proof of Theorem 2.4.2 For notational convenience, let 𝜏 ≡ (𝛼0, 𝛽0) 0. First we will show that √ 𝑑 𝑁∇𝜏 𝑄(𝜏ˆ 0 ; 𝛿)ˆ − → 𝑁𝑜𝑟𝑚𝑎𝑙 (0, 𝐷 00𝑊 𝐹0𝑊 𝐷 0 ). ˆ Since 𝑄(𝜏; ˆ = 𝑔(𝜏; 𝛿) ¯ 𝛿) ˆ 0 𝑊ˆ 𝑔(𝜏; ¯ 𝛿), ˆ =⇒ ∇𝜏 𝑄(𝜏;ˆ ˆ = [∇𝜏 𝑔(𝜏; 𝛿) ¯ 𝛿)] ˆ 0 𝑊ˆ 𝑔(𝜏; ¯ 𝛿)ˆ √ √ =⇒ 𝑁∇𝜏 𝑄(𝜏 ˆ 0 ; 𝛿)ˆ = [∇𝜏 𝑔(𝜏 ˆ 0 𝑊ˆ 𝑁 𝑔(𝜏 ¯ 0 ; 𝛿)] ˆ ¯ 0 ; 𝛿). 100 √ Carrying out an element-by-element mean value expansion of 𝑁∇𝜏 𝑄(𝜏 ˆ 0 ; 𝛿) ˆ around 𝛿0 gives, √ √ √ 𝑁∇𝜏 𝑄(𝜏 ˆ = [𝐷 0 + 𝑜 𝑝 (1)] 0 𝑊ˆ [ 𝑁 𝑔(𝜏 ˆ 0 ; 𝛿) ¯ 0 ; 𝛿0 ) + ∇𝛿 𝑔(𝜏 ¯ 𝑁 ( 𝛿ˆ − 𝛿0 )] ¯ 0 ; 𝛿) (D.1) √ √ = [𝐷 0 + 𝑜 𝑝 (1)] 0 𝑊ˆ { 𝑁 𝑔(𝜏 ¯ 0 ; 𝛿0 ) + [𝐻0 + 𝑜 𝑝 (1)] 𝑁 ( 𝛿ˆ − 𝛿0 )} √ 𝑁 −1 Õ = [𝐷 0 + 𝑜 𝑝 (1)] 0 ˆ 𝑊 { 𝑁 𝑔(𝜏 ¯ 0 ; 𝛿0 ) + [𝐻0 + 𝑜 𝑝 (1)] [𝑁 2 𝜓(𝑠𝑖 , 𝑧𝑖 ) + 𝑜 𝑝 (1)]} 𝑖=1 1Õ 𝑁 − = 𝐷 00 𝑊 {𝑁 2 [𝑔𝑖 + 𝐻0 𝜓(𝑠𝑖 , 𝑧𝑖 )]} + 𝑜 𝑝 (1), (D.0.1) 𝑖=1 𝑝 where 𝛿¯ lies between 𝛿ˆ and 𝛿0 (thus 𝛿¯ − → 𝛿0 ), 𝐻0 ≡ E[∇𝛿 𝑔(𝜏0 , 𝛿0 )] and 𝜓(𝑠𝑖 , 𝑧𝑖 ) = −[E(𝑑𝑖 𝑑𝑖0)] −1 𝑑𝑖 √ is the influence function for 𝑁 ( 𝛿ˆ − 𝛿0 ). Moreover, by central limit theorem, 𝑁 −1 Õ 𝑑 𝑁 2 [𝑔𝑖 + 𝐻0 𝜓(𝑠𝑖 , 𝑧𝑖 )]} −→ 𝑁 (0, 𝐹0 ), 𝑖=1 where 𝐹0 ≡ E[𝑔𝑖 𝑔𝑖0 + 𝐻0 𝜓(𝑠𝑖 , 𝑧𝑖 )𝑔𝑖0 + 𝑔𝑖 𝜓(𝑠𝑖 , 𝑧𝑖 ) 0 𝐻00 + 𝐻0 𝜓(𝑠𝑖 , 𝑧𝑖 )𝜓(𝑠𝑖 , 𝑧𝑖 ) 0 𝐻00 ]. Now, note that by definition, −(𝑠𝑖 /𝐺 𝑖 )𝑔 ∗ (∇𝛿 𝐺 𝑖 /𝐺 𝑖 )     1𝑖  𝐻0 = E −(𝑠𝑖 /𝐺 𝑖 )𝑔 (∇𝛿 𝐺 𝑖 /𝐺 𝑖 )  = − E( 𝑔˜𝑖 𝑑𝑖0),    ∗  2𝑖  0       where 𝑔˜𝑖 ≡ (𝑔1𝑖0 , 𝑔0 , 0) 0 and the third element is a 1 × 𝐿 zero vector. This is because 2𝑖 3  (𝑠𝑖 /𝐺 𝑖 )𝑔 ∗ {𝑠𝑖 (∇𝛿 𝐺 𝑖 /𝐺 𝑖 ) − (1 − 𝑠𝑖 ) [∇𝛿 𝐺 𝑖 /(1 − 𝐺 𝑖 )]}    1𝑖  0   ∗ E( 𝑔˜𝑖 𝑑𝑖 ) = E  (𝑠𝑖 /𝐺 𝑖 )𝑔 {𝑠𝑖 (∇𝛿 𝐺 𝑖 /𝐺 𝑖 ) − (1 − 𝑠𝑖 ) [∇𝛿 𝐺 𝑖 /(1 − 𝐺 𝑖 )]}   2𝑖  0        (𝑠𝑖 /𝐺 𝑖 )𝑔 ∗ (∇𝛿 𝐺 𝑖 /𝐺 𝑖 )     1𝑖    ∗ = E  (𝑠𝑖 /𝐺 𝑖 )𝑔 (∇𝛿 𝐺 𝑖 /𝐺 𝑖 )  ,   2𝑖  0       since 𝑠𝑖2 = 𝑠𝑖 and (1 − 𝑠𝑖 ) 2 = (1 − 𝑠𝑖 ). This implies E[𝐻0 𝜓(𝑠𝑖 , 𝑧𝑖 )𝑔𝑖0] = − E( 𝑔˜𝑖 𝑑𝑖0) [E(𝑑𝑖 𝑑𝑖0)] −1 E(𝑑𝑖 𝑔𝑖0), 101 and E[𝐻0 𝜓(𝑠𝑖 , 𝑧𝑖 )𝜓(𝑠𝑖 , 𝑧𝑖 ) 0 𝐻00 ] = E( 𝑔˜𝑖 𝑑𝑖0) [E(𝑑𝑖 𝑑𝑖0)] −1 E(𝑑𝑖 𝑔˜𝑖0). Therefore, 𝐹0 = E(𝑔𝑖 𝑔𝑖0) − {E(𝑔𝑖 𝑑𝑖0) [E(𝑑𝑖 𝑑𝑖0)] −1 E(𝑑𝑖 𝑔𝑖0) ◦ 𝑅}, where 𝑅 is a square matrix of order 𝐿 1 + 𝐿 2 + 𝐿 3 with all elements being unity except the lower right 𝐿 3 × 𝐿 3 block which is a 0 matrix, and ◦ denotes Hadamard product. Then using (D.1) and the asymptotic equivalence lemma, √ 𝑑 𝑁∇𝜏 𝑄(𝜏ˆ 0 ; 𝛿) → 𝑁𝑜𝑟𝑚𝑎𝑙 (0, 𝐷 00𝑊 𝐹0𝑊 𝐷 0 ). ˆ − (D.2) Next, an element-by-element mean value expansion of ∇𝜏 𝑄( ˆ 𝜏; ˆ around 𝜏0 gives, ˆ 𝛿) ∇𝜏 𝑄( ˆ 𝜏; ˆ 𝛿)ˆ = ∇𝜏 𝑄(𝜏 ˆ + [𝐷 0 𝑊 𝐷 0 + 𝑜 𝑝 (1)] ( 𝜏ˆ − 𝜏0 ) ˆ 0 ; 𝛿) 0 √ √ =⇒ 𝑁 ( 𝜏ˆ − 𝜏0 ) = −(𝐷 00𝑊 𝐷 0 ) −1 𝑁∇𝜏 𝑄(𝜏 ˆ 0 ; 𝛿) ˆ + 𝑜 𝑝 (1). (D.3) Combining (D.2) and (D.3) and using the asymptotic equivalence lemma gives √ 𝑑 𝑁 ( 𝜏ˆ − 𝜏0 ) −−−−→ 𝑁𝑜𝑟𝑚𝑎𝑙 [0, (𝐷 00𝑊 𝐷 0 ) −1 𝐷 00𝑊 𝐹0𝑊 𝐷 0 (𝐷 00𝑊 𝐷 0 ) −1 ], (D.0.2) which is the desired result. Proof of Proposition 2.4.1 For notational convenience, let 𝜏ˆ𝑊 𝐽 ≡ ( 𝛼ˆ 𝑊 0 , 𝛽ˆ0 ) 0. We want to show that under the null 𝐽 𝑊𝐽 𝑑 hypothesis , 𝑁 𝑄( ˆ 𝜏ˆ𝑊 𝐽 ; 𝛿) ˆ − → 𝜒2 , where 𝑊ˆ = 𝐹ˆ −1 . 𝐿3 First note that a mean value expansion around 𝛿0 yields √ √ √ 𝑑 ¯ 0 ; 𝛿) 𝑁 𝑔(𝜏 ˆ = ¯ 0 ; 𝛿0 ) + ∇𝛿 𝑔(𝜏 𝑁 𝑔(𝜏 ¯ 0 ; 𝛿)¯ 𝑁 ( 𝛿ˆ − 𝛿0 ) − → 𝑁 (0, 𝐹0 ). by equation (A.9). This implies −1 √ 𝑑 − 𝐹0 2 𝑁 𝑔(𝜏 ˆ ¯ 0 ; 𝛿) = 𝑈𝑁 − → 𝑈 ∼ 𝑁𝑜𝑟𝑚𝑎𝑙 (0, 𝐼). (D.4) 102 Moreover, the first order conditions for the objective function in (4.3) imply that √ √ 𝑁∇𝜏 𝑄(ˆ 𝜏; ˆ = [∇𝜏 𝑔( ˆ 𝛿) ¯ 𝜏; ˆ 0 𝐹ˆ −1 𝑁 𝑔( ˆ 𝛿)] ¯ 𝜏; ˆ 𝛿)ˆ =0 (D.5) √ =⇒ 𝐷 00 𝐹0−1 𝑁 𝑔( ¯ 𝜏; ˆ + 𝑜 𝑝 (1) = 0 ˆ 𝛿) 1 √ =⇒ 𝐷 00 𝐹0−1 [−𝐹02 𝑈 𝑁 + 𝐷 0 𝑁 ( 𝜏ˆ − 𝜏0 )] + 𝑜 𝑝 (1) = 0 √ 0 −1 −1 0 −1 =⇒ 𝑁 ( 𝜏ˆ − 𝜏0 ) = (𝐷 0 𝐹0 𝐷 0 ) 𝐷 0 𝐹0 2 𝑈 𝑁 + 𝑜 𝑝 (1). (D.0.3) Now, a mean value expansion of the sample moments around 𝜏0 gives √ √ √ ¯ 𝜏; 𝑁 𝑔( ˆ = ˆ 𝛿) ¯ 0 ; 𝛿) 𝑁 𝑔(𝜏 ˆ + ∇𝜏 𝑔( ¯ 𝜏; ¯ 𝛿) ˆ 𝑁 ( 𝜏ˆ − 𝜏0 ), (D.6) where 𝜏¯ lies between 𝜏ˆ and 𝜏0 . Substituting (D.4) and (D.5) into (D.6), we get √ − 1 − 1 ¯ 𝜏; 𝑁 𝑔( ˆ = −𝐹 2 𝑈 𝑁 + 𝐷 0 (𝐷 0 𝐹 −1 𝐷 0 ) −1 𝐷 0 𝐹 2 𝑈 𝑁 + 𝑜 𝑝 (1) ˆ 𝛿) 0 0 0 0 0 − 1 = −𝐹0 2 𝑅0𝑈 𝑁 + 𝑜 𝑝 (1), − 1 − 1 where 𝑅0 = 𝐼 − 𝐹0 2 𝐷 0 (𝐷 00 𝐹0−1 𝐷 0 ) −1 𝐷 00 𝐹0 2 is idempotent of rank 𝐿 3 . Then, 𝑑 𝑁 𝑄(ˆ 𝜏; ˆ = 𝑈 0 𝑅0𝑈 𝑁 + 𝑜 𝑝 (1) − ˆ 𝛿) → 𝜒𝐿2 . 𝑁 3 Proof of Proposition 2.6.1.1 I drop the 0 subscripts/superscripts for notational convenience, but all expressions in this proof are evaluated at the true values of the parameters, that is, at (𝛼0 , 𝛽0 , 𝛾0 ). First note that the GMM estimator of 𝛼0 that minimizes (2.4.3) with 𝑔(𝛼, 𝛽; 𝛿) ˆ = [𝑔1 (𝛼; 𝛿)ˆ 0, 𝑔2 (𝛽; 𝛿) ˆ 0] 0 and 𝑊ˆ = 𝐼 is numerically equivalent to 𝛼ˆ 𝑊 𝑐𝑐 , which is based only on 𝑔1 (𝛼; 𝛿). ˆ This is because ˆ simply adds equal number of parameters to be estimated and moment conditions to the 𝑔2 (𝛽; 𝛿) system.2 To characterize the asymptotic variance of this estimator, first define the following 2Ahu & Schmidt (1995) 103 quantities.       𝐷 11 0  h i 𝐹 11 𝐹 12  𝐹   13  𝐷1 ≡  𝐷 2 ≡ 𝐷 31 𝐷 32 𝐹1 ≡  𝐹2 ≡   𝐹3 ≡ 𝐹33 ,        0 𝐷  𝐹 0 𝐹  𝐹   22   12 22   23        (D.0.4) with 𝐹 𝑗𝑛 = E(𝑔 𝑗 𝑔0𝑛 ) − E(𝑔 𝑗 𝑑 0) [E(𝑑𝑑 0)] −1 E(𝑑𝑔0𝑛 ), 𝑗, 𝑛 = 1, 2, 3 except 𝐹33 which equals E(𝑔3 𝑔30 ). Then the asymptotic variance of this estimator is given by (𝐷 01 𝐹1−1 𝐷 1 ) −1 , and the required differ- ence in the proposition is given by the upper-left 𝐿 1 × 𝐿 1 block of (𝐷 01 𝐹1−1 𝐷 1 ) −1 − (𝐷 0 𝐹 −1 𝐷) −1 . We will now characterize this difference. First note that   −1     𝐹 𝐹  𝐹 −1 (𝐼 + 𝐹 𝐻𝐹 0 𝐹 −1 ) −𝐹 −1 𝐹 𝐻  𝐷  −1 1 2 1 2 2 1 1 2  1 𝐹 =  = ,𝐷 =  , (D.0.5)     𝐹 0 𝐹   −𝐻𝐹20 𝐹1−1 𝐻  𝐷   2 3    2       where 𝐻 ≡ (𝐹3 − 𝐹20 𝐹 −1 𝐹2 ) −1 . Therefore,     i 𝐹 −1 (𝐼 + 𝐹2 𝐻𝐹 0 𝐹 −1 ) −𝐹 −1 𝐹2 𝐻  𝐷   1 h 1 2 1 1 𝐷 0 𝐹 −1 𝐷 = 𝐷 0 𝐷 0  = 𝐷 1 𝐹1−1 𝐷 1 + 𝐽 0 𝐻𝐽,      1 2   −𝐻𝐹20 𝐹1−1 𝐻   𝐷   2     (D.0.6) where 𝐽 ≡ 𝐹20 𝐹1−1 𝐷 1 − 𝐷 2 . Therefore, using the Sherman Morrison formula, (𝐷 0 𝐹 −1 𝐷) −1 = (𝐷 01 𝐹1−1 𝐷 1 + 𝐽 0 𝐻𝐽) −1 = (𝐷 01 𝐹1−1 𝐷 1 ) −1 − (𝐷 01 𝐹1−1 𝐷 1 ) −1 𝐽 0 [𝐻 −1 + 𝐽 (𝐷 01 𝐹1−1 𝐷 1 ) −1 𝐽 0] −1 𝐽 (𝐷 01 𝐹1−1 𝐷 1 ) −1 , (D.0.7) which implies that (𝐷 01 𝐹1−1 𝐷 1 ) −1 − (𝐷 0 𝐹 −1 𝐷) −1 = (𝐷 01 𝐹1−1 𝐷 1 ) −1 𝐽 0 [𝐻 −1 + 𝐽 (𝐷 01 𝐹1−1 𝐷 1 ) −1 𝐽 0] −1 𝐽 (𝐷 01 𝐹1−1 𝐷 1 ) −1 ≡ (𝐷 01 𝐹1−1 𝐷 1 ) −1 𝐽 0 𝐾 𝐽 (𝐷 01 𝐹1−1 𝐷 1 ) −1 , (D.0.8) where 𝐾 ≡ [𝐻 −1 + 𝐽 (𝐷 01 𝐹1−1 𝐷 1 ) −1 𝐽 0] −1 is a positive definite matrix. The matrix in (A.32) is clearly positive semidefinite, which proves the proposition. 104 For use in the next proof, we want to characterize the difference corresponding specifically to 𝛼0 , which is given by the upper-left 𝐿 1 × 𝐿 1 block of the matrix in (A.32). For this difference, we focus on the first 𝐿 1 columns of 𝐽 (𝐷 01 𝐹1−1 𝐷 1 ) −1 . Note that 𝐽 (𝐷 01 𝐹1−1 𝐷 1 ) −1 = (𝐹20 𝐹1−1 𝐷 1 − 𝐷 2 )(𝐷 01 𝐹1−1 𝐷 1 ) −1 = (𝐹20 𝐹1−1 𝐷 1 − 𝐷 2 )𝐷 −1 −1 1 𝐹1 𝐷 1 = (𝐹20 − 𝐷 2 𝐷 −1 −1 1 𝐹1 )𝐷 1 , (D.0.9) where we have used the fact that 𝐷 1 is symmetric. Substituting the definitions of 𝐹1 , 𝐹2 , 𝐷 1 , and 𝐷 2 , we get 𝐽 (𝐷 01 𝐹1−1 𝐷 1 ) −1 equals h i 0 − 𝐷 𝐷 −1 𝐹 − 𝐷 𝐷 −1 𝐹 0 )𝐷 −1 (𝐹 0 − 𝐷 𝐷 −1 𝐹 − 𝐷 𝐷 −1 𝐹 )𝐷 −1 . (D.0.10) (𝐹13 31 11 11 32 22 12 11 23 31 11 12 32 22 22 22 The first 𝐿 1 columns of this matrix are given by the left block, which is 0 𝐷 −1 − 𝐷 𝐷 −1 𝐹 𝐷 −1 − 𝐷 𝐷 −1 𝐹 0 𝐷 −1 . 𝐿 ≡ 𝐹13 (D.0.11) 11 31 11 11 11 32 22 12 11 Let 𝐿 = [𝐿 1 𝐿 2 ], where 𝐿 1 is the first column of 𝐿 and 𝐿 2 is the matrix of last 𝐿 1 − 1 columns of 𝐿. Then the difference in asymptotic variances corresponding to 𝛼1 and 𝛼2 is 𝐿 01 𝐾 𝐿 1 and 𝐿 02 𝐾 𝐿 2 respectively. Proof of Proposition 6.1.2 We want to show that neither 𝐿 1 nor 𝐿 2 derived in the proof of Proposition 6.1.1 is zero in general. For notational simplicity, I drop the 0 sub/superscripts in this proof, but all expressions are evaluated at the true parameter values. By standard second order conditions for a probit and a normal MLE,   𝜎 −2 𝑥 0 𝑥 0 2 2  𝐷 11 = − E(𝑥 0𝑥𝑒 1 ) 𝐷 22 = E       0 𝜎 −4 /2   𝐷 31 = − E(𝑥 20 𝑥2 𝑒 2 )ℎ𝛼 𝐷 32 = − E(𝑥20 𝑥 2 𝑒 2 )ℎ 𝛽 (D.0.12) 105 where ℎ𝛼 = [ℎ𝛼1 ℎ𝛼2 ]  𝜃 − (𝜃𝛼 + 𝛼 )(1 + 𝛼2 𝜎 2 ) −1 𝛼 𝜎 2 1   1 2 1 1  ≡  q q 𝐼𝑘  1 + 𝛼12 𝜎 2 1 + 𝛼12 𝜎 2      ℎ 𝛽 = [ℎ 𝜃 ℎ 𝜎 2 ]  𝛼1 (𝜃𝛼1 + 𝛼2 )𝛼12  𝐼𝑘 −  ≡  q 2 𝜎 2 ) 3/2  ,  (D.0.13)  1 + 𝛼1 𝜎2 2 2(1 + 𝛼 1    𝑒 1 ≡ [𝜙(𝑥𝛼)] 2 /{Φ(𝑥𝛼) [1 − Φ(𝑥𝛼)]}, 𝑒 2 ≡ [𝜙(𝑥 2 𝛾)] 2 /{Φ(𝑥 2 𝛾) [1 − Φ(𝑥 2 𝛾)]}. Then we can write   𝑥 0 𝑥 𝑒 𝑥 0 𝑥 𝑒   1 1 1 2 1 𝐷 11 = − E  1  (D.0.14) 𝑥 0 𝑥 𝑒 𝑥 0 𝑥 𝑒   2 1 1 2 2 1   0 0 2 2 Let E(𝑥 2 𝑥 2 𝑒 1 ) ≡ Γ1 , E(𝑥2𝑟𝑒 1 ) ≡ Γ2 , and E(𝑟 𝑒 1 ) = 𝜎 2 . Then using 𝑥 1 = 𝑥2 𝜃 + 𝑟, we can write 𝑟 𝑒 1 𝜃 Γ1 𝜃 + 2Γ0 𝜃 + 𝜎 2 𝜃 0Γ1 + Γ02   0  2 2 𝑟 𝑒1 𝐷 11 = −  (D.0.15)      Γ1 𝜃 + Γ2 Γ1    Let Γ3 ≡ (𝜎 22 − Γ02 Γ−1 1 2 Γ ). Using the partitioned inverse formula, we can write 𝑟 𝑒 1    Γ −1 −Γ −1 (𝜃 0 + Γ0 Γ−1 )  3 3 2 1 𝐷 −1 = (D.0.16)   11  −Γ−1 (𝜃 + Γ−1 Γ ) Γ−1 + (𝜃 + Γ−1 Γ )Γ−1 (𝜃 0 + Γ0 Γ−1 )    3  1 2 1 1 2 3 2 1  To calculate the first term in (A.35), we begin by deriving 𝐹13 . 𝐹13 = E(𝑔1 𝑔30 ) − E(𝑔1 𝑑 0) [E(𝑑𝑑 0)] −1 E(𝑑𝑔30 ). (D.0.17) Let 𝑢 1 ≡ [𝑦−Φ(𝑥𝛼)]𝜙(𝑥𝛼)/Φ(𝑥𝛼) [1−Φ(𝑥𝛼)] be the generalized residual for the model of interest, 𝑣 1 ≡ [𝑦 − Φ(𝑥2 𝛾)]𝜙(𝑥 2 𝛾)/Φ(𝑥2 𝛾) [1 − Φ(𝑥 2 𝛾)] be the generalized residual for the reduced form, Ω𝑢1 𝑣 1 ≡ E{[𝑠/𝑝(𝑧)]𝑥 20 𝑥 2 𝑢 1 𝑣 1 }, Ω𝑟𝑢1 𝑣 1 ≡ E{[𝑠/𝑝(𝑧)]𝑥 2𝑟𝑢 1 𝑣 1 }, Ω𝑣 1 𝑑 ≡ E(𝑥20 𝑣 1 𝑑 0), Ω𝑢1 𝑑 ≡ E{[𝑠/𝑝(𝑧)]𝑥 20 𝑢 1 𝑑 0 }, and Ω𝑟𝑢1 𝑑 ≡ E{[𝑠/𝑝(𝑧) 2 ]𝑟𝑢 1 𝑑 0 }. Then  0 𝜃 Ω𝑢 𝑣 + Ω𝑟𝑢 𝑣 − (𝜃 0Ω𝑢 𝑑 + Ω𝑟𝑢 𝑑 ) [E(𝑑𝑑 0)] −1 Ω0   1 1 1 1 1 1 𝑣1 𝑑  𝐹13 =  (D.0.18)     Ω𝑢1 𝑣 1 − Ω𝑢1 𝑑 [E(𝑑𝑑 0)] −1 Ω0𝑣 𝑑    1  106 Using the definitions of 𝐹13 and 𝐷 −1 11 , we get that the first column of 𝐹13 0 𝐷 −1 is 11 𝑄 11 ≡ {Ω0𝑟𝑢 𝑣 − Ω𝑣 1 𝑑 [E(𝑑𝑑 0)] −1 Ω0𝑟𝑢 𝑑 }Γ−1 0 0 −1 0 3 − {Ω𝑢 1 𝑣 1 − Ω𝑣 1 𝑑 [E(𝑑𝑑 )] Ω𝑢 1 𝑑 }Γ3 Γ1 Γ2 −1 −1 1 1 1 (D.0.19) and the last 𝑘 columns of 𝐹13 0 𝐷 −1 are 11 𝑄 12 ≡ −{Ω0𝑟𝑢 𝑣 − Ω𝑣 1 𝑑 [E(𝑑𝑑 0)] −1 Ω0𝑟𝑢 𝑑 }Γ−1 0 0 −1 3 (𝜃 + Γ2 Γ1 ) 1 1 1 + {Ω0𝑢 𝑣 − Ω𝑣 1 𝑑 [E(𝑑𝑑 0)] −1 Ω0𝑢 𝑑 }Γ−1 −1 0 1 [𝐼 𝑘 + Γ2 Γ3 (𝜃 + Γ2 Γ1 )] 0 −1 (D.0.20) 1 1 1 Next we derive the second term in (A.35). ℎ𝛼 𝐷 −1 −1 11 = [ℎ 𝛼1 ℎ 𝛼2 ]𝐷 11 = [ Γ−1 3 [ℎ𝛼1 −ℎ𝛼2 (𝜃+Γ−1 Γ2 )] −ℎ𝛼1 Γ−1 (𝜃 0+Γ0 Γ−1 )+ℎ𝛼2 [Γ−1 +(𝜃+Γ−1 Γ2 )Γ−1 (𝜃 0+Γ0 Γ−1 )] ] 1 3 2 1 1 1 3 2 1 (D.0.21) Let ℎ ≡ ℎ𝛼1 − ℎ𝛼2 𝜃 and Γ4 ≡ Γ−1 3 (ℎ − ℎ𝛼2 Γ−1 1 2 Γ ). Then h i ℎ𝛼 𝐷 −111 = Γ4 −Γ 4 (𝜃 0 + Γ0 Γ−1 ) + ℎ Γ−1 𝛼2 1 (D.0.22) 2 1 Now consider 𝐹11 . Let Ω 2 ≡ E{[𝑠/𝑝(𝑧) 2 ]𝑥 20 𝑥 2 𝑢 21 }, Ω 2 ≡ E{[𝑠/𝑝(𝑧) 2 ]𝑥20 𝑟𝑢 21 }, and Ω 2 2 ≡ 𝑢 𝑟𝑢 𝑟 𝑢 1 1 1 E{[𝑠/𝑝(𝑧) 2 ]𝑟 2 𝑢 21 }. Then, 𝐹11 = [𝐹111 𝐹112 ] (D.0.23) where  0 𝜃 Ω 2 𝜃 + 2𝜃 0Ω 2 + Ω 0 0 −1 0 0  2 𝑢 2 − (𝜃 Ω𝑢 1 𝑑 + Ω𝑟𝑢 1 𝑑 ) [E(𝑑𝑑 )] (Ω𝑢 1 𝑑 𝜃 + Ω𝑟𝑢 1 𝑑 )    𝑢 𝑟𝑢 𝑟 𝐹111 ≡   1 1 1  (D.0.24) 0 0 −1 0 0   Ω 𝜃 + Ω 2 − Ω𝑢 𝑑 [E(𝑑𝑑 )] (Ω 𝜃+Ω )   𝑢 2 𝑟𝑢 1 𝑢 1 𝑑 𝑟𝑢 1 𝑑    1 1    0 𝜃 Ω 2 + Ω0 − (𝜃 0Ω𝑢 𝑑 + Ω𝑟𝑢 𝑑 ) [E(𝑑𝑑 0)] −1 Ω0    𝑢 1 𝑟𝑢 2 1 1 𝑢1 𝑑  𝐹112 ≡  1  (D.0.25)   0 Ω 2 − Ω𝑢1 𝑑 [E(𝑑𝑑 )] Ω𝑢 𝑑 −1 0    𝑢 1 1   107 Using the definitions of ℎ𝛼 𝐷 −1 11 , and 𝐹11 , we find that the first column of ℎ𝛼 𝐷 −1 𝐹 𝐷 −1 is given 11 11 11 by 𝑄 ∗21 ≡ Γ4 Γ−1 0 −1 0 3 ({Ω𝑟 2 𝑢 2 − Ω𝑟𝑢 1 𝑑 [E(𝑑𝑑 )] Ω𝑟𝑢 1 𝑑 } 1 − {Ω0 2 − Ω𝑟𝑢1 𝑑 [E(𝑑𝑑 0)] −1 Ω0𝑢 𝑑 }Γ−1 1 Γ2 − Ω𝑟𝑢 2 𝜃) 0 𝑟𝑢 1 1 1 − (Γ4 Γ02 + ℎ𝛼2 )Γ−1 −1 0 −1 0 1 Γ3 [{Ω𝑟𝑢 2 − Ω𝑢 1 𝑑 [E(𝑑𝑑 )] Ω𝑟𝑢 1 𝑑 } 1 − {Ω 2 − Ω𝑢1 𝑑 [E(𝑑𝑑 0)] −1 Ω0𝑢 𝑑 }Γ−1 1 Γ2 − (𝜃 + Γ1 Γ2 )] −1 (D.0.26) 𝑢 1 1 and the last 𝑘 columns of ℎ𝛼 𝐷 −1 𝐹 𝐷 −1 are given by 11 11 11 𝑄 ∗22 ≡ −Γ4 Γ−1 0 −1 0 0 −1 0 0 −1 3 [{Ω𝑟 2 𝑢 2 − Ω𝑟𝑢 1 𝑑 [E(𝑑𝑑 )] Ω𝑟𝑢 1 𝑑 } − Ω𝑟𝑢 2 (𝜃 + Γ1 Γ2 )] (𝜃 + Γ2 Γ1 ) 1 1 + (Γ4 Γ02 + ℎ𝛼2 )Γ−1 0 −1 0 −1 0 0 −1 0 −1 1 {Ω𝑟𝑢 2 − Ω𝑢 1 𝑑 [E(𝑑𝑑 )] Ω𝑟𝑢 1 𝑑 }Γ3 (𝜃 + Γ2 Γ1 ) + Γ4 Ω𝑟𝑢 2 Γ1 1 1 − [Γ4 Ω𝑟𝑢1 𝑑 [E(𝑑𝑑 0)] −1 Ω0𝑢 𝑑 + (Γ4 Γ02 + ℎ𝛼2 )Γ−1 0 −1 0 1 {Ω𝑢 2 − Ω𝑢 1 𝑑 [E(𝑑𝑑 )] Ω𝑢 1 𝑑 }] 1 1 Γ−1 −1 0 0 −1 1 [𝐼 𝑘 + Γ2 Γ3 (𝜃 + Γ2 Γ1 )] (D.0.27) Let 𝑄 21 ≡ − E(𝑥20 𝑥2 𝑒 2 )𝑄 ∗21 and 𝑄 22 ≡ − E(𝑥 20 𝑥 2 𝑒 2 )𝑄 ∗22 . Then, 𝐷 31 𝐷 −1 −1 11 𝐹11 𝐷 11 = [𝑄 21 𝑄 22 ] (D.0.28) Clearly, neither 𝑄 21 nor 𝑄 22 is zero. Next we want to find 𝐷 32 𝐷 −1 𝐹 0 𝐷 −1 . First note that 22 12 11   𝜎 2 E(𝑥 0 𝑥 ) −1 0  2 2 𝐷 −1 22 =  (D.0.29)   . 0 2𝜎 4     Further, let Ω𝑟 𝑑 ≡ E{[𝑠/𝑝(𝑧)]𝑥20 𝑟𝑑 0 }, Ω𝑟 2 𝑑 ≡ E{[𝑠/𝑝(𝑧)] (𝑟 2 𝜎 −2 − 1)𝑑 0 }, Ω𝑢1𝑟 ≡ E{[𝑠/𝑝(𝑧) 2 ] 𝑥 20 𝑥 2 𝑢 1𝑟}, Ω𝑢 𝑟 2 ≡ E{[𝑠/𝑝(𝑧) 2 ]𝑥 20 𝑢 1𝑟 2 }, and Ω𝑢 𝑟 3 ≡ E{[𝑠/𝑝(𝑧) 2 ]𝑢 1𝑟 3 }. Then 1 1 (Ω0𝑢 𝑟 𝜃+Ω0 )−[Ω𝑟 𝑑 [E(𝑑𝑑 0)] −1 (Ω0 𝜃+Ω0 )] Ω0𝑢 𝑟 −Ω𝑟 𝑑 [E(𝑑𝑑 0)] −1 Ω0 " # 0 = 1 𝑢1𝑟 2 𝑢1 𝑑 𝑟𝑢 1 𝑑 1 𝑢1 𝑑 𝐹12 0 −2 0 −1 0 0 . (Ω 𝜃+Ω )𝜎 −Ω 2 [E(𝑑𝑑 )] (Ω 𝜃+Ω ) 𝜎 Ω𝑢 𝑟 −Ω 2 [E(𝑑𝑑 )] Ω0 −2 0 0 −1 𝑢1𝑟 2 𝑢1𝑟 3 𝑟 𝑑 𝑢1 𝑑 𝑟𝑢 1 𝑑 1 𝑟 𝑑 𝑢1 𝑑 (D.0.30) 108 Next note that ℎ 𝛽 𝐷 −1 22 = [ℎ 𝜃 𝜎 2 E(𝑥20 𝑥 2 ) −1 2ℎ 𝜎 2 𝜎 4 ]. Using the definitions of 𝐹12 and 𝐷 −1 22 , we get that the first column of ℎ 𝛽 𝐷 −1 𝐹 0 𝐷 −1 is 22 12 11 𝑄 ∗31 ≡ ℎ 𝜃 𝜎 2 E(𝑥 20 𝑥 2 ) −1 Γ−1 3 (Ω0 − Ω𝑟 𝑑 [E(𝑑𝑑 0)] −1 Ω0𝑟𝑢 𝑑 − {Ω0𝑢 𝑟 − Ω𝑟 𝑑 [E(𝑑𝑑 0)] −1 Ω0𝑢 𝑑 }Γ−1 1 Γ2 ) 𝑢1𝑟 2 1 1 1 + 2ℎ 𝜎 2 𝜎 4 Γ−1 3 (𝜎 −2 Ω𝑢 𝑟 3 − Ω𝑟 2 𝑑 [E(𝑑𝑑 0)] −1 Ω0𝑟𝑢 𝑑 − {𝜎 −2 Ω0 2 − Ω𝑟 2 𝑑 [E(𝑑𝑑 0)] −1 Ω0𝑢 𝑑 }Γ−1 1 Γ2 ) 1 1 𝑢1𝑟 1 (D.0.31) and the last 𝑘 columns of ℎ 𝛽 𝐷 −1 𝐹 0 𝐷 −1 are 22 12 11 𝑄 ∗32 ≡ 𝜎 2 E(𝑥 20 𝑥 2 ) −1 ([−ℎ 𝜃 {Ω0 − Ω𝑟 𝑑 [E(𝑑𝑑 0)] −1 Ω0𝑟𝑢 𝑑 } 𝑢1𝑟 2 1 − 2ℎ 𝜎 2 𝜎 2 {Ω0 𝜎 −2 − Ω𝑟 2 𝑑 [E(𝑑𝑑 0)] −1 Ω0𝑟𝑢 𝑑 }Γ−1 0 0 −1 3 (𝜃 + Γ2 Γ1 )] 𝑢1𝑟 3 1 + {(ℎ 𝜃 {Ω0𝑢 𝑟 − Ω𝑟 𝑑 [E(𝑑𝑑 0)] −1 Ω0𝑢 𝑑 } 1 1 − 2ℎ 𝜎 2 𝜎 2 {Ω0𝑢 𝑟 − Ω𝑟 2 𝑑 [E(𝑑𝑑 0)] −1 Ω0𝑢 𝑑 })Γ−1 −1 0 0 −1 1 [𝐼 𝑘 + Γ2 Γ3 (𝜃 + Γ2 Γ1 )]}) (D.0.32) 1 1 Let 𝑄 31 ≡ − E(𝑥20 𝑥2 𝑒 2 )𝑄 ∗31 and 𝑄 32 ≡ − E(𝑥 20 𝑥 2 𝑒 2 )𝑄 ∗32 . Then, 𝐷 32 𝐷 −1 0 −1 22 𝐹12 𝐷 11 = [𝑄 31 𝑄 32 ] (D.0.33) Clearly, neither 𝑄 31 nor 𝑄 32 is zero. Thus, 𝐿 1 = 𝑄 11 + 𝑄 21 + 𝑄 31 ≠ 0 (D.0.34) 𝐿 2 = 𝑄 12 + 𝑄 22 + 𝑄 32 ≠ 0 (D.0.35) which implies that it is possible to obtain strict efficiency gains for both 𝛼1 and 𝛼2 . Proof of Proposition E.1. We first show that 𝛼0 is a solution to 𝑚𝑖𝑛𝛼∈A E[𝑠 · 𝑓1 (𝑦, 𝑥, 𝛼)]. First note that for any 𝛼 ∈ A, E[𝑠 · 𝑓1 (𝑦, 𝑥, 𝛼)] = E{E[𝑠 · 𝑓1 (𝑦, 𝑥, 𝛼)|𝑥]} = E{𝑝(𝑥 2 ) E[ 𝑓1 (𝑦, 𝑥, 𝛼)|𝑥]}, 109 where the second equality follows by iterated expectations. E[𝑠 · 𝑓1 (𝑦, 𝑥, 𝛼)|𝑥] = E{E[𝑠 · 𝑓1 (𝑦, 𝑥, 𝛼)|𝑦, 𝑥]|𝑥} = E[E(𝑠|𝑦, 𝑥) 𝑓1 (𝑦, 𝑥, 𝛼)|𝑥] = E[ 𝑝(𝑥2 ) 𝑓1 (𝑦, 𝑥, 𝛼)|𝑥] = 𝑝(𝑥2 ) E[ 𝑓1 (𝑦, 𝑥, 𝛼)|𝑥], where the third equality follows from part 2 of Assumption E.2. Because 𝑝(𝑥 2 ) ≥ 0 ∀𝑥 2 ∈ X2 , and 𝛼0 minimizes E[ 𝑓1 (𝑦, 𝑥, 𝛼)|𝑥] for all 𝑥 ∈ X, 𝑝(𝑥 2 ) E[ 𝑓1 (𝑦, 𝑥, 𝛼0 )|𝑥] ≤ 𝑝(𝑥 2 ) E[ 𝑓1 (𝑦, 𝑥, 𝛼)|𝑥], 𝑥 ∈ X, 𝛼 ∈ A. The result follows from taking an expectation with respect to 𝑥. A similar argument can be used to verify that 𝛽0 solves min 𝛽∈B E[𝑠 · 𝑓2 (𝑥 1 , 𝑥2 , 𝛽)] and noting that E(𝑠|𝑥) = 𝑝(𝑥 2 ) under part 2 of Assumption E.2. For the reduced form, part 1 of Assumption E.1 implies using iterated expectations that 𝛾0 minimizes E[ 𝑓3 (𝑦, 𝑥2 , 𝛾)]. 110 APPENDIX E ASYMPTOTIC THEORY FOR UNWEIGHTED ESTIMATION The notion of econometric models underlying the objective functions in (2.2.1)-(2.2.3) being correctly specified, and sample selection being based on 𝑥 2 is formalized in the following two assumptions. Assumption E.1. Assume that 1. For each 𝑥 ∈ X, 𝛼0 solves min𝛼∈A E[ 𝑓1 (𝑦, 𝑥, 𝛼)|𝑥]. For each 𝑥2 ∈ X2 , 𝛽0 and 𝛾0 solve min 𝛽∈B E[ 𝑓2 (𝑥 1 , 𝑥2 , 𝛽)|𝑥 2 ] and min𝛾∈Γ E[ 𝑓3 (𝑦, 𝑥2 , 𝛾)|𝑥2 ] respectively. 2. 𝛼0 , 𝛽0 , and 𝛾0 are the unique solutions to min𝛼∈A E[𝑠 · 𝑓1 (𝑦, 𝑥, 𝛼)] and min 𝛽∈B E[𝑠 · 𝑓2 (𝑥1 , 𝑥2 , 𝛽)] respectively. Part 1 of this assumption practically means that the underlying model is correctly specified. Part 2 is needed to ensure that the selected subpopulation is sufficiently rich to identify the respective parameters. The notion that 𝑠 depends on 𝑥2 is formalized in part 2 of the following assumption. Assumption E.2. Assume that 1. 𝑥 1 is observed whenever 𝑠 = 1, (𝑦, 𝑥2 ) are always observed. 2. 𝑃(𝑠 = 1|𝑦, 𝑥1 , 𝑥2 ) = 𝑃(𝑠 = 1|𝑥 2 ) ≡ 𝑝(𝑥 2 ). It is simple to show that Assumptions E.1 and E.2 along with regularity conditions, imply consistency of ( 𝛼ˆ 𝑈𝐽 , 𝛽ˆ𝑈𝐽 ). I show that the following proposition holds. Proposition E.1. Under Assumptions E.1 and E.2, 𝛼0 , 𝛽0 , and 𝛾0 solve min𝛼∈A E[𝑠 · 𝑓1 (𝑦, 𝑥, 𝛼)], min 𝛽∈B E[𝑠 · 𝑓2 (𝑥 1 , 𝑥2 , 𝛽)] and min𝛾∈Γ E[ 𝑓3 (𝑦, 𝑥2 , 𝛾)] respectively. The proof (given in Appendix C) simply follows from an iterated expectations argument and is an extension of that in Wooldridge (2002). Theorem E.1. Assume that 111 1. {(𝑦𝑖 , 𝑥𝑖 , 𝑠𝑖 ) : 𝑖 = 1, . . . , 𝑁 } are random draws from the population satisfying Assumption E.2. 2. Assumption E.1 holds. 3. Parts 3 (except the assumptions on Δ), 4, and 6 of Theorem 2.4.1 hold. 𝑝 Then ( 𝛼ˆ 𝑈𝐽 , 𝛽ˆ𝑈𝐽 ) −→ (𝛼0 , 𝛽0 ) as 𝑁 → − ∞. Once we verify that (𝛼0 , 𝛽0 ) are identified in the subpopulations defined by 𝑠 = 1, the proof of Theorem E.1 is very similar to that of Theorem 2.4.1, and hence is omitted. To derive the asymptotic distribution of ( 𝛼ˆ 𝑈𝐽 , 𝛽ˆ𝑈𝐽 ), we assume that E[𝑔(𝛼, 𝛽)] is differentiable at (𝛼0 , 𝛽0 ) with the derivative defined as the following.  0  𝐷  𝑈11 0     𝐷 𝑈0 ≡ E[∇ (𝛼0,𝛽0)0 𝑔(𝛼, 𝛽)| (𝛼,𝛽)=(𝛼 ,𝛽 ) ] =  0  𝐷 0 , (E.0.1) 0 0 𝑈22     0 0   𝐷 𝑈31 𝐷 𝑈32    where 𝐷 𝑈 0 = 𝜕𝑔 𝑗 (𝛼, 𝛽)/𝜕𝛼| (𝛼,𝛽)=(𝛼 ,𝛽 ) and 𝐷 𝑈 0 = 𝜕𝑔 𝑗 (𝛼, 𝛽)/𝜕 𝛽| (𝛼,𝛽)=(𝛼 ,𝛽 ) , 𝑗 = 1, 2, 3. 𝑗1 0 0 𝑗2 0 0 √ Then the following theorem gives the 𝑁−asymptotic normality result. Theorem E.2.(Asymptotic Normality): Assume that 1. The assumptions in Theorem E.1 hold 2. (𝛼0 , 𝛽0 ) ∈ 𝑖𝑛𝑡 (A × B). 3. 𝑔(𝛼, 𝛽) is twice continuously differentiable on 𝑖𝑛𝑡 (A × B). 4. 𝐷 𝑈0 is of full rank 𝐿 1 + 𝐿 2 . Then, √ 0 , 𝛽ˆ0 ) 0 − (𝛼0 , 𝛽0 ) 0] − 𝑑 0 𝐶 −1 𝐷 ) −1 ]. 𝑁 [( 𝛼ˆ 𝑈𝐽 𝑈𝐽 0 0 −−−→ 𝑁𝑜𝑟𝑚𝑎𝑙 [0, (𝐷 𝑈0 0 𝑈0 The proof follows in a straightforward manner from Theorem 3.4 of Newey and McFadden (1994) and hence is omitted. 112 APPENDIX F TABLES FOR CHAPTER 2 Table F.1: Summary of missing data methods used in 5 highly ranked economics journals from 2018 to August 2020. Total % Missingness % CC % DVM % RI % Other American Economic Review 319 20.69 71.21 16.67 15.15 15.15 Quarterly Journal of Economics 109 28.44 74.19 9.68 9.68 29.68 Journal of Labor Economics 109 35.78 58.97 15.38 10.26 17.95 Journal of Human Resources 98 43.88 46.51 32.56 11.63 16.28 Journal of Political Economy 211 19.91 59.52 16.67 21.43 14.29 Total 846 26.12 62.44 18.55 14.03 17.65 1 Column 1 shows the total number of papers published. Column 2 shows the percentage of papers that reported missing values. Columns 3-6 show the percentage of papers that used the complete cases estimator, the dummy variable method, the two-step regression imputation, and other methods respectively. 2 The row percentages add to more than 100 because some papers use multiple methods. 3 The articles that do not explicitly mention the method of imputation are included in the two-step regression imputation category since this is the most frequently used method within the imputation category. 113 Table F.2: Effect of grade variance on probability of having a 4 year college degree. Complete cases Joint GMM % ↓ in s.e. Plug-in DVM Log(income) 0.148 0.148 0.149 0.150 (0.042) (0.041) 2.38 (0.042) (0.042) GSD -0.146 -0.140 -0.138 -0.139 (0.039) (0.035) 10.26 (0.037) (0.035) GPA 0.329 0.331 0.338 0.339 (0.049) (0.043) 12.24 (0.043) (0.043) Black 0.413 0.407 0.386 0.395 (0.128) (0.116) 9.38 (0.114) (0.114) Hispanic 0.539 0.445 0.404 0.419 (0.147) (0.135) 8.16 (0.138) (0.135) Live in south 0.140 0.149 0.144 0.137 (0.065) (0.058) 10.77 (0.057) (0.057) Lived in urban area 0.093 0.080 0.082 0.083 (0.068) (0.061) 10.29 (0.062) (0.060) Mother’s education 0.060 0.057 0.055 0.056 (0.015) (0.014) 6.67 (0.014) (0.014) Father’s education 0.063 0.063 0.062 0.062 (0.011) (0.010) 9.09 (0.010) (0.010) Female -0.128 -0.132 -0.138 -0.146 (0.059) (0.053) 10.17 (0.052) (0.052) Cognitive skills 0.436 0.420 0.400 0.404 (0.050) (0.044) 12 (0.044) (0.044) Non-Cognitive skills 0.012 0.015 0.018 0.016 (0.030) (0.027) 10 (0.028) (0.027) N 3219 3942 3942 3942 p-value for J stat 0.590 114 APPENDIX G PROOFS FOR CHAPTER 3 Proof of Lemma 3.4.1 Starting with 𝑓1𝑖 (.), we want to show that E Í𝑇 0  Í𝑇 0 𝑡=1 𝑠𝑖𝑡 𝑥¥𝑖𝑡 𝑢¥𝑖𝑡 = 0. Since 𝑡=1 𝑠𝑖𝑡 𝑥¥𝑖𝑡 𝑢¥𝑖𝑡 = Í𝑇 0 Í𝑇 0  𝑡=1 𝑠𝑖𝑡 𝑥¥𝑖𝑡 𝑢𝑖𝑡 , we want to show that E 𝑡=1 𝑠𝑖𝑡 𝑥¥𝑖𝑡 𝑢𝑖𝑡 = 0.  First, Assumption 3.2.1 implies by the law of iterated expectations (LIE) that E 𝑢𝑖𝑡 |𝑥𝑖 , 𝑠𝑖 = 0. Now, ∀ 𝑡 = 1, . . . , 𝑇 𝐸 (𝑠𝑖𝑡 𝑥¥𝑖𝑡0 𝑢𝑖𝑡 ) = E[E 𝑠𝑖𝑡 𝑥¥𝑖𝑡0 𝑢𝑖𝑡 |𝑥𝑖 , 𝑠𝑖 ] = E[𝑠𝑖𝑡 𝑥¥𝑖𝑡0 E 𝑢𝑖𝑡 |𝑥𝑖 , 𝑠𝑖 ] = 0.   Therefore, E Í𝑇 𝑠 ¥ 𝑥 0 𝑢  = 0. 𝑡=1 𝑖𝑡 𝑖𝑡 𝑖𝑡 Using a similar argument for 𝑓2𝑖 (.), we want to show that E Í𝑇 𝑠 ¥ 𝑥 0 𝑟  = 0. Now, ∀ 𝑡=1 𝑖𝑡 2𝑖𝑡 𝑖𝑡 𝑡 = 1, . . . , 𝑇 0 𝑟 ) = E[E 𝑠 𝑥¥0 𝑟 |𝑥 , 𝑠  ] = E[𝑠 𝑥¥0 E 𝑟 |𝑥 , 𝑠  ] = 0. E(𝑠𝑖𝑡 𝑥¥2𝑖𝑡 𝑖𝑡 𝑖𝑡 2𝑖𝑡 𝑖𝑡 2𝑖 𝑖 𝑖𝑡 2𝑖𝑡 𝑖𝑡 2𝑖 𝑖  The last equality follows from E 𝑟𝑖𝑡 |𝑥 2𝑖 , 𝑠𝑖 = 0 which follows from Assumption 3.2.2 and LIE. Í𝑇 0  Therefore, E 𝑡=1 𝑠𝑖𝑡 𝑥¥2𝑖𝑡 𝑟𝑖𝑡 = 0. Í For 𝑓3𝑖 (.), we want to show that E[ 𝑇𝑡=1 (1 − 𝑠𝑖𝑡 ) 𝑥¤2𝑖𝑡 0 𝑣 ] = 0. First, note that using the LIE, 𝑖𝑡   Assumption 3.2.1 implies that E 𝑢𝑖𝑡 |𝑥 2𝑖 , 𝑠𝑖 = 0. This combined with E 𝑟𝑖𝑡 |𝑥 2𝑖 , 𝑠𝑖 = 0 implies   that E 𝑣𝑖𝑡 |𝑥 2𝑖 , 𝑠𝑖 = E 𝛽1𝑟𝑖𝑡 + 𝑢𝑖𝑡 |𝑥2𝑖 , 𝑠𝑖 = 0. Now, ∀ 𝑡 = 1, . . . , 𝑇 0 𝑣 ] = E{E[(1 − 𝑠 ) 𝑥¤0 𝑣 |𝑥 , 𝑠 ]} = E[(1 − 𝑠 ) 𝑥¤0 E(𝑣 |𝑥 , 𝑠 )] = 0. E[(1 − 𝑠𝑖𝑡 ) 𝑥¤2𝑖𝑡 𝑖𝑡 𝑖𝑡 2𝑖𝑡 𝑖𝑡 2𝑖 𝑖 𝑖𝑡 2𝑖𝑡 𝑖𝑡 2𝑖 𝑖 Í and hence E[ 𝑇𝑡=1 (1 − 𝑠𝑖𝑡 ) 𝑥¤2𝑖𝑡 0 𝑣 ] = 0. 𝑖𝑡 Proof of Proposition 3.4.2.1 𝛽ˆ 𝐷 is obtained by estimating the parameters in equation (3.4.3) using POLS. POLS will be consistent if 0    Í𝑇    𝑔1𝑖 (.)   𝑡=1 𝑠`𝑖𝑡 𝑥`1𝑖𝑡 𝑒`𝑖𝑡  0         Í    E 𝑔2𝑖 (.)  ≡ E  𝑇 (1 − 𝑠`𝑖𝑡 ) 𝑒`𝑖𝑡  = 0 . (F.1)    𝑡=1     Í𝑇 0 𝑔3𝑖 (.)  ` ` 0        𝑡=1 𝑥 2𝑖𝑡 𝑒 𝑖𝑡        115 We are going to show that each of these holds true iff either 𝛽1 = 0 or 𝜋2 = 𝑑𝑖 = 0 ∀ 𝑖. First, note that 𝑒`𝑖𝑡 = 𝑒𝑖𝑡 − 𝑒¯𝑖 =[(1 − 𝑠𝑖𝑡 )𝑥 22𝑖𝑡 − (1 − 𝑠𝑖 )𝑥 22𝑖 ]𝜋2 𝛽1 + [(1 − 𝑠𝑖𝑡 )𝑑𝑖 − (1 − 𝑠𝑖 )𝑑𝑖 ] 𝛽1 + [(1 − 𝑠𝑖𝑡 )𝑟𝑖𝑡 − (1 − 𝑠𝑖 )𝑟𝑖 ] 𝛽1 + [𝑢𝑖𝑡 − 𝑢¯𝑖 ], where Õ 𝑇 (1 − 𝑠𝑖 )𝑥 22𝑖 = 𝑇𝑖−1 (1 − 𝑠𝑖𝑞 )𝑥 22𝑖𝑞 𝑞=1 Õ 𝑇 (1 − 𝑠𝑖 )𝑑𝑖 = 𝑇𝑖−1 (1 − 𝑠𝑖𝑞 )𝑑𝑖 = (1 − 𝑇 −1𝑇𝑖 )𝑑𝑖 𝑞=1 Õ𝑇 (1 − 𝑠𝑖 )𝑟𝑖 = 𝑇𝑖−1 (1 − 𝑠𝑖𝑞 )𝑟𝑖𝑞 . 𝑞=1 Starting with 𝑔1𝑖 , the first term is Õ𝑇 Õ𝑇 0 [(1− 𝑠 )𝑥 0 [(1− 𝑠 )𝑥 E{ 𝑠`𝑖𝑡 𝑥`1𝑖𝑡 𝑖𝑡 22𝑖𝑡 − (1 − 𝑠𝑖 )𝑥 22𝑖 ]𝜋2 𝛽1 } = E{ 𝑠`𝑖𝑡 𝑥`1𝑖𝑡 𝑖𝑡 22𝑖𝑡 − (1 − 𝑠𝑖 )𝑥 22𝑖 ]𝜋2 𝛽1 }. 𝑡=1 𝑡=1 Consider this expectation for each 𝑡 separately. It is 0 iff either 𝜋2 = 0 or 𝛽1 = 0 or both are 0. If neither of these conditions holds, then this term will be a non-zero number, except by fluke. For the second term, 0 [(1 − 𝑠 ) − (1 − 𝑠 )𝑑 ]𝑑 𝛽 } E{ 𝑠`𝑖𝑡 𝑥`1𝑖𝑡 𝑖𝑡 𝑖 𝑖 𝑖 1 is zero ∀ 𝑡 iff 𝛽1 = 0 or 𝑑𝑖 = 0 ∀ 𝑖 or both. For the third term, 0 [(1 − 𝑠 )𝑟 − (1 − 𝑠 )𝑟 ] 𝛽 } E{ 𝑠`𝑖𝑡 𝑥`1𝑖𝑡 𝑖𝑡 𝑖𝑡 𝑖 𝑖 1 is zero ∀ 𝑡 iff 𝛽1 = 0. For the fourth term, 0 (𝑢 − 𝑢¯ )] E[ 𝑠`𝑖𝑡 𝑥`1𝑖𝑡 𝑖𝑡 𝑖 is zero ∀ 𝑡 under Assumption 3.2.1. Moving on to 𝑔2𝑖 , for the first term E{[(1 − 𝑠𝑖𝑡 ) − (1 − 𝑇 −1𝑇𝑖 )] [(1 − 𝑠𝑖𝑡 )𝑥22𝑖𝑡 − (1 − 𝑠𝑖 )𝑥 22𝑖 ]𝜋2 𝛽1 } 116 is zero ∀ 𝑡 iff 𝜋2 = 0 or 𝛽1 = 0 or both. For the second term, E{[(1 − 𝑠𝑖𝑡 ) − (1 − 𝑇 −1𝑇𝑖 )] 2 𝑑𝑖 𝛽1 } is zero ∀ 𝑡 iff 𝛽1 = 0 or 𝑑𝑖 = 0 ∀ 𝑖 or both. For the third term, E{[(1 − 𝑠𝑖𝑡 ) − (1 − 𝑇 −1𝑇𝑖 )] [(1 − 𝑠𝑖𝑡 )𝑟𝑖𝑡 − (1 − 𝑠𝑖 )𝑟𝑖 ] 𝛽1 } is zero under Assumption 3.2.2. For the fourth term, E[[(1 − 𝑠𝑖𝑡 ) − (1 − 𝑇 −1𝑇𝑖 )] (𝑢𝑖𝑡 − 𝑢¯𝑖 )] is zero ∀ 𝑡 under Assumption 3.2.1. Moving on to 𝑔3𝑖 , for the first term 0 [(1 − 𝑠 )𝑥 E{𝑥`2𝑖𝑡 𝑖𝑡 22𝑖𝑡 − (1 − 𝑠𝑖 )𝑥 22𝑖 ]𝜋2 𝛽1 } is zero ∀ 𝑡 iff 𝜋2 = 0 or 𝛽1 = 0 or both. For the second term, 0 [(1 − 𝑠 ) − (1 − 𝑇 −1𝑇 )]𝑑 𝛽 } E{𝑥`2𝑖𝑡 𝑖𝑡 𝑖 𝑖 1 is zero ∀ 𝑡 iff 𝛽1 = 0 or 𝑑𝑖 = 0 ∀ 𝑖 or both. For the third term, 0 [(1 − 𝑠 )𝑟 − (1 − 𝑠 )𝑟 ] 𝛽 } E{𝑥`2𝑖𝑡 𝑖𝑡 𝑖𝑡 𝑖 𝑖 1 is zero under Assumption 3.2.2. For the fourth term, 0 (𝑢 − 𝑢¯ )] E[ 𝑥`2𝑖𝑡 𝑖𝑡 𝑖 is zero ∀ 𝑡 under Assumption 3.2.1. Thus, for each of the moment conditions in (F.1) to be zero, we need either 𝛽1 = 0 or 𝜋2 = 𝑑𝑖 = 0 ∀ 𝑖. Proof of Proposition 3.4.3.1 Let the error 𝛽1 (1 − 𝑠𝑖𝑡 ) [ 𝑥¥2𝑖𝑡 (𝜋 − 𝜋ˆ 𝐼𝑚 𝑝 ) + 𝑟¥𝑖𝑡 ] + 𝑢¥𝑖𝑡 ≡ 𝑒¥𝑖𝑡 and let the set of regressors [𝑠𝑖𝑡 𝑥¥1𝑖𝑡 + (1 − 𝑠𝑖𝑡 ) 𝑥¥2𝑖𝑡 𝜋ˆ 𝐼𝑚 𝑝 𝑥¥2𝑖𝑡 ] ≡ 𝑧¥𝑖𝑡 117 The POLS estimator is 𝑧¥0𝑖𝑡 𝑧¥𝑖𝑡 −1 ÕÕ ÕÕ 𝛽ˆ 𝐼𝑚 𝑝 = 𝑧¥0𝑖𝑡 𝑦¥𝑖𝑡  (F.2) 𝑖 𝑡 𝑖 𝑡 ÕÕ  −1 ÕÕ = 𝛽 + 𝑁 −1 𝑧¥0𝑖𝑡 𝑧¥𝑖𝑡 𝑁 −1 𝑧¥0𝑖𝑡 𝑒¥𝑖𝑡 (G.0.1) 𝑖 𝑡 𝑖 𝑡 Consider the probability limit of the term 𝑁 −1 𝑖 𝑡 𝑧¥0𝑖𝑡 𝑒¥𝑖𝑡 . Plugging in the definitions of 𝑧¥𝑖𝑡 Í Í and 𝑒¥𝑖𝑡 , the first term is Õ Õ Õ 𝑝𝑙𝑖𝑚 𝑁 −1 𝑠𝑖𝑡 𝑥¥1𝑖𝑡 𝑢¥𝑖𝑡 = E(𝑠𝑖𝑡 𝑥¥1𝑖𝑡 𝑢¥𝑖𝑡 ) = 0 𝑡 𝑖 𝑡 This last equality is due to Assumption 3.2.1. The second term is Õ Õ Õ 𝑝𝑙𝑖𝑚 𝑁 −1 (1 − 𝑠𝑖𝑡 ) 𝜋ˆ 0𝐼𝑚 𝑝 𝑥¥2𝑖𝑡0 𝑢¥ = 𝑖𝑡 E[(1 − 𝑠𝑖𝑡 )𝜋0𝑥¥2𝑖𝑡 0 𝑢¥ ] = 0 𝑖𝑡 𝑡 𝑖 𝑡 where the last equality again holds because of Assumption 3.2.1. The third term is Õ Õ 𝑝𝑙𝑖𝑚 𝑁 −1 (1 − 𝑠𝑖𝑡 ) 𝜋ˆ 0𝐼𝑚 𝑝 𝑥¥2𝑖𝑡 0 [𝑟¥ + 𝑥¥ (𝜋 − 𝜋)] 𝑖𝑡 2𝑖𝑡 ˆ 𝛽1 𝑡 𝑖 Õ = E[(1 − 𝑠𝑖𝑡 )𝜋0𝑥¥2𝑖𝑡 0 𝑟¥ ] 𝛽 = 0 𝑖𝑡 1 𝑡 The second equality here holds because 𝜋ˆ 𝐼𝑚 𝑝 is a consistent estimator of 𝜋, and the third holds due to Assumption 3.2.2. The fourth term is Õ Õ Õ 𝑝𝑙𝑖𝑚 𝑁 −1 0 𝑢¥ = 𝑥¥2𝑖𝑡 𝑖𝑡 0 𝑢¥ ) = 0 E( 𝑥¥2𝑖𝑡 𝑖𝑡 𝑡 𝑖 𝑡 where the last equality holds due to Assumption 3.2.1. Finally, Õ Õ 𝑝𝑙𝑖𝑚 𝑁 −1 0 (1 − 𝑠 ) [𝑟¥ + 𝑥¥ (𝜋 − 𝜋)] 𝑥¥2𝑖𝑡 𝑖𝑡 𝑖𝑡 2𝑖𝑡 ˆ 𝛽1 𝑡 𝑖 Õ = E[(1 − 𝑠𝑖𝑡 ) 𝑥¥2𝑖𝑡 0 𝑟¥ ] 𝛽 = 0 𝑖𝑡 1 𝑡 as proved above. Since 𝑝𝑙𝑖𝑚 𝑁 −1 𝑖 𝑡 𝑧¥0𝑖𝑡 𝑒¥𝑖𝑡 = 0, from (F.2), 𝑝𝑙𝑖𝑚 𝛽ˆ 𝐼𝑚 𝑝 = 𝛽. Í Í Proof of Lemma 3.6.1 118 (i) Start with E[𝑚 1𝑖 (𝛽, 𝜋)] = 0. This will hold true if     𝑠𝑖 𝑝 𝑥 𝑠𝑖𝑡 𝑢˜𝑖 (𝑡)  0 1𝑖 𝑝 =      E  𝑥 0 𝑠 𝑢˜ (𝑡)  0  2𝑖 𝑝 𝑖𝑡 𝑖        Now, we can write  𝑇 Õ  E[𝑠𝑖 𝑝 𝑥 1𝑖 𝑝 𝑠𝑖𝑡 𝑢˜𝑖 (𝑡)] = E(𝑠𝑖 𝑝 𝑥 1𝑖 𝑝 𝑠𝑖𝑡 𝑢𝑖𝑡 ) − E 𝑠𝑖 𝑝 𝑥 1𝑖 𝑝 𝑠𝑖𝑡 𝑇𝑖 (𝑡) −1 𝑠𝑖𝑞 𝑢𝑖𝑞 𝑞=𝑡+1 𝑇 Õ = E(𝑠𝑖 𝑝 𝑥 1𝑖 𝑝 𝑠𝑖𝑡 𝑢𝑖𝑡 ) − E[𝑠𝑖 𝑝 𝑥 1𝑖 𝑝 𝑠𝑖𝑡 𝑇𝑖 (𝑡) −1 𝑠𝑖𝑞 𝑢𝑖𝑞 ] 𝑞=𝑡+1 𝑇 Õ = E(𝑠𝑖 𝑝 𝑠𝑖𝑡 ) E(𝑥 1𝑖 𝑝 𝑢𝑖𝑡 ) − E[𝑠𝑖 𝑝 𝑠𝑖𝑡 𝑇𝑖 (𝑡) −1 𝑠𝑖𝑞 ] E(𝑥 1𝑖 𝑝 𝑢𝑖𝑞 ) 𝑞=𝑡+1 =0 The first equality follows from just the definition of 𝑢𝑖 (𝑡), the third follows from s𝑖 |= (x𝑖 , u𝑖 , r𝑖 ) and 0 𝑠 𝑢˜ (𝑡)] = 0. the last one follows from Assumption 3.6.1. Similarly, E[𝑥 2𝑖 𝑝 𝑖𝑡 𝑖 0 𝑠 𝑟˜ ] = 0. We can write Moving on to E[𝑚 2𝑖 (𝛽, 𝜋)] = 0, we need E[𝑥 2𝑖 𝑝 𝑖𝑡 𝑖𝑡  𝑇 Õ  0 𝑠 𝑟˜ (𝑡)] E[𝑥2𝑖 = 0 𝑠 𝑟 ) E(𝑥 2𝑖 −E 0 𝑠 𝑇 (𝑡) −1 𝑥 2𝑖 𝑠𝑖𝑞 𝑟𝑖𝑞 𝑝 𝑖𝑡 𝑖 𝑝 𝑖𝑡 𝑖𝑡 𝑝 𝑖𝑡 𝑖 𝑞=𝑡+1 𝑇 Õ   = 0 𝑠 𝑟 ) E(𝑥 2𝑖 − E 0 𝑠 𝑇 (𝑡) −1 𝑠 𝑟 𝑥 2𝑖 𝑝 𝑖𝑡 𝑖𝑡 𝑝 𝑖𝑡 𝑖 𝑖𝑞 𝑖𝑞 𝑞=𝑡+1 𝑇 Õ = 0 𝑟 ) E(𝑠𝑖𝑡 ) E(𝑥 2𝑖 − E[𝑠𝑖𝑡 𝑇𝑖 (𝑡) −1 𝑠𝑖𝑞 ] E(𝑥 2𝑖 0 𝑟 ) 𝑝 𝑖𝑡 𝑝 𝑖𝑞 𝑞=𝑡+1 =0 The third equality follows from s𝑖 |= (x𝑖 , u𝑖 , r𝑖 ) and the last one follows from Assumption 3.6.2. Finally we consider the third set of moment conditions E[𝑚 2𝑖 (𝛽, 𝜋)] = 0, for which we need 119 0 (1 − 𝑠 ) 𝑣˘ ] = 0. We can write E[𝑥 0 (1 − 𝑠 ) 𝑣˘ (𝑡)] equals E[𝑥2𝑖 𝑝 𝑖𝑡 𝑖𝑡 2𝑖 𝑝 𝑖𝑡 𝑖 Õ𝑇 E[𝑥 2𝑖0 (1 − 𝑠 )𝑣 ] − E[𝑥 2𝑖 0 (1 − 𝑠 )(𝑇 − 𝑡 − 𝑇𝑖 (𝑡)) −1 (1 − 𝑠𝑖𝑞 )𝑣𝑖𝑞 ] 𝑝 𝑖𝑡 𝑖𝑡 𝑝 𝑖𝑡 𝑞=𝑡+1 Õ 𝑇 = E[𝑥2𝑖 0 (1 − 𝑠 )𝑣 ] − E[𝑥2𝑖 0 (1 − 𝑠 )(𝑇 − 𝑡 − 𝑇 (𝑡)) −1 (1 − 𝑠 )𝑣 ] 𝑝 𝑖𝑡 𝑖𝑡 𝑝 𝑖𝑡 𝑖 𝑖𝑞 𝑖𝑞 𝑞=𝑡+1 Õ𝑇 = E(1 − 𝑠𝑖𝑡 ) E(𝑥 2𝑖 0 𝑣 )− E[(1 − 𝑠𝑖𝑡 )(𝑇 − 𝑡 − 𝑇𝑖 (𝑡)) −1 (1 − 𝑠𝑖𝑞 )] E(𝑥 2𝑖 0 𝑣 ) 𝑝 𝑖𝑡 𝑝 𝑖𝑞 𝑞=𝑡+1 =0 where the last equality follows from E(𝑥 2𝑖 0 𝑣 ) = 0 which follows from Assumptions 3.6.1 and 𝑝 𝑖𝑞 3.6.2. (ii) Starting with E[𝑚 1𝑖 (𝛽, 𝜋)] = 0, we can first write Õ𝑇 E[𝑠𝑖 𝑝 𝑥 1𝑖 𝑝 𝑠𝑖𝑡 𝑢˜𝑖 (𝑡)] = E(𝑠𝑖 𝑝 𝑥 1𝑖 𝑝 𝑠𝑖𝑡 𝑢𝑖𝑡 ) − E[𝑠𝑖 𝑝 𝑥 1𝑖 𝑝 𝑠𝑖𝑡 𝑇𝑖 (𝑡) −1 𝑠𝑖𝑞 𝑢𝑖𝑞 ] 𝑞=𝑡+1 Õ𝑇 = E[E(𝑠𝑖 𝑝 𝑥 1𝑖 𝑝 𝑠𝑖𝑡 𝑢𝑖𝑡 |x𝑖𝑡 , s𝑖 )] − E{E[𝑠𝑖 𝑝 𝑥 1𝑖 𝑝 𝑠𝑖𝑡 𝑇𝑖 (𝑡) −1 𝑠𝑖𝑞 𝑢𝑖𝑞 |x𝑖𝑡 , s𝑖 ]} 𝑞=𝑡+1 Õ𝑇 = E[𝑠𝑖 𝑝 𝑥 1𝑖 𝑝 𝑠𝑖𝑡 E(𝑢𝑖𝑡 |x𝑖𝑡 , s𝑖 )] − E{𝑠𝑖 𝑝 𝑥 1𝑖 𝑝 𝑠𝑖𝑡 𝑇𝑖 (𝑡) −1 𝑠𝑖𝑞 E[𝑢𝑖𝑞 |x𝑖𝑡 , s𝑖 ]} 𝑞=𝑡+1 =0 The second equality follows from the LIE and the fourth follows from Assumption 3.6.1’. This is because using the LIE, Assumption 3.6.1’ implies that E(𝑢𝑖𝑡 |x𝑖𝑡 , s𝑖 ) = 0 for every 𝑡 = 1, . . . , 𝑇. 𝑞 Moreover, since E(𝑢𝑖𝑞 |x𝑖 , s𝑖 ) = 0 for 𝑞 = 𝑡 + 1, . . . , 𝑇, using the LIE implies that E(𝑢𝑖𝑞 |x𝑖𝑡 , s𝑖 ) = 0 for any 𝑡 < 𝑞. Similarly, E[𝑥 2𝑖 0 𝑠 𝑢˜ (𝑡)] = 0. 𝑝 𝑖𝑡 𝑖 We can write a similar proof for E[𝑚 2𝑖 (𝛽, 𝜋)] = 0 using the LIE and Assumption 3.6.2’. For 120 E[𝑚 3𝑖 (𝛽, 𝜋)] = 0, write Õ𝑇 0 (1 − 𝑠 ) 𝑣˘ (𝑡)] E[𝑥2𝑖 = 0 (1 − 𝑠 )𝑣 ] E[𝑥 2𝑖 − 0 (1 − 𝑠 )(𝑇 − 𝑡 − 𝑇 (𝑡)) −1 (1 − 𝑠 )𝑣 ] E[𝑥 2𝑖 𝑝 𝑖𝑡 𝑖 𝑝 𝑖𝑡 𝑖𝑡 𝑝 𝑖𝑡 𝑖 𝑖𝑞 𝑖𝑞 𝑞=𝑡+1 = E{E[𝑥 2𝑖0 (1 − 𝑠 )𝑣 |𝑥 𝑡 , s ]} 𝑝 𝑖𝑡 𝑖𝑡 2𝑖 𝑖 Õ𝑇 − E{E[𝑥 2𝑖 0 (1 − 𝑠 )(𝑇 − 𝑡 − 𝑇 (𝑡)) −1 (1 − 𝑠 )𝑣 |𝑥 𝑡 , s ]} 𝑝 𝑖𝑡 𝑖 𝑖𝑞 𝑖𝑞 2𝑖 𝑖 𝑞=𝑡+1 0 (1 − 𝑠 ) E[𝑣 |𝑥 𝑡 , s ]} = E{𝑥 2𝑖 𝑝 𝑖𝑡 𝑖𝑡 2𝑖 𝑖 Õ𝑇 − 0 (1 − 𝑠 )(𝑇 − 𝑡 − 𝑇 (𝑡)) −1 (1 − 𝑠 ) E[𝑣 |𝑥 𝑡 , s ]} E{𝑥2𝑖 𝑝 𝑖𝑡 𝑖 𝑖𝑞 𝑖𝑞 2𝑖 𝑖 𝑞=𝑡+1 =0 where the second equality follows from the LIE and the fourth from Assumptions 3.6.1’ and 3.6.2’. This is because using the LIE and the fact that 𝑣𝑖𝑡 = 𝛽1𝑟𝑖𝑡 + 𝑢𝑖𝑡 , Assumptions 3.6.1’ and 3.6.2’ imply 𝑞 that E(𝑣𝑖𝑡 |x𝑡2𝑖 , s𝑖 ) = 0 for every 𝑡 = 1, . . . , 𝑇. Moreover, since E(𝑣𝑖𝑞 |x2𝑖 , s𝑖 ) = 0 for 𝑞 = 𝑡 + 1, . . . , 𝑇, using the LIE implies that E(𝑣𝑖𝑞 |x𝑡2𝑖 , s𝑖 ) = 0 for any 𝑡 < 𝑞. 121 APPENDIX H EXTENSIONS TO CHAPTER 3 H.1 Missing vectors In the model of interest (3.2.1), we assumed that 𝑥 1𝑖𝑡 is a scalar. We can extend this framework to the case where 𝑥 1𝑖𝑡 is a 𝑚 × 1 vector, all elements of which are missing at the same time. In other words, if one element of 𝑥1𝑖𝑡 is missing for observation 𝑖 at time 𝑡, then so are all the other elements of 𝑥 1𝑖𝑡 . This does not fundamentally change the analysis and the single missing data indicator 𝑠𝑖𝑡 is still sufficient to characterize missingness. The population model is given by 𝑦𝑖𝑡 = 𝑥 1𝑖𝑡 𝛽1 + 𝑥 2𝑖𝑡 𝛽2 + 𝑐𝑖 + 𝑢𝑖𝑡 ≡ 𝑥𝑖𝑡 𝛽 + 𝑐𝑖 + 𝑢𝑖𝑡 , 𝑡 = 1, . . . , 𝑇, (H.1.1) which is the same as equation (3.2.1) except 𝑥 1𝑖𝑡 is a 1 × 𝑚 vector now. The imputation equations are a set of 𝑚 equations (one for each element in 𝑥1𝑖𝑡 ). 𝑥1𝑖𝑡 = 𝑥 2𝑖𝑡 Π + 𝑑𝑖 + 𝑟𝑖𝑡 (H.1.2) where Π is a 𝑘 × 𝑚 matrix and 𝑑𝑖 is a 1 × 𝑚 vector. The reduced form is 𝑦𝑖𝑡 = (𝑥 2𝑖𝑡 Π + 𝑑𝑖 + 𝑟𝑖𝑡 ) 𝛽1 + 𝑥 2𝑖𝑡 𝛽2 + 𝑐𝑖 + 𝑢𝑖𝑡 ≡ 𝑥 2𝑖𝑡 𝛾 + ℎ𝑖 + 𝑣𝑖𝑡 , (H.1.3) where 𝛾 ≡ Π𝛽1 + 𝛽2 , ℎ𝑖 ≡ 𝑑𝑖 𝛽1 + 𝑐𝑖 , and 𝑣𝑖𝑡 ≡ 𝑟𝑖𝑡 𝛽1 + 𝑢𝑖𝑡 . Since all elements of 𝑥 1𝑖𝑡 are missing at the same time, the definition of the missing data indicator given in section 3 is still sufficient to characterize missingness. That is, 𝑠𝑖𝑡 = 1 if 𝑥 1𝑖𝑡 is observed and 0 otherwise. Then the joint GMM is based on the following set of moment functions. 0  Í𝑇    𝑡=1 𝑠𝑖𝑡 𝑥¥𝑖𝑡 ( 𝑦¥𝑖𝑡 − 𝑥¥1𝑖𝑡 𝛽1 − 𝑥¥2𝑖𝑡 𝛽2 )    𝑓1𝑖 (𝛽, Π)          𝑓𝑖 (𝛽, Π) =  Í 𝑇 𝑠 𝑥¥0 ⊗ ( 𝑥¥ − 𝑥¥ Π) 0  ≡  𝑓 (𝛽, Π)  (H.1.4) 𝑡=1 𝑖𝑡 2𝑖𝑡 1𝑖𝑡 2𝑖𝑡   2𝑖  Í     𝑇 0 𝑦¤ − 𝑥¤ (𝛽 Π + 𝛽 )   𝑡=1 (1 − 𝑠𝑖𝑡 ) 𝑥¤2𝑖𝑡    𝑓 (𝛽, Π)   𝑖𝑡 2𝑖𝑡 1 2    3𝑖   122 This is a set of 𝑘 (2 + 𝑚) + 𝑚 moment conditions with 𝑘 (1 + 𝑚) + 𝑚 parameters to estimate. Thus the number of over-identifying restrictions still equals 𝑘. Note that 𝑓2𝑖 (.) is still a set of exactly identified moment functions, and hence Lemma 3.4.2 is still valid. The rest of the GMM estimation proceeds the same way as in Section 4, except the matrices 𝐶 and 𝐷 are now based on the moment conditions in (G.4). This framework can further be extended to the case where the elements of 𝑥1𝑖𝑡 are not missing at the same time. Although it leads to loss of some information in this case, it is still more efficient than using the complete case analysis. For instance, consider the case where in equation (3.2.1), 𝑥 1𝑖𝑡 = [𝑤𝑖𝑡 𝑤𝑖,𝑡−1 ], where 𝑤𝑖𝑡 is a policy variable. If 𝑤𝑖𝑡 contains missing values, then so does 𝑤𝑖,𝑡−1 . In this case, the missingness cannot be entirely characterized with a single missing data indicator as 𝑤𝑖𝑡 and 𝑤𝑖,𝑡−1 are missing in different time periods for observation 𝑖. We define the selection indicators as the following.   1 if both 𝑤𝑖𝑡 and 𝑤𝑖,𝑡−1 are observed 𝑡 = 1, ..., 𝑇    𝑠1𝑖𝑡 =  0 otherwise      1 if neither 𝑤𝑖𝑡 nor 𝑤𝑖,𝑡−1 is observed 𝑡 = 1, ..., 𝑇    𝑠2𝑖𝑡 =  0 otherwise    Thus, the complete cases are those time periods for individual 𝑖 for which 𝑤𝑖 is observed in both the current and the previous period, and are characterized by 𝑠1𝑖𝑡 = 1. One option in this case is to estimate 𝛽 using the complete cases fixed effects, as discussed in Section 4. However, we can also use the joint GMM by utilizing the observations for which 𝑠2𝑖𝑡 = 1. Note that 𝑠2𝑖𝑡 does not characterize all the incomplete cases. It is equal to 1 only for the observations for which neither 𝑤𝑖𝑡 nor 𝑤𝑖,𝑡−1 is observed, and 0 for both the complete cases as well as the observations for which either 𝑤𝑖𝑡 or 𝑤𝑖,𝑡−1 is observed. It thus does not make use of the observations for which both 𝑠1𝑖𝑡 and 𝑠2𝑖𝑡 are 0. We impose the following assumption on the population distribution. 123 Assumption G.1 For every 𝑡 = 1, . . . , 𝑇, (i) E(𝑠1𝑖𝑡 𝑥¥𝑖𝑡0 𝑢𝑖𝑡 ) = 0 (ii) E(𝑠1𝑖𝑡 𝑥¥2𝑖𝑡 0 𝑟 ) = 0 (iii) 𝑖𝑡 0 𝑣 )=0 E(𝑠2𝑖𝑡 𝑥¤2𝑖𝑡 𝑖𝑡 The joint GMM is then based on the following moment functions. 0 ( 𝑦¥ − 𝑥¥ 𝛽 − 𝑥¥ 𝛽 )   Í𝑇     𝑠 1𝑖𝑡 ¥ 𝑥 𝑖𝑡 1𝑖𝑡 1 2𝑖𝑡 2  𝑓1𝑖 (𝛽, Π)   𝑡=1 𝑖𝑡        𝑓𝑖 (𝛽, 𝜋) =   Í𝑇 𝑠 ¥ 𝑥 0 ( ¥ 𝑥 − 𝑥¥ 𝜋)  ≡  𝑓2𝑖 (𝛽, Π)   (H.1.5) 𝑡=1 1𝑖𝑡 2𝑖𝑡 1𝑖𝑡 2𝑖𝑡  Í  𝑇 0      𝑡=1 𝑠2𝑖𝑡 𝑥¤2𝑖𝑡 𝑦¤ 𝑖𝑡 − 𝑥¤2𝑖𝑡 (𝛽1 𝜋 + 𝛽2 )   𝑓3𝑖 (𝛽, Π)       where Õ 𝑇 Õ 𝑇 𝑥¥𝑖𝑡 = 𝑥𝑖𝑡 − ( 𝑠1𝑖𝑡 ) −1 𝑠1𝑖𝑞 𝑥𝑖𝑞 𝑞=1 𝑞=1 Õ 𝑇 Õ 𝑇 𝑦¥𝑖𝑡 = 𝑦𝑖𝑡 − ( 𝑠1𝑖𝑡 ) −1 𝑠1𝑖𝑞 𝑦𝑖𝑞 𝑞=1 𝑞=1 Õ 𝑇 Õ 𝑇 𝑥¤𝑖𝑡 = 𝑥𝑖𝑡 − ( 𝑠2𝑖𝑡 ) −1 𝑠2𝑖𝑞 𝑥𝑖𝑞 𝑞=1 𝑞=1 Õ 𝑇 Õ 𝑇 𝑦¤ 𝑖𝑡 = 𝑦𝑖𝑡 − ( 𝑠2𝑖𝑡 ) −1 𝑠2𝑖𝑞 𝑦𝑖𝑞 𝑞=1 𝑞=1 That is, for 𝑓1𝑖 (.) and 𝑓2𝑖 (.), the variables are still time demeaned using the complete cases, but for 𝑓3𝑖 (.), they are time demeaned using only the observations for which neither 𝑤𝑖𝑡 nor 𝑤𝑖,𝑡−1 is observed. Note that the moment functions 𝑓2𝑖 (.) imply that both 𝑤𝑖𝑡 and 𝑤𝑖,𝑡−1 will be imputed using the same covariates 𝑥 2𝑖𝑡 The rest of the GMM estimation proceeds in the usual fashion using the moment functions in (G.5). In order to utilize all the incomplete cases, we can further extend this framework by introducing a separate selection indicator for 𝑤𝑖𝑡 and 𝑤𝑖,𝑡−1 and writing a separate imputation equation (with different sets of covariates) for each of these. 124 H.2 Time varying unobserved heterogeneity We can extend the basic model in Section 2 to allow for the unobserved heterogeneity to vary over time. So instead of equation (3.2.1), our model of interest is now 𝑦𝑖𝑡 = 𝑥𝑖𝑡 𝛽 + 𝜂𝑡 𝑐𝑖 + 𝑢𝑖𝑡 , 𝑡 = 1, . . . , 𝑇 . (H.2.1) The coefficients of 𝑐𝑖 are now 𝜂𝑡 which are time-varying parameters to be estimated. We also allow for time-varying heterogeneity in the imputation model. The new model is 𝑥 1𝑖𝑡 = 𝑥 2𝑖𝑡 𝜋 + 𝜁𝑡 𝑑𝑖 + 𝑟𝑖𝑡 , 𝑡 = 1, . . . , 𝑇 . (H.2.2) The reduced form then becomes 𝑦𝑖𝑡 = 𝑥 2𝑖𝑡 𝛾 + ℎ𝑖𝑡 + 𝑣𝑖𝑡 , 𝑡 = 1, . . . , 𝑇 (H.2.3) where 𝛾 ≡ 𝛽1 𝜋 + 𝛽2 , ℎ𝑖𝑡 ≡ 𝛽1 𝜁𝑡 𝑑𝑖 + 𝜂𝑡 𝑐𝑖 , and 𝑣𝑖𝑡 ≡ 𝛽1𝑟𝑖𝑡 + 𝑢𝑖𝑡 .1 The question we consider here is that under what assumptions will the joint GMM defined in Section 3 consistently estimates 𝛽 and 𝜋. Starting with equation (G.6), if we time demean using the complete cases, we get 𝑦¥𝑖𝑡 = 𝑥¥𝑖𝑡 𝛽 + 𝜂¥𝑡 𝑐𝑖 + 𝑢¥𝑖𝑡 , 𝑡 = 1, . . . , 𝑇, (H.2.4) where 𝑦¥𝑖𝑡 , 𝑥¥𝑖𝑡 , and 𝑢¥𝑖𝑡 are defined in the same way as in Section 3. But now, this transformation does not eliminate 𝑐𝑖 . Therefore, for the moment conditions E[ 𝑓1𝑖 (𝛽)] = 0 in (3.4.10) to be valid, we need for every 𝑡 = 1, . . . , 𝑇 E[𝑠𝑖𝑡 𝑥¥𝑖𝑡0 ( 𝜂¥𝑡 𝑐𝑖 + 𝑢¥𝑖𝑡 )] = 0. (H.2.5) We know that for every 𝑡 = 1, . . . , 𝑇 E(𝑠𝑖𝑡 𝑥¥𝑖𝑡0 𝑢¥𝑖𝑡 ) = 0 (H.2.6) under Assumption 3.3.2. We additionally need that for every 𝑡 = 1, . . . , 𝑇 E(𝑠𝑖𝑡 𝑥¥𝑖𝑡0 𝜂¥𝑡 𝑐𝑖 ) = 0. (H.2.7) 1Note that the definitions of 𝛾 and 𝑣𝑖𝑡 are the same as those in Section 2. Only the unobserved heterogeneity has changed. 125 A sufficient condition for this to hold is that for every 𝑡 = 1, . . . , 𝑇 E(𝑐𝑖 | 𝑥¥𝑖𝑡 , 𝑠𝑖 ) = 0. (H.2.8) This says that at time 𝑡, the unobserved heterogeneity 𝑐𝑖 is mean independent of the time deviated 𝑥𝑖𝑡 and selection in all time periods. This is clearly stronger than Assumption 3.3.2 which did not put any restriction on the relationship between 𝑠𝑖 and 𝑐𝑖 . However, it is weaker than assuming 𝑐𝑖 is mean independent of 𝑥𝑖𝑡 . We are only assuming that it is mean independent of the time deviated 𝑥𝑖𝑡 , that is 𝑥¥𝑖𝑡 . Similarly, when we time demean the new imputation model (G.7), we get 𝑥¥1𝑖𝑡 = 𝑥¥2𝑖𝑡 𝜋 + 𝜁¥𝑡 𝑑𝑖 + 𝑟¥𝑖𝑡 , 𝑡 = 1, . . . , 𝑇 . (H.2.9) For the moment conditions E[ 𝑓2𝑖 (𝜋)] = 0 in (3.4.10) to be valid, we need that for every 𝑡 = 1, . . . , 𝑇 0 ( 𝜁¥ 𝑑 + 𝑟¥ )] = 0 E[𝑠𝑖𝑡 𝑥¥2𝑖𝑡 (H.2.10) 𝑡 𝑖 𝑖𝑡 for which we need to assume that for every 𝑡 = 1, . . . , 𝑇 E(𝑑𝑖 | 𝑥¥2𝑖𝑡 , 𝑠𝑖 ) = 0 (H.2.11) in addition to Assumption 3.3.2. Similarly, the time deviated reduced form is 𝑦¥𝑖𝑡 = 𝑥¥2𝑖𝑡 𝛾 + ℎ¥ 𝑖𝑡 + 𝑣¥𝑖𝑡 , 𝑡 = 1, . . . , 𝑇 . (H.2.12) It is easy to see that given equation (G.17), Assumptions (G.13) and (G.16) along with Assumption 3.3.2 are sufficient for the moment conditions E[ 𝑓3𝑖 (𝛽, 𝜋)] = 0 in (3.4.10) to be valid. 126 REFERENCES 127 REFERENCES Abrevaya, J., & Donald, S. G. (2011). A gmm approach for dealing with missing data on regressors and instruments. Unpublished manuscript. Abrevaya, J., & Donald, S. G. (2017). A gmm approach for dealing with missing data on regressors. Review of Economics and Statistics, 99(4), 657–662. Ahu, S. C., & Schmidt, P. (1995). A separability result for gmm estimation, with applications to gls prediction and conditional moment tests. Econometric Reviews, 14(1), 19–34. Angrist, J. D., & Krueger, A. B. (1992). The effect of age at school entry on educational attainment: an application of instrumental variables with moments from two samples. Journal of the American statistical Association, 87(418), 328–336. Angrist, J. D., & Krueger, A. B. (1995). Split-sample instrumental variables estimates of the return to schooling. Journal of Business & Economic Statistics, 13(2), 225–235. Arellano, M., & Meghir, C. (1992). Female labour supply and on-the-job search: an empirical model estimated using complementary data sets. The Review of Economic Studies, 59(3), 537– 559. Card, D. (1993). Using geographic variation in college proximity to estimate the return to schooling. National Bureau of Economic Research Cambridge, Mass., USA. Dagenais, M. G. (1973). The use of incomplete observations in multiple regression analysis: A generalized least squares approach. Journal of Econometrics, 1(4), 317–328. Devereux, P. J., & Hart, R. A. (2010). Forced to be rich? returns to compulsory schooling in britain. The Economic Journal, 120(549), 1345–1364. Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica: Journal of the Econometric Society, 1029–1054. Hellerstein, J. K., & Imbens, G. W. (1999). Imposing moment restrictions from auxiliary data by weighting. Review of Economics and Statistics, 81(1), 1–14. Hentschel, J., Lanjouw, J. O., Lanjouw, P., & Poggi, J. (2000). Combining census and survey data to trace the spatial dimensions of poverty: A case study of ecuador. The World Bank Economic Review, 14(1), 147–165. Inoue, A., & Solon, G. (2005). Two-sample instrumental variables estimators. NBER Working Paper(t0311). Inoue, A., & Solon, G. (2010). Two-sample instrumental variables estimators. The Review of Economics and Statistics, 92(3), 557–561. 128 Jones, M. P. (1996). Indicator and stratification methods for missing explanatory variables in multiple linear regression. Journal of the American statistical association, 91(433), 222–230. Klevan, S., Weinberg, S. L., & Middleton, J. A. (2016). Why the boys are missing: Using social capital to explain gender differences in college enrollment for public high school students. Research in Higher Education, 57(2), 223–257. Klevmarken, N. A. (1982). Missing variables and two-stage least-squares estimation from more than one data set (Tech. Rep.). Research Institute of Industrial Economics. Little, R. J. (1992). Regression with missing x’s: a review. Journal of the American Statistical Association, 87(420), 1227–1237. Little, R. J., & Rubin, D. B. (2014). Statistical analysis with missing data (Vol. 333). John Wiley & Sons. Loureiro, M. L., & Nayga Jr, R. M. (2006). Obesity, weight loss, and physician’s advice. Social science & medicine, 62(10), 2458–2468. Loureiro, M. L., & Nayga Jr, R. M. (2007). Physician’s advice affects adoption of desirable dietary behaviors. Review of Agricultural Economics, 29(2), 318–330. MaCurdy, T., Mroz, T., & Gritz, R. M. (1998). An evaluation of the national longitudinal survey on youth. The Journal of Human Resources, 33(2), 345–436. McDonough, I. K., & Millimet, D. L. (2017). Missing data, imputation, and endogeneity. Journal of econometrics, 199(2), 141–155. Mogstad, M., & Wiswall, M. (2012). Instrumental variables estimation with partially missing instruments. Economics Letters, 114(2), 186–189. Newey, W. K., & McFadden, D. (1994). Large sample estimation and hypothesis testing. Handbook of econometrics, 4, 2111–2245. Ortega-Sanchez, R., Jimenez-Mena, C., Cordoba-Garcia, R., Muñoz-Lopez, J., Garcia-Machado, M. L., & Vilaseca-Canals, J. (2004). The effect of office-based physician’s advice on adolescent exercise behavior. Preventive medicine, 38(2), 219–226. Pacini, D., & Windmeijer, F. (2016). Robust inference for the two-sample 2sls estimator. Economics letters, 146, 50–54. Prokhorov, A., & Schmidt, P. (2009). Gmm redundancy results for general missing data problems. Journal of Econometrics, 151(1), 47–55. Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592. Rubin, D. B. (1987). Multiple imputation for non-response in surveys. John Wiley & Sons. Schafer, J. L. (1999). Multiple imputation: a primer. Statistical methods in medical research, 8(1), 3–15. 129 Schafer, J. L., & Graham, J. W. (2002). Missing data: our view of the state of the art. Psychological methods, 7(2), 147. Secker-Walker, R. H., Solomon, L. J., Flynn, B. S., Skelly, J. M., & Mead, P. B. (1998). Re- ducing smoking during pregnancy and postpartum: physician’s advice supported by individual counseling. Preventive medicine, 27(3), 422–430. White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: issues and guidance for practice. Statistics in medicine, 30(4), 377–399. Wooldridge, J. M. (2007). Inverse probability weighted estimation for general missing data problems. Journal of Econometrics, 141(2), 1281–1301. Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data. MIT press. 130