QUASI-MAXIMUM LIKELIHOOD ESTIMATION METHODS WITH A CONTROL FUNCTION APPROACH TO ENDOGENEITY By Doosoo Kim A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Economics – Doctor of Philosophy 2017 ABSTRACT QUASI-MAXIMUM LIKELIHOOD ESTIMATION METHODS WITH A CONTROL FUNCTION APPROACH TO ENDOGENEITY By Doosoo Kim One of the fundamental problems in econometrics is the potential endogeneity in non-experimental data. This work focuses on econometric methods taking a control function approach to endogeneity. The agenda consists of two parts. In the first part, I study a general class of conditional mean regression methods with a control function, and their relative asymptotic efficiency relationship. Unlike previous results in the literature, the likelihood for the response variables can be incorrect up to the regression functions. My results provide more practical and general guidance on the choice of an estimator. In the second part, I propose a generalized Chamberlain device as a control function approach to time-invariant endogeneity in linear panel data quantile regression models with a finite time dimension. The new correlated effect (CE) estimator has substantial advantages compared to existing methods: (i) it is free of an incidental parameters problem, (ii) the correlated effect is not restricted to a linear functional form, and (iii) an arbitrary within-group dependence of regression errors is allowed. Due to the high-dimensionality of the control function, a nonconvex penalized estimator is adopted for sparse model selection. In the first chapter, I study the asymptotic relative efficiency relationship among estimators based on a quasi-limited information likelihood (QLIL). First, I show that there exists a generalized method of moments estimator (GMM-QLIML) based on all the available quasi-scores. Second, the quasi-limited information maximum likelihood estimator (QLIML) is shown to be as efficient as GMM-QLIML under a set of generalized information matrix equalities. Third, I show that in a fully robust estimation of correctly specified conditional mean functions, QLIML is efficient relative to a two-step control function approach when the generalized linear model variance assumptions hold with a scaling restriction. When a limited information structure is over-identified, the classical minimum distance (MD) estimator is often proposed as an estimation method. The purpose of the second chapter is to study its relative asymptotic efficiency relationship with respect to QLIML and two-step control function (CF) approach. First, I show that the MD estimator is asymptotically efficient relative to two other estimators. Second, I proved that the concentration of reduced form equation estimates does not affect the asymptotic efficiency of the structural parameter estimates in the MD estimation. Third, in a class of models, an if-and-only-if condition is derived for MD and other estimators to be asymptotically equivalent under the null hypothesis of exogeneity. In the third chapter, I propose a point-identifying restriction and estimation procedure for a linear panel data quantile regression model with a fixed time dimension. The proposed model restriction reasonably accounts for the -quantile-specific time-invariant heterogeneity, and allows arbitrary within-group dependence of regression errors. The generalized Chamberlain device is taken analogously as a control function to capture -quantile-specific time-invariant endogenous variations. Since the sieve-approximated control function has high-dimensionality, the estimation procedure adopts penalization techniques under the sparsity assumption. Transformation of the sieve elements into a generalized Mundlak form is considered to make the sparsity assumption more plausible in some cases. The empirical application to birth weight analysis demonstrates a convincing case where the proposed estimator works as intended in real data. Copyright by DOOSOO KIM 2017 Dedicated to my parents, and Kyuseon. v ACKNOWLEDGEMENTS First of all, I’d like to thank my family members for their endless support. My wonderful parents and my great wife, Kyuseon (Kristy) always believed in me and encouraged me even when I was in deep trouble. All the support they have provided me over the years has made this work possible. I would like to thank my advisor Jeffrey M. Wooldridge for his valuable advice and support. In addition to his insightful feedback on details of my research, his strong encouragement and optimistic view were essential ingredients for my work. He always led me to push my limits and achieve beyond my imagination. I deeply appreciate him giving me such great inspiration. I am grateful to my dissertation committee members, Peter Schmidt and Kyoo il Kim for their rigorous and helpful feedback. Thanks to their brilliant comments, I could significantly improve the quality of my work. They were also a huge inspiration to me at all of the time. I appreciate the wonderful support from the Department of Economics at Michigan State University. The comfortable work environment the department provided was crucial. Special thanks to the Chairs of the Department and Graduate Directors: Carl Davidson, Tim Vogelsang, Leslie E. Papke, and Todd Elder. I am also grateful to the following university staff: Belen Feight, Jay Feight, Margaret Lynch, Lori Jean Nichols, and Dean Olson for their unfailing support and assistance. Among the great Ph.D. students in the Department of Economics, very special gratitude goes to my study group members: Muzna Alvi, Patrick Burke, Annie Chou, Po-Chun Huang, Riju Joshi, and Danielle Kaminski. It was fantastic to have the opportunity to work and interact with them. The moments we shared together greatly enriched my life in East Lansing. Thank you all! vi TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x KEY TO ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi CHAPTER 1 1.1 1.2 1.3 1.4 RELATIVE EFFICIENCY OF QUASI-LIMITED INFORMATION MAXIMUM LIKELIHOOD ESTIMATOR . . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Model Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Relative Efficency Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 2 2.1 2.2 2.3 2.4 2.5 2.6 EFFICIENT MINIMUM DISTANCE ESTIMATOR BASED ON QUASILIMITED INFORMATION LIKELIHOOD . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Model Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Minimum Distance Estimators: MD/cMD-QLIML . . . . . . . . . . . . . . . Example 1: Linear Regression Model with Endogeneous Explanatory Variables Example 2: Probit with Endogeneous Explanatory Variables . . . . . . . . . . Monte Carlo Simulation on Probit Model with EEV . . . . . . . . . . . . . . . . 1 . 1 . 3 . 6 . 20 . . . . . . . 21 21 22 24 29 31 33 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 35 37 38 38 40 43 44 45 49 55 57 58 61 67 CHAPTER 3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 SHORT PANEL DATA QUANTILE REGRESSION MODEL WITH SPARSE CORRELATED EFFECTS . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Literature on Linear Panel Data Quantile Regression . . . . . . . . . . . . . . . Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Generalized Chamberlain Device . . . . . . . . . . . . . . . . . . . . . 3.3.2 Model Restriction and Identification . . . . . . . . . . . . . . . . . . . 3.3.3 Case of Unbalanced Panel Data with Time-constant Endogeneity . . . . Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Sieve-approximated Correlated Effect . . . . . . . . . . . . . . . . . . 3.4.2 Penalized Estimation via Non-convex Penalty Functions . . . . . . . . . 3.4.2.1 Choice of Thresholding Parameter . . . . . . . . . . . . . . 3.4.2.2 Computation of Variance Estimators . . . . . . . . . . . . . . Monte Carlo Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Application: The Effect of Smoking on Birth Outcomes . . . . . . . . . . . . . Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 APPENDIX A An Appendix for Chapter 1 . . . . . . . . . . . . . . . . . . . . . . 69 APPENDIX B An Appendix for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . 88 vii APPENDIX C An Appendix for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . 108 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 viii LIST OF TABLES Table 2.1 Root Mean Squared Error and Standard Deviation (Á D 0:6) . . . . . . . . . . . 34 Table 3.1 Selection Performance, DGP 1 and 2 . . . . . . . . . . . . . . . . . . . . . . . . 59 Table 3.2 Estimator performance, DGP 1 and 2, ˇ1 . . . . . . . . . . . . . . . . . . . . . 60 Table 3.3 Birthweight, mean and median regression, all moms (unit:grams) . . . . . . . . 63 Table 3.4 Birthweight, quantile regression with CE, all moms, (unit:grams) . . . . . . . . 65 Table 3.5 Coefficient Estimates on ‘Smoke’ using Different ICs, (unit:grams) . . . . . . . . 66 Table A.1 Estimator performance, DGP 1, ˇ2 . . . . . . . . . . . . . . . . . . . . . . . . 152 Table A.2 Estimator performance, DGP 2, ˇ2 . . . . . . . . . . . . . . . . . . . . . . . . 153 Table A.3 Selection Performance, DGP 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 Table A.4 Estimator performance, DGP 3, ˇ1 . . . . . . . . . . . . . . . . . . . . . . . . . 155 Table A.5 Estimator performance, DGP 3, ˇ2 . . . . . . . . . . . . . . . . . . . . . . . . 156 Table A.6 Selection Performance, DGP 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Table A.7 Estimator performance, DGP 4, ˇ1 . . . . . . . . . . . . . . . . . . . . . . . . . 158 Table A.8 Estimator performance, DGP 4, ˇ2 . . . . . . . . . . . . . . . . . . . . . . . . 159 Table A.9 Selection Performance, DGP 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 Table A.10 Estimator performance, DGP 5, ˇ1 . . . . . . . . . . . . . . . . . . . . . . . . . 161 Table A.11 Estimator performance, DGP 5, ˇ2 . . . . . . . . . . . . . . . . . . . . . . . . 162 Table A.12 Birthweight, pooled quantile regression, all moms, (unit: grams) . . . . . . . . . 163 Table A.13 Birthweight, quantile regression with Classical CRE, all moms, (unit: grams) . . 164 ix LIST OF FIGURES Figure A.1 Pooled Birth Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 x KEY TO ABBREVIATIONS 2SLS Two-stage Least Square AGLS Amemiya’s Generalized Least Square AIC Akaike Information Criterion BIC Bayesian information criterion CAN Consistent and Asymptotic Normal CE Correlated Effect CF Control Function DGP Data Generating Process GMM Generalized Method of Moments LEF Linear Exponential Family LIL Limited Information Likelihood LIML Limited Information Maximum Likelihood MCP Minimax Concave Penalty MD Minimum Distance QLIML Quasi-Limited Information Maximum Likelihood QMLE Quasi-Maximum Likelihood Estimator RMSE Root-Mean-Square Error SCAD Smoothly Clipped Absolute Deviation SD Standard Deviation xi CHAPTER 1 RELATIVE EFFICIENCY OF QUASI-LIMITED INFORMATION MAXIMUM LIKELIHOOD ESTIMATOR 1.1 Introduction Limited information likelihood (LIL)-based estimators have been widely used in instrumental variable estimation. The limited information maximum likelihood (LIML) estimator (Anderson and Rubin, 1949) and two-stage least square (2SLS) estimator (Theil, 1953; Basmann, 1957; Sargan, 1958) for linear models are workhorses of many empirical studies. In simultaneous probit models, analogously proposed LIML and two-stage conditional maximum likelihood estimator (Rivers and Vuong, 1988) are useful extensions of LIML and 2SLS to a nonlinear model. While correct specification of likelihoods has been assumed in the LIL literature, it is known that a certain class of maximum likelihood estimators have nice robustness against misspecification: quasi-maximum likelihood estimator (QMLE) is fully robust for correctly specified conditional mean if and only if the likelihood is in linear exponential familiy (LEF) under mild regularity conditions (Gouriéroux, Monfort and Trognon, 1984; White, 1994). Based on the result, Wooldridge (2014) reinterprets LIL as a quasi-limited information likelihood (QLIL) and expands its applicability noting that correctly specified regression functions are key assumptions for consistency in LEF. Apart from robustness of QLIL-based estimators, their relative efficiency relationship is another important issue. Relative efficiency analysis of LIML or equivalent estimator in previous works assume away potentially misspecified likelihoods for both structural and reduced form equations. When the likelihood is allowed to be misspecified, relative efficiency comparison based on the correct specification of likelihood is no longer valid. Analysis accounting for potential misspecification of likelihood is more useful to empirical researchers because economic theories usually do not imply full characterization of distributions, and there is no solid reason to believe QMLE achieves the same asymptotic efficiency as the maximum likelihood estimator. 1 The purpose of this chapter is to study asymptotic relative efficiency relationship among estimators based on QLIL. Considering a research question raised by Wooldridge (2014), I focus on sufficient condtions for relative efficiency of QLIL maximizer with repect to two-step conditional quasi-likelihood maximizer, which will be called QLIML estimator and control function (CF) approach, respectively. The CF estimator is naturally defined once we take the conventional decomposition of QLIL into structural and reduced form components. The model restriction imposed on QLIL is general enough to include nonlinear models and, in particular, misspecification of likelihoods is allowed up to correctly specified regression functions when fully robust estimation is considered. The main contributions of this chapter are followings. First, I show there exists a generalized method of moments estimator (GMM-QLIML) based on the all available quasi-scores. The asymptotic variance of GMM-QLIML estimator constitutes a lower bound for those of QLIML and CF in matrix positive semidefinite sense. Second, the QLIML estimator is proved to be as efficient as GMM-QLIML estimator under a set of generalized information matrix equalities. Third, the asymptotic equivalence of LIML and 2SLS is established via linearity of regression functions and L2 loss function incorporated in normal density. This new proof clearly shows why the equivalence holds without normality or conditional homoskedasticity which is often assumed in the assertion. Sufficient conditions for general equivalence of QLIML and CF are also found. Fourth, in fully robust estimation of correctly specified conditional mean functions, QLIML estimator is shown to be efficient relative to CF estimator if generalized linear model variance assumptions hold with a scaling restriction. In particular, correctly specified conditional moments up to second order are sufficient. The rest of this chapter is organized as follows. In Section 1.2, basic model restrictions are given with GMM interpretation of QLIML and CF estimators. In Section 1.3, GMM-QLIML estimator is defined and relative efficiency results for QLIML and CF estimator are presented. Section 1.4 contains concluding remarks. 2 1.2 Model Restrictions Assume random sampling from a population. For a random draw i, consider the system of equations yi1 D f .yi 2 ; zi1 ; ui1 I Â1 ; Â2 / (1.1) yi 2 D g .zi ; vi 2 IÂ2 / (1.2) where function f and g are known up to .p1 C p2 / a scalar response variable, yi 2 is a 1 is 1 0 1 vector of parameter  D Â10 ; Â20 , yi1 is r vector of potentially endogenous variables, zi D .zi1 ; zi 2 / k vector of included/excluded exogenous instruments with k D k1 C k2 , and .ui1 ; vi 2 / is a .1 C r/ 1 vector of unobservables. Under potentially incorrect distributional assumptions for u1 and v2 ; taking log operator on the decomposed quasi-joint likelihood l .yi1 jyi 2 ; zi I Â1 ; Â2 / l .yi 2 jzi I Â2 / delivers QLIL QLIL D q1 .yi1 ; yi 2 ; zi ; Â1 ; Â2 / C q2 .yi 2 ; zi ; Â2 / (1.3) which offers flexible model specifications. The decomposition is more of ‘composition’ in the sense that q1 and q2 do not need to be derived from a single joint quasi-likelihood. For example, Poisson log-likelihood q1 and normal log-likelihood q2 can be used as long as quasi-likelihooddriven regression functions are correct: Wooldridge (2014) showed that, in LEF, the key model restrictions for consistent estimation of conditional mean Eo Œyi1 jyi 2 ; zi  are Eo Œyi1 jyi 2 ; zi  D Eq Œf .yi 2 ; zi1 ; ui1 I Âo1 ; Âo2 / jyi 2 ; zi  Eo Œyi 2 jzi  D Eq Œg .zi ; vi 2 IÂo2 / jzi  (1.4) (1.5) where subscripts ‘o’ and ‘q’ denotes (expectation) operators based on the true and quasi-likelihood, respectively. As long as (1.4) and (1.5) hold, even failure of (1.1) is allowed in consistent estimation as shown by Example 1.2.1 below. In the linear model with quasi-normality (Anderson and Rubin, 1949), linear projection operators Lo Œ j  with appropriate regressors can replace the expectation operators Eo Œ j  when the objects of interest are linear projections rather than conditional mean functions. The decomposed nature of QLIL and correctly specified regression functions typically 3 involve with the existence of a control function. See Wooldridge (2014) for details. Following example demonstrates derivation of QLIL in linear and Probit models, and discusses their robustness properties. Example 1.2.1 (models with quasi-normality of .ui1 ; vi 2 /) Consider the following simultaneous equation systems Linear Model:yi1 D yi 2 ˛ C zi1 ı1 C ui1 ; yi 2 D zi ı 2 Cvi 2 (1.6) Probit Model:yi1 D 1 Œyi 2 ˛ C zi1 ı1 C ui1 > 0 ; yi 2 D zi ı 2 Cvi 2 (1.7) Assume .ui1 ; vi 2 / jzi q N .0; †/ where Vq .ui1 jzi / D †11 ; covq .ui1 ; vi 2 jzi / D †12 D †t21 , Vq .vi 2 jzi / D †22 ; ı 2 D ı 021 ; ı 022 is a k r matrix and other parameters are defined comformably. In the notation ‘X the subscript q indicates that the distributional assumption ‘X q 0 ‰’, ‰’ is allowed to be incorrect and is used only for deriving the quasi-likelihood of X or its transformation. The decomposed quasi-likelihoods are easily derived noting that ei1 j .yi 2 ; zi / where ei1 Á ui1 q N 0; †11 †12 †221 †21 Á vi 2 †221 †21 . In Probit model, it is assumed that Vq .ei1 jyi 2 ; zi / D 1 for normalization. The quasi-likelihoods for linear and Probit model are given explicitly in Example 1.3.5 and 1.3.15, respectively. Concerning robustness property, there are three things to be mentioned: first, since qi1 and qi 2 in both models belong to LEF, correctly specified conditional mean can be consistently estimated by QLIL-based estimators regardless of the true distribution. Second, in linear model, the conditional mean functions derived from the quasi-likelihood have an interpretation of true linear projections. In particular, the quasi-likelihood-based conditional mean of yi1 conditioned on .yi 2 ; zi / can be regarded as the linear projection of yi1 on .yi 2 ; zi1 ; vi 2 / Eq Œyi1 jyi 2 ; zi  D yi 2 ˛ C zi1 ı1 C vi 2 †221 †21 D Lo Œyi1 jyi 2 ; zi1 ; vi 2  4 where †221 †21 can be reparameterized to be Á for convenience: Since this interpretation is definitional through quasi-scores, even when conditional mean functions are incorrectly specified, .˛; ı1 / is consistently estimated as linear projection coefficients under regularity conditions. Third, the yi1 equation in (1.7) of Probit model is not a restrictive condition for consistency. When yi1 is a fractional response taking values in Œ0; 1, the equation may not hold for some observations. Such failure of yi1 equation does not necessarily harm consistent estimation of the conditional mean function if Probit response function is correct. Á Eq Œyi1 jyi 2 ; zi  D ˆ yi 2 ˛ C zi1 ı1 C vi 2 †221 †21 D Eo Œyi1 jyi 2 ; zi  However, Probit response function does not have the robust interpretation of a linear projection as in linear model when it is incorrectly specified. Given QLIL, the QLIML and CF estimators are defined as 8 ˆ O2;CF D arg max PN qi 2 .Â2 / ˆ  ˆ i D1 N < X  2 Œqi1 .Â1 ; Â2 / C qi 2 .Â2 / and ÂOQLIML D arg max Á PN ˆ O O ˆ  ˆ i D1 : Â1;CF D arg max i D1 qi1 Â1 ; Â2;CF Â1 respectively. Focusing on relative efficiency comparison of these two, it is assumed that both QLIML and CF estimators are consistent and asymptotic normal (CAN) for the true parameter values. Also, we assume that expected quasi-scores uniquely determines true parameters so that GMM interpretation of QLIML and CF estimator is valid. This is a mild assumption since the necessity of LEF for fully robust estimation is shown under enough differentiability of likelihood and interiority of a population maximizer (White, 1994, Theorem 5.6). Consequently, QLIML and CF estimators can be defined as GMM estimators based on quasi-score moment conditions 2 3 2 3 @ q . ;  / @ q . ;  / 7 6 @ i1 o1 o2 7 6 @Â1 i1 o1 o2 Eo 4 5 D 0 and Eo 4 1 5D0 @ q . ;  / C @ q . / @ q . / @ i1 o1 o2 @ i 2 o2 @ i 2 o2 2 2 2 respectively. Appendix A.1 contains relevant standard regularity conditions (Assumption 1-12). These assumptions are maintained for simplicity. They can be relaxed, for example, to allow non-smooth qi1 or qi 2 via smoothness in the limit and stochastic differentiability (Pollard, 1985). 5 1.3 Relative Efficency Comparison The key idea that enables intuitive analysis of relative efficiency relationship is to acknowledge the existence of an estimator whose asymptotic variance constitutes a lower bound for those of QLIML and CF. The estimator is discovered, and called as GMM-QLIML in this chapter. It is defined to be efficient GMM estimator based on a maximal linearly independent set of all quasi-scores available in QLIL. Its construction and potential relative efficiency over QLIML and CF can be easily shown by elementary linear algebra. Recall the definition of linear independence in the context of moment function space along with its well-known relationship with variance matrix in the following remark. Definition 1.3.1 A set of scalar moment functions fhl .wi ; Â/gL lD1 is linearly independent at  if Á PL P lD1 ˛l . / hl .wi ;  / D 0 D 1 implies ˛l . / D 0 for all l where ˛l .Â/ is arbitrary real-valued function of Â: Remark 1.3.2 fhl .wi ; Â/gL lD1 is linearly independent at  if and only if the variance matrix of fhl .wi ;  /gL lD1 is invertible, assuming that second moments are finite. Now, consider stacking all available quasi-scores in (2.1): 3 2 @ q . ;  / 6 @Â1 i1 1 2 7 7 6 6 @ q . ;  / 7 6 @Â2 i1 1 2 7 5 4 @ q . / @ i 2 2 (1.8) 2 The vector of moment functions (1.8) constitutes, when taken summation or integral, all available first order conditions from factor-by-factor QLIL maximization problem. We might hope conducting efficient GMM on these moment functions yields an estimator efficient relative to QLIML and CF. However, it turns out that (1.8) typically has a singluar variance matrix since the @ q . ;  / is linearly dependent. The singularity is closely related to set of moment functions in @ i1 1 2 the fundamental reason why we need simultaneous equation system: the quasi-likelihood function qi1 alone cannot identify Âo1 and Âo2 in general: To avoid such linear dependence; a maximal 6 linearly idependent set in (1.8) can be used instead. Since moment functions in @Â@ qi1 .Â1 ; Â2 / and 1 @ @Â2 q2 .Â2 / are assumed to be linearly independent by rank condition of CF (Assumption 12), a maximal linearly idependent set can be found by extending the set of CF moment functions. Definition 1.3.3 GMM-QLIML is an efficient GMM estimator based on a maximal linearly indeÄ pendent set of moment functions at .Âo1 ; Âo2 / in @q1 .Â1 ;Â2 / @q1 .Â1 ;Â2 / @q2 .Â2 / : @Â1 2 6 6 6 6 4 @ @Â1 q1 .Â1 ; Â2 / @ @Â22 q1 .Â1 ; Â2 / @ @Â2 q2 .Â2 / @Â2 @Â2 3 7 7 7 7 5 (1.9) 0 p21 ;  2 Rp22 with p D p C p and  can be empty. 0 ; Â0 where Â2 D Â21 22 2 21 22 22 22 ; Â21 2 R The following proposition shows that the GMM-QLIML estimator is asymptotically normal without additional model restrictions other than those of CF and QLIML. Proposition 1.3.4 Under regularity conditions and identification conditions for CF and QLIML (Assumption 1–12), the GMM-QLIML estimator is asymptotically normal. p  Á d Á 1à 0 1 Âo ! N 0; A B A N ÂOGMM QLIML where 2 6 6 6 ADE6 6 4 @qi1 .Âo; / @Â1 @ 0 @qi1 .Âo; / @Â22 @ 0 @qi 2 .Âo2 / @Â2 @ 0 3 0 7 7 7 7 and B D V 7 5 B B B B @ @qi1 .Âo; / @Â1 @qi1 .Âo; / @Â22 @qi 2 .Âo2 / @Â2 1 C C C C A Typically, the extra moment functions @Â@ q1 .Â1 ; Â2 / are orthogonality conditions between 22 exogeneous part of structural error and overidentifying .k2 shows determination of @Â@ q1 .Â1 ; Â2 / in linear model setting: 22 7 r/ instruments. Below example Example 1.3.5 (Linear Model) Consider linear model in Example 1.2.1. For notational convenience, define the following vi 2 .ı 2 / Á yi 2 zi ı 2 ei .Â/ Á yi1 yi 2 ˛ 11j2 .Â/ Á †11 vi 2 .ı 2 / †221 †21 zi1 ı1 †12 †221 †21 hi .Â/ Á ei .Â/2 11j2 .Â/ where parameters are defined comformably: ei .Â/ can be interpreted as remaining part of structural error after endogeneous variation vi 2 is projected out. The quasi-log-likelihoods are qi1 .Â1 ; Â2 / D qi 2 .Â2 / D 1 ln 2 2 k ln 2 2 1 1 ln 11j2 .Â/ ei .Â/2 11j2 .Â/ 1 2 2 1 1 ln j†22 j vi 2 .ı 2 / †221 vi 2 .ı 2 /0 2 2 and quasi-scores can be expressed as follows 2 .Â/ 1 ei .Â/ y02 6 11j2 6 1 0 6 @q1 6 11j2 .Â/ ei .Â/ z1 D6 6 @Â1 2 0 1 1 1 6 11j2 .Â/ hi .Â/ †22 †21 C 11j2 .Â/ ei .Â/ †22 vi 2 .ı 2 / 4 1 2 2 11j2 .Â/ hi .Â/ 2 h i 3 1 1 0 @q1 11j2 .Â/ ei .Â/ †22 †21 ˝ z 7 6 D4 i h 5 @Â2 L † 1 ˝ † 1 D .Â/ r 2 22 ˝ z0 3 7 7 7 7 7 7 7 5 22 3 †221 v2 .ı 2 /0 @q2 6 Ir D4 @Â2 1 L vec † 1 v .ı /0 v .ı / † 1 i2 2 2 r 22 i 2 2 22 †21 ˝ vi 2 .ı 2 /0 where D .Â/ D Œ†21 ˝ †21  12 11j2 .Â/ 2 hi .Â/ 0 ˛ 0 ; ı10 ; †021 ; †11 , Â2 D vec .ı 2 /0 ; vech .†22 /0 †221 7 Á 5 0 and Lr is a r.rC1/ 2 11j2 .Â/ 1e i .Â/ ; Â1 D r 2 elemination matrix @q . / (Section 5.7.3, Turkington, 2014) : To determine @Â1 ; we should find a set of moment functions 22 @q @q @q in @Â1 that cannot be expressed by a linear combination of moment functions in @Â1 and @Â2 at 2 1 2 @q1 true parameter values. If †o21 D 0; then @ D 0 and Â22 is empty. For now, assume †o21 6D 0 2 8 1† and, at least one element of †o22 o21 ; say io th component; is nonzero. Then, by some tedious algebra1; the extra moments can be shown to be at most @q1 .Â/ D @Â22 Á 1 e .Â/ † 1 † 0 .Â/ i 1j2 22 21 i z2; r o (1.10) where ‘ r’ denotes ‘leaving r instruments out’ in z2 . Suppose there exists enough variation in @q . / z2 so that (1.10) is indeed @Â1 : Since GMM-QLIML moment functions constitute a basis of 22 linear vector space spanned by (1.8), it can be shown that any choice of extra moments yields asymptotically equivalent estimator. If the model is just-identified .k2 D r/; or if there exists no endogeneity .†o21 D 0/; then Â22 is empty, and GMM-QLIML, CF and QLIML are asymptotically equivalent to each other. Example 1.3.5 illustrates the following general proposition. Proposition 1.3.6 Under regularitiy conditions and identification conditions (Assumption 1–12), @q .Â/ yields asymptotically equivalent GMM-QLIML estimator. (b) If Â22 is (a) Each choice of @Â1 22 empty, then GMM-QLIML, QLIML and CF are asymptotically equivalent. Example 1.3.5 shows that a preliminary step is required for GMM-QLIML to be used in practice. In the example, it is necessary to test whether there exists a component of †221 †21 significantly different from zero. Then, the extra moment functions will be chosen correspondingly. This preliminary procedure probably is not very appealing to practitioners. A practical approach in general would be to employ a generalized inverse matrix for optimal weighting and resolve singularity issue. Also, in this specific example, we can consider including moment condition 1 It @q @q1 @ı 1 @q1 @Â1 @q @q @q is easy to see that @† 1 (the second part of @Â1 ) is a linear combination of @† 1 and @† 1 22 2 21 11 @q1 @q1 @q1 (the third and fourth part of @ /:In @ı (the first part of @ ); all moments with z1 can be generated by (the second part of 1 2 2 ): So, we are left with moments with z2 : Among these, due to explicit @q . / linear relationship y2 D Zi ı 2 Cvi 2 .ı 2 / ; only .k2 r/ moments at most can be included in @Â1 : 22 For all .k2 r/ moments to be included, we need enough variation in instruments. 9 Á without †221 †21 term io @q1 .Â/ D @Â22 11j2 .Â/ 1e i .Â/ z02; r (1.11) and the resulting optimal GMM estimator is more efficient relative to GMM-QLIML though it may require additional model restrictions in general. GMM-QLIML has an important role in relative efficiency study while GMM with (1.11) has more practical usage. As a basis of linear space spanned by QLIML and CF moment functions, the asymptotic variance of GMM-QLIML forms a sharper lower bound for those of QLIML and CF. Since eliminating †221 †21 from (1.10) is equivalent to adding extra information that is not used either by QLIML or CF when †o21 D 0; the asymptotic variance of optimal GMM estimator using (1.11) can strictly smaller than that of GMM-QLIML in matrix positive definite sense. This delicate distinction offers a convenient general framework of relative efficiency comparison. Potential relative efficiency gain of GMM-QLIML with respect to QLIML and CF is clear from its definition. It is worth noting that such potential improvement is not based on additional model restrictions as shown in Proposition 1.3.4. When efficiency gain is present, it is implied that QLIML and CF make use of only a part of information that GMM-QLIML uses. In such a case, relative efficiency comparison of QLIML and CF is not obvious in general. When GMM-QLIML is equivalent to either QLIML or CF, one can conclude that the one equivalent to GMM-QLIML is superior than the other one. Conditions under which GMM-QLIML is equivalent to each estimator can be derived by applying moment redundancy conditions (Breuch, Qian, Schmidt and Wyhowski, ÁÁ  p N ÂOest r Âo ; VestSr Á 1999; BQSW). In the following propostions, denote Vest r Á Avar ÁÁ p N ÂOS;est r ÂoS for partition  D .ÂS ;  S / ; and @qiol D @qi l .Âo / for l D 1; 2. Avar 10 Proposition 1.3.7 Assume that Assumptions 1-12 hold and that Â22 is nonempty. Then .a/ VGMM QLIML VQLIML ; VCF .b/ VGMM QLIML D VCF if and only if 0 2 o 31 0 o 1 1 2 o 3 @qi1 @qi1 @qi1 Ä o o @qi1 6 @ @ 0 7 B @qi1 6 @Â1 7C B @Â1 C Eo D cov ; 4 @q o 5A Vo @ @q o A Eo 4 @q1 o 5 @ o 0 @Â22 @ @Â22 i2 i2 i2 @Â2 @Â2 .c/ VGMM QLIML D VQLIML if and only if " # ! o o @ qo C qo @qi1 @qi1 i1 i2 Eo D covo ; Vo @Â2 @ 0 @Â2 @ where Â2 is a subvector of Â2 such that o @q1o @qi1 ; @ @Â2 1 @q o o C qo @ qi1 i2 @ @Â2 @ 0 ! 1 " Eo o C qo @ qi1 i2 # @Â@ 0 @q o and @Âi1 C @Âi 2 are maximal linearly independent. 2 2 Remark 1.3.8 (b) and (c) can be derived for arbitrary subvector ÂS of  D .ÂS ;  S / : Corresponding results are given in the appendix. The equivalence conditions (b) and (c) characterize when the extra moments in GMM-QLIML contain no useful information about parameters. Rigorously put, they describe cases where the orthogonal complement of QLIML or CF moment functions in the linear span of (1.8) does not contain additional information about parameters. One interesting implication of (c) is that a set of generalized information matrix equalities (GIME; Wooldridge, 2010) for q1 ; q2 and q1 C q2 with > 0 is sufficient for QLIML to be efficient relative to CF:  oÃ Ä @q1o @q1 D Eo Vo @ @Â@ 0 " #  oà @q2 @q2o Vo D Eo @Â2 @Â2 @Â20 ! " # @ q1o C q2o @ q1o C q2o Vo D Eo @ @Â@ 0 some common scaling factor Note that this result is stronger than one in previous studies under correctly specified likelihoods. Even if QLIML is not a maximum likelihood estimator, it is efficient relative to CF whenever a finite number of moment conditions in GIMEs are met. 11 Following corollary contains relevant implications of Proposition 1.3.7 regarding GIMEs. In particular, it claims an if and only if condition for CF and QLIML estimator of Â11 to be asymptotically equivalent under GIMEs where Â1 D .Â11 ; Â12 / : Corollary 1.3.9 Assume that Assumptions 1-12 hold and that Â22 is nonempty. If generalized information matrix equalities hold for each factor of likelihood and joint likelihood with the same scaling factor, we have .a/ VGMM QLIML D VQLIML   VCF (in particular, VQLIML 6D VCF /  11 11 11 .b/ VGMM QLIML D VQLIML D VCF if and only if " " # " #  " ## o o o à 1 o @qi1 @qi1 @qi1 @qi1 0p22 p11 D Eo Eo Vo Eo @Â22 @Â20 @Â22 @Â10 @Â1 @Â1 @Â20 " " # " ## o o @qi1 @qi1 R21 Eo C R22 Eo 0 0 @Â12 @Â11 @Â2 @Â11 where R21 and R22 are defined in the proof. These results are useful to study asymptotic equivalence of QLIML and CF since, if there exists a case where QLIML and CF are asymptotically equivalent in general, then it must also be the case under GIMEs. The result (a) of Corollary 1.3.9 shows that, when Â22 is nonempty, QLIML and CF are never asymptotically equivalent for all element of Â: But this does not rule out the case where QLIML and CF are asymptotically equivalent for strict subvector of Â: The formula in Corollary 1.3.9 (b) (and another one given in Proposition 1.3.13 later) informs us about key conditions for QLIML and CF to be asymptotically equivalent for subvector Â11 of Â1 : It seems that some part Ä Ä o o @qi1 @qi1 and E should vanish to have general of the expected cross partials Eo o 0 0 @Â12 @Â11 @Â2 @Â11 equivalence. Based on this observation, following proposition explicitly claims a condition under which QLIML and CF are asymptotic equivalent for a subvector of Â. The well-known result of asymptotic equivalence of LIML and 2SLS is an implication. Proposition 1.3.10 Assume that Assumptions 1-12 hold. Let . 1 ; 2 / be a partition of Â: If there 12 exists p p invertible matrices T1 .Â/ and T2 .Â/ such that 3 2 2 @ q . ;  / 7 6 6 @Â1 1 1 2 T1 .Â/ 4 5D4 @ q . ;  / C @ q . / @Â2 1 1 2 @Â2 i 2 2 3 2 2 @ q . ;  / 6 @ 1 1 2 7 6 T2 .Â/ 4 1 5D4 @ q . / @ 2 2 2 Ä where m1 . 1 ; 2 / identifies o1 given o2 , Eo @m1 . o1 ; o2 / @ 2 3 m1 . 1 ; 2 / 7 5 m2 . 1 ; 2 / 3 m1 . 1 ; 2 / 7 5 m3 . 1 ; 2 / Ä D 0 and Eo @mig . o1 ; o2 / @ 20 is invertible for g D 2; 3, then QLIML and CF estimator for 1 are asymptotically equivalent. Corollary 1.3.11 LIML and 2SLS are asymptotically equivalent for .˛; ı1 / : The asymptotic equivalence of LIML and 2SLS is mainly due to linearity of regression functions and L2 loss function endowed in normal density. The intuition behind the proof is that v2 does not @q @q need to be controlled as regressors to estimate .˛; ı1 /: the orthogonality conditions @˛1 and @ı 1 1 in Example 1.3.5 can be transformed into 2 3 0 0 6 .yi1 yi 2 ˛ zi1 ı1 / ı22 z2 7 (1.12) 4 5 .yi1 yi 2 ˛ zi1 ı1 / z01 0 by an invertible linear map: Treating ˛ 0 ; ı10 as 1 in Proposition 1.3.10; the equivalence follows. Clearly, neither normality nor conditional homoskedasticity is needed for the result, which is not very well recognized in the literature. Amemiya (1984) proves the equivalence under conditional homoskedastic non-normal errors and non-random instruments but his argument is, in fact, valid without assuming conditional homoskedasticity. In nonlinear models such as probit, the regression function does not allow the control function part to vanish as linear model does in (1.12). Also, when the loss function is other than L2 ; for example, L1 as in median regression with tick-exponential family (Komunjer, 2009), then, even if the regression function is linear, again there exists no invertible linear transformation of quasi-scores that eliminates the control function part in general. Thus equivalence of QLIML and CF does not seem to hold for nonlinear regression models. 13 Apart from linearity of regression function and L2 loss function, another condition for general asymptotic equivalence of QLIML and CF for Â1 is Ä Â o à @q1o @q1 @q2o Eo D covo ; (1.13) @Â1 @Â2 @Â1 @Â2 Á @q @q together with covo @Â1 ; @Â2 D 0: The equivalence is easily proved by taking T1 .Â/ D T2 .Â/ D 1 2 Ip , 1 D Â1 and 2 D Â2 in Proposition 1.3.10. A set of sufficient conditions for (1.13) is ˆ well-known to be (a) q2 is correctly specified log-likelihood for wi 2 and (b) wi1 wi 2 j zi where q1 D q1 .wi1 ; wi 2 ; zi ; Â1 ; Â2 / and q2 D q2 .wi 2 ; zi ; Â2 / for some random variable .wi1 ; wi 2 /. This is a fairly general condition applicable to numerous models. However, it should be noted that wi1 cannot be a latent error term such as ui1 or ui1 vi 2 Á in Probit model of Example 1.2.1 since q1 is required to be a quasi-log-likelihood of wi1 given .wi 2 ; zi / : The next two propositions refine GIMEs to derive weaker conditions for relative efficiency of QLIML. Proposition 1.3.12 helps reducing the number of conditions in GIMEs by treating nuisance parameters as known. Multivariate normal log-likelihood becomes a member of LEF when this result is applicable to its variance parameters. Proposition 1.3.13 relaxes common scaling factors in GIMEs. When different scaling factors for q1 and q2 are allowed, 1 Ä 2 is shown to be sufficient for relative efficiency of QLIML for Â1 . Note that, with different scaling factors, having the GIME hold in both models does not necessarily imply asymptotic equivalence of QLIML and GMM-QLIML. Following Zhang (2005), the Schur complement of B in A is denoted as A=B for notational convenience. Proposition 1.3.12 Assume that Assumptions 1-12 hold. Suppose there exists .l1 C l2 / nuisance parameters D . 1 ; 2 / such that # " @qi1 .Âo1 ; Âo2 ; o1 ; o2 / Eo D 0p .l Cl / 1 2 @Â@ 01 ; 02 # " @qi 2 .Âo2 ; o2 / D 0p2 l 2 Eo @Â2 @ 02   are not affected by treating Then, VQLIML and VCF as known and redefining qQ i1 .Â/ D qi1 .Â; o / and qQ i 2 .Â2 / D qi 2 .Â2 ; o2 /. Moreover, if GMM-QLIML moment function (1.9) 14 contains exactly .l1 C l2 / scores regarding  ; then VGMM QLIML is also not affected by the redefinition. Proposition 1.3.13 Assume that Assumptions 1-12 hold. Suppose GIMEs with scaling factors à  o @q1 @q2o D 0. 1 and 2 for quasi-log-likelihood q1 and q2 , respectively. Also, assume covo @ ; @ 2 Â1 Then, VCF " Eo Â1 VQLIML @q1o @Â1 @Â10 # 1 Eo is equal to # " @q1o @Â1 @Â20 " Œ 2 W1 C . 1 2 / W2  E o @q1o @Â2 @Â10 # " Eo @q1o # 1 @Â1 @Â10 where #, " ## 1 @q1o @ q1o C q2o Eo D Eo Eo @Â2 @Â20 @Â@ 0 @Â1 @Â10 " ## 1 " Ä " ## #, " " @q1o @q1o @q1o @ q1o C q2o Eo Eo D Eo Eo @Â@ 0 @Â1 @Â10 @Â@ 0 @Â1 @Â10 ## 1 " " #, " @ q1o C q2o @q1o Eo Eo @Â@ 0 @Â1 @Â10 " W1 W2 In particular, 2 @q2o # 1 "  1 1 implies VCF "  1 VQLIML : @q @q2 @Â2 When 1 6D 2 ; GIME for .q1 C q2 / is not met, and QLIML is not optimally weighting @Â1 and 2 as GMM-QLIML does. In this sense, Proposition 1.3.13 helps us to understand the situation where complete GIMEs start to break down. Contrary to the unambiguous case of 1 Ä 2 in the proposition; when 1 > 2 ; the expression 2 W1 C . 1 2 / W2 is indefinite in general: This observation explains why general efficiency ordering of QLIML and CF is not obvious without any form of GIMEs. The next proposition shows how the general theory applies in a class of fully robust models specified with multivariate normal q2 : It is one of the most frequently used specification that attains fully robust estimation but not the only class of models that results can be applied to. It is shown that correct specification of conditional means and GLM variance assumptions with a restriction on scaling factors are sufficient for relative efficiency of QLIML for the structural parameters. In particular, correctly specified conditional moments up to second order are sufficient. 15 Proposition 1.3.14 Assume that Assumptions 1-12 hold. Suppose that q1 is a member of LEF with conditional mean G .y2 ; z1 ; v2 ; Â1 / ; and that q2 is a multivariate normal density for linear reduced form equations. In other words, qi1 .Â1 ; Â2 / D a .G .y2 ; z1 ; v2 ; Â1 // C b .yi1 / C yi1 c .G .y2 ; z1 ; v2 ; Â1 // qi 2 .Â2 / D k ln 2 2 1 ln j†22 j 2 1 vi 2 †221 v0i 2 2 0 where a; b; c and G are smooth enough functions, v2 D y2 zı 2 and Â2 D vec .ı 2 /0 ; vech .†22 /0 : Assume that Eq .y1 jy2 ; z/ and Eq .y2 jz/ are correctly specified. Then, Vo .y1 jy2 ; z/ D 1 Vq .y1 jy2 ; z/ and Vo .y2 ; jz/ D 2 Vq .y2 jz/ with 0 < 1 Ä 2 is sufficient for QLIML to be efficient relative to CF for Â1 . As a special case of Proposition 1.3.14, the next example considers a probit response function with endogeneous explanatory variables. Specifically, Proposition 1.3.14 implies that relative efficiency of QLIML holds under a much weaker condition than correct specification of likelihood given in Rivers-Vuong (1988), and this result is new in the literature. Example 1.3.15 (Rivers-Vuong, 1988) Consider probit model in Example 2.1. Note that y1 is not restricted to be binary response as long as the probit response function is correct. Assume regularity and identification conditions (Assumption 1–12). For computational convienience, impose following reparametrization Á Á †221 †21 along with normalization of e1 D u1 q1 .Â1 ; Â2 / D .1 v2 Á. Then, quasi-likelihood can be simplified as following y1 / log Œ1 ˆ .w .Â// C y1 log ˆ .w .Â// 1 vi 2 .ı 2 / †221 vi 2 .ı 2 /0 2 Ä 0 0 0 0 0 0 0 where Â1 D ˛ ; ı1 ; Á ; Â2 D vec .ı 2 / ; vech .†22 / , x D y2 z1 v2 and w .Â/ D xÂ1 . qi 2 .Â2 / D k ln 2 2 1 ln j†22 j 2 16 Taking derivatives, quasi-scores can be expressed as 2 @q2 @Â2 3 6 7 6 7 y1 ˆ .w .Â// 6 0 .w .Â// 6 z 7 7 ˆ .w .Â// ˆ .w .Â// 4 1 5 v02 3 2 0 Á˝z y1 ˆ .w .Â// 7 6 .w .Â// 4 D 5 Œ1 ˆ .w .Â// ˆ .w .Â// 0 r.rC1/ 1 2 2 3 0 1 0 6 Ir ˝ z †22 v2 .ı 2 / 7 D4 Á 5 1 L vec † 1 v .ı /0 v .ı / † 1 † 1 i2 2 2 r 22 i 2 2 22 22 @q1 D Œ1 @Â1 @q1 @Â2 y02 Assume GMM-QLIML extra moment functions are @q1 D @Â22 1 h i y1 ˆ .w .Â// 0 .w .Â// Ái z2; r ˆ .w .Â// ˆ .w .Â// where Áoi 6D 0. To derive conditions under which QLIML is efficient relative to CF, first note that †22 is nuisance parameter, that is, under correctly specifed regression functions, Ä Eo Ä Eo @q1o @Â1 @vech .†22 /0 @q2o D0 DE @vec .ı 2 / @vech .†22 /0 h Ir ˝ z0 h 1 ˝† 1 v2 †o22 o22 ii L0r D 0 Therefore, we can assume †22 is known, that is, redefine Â2 D ı2 : Then, expected Hessian and score outer product matrices are # " " # @q1o Œ .w .Âo //2 Eo D Eo xt xi Œˆ .w .Âo // Œ1 ˆ .w .Âo // i @Â1 @Â10 # " # " @q1o Œ .w .Âo //2 t 0 Áo ˝ z Eo D Eo x Œˆ .w .Âo // Œ1 ˆ .w .Âo // @Â1 @Â20 " # " # @q1o Œ .w .Âo //2 0 0 Eo D Eo Áo Áo ˝ z z Œˆ .w .Âo // Œ1 ˆ .w .Âo // @Â2 @Â20 Ä h i @q2o 1 ˝ z0 z Eo D E † o o22 @vec .ı 2 / @vec .ı 2 /0 17 and @q1o # Ã2 y1 ˆ .w .Âo // .w .Âo //2 xt x Vo D Eo @Â1 ˆ .w .Âo // Œ1 ˆ .w .Âo // # " Ã2 à  o @q1 @q1o y1 ˆ .w .Âo // Œ .w .Âo //2 xt Á0o ˝ z ; D Eo covo Œ1 ˆ .w .Âo // ˆ .w .Âo // @Â1 @Â2 # "  oà Ã2 @q1 y1 ˆ .w .Âo // .w .Âo // Vo Áo ˝ z0 D Eo Á0o ˝ z Œ1 ˆ .w .Âo // ˆ .w .Âo // @Â2 à  o h i @q2 0 1 0 1 Vo D Eo Ir ˝ z †o22 v2 v2 †o22 ŒIr ˝ z @vec .ı 2 /  à " The orthogonality between scores holds if conditional means are correct since # " Ä # " ˇ @q2o @q1o ˇ @q1o @q2o ˇ y2 ; z1 ; v2 D Eo Eo Eo @ @Â20 @ ˇ @Â20 D0 Then, it is implied by Proposition 1.3.14 that followings are sufficient for QLIML to be efficient relative to CF for Â1 Eo Œy1 jy2 ; z D ˆ .w .Âo // Vo Œy1 jy2 ; z D 1 ˆ .w .Âo // Œ1 ˆ .w .Âo // Eo Œy2 jz D zı o2 Vo Œy2 jz D 2 †o22 with 1 Ä 2 The restriction 1 Ä 2 is especially plausible for application of Probit model to fractional response y1 2 Œ0; 1 : Note that, with correctly specified conditional mean function, the conditional 18 variance of y1 is bounded above by Vq Œy1 jy2 ; z W Vo Œy1 jy2 ; z D E h i y12 jy2 ; z Ä E Œy1 jy2 ; z D ˆ .w .Âo // Œ1 Œˆ .w .Âo //2 Œˆ .w .Âo //2 ˆ .w .Âo // And 1 often appears to be very small in practice when 2 is normalized to 1. Another example where the relative efficiency conditions are applicable is Poisson regression model for positive response (such as count data) with endogenous explanatory variable. Example 1.3.16 (exponential model) In the following simulataneous equation system y1 D exp .y2 ˛ C z1 ı1 C v2 Á/ u1 y2 D zı 2 C v2 assume y1 jz; y2 q P oisson .exp .y2 ˛ C z1 ı1 C v2 Á// and y2 jz q Normal .zı 2 ; †22 / : Then, quasi-log-likelihood is q1 .Â1 ; Â2 / D q2 .Â2 / D log .y1 Š/ k ln 2 2 exp .y2 ˛ C z1 ı1 C v2 Á/ C y1 .y2 ˛ C z1 ı1 C v2 Á/ 1 ln j†22 j 2 1 vi 2 .ı 2 / †221 vi 2 .ı 2 /0 2 Since Poisson likelihood also belongs to LEF, by Proposition 1.3.14, a set of sufficient conditions for relative efficiency of QLIML for Â1 is Eo Œy1 jy2 ; z D exp .y2 ˛o C z1 ıo1 C v2 Áo / Vo Œy1 jy2 ; z D 1 exp .y2 ˛o C z1 ıo1 C v2 Áo / Eo Œy2 jz D zı o2 Vo Œy2 jz D 2 †o22 with 1 Ä 2 19 1.4 Concluding Remarks I show that, when both QLIML and CF estimators are consistent and asymptotic normal, there exists an efficient GMM estimator called GMM-QLIML whose asymptotic variance constitutes a lower bound of those of QLIML and CF estimators. In particular, a set of generalized information matrix equalities is shown to be sufficient for QLIML estimator to be as efficient as GMM-QLIML. In fully robust estimation of correctly specified conditional means, the condition is further weakened to GLM variance assumption with a scaling restriction. As Example 1.3.15 demonstrates, this condition is especially appealing for Probit model applied to fractional response. Still, there are remaining questions to be answered. Regarding Proposition 1.3.13, can we derive a refined condition for relative efficiency of QLIML estimator in 1 > 2 case? As discussed for Poisson regression model in Example 1.3.16, there are models that often exibits large 1 ; and this refinement (if possible) will be useful. Also, we cannot rule out the possibility of even weaker condition than GIMEs with scaling restriction given in Proposition 1.3.13. Direct comparison of asymptotic variances does not seem to work well in that direction of research. Moreover, relative efficiency relationship with other QLIL-based estimators can be studied. For example, when reduced form model for y1 is available, the minimum distance estimator suggested by Amemiya (1978, 1979) can be used. Newey (1987) showed its asymptotic efficiency in limited information structure with normal errors. It would be interesting to study its relative efficiency relationship when it is based on QLIL. 20 CHAPTER 2 EFFICIENT MINIMUM DISTANCE ESTIMATOR BASED ON QUASI-LIMITED INFORMATION LIKELIHOOD 2.1 Introduction When a model is over-identified in limited information simulataneous system, the classical minimum distance estimator is often proposed as an estimation method. Amemiya (1978,1979) first introduced its application to Probit and Tobit model with endogenous explanatory variables and gave an interpretation of ‘generalized least square’. Newey (1987) called this estimator as ‘Amemiya’s GLS (AGLS)’ and showed its asymptotic efficiency under correct specification of likelihood in a general class of limited information structures. Recent work by Wooldridge (2014) implies that, in linear exponential family (LEF), correct specification of regression functions of reduced form model guarantees robustness of AGLS. Still, its relative efficiency relationship has not been clarified for the case of potentially misspecified likelihood. The purpose of this chapter is to study asymptotic behavior of minimum distance estimator based on quasi-limited information likelihood. The primary focus is on its relative efficiency relationship with respect to quasi-limited information maximum likelihood (QLIML) estimator and two-step control function (CF) approach. this chapter takes the quasi-limited information framework from Wooldridge (2014) and relies on results from Chapter 1. The main contributions of this chapter are followings. First, AGLS is interpreted as a concentrated estimator (cMD-QLIML) and ‘full’ minimum distance estimator (MD-QLIML) is proposed. Based on an analogous result of Crepon, Kramarz, Trognon (1997), cMD-QLIML is shown to be asymptotically equivalent to MD-QLIML for structural parameters. Second, given quasi-limited information likelihood, cMD-QLIML is proved to be asymptotically efficient relative to QLIML and CF. In particular, cMD-QLIML can be strictly more efficient than QLIML in Newey’s framework if enough degree of misspecification is present in likelihood. Third, if and only if condition 21 for cMD-QLIML and other estimators to be asymptotically equivalent under the null hypothesis of exogeneity is derived. Immediate implication shows that a set of generalized information matrix equalities for reduced form model is sufficient. Fourth, an explicit formula of cMD-QLIML estimator for linear model is derived. It is the same as GMM but with a different weighting matrix derived from the reduced form parameter estimates. The rest of this chapter is organized as follows. In Section 2.2, basic model restrictions are given. In Section 2.3, MD-QLIML and cMD-QLIML are defined, and relative efficiency relationship is presented. Section 2.4 contains application to linear model and quantile regression model with endogeneous explanatory variables. 2.2 Model Restrictions Assume random sampling from a population. Model restrictions start from a decomposed quasilimited information log-likelihood framework in Wooldridge (2014) QLL D q1 .yi1 ; yi 2 ; zi ; Â1 ; Â2 / C q2 .yi 2 ; zi ; Â2 / where  D Â10 ; Â20 0 is .p1 C p2 /-dimensional vector of parameter, yi1 is the i th observation of a scalar response variable, yi 2 is a 1 is 1 (2.1) r vector of potentially endogenous variables, zi D .zi1 ; zi 2 / k vector of included/excluded exogenous instruments with k D k1 C k2 . For details, see Wooldridge (2014). QLIML and CF estimators are initially given as ÂOQLIML D arg max  N X Œqi1 .Â/ C qi 2 .Â2 / and i D1 8 ˆ O2;CF D arg max PN qi 2 .Â2 / ˆ  ˆ i D1 < Â2 ˆ O1;CF D arg max PN qi1 Â1 ; ÂO2;CF ˆ  ˆ iD1 : Á Â1 and redefined as GMM estimators based upon first order conditions 2 3 2 Á @ @ N N O X X 6 7 6 @Â1 qi1 @Â1 qi1 ÂQLIML Á Á 5 D 0 and 4 4 @ q @ @ O O  C q  i D1 i D1 i1 QLIML i 2 2;QLIML @Â2 @Â2 @Â2 qi 2 22 Á 3 ÂOCF 7 Á 5D0 ÂO2;CF respectively. Finite sample estimates of above extremum estimator and GMM-interpreted estimator may not coincide. Such numerical discrepancy does not harm our asymptotic analysis since they are asymptotically equivalent under regularity conditions. Both QLIML and CF estimators are assumed to be p N consistent for the true parameter values and asymptotic normal. The essential model restriction for validity of the relative efficiency results in this chapter is that the asymptotic variance of each estimator is in the standard sandwich form. To explicitly account for some cases of non-smooth objective functions, the Jacobian matrix of expected score is used rather than the expected Jacobian matrix of score. Note that redundancy conditions in Breusch, Qian, Schmidt and Wyhowski (1999) are compatible with such modification. Generalized information matrix equalities (GIMEs) are also defined accordingly. A set of standard regularity conditions for GMM interpreted estimators (Assumptions 1–13) are given in Appendix B.1. It is easy to show that, under these conditions, Proposition 1.3.4 in Chapter 1 holds with Jacobian matrices of expected score in the sandwich form. To consider a distance minimization problem, a reduced form model should be defined. The existence of link function W ‚ ! € Rg is essential in the following characterization of a reduced form model. Á R . ;  / ; .Â/ such that Definition 2.2.1 A reduced form model is a pair qi1 2 R . .Â/ ;  / D q . ;  / a.s. for 8 2 ‚ qi1 i1 1 2 2 R @qi1 @Â2 for some p2 . ; Â2 / D C .Â/ R @qi1 @ . ; Â2 / a.s. (2.2) (2.3) g matrix C .Â/ whose elements are real-valued function of Â: The link function .Â/ represents the functional relationship between structural parameters and reduced form parameters. It relates likelihoods of structural and reduced form model as in the first condition (2.2). Note that the decomposed likelihood R . ;  / C q . / qi1 2 2 2 23 for a reduced form model still belongs to QLIML framework, and q1 alone cannot identify without help of q2 . Based on relative efficiency results in Chapter 1, this model is ‘reduced’ in the sense @q R that GMM-QLIML for this model has no additional moment functions from @Âi1 . In other words, 2 all nonredundant effects of .zi ; yi 2 / on y1 are captured in . This is chracterized by the second condition (2.3). In turn, QLIML and CF estimator are asymptotically equivalent for reduced form model parameters . ; Â2 / : A set of standard regularity conditions for a reduced form model and a link function (Assumptions 14–17) are given in Appendix B.1. 2.3 Minimum Distance Estimators: MD/cMD-QLIML Given reduced form estimates . O ; ÂO2 /, MD-QLIML is defined as minimum distance estimator of  minimizing optimally weighted sum of distance O .Â/ and ÂO2 Â2 . Definition 2.3.1 Let . O ; ÂO2 / be reduced form parameter estimates and suppose 00 1 0 11 p BB b C B o CC d N @@ A @ AA ! N .0; R / O Â2 Âo2 Then MD-QLIML estimator ÂOMD QLIML is a solution to 20 1 30 1 20 b C 6B b C 7 1 6B min 4@ A h .Â/5 O R 4@ A  ÂO2 ÂO2 p where h .Â/ D . .Â/0 ; Â20 /0 and O R ! 3 7 h .Â/5 (2.4) R: Hence, MD-QLIML is a two-step procedure: 1. estimate . O ; ÂO2 / by solving just-identified reduced form model. . O ; ÂO2 / mainly represents estimated mean responsiveness of yi1 and yi 2 with respect to all available exogenous variation of instuments. 2. compress information contained in . O ; ÂO2 / into structural parameter estimates by solving the distance minimization problem (2.4). 24 Later, we will see this two-step estimation procedure can enhance finite sample performance remarkably compared to asymptotically equivalent optimal GMM estimator when model is overidentified. The next proposition states the asymptotic distribution of MD-QLIML. Proposition 2.3.2 Assume Assumption 1–17. MD-QLIML is asymptotically normal p N ÂOMD QLIML  Á d Á 1à 0 1 Âo ! N 0; Ho R Ho where Ho D @Â@ 0 h .Âo / One significant distinction of MD-QLIML from Amemiya’s GLS estimator is that the distance minimization problem of MD-QLIML considers almost redundant looking distance ÂO2 Â2 while that of AGLS imposes a constraint Â2 D ÂO2 with corresponding adjustment of the weighting matrix. In fact, AGLS can be interpreted as a ‘concentrated MD-QLIML’ where ‘concentration’ means Â2 is regarded as a implicit function of Â1 in the solution space of minimization problem: This interpretation is based on the following general result which is analogous to Proposition 1 in Crepon, Kramarz and Trognon (1997). 0 Proposition 2.3.3 (Concentrated Minimum Distance Estimator) Assume (1) h .Â/ D h01 ; h02 is continuously differentiable in  D .Â1 ; Â2 / where Â1 2 Rp1 ; Â2 2 Rp2 ; h1 2 Rg ; h2 2 Rp2 with o / has full column rank (3) h .Â/ D 6 0 if  6D Âo where o D . o1 ; o2 /, p1 (2) @h. o @ Á p d @h2 .Â/ p2 (4) g and . / .0; / 2 R 2 R N O ! N (5) det 6D 0 for 8 2 ‚: o o o1 o2 @Â2 g Then, 'N .Â1 / is well-defined by O2 h2 .Â1 ; 'N .Â1 // D 0 for each .Â1 ; N / ; and an estimator ÂOc;1 derived from min . O1 Â1 h1 .Â1 ; 'N .Â1 ///0 WO 1 . O1 h1 .Â1 ; 'N .Â1 /// is asymptotically equivalent to a minimum distance estimator of Â1 which solves min . O  h .Â//0 WO . O 25 h .Â// p where WO 1 ! So 0 o So 1 Ä , So D Ig1 the asymptotic distribution of ÂOc;1 is p where Hc D N ÂOc;1 @ h . / @Â10 1 o @h1 .Âo / @Â20 Ä @h2 .Âo / @Â20 1  h Á d Âo1 ! N 0; Hc0 So o So0 1 Ä @ h . / @h2 .Âo / @Â20 1 o @Â20 1 p and WO ! Hc o 1: Moreover, i 1à @h2 .Âo / : @Â10 Proposition 2.3.3 provides a method to construct an asymptotically equivalent minimum distance estimator for Â1 by concentrating Â2 out. The key condition is that the dimension of concentrated parameter Â2 is the same as that of concentrating equation O2 h2 .Â1 ; Â2 / D 0: This condition is presicely satisfied for Â1 and Â2 in QLIML framework. Applying Proposition 2.3.3 to QLIML framework, we have h2 .Â1 ; Â2 / D Â2 and the implict function in the proposition is merely 'N .Â1 / D ÂO2 , a constant function of Â1 : In turn, AGLS can be defined as concentrated MD-QLIML (cMD-QLIML) and its asymptotic distribution and asymptotic equivalence to MD-QLIML follows as a corollary. Definition 2.3.4 cMD-QLIML (=AGLS) is defined to be a solution ÂO1;cMD QLIML to h Ái0 h Ái min b Â1 ; ÂO2 WO 1 b Â1 ; ÂO2 Â1 p where WO 1 ! So 0 R So 1 Ä and So D (2.5) @ .Âo / @Â20 Ig Corollary 2.3.5 Assume Assumption 1–17. Then, 1 1 VMD QLIML D VcMD QLIML and p N Â1;cMD QLIML Ä @ .Âo / where Hc D 0 and So D Ig @Â1  h Âo1 ! N 0; Hc0 So R So0 d 1 Hc i 1à @ .Âo / @Â20 To study relative efficiency relationship between GMM-QLIML and AGLS(cMD-QLIML), it is useful to consider following GMM counterpart for MD-QLIML. 26 Definition 2.3.6 (MD-QLIML equivalent GMM) mGMM-QLIML is defined to be an optimal GMM estimator based on 2 6 4 @ @ q1R . .Â/ ; Â2 / @ @Â2 q2 .Â2 / 3 7 5 (2.6) The moments in (2.6) are the same first order conditions used in the first stage of MD-QLIML except that is being treated as a function of Â. This estimator can be understood as combining two-step procedure of MD-QLIML into one-step: accounting for mean responsiveness of y1 and y2 with respect to all exogeneous variation of instruments, choose  optimally. Under regularity and identification conditions for MD-QLIML (Assumption 1–17), this estimator is well-defined and asymptotically equivalent to MD-QLIML. Proposition 2.3.7 Assume Assumption 1–17. Then MD-QLIML and mGMM-QLIML are asymptotically equivalent. mGMM-QLIML is efficient relative to GMM-QLIML in matrix positive semi-definite sense. Proposition 2.3.8 formalizes the results along with a condition under which mGMM-QLIML and GMM-QLIML are asymptotically equivalent. Proposition 2.3.8 Assume Assumption 1–17. Then VmGMM QLIML VGMM QLIML where the inequality becomes equality if p1 C p22 D g: The following proposition summarizes the results in the framework of linear index model which is most frequently used but not the only class of models that results can be applied to. Proposition 2.3.9 Assume Assumption 1–18. Suppose qi1 .Â1 ; Â2 / D l .yi1 ; y2 ˛ C z1 ı1 C v2 Á; / qi 2 .Â2 / D k ln 2 2 1 ln j†22 j 2 27 1 vi 2 .ı 2 / †221 vi 2 .ı 2 /0 2 0 0 where Â1 D ˛0 ; ı10 ; Á0 ; 0 ; Â2 D vec .ı 2 /0 ; vec .†22 /0 and vi 2 .ı 2 / Á yi 2 zi ı 2 : Then following results hold: (a) VMD QLIML VGMM QLIML VQLIML ; VCF (b) If k2 D r; then VMD QLIML D VGMM QLIML D VQLIML D VCF (c) If Áo 6D 0; then VMD QLIML D VGMM QLIML (d) If Áo D 0; then VMD QLIML VQLIML ; VCF VGMM QLIML D VQLIML D VCF (e) If Áo D 0 and k2 > r; then VMD QLIML D VGMM QLIML if and only if ˇ Ä @ @qi1 ˇˇ E Á ˇ 0 0Á 0 @ 0 @ ;Â2 D o0 ;Âo2 0 1 0 1 1 2 3ˇ o o ˇ @qi1 @qi1 @q i1 o ˇ @q @ B C B C 6 7ˇ @ @ @Â1 D cov @ i1 ; @q o 1 @q o A V @ @q o 1 @q o A E 4 @q 5ˇ ˇ @ @ 0 i1 C @qi 2 i1 C i 2 i1 C i 2 ˇ @ @ @ @ @ @ 2 2 2 2 2 2  DÂo o where @q is such that optimal GMM on @ i1 together with QLIML moment functions is asymp- totically equivalent to mGMM-QLIML. A sufficient condition is GIME’s for q1 ; q2 ; q1 C q2 with same scaling factor for reduced form model. (a) holds also for the cases where the index contains higher order terms of y2 . However, asymptotic equivalence of MD-QLIML and GMM-QLIML in (b) and (c) doesn’t hold in general when higher order terms of y2 are present. (b) is well-known property and, in fact, estimators are numerically equivalent. (c) is a typical relative efficiency relationship when endogeneity is present. If a set of GIME’s with common scaling factor holds, QLIML will also be asymptotically equivalent to MD-QLIML. (d) and (e) shows relative efficiency of MD-QLIML when there exists no endogeneity. With over-identification, there exists potential efficiency gain which vanishes under a set of GIME’s for reduced form model. 28 2.4 Example 1: Linear Regression Model with Endogeneous Explanatory Variables From equation yi 2 D zi ı 2 Cvi 2 and uniqueness of linear projection; it is clear that L .yi1 jyi 2 ; zi / D L .yi1 jvi 2 ; zi / : Substituting y2 into q1 , or equivalently, substituting into regression equation yi1 yields yi1 D yi 2 ˛Czi1 ı1 C vi 2 †221 †21 C ei1 Á D z1i .ı 21 ˛ C ı1 / C z2i ı 22 ˛ C vi 2 ˛C†221 †21 C ei1 Á z1i 1 C z2i 2 C vi 2 3 C ei1 Along with 4 Á 11j2 .Â/ ; .Â/ is naturally defined as Ã0  Á0 0 0 1 .Â/ D .ı 21 ˛ C ı1 / ; .ı 22 ˛/ ; ˛C†22 †21 ; 11j2 .Â/ while dependence of qi1 on Â2 is through the control function vi 2 .ı 2 / : Consistency and asymptotic normality of estimator of . ; Â2 / is implied by invertibility of E z0 z and other regularity conditions assumed for QLIML and CF estimators. However, additional assumptions are needed in nonlinear models in general. For example, when we allow higher order terms of y2 in regression function as in yi1 D yi22 ˛ C zi1 ı1 C vi 2 †221 †21 C ei1 it is required to impose additional orthogonality and rank condition for structural and reduced form model. i.e. (a) regression function specification is done with conditional mean operator rather than linear projection operator. (b) linear independence of higher order terms of instruments is assumed. Following demonstrates a computationally useful reparameterization of classical LIML under which explicit expressions for Ho , Hc ; So and the closed form of cMD-QLIML estimator are given. Consider reparametrization Á Á †221 †21 11j2 †12 †221 †21 Á †11 29 It can be easily shown that this reparameterization does not alter other parameter estimates of any estimation method discussed in this chapter. Taking derivative of h .Â/ at Âo , we have 2 3 6 H11 .Âo2 / H12 .Âo1 / 7 Ho D 4 5 0 Ip2 where 2 3 3 2 0 7 6 ı21k1 r Ik1 0 0 6 7 7 6 7 6 ˛ ˝ Ik 7 6 ı22k r 0 6 0 0 7 7 and H12 .Â1 / D 6 0 2 H11 .Â2 / D 6 0 r.rC1/ 7 7 6 7 6 r rk g 6 7 5 4 2 I 0 I 0 r r 6 7 01 rk 4 5 0 0 0 11 1 Ä Hence, we have Hc D H11 .Âo2 / and So D Ig H12 .Âo1 / : Preliminary estimates for ˛ can be calculated by QLIML or CF when constructing weighting maxtrix for cMD-QLIML. Moreover, since .Â/ D H11 .Â2 / Â1 ; taking first order condition of (2.5) yields Ä Á0 Á 1Ä Á0 O O O O H11 Â2 WO 1b Â1;cMD QLIML D H11 Â2 W1 H11 Â2 Note that ei .Â/ D ei . .Â// where ei .Â/ Á yi1 ei . .Â// Á yi1 vi 2 †221 †21 yi 2 ˛ zi1 ı1 z2i 2 .Â/ z1i 1 .Â/ vi 2 3 .Â/ By replacing ei .Â/ with ei . / in qi1 along with reparameterizing 11j2 .Â/, a reduced form model likelihood q1 . ; Â2 / is derived: qi1 . ; Â2 / D 1 ln 2 2 Taking derivatives with respect to , we have 2 6 @qi1 . ; Â2 / 6 D6 6 @ 4 1 ei . /2 4 1 2 1 ln 4 2 1 4 ei 4 1 1 2 4 30 1e i . / z0 . / v02 C 21 ei . /2 4 2 3 7 7 7 7 5 Then, mGMM-QLIML moment functions are derived by noting .Â/ as a function of  3 2 1 0 11j2 .Â/ ei .Â/ z 7 6 7 @qi1 . .Â/ ; Â2 / 6 7 6 D6 .Â/ 1 ei .Â/ v02 7 11j2 @ 5 4 1 1 1 2 1 C 2 ei .Â/ 11j2 .Â/ 2 11j2 .Â/ It is not difficult to see that mGMM-QLIML is asymtotically equivalent to GMM-QLIML whose @q1 @Â22 replaced with (1.11) as discussed in previous section.1 To see existence of C .Â/ ; assuming, without loss of generality, that first k2 Ä C .Â/ D Á @q r instruments in z2 are chosen in @ 1 ; it is implied that 22 0.k 2 r / k1 †221 †21 Á io Ik2 r 0.k r / .2rC1/ 2 when †221 †21 io 2.5 Example 2: Probit with Endogeneous Explanatory Variables is nonzero for chosen io and k2 > r: Consider probit model with endogeneity. yi1 D 1 Œyi 2 ˛ C zi1 ı1 C ui1 > 0 ; yi 2 D zi ı 2 Cvi 2 Assume regularity and identification conditions (Assumption 1–17). For computational convienience, similar reparametrization in Example 1 is done along with normalization of e1 D u1 v2 Á. Also, †22 is taken out from q2 since its exclusion does not affect other parameter estimates. Then, quasi-likelihood can be simplified as following q1 .Â1 ; Â2 / D .1 qi 2 .Â2 / D y1 / log Œ1 ˆ .w .Â// C y1 log ˆ .w .Â// 1 vi 2 .ı 2 / vi 2 .ı 2 /0 2 1 Moreover, by invertible transformation of mGMM-moment functions and by separability con- dition of GMM, it can be shown that mGMM-QLIML is asymptotically equivalent to optimal GMM on E z 0 u : 31 0 where Â10 D ˛ 0 ; ı10 ; Á0 ; Â2 D vec .ı 2 /, and w .Â/ D y2 ˛ C z1 ı1 C v2 Á. Taking derivatives, quasi-scores can be expressed as 2 3 6 7 6 7 y1 ˆ .w .Â// 6 0 .w .Â// 6 z 7 7 ˆ .w .Â// ˆ .w .Â// 4 1 5 v02 @q1 D @Â1 1 @q1 D @Â2 @q2 D @Â2 y02 1 y1 ˆ .w .Â// .w .Â// Á ˝ z0 ˆ .w .Â// ˆ .w .Â// Ir ˝ z0 .y2 zı2 /0 It is easy to see that, GMM-QLIML extra moment functions are @q1 D @Â22 1 i h y1 ˆ .w .Â// .w .Â// Áio z02; r ˆ .w .Â// ˆ .w .Â// where Áio 6D 0. Components of Ho , Hc and So can be calculated similarily as in Example 1. 3 2 3 2 ı21k r Ik1 0 7 6 1 0 7 6 6 ˛ ˝ Ik 7 7 / . D and H H11 .Â2 / D 6 4 5 0 7 12 1 6 ı22k r 0 2 5 4 0r rk Ir 0 Ir To derive mGMM moment functions, note w .Â/ D y2 ˛ C z1 ı1 C v2 Á D z1 1 C z2 2 C v2 3 D w . .Â// Then, differentiating with respect to reduced form parameters, we have 2 @q1 . .Â/ ; Â2 / D @ 1 z0 3 y1 ˆ .w .Â// 6 7 .w .Â// 4 5 ˆ .w .Â// ˆ .w .Â// v02 and it shows that this model is LL-class as claimed by Proposition 2.3.9. 32 2.6 Monte Carlo Simulation on Probit Model with EEV In this section, Monte Carlo simulations are conducted on probit model under several specifications. The purpose is to investigate effects of GIME’s on finite sample performance of estimators when model is over-identified. Based on the relative asymptotic efficiency results in previous sections, it is expected that MD and equivalent estimators outperform CF and QLIML (in terms of standard deviation) at large enough sample size when enough misspecification is present in an overidentified model. Root mean squared error (RMSE) is used as main performance measure in this study to take account of bias as well as mean deviation. The assumptions on data generation is as follows: all data points are independently and idenÁ 1 10 4 tically generated. The instruments fzk gkD1 are mutually independent and zk SBi n 10 ; 3 d for each 1 Ä k Ä 10 where SBi n .n; p/ D Bipn.n;p/ np : The regression equation for scalar y2 is np.1 p/ y2 D z1 C C z10 C v2 : For specification of GIME’s of q1 and q2 , following restrictions were imposed in each case Probit GIME holds y1 D 1Œz Cy CÁv Ce >0 1 2 2 2 Á SBi n 104 ; 1=3 v2 GIME doesn’t hold 11 4 Œz1 Cy2 CÁv2 Ce2 >0 C 34 1h i p z1 Cy2 CÁv2 C.e2 Ce3 /= 2>0 Á z1 z2 SBi n 104 ; 1=3 where e2 and e3 each follow independent standard normal distribution. Due to nomalized variance of v2 , it can be shown that, by considering q2 without †22 term; only homoskedasticity is needed for GIME to hold for q2 . Fractional response y1 fails GIME for q1 due to correlation between e2 and p .e2 C e3 / = 2: Since relevent random variables are all discrete and have bounded supports, RMSE for all estimators are well-defined and can be used in comparison. Number of repetition is 104 and mGMM was estimated by iterative (or continuously updating) method. The simulation program was written in ado/MATA language in STATA 13, and it was executed using HPCC(High Performance Computing Center) resources provided by the iCER(institute for Cyber-Enabled Research) at Michigan State University. Table 2.1 shows results. Simulation I is a case where a complete set of GIME’s hold so that MD, 33 Table 2.1 Root Mean Squared Error and Standard Deviation (Á D 0:6) y1 v2 I bin hom II bin heter III frac hom IV frac heter RMSE.˛/ O CF QLIML mGMM cMD MD .075365 .076288 .106961 .074520 .074145 .106819 .107902 .135496 .107595 .107138 .077754 .078776 .110279 .077083 .076336 .104740 .065795 .088611 .065725 .065140 .104207 .067283 .089067 .066771 .066081 .092770 .095036 .108038 .093617 .093134 III frac hom IV frac heter .074631 .075181 .088201 .074508 .073898 .076346 .076850 .090190 .076982 .076332 .083360 .064932 .074308 .065704 .065104 .082906 .066215 .075319 .066712 .066068 .115156 .116472 .130357 .115562 .114801 .108713 .093296 .102978 .094569 .094078 .117753 .102338 .110010 .102651 .102141 .123702 .127413 .135053 .124358 .123641 .100161 .085600 .091019 .086323 .085903 .124508 .112983 .118997 .112214 .111646 O SD.ı/ .116410 .118604 .148364 .115777 .114848 .125905 .094155 .114708 .094605 .094074 .132584 .103526 .122376 .102796 .102169 RMSE.Á/ O CF QLIML mGMM cMD MD II bin heter SD.˛/ O O RMSE.ı/ CF QLIML mGMM cMD MD I bin hom .106276 .106946 .120365 .107592 .107046 SD.Á/ O .123793 .128690 .142233 .124352 .123649 .107716 .086661 .096547 .086388 .085921 .128835 .113820 .124093 .112209 .111645 .092549 .094068 .101424 .093618 .093101 cMD, mGMM and QLIML are all asymptotically equivalent. In simulation I, thus, all estiamators are asymptotically equivalent including CF. Other simuations (II, III, IV) have at least one GIME failing, and MD, cMD and mGMM are efficient relative to both QLIML and CF. Standard deviations are also presented along with RMSEs. There are some points to be mentioned: 1) These results show that there can be cases where minimum distance and its equivalent estimators can outperform CF and QLIML in finite sample. 2) Minimum distance estimators and equivalent ones except mGMM seem to behave quite similar to each other in Simulation I as predicted. 3) MD-QLIML performs remarkably well while asymptotically equivalent mGMM had poor finite sample behavior. Compared to other estimators, MD-QLIML has usually the best performance and, even when it is the second best, the RMSE difference from the best is not large. 34 CHAPTER 3 SHORT PANEL DATA QUANTILE REGRESSION MODEL WITH SPARSE CORRELATED EFFECTS 3.1 Introduction Application of quantile regression to panel data is attractive to empirical researchers. Com- pared to conditional mean regression, quantile regression provides a more thorough description of the population distribution by nature. With its application to panel data, the unobserved individual effects can be accounted for so that a potential source of endogenous variation is eliminated. A natural quantile analogue of a linear panel data model, however, suffers from the well-known incidental parameters problem as in generic nonlinear models (Neyman and Scott, 1954). Rosen (2012) shows that with time dimension fixed, the conditional quantile restriction alone cannot identify the regression coefficients in general. Additional point-identifying restrictions considered in the literature so far assume at least one of the following: (i) infinite time dimension, (ii) pure location-shifting unobserved effects, and (iii) a certain degree of within-group independence of the regression errors (e.g. Koenker, 2004; Rosen, 2012; Lamarche, 2010; Canay, 2011; see Section 2). Depending on the empirical contexts, these assumptions may not be credible for short panel data analysis, and any breakdown of such identifying restrictions will result in an inconsistent estimation. The purpose of this chapter is to study an alternative point-identifying model restriction and feasible estimation procedure for linear panel data quantile regression with a fixed time dimension. The main contributions of this chapter are as follows. First, I propose a new point-identifying restriction for a linear panel data quantile regression model with a finite time dimension. The new model restriction reasonably accounts for the -quantile-specific time-invariant heterogeneity, and allows arbitrary within-group dependence of regression errors. The generalized Chamberlain device is taken analogously as a control function to capture -quantile-specific time-invariant endogenous variations. Endogeneity due to an observability pattern in unbalanced panel data can 35 be accounted for as well. Second, asymptotic properties of a non-convex penalized estimator are studied. To treat the high-dimensional nature of the generalized Chamberlain device, a nonconvex penalized estimator is adopted. Compared to exact sparse models for cross-sectional data in Wang, Wu and Li (2012; WWL) and Sherwood and Wang (2016; SW), the model in consideration accounts for an approximation error in the sparse model, and within-group dependence of panel data. A convergence rate and asymptotic distribution of the oracle estimator is studied under both exact and approximate sparsity assumptions. A sparse version of the standard partially linear semiparametric model asymptotics is derived under approximate sparsity. The proposed penalized estimator is shown to have an oracle property in the sense that the estimator based on the true sparse model belongs to local minima of penalized quasi-likelihood with probability tending to one. The lower bound condition on the smallest magnitude of nonzero coefficients, so-called the beta-min condition is relaxed compared to the one given in SW. Third, a transformation of sieve-approximated correlated effects into a generalized Mundlak form is proposed to make the sparsity assumption more plausible in some cases. Given a choice of sieve basis elements, the approximating terms are transformed into time average and deviations. Whenever the sieve elements contain a first-order polynomial term, both a classical Chamberlain and Mundlak form of correlated effects is nested by the transformed approximating terms as a special case of true sparse models. The Monte Carlo simulation shows that, depending on the true model, the estimator using a generalized Chamberlain form can outperform the one using a Mundlak form, and vice versa. Fifth, an empirical application to birth weight analysis demonstrates a convincing case where the proposed estimator works as intended in real data. The rest of this chapter is organized as follows: Section 3.2 gives a brief literature review of linear panel data quantile regression. In Section 3.3, the new identifying restriction is explained and formalized. Along with sieve-approximated correlated effects, nonconvex penalized estimation and its asymptotic properties are presented in Section 3.4. Simulation results are discussed in Section 3.5. The empirical application to birth weight analysis is in Section 3.6. Section 3.7 contains concluding remarks. 36 3.2 Literature on Linear Panel Data Quantile Regression The literature on linear panel data quantile regression model has been growing rapidly in recent years. First, there are several studies where both the time dimension T and sample size N are assumed to be large. Koenker (2004) proposed penalized estimation under the pure location shift restriction and large .T; N / asymptotics. Lamarche (2010) showed that it is unbiased under a zero median condition on fixed effects, and derived an optimal choice of a penalty parameter. Harding and Lamarche (2016) considered a semiparametric correlated effects model in a similar framework to Koenker (2004) and Lamarche (2010). Kato, Galvao Jr. and Montes-Rojas (2012) formally studied asymptotic results when .T; N / tends to infinity. They relaxed the intertemporal independence assumption in Koenker (2004), and found that for asymptotic normality, the rate condition imposed on T is more restrictive than the one found in generic nonlinear models due to non-smoothness of the loss function. This result indicates that its short panel data application is even less appealing. Second, point-identifying restrictions and estimation methods for the fixed T case are studied. Rosen (2012) showed weak conditional independence of regression errors across time together with some support and tail conditions imply point-identification. Canay (2011) also showed that an alternative conditional independence restriction in the random coefficient framework is sufficient for identification, and he proposed a simple estimation method when unobserved effects are pure location shifters. When the independence assumption is strengthened to i.i.d., Graham, Hahn and Powell (2009) showed that there is no incidental parameters problem since the first-differenced regression errors have a zero conditional median. Abrevaya and Dahl (2008), without explicitly setting up rigorous model restrictions, applied a quantile analogue of a correlated random effect model to analyze the effects of birth inputs on birthweight. Apart from linear panel data quantile regression models, there are several related works on panel data models. Wooldridge and Zhu (2016, manuscript) proposed high-dimensional probit model with sparse correlated effects under fixed T . Arellano and Bonhomme (2016) considered 37 a class of nonlinear panel data models under fixed T where unobserved heterogeneity is nonparametrically modelled. Graham, Hahn, Poirier, Powell (2015, manuscript) extends the correlated random coefficients representation of linear quantile regression to panel data under fixed T . Chernozhukov, Fernández-Val, Hahn, Newey (2013) studied a general nonseparable model assuming time-homogeneous errors and large .T; N /. 3.3 Identification 3.3.1 Generalized Chamberlain Device One of the essential advantages from using panel data is to resolve the potential endogeneity problem that arises from unobserved time-constant heterogeneity. The unobserved effects are typically specified as unknown coefficient parameters for individual dummy variables. In the linear panel data conditional mean model, such specification is useful: both the differencing method and direct control of dummies yield consistent estimators under mild conditions. Unfortunately, panel data quantile regression with individual dummies suffers from an incidental parameters problem in general. I propose a generalized Chamberlain device as an alternative approach to achieve elimination of time-invariant endogeneity in the spirit of a control function approach. The idea is to control timeconstant endogenous variation (regressor-correlated variation) only, not the whole heterogeneous individual effect in the unobserved error. A well-known example of the conditional mean model clearly demonstrates this idea: Suppose, for 1 Ä i Ä N and 1 Ä t Ä T; yi t D xi t ˇ C ci C vi t where xi t 2 RK ; xi Á .xi1 ; (3.1) ; xiT / ; E Œvi t jxi ; ci  D 0, and xi t is assumed to be time-varying and continuously distributed. Here, the unobserved time-invariant effect is denoted as ci following Chamberlain (1984). By taking the conditional expectation of yi t on xi , we have E Œyi t jxi  D xi t ˇ C g .xi / 38 (3.2) for some measurable function g W RTK ! R: Note that the unknown arbitrary function g does not depend on the time index t: Then, regression with sieve-approximated g .xi /, for example, yields a consistent estimator of ˇ: In this sense, the conditional moment restriction (3.2) can be viewed as a control function approach counterpart for the linear panel data model. (see section 19.8.2 of Li and Racine (2007) for details.) Such control function g will be called a “generalized Chamberlain device” or “correlated effect” in this chapter. To date, this approach has not been considered seriously in the linear panel data literature since the methods based on direct control or removal of the individual effect ci equivalently eliminate potential endogeneity without much difficulty. The generalized Chamberlain device is taken analogously in quantile regression setting. Suppose, for 1 Ä i Ä N and 1 Ä t Ä T; the structural equation is yi t D xi t ˇ C ui t (3.3) where, for simplicity, xi t is assumed to be time-varying and continuously distributed. We consider balanced panel data from now on unless explicitly mentioned. For each Q .yi t jxi / D xi t ˇ . / C gt .xi ; / for some measurable function gt W RTK 2 .0; 1/ ; we can write (3.4) .0; 1/ ! R: The function gt .xi ; / represents - quantile-specific endogenous variation contained in ui t that is allowed to vary across time given xi . Unfortunately, such gt is not separately identifiable from xi t ˇ . / in general: Now, assume that any endogenous variation contained in ui t is time-constant in the sense that gt .xi ; / does not depend on t but is allowed to have a constant level difference across time. Then, for some constants, kt . /s, we have Q .yi t jxi / D xi t ˇ . / C g .xi ; / C kt . / (3.5) where kT . / D 0 is imposed for normalization: Note that (3.5) takes a quantile analogue of (3.2) with the introduction of time effects, and that it formalizes the “time-constant endogeneity” assumption for ui t . It does not rely on either additivity of composite error, ci C vi t , or the widely used conditional quantile restriction Q .vi t jxi / D 0: 39 3.3.2 Model Restriction and Identification In this subsection, the time-constant endogeneity assumption is used to derive the generalized Chamberlain device for a formal structural equation. Together with the derived control function, a set of model restrictions that attains point-identification is presented. For each i D 1; ; N and t D 1; ; T; we observe .yi t ; xi t ; zi ; vt / : The response variable is yi t 2 R; and the covariates are time/individual-varying variables xi t 2 RK1 ; time-constant variables zi 2 RK2 ; and individual-constant variables vt 2 RK3 : The covariates are allowed to contain both continuous and discrete variables which will be notated by tilde and dot accents, c c d K K K respectively. Specifically, xQ i t 2 R 1 and zQ i 2 R 2 are continuous while xP i t 2 R 1 and d K zP i 2 R 2 are discrete where we have K1 D K1c C K1d and K2 D K2c C K2d by construction. For individual-varying variables, we assume random sampling conditional on individual-constant variables. Assumption 1 (Random Sample) fyi t ; xi t ; zi gTtD1 are i.i.d. across i conditional on fvt gTtD1 : Since we consider linear quantile regression models, it is natural to assume the structural equation for yi t to be a linear function of the observed covariates. Throughout the paper, the structural equation is defined as follows: For i D 1; ; N and t D 1; ; T; yi t D xi t ˇ C zi Á C vt C ui t (3.6) where ui t is an unobserved error. The equation (3.6) describes the data generating process of the response variable yi t , which is typically implied by economic theories and specific empirical contexts. We may also think of it as an equation in the researcher’s mind. Following Hurwicz (1950) and Koopmans and Reiers l (1950), it constitutes a ‘structure’ when paired with a joint distribution function of .fui t ; xi t gTtD1 ; zi / conditional on fvt gTtD1 F fui t ;xi t gTtD1 ;zi jfvt gTtD1 .ui1 ; ; uiT ; xi1 ; 40 ; xiT ; zi jv1 ; ; vT / : (3.7) Depending on the model restrictions imposed on (3.6) and (3.7), the interpretation of parameter .ˇ; Á; / changes. For example, conditional quantile restrictions on ui t with different values of 2 .0; 1/ will change the value and interpretation of .ˇ; Á; / in general: Under the model restriction of a generalized Chamberlain device, neither Á nor can be identified. However, following this argument shows that it is important to include time constant regressor zi when the control function g is constructed: Taking conditional -quantile of yi t ; we have Q .yi t jxi ; zi ; fvt gTtD1 / D xi t ˇ . / C zi Á . / C vt . / C ft .xi ; zi ; fvt gTtD1 ; / for some measurable function ft W RT .K1 CK3 /CK2 on ft and that of t on ft are confounded. (3.8) .0; 1/ ! R: Note that the effect of fvt gTtD1 Thus, without loss of generality, we can write ft .xi ; zi ; fvt gTtD1 ; / D ht .xi ; zi ; / for some ht : (The notation neglects randomness arising from fvt gTtD1 since fvt gTtD1 is always fixed in this chapter.) The time-constant endogeneity assumption, then, implies ht .xi ; zi ; / D h .xi ; zi ; / C mt . / for some function h and some constant mt : The conditional quantile of yi t can be written as Q .yi t jxi ; zi ; fvt gTtD1 / D xi t ˇ . / C g .xi ; zi ; / C kt . / (3.9) where g .xi ; zi ; / D zi Á . / C h .xi ; zi ; / and kt . / D vt . / C mt . / : Assumption 2 below summarizes the model restriction of the generalized Chamberlain device. From now on, we will drop fvt gTtD1 in the conditioning and treat kt s as parameters to be estimated. The parameters’ dependence on will also be omitted. The time dummies are denoted as dt for t D 1; considered together with xi t as in wi t D Œ xi t d1 ;T 1; and they will be dT 1 : The kth element of wi t is written 0 as wi t k : The corresponding coefficient parameters are defined as ˇ D ˇ 0 ; 0 2 RK1 C.T 1/ : Assumption 2 (Correlated Effect) There exists a measurable function g W RK1 T CK2 ! R such that, for all .i; t/ Q .yi t jxi ; zi / D wi t ˇ C g .xi ; zi / 41 (3.10) Assumption 2 is a new model restriction that takes a control function approach for the linear panel data quantile regression model. Note that the correlated effect g depends on time-constant variables zi that enter the structural equation. Although the causal effects of zi on yi t are not identified, it is important to include zi as arguments of g: Also, there may be some deterministic relationship among .xi ; zi / which should be dealt with. For example, we may have xi t k D xi;t 0 ;k for some t; t 0 ; k with probability one. Then, having both variables is redundant for g, and one should be removed from the specification. Similarly, if there is a functional relationship between the covariates such as xi t k2 D xi2t k ; only the one containing finer information, xi t k1 ; should 1 remain. Throughout the paper, such redundancy is assumed away for simplicity. The following theorem shows that a certain degree of richness in the support of fwi t gt and well-behaved error distribution is sufficient for point-identification of ˇ under Assumption 1–2. Define "i t Á yi t wi t ˇ g .xi ; zi / ; and let fi t ."/ be a density of "i t conditional on .xi ; zi / : The condition imposed on fi t below is part (i) and (ii) of Assumption 3 in Section 3.4.2. N i D ..wi 2 Theorem 3.3.1 (Identification of ˇ/ Let W wi1 /0 ; ; wiT 0 wi.T 1/ /0 : Assume Assumption 1–2 and that fi t ."/ is continuous and uniformly bounded away from 0 and 1 in a neighborhood of 0. Suppose that the support of .wi1 ; 1 Ä j Ä J such that J T .K1 C T .j / ; wiT / contains J points .wi1 ; .j / ; wiT / 1/ matrix N .J /0 0 W i N .1/0 Œ W i .j / .j / (3.11) .j / .j / has full column rank, the pmf of .Pxi t ; xP i t 0 / satisfies p.Pxi t ; xP i t 0 / > 0 8j; and the pdf of .j / .j / .j / .j / .j / .j / .Qxi t ; xQ i t 0 / satisfies f.Qxi t ;Qx 0 /j.Pxi t ;Px 0 / .Qxi t ; xQ i t 0 jPxi t ; xP i t 0 / > 0 8j where f.Qxi t ;Qx 0 /j.Pxi t ;Px 0 / it it it it .j / .j / has continuous extension at each .wi t ; wi t 0 /: Then, ˇ is identified. The result above is not surprising since the current specification does not have incidental parameters. It shows that specification with an unknown function common to every individual has more identifying power than one with unknown parameters unique to each individual. For the rest of the paper, we assume point-identification. 42 3.3.3 Case of Unbalanced Panel Data with Time-constant Endogeneity When some time periods are missing for some individuals in the observed panel data, potential endogeneity related to observability should be treated as well. We assume the structural equation for all observed units and time periods are homogeneous. Then, with the introduction of selection indicators and auxiliary balanced data, the generalized Chamberlain device can be modified to account for time-constant endogeneity related to observability. The approach can be regarded as a nonparametric version of the correlated random effect models studied by Wooldridge (2009). First, define selection indicator si t to be a binary function that takes 1 if .yi t ; xi t ; zi / is observed for t; 0 otherwise. In addition, consider an auxiliary balanced panel data .si t yi t ; si t xi t ; si t zi ; si t / for i D 1; ; N; t D 1; ; T: A corresponding structural equation is assumed to be si t yi t D si t xi t ˇ C si t zi Á C si t vt C si t ui t (3.12) which is derived by multiplying si t to the original structural equation. The (3.12) is restrictive only for observed time periods and it assumes the structural equations are homogeneous across all observed units and time periods. The conditional -quantile of si t yi t conditional on S T D .fsi t xi t gTtD1 ; si t zi ; fsi t vt gTtD1 ; si / is Q .si t yi t jS T / D si t xi t ˇ C si t zi Á C si t vt C si t gt .fsi t xi t gTtD1 ; si t zi ; fsi t vt gTtD1 ; si / (3.13) for some gt where si D .si1 ; ; siT / : Then, based on the previous argument, the time-constant endogeneity assumption implies a conditional quantile restriction Q .si t yi t j fsi t xi t gTtD1 ; si t zi ; si / D si t xi t ˇ C si t g.fsi t xi t gTtD1 ; si t zi ; si / C si t kt .si / (3.14) where fsi t vt gTtD1 is omitted and .g . / ; kt / depends on the selection indicator si : Since (3.14) is not restrictive when si t D 0; we may write an equivalent model restriction for .i; t/ with si t D 1 as Q .yi t j fxi t gft Ws D1g ; zi ; si / D xi t ˇ C g.fxi t gftWs D1g ; zi ; si / C kt .si / : it it (3.15) Since the generalized Chamberlain device, g.fxi t gftWs D1g ; zi ; si / in (3.15) is a function of seit lection operator si ; it now accounts for time-constant endogenous variation due to observability. 43 Note that si acts as a classification device for the observed pattern of each individual. If there is no endogeneity related to observability, g and kt will not depend on si when g is assumed to have an additive form (3.16) in the next section. Identification of ˇ (not ˇ) can be trivially achieved under conditions in Theorem 3.3.1 for P fwi t gftWs D1g jsi D sQi such that P .si D sQi / > 0 and TiD1 sQi t 2. In other words, if there exists it a positive fraction of observable units with a certain number of multiple periods, and if the support of regressors are rich enough, the parameter of interest is identified. The estimation procedure will be applied to the auxiliary balanced panel data. When g has an additive form, an essential difference in estimation is that each fraction of individuals with different observability pattern si is allowed to have different additive components of g for each xi t k and zi k . 3.4 Estimation For estimation, the correlated effect g is approximated by sieve spaces. The approximated g has high-dimensionality for three reasons: First, the number of approximating terms is infinite in general. Second, the number of arguments in g is TK1 C K2 , which can grow fast in T: Besides the problem that the truncation choice on sieve approximation can be limited with a large number of arguments, the existence of discrete variables can introduce more nontrivial problems. In particular, if discrete variables .Pxi ; zP i / have rich enough supports, the total number of approximating terms can be comparable to, or larger than N even after we impose the additive functional form on g and truncate the approximating terms for additive components of continuous variables .Qxi ; zQ i /. Such “too many regressors” problem arises from the fact that it is not obvious how to truncate approximating terms related to discrete variables in general. Note that N is the maximal number of linearly independent time-invariant regressors. Third, when the panel data is unbalanced, more complex observability patterns result in a larger number of nonparametric nuisance components to be approximated since we allow different functional forms of g for each group of individuals with a different pattern. 44 The reasons for high-dimensionality mentioned above indicate that the standard sieve truncation via information criteria or cross-validation is not always usable and effective. For high-dimesional models, a penalized estimation of a sparse model with the Least Absolute Shrinkage and Selection Operator (LASSO; Tibshirani, 1996) and its variants is popular due to its prediction accuracy and computational feasibility. For high-dimensional quantile regression models, Belloni and Chernozhukov (2011; BC), Wang, Wu and Li (2012; WWL) and SW (2016) studied properties of penalized estimators with certain penalty functions. Since the nonconvex penalty functions used in WWL (2012) and SW (2016) have oracle property under mild conditions, asymptotic distribution of resulting penalized estimators can be studied via that of oracle estimator. This is a big benefit from using nonconvex penalty function compared to LASSO which has oracle property only under a quite restrictive condition. Another practical benefit is that the relaxed estimation procedure such as “post-Lasso estimation” is not necessary for nonconvex penalized estimators applied to a large sample. In this chapter, two nonconvex penalty functions are considered: Smoothly Clipped Absolute Deviation (SCAD; Fan and Li, 2001) and Minimax Concave Penalty (MCP; Zhang, 2010). For details of a general class of penalty functions, see Fan and Lv (2009) and Lv and Fan (2009) for example. Besides overcoming a nontrivial high-dimensionalty problem, it is expected that the penalized estimator improves over the standard truncated sieve estimators under sparsity assumption (Belloni and Chernozhukov, 2011a). Note that the penalized estimator selects the relevant sparse terms only while the relevant terms can be excluded from the first K elements selected by standard truncated sieve estimators. To make the sparsity assumption more plausible in some cases, a transformation of the approximated correlated effect into a generalized Mundlak form is proposed. 3.4.1 Sieve-approximated Correlated Effect In this chapter, the specific sieve space in which g belongs is not assumed. The theoretical framework is quite flexible and can accomodate various specifications. As long as the true function g can be sparsely approximated by a collection of terms that satisfies regularity conditions, such a 45 collection of terms can be used. Here, we briefly cover one useful example of the sieve approximation of g with an additive form; smoothness of additive components of g for continuous covariates, and finiteness of support for discrete covariates. While a smoothness assumption is quite standard, the additive function space can be replaced by a tensor product sieve space in general. We may also consider using multiple sieve spaces together so that basis elements can be mixed (Bunea, Tsybakov, Wegkamp, 2007; Belloni, Chen, Chernozhukov, Hansen, 2012). The additivity requires g to be represented by the sum of the univariate functions of each argument, g .xi ; zi / D g0 C K1 T X X gtxk .xi t k / C t D1 kD1 K2 X gkz .zi k / (3.16) kD1 h i h i where g0 2 R is a constant. For identification purposes, E gtxk .xi t k / D E gkz .zi k / D 0 8t; k is typically assumed but we may instead drop a constant term (if there exists any) in the sieve elements for each gtxk and gkz : Given additivity, smoothness restriction is imposed on the additive components Á of g with continuous covariates, .gtxk xQ i t k /; gkz .zQ i k / : The Hölder condition of a certain order is the most popular choice. (see Chen, 2007). Finite supports are assumed for components with Á discrete covariates, .gtxk xP i t k /; gkz .zP i k / , which implies that relevant approximating errors will be exactly zeros with large enough N: For approximating gtxk .Qxi t / and gkz .Qzi /, B-spline elements are frequently used. See Schumaker (2007) for details. The following shows how the B-spline basis can be implemented in practice: Given a knot sequence 0 D t0 < t1 < are 1; x; ; x p ; .x p t1 / C ; assumed to be Œ0; 1 and p . /C ; x < tJ < tJN C1 D 1; degree p B-spline elements Áp N tJN where the range of continuously distributed x is D .max f0; C p g/ : Then, the approximated gtxk .xQ i t k /, for example, can be written as stxk .xQ i t k / D p X qD1 q x t kq xQ i t k JN X C j D1 x t k.pCj / xQ i t k p tj C (3.17) where the constant term is removed for identification. In several contexts, nonparametric or semiparametric conditional quantile estimators using B-splines are shown to achieve the optimal 46 1 convergence rate with JN N 2rC1 under regularity conditions where r denotes the degree of Hölder condition. For example, He, Zhu and Fung (2002) showed the result for a univariate semiparametric component in the panel data model with an unspecified dependence structure. If the optimal growth rate is the same for the additive semiparametric model, Assumption 5 in the 1 next subsection is conservatively satisfied when we impose pN N 2rC1 and r > 1: For discrete xP i t k and zP i k ; we do not rely on a smoothness assumption in general. The corresponding gtxk and gkz function can be exactly expressed as a linear combination of indicator functions that take value 1 on the support elements. Suppose that a discrete random variable xP i t k has realized support elements Sx t k;N fas gsD1 in the given sample: Then, without loss of generality, the function gtxk .xP i t k / can be written (or approximated) as stxk .xP i t k / D Sx t k;N X x t ks 1 ŒxP i t k D as  (3.18) sD1 where indicator functions 1 ŒxP i t k D as  act as sieve basis elements of the function space in which Á gtxk .xP i t k / lies. To meet the identification condition E gtxk .xP i t k / D 0, we may equivalently drop x one of the indicator terms. Since xP i t k is assumed to have finite support, for some S t k < 1; we p x have Stxk;N ! S t k as N tends to infinity, and the approximation error becomes zero. However, z ) whose total is quite large relative since there can be multiple discrete variables with Stxk;N (or Sk;N to N; model selection is inevitable in some cases. Note that the approximating terms for discrete g components do not have a natural way to regularize the dimension as in the case of splines with increasing knots under smoothness assumption. Unless some additional assumption is employed, all of the terms in (3.18) should be included as regressors in principle. As discussed at the beginning, sparsity on the approximated correlated effect is assumed for a feasible inference. Sparsity in our context means that only a small number of approximating terms have true nonzero coefficients. In other words, the correlated effect is regular enough so we need only a small number of variables to describe it well. Recently, the sparsity assumption is gaining more credibility as the ‘bet on the sparsity’ principle is understood better (Hastie, Tibshirani, 47 Wainwright, 2015). Since the validity of sparsity depends on the choice of basis, a specific basis (or mixture of them) should be selected carefully. Given a set of approximating terms, a transformation into a generalized Mundlak form is proposed for time-varying regressor parts, gtxk s. The idea is to take the time averages and deviations given the common approximating terms of gtxk for t D 1; ; T: Definition 3.4.1 (Generalized Mundlak Form) Suppose the approximated gtxk .xi t k / of correlated P effect is stxk .xi t k / D SsD1 t ks pks .xi t k / for t D 1; ; T: Then, given t0 ; define a transformed pQks .xi t k / as pQks .xi t k / D 8 ˆ < 1 PT T t D1 pks .xi t k / t D t0 PT t 6D to ˆ : pks .xi t k / 1 T tD1 pks .xi t k / (3.19) If one of the approximating terms contains a first-order polynomial, that is, pks .xi t k / D xi t k for some s; then both classical Chamberlain and Mundlak device is nested in the transformed pQt ks .xi t k / as a special case of sparse models. Note that the basis elements in (3.18) also can be easily transformed into a form that contains a first-order polynomial. The choice of t0 can be Á 1 PT p avoided if pt0 ks xi t0 k t D1 t ks .xi t k / is also included in the approximating terms which T will be defined as ‘dictionary variables’ in Subsection 3.4.2. The rationale for the transformation (3.19) is the following. In empirical studies, it is often found that the estimators based on classical Chamberlain and Mundlak devices do not differ much while a Chamberlain device contains many more terms. If this is because the true coefficients of xi t k in the Chamberlain device are the same for all t , then the true coefficient of time deviation P terms, xi t k T1 TtD1 xi t k ; in the generalized Mundlak form (3.19) are zeros. Similarly, if the true coefficients of pks .xi t k / in the approximated correlated effect are the same for all t, then the true P coefficients of time deviations, pks .xi t k / T1 TtD1 pks .xi t k / ; are zeros, and the generalized Mundlak form has far fewer number of approximating terms. This indicates that the selection over the generalized Mundlak form can have a more sparse submodel on the correlated effects. 48 3.4.2 Penalized Estimation via Non-convex Penalty Functions Given the approximating terms for g, a nonconvex penalized estimator is proposed along with its asymptotic properties. The convergence rate and asymptotic distribution of the penalized estimator is studied indirectly by deriving those of an estimator based on the true sparse model. In turn, inference is conducted as if the estimated submodel is the true sparse model. This is mainly justified by two facts: (i) the proposed nonconvex penalized estimator is shown to have oracle property, and (ii) any submodel can be interpreted as an approximation to the true model by construction. In the following, the asymptotic properties of the true sparse estimator are discussed first, and then the oracle property of the penalized estimator is presented. Two approaches are considered: (i) exactly sparse model of g and (ii) approximately sparse model of g: The approximating terms can be divided into two groups in general. A group to be penalized and another group not to be penalized. Each group of variables is denoted as .xi ; zi / 2 RpN and N .xi ; zi / 2 RpN , respectively. Note that the number of unpenalized term pN is fixed and not allowed to depend on N; and that the total number of dictionary variable pN can be very large relative to the sample size N (i.e. pN N ) since ultra-high dimensionality is allowed for the proposed estimator. Then, g can be written as g .xi ; zi / D N .xi ; zi / N C .xi ; zi / C ri (3.20) where . N ; / has a conformable dimension and ri is an approximation error. The terms to be penalized, .xi ; zi / ; will be called ‘dictionary variables’ from now on. There is no hard guideline about whether a given approximating variable should be penalized or not. However, a constant term, g0 in our setting, is typically not recommended to be penalized and will not be treated as a dictionary variable in this paper. Given the choice of dictionary variables, wi t is redefined as wi t D Œ xi t d1 dT 1 N .xi ; zi /  (3.21) where wi t 2 RK4 and N .xi ; zi / is assumed to contain a constant as the default. ˇ is redefined accordingly. Also, dictionary variables, .xi ; zi / ; should be rescaled to have unit (pooled) sample variance. Otherwise, selection will get affected by the scales of the variables. 49 Under the sparsity assumption, only a small subset of dictionary variables have nonzero true coefficients. The cardinality of sparse coefficients is allowed to increase as N tends to infinity and its growth rate is ruled by a true sparse model given a sequence of dictionary variables. With increasing cardinality, the sparse approximation tends to the true function by construction. The framework is similar to the “approximate sparsity model” proposed in Belloni, Chernozhukov (2011a), Belloni, Chen, Chernozhukov, Hansen (2012) and Belloni, Chernozhukov, Hansen (2014) in its spirit. A difference is that the sparse model is assumed to be exact in the sense that the corresponding approximation error is not explicitly considered. Application of a penalized estimation to highdimensional nonparametric modeling is also discussed by Fan and Li (2001). Note that, in contrast to the theoretical framework of Sherwood and Wang (2016), the parameter of interest ˇ is not penalized in this chapter. In turn, there is no such pathological case where a parameter of interest is not selected in a penalized estimate. The estimator based on the true sparse model is often called “oracle estimator” in highdimensional statistics literature. The corresponding true sparse model will be refered to as “oracle model” in this chapter. Let A be the index set of sparse coefficients given .xi ; zi / ; ˚ « A D AN D 1 Ä j Ä pN W oj 6D 0 and its cardinality be qN D jAj : By rearranging (3.22) .xi ; zi / ; we may assume the first qN elements of D . 0oA ; 00p q /0 , and N N .xi ; zi / D . A .xi ; zi / ; Ac .xi ; zi //. Then, we can define the oracle estimator o are nonzero and the remaining pN similarly, denote qN components are zeros i.e. o as follows. Definition 3.4.2 (Oracle Estimator) T N 1 XX O .ˇ; O A / D arg min .ˇ; A / N tD1 iD1 where .u/ D u. .yi t wi t ˇ A .xi ; zi / A / (3.23) 1Œu<0 / Regularity conditions for the oracle estimator and penalized estimator are given for exactly sparse model below. In the following, let Fi t ."/ D F ."jxi ; zi / be a conditional cdf of "i t given 50 .xi ; zi / ; and ei t Á yi t wi t ˇ A .xi ; zi / A be the approximated regression error: The vector QA of all regressors is written as w i t D .wi t ; A .xi ; zi // while its stacked versions are denoted as A0 ; 0 Q A D .w Q Q A0 Q A0 /0 : Q i1 Q A0 W ;w ;W i iT / and WA D .W1 ; N Assumption 3 (Regression Error) (i) "i t has the continuous conditional density function fi t (ii) fi t is uniformly bounded away from 0 and 1 in a neighborhood of 0 8t . (iii) fi0t has a uniform upper bound in a neighborhood of 0 8t. Assumption 4 (Covariates) (i) 9M1 > 0 such that jwQ i t k j Ä M1 8 .i; t; k/ ; (ii) 9C1 > 0, C2 > 0 1 Q 0 Q Q0 W Q such that, with probability one, C1 Ä min . N1 W A A / Ä max . N WA WA / Ä C2 : Assumption 5 (Sparse Model Size) qN D O.N C3 / where C3 < 13 Assumption 6 (Exact Sparsity) g .xi ; zi / D A .xi ; zi / A for each N: Assumption 3 is a fairly standard regularity condition on regression error "i t : Note that the within-group dependence of "i D ."i1 ; ; "iT / conditional on .xi ; zi / is allowed to be arbitrary under Assumption 1 and 3. Assumption 4 imposes boundedness on the regressors and eigenvalues Q0 W Q of the Gram matrix N 1 W A A in the oracle model. Assumption 5 restricts the growth rate of the oracle model size. It is required for a given dictionary variable sequence to be valid. Assumption 6 is an essential condition that characterizes exactly sparse g: It means that for each sample size, the model with qN terms exactly describes the correlated effects. Under these conditions, the 0 convergence rate and asymptotic normality results for oracle estimator of  A D ˇ 0 ; 0A is shown below. Theorem 3.4.3 (Convergence Rate of Oracle Estimator) Suppose Assumption 1–6. Then, kÂO A q  oA k D Op . N 1 qN / Proof. The result follows from Lemma C.1.1 and C.1.4 in Appendix C.1.2. 51 (3.24) Theorem 3.4.4 (Asymptotic Normality of Oracle Estimator) Suppose Assumption 1–6. Let GN be an l 0 ! G; a positive definite matrix. Then, qN matrix with l fixed and GN GN p 1 N GN †N 2 .ÂO A d  oA / ! N .0l ; G/ (3.25) ."iT //0 ; BN D d iag .fffi t .0/gt gi /, I ."i t < 0/ ; ‰ ."i / D . ."i1 / ; ; P 0 Q A 1 1 Q 0 BN W Q A0 Q A , SN D 1 N KN D N1 W i D1 Wi ‰ ."i / ‰ ."i / Wi , and †N D KN SN KN . N A where ."i t / D Proof. The result follows from Lemma C.1.1 and C.1.4 in Appendix C.1.2. In Theorem 3.4.3 and 3.4.4, we are not assuming primitive conditions regarding the sieve basis nature of A .xi ; zi / : It is only assumed that given a collection of dictionary variable A .xi ; zi / true sparse regressors .xi ; zi / ; exist and satisfy the assumptions given, especially the exact sparse model condition. With additional assumptions on A .xi ; zi / ; it is possible to derive a sparse version of standard partially linear semiparametric model results accounting for nonzero approximation error ri . The additional assumptions on A .xi ; zi / require some definitions and notations. First, let G be a function space to which g belongs. For example, if g is assumed to be additive, and if .xi ; zi / contains continuous variables only; then an additive function space, H1 C C HK4 where H is the Hölder space of certain degree, is a popular choice. Define G to be the subspace of G whose elements can be expressed by active elements of the dictionary variable sequence, f A .xi ; zi /g1 N D1 : Then, we can consider a weighted projection of each regressor in wi t onto G hk Á arg inf h2G T N X X h E fi t .0/ .wi t k h .xi ; zi //2 i (3.26) i D1 t D1 where wi t k is the kth element of wi t : The corresponding population residual is written as i t k D wi t k hk : Stacked version of hk and i t k are denoted as follows: hi D .h1 .xi ; zi / ; 1T D .1; 0i1 ; ; 1/0 2 RT ; H D h01 ; ; 0iT 0 ;  D 01 ; ; h0N ; 0N Then, it is easy to check hO k .xi ; zi / D 0 0 ˝ 1T 2MN T K4 ; i t D i t1 ; ; hK .xi ; zi //; Á4 ; i tK4 ; i D so that W D H C  where W D.w011 ; w012 ; A .xi ; zi / 'Ok 52 where Wk D .w11k ; ; w0N T /0 : ; wN T k /0 ; …A D . A .x1 ; z1 /0 ; ; A .xN ; zN /0 /0 ˝1T ; 'Ok D …0A BN …A 1 …0A BN Wk . Additional conditions on A .xi ; zi / are given below. Assumption 7 (Covariates) 9M2 > 0 such that EŒ4it k  Ä M2 8 .i; t; k/ 1=2 Assumption 8 (Approximate Sparse Correlated Effects) (i) sup jri j D O.N 1=2 qN / i P 2 D o .1/ 8k O .x / .x / (ii) N 1 N Œh ; z h ; z p k i i i D1 k i i Assumption 7 restricts the population residual of wi t k projected out of hk to have finite fourth order moment. Assumption 8 is the essential condition that characterizes f .xi ; zi /g as the sieve basis elements that attain well-behaved sparse submodel. Part (i) assumes that the order p of approximation error is uniformly dominated by 1= N . Part (ii) is a high-level assumption assuming that the sample analogue estimator of hk converges to a true function with respect to the empirical L2 -norm. Since the convergence rate is not restricted and fi t .0/ is uniformly bounded, this is a fairly standard property of the sieve basis elements. For example, when qN diverges, it can be shown that the uniform approximation property of A .x1 ; z1 / (Newey, 1997) ˇ sup ˇhk .xi ; zi / x ;z A .xi ; zi / 'o;k i i ˇ ˇ D O.q ˛ / N (3.27) for some ˛ > 0 implies Assumption 8 (ii) under Assumption 1–5. With Assumptions 7 and 8 additionally assumed, the theorems below show a sparse version of the standard partially linear semiparametric model asymptotics. Denote gO .xi ; zi / D A .xi ; zi / O A : Theorem 3.4.5 (Convergence Rate of Oracle Estimator) Suppose Assumption 1–5, 7 and 8. Then, kˇO N 1 N X ŒgO .xi ; zi / 1 ˇ o k D Op .N 2 / go .xi ; zi /2 D Op N 1 qN i D1 Proof. See Appendix C.1.3. 53 (3.28) Á (3.29) Theorem 3.4.6 (Asymptotic Normality of Oracle Estimator) Suppose Assumption 1–5, 7 and 8. Then, p 1 N †N 2 .ˇO d ˇ o / ! N .0; I / (3.30) P 0 0 where †N D KN 1 SN KN 1 with KN D N 1 0 BN , SN D N 1 N i D1 i ‰ ."i / ‰ ."i / i . Proof. The result follows from Lemmas C.1.9 and C.1.11 in Appendix C.1.3. The convergence rate of ˇO is the parametric rate as in the typical partially linear semiparametric models. On the other hand, gO .xi ; zi / has a convergence rate of N 1 qN which depends on the sparsity of a true submodel given the dictionary variable sequence. One obvious implication is that the performance of an oracle estimator will depend on the choice of the dictionary variable sequence. Each of Theorem 3.4.4 and 3.4.6 shows an asymptotic distribution result that can be used to approximate the distribution of the oracle estimator in a finite sample. Note that the sample O The computation analogue estimator of N 1 †N and N 1 †N coincide for approximating VO .ˇ/: of variance estimators is presented in Subsubsection 3.4.2.2. Since true nonzero coefficients are unknown, the sparse set is estimated by penalizing coefficients of all dictionary variable in the sample optimization problem. In multiple contexts, the penalized estimators using nonconvex penalty functions such as SCAD (Fan and Li, 2001) 8 ˆ ˆ j j ˆ 0Ä < ˆ ˆ Á < 2 2 a j j C =2 for some a > 2 (3.31) p .j j/ D Äj jÄa ˆ .a 1/ ˆ ˆ ˆ 2 ˆ a 1 (3.32) a Äj j are shown to yield the oracle estimator among the local minima of a penalized objective function with probability tending to one. Such feature is called “a (weak) oracle property” of the nonconvex penalized estimator. To present the oracle property for the current model setting, the penalized estimator is defined as follows. 54 Definition 3.4.7 (Penalized Estimator) N T 1 XX O .ˇ ; O / D arg min .ˇ; / N .yi t wi t ˇ pN X .xi ; zi / / C i D1 t D1 ˇ ˇ p .ˇ j ˇ/ (3.33) j D1 where p . / is either a SCAD or MCP penalty function. The oracle property is shown with one additional condition on true sparse coefficients. Assumption 9 below is often called a ‘beta-min’ condition, which basically assumes that the minimum maginitude of nonzero coefficients in the oracle model is sufficiently large. In our context, the lower bound for the coefficient magnitude can be understood as a truncation cut-off for approximating surrogate function .xi ; zi / given a sequence of dictionary variables. Assumption 9 (Nonzero Coefficients) There exist positve constants C4 and C5 such that C3 < ˇ ˇ C4 < 1 and N .1 C4 /=2 min ˇ oj ˇ C5 : 1Äj ÄqN Theorem 3.4.8 (Oracle Property of Penalized Estimator) Suppose Assumption 1–5 and 9 together Á 1=2 with Assumption 6 or with Assumption 7 and 8. If D o N .1 C4 /=2 ; N 1=2 qN D o . / ; and log .pN / D o.N 2 /; then lim P ..ˇO ; O / 2 "N . // D 1 N !1 (3.34) where "N . / is the set of local minima of objective function in (3.33). Proof. See Appendix. The rate condition on qN and in Theorem 3.4.8 is weaker than the one given in Sherwood and Wang (2016). In turn, a necessary requirement on C4 in the beta-min condition is also weaker. 3.4.2.1 Choice of Thresholding Parameter Lee, Noh and Park (2014; LNP) recently proposed a modified Bayesian Information Criterion for linear quantile regression with cross-sectional data when the dimension of the dictionary variables diverges and the dimension of the true model is a constant. Sherwood and Wang (2016) 55 take LNP’s criterion for the case where the dimension of the true model may diverge. Its pooled information version for panel data in the current setting can be considered as follows 0 1 N T XX QBICL . / D 2T N log @ .ei t .ˇO ; O //A C S CN log .T N / (3.35) iD1 tD1 where ei t .ˇ; / D yi t wi t ˇ .xi ; zi / ; S is the degree of freedom of the fitted model and CN is chosen as log .pN / in Sherwood and Wang (2016). Note that the goodness of fit measure in (3.35) is derived from the quasi-likelihood of asymmetric laplace distribution with scaling parameter . (See LNP for details.) When D 1 is imposed, the resulting measure coincides with the conventional check loss function without logarithm and T N -scaling. Then, an alternative form of high-dimensional BIC can be written as BICL . / D 2 N X T X .ei t .ˇO ; O // C S CN log .T N / : (3.36) iD1 tD1 To take into account the clustered information of panel data, it is useful to think of clustering as a kind of misspecification problem in quasi-likelihood. In this perspective, generalized BIC (GBIC) and GBICp studied by Lv and Liu (2014; LL) can be considered. They explicitly incorporate model misspecification using a second-order term in the asymptotic expansion of the Bayesian principle under generalized linear model settings. The final result is claimed to be general enough to be applied to other contexts. Adding the second-order term of GBIC studied by LL to (3.35), we have 1 0 N X T X GQBICL . / D 2T N log @ .ei t .ˇO ; O //A C S CN log .T N / log det H ;N iD1 tD1 (3.37) 1 SO O where H ;N D KO ;N ;N is a covariance contrast matrix evaluated at .ˇ ; O /: Note that the second-order term can be negative. The same modification can be done for BICL. For further details about GBIC and GBICp , see LL (2014). Note that if the correction term log det H ;N is asymptotically bounded, then the first two terms in the information criterion will be dominant as N tends to infinity. Thus, if GBIC is indeed a valid criterion for selection consistency, then regular BICs without the correction term must be valid as well in such cases. 56 3.4.2.2 Computation of Variance Estimators The sample analogue estimators for the sandwich form of KN1 SN KN1 and KN 1 SN KN 1 are computed using the set of selected variables, AO . / ; given the penalized estimator .ˇO ; O /: This is mainly justfied by the fact that (i) the penalized estimator has an oracle property, and that (ii) any submodel constitutes an approximation of the true model. Here, estimators are constructed following the cluster-robust variance estimator proposed by Wooldridge (2010). Let M C denote the Moore-Penrose generalized inverse of M . First, the residual eOi t is simply computed by plugging in estimates .ˇO ; O / in the formula: eOi t D yi t wi t ˇO O / A. .xi ; zi / O : O i t can be estimated as, for a sequence hN tending to 0, The A -projected-out regressor  0 1C 0 1 N X T N X T X X O i t D wi t @ A @  1ŒjeO jÄh  0 O A;i 1ŒjeO jÄh  0 O wi t A O A;i it it N A;i O N A;i i D1 t D1 (3.38) (3.39) iD1 tD1 where the conditional density fi t .0/ is approximated via uniform kernel. Then, the sample O DK O C SO K O C analogue estimator of †N can be written as † N N N N with O D K N SO N D N T 1 XX O0  O 1ŒjeO jÄh   it it it N 2N hN (3.40) 1 N (3.41) i D1 t D1 N T X T XX eOi t 0 i D1 tD1 t 0 D1 O : O 0 0 .eOi t /  it it The generalized inverse is used instead of the regular inverse since A may contain a set of linearly dependent variables given sample size N and threshold parameter : Note that the Moore-Penrose inverse coincides with the ordinary inverse whenever the operated matrix is invertible. The estimator ON D K O C SO N K O C can be similarly computed by replacing  O i t with wQ i t in (3.40) and (3.41): It † N N O N and † O are numerically equivalent if can be shown that the estimated variances of ˇO based on † N O N is invertible. For the choice of sequence hN ; see Perente and Santos Silva (2010) for example. K 57 3.5 Monte Carlo Simulation A set of Monte Carlo simulations is conducted to study the selection performance and estimator performance on simple location shift and location-scale shift models. With a 3-period panel structure, 5 specifications are considered: DGP 1 W yi t D xi t1 C xi t 2 C xi11 C xi12 C xi13 C xi 21 C xi 22 C xi 23 C ui t DGP 2 W yi t D .8 C xi t1 C xi t 2 C xi11 C xi12 C xi13 C xi 21 C xi 22 C xi 23 / .ui t C 1/ X k C xk C xk C xk C xk C xk / C u DGP 3 W yi t D 14 C xi t1 C xi t 2 C .xi11 it i12 i13 i 21 i 22 i 23 k2K DGP 4 W yi t D f14 C xi t1 C xi t 2 C X k C x k C x k C x k C x k C x k /g .u C 1/ .xi11 it i 23 i 22 i 21 i13 i12 k2K DGP 5 W yi t D f14 C xi t1 C xi t 2 C X k C 2x k C x k C 2x k /g .u C 1/ .xi11 it i12 i 21 i 22 k2K where T D 3; N D 300 or 1000; x1t U . 1; 1/ ; x2t U . 1; 1/ ; ui t U .0; 1/ ; K D f1; 2; 7g : DGP 1, 3 and 2, 4, 5 are location shift and location-scale shift models, respectively. Note that location-scale shift models have heteroskedastic regression error terms. DGP 1 and 2 impose a Chamberlain specification while DGP 3, 4 and 5 introduce nonlinearity or different coefficients on correlated effect terms across time periods. Note that the rescaled true parameters for higher-order polynomial terms are smaller since all dictionary variables are rescaled to have unit sample variance 7 in DGP 3 in estimation. For example, a normalized 1st and 7th order polynomial term xi11 ; xi11 have true parameter values of about .58 and .26, respectively. In turn, higher-order terms are harder to be selected as relevant variables at given sample sizes. The number of simulated draws is 1000. Table 3.1-3.2 and Table A.1-A.11 in Appendix C.2 contain the results. Along with QBICL and BICL, their non-high dimensional versions, QBIC and BIC are also considered. AIC1 and AIC2 are AIC counterparts of QBIC and BIC that use different goodness of fit measures. “pN ” denotes the number of dictionary variables, and “qo ” denotes the number of terms in the true models. “TV” and “FV” are defined as the average number of true and false coefficients selected, respectively. “true” means the true model hit rate. 58 Table 3.1 Selection Performance, DGP 1 and 2 D 0:1 D 0:5 D 0:9 Method pN qo N IC TV FV True TV FV True TV FV True gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund 136 136 136 136 136 136 136 136 136 136 136 136 2 2 2 2 2 2 2 2 2 2 2 2 300 300 300 300 300 300 1000 1000 1000 1000 1000 1000 QBIC QBICL BIC BICL AIC1 AIC2 QBIC QBICL BIC BICL AIC1 AIC2 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 0.13 0.00 0.00 0.00 2.18 0.00 0.08 0.00 0.00 0.00 2.57 0.00 0.89 1.00 1.00 1.00 0.35 1.00 0.94 1.00 1.00 1.00 0.39 1.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 0.17 0.00 0.00 0.00 3.78 0.01 0.10 0.00 0.00 0.00 7.22 0.01 0.88 1.00 1.00 1.00 0.29 0.99 0.93 1.00 1.00 1.00 0.23 0.99 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 0.17 0.00 0.00 0.00 4.05 0.00 0.11 0.00 0.00 0.00 6.34 0.00 0.87 1.00 1.00 1.00 0.23 1.00 0.92 1.00 1.00 1.00 0.24 1.00 gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham 102 102 102 102 102 102 102 102 102 102 102 102 6 6 6 6 6 6 6 6 6 6 6 6 300 300 300 300 300 300 1000 1000 1000 1000 1000 1000 QBIC QBICL BIC BICL AIC1 AIC2 QBIC QBICL BIC BICL AIC1 AIC2 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 0.18 0.00 0.00 0.00 5.71 0.00 0.12 0.00 0.00 0.00 7.49 0.00 0.87 1.00 1.00 1.00 0.24 1.00 0.92 1.00 1.00 1.00 0.23 1.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 0.19 0.00 0.00 0.00 3.77 0.01 0.13 0.00 0.00 0.00 7.42 0.01 0.85 1.00 1.00 1.00 0.27 0.99 0.91 1.00 1.00 1.00 0.20 0.99 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 0.22 0.00 0.00 0.00 4.36 0.00 0.16 0.00 0.00 0.00 7.87 0.00 0.85 1.00 1.00 1.00 0.27 1.00 0.90 1.00 1.00 1.00 0.21 1.00 D 0:1 D 0:5 D 0:9 Method pN qo N IC TV FV True TV FV True TV FV True gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund 136 136 136 136 136 136 136 136 136 136 136 136 2 2 2 2 2 2 2 2 2 2 2 2 300 300 300 300 300 300 1000 1000 1000 1000 1000 1000 QBIC QBICL BIC BICL AIC1 AIC2 QBIC QBICL BIC BICL AIC1 AIC2 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 0.85 0.00 0.02 0.00 6.63 1.42 0.72 0.00 0.01 0.00 7.59 1.68 0.55 1.00 0.98 1.00 0.03 0.39 0.59 1.00 0.99 1.00 0.00 0.31 2.00 1.98 2.00 1.98 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 1.16 0.02 1.15 0.02 6.75 6.65 0.86 0.00 0.85 0.00 8.03 7.98 0.40 0.98 0.41 0.98 0.00 0.00 0.51 1.00 0.51 1.00 0.01 0.01 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 1.16 0.01 0.02 0.01 7.91 1.74 0.80 0.00 0.01 0.00 8.90 1.89 0.44 1.00 0.98 1.00 0.00 0.30 0.54 1.00 0.99 1.00 0.00 0.26 gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham 102 102 102 102 102 102 102 102 102 102 102 102 6 6 6 6 6 6 6 6 6 6 6 6 300 300 300 300 300 300 1000 1000 1000 1000 1000 1000 QBIC QBICL BIC BICL AIC1 AIC2 QBIC QBICL BIC BICL AIC1 AIC2 5.85 5.74 5.75 0.86 5.96 5.88 6.00 5.99 6.00 5.99 6.00 6.00 0.80 0.27 0.28 0.11 7.33 1.24 0.46 0.01 0.02 0.01 8.95 1.31 0.55 0.77 0.77 0.06 0.02 0.44 0.71 0.98 0.98 0.98 0.01 0.44 5.61 5.46 5.61 5.46 5.95 5.95 6.00 5.96 6.00 5.96 6.00 6.00 1.01 0.49 0.98 0.49 6.24 6.07 0.48 0.04 0.48 0.04 7.59 7.55 0.37 0.57 0.38 0.57 0.01 0.01 0.68 0.96 0.69 0.96 0.01 0.01 6.00 5.94 5.98 5.86 6.00 6.00 6.00 6.00 6.00 6.00 6.00 6.00 0.82 0.14 0.15 0.14 6.40 1.31 0.51 0.01 0.02 0.01 8.35 1.36 0.54 0.87 0.87 0.87 0.01 0.40 0.70 0.99 0.98 0.99 0.01 0.42 59 Table 3.2 Estimator performance, DGP 1 and 2, ˇ1 D 0:1 Method N IC gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund 300 300 300 300 300 300 1000 1000 1000 1000 1000 1000 gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham 300 300 300 300 300 300 1000 1000 1000 1000 1000 1000 D 0:5 D 0:9 Bias SD RMSE Bias SD RMSE Bias SD RMSE QBIC QBICL BIC BICL AIC1 AIC2 QBIC QBICL BIC BICL AIC1 AIC2 0.0000 -0.0001 -0.0001 -0.0001 -0.0001 -0.0001 -0.0002 -0.0001 -0.0001 -0.0001 -0.0001 -0.0001 0.0197 0.0197 0.0197 0.0197 0.0201 0.0197 0.0119 0.0119 0.0119 0.0119 0.0118 0.0119 0.0197 0.0197 0.0197 0.0197 0.0200 0.0197 0.0119 0.0119 0.0119 0.0119 0.0118 0.0119 0.0009 0.0008 0.0008 0.0008 0.0021 0.0008 0.0013 0.0013 0.0013 0.0013 0.0015 0.0013 0.0345 0.0345 0.0345 0.0345 0.0346 0.0345 0.0191 0.0191 0.0191 0.0191 0.0194 0.0191 0.0345 0.0345 0.0345 0.0345 0.0346 0.0345 0.0192 0.0191 0.0191 0.0191 0.0195 0.0191 -0.0000 -0.0000 -0.0000 -0.0000 -0.0002 -0.0000 -0.0001 -0.0001 -0.0001 -0.0001 -0.0001 -0.0001 0.0215 0.0215 0.0215 0.0215 0.0215 0.0215 0.0114 0.0114 0.0114 0.0114 0.0114 0.0114 0.0215 0.0215 0.0215 0.0215 0.0215 0.0215 0.0114 0.0114 0.0114 0.0114 0.0114 0.0114 QBIC QBICL BIC BICL AIC1 AIC2 QBIC QBICL BIC BICL AIC1 AIC2 0.0007 0.0005 0.0005 0.0005 0.0014 0.0005 0.0004 0.0003 0.0003 0.0003 0.0004 0.0003 0.0211 0.0212 0.0212 0.0212 0.0208 0.0212 0.0116 0.0115 0.0115 0.0115 0.0115 0.0115 0.0211 0.0212 0.0212 0.0212 0.0209 0.0212 0.0116 0.0115 0.0115 0.0115 0.0115 0.0115 0.0042 0.0040 0.0040 0.0040 0.0058 0.0039 -0.0004 -0.0005 -0.0005 -0.0005 -0.0001 -0.0005 0.0358 0.0358 0.0358 0.0358 0.0355 0.0359 0.0186 0.0186 0.0186 0.0186 0.0186 0.0186 0.0360 0.0360 0.0360 0.0360 0.0359 0.0361 0.0186 0.0186 0.0186 0.0186 0.0186 0.0186 -0.0005 -0.0006 -0.0006 -0.0006 -0.0002 -0.0006 0.0002 0.0002 0.0002 0.0002 0.0005 0.0002 0.0212 0.0208 0.0208 0.0208 0.0212 0.0208 0.0117 0.0117 0.0117 0.0117 0.0117 0.0117 0.0212 0.0208 0.0208 0.0208 0.0212 0.0208 0.0117 0.0117 0.0117 0.0117 0.0117 0.0117 D 0:1 Method N IC gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund 300 300 300 300 300 300 1000 1000 1000 1000 1000 1000 gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham 300 300 300 300 300 300 1000 1000 1000 1000 1000 1000 D 0:5 D 0:9 Bias SD RMSE Bias SD RMSE Bias SD RMSE QBIC QBICL BIC BICL AIC1 AIC2 QBIC QBICL BIC BICL AIC1 AIC2 0.0073 0.0066 0.0066 0.0066 0.0160 0.0073 0.0066 0.0056 0.0056 0.0056 0.0103 0.0071 0.1551 0.1565 0.1562 0.1565 0.1559 0.1551 0.0877 0.0877 0.0878 0.0877 0.0893 0.0881 0.1552 0.1565 0.1562 0.1565 0.1567 0.1552 0.0879 0.0878 0.0879 0.0878 0.0898 0.0883 0.0048 0.0064 0.0049 0.0064 -0.0009 -0.0007 -0.0051 -0.0053 -0.0053 -0.0053 -0.0026 -0.0028 0.2746 0.2778 0.2747 0.2778 0.2657 0.2658 0.1479 0.1467 0.1478 0.1467 0.1479 0.1480 0.2745 0.2777 0.2746 0.2777 0.2656 0.2657 0.1479 0.1467 0.1478 0.1467 0.1478 0.1479 0.0028 0.0013 0.0017 0.0013 -0.0033 0.0028 0.0063 0.0048 0.0048 0.0048 0.0036 0.0054 0.1648 0.1647 0.1651 0.1647 0.1612 0.1650 0.0854 0.0858 0.0858 0.0858 0.0865 0.0851 0.1648 0.1646 0.1650 0.1646 0.1612 0.1649 0.0856 0.0859 0.0859 0.0859 0.0865 0.0852 QBIC QBICL BIC BICL AIC1 AIC2 QBIC QBICL BIC BICL AIC1 AIC2 0.0105 0.0148 0.0141 1.0494 0.0135 0.0101 0.0033 0.0040 0.0038 0.0042 0.0047 0.0039 0.1679 0.1668 0.1669 0.4584 0.1640 0.1660 0.0882 0.0879 0.0878 0.0879 0.0870 0.0875 0.1682 0.1674 0.1674 1.1450 0.1645 0.1663 0.0883 0.0879 0.0878 0.0880 0.0871 0.0875 0.0367 0.0503 0.0367 0.0508 0.0289 0.0284 -0.0004 0.0017 -0.0004 0.0017 -0.0008 -0.0006 0.2562 0.2738 0.2565 0.2745 0.2590 0.2591 0.1477 0.1490 0.1482 0.1490 0.1487 0.1488 0.2587 0.2783 0.2589 0.2790 0.2605 0.2605 0.1476 0.1489 0.1481 0.1489 0.1486 0.1487 -0.0015 0.0027 0.0003 0.0077 -0.0070 -0.0022 -0.0006 -0.0002 -0.0002 -0.0002 -0.0035 -0.0007 0.1625 0.1627 0.1626 0.1622 0.1570 0.1611 0.0880 0.0891 0.0891 0.0891 0.0880 0.0878 0.1624 0.1627 0.1625 0.1623 0.1571 0.1610 0.0880 0.0891 0.0891 0.0891 0.0880 0.0878 60 There are some useful findings to be mentioned. First, the root mean squared error (RMSE) of ˇ estimates decreases and TV increases as the sample size increases for all DGPs, quantiles and information criteria considered. Second, FV decreases as the sample size increases in DGP 1, 2 and 3 with non-high dimensional BIC type criteria. With AIC or in other models, FV may increase. In DGP 4 and 5, FV increases for BIC type criteria but not as much as AICs. Third, TV seems to be the key element that determines the estimator performance. Neither FV nor the true hit rate seems to matter much. This can be easily seen by comparing AICs with BICs. AIC typically involves a much higher FV and lower true hit rate but often shows the smallest bias, SD and RMSE. Fourth, the estimators using a generalized Mundlak form and generalized Chamberlain form can outperform the others. When the coefficients on the correlated effect terms across time periods are constant as in DGP 1, 2, 3, and 4, the Mundlak form yields a smaller number of nonzero terms to be selected and the corresponding estimator often has better performance. But when the coefficients are different across time as in DGP 5, the generalized Chamberlain form can have a more sparse selected submodel than the generalized Mundlak form and the corresponding estimator often outperforms. 3.6 Application: The Effect of Smoking on Birth Outcomes In this section, the proposed estimator is applied to an empirical example of birth weight analysis. The data in use is the matched panel1 #3 of Abrevaya (2006) where mean regression analysis was done accounting for unobserved individual moms’ heterogeneity. First, the median regression with a correlated effect shows convincing evidence that the correlated effect estimator works well as intended. Second, the corresponding other quantile regression results show that for lower quantiles, the impact of smoking on birth weight is smaller in terms of absolute magnitute but can be larger relative to fitted quantile birth weights. Third, some computational issues are 1 The data do not have a panel structure in the strict sense since each mom is observed at a different time point when she gave a birth. Still, the data set is a clustered in general and the proposed method is valid as long as the set of assumptions hold. 61 reported regarding optimization of the nonconvex objective function. The matched panel data #3 of Abrevaya (2006) contains information on 129,569 two-birth moms and 12,360 three-birth moms. Note that in these data, the quantile regression with individual fixed effects will cost an additional 141,929 dummies. One of the main benefits of the correlated effect estimator is to reduce the number of additional terms while individual heterogeneity is treated well. The results below show that less than 400 additional terms are spent to obtain reasonable estimates. The number of additional terms is less than 0.3% that of fixed effect estimator. The structural equation is taken from Abrevaya (2006) as BWi b D smokei b C malei b C agei b C agei2b C adeqcode2i b C adeqcode3i b C novisiti b C pret ri2i b C pret ri3i b C _I nlbnl_1i b C C _Iyear_1i b C (3.42) C _I nlbnl_15i b C _Iyear_8i b C const C "i b where i is an individual mother index, b is an observed birth index, “adeqcode#” is the Kessner index of #, “novisit” is an indicator of no prenatal visit during pregnancy, “pretri#” is an indicator of the first prenatal visit in #th trimester, and other terms are live birth order and year effects. The observed birth index b corresponds to time index t in our setting. All right-hand-side variables in (3.42) constitute xi b following the previous notation. Besides xi b ; there are several within-group-constant variables, zi that are used to construct the correlated effects: binary dummies for high school graduate, some college experience, college graduate, marital status, being black and state of residence. Since all variables except “age” are treated as discrete, the sparsity assumption is essentially imposed on the “age” component of the correlated effect. For the sieve approximation, polynomials of order up to 10th are used as default. To account for potential endogeneity due to observability, the selection indicator will be used to construct the correlated effects. Basically, there are two observed patterns: 2-birth mothers and 3-birth mothers. Then, the vector of the selection indicator can be written as si D .1; 1; 0/ Œ2 or .1; 1; 1/ : For notational convenience, denote si 62 Œ3 D .1; 1; 0/ and si D .1; 1; 1/ : Then, the Table 3.3 Birthweight, mean and median regression, all moms (unit:grams) Mean Smoke Male Age Age2 High-school graduate Some College College Graduate Married Black Kessner index = 2 Kessner index = 3 No prenatal visit First prenatal visit in 2nd trimester First prenatal visit in 3rd trimester Information Criterion # of Dictionary Var. # of Selected Var. Median OLS FE Pooled CE1 CE2 CE3 -243.27 (3.20) 126.70 (1.88) 7.06 (1.77) -0.12 (0.03) 60.52 (4.12) 91.34 (4.52) 100.89 (4.73) 64.43 (3.65) -252.04 (4.36) -100.93 (4.19) -176.48 (10.20) -26.49 (18.00) 89.12 (4.96) 154.66 (12.03) -144.04 (4.75) 133.58 (2.08) -15.98 (3.96) 0.32 (0.05) -138.26 (6.31) 138.87 (2.51 ) -13.37 (5.38) 0.35 (0.07) -138.55 (6.39) 139.38 (2.54) -8.32 (27.04) 0.26 (0.29) -140.34 (6.45) 139.45 (2.52) -8.74 (4.57) 2.75 (0.07) -84.43 (4.45) -143.91 (10.28) -42.35 (16.57) 66.56 (5.27) 111.90 (12.49) -238.49 (3.79) 131.27 (2.10) 2.59 (2.10) -0.04 (0.04) 64.19 (4.98) 96.65 (5.55) 102.84 (5.79) 55.23 (4.32) -239.28 (5.07) -81.71 (4.56) -149.85 (12.48) 7.87 (21.29) 72.21 (5.40) 119.48 (14.34) -79.17 (5.66) -163.42 (15.67) -32.02 (24.99) 67.38 (6.73) 109.92 (18.82) -74.12 (6.14) -154.35 (15.52) -47.25 (27.03) 62.80 (9.27) 111.45 (22.84) -69.96 (5.55) -150.94 (15.50) -52.70 (26.88) 58.66 (6.67) 118.90 (18.41) - - - BIC 301 298 BIC 451 300 BICL 301 169 63 conditional quantiles with the generalized Chamberlain device can be succinctly written as Q Q Œ2 sib yi b j fxi t g2tD1 ; zi ; si D si Á D si b xi b ˇ C si b g2 .xi1 ; xi 2 ; zi / C si b k2b Á Á Œ3 si b yi b j fxi t g3tD1 ; zi ; si D si D si b xi b ˇ C si b g3 fxi t g3tD1 ; zi C si b k3b (3.43) (3.44) for some g2 ; g3 ; k2b s and k3b s: Assuming additivity of gs, following transformation is considered g2 .xi1 ; xi 2 ; zi / C k2b D g2;0 C g3 fxi t g3tD1 ; zi Á K1 3 X X x g2bk bD1 kD1 3 K1 C k3b D g2;0 C C h3;0 C x .x g2bk i bk / C XX bD1 kD1 3 K1 XX hxbk .xi bk / C bD1 kD1 kD1 K2 X z .z / C k g2k ik 2b (3.45) z .z / C k g2k ik 2b (3.46) kD1 K2 X hz2k .zi k / C lb (3.47) kD1 x .x g2;0 ; hxbk .xi bk / D g3bk i bk / x .x where g2bk i bk / D 0 for b D 3; h3;0 D g3;0 z .z / hz2k .zi k / D g3k ik .xi bk / C K2 X z .z / and l D k g2k b 3b ik x .x g2bk i bk / ; k2b : In estimation, the interaction of the 3-birth x z (b D 1; 2) are included: The constant mom dummy and approximating terms for g2bk and g2k term g2;0 ; h3;0 and time effects k2b ; lb are also included but not penalized. l2 is excluded to avoid multicollinearity. If there is no systematic difference in g components between the two-birth mothers and three-birth mothers, then the corresponding h compoents will be zeros and there will be fewer terms selected in the final estimates. In Table 3.3, the OLS estimator, FE estimator, pooled median regression estimator and CE estimators are compared. BIC and BICL were used to choose the threshold parameters of the CE1/CE2 and CE3 estimates, respectively where the candidate threshold parameters were chosen to be 50 equi-spaced points between 0 and 0:01: The CE2 estimator uses polynomials of order upto 40th. The "rqPen" and "pracma" packages for R were used for computing penalized estimates and Moore-Penrose inverse matrices, respectively. As noted by Abrevaya (2006), the FE coefficient estimate on the “smoke” variable has approximately a 100g lower magnitude than the OLS estimate, which is consistent with the basic omitted variables story. The CE coefficient estimates on the “smoke” variable also has a lower magnitude than the pooled median regression estimate by a 64 Table 3.4 Birthweight, quantile regression with CE, all moms, (unit:grams) Quantile Smoke Male Age Age2 Kessner index = 2 Kessner index = 3 No prenatal visit First prenatal visit in 2nd trimester First prenatal visit in 3rd trimester # of Dictionary Var. # of Selected Var. 0.1 0.25 0.5 0.75 0.9 -129.99 (10.75) 102.44 (3.79) -1.25 (8.34) 1.94 (0.13) -121.12 (10.29) -255.81 (27.03) -169.44 (49.00) 92.05 (12.20) 260.68 (36.80) -136.67 (7.29) 122.63 (2.91) -6.75 (5.38) 0.26 (0.08) -106.11 (6.90) -220.18 (17.05) -40.80 (27.40) 88.90 (7.97) 178.45 (20.16) -138.26 (6.31) 138.87 (2.51) -13.37 (5.38) 0.35 (0.07) -79.17 (5.66) -163.42 (15.67) -32.02 (24.99) 67.38 (6.73) 109.92 (18.82) -148.34 (7.63) 150.31 (2.87) -9.05 (5.27) 0.25 (0.08) -69.06 (6.45) -138.40 (17.80) -9.20 (30.33) 53.81 (7.59) 97.94 (21.59) -152.86 (9.80) 162.12 (3.41) -4.47 (3.04) 1.12 (0.05) -75.41 (8.65) -120.40 (19.91) -88.82 (48.29) 69.25 (10.72) 112.61 (25.68) 301 150 301 246 301 298 301 269 301 141 similar amount. Moreover, the FE and CE coefficient estimates on other variables show similar patterns of changes from the OLS and pooled median regression, respectively. For example, the coefficient estimates of OLS/FE on age and age2 alternate in sign and similar patterns are found in the pooled median regression and CE estimates on age and age2 : Overall, the CE estimates are quite close to the FE estimates, and they can be regarded as a median analogue of the FE estimates. Note that this is sensible because, considering the nature of dependent variables, we expect the conditional distribution of regression errors are fairly symmetric, and because the CE estimator takes the control function analogue of the FE estimator. For the unconditional distribution, the mean/median pairs of birth weights for b D 1; 2 and 3 are 3426g/3430g, 3482g/3487g and 3517g/3520g, respectively. Figure A.1 in Appendix C.2 shows a frequency histrogram for pooled birth weights across all births. 65 Table 3.5 Coefficient Estimates on ‘Smoke’ using Different ICs, (unit:grams) Quantile QBIC QBICL BIC BICL 0.1 0.25 0.5 0.75 0.9 -132.51 [88] -140.61 [37] -129.99 [150] -129.99 [150] -145.97 [75] -158.1 [37] -136.66 [246] -140.18 [179] -145.27 [45] -153.97 [30] -138.26 [298] -140.34 [169] -147.74 [40] -152.33 [27] -148.34 [269] -146.47 [244] -152.41 [93] -245.82 [30] -152.86 [141] -152.86 [141] Table 3.4 contains the CE estimates for 10, 25, 50, 75 and 90 percentiles.2 The same set of dictionary variables is used with BIC for all cases. Evidently, the magnitude of the coefficient estimates on “smoke” variable declines as the percentile decreases. Note that the pooled quantile regression results in Appendix C.2 shows an exact opposite relationship, which indicates that the impact of smoking is more severely overestimated in the pooled regressions for the lower quantiles. Although the absolute magnitude of impact declines, its proportionate impact can be larger for lower quantiles. For example, relative to the fitted values for a two-birth mom who had two female babies at age 27, and 28 (with all other dummy variables equal to zero), the proportionate impacts of smoking for 10, 25, 50, 75 and 90 percentiles are -5.13%, -4.60%, -4.03%, -3.98%, and -3.92%, respectively. There are some computational issues that need to be addressed. First, the main computational challenge in the SCAD or MCP penalized estimator lies in the non-convex nature of the objective function. The numerical algorithms studied so far use some version of an approximated objective function. In this chapter, all estimates are computed using iterative quantile regression on an augmented data set based on local linear approximation of the SCAD panelty function (Sherwood and Wang, 2016). Second, when Sherwood and Wang (2016)’s iterative quantile regression method 2 The estimates in Table A.13 were computed with classical CRE (after dropping linearly dependent terms). CRE estimator is less robust than the proposed CE estimator by construction. 66 is used on the given data set, the selection path can vanish for small enough threshold parameters at close enough to 0 or 1. That is, the penalized estimator is essentially not computable for a small enough at high-end or low-end quantiles. The .1 and .9 quantile results could have more selected terms if there was no such problems. Table 3.5 shows the coefficient estimates for the “smoke” variable based on four different Bayesian-type information criteria. The bracketed numbers are the number of selected variables out of 301 dictionary variables. BIC and BICL do not have any difference at the 10 and 90 percentiles. Unreported results show that .15 and .85 quantiles do not suffer from the problem. It seems that the numerical algorithm for the quantile regression matters. For the given data set, Koenker and d’Orey’s (1987; KO) algorithm yields more stable results than Armstrong, Frome and Kung’s (1979; AFK) algorithm. 3.7 Concluding Remarks I propose a new model restriction and estimation procedure for a linear panel data quantile regression model with fixed T. By introducing a nonparametric correlated effect, the new model restriction reasonably accounts for the -quantile-specific time-invariant heterogeneity and allows arbitrary within-group dependence of regression errors. A non-convex penalized estimation procedure is employed under the sparsity assumption on the correlated effect. To make the sparsity assumption more plausible in some cases, a transformation of the approximated correlated effect into a generalized Mundlak form is proposed. There are interesting questions to be answered in future research. First, it would be useful to study Bayesian type information criteria that allow diverging pN and qN and attain selection consistency under a certain degree of misspecification. Second, the numerical algorithm for a nonconvex objective function can be improved for more stable and efficient computation. Third, extending the current framework to account for a censored response variable and time-varying endogeneity is another interesting direction to pursue. The extended estimator is expected to have similar advantages to the estimator studied in this chapter. 67 APPENDICES 68 APPENDIX A AN APPENDIX FOR CHAPTER 1 A.1 Assumptions Assumptions (1) .yi1 ; yi 2 ; zi / are i.i.d. (2) ‚ cpt Rp (3) q1 W ‚ W ! R and q2 W W ! R where .yi1 ; yi 2 ; zi / 2 W (4) Âo 2 i nt .‚/ and let N be a neighborhood of Âo ‚2 (5) with probability one, q1 .yi1 ; yi 2 ; zi ; Â1 ; Â2 / and q2 .yi 2 ; zi ; Â2 / are 2 continuously differentiable 3 @.q1 Cq2 / 6 7 @ at each  2 ‚ and twice continuosly differentiable in N . (6) E 4 sup 5 < 1 Â2‚ (7) each element of 2 @.q1 Cq2 / 6 @ @ 0 E 4 sup @q  2N 2 @Â2 @ 08 @q1 .yi1 ;yi 2 ;zi ;Âo1 ;Âo2 / @ 3 and @q2 .yi 2 ;zi ;Âo2 / @Â2 @q2 @Â2 has finite second moment. (8) Ä @.q1 Cq2 / 7 D 0 (10) 5 < 1 (9) (QLIML f.o.c.) fÂo g D  2 ‚ W E @ 9 2 3 @q1 . / ˆ > Ä < = @.q1 Cq2 /.Âo / 6 @Â1 7 (CF f.o.c.) fÂo g D  2 ‚ W E 4 5 D 0 (11) (QLIML rank condition) E @ @ 0 @q  ˆ > 2. 2/ ; : @Â2 3 2 3 2 @q1 .Âo / @q1 .Âo / Ä @.q1 Cq2 /.Âo / 6 7 6 @Â1 @ 0 7 @ and V are invertible. (12) (CF rank condition) E 5 and V 4 @q Â1 5 4 @q  @ . / . / 2 o2 2 o2 @Â2 @ 0 are invertible. A.2 Proofs A.2.1 Proof of Proposition 1.3.4 @Â2 Under regularity conditions, it suffices to show (Newey and McFadden, 1994): 2 the followings 3 2 3 @q1 . / @q1 .Â/ 7 6 @Â1 @ 0 @Â1 6 7 7 6 6 7 7 6 . / @q .Â/ @q 6 7 1 1 (a) E 6 sup sup 7<1 0 7 < 1 (b) E 6 @Â22 @ @ 7 6Â2N 22 4 2‚ 5 4 @q2 .Â2 / @q2 .Â2 / 5 @Â2 @ 0 @Â2 69 (c) fÂo g D 8 ˆ ˆ ˆ ˆ < 2  ˆ ˆ ˆ ˆ : 0 B B (e) V B B @ 6 6 2‚WE6 6 4 @q1 .Âo / @Â1 @q1 .Âo / @Â22 @q2 .Âo2 / @Â2 @q1 .Â/ @Â1 @q1 .Â/ @Â22 @q2 .Â2 / @Â2 2 39 > > 6 7> > 6 7= 7 (d) E 6 6 7> 6 5> > 4 > ; @q1 .Âo / @Â1 @ 0 @q1 .Âo / @Â22 @ 0 @q2 .Âo2 / @Â2 @ 0 3 7 7 7 7 has full column rank. 7 5 1 C C C is invertible. C A (c) and (e) are direct implications of definition of GMM-QLIML. (d) can be shown from Assumption 12 since adding extra rows does not affect column rank. (a) and (b) are im@q .Â/ plied by triangular inequality together with Assumption 6 and Assumption 8. i.e. @Â1 22 @q1 .Â/ @q2 .Â2 / @q1 .Â/ @q2 .Â2 / @q2 .Â2 / @q1 .Â/ @q2 .Â2 / C @ C @ and @ @ 0 Ä @ @ 0 C @ @ 0 C @ @ 0 @ 22 A.2.2 22 22 22 22 22 Ä 22 Proof of Proposition 1.3.6 (a) and (b) are directly implied by following lemma. Lemma A.2.1 Let G be a linear span of moment functions fgi g in .1:8/: Optimal GMM based on each maximal linearly independent set at true parameter values in G yields asymptotically equivalent estimator. Proof. Suppose fgQ i .Âo /g and fgO i .Âo /g are maximal linearly independent subsets in linear span of moment functions fgi .Âo /g. First, the number of moments in fgQ i .Âo /g and fgO i .Âo /g is the same since both fgQ i .Âo /g and fgO i .Âo /g are basis for span .fgi .Âo /g/ and span .fgi .Âo /g/ is a finite dimensional there exists an invertible linear map A .Âo / such that 2 3vector 2 space. Second, 3 6 gQ 1 .Âo / 7 6 gO 1 .Âo / 7 6 7 6 7 :: :: 7 D 6 7 by definition of basis. By proposition 3.4, efficient GMM A .Âo / 6 : : 6 7 6 7 4 5 4 5 gQ k .Âo / gO k .Âo / 70 based on fgQ i .Âo /g and fgO i .Âo /g are well-defined and asymptotic normal. Then, it is easy to see @gO .Âo / 0 E V @Â Ä @gQ .Âo / DE @Â Ä @gQ .Âo / DE @Â Ä A.2.3  @gO .Âo / @ à 1 Ä @gO .Âo / @Â Ã Ä Â 0 @gQ .Âo / @gQ .Âo / 1 1 0 1 0 .A .Âo // A .Âo / E A .Âo / A .Âo / V @ @ à 1 Ä 0  @gQ .Âo / @gQ .Âo / V E @ @ E Statement (d),(e) and Proof of Proposition 1.3.7 S S (d) VGMM QLIML D VQLIML iff 2 3 0 2 31 o o @qi1 @qi1 Ä 6 7 @qio2 @Â1 @ÂS0 B @qio2 6 7C @Â1 6 7 ; W E Eo cov @ 4 5 A o o o o o o 0 o 4 @qi1 @Â22 @qi1 @qi 2 @qi 2 5 @Â22 @ÂS C @Â2 C @Â2 @Â2 @ÂS0 @Â2 @ÂS0 2 2 2 3 1 0 o o @qi1 @qi1 Ä 6 6 @qio2 @Â1 @ 0 S 7C B @qio2 6 @Â1 6 ; W E D6 cov E 4 5 A @ o o o @ o o4 4 o @Â22 @ 0 @qi1 @qi1 @qio2 @qi 2 22 S C @Â2 C @Â2 @Â2 @ 0 S @Â2 @ 0 S 30 33 1 2 2 2 o o 6 6 6Eo 6 4 4 2 2 6 6 6Eo 6 4 4 where @qi1 @Â1 @ 0 S o @qi1 @qio2 C @Â2 @ 0 S @Â2 @ 0 S o @qi1 @Â1 @ 0 S o @qi1 @qio2 C @Â2 @ 0 S @Â2 @ 0 S 7 6 7 W Eo 6 o 5 4 30 2 7 6 7 W Eo 6 o 5 4 0 B Wo D V o @ @qi1 77 @Â1 @ 0 S 77 o o 55 @qi1 @qi 2 C 0 0 @Â2 @ S @Â2 @ S 33 o @qi1 77 @Â1 @ÂS0 77 o o @qi1 @qi 2 55 C @Â2 @ÂS0 @Â2 @ÂS0 o @qi1 @ o 1 @q o @qi1 i2 @Â2 C @Â2 1 1 C A S S (e) VGMM QLIML D VCF iff 2 3 0 2 o 31 0 o 1 1 o @qi1 @qi1 @qi1 Ä o o 6 @Â1 @ 0 7 @qi1 B @qi1 6 @Â1 7C B @Â1 C S 7 Eo cov ; E V 4 @q o 5A o @ @q o A o @ @ o6 4 @qio2 5 @Â22 @ 0 22 i2 i2 S @Â2 @Â2 71 @Â2 @ÂS0 33 77 77 55 2 0 6 D6 4Eo 2 Ä 2 6 6 6Eo 6 4 4 2 2 6 6 6Eo 6 4 4 o @qi1 @Â22 @ 0 S o @qi1 @Â1 @ 0 @qio2 @Â2 @ 0 o @qi1 @Â1 @ 0 @qio2 @Â2 @ 0 B @q o 6 covo @ @ i1 ; 4 22 30 S S S 2 0 7 7 Vo B @ 5 30 0 7 7 Vo B @ 5 S o @qi1 @Â1 @qio2 @Â2 31 0 7C B 5 A Vo @ o @qi1 @Â1 @qio2 @Â2 2 1 1 o @qi1 6 @Â1 @ 0 C S A Eo 6 4 @q o o @qi1 @Â1 @qio2 @Â2 2 1 1 6 C A Eo 6 4 o @qi1 @Â1 @qio2 @Â2 2 1 1 o @qi1 6 @Â1 @ 0 C S A Eo 6 4 @q o 33 1 i2 @Â2 @ 0 S 33 77 77 55 77 77 55 i2 @Â2 @ 0 S 33 o @qi1 7 @Â1 @ÂS0 7 77 o @qi 2 55 @Â2 @ÂS0 Proof (a) VGMM QLIML VCF is trivial. To see VGMM QLIML VQLIML , note first that, at true parameter value, 2 6 4 @ @Â1 q1 .Â1 ; Â2 / @ @ @Â2 q1 .Â1 ; Â2 / C @Â2 q2 .Â2 / 3 7 5 is linearly independent by Assumption 11. Thus, there exists an extension to a basis 2 3 @ q . ;  / 6 7 @Â1 1 1 2 6 7 6 @ @ q . / 7 / . C ;  q 6 @ 1 1 2 @Â2 2 2 7 6 2 7 4 5 @ q .Â/ 1 (A.1) @Â2 which is an invertible linear transformation of (1.9). Hence, the result follows by Lemma C.1. (b) Apply BQSW redundancy condition to (1.9). (c) Apply BQSW redundancy condition to (A.1). (d),(e) Apply BQSW partial redundancy results to (1.9) and (A.1) , respectively. A.2.4 Proof of Corollary 1.3.9 (a) Suppose GIMEs: 72  oà @q Vo 02 1 @Â Ä D Eo 31 o @q1o @ @ 0 @q1 @Â1 , Vo 2  oà @q 2 @Â2 It is implied that covo D Eo @q1o @Â1 @ 0 B6 7C 6 Vo @4 @q o o 5 A D Eo 4 @q 1 2 @Â2 C @Â2  @q2o Ä 0 @ @ 3 2 2 , 7 @q1o @q2o 5 0 @ @Â2 @ 0 Ã2 @ o o @q1 @q2 @Â1 ; @Â2 D 0. Then the condition (c) in Proposition 1.3.7 follows. To see VQLIML D 6 VCF ; consider a submatrix of Á Á Á 1 Á @q @q @q @q @q @q @q covo @ i1 ; @Âi1 covo @ i1 ; @Âi1 Vo @Âi1 covo @Âi1 ; @Âi1 ; 22 2 22 1 1 1 2 à  à  à  à @qi1 @qi1 @qi1 @qi1 @qi1 1 @qi1 @qi1 covo ; covo ; Vo covo ; @Â22 @Â22 @Â22 @Â1 @Â1 @Â1 @Â22 à  à  à  à  à  à  @qi1 @qi1 @qi1 1 @qi1 @qi1 1 @qi1 @qi1 @qi1 Vo covo covo ; Vo Vo ; D Vo @Â22 @Â22 @Â1 @Â1 @Â1 @Â1 @Â1 @Â22 !  à  à  à @qi1 @qi1 @qi1 @qi1 1 @qi1 D Vo V covo ; Vo @Â22 @Â22 @Â1 @Â1 @Â1  ˇ Á @q @q ˇ @q The last expression can be interpreted as difference of outer products of @ i1 and L @ i1 ˇ @Âi1 : 22 22 1 @qi1 @qi1 Since @ and @ are assumed to be linearly independent at true parameter, it cannot be zero. 22 1 (b) Suppose GIMEs. Without loss of generality, D 1 is assumed:Then, the result (e) of Proposition 1.3.7 implies " .LHS / D Eo o @qi1 0 @Â22 @Â11 # 0 o 2 B @q 6 covo @ i1 ; 4 @Â22 D 0p22 p11 73 o @qi1 @Â1 @qio2 @Â2 31 0 7C B 5A Vo @ o @qi1 @Â1 @qio2 @Â2 2 1 1 o @qi1 6 @Â1 @ 0 C 11 A Eo 6 4 @q o i2 0 @Â2 @Â11 3 7 7 5 For RHS, .1st part of RHS/ " # @qi1 DE 0 ; Â0 @Â22 @ Â12 2 Ä @q 31 2 @Â2 0 @Â2 0 ; 0 @Â2 @ Â12 2 Á @q @q cov @ i1 ; @Âi1 22 2 2 Á 1 @qi1 Ä Á 0p1 p2 V 6 @Â1 @q @q Á 1 cov @ i1 ; @Âi1 0p22 p2 4 @q 22 1 0p2 p1 V @Âi 2 2 2 Á Á 3 @q @q @qi1 @qi1 cov @Âi1 ; @Âi1 7 6 cov @Â1 ; @Â12 1 Á2 5 4 @qi 2 0p2 p12 V @ 2 Ä Á Á Á @qi1 @qi1 @qi1 @qi1 @qi1 D 0p p ; cov ; V cov 22 11 @ @ @ @ @ D @q 2 1 1 @qi1 Á @qi1 @qi1 6 0 0 B @qi1 6 @Â1 7C B @Â1 C 6 @Â1 @ Â12 ;Â2 cov @ ;4 5A V @ @q A E 6 @qi 2 @qi 2 4 @Â22 i2 Á 0 cov @ i1 ; @ i1 22 12 3 7 7 7 5 Á 22 2 22 1 1 3 7 5 1 cov @qi1 @qi1 @Â1 ; @Â2 Á and .2nd part of RHS/ 3 2 Á Á 30 2 Á 1 @qi1 @qi1 @qi1 @qi1 @qi1 cov @ ; @ 0p1 p2 7 7 6 V @Â1 6 cov @Â1 ; @Â12 1 Á2 D4 Á 1 5 5 4 @qi 2 @qi 2 0p2 p12 V @ 0p2 p1 V @ 2 2 2 3 Á Á @qi1 @qi1 @qi1 @qi1 cov @ ; @ 6 cov @Â1 ; @Â12 7 1 Á2 4 5 @qi 2 0p2 p12 V @ 2 2 3 Á Á @qi1 @qi1 @qi1 cov @ ; @ V @ 6 7 12 2 D4 Á Á Á 112 Á Á 5 @qi1 @qi1 @qi1 @qi1 @qi1 @qi1 @qi1 @qi 2 cov @ ; @ cov @ ; @ V @ cov @ ; @ C V @ 2 12 2 1 1 74 1 2 2 and .3rd part of RHS/ 3 2 Á Á 30 2 Á 1 @qi1 @qi1 @qi1 @qi1 @qi1 cov @ ; @ 0p1 p2 7 6 cov @Â1 ; @Â12 7 6 V @Â1 1 Á2 D4 Á 1 5 4 5 @q @q 0p2 p12 V @Âi 2 0p2 p1 V @Âi 2 2 2 2 Á 3 @qi1 @qi1 6 cov @Â1 ; @Â11 7 5 4 0p2 p11 2 Á 3 @qi1 @qi1 6 cov @Â12 ; @Â11 7 D4 Á 5 @q @q cov @Âi1 ; @ i1 2 11 Consider the inverse of the second part: 2 V 6 4 @qi1 @Â12 Á @qi1 @qi1 @Â2 ; @Â12 Á cov 3 2 6 R11 R12 7 D4 5 R21 R22 @q @q cov @ i1 ; @Âi1 12 2 Á 1 Á @qi1 @qi1 @qi1 cov @ ; @ C V @Â1 1 2 Á cov @qi1 @qi1 @Â2 ; @Â1 Á V 3 1 7 Á 5 @qi 2 @Â2 Then (RHS) can be expressed as .RHS / Ä Á Á Á 1 Á @qi1 @qi1 @qi1 @qi1 @qi1 @qi1 @qi1 D 0p p ; cov ; V cov ; cov 22 11 @Â22 @Â2 @Â22 @Â1 @Â1 @Â1 @Â2 2 3 2 Á 3 @qi1 @qi1 6 R11 R12 7 6 cov @Â12 ; @Â11 7 Á 5 4 5 4 @qi1 @qi1 cov @ ; @ R21 R22 2 11 " à  à  à  Ã#  @qi1 @qi1 @qi1 @qi1 @qi1 1 @qi1 @qi1 D cov ; cov ; V cov ; @Â22 @Â2 @Â22 @Â1 @Â1 @Â1 @Â2 Ä Â Ã Â Ã @qi1 @qi1 @qi1 @qi1 R21 cov ; C R22 cov ; @Â12 @Â11 @Â2 @Â11 75 where à  à  à  Ã# 1  à @qi1 @qi1 @qi1 1 @qi1 @qi1 @qi 2 @qi1 @qi1 R21 D cov ; V cov ; CV cov ; @Â2 @Â1 @Â1 @Â1 @Â2 @Â2 @Â12 @Â2 Â Â Ã Ä Â Ã Â Ã Â Ã Ã 1 @qi1 @qi1 @qi1 @qi1 @qi1 @qi1 @qi1 1 V cov ; Mo cov ; cov ; @Â12 @Â12 @Â2 @Â2 @Â12 @Â12 @Â2 " à  à  Ã# 1  @qi1 1 @qi1 @qi1 @qi1 @qi1 ; V cov ; R22 D Mo cov @Â2 @Â12 @Â12 @Â12 @Â2  à  à à  à  @qi1 @qi1 @qi1 1 @qi 2 @qi1 @qi1 Mo D cov ; V ; CV cov @Â2 @Â1 @Â1 @Â1 @Â2 @Â2 " A.2.5  Proof of Proposition 1.3.10 For 3 2 6 m1 . 4 m2 . 1; 2/ 1; 2/ 7 5 asymptotically equivalent linearized moment functions are 2 3 6 li1 . 1 ; 2 / 7 5 4 li 2 . 1 ; 2 / Ä Ä 2 @mi1 . o1 ; o2 / @mi1 . o1 ; o2 / . 1 . 2 o1 / C E 0 6 mi1 . o1 ; o2 / C E @ @ 20 1 6 Ä Ä D4 @mi 2 . o1 ; o2 / @mi 2 . o1 ; o2 / . 1 . 2 mi 2 . o1 ; o2 / C E o1 / C E 0 0 @ 1 @ 2 3 o2 / o2 / By subtracting " @mi1 . o1 ; o2 / E @ 20 # " @mi 2 . o1 ; o2 / E @ 20 #! 1 li 2 . 1 ; 2 / from li1 . 1 ; 2 / ; we get " 0 . / li1 1 D li1 . 1 ; 2 / @mi1 . o1 ; o2 / E @ 20 # " @mi 2 . o1 ; o2 / E @ 20 #! 1 li 2 . 1 ; 2 / 0 . / is a function of 0 where li1 1 1 only. Then standard asymptotics from li1 . 1 / yields 1 VQLIML D A1 1 B1 A1 1 76 7 7 5 where " @mi1 . o1 ; o2 / A1 D E @ 10 0 B1 D V @mi1 . o1 ; o2 / Ä Since E @mi1 . o1 ; o2 / @ 20 # " #! 1 " # @mi1 . o1 ; o2 / @mi 2 . o1 ; o2 / @mi 2 . o1 ; o2 / E E E @ 20 @ 20 @ 10 1 " # " #! 1 @mi1 . o1 ; o2 / @mi 2 . o1 ; o2 / E E mi 2 . o1 ; o2 /A 0 @ 2 @ 20 # " D 0; this simply reduces to " @mi1 . o1 ; o2 / A1 D E @ 10 # B1 D V .mi1 . o1 ; o2 // and the result is the same for the case of 3 2 6 m1 . 1 ; 2 / 7 5 4 m3 . 1 ; 2 / A.2.6 Proof of Corollary 1.3.11 First, consider following reparameterizations with Á and 11j2 Á Á †221 †21 11j2 †12 †221 †21 Á †11 Á0 0 where Â1 D ˛ 0 ; ı10 ; Á0 ; 11j2 , Â2 D vec .ı 2 /0 ; vec .†22 / (Similar proof can be done with original scores as well.) These modifications do not change parameter estimates (other than †11 and †21 / of any methods studied in this chapter. It is due to the fact that the first two reparameterization do not impose any restriction on parameter space. Now, the quasi-scores of q1 77 are modified to 2 @q1 6 D4 @Â1 2 1 e .Â/ x0 11j2 i 1 2 2 11j2 hi .Â/ 1 e 11j2 i .Â/ @q1 6 D4 @Â2 0 r.rC1/ 3 7 5 Á ˝ z0 3 7 5 1 2 Ä where x D y2 z1 v2 .ı 2 / and .ei .Â/ ; hi .Â// is defined correspondingly: Moment functions for LIML and CF are 2 1 0 6 11j2 .y1 y2 ˛ z1 ı1 v2 Á/ y2 6 1 0 6 6 11j2 .y1 y2 ˛ z1 ı1 v2 Á/ z1 6 6 1 .y 0 6 11j2 1 y2 ˛ z1 ı1 v2 Á/ v2 6 h i E6 6 1 2 .y1 y2 ˛ z1 ı1 v2 Á/2 11j2 6 2 11j2 Á 6 6 vec z0 .y 1 .y 1 vec z0 11j2 1 y2 ˛ 2 zı 2 / †22 6 4 Á 1 L vec † 1 v .ı /0 v .ı / † 1 † 1 r i2 2 2 22 22 22 i 2 2 and 2 1 .y y2 ˛ z1 ı1 v2 Á/ y02 11j2 1 1 .y y2 ˛ z1 ı1 v2 Á/ z01 11j2 1 1 .y y2 ˛ z1 ı1 v2 Á/ v02 11j2 1 h 1 2 2 2 11j2 .y1 y2 ˛ z1 ı1 v2 Á/ 3 z1 ı1 6 6 6 6 6 6 6 i E6 6 6 11j2 6 Á 6 6 vec z0 .y 1 2 zı 2 / †22 6 4 Á 1 L vec † 1 v .ı /0 v .ı / † 1 † 1 r i 2 2 i 2 2 2 22 22 22 7 7 7 7 7 7 7 7D0 7 7 Á 7 7 v2 Á/ Á0 7 7 5 (A.2) 3 7 7 7 7 7 7 7 7D0 7 7 7 7 7 7 5 (A.3) If Á D 0; the result is trivial. Suppose that there exists at least one nonzero element of Á: Also, assume over-identification. By substitution of y2 and an invertible linear transformation; these can 78 be equivalently expressed as 2 0 0 6 .y1 y2 ˛ z1 ı1 v2 Á/ ı22 z2 6 0 6 .y 6 1 y2 ˛ z1 ı1 v2 Á/ z1 6 6 6 .y1 y2 ˛ z1 ı1 v2 Á/ v02 E6 6 6 .y1 y2 ˛ z1 ı1 v2 Á/2 11j2 6 Á 6 1 1 .y 6 vec z0 .y vec z0 11j2 1 2 zı 2 / †22 6 4 Á Lr vec †221 vi 2 .ı 2 /0 vi 2 .ı 2 / †221 †221 and 2 6 6 6 6 6 6 6 E6 6 6 6 6 6 6 4 .y1 y2 ˛ y2 ˛ z1 ı1 0 z0 v2 Á/ ı22 2 .y1 y2 ˛ z1 ı1 v2 Á/ z01 .y1 y2 ˛ z1 ı1 v2 Á/ v02 .y1 y2 ˛ z1 ı1 vec z0 .y2 3 v2 Á/2 Á zı 2 / †221 z1 ı1 7 7 7 7 7 7 7 7D0 7 7 Á 7 7 v2 Á/ Á0 7 7 5 (A.4) 3 11j2 Lr vec †221 vi 2 .ı 2 /0 vi 2 .ı 2 / †221 †221 7 7 7 7 7 7 7 7D0 7 7 7 7 7 7 Á 5 (A.5) respectivley. We can show (A.4) and (A.5) can be transformed by an invertible linear transformation into following expressions 2 6 6 6 6 6 6 6 6 6 E6 6 6 6 6 6 6 6 4 y2 ˛ 0 z0 z1 ı1 / ı22 2 .y1 y2 ˛ z1 ı1 / z01 .y1 y2 ˛ z1 ı1 .y1 3 v2 Á/ v02 v2 Á/2 11j2 Á vec z01 .y2 zı 2 / †221 Á 1 .y vec z02 .y2 zı 2 / †221 vec z02 11j2 1 Á Lr vec †221 vi 2 .ı 2 /0 vi 2 .ı 2 / †221 †221 .y1 y2 ˛ z1 ı1 79 y2 ˛ z1 ı1 7 7 7 7 7 7 7 7 7 7D0 7 7 7 7 Á 7 v2 Á/ Á0 7 7 5 (A.6) and 2 6 6 6 6 6 6 6 E6 6 6 6 6 6 6 4 .y1 y2 ˛ .y1 y2 ˛ z1 ı1 / z01 .y1 y2 ˛ z1 ı1 .y1 y2 ˛ z1 ı1 vec z0 .y2 3 0 z0 z1 ı1 / ı22 2 7 7 7 7 7 7 7 7D0 7 7 7 7 7 7 Á 5 v2 Á/ v02 v2 Á/2 Á zı 2 / †221 11j2 (A.7) Lr vec †221 vi 2 .ı 2 /0 vi 2 .ı 2 / †221 †221 repectively. Then the result follows by Proposition 1.3.11 and by similar argument as in Lemma C.1. Equivalence in CF case is clear since h E vec z0 .y2 zı 2 / †221 Ái D0 implies 22 3 3 0 0 66 ı z 7 7 E 44 22 2 5 v2 Á5 D 0: z01 Too see (A.4) implies (A.6), note the second k2 r part of the fifth moments implies Ái Á h 0 1 .y 0 z0 1 0 z0 .y / D0 y ˛ z ı v Á/ Á vec ı † zı E vec ı22 2 1 1 2 2 22 2 11j2 1 22 2 2 1 and elements of Á; we have and, by adding the first moments after multiplying 11j2 Ái h 1 0 z0 .y / D0 † zı E vec ı22 2 22 2 2 2 3 0 z0 v Á 5 D 0: Similarly, by adding second moment after multiplying 1 which implies E 4ı22 2 2 „ƒ‚… 11j2 scalar and elements of Á to the first k1 r part of the fifth moments, we have h Ái E vec z01 .y2 zı 2 / †221 D 0 and it implies E z01 v2 Á D 0. The converse can be shown as following: h Á t zt .y 1 t zt 1 / E vec ı22 zı † vec ı22 2 2 2 22 2 11j2 .y1 y2 ˛ z1 ı1 h Á Ái t zt .y 1 t zt 1 . v Á/ Á0 / D E vec ı22 zı † vec ı 2 2 2 2 22 22 2 11j2 h Á i t zt v † 1 C vec ı t zt v ÁÁ0 1 D E vec ı22 22 2 2 2 2 22 11j2 D 0 80 v2 Á/ Á0 Ái multiplying †22 Á from right, we have 2 0 13 1 Á0 †22 Á 11j2 „ ƒ‚ … 6 t zt v Á C vec Bı t zt v Á E 4vec ı22 @ 22 2 2 2 2 C7 A5 D 0 strict positive scalar if Á6D0 which implies t zt v Á D 0 E ı22 2 2 h Ái And, again, seeing E vec z01 .y2 zı 2 / †221 D 0 implies E z01 v2 Á D 0 delivers the result. Ä Ä @mi 3 . o1 ; o2 / @mi 2 . o1 ; o2 / and E can be easily derived from identiAnd invertibility of E 0 0 @ 2 @ 2 fication conditions from LIML and CF. Hence, it is shown that there exist such T1 .Â/ and T2 .Â/ in Proposition 1.3.10. A.2.7 Proof of Proposition 1.3.12 Lemma H.1 below proves the results when g1 .w; Â; / and g2 .w; Â; / are taken properly: 2 3 2 3 2 3 @q1 @q1 @q1 @ 2 @ 2 6 7 6 0 7 6 0 7 @ 01 7 , CF W g2 D 6 @ 1 7 ; GMM-QLIML W g2 D 6 @ 1 7 QLIML W g2 D 6 4 @q1 4 @q2 5 4 @q2 5 @q2 5 0 C 0 0 0 @ 2 @ 2 with g1 chosen to be the rest of moment functions in each GMM-interpreted estimator. 2 Rr and g D g1 .w; Â; /0 ; g2 .w; Â; /0 Lemma A.2.2 Let  2 Rp ; moment functions with q 0 be RqCr valued p. Assume regularity conditions for well-definedness of relevant GMM estimators below. Suppose 2 6 E4 @g1 .Âo ; o / @. 0 ; 0 / @g2 .Âo ; o / @. 0 ; 0 / 3 2 3 11 7 6 Gq p 0 q r 7 5D4 5 22 0 r p Gr r where both G 22 and V .g .w; Âo ; o // are invertible. Then, the asymptotic variance of optimal 0 GMM estimator of  based on g1 .Â; /0 ; g2 .Â; /0 is the same as that of optimal GMM estimator of  based on g1 .Â; o / treating o as a known value. 81 Proof. Let 2 6 GD4 Gq11 p 3 0q r 7 5 Gr22 r 0r p 0 1 1 2 3 1 2 3 1 12 B g1 .w; Âo ; o / C 6 V11;q q V12;q r 7 6 Bq q Bq r 7 V 1DV @ A D4 5 D4 5 21 22 g2 .w; Âo ; o / V21;r q V22;r r Br q Br r Then 2 6 G 0V 1G D 4 2 6 D4 2 6 D4 30 2 Gq11 p Bq1 q Bq12 r 0q r 7 6 5 4 Br21 q Br22 r 0r p Gr22 r 32 110 1 110 12 G B G B 7 6 Gq11 p 54 G 220 B 21 G 220 B 22 0r p 3 110 1 11 110 12 22 G B G G B G 7 5 G 220 B 21 G 11 G 220 B 22 G 22 32 76 54 Gq11 p 0r p 3 3 0q r 7 5 Gr22 r 0q r 7 5 Gr22 r Now, it suffices to show that  G 110 B 1 G 11 G 110 B 12 G 22 h G 220 B 22 G 22 i 1 G 220 B 21 G 11 à 1 D G 110 V111 G 11 This is true since i 1 h G 220 B 21 G 11 G 110 B 1 G 11 G 110 B 12 G 22 G 220 B 22 G 22 Ä Á 1 Á 1 Á 1 110 B 1 B 12 G 22 G 22 B 22 G 220 G 220 B 21 G 11 DG Ä Á 1 110 1 12 22 DG B B B B 21 G 11 D G 110 V111 G 11 82 Á 1 A.2.8 Proof of Proposition 1.3.13 Define Schur complements as A12 .A22 C C22 / 1 A21 A= .A22 C C22 / Á A11 A=A11 Á A22 C C22 A21 A111 A12 3 Ä Ä @q1 @q1 7 t ; A12 D A21 D E ; A22 D 5 ; A11 D E @Â1 @Â10 @Â1 @Â20 2 A12 6 A11 where A D 4 A21 A22 C C22 Ä Ä @q2 @q1 and C22 D E : E @Â2 @Â20 @Â2 @Â20 Á Á @q @q @q Assume GIMEs i.e. V @Â1 D 1 A11 ; cov @Â1 ; @Â1 D 1 A12 ; V 1 1 2 Á Á @q2 @q1 @q2 V @ D 2 C22 and cov @ ; @ D 0: Then, what needs to be shown is 2 @q1 @Â2 Á D 1 A22 ; 2  1 VCF  1 VQLIML D A111 A12 Œ 2 W1 C . 1 1 2 / W2  A21 A11 where W1 D C221 W2 D ŒA=A11  1 Á A21 A111 A12 ŒA=A11  1 ŒA=A11  1 A22 First, by argument used in the proof of Proposition 1.3.10 (i), the variance difference is  1 VCF  1 VQLIML D A111 B2 A111 h A11 A12 .A22 C C22 / 1 A21 i 1 h B1 A11 A12 .A22 C C22 / 1 A21 i 1 where Ä B1 D V @q1 @Â1 A12 .A22 C C22 / 1  @q1 @q2 C @Â2 @Â2 à D 1 A11 C A12 .A22 C C22 / 1 . 1 A22 C 2 C22 / .A22 C C22 / 1 A21 2 1 A12 .A22 C C22 / 1 A21 D 1 A11 C A12 .A22 C C22 / 1 . 1 A22 C 1 C22 2 1 A12 .A22 C C22 / 1 A21 83 1 C22 C 2 C22 / .A22 C C22 / 1 A21 h D 1 A11 C. 2 A12 .A22 C C22 / 1 / A12 .A22 1A 21 i C C22 / 1 C22 .A22 C C22 / 1 A21 and Ä B2 D V @q1o @q2o 1 A12 C22 @Â2 @Â1 D 1 A11 C 2 A12 C221 A21 Since h A11 A12 .A22 C C22 / 1 A21 i 1 D A111 C A111 A12 ŒA22 C C22  A21 A111 A12 Á 1 A21 A111 the difference can be rearranged as i 1 i 1 h B1 A11 A12 .A22 C C22 / 1 A21 A12 .A22 C C22 / 1 A21 i 1 h Á 1A / .A C C A A D A111 1 A11 C 2 A12 C221 A21 A111 22 22 21 11 12 1 i 1 i 1 h h 1 1 . 2 A12 DA21 A11 A12 .A22 C C22 / A21 1 / A11 A12 .A22 C C22 / A21  Á 1à 1 1 1 A21 A111 D A11 A12 2 C22 1 ŒA22 C C22  A21 A11 A12 h A111 B2 A111 A11 h 1A i 1 h A12 DA21 A11 A12 .A22 C C22 / A11 A12 .A22 C C22 / 21  Á 1à 1 1 1 ŒA22 C C22  A21 A11 A12 A21 A111 D 2 A11 A12 C22 C. 1 2/ C. 1 h / 2 A11 . 1 1 2 / A11 A12 ŒA22 C C22  i 1 h A12 DA21 A11 Á 1 A21 A111 A12 A21 A111 A12 .A22 C C22 / 1 A21 1A 21 A12 .A22 C C22 / 1 A21 where D D .A22 C C22 / 1 C22 .A22 C C22 / 1 : Now, it suffices to show h i 1 h i 1 A11 A12 .A22 C C22 / 1 A21 A12 DA21 A11 A12 .A22 C C22 / 1 A21 Á 1 A111 A12 ŒA22 C C22  A21 A111 A12 A21 A111 Á D A111 A12 ŒA=A11  1 A22 A21 A111 A12 ŒA=A11  1 A21 A111 84 i 1 i 1 The first term: i 1 h i 1 A12 .A22 C C22 / 1 A21 A12 DA21 A11 A12 .A22 C C22 / 1 A21 h i h i 1 1 1 1 1 D A11 A11 C A12 ŒA=A11  A21 A11 A12 DA21 A11 A11 C A12 ŒA=A11  A21 A111 h i 1 1 1 1 D A11 A11 A11 A12 C A12 ŒA=A11  A21 A11 A12 h i D A21 A111 A11 C A21 A111 A12 ŒA=A11  1 A21 A111 h i h i D A111 A12 I C ŒA=A11  1 A21 A111 A12 D I C A21 A111 A12 ŒA=A11  1 A21 A111 h i h i D A111 A12 ŒA=A11  1 ŒA=A11  C A21 A111 A12 D ŒA=A11  C A21 A111 A12 ŒA=A11  1 A21 A111 h A11 D A111 A12 ŒA=A11  1 ŒA22 C C22  D ŒA22 C C22  ŒA=A11  1 A21 A111 D A111 A12 ŒA=A11  1 C22 ŒA=A11  1 A21 A111 The second term: A111 A12 ŒA22 C C22  A21 A111 A12 Á 1 A21 A111 D A111 A12 ŒA=A11  1 ŒA=A11  ŒA=A11  1 A21 A111 Hence the result. To see positive semi-definiteness of W1 ; C221 ŒA=A11  1 0 ” ŒA=A11  D A22 C22 A21 A111 A12 ” under A11 0 2 3 6 A11 A12 7 4 5 A21 A22 85 0 0 where the last equivalence is from Schur complement condition for positive semi-definiteness. Since A11 D and 1 1  Vo @q1o à  0 @Â1 2 @q1 * linearly independent at Âo @Â1 à 3  oà @q1 1 6 A11 A12 7 4 5 D Vo @ 1 A21 A22 0 the statement is shown. Negative semi-definiteness of W2 is also proved similarly. A.2.9 Proof of Proposition 1.3.14 Note that Ä E Ä E @q1o @Â1 @vech .†22 /0 @q2o @vec .ı 2 / @vech .†22 /0 D0 D0 Thus we can treat †22 as a known value by Proposition 1.3.12. Then, redefined qQ 2 is also a member of linear exponential family (Gouriéroux, Monfort and Trognon ,1984). Now, it suffices to show GLM variance assumptions imply GIMEs with corresponding scaling factor in linear exponential family. Let m .Â/ Á G .y2 ; z1 ; v2 ; Â1 / : Based on general form of score and Hessian (Wooldridge, 2010), it is easy to see Ä Ä @q1 1 @m .Â/ @m .Â/ Eo D E o @Â@ 0 Vq . y1 j y2 ; z/ @ @ 0 " # Ä Œy1 m .Â/2 @m .Â/ @m .Â/ @q1 @q1 D Eo Eo @ @ 0 @ 0 Vq . y1 j y2 ; z/2 @ ˇ i 2 h 3 ˇ Eo Œy1 m .Â/2 ˇ y2 ; z @m .Â/ @m .Â/ 5 D Eo 4 0 2 @ @ Vq . y1 j y2 ; z/ Ä @q1 D 1 Eo @Â@ 0 86 And GIMEs for qQ 2 can be shown similarly. Ä Eo h i @qQ 2 1 0 D Eo †22 ˝ z z @vec .ı 2 / @vec .ı 2 /0  à h i @qQ 2 0 1 0 1 Vo D Eo Ir ˝ z †22 v2 v2 †22 ŒIr ˝ z @vec .ı 2 / h h ii 1 0 1 0 D Eo Ir ˝ z †22 Eo v2 v2 jz †22 ˝ z h h ii 0 1 D 2 Eo Ir ˝ z †22 ˝ z Ä @qQ 2 D 2 Eo @vec .ı 2 / @vec .ı 2 /0 Orthogonality of scores holds under correct specification of conditional means since " # " Ä # ˇ @q1 @q2 @q2 @q1 ˇˇ Eo D Eo Eo y2 ; z @ @Â20 @ ˇ @Â20 " # Eo Œy1 m .Â/j y2 ; z @m .Â/ @q2 D Eo Vq .y1 j y2 ; z/ @ @Â20 D0 Then, by Proposition 1.3.13, QLIML is efficient relative to CF for Â1 : 87 APPENDIX B AN APPENDIX FOR CHAPTER 2 B.1 Regularity conditions Assumptions (consistency and asymptotic normality of QLIML and CF estimator) (1) .yi1 ; yi 2 ; zi / are i.i.d. (2) ‚ cpt (3) q1 W ‚ Rp W ! R and q2 W ‚2 W ! R where .yi1 ; yi 2 ; zi / 2 W (4) Âo 2 i nt .‚/ and let N be a neighborhood of Âo at each  2 ‚ with probability one. (5) qi1 .Â1 ; Â2 / and qi 2 .Â2 / are continuously 2 differentiable 3 2 3 2 @.q1 Cq2 / @q1 .Âo / 6 7 6 7 @ @ 6 7<1 (6) E 4 sup < 1 (7) E 4 5 5 @q2 @q2 .Âo2 / Â2‚ @Â2 2 6 (8) E 4 @Â2 @.q1 .Â/Cq2 .Â2 // @ @q2 .Â2 / @Â2 3 7 5 is differentiable with respect to  at Âo Ä @.q1 Cq2 / (9) (QLIML orthogonality) fÂo g D  2 ‚ W E D0 @ 8 9 2 3 @q1 .Â/ ˆ > < = 6 @Â1 7 (10) (CF orthogonality) fÂo g D  2 ‚ W E 4 5D0 @q2 .Â2 / ˆ > : ; @Â2 " ˇ ˇ Ä @.q1 . /Cq2 .Â// ˇˇ @.q1 .Â/Cq2 .Â// ˇˇ @ and V (11) (QLIML rank) @ 0 E ˇ ˇ @ @  DÂo ÂDÂo ible. 2 6 (12) (CF rank) @Â@ 0 E 4 @q1 .Â/ @Â1 @q2 .Â/ @Â2 3ˇ ˇ ˇ 7ˇ 5ˇ ˇ ˇ 2 6 and V 4 ÂDÂo @q1 .Âo / @Â1 @q2 .Âo2 / @Â2 3 7 5 are invertible. (13) (stochastic differentiability) For j D 1; 2 and any ıN ! 0; h i p j j j N gO N .Â/ gO N .Âo / E gO N .Â/ p !0 sup p 1 C N k Âo k k Âo kÄıN 88 # are invert- where 1 .Â/ D gO N N Ä X iD1 @ .qi1 .Â/ C qi 2 .Â// @ 2 .Â/ D and gO N N X i D1 2 6 4 @qi1 .Â/ @Â1 @qi 2 .Â/ @Â2 3 7 5 Assumptions (consistency and asymptotic normality of minimum distance estimators) .Â1 ; Âo2 / D o if and only if Â1 D Âo1 @ .Âo1 ;Âo2 / has full rank (15) @ (14) 1 (16) W ‚ ! € is continuous at each  2 ‚ and continuously differentiable in N (17) (asymptotic normality of reduced form parameters) p  N O 0 ; ÂO20 Á0 0 0 0 o ; Âo2 à d  ! N 0; A0R BR 1 AR Á 1à where 2 AR D @ @ 6 E4 0;  0 2 @q1 . ;Â2 / @ @q2 .Â2 / @Â2 3ˇ 3 2 ˇ @q1 . o ;Âo2 / ˇ 7ˇ 7 6 @ and BR D V 4 5ˇ 5 @q2 .Âo2 / ˇ ˇ @Â2 . ;Â2 /D. o ;Âo2 / Assumption (18) each component of @q1 .Â/ @ : B.2 @q2 .Âo2 / @Â2 cannot be expressed as linear combination of Proof of Proposition 2.3.3 First, it is shown that a well-defined minimum distance estimator and its linearized version are asymptotically equivalent. Lemma B.2.1 (linearized minimum distance estimator) Assume (1) o h .Â/ 6D 0 if  6D Âo (2) @h . / has full column h is continuously differentiable in  where  2 Rp , 2 Rg and g p (3) @ o p d rank (4) N . O o / ! N .0; o /. Then, efficient MD on a linearized link function yields an estimator asymptotically equivalent to efficient MD with original link function. 89 Proof. Consider a first-order expansion of h around Âo h .Âo / C h .Â/ The minimization problem is  @h .Âo / . min O h .Âo / @ 0  @h .Âo / . @ 0 Ã0 Âo / WO Âo /  O h .Âo / @h .Âo / . @ 0 à Âo / (A.1) where p WO ! Wo The first order condition is @h .Âo / O W @  O @h .Âo / . @ 0 h .Âo / à Âo / D 0 Then Ä Â Ã @h .Âo / @h .Âo / O @h .Âo / 1 @h .Âo / O O ÂD W W O h .Âo / C Âo @ @ 0 @ @ 0 Ä @h .Âo / O @h .Âo / 1 @h .Âo / O D Âo C W W . O h .Âo // @ @ 0 @ The asymptotic distribution is p N ÂO Á Ä Âo D 1 @h .Âo / O @h .Âo / W @ @ 0 D Ho0 Wo Ho D Ho0 Wo Ho @h .Âo / O p W N .O @ p Ho0 Wo N . O p 1 0 Ho Wo N . O 1 h .Âo // h .Âo // C op .1/ o / C op .1/ Then the optimal weighting matrix is such that Wo D o 1 Hence the proof Now, consider an auxiliary model. Lemma B.2.2 There exists an auxiliary asymptotic model whose GLS estimator is asymptotically equivalent to efficient linearized MD. 90 Proof. Consider O „ @h .Âo / @h .Âo / D  C un h .Âo / C  o 0 0 @ @ ƒ‚ … „ ƒ‚ … y X where E Œun  D 0 V .un / D o Then GLS estimator of  in this auxiliary asymptotic model is asymptotically equivalent to efficient linearized MD since GLS estimator solves (A.1). Here, mean and variance of un are just imposed restrictions in auxiliary model, not derived from original model. (Gouriéroux, Monfort, Trognon, 1985) Next, above two lemma will be applied to a minimum distance problem with partitioned link function. Consider 8 2 3 2 ˆ < 6 o1 7 6 f.Âo1 ; Âo2 /g D .Â1 ; Â2 / W 4 5D4 ˆ : o2 where D 0 0 0 1; 2 ; h1 .Â1 ; Â2 / h2 .Â1 ; Â2 / 0 h D h01 ; h02 ; 1 2 Rg1 ; 2 2 Rg2 ; Â1 2 Rp1 ; Â2 2 Rp2 ; and p1 D g2 : Then, by first order expansion, a linearized partioned model is 3 2 2 3 2 6 h1 .Â1 ; Â2 / 7 4 5 h2 .Â1 ; Â2 / 6 h1 .Âo1 ; Âo2 / 7 6 5C6 4 4 h2 .Âo1 ; Âo2 / @h1 .Âo / @Â10 @h2 .Âo / @Â10 and the corresponding auxiliary asymptotic model is 2 3 2 3 2 6 O1 7 4 5 O2 2 6 D6 4 39 > = 7 5 > ; 6 h1 .Âo1 ; Âo2 / 7 6 4 5C6 4 h2 .Âo1 ; Âo2 / 32 3 @h1 .Âo / @Â10 @h2 .Âo / @Â10 @h1 .Âo / @Â20 @h2 .Âo / @Â20 @h1 .Âo / @Â20 @h2 .Âo / @Â20 @h1 .Âo / @Â10 @h2 .Âo / @Â10 2 3 3 2 32 31 6 Âo1 7C 4 5A Âo2 7 B6 Â1 7 7 @4 5 5 Â2 @h1 .Âo / @Â20 @h2 .Âo / @Â20 7 6 Â1 7 6 u1n 7 74 5C4 5 5 Â2 u2n 91 3 02 3 7 6 Âo1 7 74 5 5 Âo2 where 2 3 6 u1n E4 u2n 2 6 u1n V4 u2n 7 5D0 3 7 5D o Define y1 Á O1 y2 Á O2 @h .Âo / @h1 .Âo / Âo1 C 1 0 Âo2 0 @Â1 @Â2 @h .Âo / @h .Âo / h2 .Âo1 ; Âo2 / C 2 0 Âo1 C 2 0 Âo2 @Â1 @Â2 h1 .Âo1 ; Âo2 / C X1 Á @h1 .Âo / @Â10 Z1 Á @h1 .Âo / @Â20 X2 Á @h2 .Âo / @Â10 Z2 Á @h2 .Âo / I invertible p2 @Â20 p2 where 2 3 6 u1n 7 E4 5D0 u2n 2 3 6 u1n 7 V4 5D o u2n 92 Define @h1 .Âo / @h1 .Âo /  C Âo2 o1 @Â10 @Â20 @h2 .Âo / @h2 .Âo / Âo1 C Âo2 O2 h2 .Âo1 ; Âo2 / C 0 @Â1 @Â20 @h1 .Âo / @Â10 @h1 .Âo / @Â20 @h2 .Âo / @Â10 @h2 .Âo / I invertible p2 p2 @Â20 y1 Á O1 y2 Á X1 Á Z1 Á X2 Á Z2 Á h1 .Âo1 ; Âo2 / C Then we have a system 8 ˆ < y1 D X1 Â1 C Z1 Â2 C u1n ˆ : y2 D X2 Â1 C Z2 Â2 C u2n Since Z2 invertible, we have Z2 1 y2 D Z2 1 X2 Â1 C Â2 C Z2 1 u2n ” Â2 D Z2 1 y2 Z2 1 X2 Â1 Z2 1 u2n Thus y1 D X1 Â1 C Z1 Z2 1 y2 Z2 1 X2 Â1 Z2 1u Á 2n C u1n ” y1 Z1 Z2 1 y2 D X1 Then an equivalent system is 8 ˆ < y1 Z1 Z 1 y2 D X1 2 ˆ : Á Z1 Z2 1 X2 Â1 C u1n Á Z1 Z2 1 X2 Â1 C u1n y2 D X2 Â1 C Z2 Â2 C u2n 93 Z1 Z2 1 u2n Z1 Z2 1 u2n Moreover, define Z1 Z2 1 u2n u1n Á u1n y1 Á y1 Z1 Z2 1 y2 X1 Á X1 Z1 Z2 1 X2 u2n Á u2n L u2n ju1 D u2n y2 D y2 Ay1 X2 D X2 AX1 Au1 where the linear projection L . j / is defined in auxiliary population space. Then we have another equivalent system 8 ˆ < y1 D X1 Â1 C u1n ˆ : y D X Â1 C Z2 Â2 C u 2 2 2n Since u1n and u2n are orthogonal here, GLS on the first part only is equivalent to joint GLS for Â1 . From now on, It will be proved that concentrated MD is asymptotically equivalent to running GLS on the first part only. Consistency The concentrating equation is h2 .Â1 ; Â2 / D 0 O2 and, by implicit function theorem, we know Â2 D 'n .Â1 / is well-defined and continuously differentiable at each Â1 : The concentrated MD is derived from minimizing distance O1 h1 .Â1 ; 'n .Â1 // and consistency of concentrated MD easily follows by the fact that 'n .Â1 / is well-defined and smooth enough for each Â1 . 94 First, from the concentration identity, we can take differen- Optimal Weight Calculation tiation on both handsides @h2 .Â1 ; 'n .Â1 // @h2 .Â1 ; 'n .Â1 // @'n .Â1 / C D0 @Â10 @Â20 @Â10 „ ƒ‚ … invertible Hence @'n .Â1 / D @Â10 " @h2 .Â1 ; 'n .Â1 // @Â20 # 1 @h2 .Â1 ; 'n .Â1 // @Â10 In the minimization problem, min Œ O1 Â1 h1 .Â1 ; 'n .Â1 //0 WO Œ O1 h1 .Â1 ; 'n .Â1 // taking first order condition, we have 2 0D4 ÁÁ @ O1 ; 'n ÂO1 C @ h1 ÂO1 ; 'n ÂO1  h 1 @Â10 @Â20 2 ÁÁ 6 @ D 4 0 h1 ÂO1 ; 'n ÂO1 @Â1 „ WO ÁÁ @'n ÂO1 @Â1 Á 30 5 WO h ÁÁ @h2 ÂO1 ; 'n ÂO1 @ O O h1 Â1 ; 'n Â1 4 @Â20 @Â20 ƒ‚ Á 2 O1 h1 ÂO1 ; 'n ÂO1 ÁÁ 3 1 5 ÁÁi @h2 ÂO1 ; 'n ÂO1 ÁÁ 30 7 5 @Â10 … ÁHn ÂO1 h h1 ÂO1 ; 'n ÂO1 O1 ÁÁi Note h1 ÂO1 ; 'n ÂO1 ÁÁ D h1 .Âo1 ; 'n .Âo1 // 2 @ C 4 0 h1 Â1 ; 'n Â1 @Â1 „ " @ h1 Â1 ; 'n Â1 @Â20 @h2 Â1 ; 'n Â1 @Â20 ƒ‚ Á ÁHn Â1 ÂO1 Âo1 Á 95 # 1 @h2 Â1 ; 'n Â1 @Â10 3 5 … where Â1 lies on the segment connecting ÂO1 and Âo1 Hence 0 D Hn ÂO1 Á0 WO h O1 h1 ÂO1 ; 'n ÂO1 D Hn ÂO1 Á0 WO h O1 h1 .Âo1 ; 'n .Âo1 // ÁÁi Hn Â1 ÂO1 Âo1 Ái ) p Á O N Â1 Âo1 Ä Á0 O D Hn Â1 WO Hn Â1 D Ho0 Wo Ho D Ho0 Wo Ho D Ho0 Wo Ho 1 Hn .Â/0 WO p N Œ O1 h1 .Âo1 ; 'n .Âo1 // p Ho0 Wo N Œ O1 h1 .Âo1 ; 'n .Âo1 // C op .1/ p 1 0 Ho Wo N Œ O1 h1 .Âo1 ; Âo2 / C h1 .Âo1 ; Âo2 / h1 .Âo1 ; 'n .Âo1 // C op .1/ hp i p 1 0 /// . . / / . .h Ho Wo C op .1/ ; ' h C ;  N . O1 N n o1 1 o1 o1 1 o1 o2 1 Note h1 .Âo1 ; 'n .Âo1 // D h1 .Âo1 ; Âo2 / C @h1 Âo1 ; Â2 @Â20 .'n .Âo1 / Âo2 / and 0 D O2 D O2 h2 .Âo1 ; 'n .Âo1 // h2 .Âo1 ; Âo2 / @h2 Âo1 ; Â2 @Â20 .'n .Âo1 / Âo2 / where Â2 and Â2 lie on the segment connecting 'n .Âo1 / and Âo2 which implies " .'n .Âo1 / Âo2 / D " @h2 Âo1 ; Â2 # 1 . O2 @Â20 @h2 Âo1 ; Â2 @Â20 96 h2 .Âo1 ; Âo2 // # 1 . O2 o2 / Thus p N ÂO1 Âo1 Á 1 D Ho0 Wo Ho 2 p 4 N . O1 o1 / p 0 N@ 1 Ho0 Wo Ho Ho0 Wo @Â20 @h2 Âo1 ; Â2 @Â20 13 # 1 . O2 o2 /A5 C op .1/ 3 # 1 " p @h1 .Âo / @h2 .Âo / N . O2 o2 /5 C op .1/ 0 0 @Â2 @Â2 02 31 Ä 1 p o1 7C B6 O1 @h1 .Âo / @h2 .Âo / N @4 5A C op .1/ 0 0 @Â2 @Â2 O2 o2 ƒ‚ … Ä Ig1 „ " @h1 Âo1 ; Â2 2 p 1 0 Ho Wo 4 N . O1 D Ho0 Wo Ho D Ho0 Wo o1 / inverse of asymp. var. of this expression is optimal weight where @ Ho D 0 h1 .Âo / @Â1 " # 1 @ @h2 .Âo / @h2 .Âo / h1 .Âo / 0 0 @Â2 @Â2 @Â10 Therefore, the optimal weight is Ä Wo D Ig1 @h1 .Âo / @Â20 Ä @h2 .Âo / @Â20 1 Ä o Ig1 @h1 .Âo / @Â20 Ä @h2 .Âo / @Â20 1 0 ! 1 Linearized cMD and auxiliary asymptotic model The linearized cMD on 2 3 " # 1 @ @ @h2 .Âo / @h2 .Âo / 5 .Â1 O1 h1 .Âo1 ; 'n .Âo1 // 4 0 h1 .Âo / h1 .Âo / 0 0 @Â1 @Â2 @Â2 @Â10 D O1 h1 .Âo1 ; 'n .Âo1 // Ho .Â1 Âo1 / Âo1 / with weights calculated above is asymptotically equivalent to concentrated MD. The corresponding auxiliary asymptotic model is O1 where u1n h1 .Âo1 ; 'n .Âo1 // C Ho Âo1 D Ho  C u1n Z1 Z2 1 u2n Z1 Z2 1 u2n is derived from optimal weight calculation. Moreover, since we know X1 Á Z1 Z2 1 X2 D Ho 97 to see its equivalence to y1 Z1 Z2 1 y2 D X1 it suffices to show that we can replace O1 Z1 Z2 1 X2 Á Â1 C u1n Z1 Z2 1 u2n h1 .Âo1 ; 'n .Âo1 // C Ho Âo1 with y1 Z1 Z2 1 y2 : Note that we have y1 Z1 Z2 1 y2 @h1 .Âo / @h1 .Âo / Âo1 C Âo2 0 @Â1 @Â20 " # 1" # @h1 .Âo / @h2 .Âo / @h2 .Âo / @h2 .Âo / O2 h2 .Âo1 ; Âo2 / C Âo1 C Âo2 @Â20 @Â20 @Â10 @Â20 D O1 h1 .Âo1 ; Âo2 / C D O1 h1 .Âo1 ; Âo2 / C " @h1 .Âo / @h2 .Âo / @Â20 @Â20 @h1 .Âo / Âo1 @Â10 # 1" O2 @h2 .Âo / h2 .Âo1 ; Âo2 / C Âo1 @Â10 # along with O1 h1 .Âo1 ; 'n .Âo1 // C Ho Âo1 h1 .Âo1 ; Âo2 / C h1 .Âo1 ; Âo2 / h1 .Âo1 ; 'n .Âo1 // 3 " # 1 . / . / @ @ @h @h 2 o 2 o 5 C 4 0 h1 .Âo / h1 .Âo / Âo1 0 0 @Â1 @Â2 @Â2 @Â10 " # 1 @h1 .Âo / @h1 .Âo / @h2 .Âo / @h2 .Âo / D O1 h1 .Âo1 ; Âo2 / C Âo1 Âo1 0 0 0 @Â1 @Â2 @Â2 @Â10 D O1 2 C h1 .Âo1 ; Âo2 / h1 .Âo1 ; 'n .Âo1 // " # 1 @h1 .Âo / @h2 .Âo / @h2 .Âo / @h1 .Âo / D O1 h1 .Âo1 ; Âo2 / C Âo1 Âo1 0 0 0 @Â1 @Â2 @Â2 @Â10 " # 1 @h1 Âo1 ; Â2 @h2 Âo1 ; Â2 . O2 C o2 / @Â20 @Â20 D O1 @h1 .Âo / Âo1 @Â10 # 1" h1 .Âo1 ; Âo2 / C " @h1 .Âo / @h2 .Âo / @Â20 @Â20 O2 #  à 1 @h2 .Âo / h2 .Âo1 ; Âo2 / C Âo1 C op n 2 @Â10 98 Hence the result. B.3 Proof of Proposition 2.3.7 First, note that 1 0 0 1 VMD QLIML D Ho AR BR AR Ho where 2 6 Ho D 4 @ @Â10 .Âo / 0 @ 6 E4 0;  0 2 @ 3 .Âo / 7 5 Ip2 2 AR D @ @Â20 @q1 . ;Â2 / @ @q2 .Â2 / @Â2 0 3ˇ ˇ ˇ 7ˇ 5ˇ ˇ ˇ . ;Â2 /D. o ;Âo2 / 1 @ q . ; / i1 o o2 C B BR D V @ @ A @ q . / o2 i 2 @ 2 Next, by product rule of differentiation, we have 3ˇ 2 ˇ @ q . .Â/ ;  / ˇ @ 2 7ˇ 6 @ 1 E4 5ˇ ˇ @ q . / @ 0 ˇ @Â2 2 2 ÂD 2 3ˇ o 2 3ˇ ˇ @q1 . ;Â2 / ˇˇ ˇ .Â/ @ @ 6 7 ˇ 6 7 ˇ @ D E ˇ ˇ 4 5 4 5 0 0 @q2 .Â2 / ˇ ˇ @ @ 0 ; Â2 Â2 ˇ ˇ @Â2  DÂo . ;Â2 /D. o ;Âo2 / „ ƒ‚ … „ ƒ‚ … DHo DAR Hence 1 0 0 1 VmGMM QLIML D Ho AR BR AR Ho 99 B.4 Proof of Proposition 2.3.8 Note that quasi-likelihoods for reduced form model are q1 . .Â1 ; Â2 / ; Â2 / q2 .Â2 / 1 We will show VmGMM QLIML 1 VGMM QLIML 0: First, note that 1 0 0 1 VmGMM QLIML D Ho AR BR AR Ho where 2 6 Ho D 4 @ @Â10 .Âo / 0 @ 6 E4 0;  0 2 @ 3 .Âo / 7 5 Ip2 2 AR D @ @Â20 3ˇ ˇ ˇ 7ˇ 5ˇ ˇ ˇ . ;Â2 /D. o ;Âo2 / 1 @q1 . ;Â2 / @ @q2 .Â2 / @Â2 0 @ B @ qi1 . o ; Âo2 / C BR D V @ A @ q . / @ i 2 o2 2 1 It will be shown that VGMM QLIML can be expressed in terms of Ho ; AR ,BR and an additional linear transformation. Suppose Â22 is not empty. With probability one, for some p22 100 g matrix C2 .Â/ 2 6 6 6 6 4 @ @Â1 qi1 .Â/ @ @Â22 qi1 .Â/ @ @Â2 qi 2 .Â2 / 3 2 3 @ @ @Â1 @ qi1 . ; Â2 / 7 6 7 7 6 @ 7 @ q . ; / C @ q . ; / 7 7D6 i1 2 i1 2 7 6 @Â22 @ 7 @Â22 5 4 5 @ q . / @Â2 i 2 2 2 3 @ @ . / q ;  2 @Â1 @ i1 6 h 7 i 6 @ 7 @ 6 7 D6 .Â/ . / C C q ;  2 i1 2 7 @Â22 @ 5 4 @ q . / @Â2 i 2 2 3 2 @ 3 0 72 @Â1 6 h @ i 7 6 @ qi1 . ; Â2 / 7 6 @ 74 D6 5 .Â/ 0 C C 2 7 6 @Â22 @ q . / 5 4 @Â2 i 2 2 0 Ip2 „ ƒ‚ … ÁW . /W.pCp22 / .gCp2 / By product rule of differentiation, 0 1 3 C B @ q . ; / C d B 2 i1 7C B 6 W .Â/ E4 @ B 5C C @ q . / d 0 B.pCp / .gCp / A @ 22 2 @Â2 i 2 2 .gCp2 / 1 0 2 1 B Ä C d B C D BE @@ 0 qi1 . ; Â2 / @ 0 qi 2 .Â2 / ˝ I.pCp / C 0 vec ŒW .Â/ @Â2 22 A d @ „ ƒ‚ … vanish in expectation at true paramters 0 2 C W .Â/ d B 6 @E 4 d 0 31 @ @ qi1 . ; Â2 / 7C 5A @ q . / @ i 2 2 2 101 where 0 2 31 @ @ qi1 . ; Â2 / 7C 5A @ q . / @Â2 i 2 2 2 h iÁ h iÁ h iÁ @ @ 1 E @ q . ; / 1 E @ q . ; / 1 E @ q . ; / C 2 2 2 6 @ 0 @ i1 @ i1 @ i1 @Â10 @ 0 @Â20 @Â20 D6 h iÁ 4 1 E @ q . / 0 @Â2 i 2 2 @Â20 2 iÁ h iÁ 3 2 h 3 1 E @ q . ; / 1 E @ q . ; / @ @ i1 2 i1 2 7 6 @ 0 @ 0 7 6 @ 0 @ @ @Â20 74 1 2 5 D6 h iÁ 5 4 1 @ 0 E @ qi 2 .Â2 / 0 Ip @Â20 2 „ ƒ‚ … ƒ‚ … „ DH when  D o o DAR when . ;Â2 /D. o ;Âo2 / d B 6 @E 4 d 0 Then we have 2 6 6 d 6 E d 0 6 4 @ @Â1 qi1 .Â/ @ @Â22 qi1 .Â/ @ @Â2 qi 2 .Â2 / 3ˇ ˇ ˇ 7ˇ 7ˇ 7ˇ 7ˇ 5ˇ ˇ ˇ where Wo D W .Âo / : Also, it is easy to see 0 @ q   B @Â1 i1 o1; o2 B @ VB B @Â22 qi1 Âo1; Âo2 @ @ @ qi 2 .Âo2 / D Wo AR Ho  DÂo 1 C C C D Wo BM W 0 o C A 2 Hence 1 0 0 0 0 VGMM QLIML D Ho AR Wo Wo BR Wo 1 Wo AR Ho To see relative efficiency of MD-QLIML, 1 VmGMM QLIML 1 VGMM QLIML 1 D Ho0 A0R BR 1 AR Ho Wo AR Ho Ho0 A0R Wo0 Wo BR Wo0 Á 1 D Ho0 A0R BR 1 Wo0 Wo BR Wo0 Wo AR Ho 0 1 ! 1 D 1 0 0 Ho AR BR 2 @I 1 BR2 Wo0 10 1 Wo BR2 BR2 Wo0 102 10 10 2 2 A Wo BR BR AR Ho 0 3 7 7 5 where 10 1 BR D BR2 BR2 When p1 C p22 D g holds; Wo is invertible and we have I 1 BR2 Wo0 10 1 Wo BR2 BR2 Wo0 ! 1 10 Wo BR2 D 0 In the case where Â22 is empty, the result can be shown with slight modification of above proof. B.5 Proof of Proposition 2.3.9 (a) (Well-definedness of reduced form model) The reduced form likelihood is qi1 .Â1 ; Â2 / D l .yi1 ; y2 ˛ C z1 ı1 C v2 Á; / D l .yi1 ; z1 .ı 21 ˛ C ı1 / C z2 ı 22 ˛ C v2 .˛ C Á/ ; / D l .yi1 ; z 1 .Â/ C v2 2 .Â/ ; 3 .Â// D q1 . .Â/ ; Â2 / @q Since qi1 depends on Â2 only through ı 2 ; it suffices to show that each element of @ı1 can be 2 @q expressed as a linear combination of @ 1 : Note 1 @q1 .Â1 ; Â2 / D s .yi1 ; y2 ˛ C z1 ı1 C v2 .ı 2 / Á; / Á ˝ z0 @ı 2 @q1 . ; Â2 / D s .yi1 ; z 1 C v2 .ı 2 / 2 ; 3 / z0 @ 1 D s .yi1 ; y2 ˛ C z1 ı1 C v2 .ı 2 / Á; / z0 where s .yi1 ; ‡; Â1 / D @l .yi1 ;‡;Â1 / : Hence @‡ the proof. (b) 103 Suppose k2 D r: [To show VGMM QLIML D VQLIML D VCF  The quasi-scores are 2 3 0 .y / 6 s i1 ; y2 ˛ C z1 ı1 C v2 Á; y2 7 6 7 0 7 6 .y / s ; y ˛ C z ı C v Á; z @q1 .Â1 ; Â2 / 6 i1 2 1 1 2 1 7 D6 (A.2) 7 6 s .y ; y ˛ C z ı C v Á; / v0 7 @Â1 6 7 i1 2 1 1 2 2 5 4 @l .yi1 ;y2 ˛Cz1 ı1 Cv2 Á; / @ 2 @q1 .Â1 ; Â2 / 6 D4 @Â2 s .yi1 ; y2 ˛ C z1 ı1 C v2 Á; 0 r.rC1/ 2 QLIML and CF rank conditions implies that k2 / Á ˝ z0 3 (A.3) 7 5 1 r matrix ı o22 is required to be full column rank. Since k2 D r, ı o22 is an invertible matrix. Also, noting that s .yi1 ; y2 ˛ C z1 ı1 C v2 Á; / y02 D s .yi1 ; y2 ˛ C z1 ı1 C v2 Á; / .z1 ı 21 C z2 ı 22 C v2 /0 any moment function in @q1 .Â1 ;Â2 / @Â2 can be expressed as linear combination of @q1 .Â1 ;Â2 / : Thus Â22 is @Â1 empty and the result follows by Proposition 2.3.6 (b). [To show VMD QLIML D VGMM QLIML ] It suffices to show p1 C p22 D g: Let ; 3 2 Rl : As shown above, p22 D 0: Then, p1 D r C k1 C r C l and g D k1 C k2 C r C l: Since k2 D r; the result follows. (c) Suppose Áo 6D 0: [To show VMD QLIML D VGMM QLIML ] As in (b,2), p1 D r Ck1 Cr Cl and g D k1 C k2 C r C l: It suffices to show p1 C p22 D g or equivalently, p22 D k2 r: The case with k2 D r was shown in (b,2). Case k2 < r is ruled out by order condition. i.e. ı o22 cannot be full column rank with k2 < r. Suppose k2 > r: By rank condition of reduced form model, we 104 know linear independence of components in score 2 0 6 s .yi1 ; z 1 C v2 .ı 2 / 2 ; 3 / z @q1 . .Â/ ; Â2 / 6 0 D6 6 s .yi1 ; z 1 C v2 .ı 2 / 2 ; 3 / v2 @ 4 @q1 @ 3 2 3 7 7 7 7 5 / z0 3 6 s .yi1 ; y2 ˛ C z1 ı1 C v2 .ı 2 / Á; 7 6 7 0 7 D6 .y .ı / / s ; y ˛ C z ı C v Á; v i1 2 1 1 2 2 6 2 7 4 5 @q1 @ Then, due to explicit linear relationship y2 D z1 ı 21 Cz2 ı 22 Cv2 ; a maximal linearly independent set n o @q1 in sy2 ; sz1 ;sz2 ;sv2 ; @ always contains .k1 C k2 C r C l/ elements: Hence, a maximal linearly o n @q1 .Âo / @q1 .Âo / ; @ contains .k1 C k2 C r C l/ elements whenever there exists independent set in @ 1 2 at least one nonzero element in Áo : Since implied that p22 D k2 @q1 .Âo / @Â1 contains k1 C 2r C l moment functions, it is r: [To show VGMM QLIML VQLIML ; VCF ] The result follows from Proposition 2.3.7 (d) Suppose Áo D 0: In (A.3), we have @q1 .Â1 ;Â2 / @Â2 D 0p2 1 and Â22 is empty. Then, VGMM QLIML D VQLIML D VCF by Proposition 1.3.6 in Chapter 1. (e) Let M be the linear span generated by mGMM-QLIML moment functions 2 3 @q1 . .Âo /;Â2 / 6 7 @ 4 5 @q2 .Â2 / (A.4) @Â2 When Áo D 0; Â22 is empty and VGMM QLIML D VQLIML D VCF as shown in (d). Hence, GMM-QLIML moment functions are 2 3 2 6 4 @q1 .Âo / @Â1 @q2 .Âo2 / @Â2 7 6 5D4 „ @ .Âo / @Â1 0 3 2 0 7 6 5 4 @q1 . .Âo /;Âo2 / @ @q2 .Âo2 / @Â2 3 7 5 Ip2 ƒ‚ … „ ƒ‚ … mGMM QLIML moments .gCp2 / p 105 where 2 6 4 3 @ .Âo / @Â1 0 0 7 5 Ip2 is assumed to be full column rank by Assumption 15. Hence, GMM-QLIML moment functions is a linearly independent set in M . Also, since 2 6 4 @q1 .Âo / @Â1 @q2 .Âo2 / @Â2 3 @q1 .Âo / @Â2 2 7 6 5D4 D 0 when Áo D 0; clearly we have @q1 .Âo / @Â1 @q1 .Âo / @q2 .Âo2 / @Â2 C @Â2 3 7 5 and it is implied that QLIML moment functions are also linearly independent in M: Since we are assuming k2 > r; we have g D k1 C k2 C r C l > r C k1 C r C l D p1 : Thus, the dimension of M is larger than number of GMM-QLIML(or QLIML) moment functions, p: Relative efficiency of mGMM-QLIML is obvious. To find a condition for asymptotic equivalence, consider QLIML moment functions 2 6 4 @q1 . / @Â1 @q1 . / @q2 .Â2 / @Â2 C @Â2 3 7 5 By replacement theorem (Thm 1.10, Friedberg et al, 2003), there exists k2 r elements in (A.4) with which QLIML moment functions constitute a basis of M at true parameter values: Denote such k2 r elements as @q1 . .Â/ ; Â2 / @ where is .k2 r/ 1: Then optimal GMM on 2 6 6 6 6 4 @q1 .Â/ @Â1 @q1 .Â/ @q2 .Â2 / @Â2 C @Â2 @q1 . .Â/;Â2 / @ 3 7 7 7 7 5 (A.5) is asympotically equivalent to mGMM-QLIML by similar reasoning in Lemma C.1. Equivalence condition follows by applying BQSW redundancy condition to (A.5). To see sufficiency of GIME’s 106 for reduced form model, note that ! #ˇ " ˇ @q1o @ @q1 ˇ V D E 0 0 ˇˇ Á Á 0 0 0 0 0 0 @ ; Â2 @ ; Â2 @ ; Â2 0 ; 0 0 D 0 ; 0 0 o o2 2 Ä o ˇˇ  oà @q2 ˇ @q2 @ E V D ˇ 0 @Â2 @Â2 @Â2 ˇ  D 3ˇ 2 2 o2 3 1 02 ˇ @q1o @q1o ˇ 0 7 ˇ 6 @ @ @ 7C B6 @ 7ˇ 6 o o D E V @4 @q o o 5 A @q @q 4 @q2 2 Á 5ˇ 1 ÁC @ 0 ; Â20 1 ˇ Á 0 0 @Â2 C @Â2 ˇ 0 0 Á0 @Â2 @ ;Â2 @Â2 @ 0 ;Â20 0 0 ;Â2 D o0 ;Âo2 0 1 implies cov @ @q1o @ 0 ; 0 2 Á0 ; @q2o A @Â2 D 0 and GIME’s for structural model. Then result follows by some algebra. 107 APPENDIX C AN APPENDIX FOR CHAPTER 3 C.1 Appendix: Proofs Many proof ideas and steps used in Theorem 3.4.7 3.4.15 are similar to Wang, Wu, Li (2012) and Sherwood and Wang (2016). In the following, C denotes a constant that does not depend on N . It is allowed to take different values in different places. C.1.1 Proof of Theorem 3.3.1 The model structure (or admissible structure) S in consideration can be defined as following. Q g; g .xi ; zi / : Let S D f.ˇ; /jˇQ 2 RK1 CT 1 ; gQ W Q FQ f"i t ;xi t gTtD1 ;zi RK1 T CK2 ! R is measurable, FQ is a distribution function on RK1 T CK2 CT such f"i t ;xi t gTtD1 ;zi that Q ."i t jxi ; zi / D 0 for each t and the support condition on .wi t /TtD1 given in the premise is Denote "i t D yi t wi t ˇ satisfied:g Suppose .ˇ ; g ; F / 2 S: Then, for each t D 2; ; T; we can write f"i t ;xi t gTtD1 ;zi Ä Q yi t jxi ; zi I ˇ ; g ; F (A.1) D wi t ˇ C g .xi ; zi / f"i t ;xi t gTtD1 ;zi Ä Q yi .t 1/ jxi ; zi I ˇ ; g ; F (A.2) D wi .t 1/ ˇ C g .xi ; zi / f"i t ;xi t gTtD1 ;zi Note that the conditional quantiles of yi t and yi.t 1/ are unique. By taking difference across time periods, we have Ä Q yi t jxi ; zi I ˇ ; g ; F f"i t ;xi t gTtD1 ;zi D wi t Ä Q yi .t 1/ jxi ; zi I ˇ ; g ; F f"i t ;xi t gTtD1 ;zi wi.t 1/ ˇ (A.3) (A.4) The full rank condition on (3:11) will be shown to imply point-identification of ˇ : First, note R By continuity of that there exists a square invertible submatrix of (3:11). Let such matrix be W: the matrix determinant, there exists a neighborhood around 108 J .Qx.j / ; xQ .j / / j D1 i t i.t 1/ each of whose .j / .j / R elements along with .Pxi t ; xP i.t 1/ /Jj D1 and time dummies constitute a perturbed version of W which is still invertible. Since f Á xQ i t ;Qxi .t 1/ j xP i t ;Pxi .t 1/ Á .Qx.j / ; xQ .j / jPx.j / ; xP .j / / it i.t 1/ i t i .t 1/ > 0 8j and it is continuously extendable, the probability of observing such collection of support points is positive. (equivalently, a change in ˇ implies a nontrivial change in F fyi t ;xi t gTtD1 ;zi / Hence the proof. C.1.2 Proof of Theorem 3.4.3 and 3.4.4 0 L i t D p1 w Q A and ‰ .e/ D ‰ .e 1 /0 ; Define  A D .ˇ; A / ; w ; ‰ .e N /0 : Also, kAk D N it p 0 max .A A/ denotes spectral norm for matrix A, kvk denotes Euclidean norm for vector v; and write Ew . / D E . jxi ; zi / and Pw . / D P . jxi ; zi / : Consider a following reparameterized objection function N T 1 XX N N T 1 XX L i t ı/ D w N .ei t i D1 tD1  yi t QA w it  .ei t L i t ı/ w i D1 t D1 1 ÂoA C p ı N Ãà Let ıO be the reparameterized oracle estimator N X T X Oı D arg min 1 N ı i D1 tD1 where ıO D p N ÂO A Á  oA holds. Its Bahadur Representation can be written as ıQ D p Q 0 BN W QA N W A 1 Q 0 ‰ ."/ W A p Lemma C.1.1 If Assumption 1–4 hold, then ıQ D Op qN : (ii) If Assumption 1-5 hold, then 1=2 Q d G N †N ı ! N .0; G/ : Proof. Since s  à 1 1 0 1 2 Q Q Q W B B W D p max N A N WA N A N Ä 1 BN2 1 QA D p W N 109 s p max .BN /  1 0 Q W QA W max N A à 1 Q 0 Q N WA BN WA max Á Á Q 0 BN W Q A is is also bounded above. Similarly, we can show min N1 W A bounded below by some positive constant. Then, note that  ıQ D  à 1 à 1 2 2 1 0 1 1 0 1 Q 0 ‰ ."/ Ä Q 0 ‰ ."/ Q BN W Q BN W QA QA W W p W p W A A A A N N N N 1 p Q 0 ‰ ."/ D Op qN ÄC p W A N (ii) 1=2 GN †N ıQ D 1=2 Q0 ‰ GN †N KN1 N 1=2 W A D N X T X 1=2 1 1=2 Q A0 GN †N KN N w it iD1 tD1 1=2 where DN i t D GN †N Q A0 KN1 N 1=2 w it ."/ D N X 1=2 1 1=2 Q A0 GN †N KN N W i ‰ iD1 N T XX ."i t / D ."i / DN i t i D1 tD1 ."i t / : Note E ŒDN i t  D 0 and E hP T tD1 DN i t i D 0: Then, N X 20 E 4@ i D1 10 T X DN i t A @ T X 10 3 DN i t A 5 t D1 tD1 0 N T X X 1=2 1 1 @ 4 4 Q A0 w D E GN †N KN N it 10 2 2 i D1 2 2 1=2 D E 4GN †N KN1 4N 1 N X ."i t /A @ T X 1=2 D G N E †N 3 3 0 Q A 5 K 1 † 1=2 G 0 5 Q A0 W i ‰ ."i / ‰ ."i / Wi N N N 1=2 KN1 SN KN1 †N 1=2 A5 K 1 † QA ."i t / w it N N t D1 t D1 i D1 h 3 13 i 0 D G G0 ! G GN N N 110 0 5 GN To check Lindeberg-Feller condition, fix " > 0: By Assumption 3, 4, and 5, 2 33 4 2 2 N T N T T X X X 6 X X 1 7 E DN i t E4 DN i t 1 4 DN i t > "55 Ä 2 " i D1 Ä Ä t D1 1 N 2 "2 C N 2 "2 N X T X T ˇ X ˇ @ E ˇ ."i t / tD1 t 0 D1 0 E@ T X T X tD1 t 0 D1 i D1 12 1=2 1 QA w i t KN †N tD1 1=2 0 G † GN N N N T X C X 1=2 1 Q A0 Ä 2 2 w E GN †N KN it0 N " 0 i D1 t D1 0 1 4 N T X C B1 X C QA Ä E w A D Op @ i t 2 N N" i D1 t D1 12 ˇ ˇA 1=2 0 1=2 1 Q A0 QA GN GN †N KN1 w "i t 0 w i t KN †N it0ˇ 0 i D1 N X i D1 tD1 4 A Q A0 KN1 w it0 N T 4 X C X 1=2 4 1 Q A0 KN w Ä 2 2 E kGN k †N it0 N " 0 i D1 2 qN N 4 t D1 ! D op .1/ 0 G 0 where the last inequality follows from max GN N D max GN GN ! c as N ! 1: Lemma C.1.2 (i) Assume Assumption 1–6. Then, for any finite constant M; ˇ ˇ ˇ ˇN i h Ái 1 h ˇ ˇX 0 0 ı KN ı ıQ KN ıQ .1 C o .1//ˇˇ D op .1/ sup ˇˇ Ew QQ i ı; ıQ 2 ˇ ˇ Q ı ı ÄM i D1 Á P h where QQ i ı; ıQ D TtD1 .ei t L i t ı/ w ei t L i t ıQ w Ái Á Á L i t ık D O N 1=2 qN L i t ıQ D Op N 1=2 qN by Lemma C.1.1 (i), and kw Proof. Note that w by construction. Also, we have .ei t L i t ı/ D w ."i t L i t ı/ by Assumption 6. Then, w applying Knight’s identity, N X h Ew QQ i ı; ıQ Ái D i D1 D Ew h ."i t L i t ı/ w "i t L i t ıQ w Ái iD1 tD1 N X T Z w L it ı X L i t ıQ i D1 tD1 w 1h 0 D ı KN ı 2 N X T X .Fi t .s/ Ä N T 1 XX L i t ı/2 Fi t .0// ds D fi t .0/ .w 2 i D1 tD1 Q0 ı KN ıQ i 1 C op .1/ 111 L i t ıQ w Á2 1 C op .1/ L i t ıQ where the third inequality is followed from w L i t ık D o .1/ : Hence the D op .1/ and kw result. Lemma C.1.3 Assume Assumption 1–6. Then, for any given positive constant M; ˇ ˇ ˇX Áˇˇ ˇN Ai ı; ıQ ˇˇ D op .1/ sup ˇˇ ˇ ı ıQ ÄM ˇi D1 Á Á Q Q Q where Ai ı; ı D Qi ı; ı h Á i P Á T Q Q Q L it ı ı .ei t / : E Qi ı; ı jxi ; zi C tD1 w q q L Proof. By Assumption 4, max kwi t k Ä ˛1 NN for some positive constant ˛1 : (Fn1 in Sherwood i;t and Wang (2016) has probability one here.) It suffices to show for 8" > 0; 1 0 ˇ ˇ ˇN ˇ Á ˇX ˇ C B Ai ı; ıQ ˇˇ > "A ! 0 P @ sup ˇˇ ˇ ı ıQ ÄM ˇi D1 Let n N  D ı 2 RqN W ı ıQ Ä M o N into disjoint sets  N 1; We can partition  exceed m0 D 10T ˛1 " p N qN N D such that the diameter of each set does not  N  p Ãq N C N qN and the cardinality of partition satisfies DN Ä (For 2" example, by similar argument used in Lemma 5.2 of Vershynin, 2011): Pick arbitrary ı d 2 Dd for 1 Ä d Ä DN : Then 0 1 ˇ ˇ ˇN ˇ Áˇ ˇX B C P @ sup ˇˇ Ai ı; ıQ ˇˇ > "A ˇ ı ıQ ÄM ˇiD1 ˇ 0ˇ ˇN ˇ DN Á X X ˇ ˇ ˇ Q Ä P @ˇ Ai ı d ; ı ˇˇ C sup ˇi D1 ˇ ı2N d d D1 ˇ ˇN h Á ˇX ˇ Q A ı; ı i ˇ ˇi D1 112 ˇ 1 Áiˇˇ Ai ı d ; ıQ ˇˇ > "A ˇ 1 2 Since u1 Œu < 0 D 12 u QQ i ı; ıQ D D D D Á T h X t D1 T X t D1 T X QQ i ı d ; ıQ juj, we have Á L i t ı/ w .ei t L i t ıQ w ei t T h X Ái L i t ıd / w .ei t ei t L i t ıQ w Ái tD1 Œ. 1 Œei t L i t .ı d Œ w t D1 T Ä X t D1 L i t ı < 0/ .ei t w .ei t ı/ à 1 L i t .ı d w 2 L i t k sup Œkı Ä 2T max kw i;t N ı2 d L i t ı/ w L i t ı/ 1 Œei t w ı/ C . 1 Œei t L i t ı d < 0/ .ei t w L i t ı < 0 C .ei t w 1 j.ei t 2 L i t ı/j w 1 j.ei t 2 L i t ı d / w L i t ı d / 1 Œei t w L i t ı d < 0 w L i t ı d /j w ı d k Thus ˇ ˇ ˇ ˇX N h Á Ái ˇ ˇ Ai ı; ıQ Ai ı d ; ıQ ˇˇ sup ˇˇ N ˇi D1 ˇ ı2 d ˇ ˇN T h Ái X Á ˇX L it ı w fQQ i ı; ıQ Ew QQ i ı; ıQ C D sup ˇˇ N ˇ ı2d iD1 tD1 Á h QQ i ı d ; ıQ C Ew QQ i ı d ; ıQ Ái T X tD1 L i t k sup Œkı Ä 5N T max kw i;t r Ä 5N T ˛1 N ı2 d L i t ıd w ıQ Á ıQ Á .ei t / ˇ ˇ ˇ .ei t /gˇˇ ˇ ı d k qN " m0 D N 2 Therefore, now it suffices to show PDN ˇP ˇ Á ˇ N ˇ " Q P A .ı ; ı/ > ˇ 2 has a vanishing upper d D1 w ˇ i D1 i d bound that does not depend on .xi ; zi /. Berstein’s inequality is used. To evaluate maximum, using 113 1 2 .u/ D Á u C 21 juj, we can write Q maxAi .ı d ; ı/ i;d Q D maxQQ i .ı d ; ı/ Q C Ew ŒQQ i .ı d ; ı/ i;d i;d C L i t .ı d w Q .ei t / ı/ tD1 D max T X T X T X L i t ıd / w .ei t Œ Q L i t ı/ w .ei t Ew Œ T h X .ei t L i t ıd / w à 1 L i t .ıQ w 2 ıd / .ei t i Q  L i t ı/ w t D1 tD1 ıQ L i t ıd w Á .ei t / tD1 D max i;d Ew T Ä h X 1 tD1 T XÄ tD1 C T X 2 L i t ı d /j w j.ei t 1h j.ei t 2 L i t .ı d w ˇi  ˇ Q L i t ı/ˇ C w ˇ ˇ ˇ.ei t ˇi  ˇ Q L i t ı/ˇ C w ˇ ˇ ˇ.ei t L i t ı d /j w à 1 L i t .ıQ w 2 ıd / Q .ei t / ı/ tD1 r L i t k max kı Ä 3T max kw i;t ıd k Ä C d qN N To evaluate variance, applying Knight’s identity, we have T Á X L i t ıd QQ i ı d ; ıQ C w ıQ Á .ei t / tD1 D D T X tD1 T X L i t ıd w .ei t / C tD1 C T X t D1 L i t ıQ w T X L i t ıd / w .ei t .ei t / t D1 T Z w L i t ıd X t D1 0 T Z w L i t ıQ X tD1 0 D T Á X L i t ıQ C L i t ıd w w ei t L i t ıQ t D1 w Á .ei t / t D1 .1 Œ.ei t < t/ .1 Œ.ei t < t/ T Z w L i t ıd X ıQ 1 Œei t < 0/ dt 1 Œei t < 0/ dt C T X tD1 .1 Œ.ei t < t/ 114 1 Œei t < 0/ dt L i t ıd w ıQ Á .ei t / Thus, Q Ai .ı d ; ı/ D T Z w L i t ıd X .1 Œ.ei t < t/ L i t ıQ tD1 w 2 T Z w L i t ıd X Ew 4 L i t ıQ t D1 w 1 Œei t < 0/ dt 3 2 1 Œei t < 0/ dt 5 C Ew 4 .1 Œ.ei t < t/ 3 T X L i t ıd w ıQ Á .ei t /5 t D1 And it implies N X Á Á Var Ai ı d ; ıQ jxi ; zi Ä i D1 Ä N X iD1 N X 2 r ÄC 6 Ew 4 @ T Z w L i t ıd X L i t ıQ t D1 w T Z w Áˇ X L i t ıd ˇ .Fi t .t/ ıQ ˇ ˇ ˇ L i t ıd Ew 4max ˇw L i t ıQ tD1 w i;t i D1 20  N T qN X X L i t ı d /2 fi t .0/ .w N L i t ıQ w Á2 à .1 Œ.ei t < t/ 12 3 7 1 Œei t < 0/ dt A 5 3 Fi t .0// dt 5 r .1 C o .1// Ä C i D1 t D1 qN .1 C o .1// N by similar argument as in Lemma C.1.2. Then, by Bernstein’s inequality and Assumption 5, ˇ 0ˇ ˇN Áˇˇ ˇX Ai ı d ; ıQ ˇˇ > Pw @ˇˇ ˇi D1 ˇ d D1 DN X ÄC C p N qN Áq N exp 1 0 1 DN "2 =4 B C X exp @ q exp q AÄ qN qN C N C "C N d D1 d D1 s s ! !! N N C Ä C exp C qN log N !0 qN qN "A Ä 2 DN X s C N qN ! Lemma C.1.4 (Asymptotic Equivalence with Bahadur Representation) Assume Assumption 1–6. Then, we have ıQ ıO D op .1/ : Proof. It suffices to show that for any positive constants M , 0 1 N Á X C B P @ inf QQ i ı; ıQ > 0A ! 1 ıQ ı M iD1 115 since, we have Á O ıQ Ä 0. By Lemma C.1.3, Q i ı; Q iD1 PN ˇ 2 ˇN Á ˇX ˇ Q Q 4 Qi ı; ı sup ˇ ˇ Q ı ı ÄM i D1 h Ew QQ i ı; ıQ Ái C T X ıQ L it ı w Á t D1 3ˇ ˇ ˇ .ei t /5ˇˇ D op .1/ ˇ Then by Lemma C.1.2, ˇ ˇX Á ˇN QQ i ı; ıQ sup ˇˇ ı ıQ ÄM ˇi D1 C N X T X 1h 0 ı KN ı 2 ıQ L it ı w 0 ıQ KN ıQ i 1 C op .1/ (A.5) ˇ ˇ ˇ .ei t /ˇˇ ˇ Á i D1 t D1 D op .1/ And since à 1 1 1 0 QA Q BN W Q 0 ‰ .e/ W ıQ D p W A A N N 1 Q 0 ‰ .e/ D KN1 p W A N  we have N X T X L it ı w ıQ Á ıQ .ei t / D ı N X T Á0 X L 0i t w .ei t / i D1 t D1 iD1 tD1 Á0 1 Q 0 ‰ .e/ ıQ p W A N Á0 ıQ KN ıQ D ı D ı Combining (A.5) and (A.6), we have ˇ ˇN Á 1h ˇX ˇ Q Q ı 0 KN ı sup ˇ Qi ı; ı 2 ı ıQ ÄM ˇi D1 0 ıQ KN ıQ C ı i ıQ Á0 ˇ ˇ ˇ KN ıQ ˇˇ D op .1/ ˇ which implies ˇ ˇN Á ˇX sup ˇˇ QQ i ı; ıQ ı ıQ ÄM ˇiD1 1 ı 2 116 ıQ Á0 KN ı ˇ Áˇˇ ıQ ˇˇ D op .1/ ˇ (A.6) By assumption 4, for any ı ıQ > M; 1 ı 2 ıQ Á0 KN ı Á ıQ > CM for some positive C: C.1.3 Proof of Theorem 3.4.5 and 3.4.6 QA Consider partition of w i t into a single variable (that is not penalized) and the rest of vari0 b Qa D w Q ait /: Define their stacked versions as W Q a0 Q a0 Q a0 ables, .wQ ibt ; w ;w ;w A 11 ; 1T ; N T and WA D b ; .wQ 11 b ; ; wQ 1T b /: Let ı ; wQ N ab D .ı a ; ıb / denote another reparametrization of ÂA such that T N T 1 XX N yi t QA w it ÂA Á i D1 tD1 R ait where w p1 N N T 1 XX D N wR ibt ıb Á i D1 tD1 Ä Q ait wR ibt WAb0 BN WAb w p a Then, we have ıO a D N .ÂO A  aoA / and D " i t C ri R ait ı a w h i1=2 b ıOb D WAb0 BN WAb0 ÂOA Á 1 h i 1=2 Q a and wR b D W b0 BN W b0 WAb0 BN W wQ ibt : it A A A Á h i 1=2 b Q a ÂO aA ÂoA C WAb0 BN WAb0 WAb0 BN W A  aoA Á Á b is defined similarly. Also, note that ıO is a subvector of ı: O where  aA ; ÂA a Lemma C.1.5 Assume Assumption 1–5, 7 and 8. Then, for any finite constant M1 and M2 ; ˇ ˇ ˇN ˇ h Ái h i X ˇ ˇ 1 0 0 ˇ R N ı a ıQ a K R N ıQ a .1 C o .1//ˇ Ew QQ i ı a ; ıQ a ; ıb ıa K sup ˇ ˇ 2 p ˇ ı a ıQ a ÄM1 ;kıb kÄM2 qN ˇi D1 is op .1/ where T Á X Q Q Qi ı a ; ı a ; ıb D Œ ."i t C ri R ait ı a w R a0 ;w iT ; R a0 ;w NT wR ibt ıb / t D1 RaD w R a0 W 11 ; RN DW R 0a BN W Ra K 117 0 ."i t C ri R ait ıQ a w wR ibt ıb / Á L i t ıQ D Op N 1=2 qN is implied by Lemma C.1.1 (i), and kw L i t ık D Proof. Note that w Á Á R ait ı a D R ait ıQ a D Op N 1=2 qN and w O N 1=2 qN by construction. Similarly, we have w Á Op N 1=2 qN : Then, applying Knight’s identity, N X h Ái Ew QQ i ı a ; ıQ a ; ıb iD1 D D D D D N X T X Ew Œ ."i t C ri R ait ı a w wR b ıb / ."i t C ri R ait ıQ a w wR ibt ıb / i D1 tD1 N X T Z w R a ı a CwR b ıb ri X it .Fi t .s/ Fi t .0// ds a ıQ CwR ı r R w a b b i i D1 tD1 it Ä N T Á2 Á2 1 XX R ait ı a C wR ibt ıb ri R ait ıQ a C wR ibt ıb ri w 1 C op .1/ fi t .0/ w 2 iD1 tD1 N T h i 1 XX R ait ı a /2 .w R ait ıQ a /2 C 2.wR ibt ıb ri /w R ait .ı a ıQ a / 1 C op .1/ fi t .0/ .w 2 iD1 tD1 N X T i X 1h 0 R 0 R Q 0 Q Q R a0 .ı a ı a / ı KN ı a ı a KN ı a 1 C op .1/ fi t .0/ ri w it 2 a i D1 tD1 Note that N X T X R ait .ı a fi t .0/ ri w ıQ a / iD1 tD1 Ä N T Á 1 1 XX Q a .ı a ıQ a / Q ait WAb0 BN W Dp fi t .0/ ri w WAb0 BN WAb A N iD1 tD1 Á 1 1 a b0 b R a C op .1/ where  R a denotes the Q a  D p1  p Q and that Œw WA BN WA WAb0 BN W it A N it N it R a is well-defined by Assumption 5. To show population projection error.  it N T 1 XX R ait .ı a sup fi t .0/ ri w p N i D1 tD1 p ı a ıQ a ÄM1 ;kıb kÄM2 qN , note that by Markov inequality, we have ˇ 0ˇ ˇ ˇ N T X X ˇ 1 ˇ a R .ı a ıQ a /ˇ fi t .0/ ri  P @ˇˇ p it ˇ ˇ N i D1 tD1 ˇ 1 "A Ä 118 ıQ a / D op .1/ h PN PT Ra E p1 i D1 tD1 fi t .0/ ri i t .ı a N "2 i2 ıQ a / where 2 N X T X 32 1 R ait .ı a ıQ a /5 E 4p fi t .0/ ri w N iD1 tD1 3 0 2 2 N X N T T X 1 XX 1 a Q 5 @ 4 4 R i t .ı a ı a / C E p R ait .ı a fi t .0/ ri w fi t .0/ ri w DV p N i D1 tD1 N i D1 tD1 The first part: 2 N T 1 XX 4 R ait .ı a V p fi t .0/ ri w N iD1 tD1 3 2 ıQ a /5 D V 4 T X 312 ıQ a /5A 3 R ait .ı a fi t .0/ ri w ıQ a /5 t D1 ÄC ÄC ÄC T X tD1 T X tD1 T X h R ait .ı a V fi t .0/ ri w i ıQ a / h R ait .ı a E fi t .0/ ri w i2 ıQ a / h R ait .ı a E fi t .0/ w i2 ıQ a / .sup jri j/2 tD1 ! 0; The second part: ˇ 2 ˇ N X T X ˇ ˇE 4 p1 R ait .ı a fi t .0/ ri w ˇ N ˇ i D1 t D1 3ˇ ˇ ˇ ˇ T h ˇ ˇp X Qı a /5ˇ D ˇ N R ait .ı a E fi t .0/ ri w ˇ ˇ ˇ ˇ tD1 Ä p N T X ˇ ˇ R ait .ı a E ˇfi t .0/ ri w ˇ iˇˇ ıQ a / ˇˇ ˇ ˇ ˇ ıQ a /ˇ t D1 Ä p N sup jri j T X ˇ ˇ R ait .ı a E ˇfi t .0/ w tD1 Hence the result. Lemma C.1.6 Assume Assumption 1–5, 7 and 8. Then, for any positive constant L; ˇ ˇ ˇN ˇ X ˇ ˇ p 1 qN sup ˇˇ Di .ı ab ; qN /ˇˇ D op .1/ ˇ kı ab kÄL ˇi D1 119 ˇ ˇ ıQ a /ˇ where Qi .qN / D T X R ait ı a qN w ."i t C ri qN wR ibt ıb / tD1 Di .ı; qN / D Qi .qN / Qi .0/ R ait ı a C wR b ıb C qN w Ew ŒQi .qN / Qi .0/ ."i t / Proof. It suffices to show for all " > 0; ˇ ˇ 1 0 ˇ ˇX N ˇ ˇ p P @qN1 sup ˇˇ Di .ı ab ; L qN /ˇˇ > "A ! 0 ˇ kı ab kÄ1 ˇi D1 First, consider a constant ˛1 such that max R ait ; wR ibt w Á 1=2 Ä ˛1 N 1=2 qN : Partition B D fı W kık Ä 1g into MN disjoint sets B1 ; ; BMN with diameter less than m0 D p Áq C1 " p where M Ä C C N N b / 2 B for 1 Ä m Ä M : Then : Let d m D .d am ; dm m N N " 4˛1 L N ˇ ˇ 1 ˇ ˇX N ˇ ˇ p P @qN1 sup ˇˇ Di .ı ab ; L qN /ˇˇ > "A ˇ ı ab 2B ˇi D1 ˇ ˇ 1 0 ˇ ˇX MN N X ˇ ˇ p Ä P @qN1 sup ˇˇ Di .ı ab ; L qN /ˇˇ > "A ˇ ı ab 2Bm ˇi D1 mD1 1 0 ˇ ˇP p ˇ ˇ N MN ˇ iD1 Di .d m ; L qN /ˇ X B ˇ ˇ > "q C C B P@ Ä PN p p NA ˇPN ˇ C sup ˇ i D1 Di .ı ab ; L qN / D .d ; L q / ˇ m i N iD1 0 mD1 ı ab 2Bm 120 Since u1 ŒU < 0 D 12 u 12 juj ; we can write ˇ ˇ ˇX ˇ N N X ˇ ˇ p p sup ˇˇ Di .ı ab ; L qN / Di .d m ; L qN /ˇˇ ˇ ı ab 2Bm ˇi D1 i D1 ˇ ˇN T ˇ ˇX X 1 hˇˇ p p ˇ R ait ı a L qN wR ibt ıb L qN C ri ˇ D sup ˇˇ ˇ"i t w 2 ı 2Bm ˇ ab i D1 t D1 N X T X 1 C i D1 t D1 N X T X i D1 t D1 N X T X C i j"i t C ri j i D1 t D1 N X T X i D1 t D1 N X T X 2 hˇ ˇ Ew ˇ"i t p p R ait ı a L qN w C wR ibt ıb L qN R ait ı a w 1 hˇˇ ˇ"i t 2 p R ait d am L qN w hˇ 1 ˇ Ew ˇ"i t 2 p wR ibt ıb L qN Á b Lpq wR ibt dm N Á i D1 tD1 Ä 2NLm0 qN 1=2 Ä 2˛1 NLm0 qN max 1=2 p R ait ; wR ibt w i j"i t C ri j ."i t / p R ait d am L qN w p b R ait d am C wR ibt dm L qN w ˇ ˇ C ri ˇ ˇ ˇ C ri ˇ b L pq wR ibt dm N i j"i t C ri j ˇ ˇ C ri ˇ i j"i t C ri j ˇ ˇ ˇ ."i t /ˇˇ ˇ Á p qN =N D 2˛1 L N m0 D "=2 Now, it suffices to show ˇ 0ˇ 1 ˇX ˇ N ˇ ˇ p P @ˇˇ Di .d m ; L qN /ˇˇ > "qN =2A ! 0 ˇi D1 ˇ mD1 MN X Bernstein’s inequality will be used. To evaluate the maximum, note ˇ ˇ p max ˇDi .d m ; L qN /ˇ i ˇˇ ˇ ˇ p p ˇˇ ˇ ˇ R ait ı a L qN wR ibt ıb L qN C ri ˇ j"i t C ri jˇ Ä max ˇˇ"i t w i ˇ p ˇ p p Á ˇ ˇ a b R i t ı a L qN C wR i t ıb L qN ."i t /ˇ C max ˇL qN w i Á p R ait ; wR ibt Ä C qN N 1=2 Ä 2L qN max w 121 To evaluate an upper bound of the variance, first note that, by Knight’s identity, p Qi L qN Á p R ait ı a C wR ibt ıb ."i t / Qi .0/ C L qN w Á p R ait ı a C wR ibt ıb ŒI ."i t C ri < 0/ I ."i t < 0/ D L qN w Á Z Lpq wR a ı a CwR b ı N it it b ŒI ."i t C ri < s/ I ."i t ri < 0/ ds C 0 D Vi1 C Vi 2 Then, the second order conditional moments can be bounded as N X h Ew Vi12 i D i D1 N X Ä Á2 R ait ı a C wR ibt ıb jI ."i t C ri < 0/ Ew qN L2 w I ."i t < 0/j i D1 Ä N X 2L2 qN Ä Ew max R ait ; wR ibt w Á Á2 I .0 Ä j"i t j Ä jri j/ i D1 Ä 2 N 1 C qN N Z jr j X i iD1 jri j fi t .s/ ds Ä 2 C qN r qN N and N X h Ew Vi22 i iD1 Ä C qN N 1=2 Ä C qN N p N Z X R a ı a CwR b ıb qN L w it it Á i D1 0 Á N Z pq L w R a ı a CwR b ıb X N it it 1=2 i D1 0 2 2 N 1=2 4ı 0 Ä C qN a N X R a0 R ait ı a C fi t .0/ w it w i D1 ŒFi t .s ri / Fi t .ri / ds .fi t .0/ C o .1// s C O s 2 N X fi t .0/ wR ibt ıb iD1 2 N 1=2 .1 C o .1// Ä C qN Since bounds do not depend on w;we Q have N X p 2 Vars Di .d m ; L qN / Ä C qN iD1 122 r qN N Á2 ÁÁ ds 3 5 .1 C o .1// Then, Bernstein’s inequality implies ˇ 1 0ˇ ˇ ˇX MN X p ˇ ˇN Ps @ˇˇ Di .d m ; L qN =N /ˇˇ > qN "=2A ˇ ˇiD1 mD1 1 0 MN 2 "2 =4 X qN C B Ä2 exp @ q A q 2 N C C "q N 1=2 C qN mD1 N N s s ! ! MN X N N Ä2 exp C D 2MN exp C qN qN mD1 s ! N Ä C exp C .qN C 1/ log N C !0 qN Lemma C.1.7 Assume Assumption 1–5, 7 and 8. Then, 8Á > 0; there exists an L > 0 such that 0 1 N X p P @ inf qN1 qN Qi .0// > 0A 1 Á .Qi ı DL k ab k i D1 Proof. Consider qN 1 N X p qN .Qi Qi .0// D qN i D1 qN 1 N X Di .ı ab ; i D1 N 1=2 X p qN / C qN 1 N X E w Qi p qN Qi .0/ i D1 R ait ı a C wR ibt ıb w Á ."i t / i D1 D GN1 C GN 2 C GN 3 sup jGN1 j D op .1/ : Also, note that E ŒGN 3  D 0 and that kı ab kÄL h i h i Á 2 1E ı0 W 0W 2W 0W 1 2 R R R R E GN Ä C q ı C ı D O q kı k ab a a a a 3 N N b b b By Lemma C.1.6, we have 123 Á kı ab k : By applying Knight’s identity, 2 p Á Z q wR a ı a CwR b ı N X T ri X N b 1 it it ŒI ."i t < s/ GN 2 D Ew 4 qN ri Thus, GN 3 D Op qN 1=2 3 I ."i t < 0/ ds 5 i D1 t D1 N T Z pq R a ı a CwR b ıb ri N w 1 XX it it D fi t .0/ sds .1 C o .1// qN r i i D1 t D1 Ä N T Á2 Á 1 XX 1 p R ait ı a C wR ibt ıb R ait ı a C wR ibt ıb .1 C o .1// D ri qN w fi t .0/ qN w qN 2 i D1 t D1 0 1 N X T X R ait A ı a .1 C o .1// R a0 D C ı 0a @ fi t .0/ w it w Á iD1 tD1 C C ıb2 qN N X T X fi t .0/ .1 C o .1// iD1 tD1 N T 1=2 X X R ait ı a C wR ibt ıb fi t .0/ ri w Á i D1 t D1 R a ı a .1 C o .1// C C ı 2 .1 C o .1// R 0a BN W D C ı 0a W b qN T N 1=2 X X R ait ı a C wR ibt ıb fi t .0/ ri w Á tD1 i D1 R a ı a .1 C o .1// C C ı 2 .1 C o .1// R 0a BN W Note that there exists a constant M such that C ı 0a W b p 2 T N ; rN / 2 R : Then we have kRN k D O qN : By CauchyM kı ab k . Let RN D .r1 ; r1 ; Schwarz inequality, qN N T 1=2 X X R ait ı a D qN fi t .0/ ri w 1=2 0 R 0 ı a Wa BN RN iD1 tD1 1=2 R 0a kBN RN k ı 0a W Á 1=2 1=2 D Op qN qN kı a k D Op .kı ab k/ Ä qN Similarly, qN N T 1=2 X X 1=2 ıb WR b0 BN RN 1=2 ıb WR b0 BN fi t .0/ ri wR ibt ıb D qN i D1 tD1 Ä qN 124 1=2 1=2 BN RN D Op .kı ab k/ Therefore, for L sufficiently large, qN1 PN i D1 .Qi p Qi .0// > 0 has asymptotically a lower qN bound cL2 : Lemma C.1.8 Assume Assumption 1–5, 7 and 8. Then, for any given positive constant M1 and M2 ; ˇ ˇ ˇN ˇ ˇX ˇ ˇ sup Ai .ı a ; ıQ a ; ıb /ˇˇ D op .1/ ˇ p ˇ ı a ıQ a ÄM1 ;kıb kÄM2 qN ˇi D1 Á h i P T a Q Q Q Q Q Q R i t ıa ıa ."i t / : where Ai .ı a ; ı a ; ıb / D Qi .ı a ; ı a ; ıb / E Qi .ı a ; ı a ; ıb /jxi ; zi C tD1 w q q L i t k Ä ˛1 NN for some positive constant ˛1 : By Assumption 8 Proof. By Assumption 4, max kw i;t q qN (i), max kri k Ä ˛2 N for some positive constant ˛2 : It suffices to show for 8" > 0; i 0 1 ˇ ˇ ˇX ˇ N ˇ ˇ B C ˇ P@ sup Ai .ı a ; ıQ a ; ıb /ˇˇ > "A ! 0 ˇ p ˇ ı a ıQ a ÄM1 ;kıb kÄM2 qN ˇi D1 Let n a N  D ı 2 RqN W ı a ıQ a Ä M1 o ˚ « N b D ıb 2 R W kıb k Ä M2 pqN  N a ( N b ) into disjoint sets  N a; We can partition  1 N a a ( N b;  1 D N N b ) such that the diameter  Db N " a Ä p and the cardinality of partition satisfies DN of each set does not exceed m0 D 10T ˛1 N qN Ãq  p Ãq  p N N C N qN C N qN b and D Ä (For example, by similar argument used in Lemma 5.2 2" 2" N N b for 1 Ä l Ä D b : Then N a for 1 Ä k Ä D a and ı l 2  of Vershynin, 2011): Pick arbitrary ı ka 2  N N k b l 0 1 ˇ ˇ ˇX ˇ N ˇ ˇ B C ˇ P@ sup Ai .ı a ; ıQ a ; ıb /ˇˇ > "A ˇ p ˇ ı a ıQ a ÄM1 ;kıb kÄM2 qN ˇi D1 0ˇ 1 ˇ ˇ ˇ a Db DN ˇ ˇ ˇ ˇ N N N Áˇ Áiˇ XX ˇX h Bˇ X k ; ı; l ˇ > "C ˇ Q ıl ˇ C Q Q Ä P @ˇˇ Ai ı ka ; ı; sup A .ı ; ı ; ı / A ı ı A i a a b i a ˇ b ˇ b ˇ ˇ ˇ ˇ ˇ a l b N N i D1 ı a 2 ;ı 2 i D1 kD1 lD1 k b 125 l 1 2 Since u1 Œu < 0 D 12 u QQ i ı ka ; ıQ a ; ıbl QQ i .ı a ; ıQ a ; ıb / D T h X t D1 T h X D tD1 T X juj, we have Á ei t R ait ı a w wR b ıb ei t R ait ı ka w wR b ıbl  Œ Á Ái ei t R ait ıQ a w wR b ıb ei t R ait ıQ a w wR b ıbl Ái à 1 R ait ı ka w 2 t D1 1 ˇˇ ˇ"i t C ri 2 R ait Ä 2T max w i;t Á 1 ˇ ˇ ˇ"i t C ri w R ait ı a wR b ıb ˇ ıa C 2 ˇ ˇ ˇÁ ˇ ˇ ˇ R ait ıQ a wR b ıb ˇ ˇ"i t C ri w R ait ıQ a wR b ıbl ˇ  w h i k l sup ı a ı a C ıb ıb ˇ ˇ ˇ"i t C ri R ait ı ka w ˇÁ ˇ wR b ıbl ˇ N a ;ı l 2 Nb ık a 2 k b l Thus ˇ ˇ ˇX Áiˇˇ ˇN h ˇ Q ıl ˇ sup Ai .ı a ; ıQ a ; ıb / Ai ı ka ; ı; ˇ b ˇ ˇ N a ;ı l 2 N b ˇi D1 ı a 2 k b ˇ l ˇN T h i X ˇX ˇ Q Q Q Q R ait ı a fQi .ı a ; ı a ; ıb / Ew Qi .ı a ; ı a ; ıb / C w D sup ˇ N ˇi D1 ı2 t D1 d h QQ i .ı kd ; ıQ a ; ıbl / C Ew QQ i .ı ka ; ıQ a ; ıbl / i T X R ait ı ka w tD1 D sup N ı2 d C T X ˇ ˇN ˇX ˇ fQQ i .ı a ; ıQ a ; ıb / ˇ ˇi D1 L i t ıa w ı ka r Ä 5N T ˛1 sup N a ;ı l 2 Nb ık a 2 k b l h ı a ı ka C ıb qN " m0 D N 2 126 ıbl i Á ."i t / ˇ ˇ ˇ ."i t /ˇˇ ˇ h Ew QQ i .ı a ; ıQ a ; ıb / ˇ ˇ ˇ ."i t /gˇˇ ˇ Á tD1 R ait Ä 5N T max w i;t QQ i .ı kd ; ıQ a ; ıbl / ıQ a Á ıQ a i QQ i .ı ka ; ıQ a ; ıbl / ˇP Á Áˇ a PD b PDN ˇ N ˇ " N l k Q Therefore, now it suffices to show kD1 lD1 Pw ˇ iD1 Ai ı a ; ı; ıb ˇ > 2 has a vanishing upper bound that does not depend on .xi ; zi /. Berstein’s inequality is used. To evaluate maximum, Á 1 u C 1 juj, we can write using .u/ D 2 2 Q ıl maxAi ı ka ; ı; b Á i;k;l T i X k l Q Q R ait ı ka E Qi .ı a ; ı a ; ıb /jxi ; zi C w h D maxŒQQ i .ı ka ; ıQ a ; ıbl / i;k;l T h X t D1 T h X tD1 T X D max i;k;l t D1 T X t D1 C T X R ait ı ka w ei t i;k;l Ew Á ."i t / tD1 D max Ew Œ ıQ a wR b ıbl R ait ı ka w ei t wR b ıbl Á ei t Á ei t 1 ˇˇ Œˇei t 2 1 ˇˇ Œˇei t 2 R ait ı ka w R ait ı ka w ıQ a Á ˇ ˇ ˇei t ˇ ˇ wR b ıbl ˇ ˇ ˇ wR b ıbl ˇ ˇ ˇ ˇei t R ait ıQ a w R ait ıQ a w ."i t / tD1 i;t R ait ıQ a w wR b ıb Ái Ái C T X R ait w ıQ a ıa Á tD1 R ait ı a w Ä 3T max wR b ıbl R ait ıQ a w R ait w max ıQ a k ı ka r ÄC qN N 127 ˇ ˇ wR b ıbl ˇ C ˇ ˇ wR b ıbl ˇ C   à 1 R ait ıQ a w 2 à 1 R ait ıQ a w 2 ı ka ı ka Á  Á  ."i t / To evaluate variance, applying Knight’s identity, we have QQ i .ı ka ; ıQ a ; ıbl / C T X R ait ı ka w ıQ a Á ."i t / t D1 D D C C T h X tD1 T X R ait ı ka w ei t Á R ait ıQ a w ei t wR b ıbl Ái C T X R ait ı ka w ıQ a Á ."i t / tD1 R ait ı ka C wR b ıbl .w t D1 T X R ait ıQ a C wR b ıbl .w t D1 T X wR b ıbl R ait ı ka w ıQ a Á ri / ."i t / C tD1 0 T Z w R a ıQ a CwR b ı l ri X it b ."i t / ri / l T Z w R a ık X i t a CwR b ıb ri t D1 0 ŒI ."i t < t/ ŒI ."i t < t/ I."i t < 0/dt I."i t < 0/ dt ."i t / t D1 l T Z w R a ık X i t a CwR b ıb ri D ŒI a ıQ CwR ı l r R w a b b i t D1 it ."i t < t/ I."i t < 0/dt Thus, Ai ı ka ; ıQ a ; ıbl D QQ i Á ı ka ; ıQ a ; ıbl Á h E QQ i ı ka ; ıQ a ; ıbl Á i jxi ; zi C T X R ait ı ka w ıQ a Á ."i t / t D1 l T Z w R a ık X i t a CwR b ıb ri D ŒI ."i t < t/ a ıQ CwR ı l r R w a b b i tD1 2 it Z a l T R ık w X i t a CwR b ıb ri Ew 4 ŒI ."i t R a ıQ a CwR b ı l ri t D1 w it b T Á X R ait ı ka ıQ a ."i t / C w tD1 D l T Z w R a ık X i t a CwR b ıb ri ŒI ."i t < t/ a ıQ CwR ı l r R w a b b i tD1 2 it Z a l T R ık w X i t a CwR b ıb ri Ew 4 ŒI ."i t a ıQ CwR ı l r R w b b i t D1 it a I."i t < 0/dt T X R ait ı ka w ıQ a Á ."i t / tD1 < t/ I."i t < 0/dt T X 3 R ait ı ka w ıQ a Á ."i t /5 t D1 I."i t < 0/dt 3 < t/ 2 I."i t < 0/dt 5 C Ew 4 T X t D1 128 3 R ait ı ka w ıQ a Á ."i t /5 And it implies N X Var Ai ı ka ; ıQ a ; ıbl Á jxi ; zi Á i D1 Ä N X i D1 Ä N X i D1 12 3 7 I."i t < 0/dt A 5 20 l T Z Ra k 6@X wi t ı a CwR b ıb ri ŒI ."i t < t/ Ew 4 R a ıQ a CwR b ı l ri tD1 w it b ˇ ˇ a R i t ı ka T max ˇw i;t r ÄC l T Z w Áˇ X R a ık a CwR b ıb ri i t ˇ ŒFi t .t / ıQ a ˇ a l R ıQ a CwR b ı ri t D1 w it b  N T qN X X R ait ı ka C wR b ıbl fi t .0/ w N ri Á2 R ait ıQ a w Fi t .0/dt C wR b ıbl ri Á2 à .1 C o .1// iD1 tD1 r ÄC qN .1 C o .1// N by similar argument as in Lemma C.1.2. Then, by Bernstein’s inequality and Assumption 5, a Db DN N XX ˇ 1 0ˇ ˇN ˇ Á ˇX ˇ " Q ıl ˇ > A Pw @ˇˇ Ai ı ka ; ı; b ˇ ˇi D1 ˇ 2 kD1 lD1 0 1 a Db a Db s ! DN DN N N 2 XX X X N " =4 B C Ä exp @ q exp C q AÄ qN qN qN C C "C kD1 lD1 kD1 lD1 N N s ! Áq Áq p p N N N C N qN exp C Ä C exp C qN log N Ä C C N qN qN s N qN !! !0 Lemma C.1.9 (Asymptotic Equivalence with Bahadur Representation) Assume Assumption 1–5, 7 and 8. Then, we have ıQ a ıO a D op .1/ : Proof. Note that PN i D1 ŒQi p p qN ıL a ; qN ıLb Á Qi .0; 0/ Ä 0 where incides with oracle estimator ıO ab . Then, Lemma C.1.7 implies ıO ab p Á p qN ıL a ; qN ıLb cop D Op qN . Now, it suffices to show that for any positive constants M1 and M2 ; 0 1 N Á X B C P@ inf QQ i ı a ; ıQ a ; ıb > 0A ! 1 p ıQ a ıO a M1 ;kıb kÄM2 qN i D1 129 since, we have p o ıQ a Ä M1 ; kıb k Ä M2 qN : Á n Q ıOb Ä 0. Let B D ı ab W ı a Q i ıO a ; ı; Q i D1 PN By Lemma C.1.8, ˇ 2 ˇN ˇX 4QQ i .ı a ; ıQ a ; ıb / sup ˇˇ ˇ ı ab 2B i D1 h i E QQ i .ı a ; ıQ a ; ıb /jxi ; zi C C R ait w R ait ı a w ıQ a Á tD1 Then by Lemma C.1.5, ˇ ˇN Ä ˇX sup ˇˇ QQ i .ı a ; ıQ a ; ıb / ı ab 2B ˇi D1 T X T X ıa ıQ a 1h 0 R ı KN ı a 2 a ˇ iˇˇ ."i t / ˇˇ ˇ Á t D1 3ˇ ˇ ˇ ."i t /5ˇˇ D op .1/ ˇ i 0 R Q .1 C o .1// ıQ a K ı a N (A.7) D op .1/ And since Ra R 0a BN W ıQ a D W 1 R 0a ‰ ."/ W R 1W R 0a ‰ ."/ DK N we have N X T X R ait w ıa ıQ a Á ıQ a ."i t / D ı a N X T Á0 X .ei t / i D1 t D1 i D1 tD1 Combining (A.7) and (A.9), we have ˇ ˇN Ä ˇX 1h 0 R sup ˇˇ QQ i .ı a ; ıQ a ; ıb / ı KN ı a 2 a ı 2B ˇ ab R a0 w it D ıa ıQ a D ıa ıQ a Á0 R 0a ‰ ."/ W (A.8) Á0 R N ıQ a K (A.9) 0 R Q ıQ a K N ıa C ıa i i D1 ıQ a Á0 ˇ ˇ ˇ Q R KN ı a ˇˇ D op .1/ ˇ which implies ˇ ˇN Ä ˇX sup ˇˇ QQ i .ı a ; ıQ a ; ıb / ı ab 2B ˇi D1 1 ıa 2 130 ıQ a Á0 R N ıa K ˇ Á ˇˇ ıQ a ˇˇ D op .1/ ˇ By assumption 4, for any ıa ıQ a 1 ıa 2 Á > M; ıQ a Á0 R N ıa K Á Qı a > CM for some positive C: Q ait / is arbitrary, Lemma C.1.9 implies ıQ Since the partition .wQ ibt ; w ıO D op .1/ : 1) convergence rate of ˇO : Consider the Bahadur representation ıQ in Lemma C.1.1. Let ıQ 1 be Q Then, by paritioned matrix formula, we can write the subvector of the first K4 components in ı: ıQ 1 D p h N W0 BN W W0 ‰ ."/ W0 BN …A …0A BN …A 1 0 W BN W N D „ " Á 1 …0A BN W i 1 0 …A ‰ ."/ W0 BN …A …0A BN …A 1 ! 1  à 1 1 0 1 0 1 0 W BN …A … BN …A … BN W N N A N A ƒ‚ … .a/ 1 p W0 ‰ ."/ N  1 0 1 0 W BN …A … BN …A N N A Since part (a) is upper-left submatrix of Á 1 1 Q 0 Q N W BN W # à 1 1 p …0A ‰ ."/ N ; its maximum eigenvalue is bouned above by an argument used in Lemma C.1.1. Then, by Lemma C.1.10, ıQ 1 Ä 1 0 W BN W N 1 p W0 N ."/ h 1 Ä C p W0 I N T N„ ! 1 2 à 1 2 1 0 1 0 1 0 W BN …A … BN …A … BN W N N A N A   à 1 1 0 1 0 1 W BN …A …A BN …A p …0A ‰ ."/ N N N BN …A …0A BN …A ƒ‚ DW 0 à 1 DC p  C op .1/ ‰ ."/ D Op .1/ N Á ˇ o D Op N 1=2 :  Thus ˇO 2) convergence rate of gO 131 1 i …0A ‰ ."/ … By Lemma C.1.1 and C.1.9, 0 B oA k Ä @ k OA 1 ˇO ˇo OA oA 1 C A D p ıO D Op N Âr qN N à And, by Assumption 8, we have N 1 X ŒgO .xi ; zi / N go .xi ; zi /2 iD1 N 1 X . iA O A D N 1 Ä N iD1 N X go .xi ; zi //2 N 1 X . iA oA oA // C N iD1 qN Á 2 // C O oA N 2 . iA . O A i D1 1 X . iA . O A Ä N Now, it suffices to show X . iA . O A oA // 2 P 2 D Op .qN / : First note that 0 .…A . O A . iA . O A oA // D .…A . O A oA // 1 2 0 D N . OA oA // Á1 2 oA / N 1 …0 … A A oA / Ä Á1 N 1 …0A …A 2 Á1 2 1 …0 … A A N go .xi ; zi //2 1 N 2 . OA oA / Then, since we have Á1 1 N 1 …0A …A 2 N 2 . O A 1 N 2 . OA Ä max N 1 …0A …A D Op p Á oA / 1 N 2 . OA oA / qN The result follows. 1 1 Lemma C.1.10 Assume Assumption 1–5, 7 and 8. Then (i) N 2 W D N 2  C op .1/ (ii) h i 1 0 N 1 W 0 BN W D KN C op .1/ where W D I N T …A …0A BN …A …A BN W and KN D N1 0 BN : Proof. (i) Let P D …A …0A BN …A 1 1 1 N 2 W D N 2 ŒW …0A BN : We can write 1 1 P W D N 2  C N 2 .H 132 P W/ Note that 2 1 N 2 .H P W/ P W/0 .H D N 1 max .H P W/0 .H Ä N 1 t r .H DN 1 CT 1 h N X T K1X X i D1 tD1 P W/ P W/ i2 O hk .xi ; zi / D op .1/ hk .xi ; zi / kD1 where the last equality follows from Assumption 8. (ii) Note N 1W 0B W N  D N Ã0 1 2  C op .1/ BN  à 1 N 2  C op .1/ D N 1 0 BN  C op .1/ Lemma C.1.11 Assume Assumption 1–5, 7 and 8. Then, †N 1=2 Q d ı 1 ! N .0; I / Proof. We can write ıQ 1 D p 1 N W 0 BN W D KN C op .1/ 1 W 0 ‰ ."/ 1 2 C N  N 1 2 .H Ã0 P W/ ‰ ."/ Note that a similar argument used in Lemma C.1.10 yields  N 1 2 C N 1 2 .H Ã0 1 1 P W/ ‰ ."/ D N 2 0 ‰ ."/ C N 2 .H 1 D N 2 0 ‰ ."/ C op .1/ so that we have ıQ 1 D KN Define DN i t D †N 1=2 1 KN 1 N 2 0i t N X T X 1 1 N 2 0 ‰ ."/ C op .1/ ."i t /. Then, DN i t D †N 1=2 i D1 t D1 133 1 KN 1 N 2 0 ‰ ."/ P W/0 ‰ ."/ Similarly as in Lemma C.1.1, note E DN i t D 0 and E 10 3 10 20 T N T X X X DN i t A 5 E 4@ DN i t A @ i D1 D N X E 4@ i D1 T X 1=2 †N KN 1 N D E †N 1=2 1=2 i D 0: Also, ."i t /A @ T X †N 1=2 KN 1 N 10 3 1 2 0 it ."i t /A 5 tD1 2 KN 1 4N 1 N X 0 T X @ i D1 h 10 1 2 0 it tD1 2 D E 4†N T t D1 DN i t tD1 tD1 20 hP 10 0i t ."i t /A @ tD1 13 T X 3 1=2 5 ."i t / i t A5 KN 1 †N tD1 i 1=2 KN 1 SN KN 1 †N DI To check Lindeberg-Feller condition, fix " > 0: 2 2 N T X 6 X E4 DN i t 1 PT 3 N T X X Á7 Ä 1 E DN i t 5 DN i t >" "2 tD1 i D1 t D1 iD1 t D1 12 0 N T T X M X @X 1=2 1=2 E †N KN 1 0i t A Ä 2 2 i t KN 1 †N " N i D1 tD1 tD1 N T X M X 1=2 1 Ä 2 2 E †N KN 0i t " N i D1 4 tD1 N T X C X E Ä 2 2 0i t " N i D1 4 4  D Op t D1 1 N à D op .1/ Proof of Theorem 3.4.8 The penalized objective function in (3.33) can be expressed as a difference of two convex functions k .ˇ; / and l .ˇ; / N T 1 XX N .yi t wi t ˇ .xi ; zi / / C pN X j D1 iD1 tD1 134 p ˇ ˇ ˇ j ˇ D k .ˇ; / l .ˇ; / where N T 1 XX k .ˇ; / D N .yi t .xi ; zi / / C wi t ˇ i D1 tD1 l .ˇ; / D pN X pN X ˇ ˇ ˇ jˇ j D1 Á L j j D1 for some L . /. The function L L j D 2 j Á j for SCAD and MCP is defined as ˇ ˇ C 2 ˇ jˇ C 2 1 2 .a 1/ ˇ ˇ Ä ˇ jˇ Ä a ˇ ˇ ˇ jˇ C .a C 1/ 2 2 ! ˇ ˇ 1 ˇ jˇ > a with a > 2; and L j D 2 j 2a ˇ ˇ 1 0 Ä ˇ jˇ < a a 2 2 ˇ ˇ ˇ jˇ C ! ˇ ˇ 1 ˇ jˇ > a with (a > 1), respectively. Subdifferential of f at Áo is defined to be a set ˚ @f .Áo / D t W f .Á/ f .Áo / C .Á « Áo /0 t; 8Á Then @l; the subdifferential of l; is merely a regular derivative 8 8 ˆ ˆ < < Á 0 for 1 Ä j Ä K4 @l .ˇ; / D ; K4 CpN W j D 1; ˆ ˆ : : @@l.ˇ; / for K4 C 1 Ä j Ä K4 C pN j K4 where 8 ˆ ˆ ˆ ˆ ˆ < 0 @l .ˇ; / sgn j j D ˆ @ j a 1 ˆ ˆ ˆ ˆ : sgn j for SCAD, and 8 ˆ < @l .ˇ; / D ˆ @ j : Á j a sgn j 99 > => = > > ;; ˇ ˇ 0 Ä ˇ jˇ < ˇ ˇ Ä ˇ jˇ Ä a ˇ ˇ ˇ jˇ > a ˇ ˇ 0 Ä ˇ jˇ < a ˇ ˇ ˇ jˇ a for MCP. Before we derive the subdifferential of k; first consider the subgradient of the unpenalized objective function N T 1 XX N .yi t wi t ˇ i D1 tD1 135 .xi ; zi / / : For 1 Ä j Ä K4 ; N T 1 XX wQ i tj 1 Œyi t N sj .ˇ; / D i D1 t D1 N X T X 1 1/ N . wi t ˇ wQ i tj 1 Œyi t .xi ; zi / > 0 .xi ; zi / wi t ˇ N T 1 XX wi tj ai t N < 0 i D1 tD1 iD1 tD1 and for K4 C 1 Ä j Ä K4 C pN ; N T 1 XX wQ i tj 1 Œyi t N sj .ˇ; / D i D1 t D1 N X T X . 1/ where ai t D 0 if yi t 1 N wi t ˇ wQ i tj 1 Œyi t .xi ; zi / wi t ˇ > 0 .xi ; zi / N T 1 XX wQ i tj ai t N < 0 i D1 tD1 iD1 tD1 .xi ; zi / wi t ˇ 6D 0, ai t 2 Œ 1;  otherwise. The subgradient of k conincides with s .ˇ; / with additional term lj introduced for K4 C 1 Ä j Ä K4 C pN : @k .ˇ; / D n Ä1 ; ; ÄK4 CpN Á W Äj D sj .ˇ; / if 1 Ä j Ä K4 ; sj .ˇ; / C lj otherwise o Á Á O O be the oracle if j K4 6D 0 and lj 2 Œ 1; 1 otherwise. Let ˇ; Á estimator. Define K be a collection of vector Ä D Ä1 ; ; ÄK4 CpN such that 8 ˆ ˆ 0 if j D 1; ; K4 ˆ ˆ < Á Äj D sgn Oj K4 if j D K4 C 1; ; K4 C qN ˆ ˆ Á ˆ ˆ : sj ˇ; O O C lj if j D K4 C qN C 1; ; K4 C pN where lj D sgn j K4 Then, Lemma C.1.12 and Lemma C.1.13 [Lemma C.1.16] below deliver the result. First note ÁÁ O that, by Lemma C.1.13 [Lemma C.1.16], it is implied that lim P K @k ˇ; O D 1 since N !1 Á Á Á O lj D sgn Oj K4 for j D K4 C1; ; K4 CqN : Now, consider a point .ˇ; / 2 B ˇ; O ; 2 : By Lemma C.1.12, it suffices to show that there exists Ä 2 K such that P .Ä 2 @l .ˇ; // ! 1 i.e. lim P N !1 lim P N !1 @l .ˇ; / Äj D ; j D 1; @ˇj @l .ˇ; / Äj D ; j D K4 C 1; @ j K4 136 ! ; K4 D1 (A.10) D1 (A.11) ! ; K4 C pN / Since @l.ˇ; D 0 for j D 1; @ˇ ; K4 ; (A.10) holds by definition of K. To show (A.11): For both ˇ ˇ Á ˇ ˇ @l.ˇ; / D sgn j K4 for ˇ j K4 ˇ a (the case with SCAD and MCP penalty function, @ j K4 ˇ ˇ ˇ ˇ ˇ j K4 ˇ D a can be easily checked.): By Lemma C.1.13 [Lemma C.1.16], it is implied that j min 1Äj ÄqN ˇ ˇ ˇ jˇ min 1Äj ÄqN ˇ ˇ ˇ Oj ˇ max 1Äj ÄqN ˇ ˇ Oj  with probability approaching one. Thus, lim P N !1 K4 C 1; ˇ ˇ j @l.ˇ; / @ j K 4  1 aC 2 à 2 Da Áà D sgn j K4 D 1 for j D ; K4 C pN : For K4 C 1 Ä j Ä K4 C qN , it suffices to show lim P N !1 Á sgn Oj K4 D sgn ÁÁ j K4 D1 Á since Äj D sgn Oj K4 for K4 C 1 Ä j Ä K4 C qN : From the fact that Oj oj D Á 1=2 Op N 1=2 qN D o . / for 1 Ä j Ä qN where oj > 0; and that Oj j < 2 ; it is implied that Oj K4 and j K have the same sign for K4 C 1 Ä j Ä K4 C qN with probability tending to 4 one. For K4 C qN C 1 Ä j Ä K4 C pN ; we have Oj K4 D 0 by the definition of oracle estimator. Then, ˇ ˇ ˇ ˇ ˇ ˇ ˇ j K4 ˇ Ä ˇ j K4 ˇ ˇ ˇ ˇ ˇ ˇ ˇ ˇ Oj K4 ˇ C ˇ Oj K4 ˇ D ˇ j K4 ˇ ˇ Oj K4 ˇ < : 2 ˇ ˇ ˇ ˇ ˇ @l.ˇ; / ˇ j @l.ˇ; / @l.ˇ; / ˇ ˇ ˇ ˇÄ ; For j < ; @ D 0 for SCAD and @ D a for MCP, which implies ˇ @ j j j K4 ˇ K4 CqN C1 Ä j Ä K4 CpN , for both penalty functions. Also, by Lemma C.1.13 [Lemma C.1.16], ˇ Áˇ ˇ O O ˇˇ Ä with probability tending to one for K4 CqN C1 Ä j Ä K4 CpN : it is implied that ˇsj ˇ; 2 Therefore, there exists lj 2 Œ 1; 1 such that lim P N !1 O O C l D @l .ˇ; / ; j D K4 C qN C 1; sj ˇ; j @ j K4 Á ! ; K4 C pN D 1: Á O O C l with such l : Then the result follows. Take Äj D sj ˇ; j j Lemma C.1.12 (Tao and An, 1997) Consider the function k .Á/ l .Á/ where both k and l are convex with subdifferential functions @k .Á/ and @l .Á/ : Let Á be a point that has neighborhood U such that @l .Á/ \ @k .Á / 6D ; 8Á 2 U \ dom .k/ : Then Á is a local minimizer of k .Á/ 137 l .Á/ : Á 1=2 Lemma C.1.13 Assume Assumption 1–6 and 9. Suppose D o N .1 C4 /=2 ; N 1=2 qN D Á Á O O be the oracle estimator. Then, there exists a with o . / ; and log .pN / D o N 2 : Let ˇ; it ai t D 0 if yi t .xi ; zi / wi t ˇ 6D 0 and ai t 2 Œ1 ;  otherwise, such that, for sj .ˇ; / with ai t D ai t ; with probability approaching one Á O O D 0; j D 1; sj ˇ; ; K4 C qN à  ˇ ˇ 1 ˇ Oj ˇ ; j D 1; ; qN aC 2 ˇ Áˇ ˇ O O ˇˇ Ä c ; 8c > 0; j D K4 C qN C 1; ˇsj ˇ; n Proof. Define D D .i; t/ W yi t wi t ˇO (A.12) (A.13) ; K4 C pN (A.14) o O .x / ; z D 0 : To show (A.12), note that, with A i i A Q i t / is in general position i.e. jDj D K4 C qN since yi t probability tending to one, .yi t ; w ˚ « has a continuous density (2.2.2., Koenker, 2005). Thus, there exists ai t with .K4 C qN / Á O O A implies 0K Cp 2 nonzero elements such that (A.12) holds: (Alternatively, optimality of ˇ; 4 N Á ˚ « P P O @ wi t ˇ A .xi ; zi / A / at ˇ; O so that such ai t exist.) To show (A.13), i t .yi t note that min 1Äj ÄqN By Assumption 9, min 1Äj ÄqN ˇ ˇ ˇ Oj ˇ ˇ ˇ ˇ oj ˇ min 1Äj ÄqN ˇ ˇ ˇ oj ˇ max 1Äj ÄqN ˇ ˇ Oj oj ˇ ˇ C5 N .1 C4 /=2 : Then, by Lemma C.1.1 and Lemma C.1.4, Âq à Á ˇ qN 1 C4 /=2 : The result follows . ˇ D O N D o p p oj N ˇ ˇ Oj 1Äj ÄqN Á from D o N .1 C4 /=2 : To show (A.14), define J3 Á fj WK4 C qN C 1 Ä j Ä K4 C pN g : Á O O ; Then, for j 2 J3 ; by definition of sj ˇ; it is implied that O O sj ˇ; max Á N T h 1 XX wQ i tj 1 yi t D N wi t ˇO A .xi ; zi / O A Ä 0 i Á iD1 tD1 where ai t satisfies the given condition. Thus, 1 N P 138 Q i tj .i;t /2D w 1 N X wQ i tj ai t C .1 / .i;t /2D ai t C .1 / D Op qN N Á D op . / since jDj D K4 C qN with probability tending to one: It will be shown that ˇ ˇ 1 0 ˇ N X T i Áˇˇ h ˇ1 X ˇ>c AD0 wQ i tj 1 yi t wi t ˇO lim P @ max ˇˇ A .xi ; zi / O A Ä 0 ˇ N !1 j 2J3 ˇ N ˇ i D1 tD1 Define h Ii t .Â/ D 1 yi t QA w it Â Ä 0 i Á Pi t .Â/ D P yi t QA QA w it i t Â Ä 0jw Hi t .Â/ D Ii t .Â/ Ii t . o / Pi t .Â/ C Pi t . o / wi t ˇO i O / .x ; z Ä 0 i i A A Then, note ˇ ˇ X T h ˇ1 N X ˇ @ P max ˇ wQ i tj 1 yi t j 2J3 ˇ N i D1 t D1 0 ˇ 1 Áˇˇ ˇ>c A ˇ ˇ ˇ ˇ 1 ˇ N X T h i Áˇˇ ˇ1 X ˇ>c A P @ max ˇˇ wQ i tj 1 yi t wi t ˇO A .xi ; zi / O A Ä 0 ˇ j 2J3 ˇ N ˇ i D1 tD1 ˇ ˇ 0 1 ˇ X T Áˇˇ Á ˇ1 N X Ä P @ max ˇˇ wQ i tj Ii t ÂO A Ii t . oA / ˇˇ > c A 2 j 2J3 ˇ N ˇ i D1 tD1 ˇ ˇ 1 0 ˇ ˇ N X T ˇ ˇ1 X /ˇˇ > c A wQ i tj .Ii t . oA / C P @ max ˇˇ 2 j 2J3 ˇ N ˇ i D1 tD1 0 1 ˇ ˇ ˇ ˇ N T ˇ ˇ 1 XX B C ˇ ˇ .I . / . // w Q I Ä P @ max sup q i tj i t it A oA ˇ > c A C op .1/ ˇ 2 j 2J3 ˇ q ˇ N i D1 tD1 k A  oA kÄC NN 1 0 ˇ ˇ ˇ ˇ N T ˇ 1 XX ˇ B C ˇ wQ i tj .Hi t . A //ˇˇ > c A Ä P @ max sup q ˇ 4 j 2J3 ˇ q ˇ N i D1 tD1 k A  oA kÄC NN 0 1 ˇ ˇ ˇ X ˇ T ˇ1 N X ˇ B C ˇ C P @ max wQ i tj .Pi t . A / Pi t . oA //ˇˇ > c A C op .1/ sup q ˇ N 4 j 2J3 ˇ q ˇ i D1 tD1 k A  oA kÄC NN 0 139 where the second inequality follows by Lemma C.1.14. Here, note that ˇ ˇ ˇ ˇ N T ˇ 1 XX ˇ ˇ ˇ . / . / max sup q w Q .P P i tj i t i t A oA ˇ ˇ j 2J3 ˇ qN ˇ N i D1 t D1 k A  oA kÄC N ˇ ˇ X T Á ˇ1 N X A ˇ Q i t . A  oA / wQ i tj .Fi t w D max sup q ˇ j 2J3 qN ˇ N i D1 t D1 k A  oA kÄC N ˇ ˇ ˇ Fi t .0/ˇˇ ˇ N T ˇ 1 X X ˇˇ A ˇ Q i t . A  oA /ˇ ÄC sup q ˇw q N i D1 t D1 k A  oA kÄC NN s r  à 1 qN 0 Q AW Q Â Ä C W D o. / ÄC sup q k k max A oA A N N qN k A  oA kÄC N Thus, now it suffices to show ˇ 0 ˇ X T ˇ1 N X ˇ @ P max sup ˇ wQ i tj .Ii t . A / j 2J3  2L ˇ N A N Ii t . oA / i D1 tD1 ˇ 1 ˇ ˇ Pi t . A / C Pi t . oA //ˇˇ > c =4A ˇ tends to 0 as N goes to infinity where LN D  A W k A  oA k Ä C q qN N : It is implied by Lemma C.1.15. Lemma C.1.14 Assume Assumption 1–6. Suppose log .pN / D o N ˇ ˇ N X T ˇ1 X ˇ @ P max ˇ wQ i tj .Ii t . oA / j 2J3 ˇ N i D1 tD1 0 2 Á and N 2 ! 1: Then ˇ 1 ˇ ˇ /ˇˇ > c A ! 0: 2 ˇ Proof. The argument is similar to Wang, Wu, Li (2012). Note that, for some constant C , the P random variable C1 TtD1 wQ i tj Ii t . oA / is independent across i , bounded by the interval Œ0; 1 : Then, by Hoeffding’s inequality, it is implied 0 N T 1 XX @ P wQ i tj .Ii t . oA / N i D1 tD1 140 1 / > c A Ä exp 2 CN 2 Á so that ˇ ˇ X T ˇ1 N X ˇ @ P max ˇ wQ i tj .Ii t . oA / j 2J3 ˇ N i D1 tD1 Á 2 Ä 2pN exp CN Á D 2exp log pN CN 2 ! 0 ˇ 1 ˇ ˇ /ˇˇ > c A 2 ˇ 0 Á 1=2 1 C =2 . / 4 Do N ; N 1=2 qN D Lemma C.1.15 Assume Assumption 1–6 and 9. Suppose Á 2 : Then, for any given positive constant C; o . / ; and log .pN / D o N ˇ ˇN T ˇX X @ lim P max sup ˇˇ wQ i tj .Ii t . A / N !1 j 2J3  2B ˇ A i D1 t D1 0 where B D  A W k A  oA k Ä C q qN N Ii t . oA / ˇ 1 ˇ ˇ Pi t . A / C Pi t . oA //ˇˇ > N A D 0 ˇ : Proof. B can be covered with a net of balls with radius C qq N N5 with cardinality N1 Á jBj Ä CN 2qN . Denote the N1 balls centered at t m by b .t m / for m D 1; ; N1 : 1 0 ˇ ˇ ˇ ˇX N T X ˇ ˇ B ˇ>N C ˇ // . / . / . / . P@ sup q C P P I w Q .I A i t i t i t i tj i t oA A oA A ˇ ˇ ˇ qN ˇiD1 tD1 k A  oA kÄC N ˇ 1 0ˇ ˇ ˇN T N1 X X X ˇ ˇ Ä P @ˇˇ wQ i tj .Ii t .t m / Ii t . oA / Pi t .t m / C Pi t . oA //ˇˇ > N =2A ˇ ˇi D1 tD1 mD1 1 0 ˇP Á ˇ N PT N1 B C X Ii t .t m / ˇ i D1 tD1 wQ i tj .Ii t ÂQ A B C ˇ C PB sup r > N =2C Á ˇ @ A qN Pi t ÂQ A C Pi t .t m //ˇ mD1 ÂQ t ÄC A m N5 Á INj1 C INj 2 : To evaluate INj1 ; define vij D PT Q i tj .Ii t tD1 w .t m / Ii t . oA / Pi t .t m / C Pi t . oA //; which are bounded, independent mean zero random variables. First, note that Ii t .t m / 141 Ii t . oA / is nonzero only if (Ii t .t m / D 1 and Ii t . oA / D 0) or (Ii t .t m / D 0 and Ii t . oA / D 1), which ˇ ˇ ˇ ˇ A Q i t .t m  oA /ˇ a.s. where ei t D yi t w QA implies jei t j < ˇw i t  oA : Then, V vij Ä C ÄC ÄC ÄC T X tD1 T X tD1 T X tD1 T X V wQ i tj .Ii t .t m / E wQ i2tj ŒIi t .t m / P ŒIi t .t m / Pi t .t m / C Pi t . oA // Ii t . oA / Ii t . oA Ii t . oA ˇ ˇ A Q i t .t m P jei t j < ˇw /2 /2 D1 Á Á ˇÁ ˇ  oA /ˇ tD1 ÄC T X E h ˇ ˇ A Q i t .t m Fi t ˇw ˇÁ ˇ  oA /ˇ Fi t ˇ ˇ A Q i t .t m ˇw ˇÁiÁ ˇ  oA /ˇ t D1 0 T ˇ X ˇ A @ Q i t .t m Ä CE ˇw 1 ˇ ˇ  oA /ˇA tD1 Therefore, N X iD1 1 0v u N X T ˇ u1 X ˇA t QA w  oA /ˇ Ä CNE @ i t .t m N i D1 t D1 iD1 tD1 s  Ã! p 1 Q Q0 Ä CNE WA WA kt m  oA k Ä C N qN max N 0 N X T ˇ X ˇ A Q i t .t m V vij Ä CE @ ˇw  oA / Á2 1 A Applying Bernstein’s inequality, INj1 Ä N1 exp N 2 2 =8 p C N qN C CN ! Ä N1 exp . CN / Ä exp .C qN log .N / To evaluate INj 2 ; note that h Ii t . A / D 1 yi t i h QA w Â Ä 0 D 1 yi t it A 142 QA QA w it tm Ä w i t . A i t m/ CN / Á QA w Â Ä e and Pi t .Â; e/ D it and that an indicator function is increasing. Define Ii t .Â; e/ D 1 yi t Á A : Then, we have QA Q P yi t w Â Ä ej w it it ÂQ A ˇ ˇX T Á ˇN X ˇ Q sup r w Q .I  i tj i t A ˇ qN ˇiD1 tD1 t m ÄC Ii t .t m / N5 r Ã Ä Â N X T ˇ X ˇ q N A ˇwQ i tj ˇ Ii t t m ; w Q it C Ä N5 D C ˇ ˇ ˇ Pi t ÂQ A C Pi t .t m //ˇˇ ˇ Á Ii t .t m / i D1 t D1 N X T ˇ X r Ã Ä Â ˇ q N A ˇwQ i tj ˇ Ii t t m ; w Q it C N5 Ii t .t m / r Ä Â Ã ˇ q N A ˇwQ i tj ˇ Pi t t m ; w Q it C N5 QA w it Pi t t m ; r C C r QA w it Pi t t m ;  r QA w it  i D1 tD1 N X T ˇ X i D1 t D1  Pi t t m ; C qN N5 qN N5 qN N5 à qN N5 à à C Pi t .t m / C Pi t .t m / à Note that r Ä Ã Â N X T ˇ X ˇ qN A ˇwQ i tj ˇ Pi t t m ; w Q it C N5 i D1 tD1 N X T ˇ X Ä Â ˇ ˇwQ i tj ˇ Fi t w QA i t .t m D  Pi t t m ; QA w it  oA / C r C i D1 t D1  Fi t ÄC QA w it .t m N X T X QA w it  oA / r i D1 tD1 QA w it r C qN N5 QA w it qN N5 r C à à qN Ä C qN N 3=2 D o .N / 5 N Hence, for all N sufficiently large, INj 2 Ä PN1 r Ä Â Ã T ˇ X ˇ qN A ˇ ˇ Q it C ˛mi D wQ i tj Ii t t m ; w N5 mD1 P PN iD1 ˛mi  Ii t .t m / Pi t t m ; N 4 Á QA w it where r C tD1 qN N5 à C Pi t .t m / Since ˛mi are independent bounded random variables with mean zero, similarly as in the evaluation of INj1 ; we can show that V .˛mi / Ä C T X à Âq A 5 Q it E qN =N w Ä C qN N 5=2 tD1 143 Applying Bernstein’s inequality, 0 N1 N X X @ P ˛mi 1 N A Ä N1 exp 4 i D1 mD1 ! N 2 2 =32 C qN N 3=2 C CN Ä N1 exp . CN / Ä C exp .C qN log N CN / Therefore, we have X INj1 C INj 2 Ä C exp .log pN C C qN log .N / CN / D o .1/ j 2J3 1 1 since N 2 qN2 log N D o .1/ is implied by conditions given. Hence the result. Á 1=2 Lemma C.1.16 Assume Assumption 1–5, and 7–9. Suppose D o N .1 C4 /=2 ; N 1=2 qN D Á Á O O be the oracle estimator. Then, there exists a with o . / and log .pN / D o N 2 : Let ˇ; it ai t D 0 if yi t wi t ˇ .xi ; zi / 6D 0 and ai t 2 Œ1 ;  otherwise, such that, for sj .ˇ; / with ai t D ai t ; with probability approaching one Á O O D 0; j D 1; sj ˇ; ; K4 C qN  à ˇ ˇ 1 ˇ Oj ˇ aC ; j D 1; ; qN 2 ˇ Áˇ ˇ ˇ O O s ˇ; ˇ j ˇ Ä c ; 8c > 0; j D K4 C qN C 1; n Proof. Define D D .i; t/ W yi t wi t ˇO (A.15) (A.16) ; K4 C pN (A.17) o O .x / ; z D 0 : To show (A.15), note that, with A i i A Q i t / is in general position i.e. jDj D K4 C qN since yi t probability tending to one, .yi t ; w ˚ « has a continuous density (2.2.2., Koenker, 2005). Thus, there exists ai t with .K4 C qN / Á O O A implies 0K Cp 2 nonzero elements such that (A.12) holds: (Alternatively, optimality of ˇ; 4 N Á ˚ « P P O @ wi t ˇ A .xi ; zi / A / at ˇ; O so that such ai t exist.) To show (A.16), i t .yi t note that min 1Äj ÄqN ˇ ˇ ˇ Oj ˇ min 1Äj ÄqN ˇ ˇ ˇ oj ˇ 144 max 1Äj ÄqN ˇ ˇ Oj oj ˇ ˇ By Assumption 9, min 1Äj ÄqN ˇ ˇ ˇ oj ˇ C5 N .1 C4 /=2 : Then, by Lemma C.1.1 and Lemma C.1.9, Âq à Á ˇ qN 1 C =2 . / ˇ 4 D op N : The result follows oj D Op N ˇ ˇ Oj 1Äj ÄqN Á 1 C =2 . / 4 : To show (A.17), define J3 Á fj WK4 C qN C 1 Ä j Ä K4 C pN g : from D o N Á O Then, for j 2 J3 ; by definition of sj ˇ; O ; it is implied that O O sj ˇ; max Á N T h 1 XX D wQ i tj 1 yi t N wi t ˇO Ä0 A .xi ; zi / O A iD1 tD1 where ai t satisfies the given condition. Thus, N1 i Á 1 N X wQ i tj ai t C .1 / .i;t /2D P Q i tj ai t C .1 .i;t /2D w / D Op qN N Á op . / since jDj D K4 C qN with probability tending to one: It will be shown that ˇ ˇ 1 0 ˇ ˇ X N T i Á h X ˇ ˇ1 ˇ>c AD0 lim P @ max ˇˇ wQ i tj 1 yi t wi t ˇO A .xi ; zi / O A Ä 0 ˇ N !1 j 2J3 ˇ N ˇ i D1 tD1 Define h Ii t .Â/ D 1 yi t Iiot D 1 Œyi t Pi t .Â/ D P yi t Piot D P yi t QA w it Â Ä 0 i go .xi ; zi / Ä 0 Á A QA Q w Â Ä 0j w it it wi t ˇ o wi t ˇ o QA go .xi ; zi / Ä 0jw it Á Then, note ˇ ˇ N X T h ˇ1 X ˇ @ P max ˇ wQ i tj 1 yi t j 2J3 ˇ N i D1 t D1 0 wi t ˇO 145 i O .x / Ä 0 ; z A i i A ˇ 1 Áˇˇ ˇ>c A ˇ ˇ D ˇ ˇ ˇ1 Ä P @ max ˇˇ j 2J3 ˇ N ˇ 0 ˇ ˇ1 C P @ max ˇˇ j 2J3 ˇ N 0 0 N X T X wQ i tj Ii t ÂO A i D1 tD1 N X T X i D1 tD1 wQ i tj Iiot Á ˇ 1 Áˇˇ Iiot ˇˇ > c A 2 ˇ 1 ˇ ˇ ˇ ˇ>c A ˇ 2 ˇ 1 ˇ ˇ ˇ C Iiot ˇˇ > c A C op .1/ 2 ˇ ˇ ˇ X T ˇ1 N X B ˇ Ä P @ max sup q wQ i tj Ii t . A / ˇ j 2J3 qN ˇ N iD1 tD1 k A  oA kÄC N 1 ˇ ˇ ˇ X ˇ N T X ˇ1 ˇ C B ˇ Ä P @ max sup q wQ i tj .Ii t . A / Iiot Pi t . A / C Piot /ˇˇ > c A ˇ 4 j 2J3 ˇ q ˇ N i D1 t D1 k A  oA kÄC NN 1 0 ˇ ˇ ˇ ˇ N T ˇ 1 XX ˇ C B ˇ C P @ max sup q wQ i tj .Pi t . A / Piot /ˇˇ > c A C op .1/ ˇ N 4 j 2J3 ˇ q ˇ i D1 t D1 k A  oA kÄC NN 0 where the second inequality follows by Lemma C.1.17. Here, note that ˇ ˇ ˇ ˇ X N T X ˇ ˇ1 o ˇ wQ i tj .Pi t . A / Pi t /ˇˇ max sup q ˇ j 2J3 ˇ q ˇ N iD1 tD1 k A  oA kÄC NN ˇ ˇ X T ˇ1 N X ˇ D max sup q wQ i tj .Fi t .wi t .ˇ ˇ o / ˇ j 2J3 qN ˇ N i D1 t D1 k A  oA kÄC N ri / ˇ ˇ ˇ Fi t .0/ˇˇ ˇ N T 1 XX .jwi t .ˇ ˇ o /j C jri j/ q qN N i D1 t D1 k A  oA kÄC N s  à 1 0 Q AW Q ÄC sup q Œ W max A k A  oA k C sup jri j N i qN k A  oA kÄC N r Âr à qN qN ÄC C D o. / N N ÄC sup Thus, now it suffices to show 0 ˇ ˇ X T ˇ1 N X B ˇ P @ max sup q wQ i tj .Ii t . A / ˇ j 2J3 qN ˇ N i D1 tD1 k A  oA kÄC N Iiot 1 ˇ ˇ ˇ C Pi t . A / C Piot /ˇˇ > c =4A ˇ tends to 0 as N goes to infinity, which is implied by Lemma C.1.18. 146 Á Lemma C.1.17 Assume Assumption 1–5, and 7–9. Suppose log .pN / D o N 2 and N 2 ! 1: Then ˇ ˇ N X T ˇ1 X P @ max ˇˇ wQ i tj Iiot N j 2J3 ˇ i D1 t D1 ˇ 1 ˇ ˇ ˇ > c A ! 0: ˇ 2 ˇ 0 Proof. The argument is similar to Wang, Wu, Li (2012). Note that, for some constant C , the P random variable C1 TtD1 wQ i tj Iiot is independent across i , bounded by the interval Œ0; 1 : Then, by Hoeffding’s inequality, it is implied 0 N X T X 1 P@ wQ itj Iiot N 1 > c A Ä exp 2 i D1 tD1 CN 2 Á so that ˇ ˇ X T ˇ1 N X wQ i tj Iiot P @ max ˇˇ j 2J3 ˇ N i D1 tD1 Á Ä 2pN exp CN 2 Á D 2exp log pN CN 2 ! 0 0 ˇ 1 ˇ ˇ ˇ>c A ˇ 2 ˇ Á 1=2 Lemma C.1.18 Assume Assumption 1–5, and 7–9. Suppose D o N .1 C4 /=2 ; N 1=2 qN D Á o . / ; and log .pN / D o N 2 : Then, for any given positive constant C; 0 ˇ ˇN T ˇX X B ˇ lim P @ max sup q wQ i tj .Ii t . A / ˇ N !1 j 2J3 qN ˇi D1 t D1 k A  oA kÄC N Iiot 1 ˇ ˇ ˇ C Pi t . A / C Piot /ˇˇ > N A D 0 ˇ q q Proof. Let B D  A W k A  oA k Ä C NN : B can be covered with a net of balls with radius qq N with cardinality N Á jBj Ä CN 2qN . Denote the N balls centered at t D .t ; t / C m 1 1 m1 m2 5 N 147 ; N1 : by b .t m / for m D 1; 0 ˇ ˇ ˇN T ˇ ˇX X ˇ B o o ˇ ˇ . / . / w Q .I I P C P / P@ sup q i tj i t i t A A it it ˇ > N ˇ ˇ qN ˇi D1 t D1 k A  oA kÄC N ˇ 0ˇ 1 ˇ ˇN T N1 X ˇ ˇX X Ä P @ˇˇ wQ i tj .Ii t .t m / Iiot Pi t .t m / C Piot /ˇˇ > N =2A ˇ ˇiD1 tD1 mD1 0 ˇP Á PT ˇ N1 N B Q X B Ii t .t m / ˇ i D1 tD1 wQ i tj .Ii t  A ˇ C PB sup r >N Á ˇ @ Q qN Pi t  A C Pi t .t m //ˇ mD1 ÂQ t ÄC A m 1 C A 1 C C =2C A N5 Á INj1 C INj 2 : To evaluate INj1 ; define vij D PT Q i tj .Ii t tD1 w .t m / Iiot Pi t .t m /CPiot /; which are bounded, in- dependent mean zero random variables. First, note that Ii t .t m / Iiot is nonzero only if (Ii t .t m / D 1 and Iiot D 0) or (Ii t .t m / D 0 and Iiot D 1), which implies j"i t j < jwi t .t m1 ˇ o / ri j a.s: Then, V vij Ä C ÄC ÄC ÄC ÄC T X V wQ i tj .Ii t .t m / tD1 T X Iiot P tD1 T X Ii t .t m / Iiot 2 Iiot E wQ i2tj Ii t .t m / tD1 T X Pi t .t m / C Piot / 2 Á Á D1 P .j"i t j < jwi t .t m1 ˇ o / ri j/ tD1 T X E .ŒFi t .jwi t .t m1 ˇ o / ri j/ tD1 0 Ä CE @ T X 1 jwi t .t m1 ˇ o / ri jA t D1 148 Fi t . jwi t .t m1 ˇ o / ri j// Therefore, 0v 1 u N X T u1 X .wi t .t m1 ˇ o / ri /2 A V vij Ä CE @ jwi t .t m1 ˇ o / ri jA Ä CNE @t N i D1 i D1 t D1 i D1 tD1 s # " Ã!  p 1 Q Q0  C sup Ä C W W Ä CN E N qN kt k jr j m max i oA A A N 0 N X 1 N X T X Applying Bernstein’s inequality, INj1 Ä N1 exp N 2 2 =8 p C N qN C CN ! Ä N1 exp . CN / Ä exp .C qN log .N / CN / To evaluate INj 2 ; note that h Ii t . A / D 1 yi t i h QA Â Ä 0 D 1 yi t w it A Á QA w Â Ä e and Pi t .Â; e/ D it and that an indicator function is increasing. Define Ii t .Â; e/ D 1 yi t Á A : Then, we have QA Q P yi t w Â Ä ej w it it ÂQ A ˇ ˇN T Á ˇX X ˇ Q sup r w Q .I  i tj i t A ˇ qN ˇiD1 tD1 t m ÄC Ii t .t m / N5 r Ä Â Ã N X T ˇ X ˇ q N A ˇwQ i tj ˇ Ii t t m ; w Q it C Ä N5 D C i D1 t D1 N X T ˇ X r Ã Ä Â ˇ q N A ˇwQ i tj ˇ Ii t t m ; w Q it C N5 i D1 tD1 N X T ˇ X i D1 t D1 r Ä Â Ã ˇ q N A ˇwQ i tj ˇ Pi t t m ; w Q it C N5 ˇ ˇ ˇ Pi t ÂQ A C Pi t .t m //ˇˇ ˇ Ii t .t m / Á  Pi t t m ;  Ii t .t m /  Pi t t m ; 149 i t m/ QA QA w i t . A it tm Ä w Pi t t m ; QA w it QA w it QA w it r C qN N5 r C r C à qN N5 qN N5 à C Pi t .t m / à C Pi t .t m / Note that r Ã Ä Â N X T ˇ X ˇ q N A ˇwQ i tj ˇ Pi t t m ; w Q it C N5  Pi t t m ; r QA w it qN N5 C i D1 tD1 N X T ˇ X D i D1 t D1 r Ã Ä Â ˇ q N A ˇwQ i tj ˇ Fi t wi t .t m1 ˇ / ri C w Q it C o N5  QA w it Fi t wi t .t m1 ˇ o / ri N X T X ÄC à QA w it r i D1 tD1 r C qN N5 à qN Ä C qN N 3=2 D o .N / N5 Hence, for all N sufficiently large, INj 2 Ä PN1 mD1 P r Ä Â Ã T ˇ X ˇ qN A ˇ ˇ Q it C wQ i tj Ii t t m ; w ˛mi D N5 PN iD1 ˛mi  Ii t .t m / Pi t t m ; N 4 Á where QA w it r C tD1 qN N5 à C Pi t .t m / Since ˛mi are independent bounded random variables with mean zero, similarly as in the evaluation of INj1 ; we can show that V .˛mi / Ä C T X à Âq A 5 Q it qN =N w Ä C qN N 5=2 E tD1 Applying Bernstein’s inequality, 0 N1 N X X @ P ˛mi mD1 i D1 1 N A Ä N1 exp 4 N 2 2 =32 ! C qN N 3=2 C CN Ä N1 exp . CN / Ä C exp .C qN log N CN / Therefore, we have X INj1 C INj 2 Ä C exp .log pN C C qN log .N / CN / D o .1/ j 2J3 since N 1 1 2q2 N log N D o .1/ is implied by conditions given. Hence the result. 150 C.2 Appendix: Supplementary Tables Figure A.1 Pooled Birth Weights 151 Table A.1 Estimator performance, DGP 1, ˇ2 D 0:1 Method N gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham D 0:5 D 0:9 IC Bias SD RMSE Bias SD RMSE Bias SD RMSE 300 300 300 300 300 300 1000 1000 1000 1000 1000 1000 QBIC QBICL BIC BICL AIC1 AIC2 QBIC QBICL BIC BICL AIC1 AIC2 -0.0005 -0.0006 -0.0006 -0.0006 -0.0000 -0.0006 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0217 0.0216 0.0216 0.0216 0.0217 0.0216 0.0111 0.0112 0.0112 0.0112 0.0112 0.0112 0.0216 0.0216 0.0216 0.0216 0.0217 0.0216 0.0111 0.0112 0.0112 0.0112 0.0112 0.0112 0.0010 0.0011 0.0011 0.0011 0.0019 0.0011 0.0012 0.0011 0.0011 0.0011 0.0011 0.0012 0.0356 0.0357 0.0357 0.0357 0.0355 0.0358 0.0185 0.0186 0.0186 0.0186 0.0183 0.0186 0.0356 0.0357 0.0357 0.0357 0.0356 0.0358 0.0186 0.0186 0.0186 0.0186 0.0183 0.0186 -0.0003 -0.0003 -0.0003 -0.0003 -0.0000 -0.0003 -0.0000 -0.0001 -0.0001 -0.0001 -0.0001 -0.0001 0.0215 0.0215 0.0215 0.0215 0.0213 0.0215 0.0111 0.0111 0.0111 0.0111 0.0110 0.0111 0.0215 0.0215 0.0215 0.0215 0.0212 0.0215 0.0111 0.0111 0.0111 0.0111 0.0110 0.0111 300 300 300 300 300 300 1000 1000 1000 1000 1000 1000 QBIC QBICL BIC BICL AIC1 AIC2 QBIC QBICL BIC BICL AIC1 AIC2 0.0009 0.0008 0.0008 0.0008 0.0010 0.0008 0.0002 0.0002 0.0002 0.0002 0.0001 0.0002 0.0201 0.0200 0.0200 0.0200 0.0205 0.0200 0.0110 0.0110 0.0110 0.0110 0.0109 0.0110 0.0201 0.0200 0.0200 0.0200 0.0205 0.0200 0.0110 0.0110 0.0110 0.0110 0.0109 0.0110 -0.0005 -0.0008 -0.0008 -0.0008 -0.0002 -0.0007 0.0008 0.0007 0.0007 0.0007 0.0010 0.0007 0.0347 0.0346 0.0346 0.0346 0.0346 0.0346 0.0196 0.0196 0.0196 0.0196 0.0196 0.0196 0.0347 0.0346 0.0346 0.0346 0.0346 0.0346 0.0196 0.0196 0.0196 0.0196 0.0196 0.0196 0.0010 0.0009 0.0009 0.0009 0.0008 0.0009 0.0004 0.0005 0.0005 0.0005 0.0007 0.0005 0.0195 0.0196 0.0196 0.0196 0.0191 0.0196 0.0115 0.0116 0.0116 0.0116 0.0117 0.0116 0.0195 0.0196 0.0196 0.0196 0.0191 0.0196 0.0115 0.0116 0.0116 0.0116 0.0117 0.0116 152 Table A.2 Estimator performance, DGP 2, ˇ2 D 0:1 Method N gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham D 0:5 D 0:9 IC Bias SD RMSE Bias SD RMSE Bias SD RMSE 300 300 300 300 300 300 1000 1000 1000 1000 1000 1000 QBIC QBICL BIC BICL AIC1 AIC2 QBIC QBICL BIC BICL AIC1 AIC2 0.0105 0.0085 0.0082 0.0085 0.0186 0.0119 0.0057 0.0045 0.0045 0.0045 0.0081 0.0068 0.1643 0.1642 0.1638 0.1642 0.1665 0.1648 0.0887 0.0882 0.0883 0.0882 0.0878 0.0895 0.1645 0.1643 0.1639 0.1643 0.1674 0.1652 0.0889 0.0883 0.0884 0.0883 0.0882 0.0897 -0.0078 -0.0021 -0.0075 -0.0021 -0.0041 -0.0039 0.0093 0.0101 0.0091 0.0101 0.0080 0.0078 0.2587 0.2635 0.2587 0.2635 0.2580 0.2579 0.1473 0.1479 0.1474 0.1479 0.1478 0.1477 0.2587 0.2634 0.2587 0.2634 0.2579 0.2578 0.1476 0.1481 0.1476 0.1481 0.1480 0.1479 -0.0078 -0.0063 -0.0065 -0.0063 -0.0119 -0.0080 -0.0054 -0.0049 -0.0048 -0.0049 -0.0067 -0.0045 0.1596 0.1621 0.1622 0.1621 0.1598 0.1610 0.0876 0.0857 0.0856 0.0857 0.0878 0.0876 0.1597 0.1622 0.1622 0.1622 0.1601 0.1612 0.0878 0.0858 0.0857 0.0858 0.0880 0.0876 300 300 300 300 300 300 1000 1000 1000 1000 1000 1000 QBIC QBICL BIC BICL AIC1 AIC2 QBIC QBICL BIC BICL AIC1 AIC2 0.0238 0.0269 0.0266 1.0523 0.0240 0.0237 0.0004 0.0011 0.0006 0.0014 0.0004 0.0015 0.1623 0.1639 0.1640 0.4443 0.1593 0.1613 0.0891 0.0886 0.0886 0.0886 0.0878 0.0881 0.1640 0.1660 0.1660 1.1421 0.1610 0.1629 0.0890 0.0885 0.0886 0.0886 0.0878 0.0880 0.0129 0.0243 0.0135 0.0239 0.0055 0.0059 -0.0078 -0.0061 -0.0078 -0.0061 -0.0097 -0.0098 0.2653 0.2830 0.2650 0.2825 0.2659 0.2659 0.1433 0.1445 0.1433 0.1445 0.1432 0.1434 0.2655 0.2839 0.2652 0.2834 0.2658 0.2658 0.1434 0.1446 0.1434 0.1446 0.1435 0.1437 -0.0121 -0.0089 -0.0109 -0.0036 -0.0201 -0.0143 0.0047 0.0049 0.0050 0.0050 0.0035 0.0040 0.1608 0.1644 0.1622 0.1691 0.1646 0.1623 0.0865 0.0859 0.0859 0.0862 0.0867 0.0863 0.1612 0.1646 0.1625 0.1691 0.1657 0.1628 0.0866 0.0860 0.0860 0.0863 0.0867 0.0863 153 Table A.3 Selection Performance, DGP 3 D 0:1 D 0:5 D 0:9 Method pN qo N IC TV FV True TV FV True TV FV True gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund 136 136 136 136 136 136 136 136 136 136 136 136 6 6 6 6 6 6 6 6 6 6 6 6 300 300 300 300 300 300 1000 1000 1000 1000 1000 1000 QBIC QBICL BIC BICL AIC1 AIC2 QBIC QBICL BIC BICL AIC1 AIC2 5.14 4.86 4.64 4.42 5.15 4.76 5.52 5.53 5.15 4.85 5.62 5.52 2.61 1.52 1.36 1.12 4.33 1.42 1.81 1.54 1.05 1.15 3.06 1.53 0.07 0.07 0.07 0.07 0.06 0.07 0.19 0.20 0.20 0.20 0.13 0.20 4.98 4.74 4.71 4.64 5.30 4.81 5.89 5.51 5.41 5.35 5.89 5.83 2.19 1.58 1.49 1.36 4.83 1.82 0.93 0.65 0.63 0.65 4.38 0.74 0.10 0.11 0.11 0.11 0.07 0.11 0.42 0.46 0.46 0.46 0.23 0.45 5.72 5.73 5.70 5.37 5.43 5.72 5.92 5.96 5.92 5.91 5.79 5.96 0.80 0.36 0.31 0.44 3.62 0.33 0.28 0.12 0.09 0.09 3.35 0.12 0.59 0.71 0.71 0.67 0.16 0.71 0.83 0.92 0.92 0.92 0.23 0.92 gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham 102 102 102 102 102 102 102 102 102 102 102 102 18 18 18 18 18 18 18 18 18 18 18 18 300 300 300 300 300 300 1000 1000 1000 1000 1000 1000 QBIC QBICL BIC BICL AIC1 AIC2 QBIC QBICL BIC BICL AIC1 AIC2 14.89 14.31 13.49 2.11 15.02 14.11 16.61 16.39 15.66 15.29 16.50 16.35 5.64 4.51 4.54 0.40 9.44 4.44 3.98 2.45 2.43 2.50 7.18 2.36 0.00 0.00 0.00 0.00 0.00 0.00 0.05 0.05 0.05 0.05 0.02 0.05 13.46 12.65 12.52 8.89 14.11 12.96 15.59 14.34 14.11 13.43 16.63 14.92 8.86 5.84 5.61 3.89 11.92 6.95 5.12 4.67 4.71 4.70 7.98 4.67 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 14.71 14.22 13.32 2.76 14.88 14.08 16.89 16.62 15.91 15.40 16.65 16.57 6.45 5.04 4.87 5.65 10.49 4.96 3.31 2.24 2.20 2.44 7.30 2.15 0.00 0.00 0.00 0.00 0.00 0.00 0.10 0.10 0.10 0.10 0.02 0.10 154 Table A.4 Estimator performance, DGP 3, ˇ1 D 0:1 Method N gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham D 0:5 D 0:9 IC Bias SD RMSE Bias SD RMSE Bias SD RMSE 300 300 300 300 300 300 1000 1000 1000 1000 1000 1000 QBIC QBICL BIC BICL AIC1 AIC2 QBIC QBICL BIC BICL AIC1 AIC2 0.0015 0.0005 0.0004 0.0074 0.0016 0.0005 0.0007 0.0007 0.0008 0.0008 0.0009 0.0007 0.0210 0.0212 0.0214 0.0404 0.0207 0.0214 0.0116 0.0116 0.0120 0.0123 0.0117 0.0116 0.0210 0.0212 0.0214 0.0411 0.0208 0.0214 0.0116 0.0116 0.0120 0.0123 0.0117 0.0116 0.0016 0.0012 0.0009 0.0002 0.0021 0.0013 0.0016 0.0015 0.0013 0.0013 0.0015 0.0016 0.0364 0.0366 0.0363 0.0359 0.0364 0.0362 0.0191 0.0192 0.0192 0.0192 0.0188 0.0191 0.0365 0.0366 0.0363 0.0359 0.0365 0.0362 0.0191 0.0193 0.0193 0.0193 0.0189 0.0192 -0.0001 -0.0001 -0.0002 0.0100 0.0001 -0.0001 -0.0000 0.0000 -0.0001 -0.0001 -0.0001 0.0000 0.0200 0.0203 0.0204 0.0419 0.0202 0.0204 0.0115 0.0116 0.0115 0.0115 0.0115 0.0116 0.0200 0.0203 0.0204 0.0431 0.0202 0.0203 0.0115 0.0116 0.0115 0.0115 0.0115 0.0116 300 300 300 300 300 300 1000 1000 1000 1000 1000 1000 QBIC QBICL BIC BICL AIC1 AIC2 QBIC QBICL BIC BICL AIC1 AIC2 0.0012 0.0010 0.0011 0.8448 0.0006 0.0010 0.0001 0.0001 0.0002 0.0010 -0.0000 0.0001 0.0211 0.0215 0.0235 0.5553 0.0209 0.0220 0.0116 0.0118 0.0119 0.0141 0.0116 0.0117 0.0212 0.0216 0.0235 1.0109 0.0209 0.0220 0.0116 0.0118 0.0119 0.0142 0.0116 0.0117 0.0033 0.0004 -0.0003 0.0629 0.0037 0.0017 0.0004 0.0001 0.0002 -0.0004 0.0004 0.0003 0.0332 0.0335 0.0331 0.0530 0.0332 0.0336 0.0190 0.0189 0.0189 0.0186 0.0191 0.0189 0.0333 0.0335 0.0331 0.0823 0.0334 0.0337 0.0190 0.0189 0.0189 0.0186 0.0191 0.0189 -0.0004 -0.0007 -0.0002 0.1379 -0.0005 -0.0008 0.0001 0.0002 0.0002 0.0039 0.0002 0.0002 0.0226 0.0229 0.0247 0.0987 0.0219 0.0234 0.0110 0.0110 0.0114 0.0249 0.0110 0.0111 0.0226 0.0229 0.0247 0.1695 0.0219 0.0234 0.0110 0.0110 0.0114 0.0252 0.0110 0.0111 155 Table A.5 Estimator performance, DGP 3, ˇ2 D 0:1 Method N gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham D 0:5 D 0:9 IC Bias SD RMSE Bias SD RMSE Bias SD RMSE 300 300 300 300 300 300 1000 1000 1000 1000 1000 1000 QBIC QBICL BIC BICL AIC1 AIC2 QBIC QBICL BIC BICL AIC1 AIC2 0.0015 0.0011 0.0004 0.0060 0.0012 0.0008 0.0005 0.0005 0.0004 0.0004 0.0005 0.0005 0.0215 0.0219 0.0222 0.0404 0.0211 0.0222 0.0117 0.0117 0.0117 0.0118 0.0117 0.0117 0.0215 0.0219 0.0222 0.0409 0.0211 0.0222 0.0117 0.0117 0.0117 0.0118 0.0117 0.0117 0.0015 0.0008 0.0006 0.0004 0.0015 0.0011 0.0001 -0.0002 -0.0001 -0.0001 0.0002 0.0001 0.0355 0.0356 0.0355 0.0352 0.0359 0.0355 0.0195 0.0195 0.0195 0.0195 0.0196 0.0196 0.0355 0.0356 0.0355 0.0352 0.0359 0.0355 0.0195 0.0195 0.0195 0.0195 0.0196 0.0196 0.0010 0.0010 0.0011 0.0121 0.0011 0.0010 0.0011 0.0011 0.0010 0.0010 0.0011 0.0011 0.0212 0.0211 0.0214 0.0443 0.0209 0.0212 0.0114 0.0115 0.0114 0.0114 0.0115 0.0115 0.0212 0.0211 0.0214 0.0459 0.0209 0.0212 0.0115 0.0115 0.0114 0.0114 0.0116 0.0115 300 300 300 300 300 300 1000 1000 1000 1000 1000 1000 QBIC QBICL BIC BICL AIC1 AIC2 QBIC QBICL BIC BICL AIC1 AIC2 -0.0002 0.0005 0.0009 0.8354 -0.0005 0.0008 -0.0006 -0.0004 -0.0005 0.0004 -0.0007 -0.0006 0.0216 0.0223 0.0236 0.5497 0.0213 0.0227 0.0118 0.0119 0.0120 0.0148 0.0120 0.0119 0.0216 0.0223 0.0237 0.9999 0.0213 0.0227 0.0119 0.0119 0.0120 0.0148 0.0120 0.0119 0.0037 0.0011 0.0006 0.0606 0.0044 0.0022 0.0008 0.0007 0.0008 0.0005 0.0008 0.0007 0.0345 0.0329 0.0329 0.0533 0.0348 0.0333 0.0191 0.0190 0.0189 0.0191 0.0191 0.0192 0.0347 0.0329 0.0329 0.0806 0.0350 0.0334 0.0191 0.0190 0.0189 0.0191 0.0191 0.0192 -0.0001 0.0003 0.0004 0.1367 -0.0005 0.0004 0.0002 0.0000 0.0002 0.0040 -0.0000 0.0000 0.0220 0.0227 0.0240 0.0998 0.0217 0.0227 0.0113 0.0113 0.0115 0.0264 0.0115 0.0113 0.0220 0.0227 0.0240 0.1692 0.0217 0.0227 0.0113 0.0113 0.0115 0.0267 0.0115 0.0113 156 Table A.6 Selection Performance, DGP 4 D 0:1 D 0:5 D 0:9 Method pN qo N IC TV FV True TV FV True TV FV True gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund 136 136 136 136 136 136 136 136 136 136 136 136 6 6 6 6 6 6 6 6 6 6 6 6 300 300 300 300 300 300 1000 1000 1000 1000 1000 1000 QBIC QBICL BIC BICL AIC1 AIC2 QBIC QBICL BIC BICL AIC1 AIC2 3.41 1.65 3.08 1.38 4.20 4.04 4.11 3.17 4.00 2.54 4.70 4.57 2.23 1.01 1.63 0.94 7.48 5.49 2.40 1.34 2.06 1.11 8.67 6.39 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.67 1.17 3.21 1.66 3.62 4.05 3.65 1.97 4.02 3.23 4.52 4.94 2.40 1.05 4.42 1.30 7.22 11.82 2.52 1.43 4.30 1.95 8.66 16.30 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.71 2.31 3.60 1.58 3.99 3.90 4.38 3.90 4.30 3.66 4.66 4.56 3.36 2.06 2.78 1.93 10.00 7.45 3.13 2.04 2.60 2.02 10.86 7.98 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham 102 102 102 102 102 102 102 102 102 102 102 102 18 18 18 18 18 18 18 18 18 18 18 18 300 300 300 300 300 300 1000 1000 1000 1000 1000 1000 QBIC QBICL BIC BICL AIC1 AIC2 QBIC QBICL BIC BICL AIC1 AIC2 6.56 1.39 5.75 0.46 9.18 8.53 9.79 4.52 8.85 3.99 11.88 11.55 4.06 1.13 3.46 0.43 8.27 6.57 5.52 2.81 4.85 2.66 10.11 8.30 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.83 0.93 6.11 3.37 7.31 8.69 7.84 3.39 9.34 4.61 10.41 11.45 3.82 0.88 5.06 2.77 6.96 10.22 5.37 3.15 6.88 3.66 9.05 12.97 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 7.83 3.18 6.84 2.86 9.79 9.44 11.51 7.19 11.15 5.12 12.38 12.17 7.27 4.04 6.49 3.79 11.46 10.01 7.60 6.56 7.29 6.07 12.00 10.11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 157 Table A.7 Estimator performance, DGP 4, ˇ1 D 0:1 Method N gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham D 0:5 D 0:9 IC Bias SD RMSE Bias SD RMSE Bias SD RMSE 300 300 300 300 300 300 1000 1000 1000 1000 1000 1000 QBIC QBICL BIC BICL AIC1 AIC2 QBIC QBICL BIC BICL AIC1 AIC2 0.0259 0.0796 0.0322 0.1409 0.0177 0.0177 0.0060 0.0443 0.0086 0.0625 0.0053 0.0069 0.3429 0.3854 0.3489 0.4682 0.3421 0.3431 0.1788 0.1950 0.1805 0.1943 0.1760 0.1783 0.3437 0.3933 0.3503 0.4887 0.3424 0.3434 0.1788 0.1999 0.1806 0.2040 0.1760 0.1784 0.0434 0.1254 0.0324 0.0866 0.0293 0.0170 0.0186 0.0763 0.0100 0.0313 0.0063 0.0086 0.5665 0.6198 0.5584 0.5692 0.5506 0.5498 0.3064 0.3240 0.3030 0.3152 0.3036 0.3025 0.5678 0.6321 0.5591 0.5755 0.5511 0.5498 0.3068 0.3327 0.3030 0.3166 0.3035 0.3025 -0.0069 0.0966 0.0008 0.1341 0.0005 -0.0009 -0.0023 0.0088 -0.0027 0.0336 -0.0030 -0.0031 0.3193 0.3656 0.3270 0.3740 0.3233 0.3214 0.1883 0.1979 0.1877 0.2165 0.1929 0.1940 0.3192 0.3780 0.3268 0.3971 0.3232 0.3213 0.1882 0.1980 0.1876 0.2190 0.1928 0.1940 300 300 300 300 300 300 1000 1000 1000 1000 1000 1000 QBIC QBICLP BIC BICLP AIC1 AIC2 QBIC QBICLP BIC BICLP AIC1 AIC2 0.1006 0.9157 0.1263 1.2698 0.0437 0.0517 0.0305 0.0775 0.0426 0.0831 0.0098 0.0128 0.3474 0.6470 0.3550 0.5288 0.3407 0.3419 0.1835 0.1901 0.1863 0.1974 0.1815 0.1810 0.3615 1.1210 0.3766 1.3754 0.3434 0.3456 0.1859 0.2052 0.1910 0.2140 0.1817 0.1814 0.1191 1.3293 0.0694 0.3097 0.0429 0.0225 0.0531 0.1170 0.0348 0.0951 0.0276 0.0230 0.5735 0.8436 0.5610 0.7033 0.5478 0.5443 0.3109 0.3379 0.3064 0.3092 0.3046 0.3024 0.5855 1.5742 0.5650 0.7681 0.5492 0.5445 0.3153 0.3575 0.3082 0.3234 0.3057 0.3032 0.0432 0.1974 0.0725 0.2272 0.0062 0.0107 0.0011 0.1160 0.0036 0.1638 -0.0052 -0.0028 0.3505 0.4299 0.3544 0.4578 0.3382 0.3414 0.1881 0.2168 0.1900 0.2254 0.1890 0.1902 0.3530 0.4728 0.3615 0.5109 0.3381 0.3414 0.1880 0.2458 0.1899 0.2785 0.1890 0.1902 158 Table A.8 Estimator performance, DGP 4, ˇ2 D 0:1 Method N gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham D 0:5 D 0:9 IC Bias SD RMSE Bias SD RMSE Bias SD RMSE 300 300 300 300 300 300 1000 1000 1000 1000 1000 1000 QBIC QBICL BIC BICL AIC1 AIC2 QBIC QBICL BIC BICL AIC1 AIC2 0.0423 0.0882 0.0542 0.1200 0.0342 0.0332 0.0148 0.0477 0.0164 0.0637 0.0159 0.0170 0.3284 0.3400 0.3286 0.3883 0.3250 0.3238 0.1835 0.1956 0.1861 0.1998 0.1819 0.1829 0.3309 0.3511 0.3329 0.4062 0.3267 0.3253 0.1840 0.2012 0.1867 0.2096 0.1825 0.1836 0.0432 0.1255 0.0313 0.0873 0.0288 0.0177 0.0238 0.0883 0.0175 0.0387 0.0181 0.0162 0.5459 0.5864 0.5319 0.5363 0.5270 0.5267 0.3133 0.3228 0.3102 0.3205 0.3070 0.3046 0.5474 0.5993 0.5325 0.5431 0.5275 0.5267 0.3141 0.3345 0.3105 0.3227 0.3074 0.3049 0.0055 0.1222 0.0072 0.1595 -0.0005 -0.0004 0.0051 0.0160 0.0047 0.0433 0.0051 0.0042 0.3433 0.3856 0.3433 0.3912 0.3371 0.3392 0.1823 0.1915 0.1799 0.2108 0.1783 0.1800 0.3432 0.4043 0.3432 0.4223 0.3369 0.3390 0.1823 0.1921 0.1798 0.2151 0.1783 0.1799 300 300 300 300 300 300 1000 1000 1000 1000 1000 1000 QBIC QBICLP BIC BICLP AIC1 AIC2 QBIC QBICLP BIC BICLP AIC1 AIC2 0.0723 0.8832 0.0982 1.2474 0.0127 0.0284 0.0249 0.0757 0.0387 0.0784 0.0060 0.0056 0.3467 0.6648 0.3581 0.5287 0.3366 0.3383 0.1887 0.1926 0.1891 0.1966 0.1902 0.1915 0.3540 1.1052 0.3712 1.3547 0.3367 0.3393 0.1902 0.2068 0.1930 0.2116 0.1902 0.1915 0.0962 1.2828 0.0447 0.2693 0.0280 0.0091 0.0317 0.0948 0.0171 0.0794 0.0111 0.0068 0.5598 0.8614 0.5466 0.6714 0.5434 0.5350 0.3141 0.3322 0.3109 0.3121 0.3101 0.3080 0.5677 1.5449 0.5481 0.7230 0.5439 0.5348 0.3156 0.3453 0.3113 0.3219 0.3102 0.3079 0.0496 0.1973 0.0728 0.2520 0.0062 0.0071 -0.0009 0.1151 0.0053 0.1628 -0.0059 -0.0054 0.3642 0.4430 0.3712 0.4945 0.3535 0.3532 0.1844 0.2145 0.1856 0.2199 0.1852 0.1850 0.3674 0.4847 0.3781 0.5548 0.3534 0.3531 0.1843 0.2433 0.1855 0.2735 0.1852 0.1850 159 Table A.9 Selection Performance, DGP 5 D 0:1 D 0:5 D 0:9 Method pN qo N IC TV FV True TV FV True TV FV True gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund 136 136 136 136 136 136 136 136 136 136 136 136 18 18 18 18 18 18 18 18 18 18 18 18 300 300 300 300 300 300 1000 1000 1000 1000 1000 1000 QBIC QBICL BIC BICL AIC1 AIC2 QBIC QBICL BIC BICL AIC1 AIC2 7.54 3.20 6.87 2.56 9.21 8.72 9.65 7.43 9.23 5.93 11.72 11.22 3.67 1.44 3.08 1.22 7.09 5.70 4.82 2.17 4.20 1.49 8.97 7.50 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 6.01 2.35 7.26 3.65 8.13 9.23 8.44 4.56 9.45 7.17 10.65 11.85 3.88 1.75 5.30 2.39 6.76 9.66 4.93 2.52 6.24 3.81 8.74 12.94 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 8.77 4.20 8.00 2.92 10.57 10.11 12.12 7.87 11.72 6.99 13.08 12.87 6.34 4.30 5.70 3.82 10.81 9.03 7.14 5.17 6.67 5.08 11.36 9.64 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham 102 102 102 102 102 102 102 102 102 102 102 102 12 12 12 12 12 12 12 12 12 12 12 12 300 300 300 300 300 300 1000 1000 1000 1000 1000 1000 QBIC QBICL BIC BICL AIC1 AIC2 QBIC QBICL BIC BICL AIC1 AIC2 5.96 3.09 5.54 2.28 7.23 6.92 7.51 5.51 7.23 5.24 8.54 8.31 3.42 1.08 2.72 0.83 8.79 6.75 4.17 1.96 3.71 1.55 10.60 8.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.85 2.08 5.67 3.59 6.39 7.27 6.53 4.61 7.36 5.40 8.20 9.03 3.34 1.20 4.96 2.12 7.43 11.71 4.10 2.51 5.65 3.02 8.74 15.04 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 6.65 3.81 6.29 3.18 7.47 7.27 8.41 6.97 8.24 6.12 8.95 8.81 5.16 3.47 4.63 3.34 9.64 7.80 5.03 3.99 4.64 3.84 10.53 8.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 160 Table A.10 Estimator performance, DGP 5, ˇ1 D 0:1 Method N gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham D 0:5 D 0:9 IC Bias SD RMSE Bias SD RMSE Bias SD RMSE 300 300 300 300 300 300 1000 1000 1000 1000 1000 1000 QBIC QBICL BIC BICL AIC1 AIC2 QBIC QBICL BIC BICL AIC1 AIC2 0.0459 0.1934 0.0575 0.3393 0.0346 0.0371 0.0152 0.0339 0.0155 0.0513 0.0153 0.0151 0.3642 0.4944 0.3677 0.6274 0.3594 0.3593 0.1861 0.1926 0.1856 0.2024 0.1836 0.1832 0.3669 0.5306 0.3720 0.7130 0.3609 0.3610 0.1866 0.1955 0.1862 0.2087 0.1841 0.1838 -0.0024 0.1810 -0.0092 0.0659 -0.0166 -0.0193 0.0038 0.0632 0.0041 0.0134 0.0045 0.0044 0.5246 0.6825 0.5188 0.5413 0.5249 0.5259 0.3030 0.3224 0.3028 0.3038 0.3034 0.3056 0.5244 0.7057 0.5186 0.5451 0.5249 0.5260 0.3028 0.3284 0.3027 0.3040 0.3033 0.3055 0.0090 0.1306 0.0202 0.1669 0.0017 0.0042 -0.0013 0.0338 -0.0019 0.0634 -0.0054 -0.0040 0.3442 0.4318 0.3543 0.4681 0.3366 0.3374 0.1798 0.2153 0.1792 0.2334 0.1789 0.1786 0.3442 0.4510 0.3547 0.4968 0.3365 0.3372 0.1797 0.2179 0.1791 0.2418 0.1789 0.1786 300 300 300 300 300 300 1000 1000 1000 1000 1000 1000 QBIC QBICL BIC BICL AIC1 AIC2 QBIC QBICL BIC BICL AIC1 AIC2 0.0202 0.2321 0.0277 0.3751 0.0009 0.0091 0.0141 0.0255 0.0161 0.0256 0.0137 0.0100 0.3090 0.4315 0.3097 0.4462 0.3242 0.3209 0.1704 0.1691 0.1713 0.1704 0.1785 0.1755 0.3095 0.4898 0.3108 0.5828 0.3240 0.3208 0.1709 0.1710 0.1719 0.1722 0.1790 0.1757 0.0732 0.4136 0.0538 0.1377 0.0529 0.0462 0.0226 0.0413 0.0177 0.0317 0.0210 0.0232 0.5035 0.6022 0.4978 0.5489 0.5055 0.5086 0.2826 0.2899 0.2830 0.2777 0.2877 0.2894 0.5086 0.7303 0.5005 0.5657 0.5080 0.5104 0.2834 0.2927 0.2834 0.2794 0.2883 0.2902 0.0128 0.0861 0.0221 0.1168 -0.0024 0.0023 -0.0037 0.0202 -0.0031 0.0359 -0.0052 -0.0051 0.3246 0.3633 0.3240 0.3952 0.3244 0.3201 0.1704 0.1730 0.1706 0.1766 0.1772 0.1763 0.3247 0.3732 0.3245 0.4119 0.3242 0.3199 0.1703 0.1741 0.1705 0.1801 0.1772 0.1763 161 Table A.11 Estimator performance, DGP 5, ˇ2 D 0:1 Method N gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gMund gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham gCham D 0:5 D 0:9 IC Bias SD RMSE Bias SD RMSE Bias SD RMSE 300 300 300 300 300 300 1000 1000 1000 1000 1000 1000 QBIC QBICL BIC BICL AIC1 AIC2 QBIC QBICL BIC BICL AIC1 AIC2 0.0468 0.1815 0.0535 0.3173 0.0305 0.0372 0.0150 0.0308 0.0140 0.0473 0.0144 0.0164 0.3428 0.4963 0.3491 0.6270 0.3319 0.3360 0.1813 0.1889 0.1840 0.2013 0.1819 0.1817 0.3458 0.5282 0.3530 0.7024 0.3331 0.3379 0.1818 0.1913 0.1844 0.2066 0.1824 0.1823 0.0297 0.1784 0.0139 0.0868 0.0090 0.0029 0.0081 0.0646 0.0065 0.0132 0.0069 0.0072 0.5341 0.6529 0.5313 0.5283 0.5312 0.5309 0.2922 0.3073 0.2914 0.3000 0.2923 0.2953 0.5347 0.6765 0.5312 0.5351 0.5310 0.5306 0.2922 0.3138 0.2913 0.3001 0.2923 0.2952 -0.0059 0.1127 0.0040 0.1521 -0.0101 -0.0105 -0.0015 0.0344 -0.0019 0.0679 -0.0041 -0.0046 0.3508 0.4217 0.3557 0.4538 0.3416 0.3444 0.1866 0.2085 0.1854 0.2284 0.1848 0.1870 0.3506 0.4363 0.3555 0.4784 0.3416 0.3444 0.1866 0.2112 0.1853 0.2382 0.1848 0.1869 300 300 300 300 300 300 1000 1000 1000 1000 1000 1000 QBIC QBICL BIC BICL AIC1 AIC2 QBIC QBICL BIC BICL AIC1 AIC2 0.0467 0.2552 0.0488 0.3876 0.0268 0.0289 0.0133 0.0269 0.0134 0.0294 0.0102 0.0084 0.3172 0.4229 0.3241 0.4361 0.3250 0.3246 0.1727 0.1707 0.1722 0.1744 0.1821 0.1818 0.3204 0.4937 0.3276 0.5833 0.3259 0.3257 0.1732 0.1727 0.1726 0.1768 0.1823 0.1819 0.0185 0.3702 0.0067 0.0823 -0.0023 -0.0070 0.0266 0.0450 0.0230 0.0342 0.0232 0.0226 0.5172 0.6165 0.5211 0.5595 0.5236 0.5261 0.2898 0.2962 0.2964 0.2886 0.3005 0.3061 0.5173 0.7188 0.5209 0.5652 0.5233 0.5259 0.2908 0.2995 0.2972 0.2905 0.3012 0.3068 0.0130 0.0782 0.0153 0.1005 0.0022 0.0048 0.0003 0.0240 -0.0007 0.0420 0.0010 0.0006 0.3223 0.3827 0.3261 0.4033 0.3292 0.3278 0.1708 0.1743 0.1683 0.1802 0.1768 0.1748 0.3224 0.3904 0.3263 0.4154 0.3291 0.3277 0.1708 0.1759 0.1682 0.1849 0.1767 0.1747 162 Table A.12 Birthweight, pooled quantile regression, all moms, (unit: grams) Quantile Smoke Male Age Age2 Kessner index = 2 Kessner index = 3 No prenatal visit First prenatal visit in 2nd trimester First prenatal visit in 3rd trimester 0.1 0.25 0.5 0.75 0.9 -258.82 (6.57) 93.59 (3.52) 18.11 (3.73) -0.34 (0.06) -157.10 (8.50) -297.62 (24.05) -118.36 (40.77) 139.51 (10.11) 282.43 (27.28) -250.75 (4.38) 114.26 (2.38) 7.28 (2.41) -0.13 (0.04) -108.39 (5.39) -212.68 (15.39) 0.77 (24.07) 93.27 (6.32) 194.33 (17.32) -238.49 (3.79) 131.27 (2.10) 2.59 (2.10) -0.04 (0.04) -81.71 (4.56) -149.85 (12.48) 7.87 (21.29) 72.21 (5.40) 119.48 (14.34) -234.27 (4.21) 145.08 (2.35) -0.10 (2.38) 0.02 (0.04) -66.64 (4.84) -120.34 (11.17) 30.44 (17.82) 60.64 (5.89) 89.75 (14.15) -227.57 (5.09) 157.47 (3.05) -2.81 (2.89) 0.09 (0.05) -63.02 (6.22) -91.31 (17.79) 25.41 (29.10) 57.45 (7.52) 56.94 (20.60) 163 Table A.13 Birthweight, quantile regression with Classical CRE, all moms, (unit: grams) Quantile Smoke Male Age Age2 Kessner index = 2 Kessner index = 3 No prenatal visit First prenatal visit in 2nd trimester First prenatal visit in 3rd trimester 0.1 0.25 0.5 0.75 0.9 -144.27 (10.43) 98.49 (4.59) -15.43 (8.69) 0.30 (0.12) -139.90 (10.29) -257.35 (22.65) -133.51 (36.44) 109.02 (11.60) 209.63 (27.53) -145.56 (7.51) 122.15 (3.30) -20.22 (6.26) 0.35 (0.08) -88.61 (7.06) -173.21 (16.31) -31.59 (26.23) 66.49 (8.35) 137.03 (19.82) -147.59 (6.62) 139.71 (2.91) -22.62 (5.51) 0.41 (0.07) -60.36 (6.22) -126.93 (14.36) 3.45 (23.10) 43.00 (7.36) 84.09 (17.46) -147.18 (7.56) 153.22 (3.32) -17.66 (6.30) 0.32 (0.08) -57.38 (7.11) -93.61 (16.42) 14.55 (26.41) 40.34 (8.41) 60.87 (19.96) -147.37 (9.87) 162.48 (4.34) -13.34 (8.22) 0.34 (0.11) -50.37 (9.28) -65.79 (21.42) 57.79 (34.46) 39.56 (10.97) 55.49 (26.04) 164 BIBLIOGRAPHY 165 BIBLIOGRAPHY Abrevaya, Jason. 2006. Estimating the effect of smoking on birth outcomes using a matched panel data approach. Journal of Applied Econometrics 21(4). 489–519. doi:10.1002/jae.851. Abrevaya, Jason & Christian M Dahl. 2008. The Effects of Birth Inputs on Birthweight. Journal of Business & Economic Statistics 26(4). 379–397. doi:10.1198/073500107000000269. http: //dx.doi.org/10.1198/073500107000000269. Amemiya, T. 1985. Advanced econometrics. Havard University Press. Amemiya, Takeshi. 1978. The Estimation of a Simultaneous Equation Generalized Probit Model. Econometrica 46(5). 1193–1205. doi:10.2307/1911443. Amemiya, Takeshi. 1979. The estimation of a simultaneous-eqution Tobit model. International Economic Review 20(1). 169–181. Belloni, Alexandre, D Chen, Victor Chernozhukov & Christian Hansen. 2012a. Sparse Models and Methods for Optimal Instruments with an Application to Eminent Domain. Econometrica 80(6). 2369–2429. doi:10.3982/ECTA9626. Belloni, Alexandre, V Chernozhukov & C Hansen. 2012b. Inference for high-dimensional sparse econometric models. arXiv:1201.0220v1 (June 2010). 1–41. doi:10.1017/CBO9781139060035. 008. http://arxiv.org/abs/1201.0220$\delimiter"026E30F$npapers3: //publication/uuid/9974B31C-5AB7-4852-AA3C-B6A2D25D3901. Belloni, Alexandre & Victor Chernozhukov. 2011a. Ch.3 High Dimensional Sparse Econometric Models:. In Pierre Alquier, Eric Gautier & Gilles Stoltz (eds.), An introduction, inverse problems and high-dimensional estimation ( lecture notes in statistics volume 203), 121–156. SpringerVerlag. doi:10.1007/978-3-642-19989-9. Belloni, Alexandre & Victor Chernozhukov. 2011b. l1-Penalized Quantile Regression in HighDimensional Sparse Models. The Annals of Statistics 39(1). 82–130. doi:10.1214/10-AOS827. http://projecteuclid.org/euclid.aos/1291388370. Belloni, Alexandre, Victor Chernozhukov & Christian Hansen. 2014a. Inference on treatment effects after selection among high-dimensional controls. Review of Economic Studies 81(2). 608–650. doi:10.1093/restud/rdt044. Belloni, Alexandre, Victor Chernozhukov, Christian Hansen & Damian Kozbur. 2014b. Inference in High Dimensional Panel Models with an Application to Gun Control. arXiv:1411.6507v1 (April 2013). 1–80. Breusch, Trevor, Hailong Qian, Peter Schmidt & Donald Wyhowski. 1999. Redundancy of moment conditions. Journal of Econometrics 91(1). 89–111. doi:10.1016/S0304-4076(98)00050-5. Bunea, Florentina, Alexandre Tsybakov & Marten Wegkamp. 2007. Sparsity oracle inequalities for the Lasso. Electronic Journal of Statistics 1. 169–194. doi:10.1214/07-EJS008. http: //eprints.pascal-network.org/archive/00003861/. 166 Canay, Ivan a. 2011. A simple approach to quantile regression for panel data. Econometrics Journal 14(3). 368–386. doi:10.1111/j.1368-423X.2011.00349.x. Chamberlain, Gary. 1980. Analysis of Covariance with Qualitative Data. Review of Economic Studies 47(146). 225. doi:10.3386/w0325. Chamberlain, Gary. 1984. Panel Data. In Zvi Griliches & Michael D. Intriligator (eds.), Handbook of econometrics, chap. 22, 1247–1318. Amsterdam: North Holland. Chernozhukov, Victor, Ivan Fernandez-Val, Jinyong Hahn & Whitney Newey. 2009. Average and Quantile Effects in Nonseparable Panel Models. Econometrica 81(2). 71. doi:10.3982/ ECTA8405. http://arxiv.org/abs/0904.1990. Crepon, Bruno, Francis Kramarz & Alain Trognon. 1997. Parameters of Interest, Nuisance Parameters and Orthogonality Conditions: An Application to Autoregressive Error Component Models. Journal of Econometrics 82(1). 135–156. doi:10.1016/S0304-4076(97)00054-7. Fan, Jianqing & Runze Li. 2001. Variable Selection via Nonconcave Penalized. Journal of American Statistical Association 96(456). 1348–1360. Fan, Jianqing & Jinchi Lv. 2010. A Selective Overview of Variable Selection in High Dimensional Feature Space. Statistica Sinica 18(11). 1492–1501. doi:10.1016/j.str.2010.08.012.Structure. Fan, Jianqing & Jinchi Lv. 2011. Nonconcave penalized likelihood with NP-dimensionality. IEEE Transactions on Information Theory 57(8). 5467–5484. doi:10.1109/TIT.2011.2158486. Fan, Yingying & Cheng Yong Tang. 2013. Tuning parameter selection in high dimensional penalized likelihood. Journal of the Royal Statistical Society: Series B 75(3). 531–552. Friedberg, Stephen H., Arnold J. Insel & Lawrence E. Spence. 2003. Linear algebra. Prentice Hall 4th edn. Gouriéroux, Christian, Alain Monfort & Alain Trognon. 1984. Pseudo maximum likelihood methods: theory. Econometrica 52(3). Graham, Bryan, Jinyong Hahn & James Powell. 2009a. A Quantile Correlated Random Coefficient Panel Data Model. manuscript (March 2008). Graham, Bryan S., Jinyong Hahn & James L. Powell. 2009b. The incidental parameter problem in a non-differentiable panel data model. Economics Letters 105(2). 181–182. doi:10.1016/j. econlet.2009.07.015. http://dx.doi.org/10.1016/j.econlet.2009.07.015. Harding, Matthew & Carlos Lamarche. 2016. Penalized Quantile Regression with Semiparametric Correlated Effects: An Application with Heterogeneous Preferences. Journal of Applied Econometrics doi:10.1002/jae.2520. Hastie, Trevor, Robert Tibshirani & Martin Wainwright. 2015. Statistical Learning with Sparsity. CRC Press. 167 He, Xuming & Peide Shi. 1994. Convergence rate of b-spline estimators of nonparametric conditional quantile functions. Journal of Nonparametric Statistics 3(3-4). 299– 308. doi:10.1080/10485259408832589. http://www.tandfonline.com/doi/abs/10. 1080/10485259408832589. He, Xuming & Peide Shi. 1996. Bivariate Tensor-Product B-Splines in a Partly Linear Model. Journal of Multivariate Analysis 58(2). 162–181. doi:10.1006/jmva.1996.0045. http://www. sciencedirect.com/science/article/pii/S0047259X96900457. He, Xuming, Zhong Yi Zhu & Wing Kam Fung. 2002. Estimation in a semiparametric model for longitudinal data with unspecified dependence structure. Biometrika 89(3). 579–590. doi: 10.1093/biomet/89.3.579. Horowitz, Joel L & Sokbae Lee. 2005. Nonparametric Estimation of an Additive Quantile Regression Model. Journal of the American Statistical Association 100(472). 1238–1249. doi: 10.1198/016214505000000583. http://dx.doi.org/10.1198/016214505000000583. Hurwicz, Leonid. 1950. Generalization of the Concept of Identification, "Statistical Inference in Dynamic Economic Models". New York: John Wiley. Jun, Sung Jae, Yoonseok Lee & Youngki Shin. 2016. Treatment Effects With Unobserved Heterogeneity: A Set Identification Approach. Journal of Business & Economic Statistics 34(2). 302–311. doi:10.1080/07350015.2015.1044008. http://dx.doi.org/10.1080/07350015. 2015.1044008. Kato, Kengo, Antonio F. Galvao Jr. & Gabriel V. Montes-Rojas. 2012. Asymptotics for panel quantile regression models with individual effects. Journal of Econometrics 170(1). 76–91. doi: 10.1016/j.jeconom.2012.02.007. http://dx.doi.org/10.1016/j.jeconom.2012.02.007. Kim, Tae-Hwan & Halbert White. 2003. Estimation, Inference, and Specification Testing for Possibly Misspecified Quantile Regression. In Maximum likelihood estimation of misspecified models: Twenty years later, advances in econometrics, volume 17, 107–132. Emerald Group Publishing Limited. Kim, Yongdai & Sunghoon Kwon. 2012. Global optimality of nonconvex penalized estimators. Biometrika 99(2). 315–325. doi:10.1093/biomet/asr084. Koenker, Roger. 2004. Quantile regression for longitudinal data. Journal of Multivariate Analysis 91(1). 74–89. doi:10.1016/j.jmva.2004.05.006. Koenker, Roger. 2005. Quantile regression. Cambridge University Press. https://books.google.com/books?hl=en{&}lr={&}id=hdkt7V4NXsgC{&}oi= fnd{&}pg=PP1{&}dq=koenker+quantile+regression+{&}ots=FtvlcdE2xv{&}sig= yIkQo3GawnC7IRgvmAlbtPE3NtI. Koenker, Roger & Gilbert Bassett Jr. 1982. Robust Tests for Heteroscedasticity Based on Regression Quantiles. Econometrica 50(1). 43–61. doi:10.2307/1912528. http://ideas.repec.org/a/ ecm/emetrp/v50y1982i1p43-61.html. 168 Lamarche, Carlos. 2010. Robust penalized quantile regression estimation for panel data. Journal of Econometrics 157(2). 396–408. doi:10.1016/j.jeconom.2010.03.042. http://dx.doi.org/ 10.1016/j.jeconom.2010.03.042. Lee, Sokbae. 2007. Endogeneity in quantile regression models: A control function approach. Journal of Econometrics 141(2). 1131–1158. doi:10.1016/j.jeconom.2007.01.014. Li, Qi & Jefferey S. Racine. 2007. Nonparametric Econometrics: theory and practice. Princeton University Press. Lv, Jinchi & Yingying Fan. 2009. A unified approach to model selection and sparse recovery using regularized least squares. Annals of Statistics 37(6 A). 3498–3528. doi:10.1214/09-AOS683. Matzkin, Rosa. 2007. Nonparametric identification. Handbook of Econometrics http://www. sciencedirect.com/science/article/pii/S1573441207060734. Mundlak, Yair. 1978. On the pooling of time series and cross section data. Econometrica 46(1). 69–85. doi:10.1017/CBO9781107415324.004. Newey, Whitney K. 1987. Efficient Estimation of Limited Dependent Variable Models with Endogenous Explanatory Variables. Journal of Econometrics 36(3). 231–250. doi:10.1016/ 0304-4076(87)90001-7. Newey, Whitney K. & Daniel McFadden. 1994. Chapter 36 Large sample estimation and hypothesis testing. doi:10.1016/S1573-4412(05)80005-4. Noh, Hoh Suk & Byeong U Park. 2010. Sparse Varying Coefficient Models for Longitudinal Data. Statistica Sinica 20. 1183–1202. Parente, Paulo & J.M.C Santos Silva. 2016. Quantile regression with clustered data. Journal of Econometric Methods (forthcoming) 5(1). 1–15. doi:10.1515/jem-2014-0011. Pollard, David. 1985. New Ways to Prove Central Limit Theorems. Econometric Theory 1(03). 295–313. doi:10.1017/S0266466600011233. Rivers, Douglas & Quang H. Vuong. 1988. Limited information estimators and exogeneity tests for simultaneous probit models. Journal of Econometrics 39(3). 347–366. doi: 10.1016/0304-4076(88)90063-2. Rosen, Adam M. 2012. Set identification via quantile restrictions in short panels. Journal of Econometrics 166(1). 127–137. doi:10.1016/j.jeconom.2011.06.011. http://dx.doi.org/ 10.1016/j.jeconom.2011.06.011. Schumaker, Larry L. 2007. Spline functions: basic theory. Cambridge University Press. Sherwood, Ben & Lan Wang. 2016. Partially linear additive quantile regression in ultra-high dimension. Annals of Statistics 44(1). 288–317. doi:10.1214/12-AAP876. 169 Tao, Pd & Lth An. 1997. Convex analysis approach to dc programming: Theory, algorithms and applications. Acta Mathematica Vietnamica 22(1). 289–355. http://www.math.ac.vn/ publications/acta/pdf/9701289.pdf. Tibshirani, Ryan J. 2013. The lasso problem and uniqueness. Electronic Journal of Statistics 7(1). 1456–1490. doi:10.1214/13-EJS815. Turkington, Darrell A. 2013. Generalized vectorization, cross-products, and matrix calculus. Cambridge. Vershynin, R. 2010. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027 http://arxiv.org/abs/1011.3027. Wang, Lan, Yichao Wu & Runze Li. 2012. Quantile Regression for Analyzing Heterogeneity in Ultra-high Dimension. Journal of the American Statistical Association 107(497). 214–222. doi:10.1080/01621459.2012.656014. http://www.pubmedcentral.nih.gov/ articlerender.fcgi?artid=3471246{&}tool=pmcentrez{&}rendertype=abstract. Welsh, A.H. 1989. On M-Processes and M-Estimation. The Annals of Statistics 17(1). 337–361. White, Halbert. 1994. Estimation, Inference and Specification Analysis. Cambridge University Press. Wooldridge, Jeffrey M. 2009. Correlated random effects models with unbalanced panels. Manuscript (version July 2009) Michigan State http://citeseerx.ist.psu.edu/viewdoc/ download?doi=10.1.1.472.4787{&}rep=rep1{&}type=pdf. Wooldridge, Jeffrey M. 2010. Econometric Analysis of Cross Section and Panel Data. MIT Press 2nd edn. doi:10.1515/humr.2003.021. Wooldridge, Jeffrey M. 2014. Quasi-maximum Likelihood Estimation and Testing for Nonlinear Models with Endogenous Explanatory Variables. Journal of Econometrics 182(1). 226–234. doi: 10.1016/j.jeconom.2014.04.020. http://dx.doi.org/10.1016/j.jeconom.2014.04.020. Wooldridge, Jeffrey M. & Ying Zhu. 2016. L1-Regularized Quasi-Maximum likelihood Estimation and Inference in High-Dimensional Correlated Random Effects Probit. manuscript . Zhang, Cun-Hui. 2010. Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics 38(2). 894–942. doi:10.1214/09-AOS729. Zhang, F. 2005. The Schur complement and its applications, vol. 4. Springer Science & Business Media. doi:10.1007/b105056. Zou, Hui. 2006. The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical Association 101(476). 1418–1429. doi:10.1198/016214506000000735. 170