QUASI-MAXIMUM LIKELIHOOD ESTIMATION METHODS WITH A CONTROL
FUNCTION APPROACH TO ENDOGENEITY
By
Doosoo Kim

A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
Economics – Doctor of Philosophy
2017

ABSTRACT
QUASI-MAXIMUM LIKELIHOOD ESTIMATION METHODS WITH A CONTROL
FUNCTION APPROACH TO ENDOGENEITY
By
Doosoo Kim
One of the fundamental problems in econometrics is the potential endogeneity in non-experimental
data. This work focuses on econometric methods taking a control function approach to endogeneity.
The agenda consists of two parts. In the first part, I study a general class of conditional mean
regression methods with a control function, and their relative asymptotic efficiency relationship.
Unlike previous results in the literature, the likelihood for the response variables can be incorrect
up to the regression functions. My results provide more practical and general guidance on the
choice of an estimator. In the second part, I propose a generalized Chamberlain device as a control
function approach to time-invariant endogeneity in linear panel data quantile regression models
with a finite time dimension. The new correlated effect (CE) estimator has substantial advantages
compared to existing methods: (i) it is free of an incidental parameters problem, (ii) the correlated
effect is not restricted to a linear functional form, and (iii) an arbitrary within-group dependence of
regression errors is allowed. Due to the high-dimensionality of the control function, a nonconvex
penalized estimator is adopted for sparse model selection.
In the first chapter, I study the asymptotic relative efficiency relationship among estimators
based on a quasi-limited information likelihood (QLIL). First, I show that there exists a generalized
method of moments estimator (GMM-QLIML) based on all the available quasi-scores. Second, the
quasi-limited information maximum likelihood estimator (QLIML) is shown to be as efficient as
GMM-QLIML under a set of generalized information matrix equalities. Third, I show that in a fully
robust estimation of correctly specified conditional mean functions, QLIML is efficient relative to
a two-step control function approach when the generalized linear model variance assumptions hold
with a scaling restriction.

When a limited information structure is over-identified, the classical minimum distance (MD)
estimator is often proposed as an estimation method. The purpose of the second chapter is to
study its relative asymptotic efficiency relationship with respect to QLIML and two-step control
function (CF) approach. First, I show that the MD estimator is asymptotically efficient relative to
two other estimators. Second, I proved that the concentration of reduced form equation estimates
does not affect the asymptotic efficiency of the structural parameter estimates in the MD estimation.
Third, in a class of models, an if-and-only-if condition is derived for MD and other estimators to
be asymptotically equivalent under the null hypothesis of exogeneity.
In the third chapter, I propose a point-identifying restriction and estimation procedure for a
linear panel data quantile regression model with a fixed time dimension. The proposed model
restriction reasonably accounts for the -quantile-specific time-invariant heterogeneity, and allows
arbitrary within-group dependence of regression errors. The generalized Chamberlain device is
taken analogously as a control function to capture -quantile-specific time-invariant endogenous
variations. Since the sieve-approximated control function has high-dimensionality, the estimation
procedure adopts penalization techniques under the sparsity assumption. Transformation of the
sieve elements into a generalized Mundlak form is considered to make the sparsity assumption
more plausible in some cases. The empirical application to birth weight analysis demonstrates a
convincing case where the proposed estimator works as intended in real data.

Copyright by
DOOSOO KIM
2017

Dedicated to my parents, and Kyuseon.

v

ACKNOWLEDGEMENTS

First of all, I’d like to thank my family members for their endless support. My wonderful parents
and my great wife, Kyuseon (Kristy) always believed in me and encouraged me even when I was in
deep trouble. All the support they have provided me over the years has made this work possible.
I would like to thank my advisor Jeffrey M. Wooldridge for his valuable advice and support.
In addition to his insightful feedback on details of my research, his strong encouragement and
optimistic view were essential ingredients for my work. He always led me to push my limits and
achieve beyond my imagination. I deeply appreciate him giving me such great inspiration.
I am grateful to my dissertation committee members, Peter Schmidt and Kyoo il Kim for their
rigorous and helpful feedback. Thanks to their brilliant comments, I could significantly improve
the quality of my work. They were also a huge inspiration to me at all of the time.
I appreciate the wonderful support from the Department of Economics at Michigan State
University. The comfortable work environment the department provided was crucial. Special
thanks to the Chairs of the Department and Graduate Directors: Carl Davidson, Tim Vogelsang,
Leslie E. Papke, and Todd Elder. I am also grateful to the following university staff: Belen Feight,
Jay Feight, Margaret Lynch, Lori Jean Nichols, and Dean Olson for their unfailing support and
assistance.
Among the great Ph.D. students in the Department of Economics, very special gratitude goes
to my study group members: Muzna Alvi, Patrick Burke, Annie Chou, Po-Chun Huang, Riju Joshi,
and Danielle Kaminski. It was fantastic to have the opportunity to work and interact with them.
The moments we shared together greatly enriched my life in East Lansing.
Thank you all!

vi

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

KEY TO ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

CHAPTER 1
1.1
1.2
1.3
1.4

RELATIVE EFFICIENCY OF QUASI-LIMITED INFORMATION MAXIMUM LIKELIHOOD ESTIMATOR . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Model Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Relative Efficency Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . .
Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 2
2.1
2.2
2.3
2.4
2.5
2.6

EFFICIENT MINIMUM DISTANCE ESTIMATOR BASED ON QUASILIMITED INFORMATION LIKELIHOOD . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Model Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Minimum Distance Estimators: MD/cMD-QLIML . . . . . . . . . . . . . . .
Example 1: Linear Regression Model with Endogeneous Explanatory Variables
Example 2: Probit with Endogeneous Explanatory Variables . . . . . . . . . .
Monte Carlo Simulation on Probit Model with EEV . . . . . . . . . . . . . . .

. 1
. 1
. 3
. 6
. 20

.
.
.
.
.
.
.

21
21
22
24
29
31
33

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

35
35
37
38
38
40
43
44
45
49
55
57
58
61
67

CHAPTER 3
3.1
3.2
3.3

3.4

3.5
3.6
3.7

SHORT PANEL DATA QUANTILE REGRESSION MODEL WITH
SPARSE CORRELATED EFFECTS . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Literature on Linear Panel Data Quantile Regression . . . . . . . . . . . . . . .
Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Generalized Chamberlain Device . . . . . . . . . . . . . . . . . . . . .
3.3.2 Model Restriction and Identification . . . . . . . . . . . . . . . . . . .
3.3.3 Case of Unbalanced Panel Data with Time-constant Endogeneity . . . .
Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.1 Sieve-approximated Correlated Effect . . . . . . . . . . . . . . . . . .
3.4.2 Penalized Estimation via Non-convex Penalty Functions . . . . . . . . .
3.4.2.1 Choice of Thresholding Parameter . . . . . . . . . . . . . .
3.4.2.2 Computation of Variance Estimators . . . . . . . . . . . . . .
Monte Carlo Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Application: The Effect of Smoking on Birth Outcomes . . . . . . . . . . . . .
Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.

APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
APPENDIX A
An Appendix for Chapter 1 . . . . . . . . . . . . . . . . . . . . . . 69
APPENDIX B
An Appendix for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . 88

vii

APPENDIX C

An Appendix for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . 108

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

viii

LIST OF TABLES

Table 2.1

Root Mean Squared Error and Standard Deviation (Á D 0:6) . . . . . . . . . . . 34

Table 3.1

Selection Performance, DGP 1 and 2 . . . . . . . . . . . . . . . . . . . . . . . . 59

Table 3.2

Estimator performance, DGP 1 and 2, ˇ1 . . . . . . . . . . . . . . . . . . . . . 60

Table 3.3

Birthweight, mean and median regression, all moms (unit:grams) . . . . . . . . 63

Table 3.4

Birthweight, quantile regression with CE, all moms, (unit:grams) . . . . . . . . 65

Table 3.5

Coefficient Estimates on ‘Smoke’ using Different ICs, (unit:grams) . . . . . . . . 66

Table A.1 Estimator performance, DGP 1, ˇ2 . . . . . . . . . . . . . . . . . . . . . . . . 152
Table A.2 Estimator performance, DGP 2, ˇ2 . . . . . . . . . . . . . . . . . . . . . . . . 153
Table A.3 Selection Performance, DGP 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
Table A.4 Estimator performance, DGP 3, ˇ1 . . . . . . . . . . . . . . . . . . . . . . . . . 155
Table A.5 Estimator performance, DGP 3, ˇ2 . . . . . . . . . . . . . . . . . . . . . . . . 156
Table A.6 Selection Performance, DGP 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Table A.7 Estimator performance, DGP 4, ˇ1 . . . . . . . . . . . . . . . . . . . . . . . . . 158
Table A.8 Estimator performance, DGP 4, ˇ2 . . . . . . . . . . . . . . . . . . . . . . . . 159
Table A.9 Selection Performance, DGP 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
Table A.10 Estimator performance, DGP 5, ˇ1 . . . . . . . . . . . . . . . . . . . . . . . . . 161
Table A.11 Estimator performance, DGP 5, ˇ2 . . . . . . . . . . . . . . . . . . . . . . . . 162
Table A.12 Birthweight, pooled quantile regression, all moms, (unit: grams) . . . . . . . . . 163
Table A.13 Birthweight, quantile regression with Classical CRE, all moms, (unit: grams) . . 164

ix

LIST OF FIGURES

Figure A.1 Pooled Birth Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

x

KEY TO ABBREVIATIONS
2SLS Two-stage Least Square
AGLS Amemiya’s Generalized Least Square
AIC Akaike Information Criterion
BIC Bayesian information criterion
CAN Consistent and Asymptotic Normal
CE Correlated Effect
CF Control Function
DGP Data Generating Process
GMM Generalized Method of Moments
LEF Linear Exponential Family
LIL Limited Information Likelihood
LIML Limited Information Maximum Likelihood
MCP Minimax Concave Penalty
MD Minimum Distance
QLIML Quasi-Limited Information Maximum Likelihood
QMLE Quasi-Maximum Likelihood Estimator
RMSE Root-Mean-Square Error
SCAD Smoothly Clipped Absolute Deviation
SD Standard Deviation

xi

CHAPTER 1
RELATIVE EFFICIENCY OF QUASI-LIMITED INFORMATION MAXIMUM
LIKELIHOOD ESTIMATOR

1.1

Introduction

Limited information likelihood (LIL)-based estimators have been widely used in instrumental
variable estimation. The limited information maximum likelihood (LIML) estimator (Anderson
and Rubin, 1949) and two-stage least square (2SLS) estimator (Theil, 1953; Basmann, 1957; Sargan,
1958) for linear models are workhorses of many empirical studies. In simultaneous probit models,
analogously proposed LIML and two-stage conditional maximum likelihood estimator (Rivers and
Vuong, 1988) are useful extensions of LIML and 2SLS to a nonlinear model. While correct
specification of likelihoods has been assumed in the LIL literature, it is known that a certain class
of maximum likelihood estimators have nice robustness against misspecification: quasi-maximum
likelihood estimator (QMLE) is fully robust for correctly specified conditional mean if and only if
the likelihood is in linear exponential familiy (LEF) under mild regularity conditions (Gouriéroux,
Monfort and Trognon, 1984; White, 1994). Based on the result, Wooldridge (2014) reinterprets
LIL as a quasi-limited information likelihood (QLIL) and expands its applicability noting that
correctly specified regression functions are key assumptions for consistency in LEF.
Apart from robustness of QLIL-based estimators, their relative efficiency relationship is another
important issue. Relative efficiency analysis of LIML or equivalent estimator in previous works
assume away potentially misspecified likelihoods for both structural and reduced form equations.
When the likelihood is allowed to be misspecified, relative efficiency comparison based on the
correct specification of likelihood is no longer valid. Analysis accounting for potential misspecification of likelihood is more useful to empirical researchers because economic theories usually
do not imply full characterization of distributions, and there is no solid reason to believe QMLE
achieves the same asymptotic efficiency as the maximum likelihood estimator.

1

The purpose of this chapter is to study asymptotic relative efficiency relationship among estimators based on QLIL. Considering a research question raised by Wooldridge (2014), I focus
on sufficient condtions for relative efficiency of QLIL maximizer with repect to two-step conditional quasi-likelihood maximizer, which will be called QLIML estimator and control function
(CF) approach, respectively. The CF estimator is naturally defined once we take the conventional
decomposition of QLIL into structural and reduced form components. The model restriction imposed on QLIL is general enough to include nonlinear models and, in particular, misspecification
of likelihoods is allowed up to correctly specified regression functions when fully robust estimation
is considered.
The main contributions of this chapter are followings. First, I show there exists a generalized
method of moments estimator (GMM-QLIML) based on the all available quasi-scores. The asymptotic variance of GMM-QLIML estimator constitutes a lower bound for those of QLIML and CF
in matrix positive semidefinite sense. Second, the QLIML estimator is proved to be as efficient
as GMM-QLIML estimator under a set of generalized information matrix equalities. Third, the
asymptotic equivalence of LIML and 2SLS is established via linearity of regression functions and
L2 loss function incorporated in normal density. This new proof clearly shows why the equivalence
holds without normality or conditional homoskedasticity which is often assumed in the assertion.
Sufficient conditions for general equivalence of QLIML and CF are also found. Fourth, in fully
robust estimation of correctly specified conditional mean functions, QLIML estimator is shown to
be efficient relative to CF estimator if generalized linear model variance assumptions hold with a
scaling restriction. In particular, correctly specified conditional moments up to second order are
sufficient.
The rest of this chapter is organized as follows. In Section 1.2, basic model restrictions are given
with GMM interpretation of QLIML and CF estimators. In Section 1.3, GMM-QLIML estimator
is defined and relative efficiency results for QLIML and CF estimator are presented. Section 1.4
contains concluding remarks.

2

1.2

Model Restrictions

Assume random sampling from a population. For a random draw i, consider the system of equations
yi1 D f .yi 2 ; zi1 ; ui1 I Â1 ; Â2 /

(1.1)

yi 2 D g .zi ; vi 2 IÂ2 /

(1.2)

where function f and g are known up to .p1 C p2 /
a scalar response variable, yi 2 is a 1
is 1

0

1 vector of parameter Â D Â10 ; Â20 , yi1 is

r vector of potentially endogenous variables, zi D .zi1 ; zi 2 /

k vector of included/excluded exogenous instruments with k D k1 C k2 , and .ui1 ; vi 2 / is a

.1 C r/

1 vector of unobservables.

Under potentially incorrect distributional assumptions for u1 and v2 ; taking log operator on the
decomposed quasi-joint likelihood l .yi1 jyi 2 ; zi I Â1 ; Â2 / l .yi 2 jzi I Â2 / delivers QLIL
QLIL D q1 .yi1 ; yi 2 ; zi ; Â1 ; Â2 / C q2 .yi 2 ; zi ; Â2 /

(1.3)

which offers flexible model specifications. The decomposition is more of ‘composition’ in the
sense that q1 and q2 do not need to be derived from a single joint quasi-likelihood. For example,
Poisson log-likelihood q1 and normal log-likelihood q2 can be used as long as quasi-likelihooddriven regression functions are correct: Wooldridge (2014) showed that, in LEF, the key model
restrictions for consistent estimation of conditional mean Eo Œyi1 jyi 2 ; zi  are
Eo Œyi1 jyi 2 ; zi  D Eq Œf .yi 2 ; zi1 ; ui1 I Âo1 ; Âo2 / jyi 2 ; zi 
Eo Œyi 2 jzi  D Eq Œg .zi ; vi 2 IÂo2 / jzi 

(1.4)
(1.5)

where subscripts ‘o’ and ‘q’ denotes (expectation) operators based on the true and quasi-likelihood,
respectively. As long as (1.4) and (1.5) hold, even failure of (1.1) is allowed in consistent estimation
as shown by Example 1.2.1 below. In the linear model with quasi-normality (Anderson and Rubin,
1949), linear projection operators Lo Œ j  with appropriate regressors can replace the expectation
operators Eo Œ j  when the objects of interest are linear projections rather than conditional mean
functions. The decomposed nature of QLIL and correctly specified regression functions typically

3

involve with the existence of a control function. See Wooldridge (2014) for details. Following
example demonstrates derivation of QLIL in linear and Probit models, and discusses their robustness
properties.
Example 1.2.1 (models with quasi-normality of .ui1 ; vi 2 /) Consider the following simultaneous
equation systems
Linear Model:yi1 D yi 2 ˛ C zi1 ı1 C ui1 ; yi 2 D zi ı 2 Cvi 2

(1.6)

Probit Model:yi1 D 1 Œyi 2 ˛ C zi1 ı1 C ui1 > 0 ; yi 2 D zi ı 2 Cvi 2

(1.7)

Assume
.ui1 ; vi 2 / jzi

q

N .0; †/

where Vq .ui1 jzi / D †11 ; covq .ui1 ; vi 2 jzi / D †12 D †t21 , Vq .vi 2 jzi / D †22 ; ı 2 D ı 021 ; ı 022
is a k

r matrix and other parameters are defined comformably. In the notation ‘X

the subscript q indicates that the distributional assumption ‘X

q

0

‰’,

‰’ is allowed to be incorrect

and is used only for deriving the quasi-likelihood of X or its transformation. The decomposed
quasi-likelihoods are easily derived noting that
ei1 j .yi 2 ; zi /
where ei1 Á ui1

q

N 0; †11

†12 †221 †21

Á

vi 2 †221 †21 . In Probit model, it is assumed that Vq .ei1 jyi 2 ; zi / D 1 for

normalization. The quasi-likelihoods for linear and Probit model are given explicitly in Example 1.3.5 and 1.3.15, respectively. Concerning robustness property, there are three things to be
mentioned: first, since qi1 and qi 2 in both models belong to LEF, correctly specified conditional
mean can be consistently estimated by QLIL-based estimators regardless of the true distribution.
Second, in linear model, the conditional mean functions derived from the quasi-likelihood have an
interpretation of true linear projections. In particular, the quasi-likelihood-based conditional mean
of yi1 conditioned on .yi 2 ; zi / can be regarded as the linear projection of yi1 on .yi 2 ; zi1 ; vi 2 /
Eq Œyi1 jyi 2 ; zi  D yi 2 ˛ C zi1 ı1 C vi 2 †221 †21 D Lo Œyi1 jyi 2 ; zi1 ; vi 2 
4

where †221 †21 can be reparameterized to be Á for convenience: Since this interpretation is definitional through quasi-scores, even when conditional mean functions are incorrectly specified, .˛; ı1 /
is consistently estimated as linear projection coefficients under regularity conditions. Third, the
yi1 equation in (1.7) of Probit model is not a restrictive condition for consistency. When yi1 is a
fractional response taking values in Œ0; 1, the equation may not hold for some observations. Such
failure of yi1 equation does not necessarily harm consistent estimation of the conditional mean
function if Probit response function is correct.
Á
Eq Œyi1 jyi 2 ; zi  D ˆ yi 2 ˛ C zi1 ı1 C vi 2 †221 †21 D Eo Œyi1 jyi 2 ; zi 
However, Probit response function does not have the robust interpretation of a linear projection as
in linear model when it is incorrectly specified.
Given QLIL, the QLIML and CF estimators are defined as
8
ˆ
O2;CF D arg max PN qi 2 .Â2 /
ˆ
Â
ˆ
i D1
N
<
X
Â
2
Œqi1 .Â1 ; Â2 / C qi 2 .Â2 / and
ÂOQLIML D arg max
Á
PN
ˆ
O
O
ˆ
Â
ˆ
i D1
: Â1;CF D arg max i D1 qi1 Â1 ; Â2;CF
Â1

respectively. Focusing on relative efficiency comparison of these two, it is assumed that both
QLIML and CF estimators are consistent and asymptotic normal (CAN) for the true parameter
values. Also, we assume that expected quasi-scores uniquely determines true parameters so that
GMM interpretation of QLIML and CF estimator is valid. This is a mild assumption since the
necessity of LEF for fully robust estimation is shown under enough differentiability of likelihood
and interiority of a population maximizer (White, 1994, Theorem 5.6). Consequently, QLIML and
CF estimators can be defined as GMM estimators based on quasi-score moment conditions
2
3
2
3
@ q .Â ; Â /
@ q .Â ; Â /
7
6 @Â i1 o1 o2 7
6
@Â1 i1 o1 o2
Eo 4
5 D 0 and Eo 4 1
5D0
@ q .Â ; Â / C @ q .Â /
@ q .Â /
@Â i1 o1 o2
@Â i 2 o2
@Â i 2 o2
2

2

2

respectively. Appendix A.1 contains relevant standard regularity conditions (Assumption 1-12).
These assumptions are maintained for simplicity. They can be relaxed, for example, to allow
non-smooth qi1 or qi 2 via smoothness in the limit and stochastic differentiability (Pollard, 1985).

5

1.3

Relative Efficency Comparison

The key idea that enables intuitive analysis of relative efficiency relationship is to acknowledge the
existence of an estimator whose asymptotic variance constitutes a lower bound for those of QLIML
and CF. The estimator is discovered, and called as GMM-QLIML in this chapter. It is defined to be
efficient GMM estimator based on a maximal linearly independent set of all quasi-scores available
in QLIL. Its construction and potential relative efficiency over QLIML and CF can be easily shown
by elementary linear algebra.
Recall the definition of linear independence in the context of moment function space along with
its well-known relationship with variance matrix in the following remark.
Definition 1.3.1 A set of scalar moment functions fhl .wi ; Â/gL
lD1 is linearly independent at Â if
Á
PL
P
lD1 ˛l .Â / hl .wi ; Â / D 0 D 1 implies ˛l .Â / D 0 for all l where ˛l .Â/ is arbitrary
real-valued function of Â:
Remark 1.3.2 fhl .wi ; Â/gL
lD1 is linearly independent at Â if and only if the variance matrix of
fhl .wi ; Â /gL
lD1 is invertible, assuming that second moments are finite.
Now, consider stacking all available quasi-scores in (2.1):
3
2
@ q .Â ; Â /
6 @Â1 i1 1 2 7
7
6
6 @ q .Â ; Â / 7
6 @Â2 i1 1 2 7
5
4
@ q .Â /
@Â i 2 2

(1.8)

2

The vector of moment functions (1.8) constitutes, when taken summation or integral, all
available first order conditions from factor-by-factor QLIL maximization problem. We might hope
conducting efficient GMM on these moment functions yields an estimator efficient relative to
QLIML and CF. However, it turns out that (1.8) typically has a singluar variance matrix since the
@ q .Â ; Â / is linearly dependent. The singularity is closely related to
set of moment functions in @Â
i1 1 2

the fundamental reason why we need simultaneous equation system: the quasi-likelihood function
qi1 alone cannot identify Âo1 and Âo2 in general: To avoid such linear dependence; a maximal

6

linearly idependent set in (1.8) can be used instead. Since moment functions in @Â@ qi1 .Â1 ; Â2 / and
1

@
@Â2 q2 .Â2 /

are assumed to be linearly independent by rank condition of CF (Assumption 12), a

maximal linearly idependent set can be found by extending the set of CF moment functions.
Definition 1.3.3 GMM-QLIML is an efficient GMM estimator based on a maximal linearly indeÄ
pendent set of moment functions at .Âo1 ; Âo2 / in @q1 .Â1 ;Â2 / @q1 .Â1 ;Â2 / @q2 .Â2 / :
@Â1

2
6
6
6
6
4

@
@Â1 q1 .Â1 ; Â2 /
@
@Â22 q1 .Â1 ; Â2 /
@
@Â2 q2 .Â2 /

@Â2

@Â2

3
7
7
7
7
5

(1.9)

0

p21 ; Â 2 Rp22 with p D p C p and Â can be empty.
0 ; Â0
where Â2 D Â21
22
2
21
22
22
22 ; Â21 2 R

The following proposition shows that the GMM-QLIML estimator is asymptotically normal
without additional model restrictions other than those of CF and QLIML.
Proposition 1.3.4 Under regularity conditions and identification conditions for CF and QLIML
(Assumption 1–12), the GMM-QLIML estimator is asymptotically normal.
p

Â
Á d
Á 1Ã
0
1
Âo ! N 0; A B A

N ÂOGMM QLIML

where

2
6
6
6
ADE6
6
4

@qi1 .Âo; /
@Â1 @Â 0
@qi1 .Âo; /
@Â22 @Â 0
@qi 2 .Âo2 /
@Â2 @Â 0

3

0

7
7
7
7 and B D V
7
5

B
B
B
B
@

@qi1 .Âo; /
@Â1
@qi1 .Âo; /
@Â22
@qi 2 .Âo2 /
@Â2

1
C
C
C
C
A

Typically, the extra moment functions @Â@ q1 .Â1 ; Â2 / are orthogonality conditions between
22
exogeneous part of structural error and overidentifying .k2
shows determination of @Â@ q1 .Â1 ; Â2 / in linear model setting:
22

7

r/ instruments. Below example

Example 1.3.5 (Linear Model) Consider linear model in Example 1.2.1. For notational convenience, define the following
vi 2 .ı 2 / Á yi 2

zi ı 2

ei .Â/ Á yi1

yi 2 ˛

11j2 .Â/

Á †11

vi 2 .ı 2 / †221 †21

zi1 ı1

†12 †221 †21

hi .Â/ Á ei .Â/2

11j2 .Â/

where parameters are defined comformably: ei .Â/ can be interpreted as remaining part of structural
error after endogeneous variation vi 2 is projected out. The quasi-log-likelihoods are
qi1 .Â1 ; Â2 / D
qi 2 .Â2 / D

1
ln 2
2
k
ln 2
2

1
1
ln 11j2 .Â/
ei .Â/2 11j2 .Â/ 1
2
2
1
1
ln j†22 j
vi 2 .ı 2 / †221 vi 2 .ı 2 /0
2
2

and quasi-scores can be expressed as follows
2
.Â/ 1 ei .Â/ y02
6 11j2
6
1
0
6
@q1
6 11j2 .Â/ ei .Â/ z1
D6
6
@Â1
2
0
1
1
1
6
11j2 .Â/ hi .Â/ †22 †21 C 11j2 .Â/ ei .Â/ †22 vi 2 .ı 2 /
4
1
2
2 11j2 .Â/ hi .Â/
2
h
i 3
1
1
0
@q1
11j2 .Â/ ei .Â/ †22 †21 ˝ z 7
6
D4
i
h
5
@Â2
L † 1 ˝ † 1 D .Â/
r

2

22

˝ z0

3
7
7
7
7
7
7
7
5

22

3

†221 v2 .ı 2 /0

@q2
6 Ir
D4
@Â2
1 L vec † 1 v .ı /0 v .ı / † 1
i2 2
2 r
22 i 2 2
22

†21 ˝ vi 2 .ı 2 /0

where D .Â/ D Œ†21 ˝ †21  12 11j2 .Â/ 2 hi .Â/
0

˛ 0 ; ı10 ; †021 ; †11 , Â2 D vec .ı 2 /0 ; vech .†22 /0

†221

7
Á 5

0

and Lr is a r.rC1/
2

11j2 .Â/

1e

i

.Â/ ; Â1 D

r 2 elemination matrix

@q .Â /

(Section 5.7.3, Turkington, 2014) : To determine @Â1 ; we should find a set of moment functions
22
@q

@q

@q

in @Â1 that cannot be expressed by a linear combination of moment functions in @Â1 and @Â2 at
2
1
2
@q1
true parameter values. If †o21 D 0; then @Â D 0 and Â22 is empty. For now, assume †o21 6D 0
2

8

1†
and, at least one element of †o22
o21 ; say io th component; is nonzero. Then, by some tedious

algebra1; the extra moments can be shown to be at most
@q1 .Â/
D
@Â22

Á
1 e .Â/ † 1 †
0
.Â/
i
1j2
22 21 i z2; r
o

(1.10)

where ‘ r’ denotes ‘leaving r instruments out’ in z2 . Suppose there exists enough variation in
@q .Â /

z2 so that (1.10) is indeed @Â1 : Since GMM-QLIML moment functions constitute a basis of
22
linear vector space spanned by (1.8), it can be shown that any choice of extra moments yields
asymptotically equivalent estimator. If the model is just-identified .k2 D r/; or if there exists no
endogeneity .†o21 D 0/; then Â22 is empty, and GMM-QLIML, CF and QLIML are asymptotically
equivalent to each other.
Example 1.3.5 illustrates the following general proposition.
Proposition 1.3.6 Under regularitiy conditions and identification conditions (Assumption 1–12),
@q .Â/

yields asymptotically equivalent GMM-QLIML estimator. (b) If Â22 is
(a) Each choice of @Â1
22
empty, then GMM-QLIML, QLIML and CF are asymptotically equivalent.
Example 1.3.5 shows that a preliminary step is required for GMM-QLIML to be used in
practice. In the example, it is necessary to test whether there exists a component of †221 †21
significantly different from zero. Then, the extra moment functions will be chosen correspondingly.
This preliminary procedure probably is not very appealing to practitioners. A practical approach
in general would be to employ a generalized inverse matrix for optimal weighting and resolve
singularity issue. Also, in this specific example, we can consider including moment condition
1 It

@q

@q1
@ı 1

@q1
@Â1

@q

@q

@q

is easy to see that @† 1 (the second part of @Â1 ) is a linear combination of @† 1 and @† 1
22
2
21
11
@q1
@q1
@q1
(the third and fourth part of @Â /:In @ı (the first part of @Â ); all moments with z1 can be generated
by

(the second part of

1

2

2

): So, we are left with moments with z2 : Among these, due to explicit
@q .Â /

linear relationship y2 D Zi ı 2 Cvi 2 .ı 2 / ; only .k2 r/ moments at most can be included in @Â1 :
22
For all .k2 r/ moments to be included, we need enough variation in instruments.

9

Á
without †221 †21
term
io
@q1 .Â/
D
@Â22

11j2 .Â/

1e

i

.Â/ z02; r

(1.11)

and the resulting optimal GMM estimator is more efficient relative to GMM-QLIML though it may
require additional model restrictions in general.
GMM-QLIML has an important role in relative efficiency study while GMM with (1.11) has
more practical usage. As a basis of linear space spanned by QLIML and CF moment functions,
the asymptotic variance of GMM-QLIML forms a sharper lower bound for those of QLIML and
CF. Since eliminating †221 †21 from (1.10) is equivalent to adding extra information that is not
used either by QLIML or CF when †o21 D 0; the asymptotic variance of optimal GMM estimator
using (1.11) can strictly smaller than that of GMM-QLIML in matrix positive definite sense. This
delicate distinction offers a convenient general framework of relative efficiency comparison.
Potential relative efficiency gain of GMM-QLIML with respect to QLIML and CF is clear
from its definition. It is worth noting that such potential improvement is not based on additional
model restrictions as shown in Proposition 1.3.4. When efficiency gain is present, it is implied that
QLIML and CF make use of only a part of information that GMM-QLIML uses. In such a case,
relative efficiency comparison of QLIML and CF is not obvious in general. When GMM-QLIML
is equivalent to either QLIML or CF, one can conclude that the one equivalent to GMM-QLIML is
superior than the other one. Conditions under which GMM-QLIML is equivalent to each estimator
can be derived by applying moment redundancy conditions (Breuch, Qian, Schmidt and Wyhowski,
ÁÁ Â
p
N ÂOest r Âo ; VestSr Á
1999; BQSW). In the following propostions, denote Vest r Á Avar
ÁÁ
p
N ÂOS;est r ÂoS for partition Â D .ÂS ; Â S / ; and @qiol D @qi l .Âo / for l D 1; 2.
Avar

10

Proposition 1.3.7 Assume that Assumptions 1-12 hold and that Â22 is nonempty. Then
.a/ VGMM QLIML

VQLIML ; VCF

.b/ VGMM QLIML D VCF if and only if
0
2 o 31 0 o 1 1
2
o 3
@qi1
@qi1
@qi1
Ä
o
o
@qi1
6 @Â @Â 0 7
B @qi1 6 @Â1 7C B @Â1 C
Eo
D
cov
; 4 @q o 5A Vo @ @q o A Eo 4 @q1 o 5
@
o
0
@Â22 @Â
@Â22
i2
i2
i2
@Â2

@Â2

.c/ VGMM QLIML D VQLIML if and only if
"
#
!
o
o @ qo C qo
@qi1
@qi1
i1
i2
Eo
D covo
;
Vo
@Â2 @Â 0
@Â2
@Â
where Â2 is a subvector of Â2 such that

o
@q1o @qi1
; @Â
@Â2
1

@q o

o C qo
@ qi1
i2
@Â

@Â2 @Â 0

! 1

"
Eo

o C qo
@ qi1
i2

#

@Â@Â 0

@q o

and @Âi1 C @Âi 2 are maximal linearly independent.
2
2

Remark 1.3.8 (b) and (c) can be derived for arbitrary subvector ÂS of Â D .ÂS ; Â S / : Corresponding results are given in the appendix.
The equivalence conditions (b) and (c) characterize when the extra moments in GMM-QLIML
contain no useful information about parameters. Rigorously put, they describe cases where the
orthogonal complement of QLIML or CF moment functions in the linear span of (1.8) does not
contain additional information about parameters. One interesting implication of (c) is that a set of
generalized information matrix equalities (GIME; Wooldridge, 2010) for q1 ; q2 and q1 C q2 with
> 0 is sufficient for QLIML to be efficient relative to CF:
Â oÃ
Ä
@q1o
@q1
D Eo
Vo
@Â
@Â@Â 0
"
#
Â oÃ
@q2
@q2o
Vo
D Eo
@Â2
@Â2 @Â20
!
"
#
@ q1o C q2o
@ q1o C q2o
Vo
D Eo
@Â
@Â@Â 0

some common scaling factor

Note that this result is stronger than one in previous studies under correctly specified likelihoods.
Even if QLIML is not a maximum likelihood estimator, it is efficient relative to CF whenever a
finite number of moment conditions in GIMEs are met.

11

Following corollary contains relevant implications of Proposition 1.3.7 regarding GIMEs. In
particular, it claims an if and only if condition for CF and QLIML estimator of Â11 to be asymptotically equivalent under GIMEs where Â1 D .Â11 ; Â12 / :
Corollary 1.3.9 Assume that Assumptions 1-12 hold and that Â22 is nonempty. If generalized
information matrix equalities hold for each factor of likelihood and joint likelihood with the same
scaling factor, we have
.a/ VGMM QLIML D VQLIML
Â

Â

VCF (in particular, VQLIML 6D VCF /
Â

11
11
11
.b/ VGMM
QLIML D VQLIML D VCF if and only if
" "
#
"
# Â
"
##
o
o
o Ã 1
o
@qi1
@qi1
@qi1
@qi1
0p22 p11 D Eo
Eo
Vo
Eo
@Â22 @Â20
@Â22 @Â10
@Â1
@Â1 @Â20
"
"
#
"
##
o
o
@qi1
@qi1
R21 Eo
C R22 Eo
0
0
@Â12 @Â11
@Â2 @Â11

where R21 and R22 are defined in the proof.
These results are useful to study asymptotic equivalence of QLIML and CF since, if there exists
a case where QLIML and CF are asymptotically equivalent in general, then it must also be the case
under GIMEs. The result (a) of Corollary 1.3.9 shows that, when Â22 is nonempty, QLIML and CF
are never asymptotically equivalent for all element of Â: But this does not rule out the case where
QLIML and CF are asymptotically equivalent for strict subvector of Â: The formula in Corollary
1.3.9 (b) (and another one given in Proposition 1.3.13 later) informs us about key conditions for
QLIML and CF to be asymptotically equivalent for subvector Â11 of Â1 : It seems that some part
Ä
Ä
o
o
@qi1
@qi1
and
E
should vanish to have general
of the expected cross partials Eo
o
0
0
@Â12 @Â11

@Â2 @Â11

equivalence. Based on this observation, following proposition explicitly claims a condition under
which QLIML and CF are asymptotic equivalent for a subvector of Â. The well-known result of
asymptotic equivalence of LIML and 2SLS is an implication.
Proposition 1.3.10 Assume that Assumptions 1-12 hold. Let . 1 ; 2 / be a partition of Â: If there

12

exists p

p invertible matrices T1 .Â/ and T2 .Â/ such that
3 2
2
@ q .Â ; Â /
7 6
6
@Â1 1 1 2
T1 .Â/ 4
5D4
@ q .Â ; Â / C @ q .Â /
@Â2 1 1 2
@Â2 i 2 2
3 2
2
@ q .Â ; Â /
6 @Â 1 1 2 7 6
T2 .Â/ 4 1
5D4
@ q .Â /
@Â 2 2
2

Ä
where m1 . 1 ; 2 / identifies o1 given o2 , Eo

@m1 . o1 ; o2 /
@ 2

3
m1 . 1 ; 2 / 7
5
m2 . 1 ; 2 /
3
m1 . 1 ; 2 / 7
5
m3 . 1 ; 2 /
Ä
D 0 and Eo

@mig . o1 ; o2 /
@ 20

is

invertible for g D 2; 3, then QLIML and CF estimator for 1 are asymptotically equivalent.
Corollary 1.3.11 LIML and 2SLS are asymptotically equivalent for .˛; ı1 / :
The asymptotic equivalence of LIML and 2SLS is mainly due to linearity of regression functions
and L2 loss function endowed in normal density. The intuition behind the proof is that v2 does not
@q

@q

need to be controlled as regressors to estimate .˛; ı1 /: the orthogonality conditions @˛1 and @ı 1
1
in Example 1.3.5 can be transformed into
2
3
0
0
6 .yi1 yi 2 ˛ zi1 ı1 / ı22 z2 7
(1.12)
4
5
.yi1 yi 2 ˛ zi1 ı1 / z01
0

by an invertible linear map: Treating ˛ 0 ; ı10 as 1 in Proposition 1.3.10; the equivalence follows.
Clearly, neither normality nor conditional homoskedasticity is needed for the result, which is not
very well recognized in the literature. Amemiya (1984) proves the equivalence under conditional
homoskedastic non-normal errors and non-random instruments but his argument is, in fact, valid
without assuming conditional homoskedasticity. In nonlinear models such as probit, the regression
function does not allow the control function part to vanish as linear model does in (1.12). Also, when
the loss function is other than L2 ; for example, L1 as in median regression with tick-exponential
family (Komunjer, 2009), then, even if the regression function is linear, again there exists no
invertible linear transformation of quasi-scores that eliminates the control function part in general.
Thus equivalence of QLIML and CF does not seem to hold for nonlinear regression models.

13

Apart from linearity of regression function and L2 loss function, another condition for general
asymptotic equivalence of QLIML and CF for Â1 is
Ä
Â o
Ã
@q1o
@q1 @q2o
Eo
D covo
;
(1.13)
@Â1 @Â2
@Â1 @Â2
Á
@q @q
together with covo @Â1 ; @Â2 D 0: The equivalence is easily proved by taking T1 .Â/ D T2 .Â/ D
1

2

Ip , 1 D Â1 and 2 D Â2 in Proposition 1.3.10. A set of sufficient conditions for (1.13) is
ˆ

well-known to be (a) q2 is correctly specified log-likelihood for wi 2 and (b) wi1

wi 2 j zi where

q1 D q1 .wi1 ; wi 2 ; zi ; Â1 ; Â2 / and q2 D q2 .wi 2 ; zi ; Â2 / for some random variable .wi1 ; wi 2 /. This
is a fairly general condition applicable to numerous models. However, it should be noted that wi1
cannot be a latent error term such as ui1 or ui1

vi 2 Á in Probit model of Example 1.2.1 since q1

is required to be a quasi-log-likelihood of wi1 given .wi 2 ; zi / :
The next two propositions refine GIMEs to derive weaker conditions for relative efficiency of
QLIML. Proposition 1.3.12 helps reducing the number of conditions in GIMEs by treating nuisance
parameters as known. Multivariate normal log-likelihood becomes a member of LEF when this
result is applicable to its variance parameters. Proposition 1.3.13 relaxes common scaling factors
in GIMEs. When different scaling factors for q1 and q2 are allowed, 1 Ä 2 is shown to be
sufficient for relative efficiency of QLIML for Â1 . Note that, with different scaling factors, having
the GIME hold in both models does not necessarily imply asymptotic equivalence of QLIML and
GMM-QLIML. Following Zhang (2005), the Schur complement of B in A is denoted as A=B for
notational convenience.
Proposition 1.3.12 Assume that Assumptions 1-12 hold. Suppose there exists .l1 C l2 / nuisance
parameters

D . 1 ; 2 / such that
#
"
@qi1 .Âo1 ; Âo2 ; o1 ; o2 /
Eo
D 0p .l Cl /
1 2
@Â@ 01 ; 02
#
"
@qi 2 .Âo2 ; o2 /
D 0p2 l 2
Eo
@Â2 @ 02

Â
Â are not affected by treating
Then, VQLIML
and VCF

as known and redefining qQ i1 .Â/ D

qi1 .Â; o / and qQ i 2 .Â2 / D qi 2 .Â2 ; o2 /. Moreover, if GMM-QLIML moment function (1.9)

14

contains exactly .l1 C l2 / scores regarding

Â
; then VGMM
QLIML is also not affected by the

redefinition.
Proposition 1.3.13 Assume that Assumptions 1-12 hold. Suppose GIMEs with scaling factors
Ã
Â o
@q1 @q2o
D 0.
1 and 2 for quasi-log-likelihood q1 and q2 , respectively. Also, assume covo @Â ; @Â
2

Â1
Then, VCF

"
Eo

Â1
VQLIML

@q1o

@Â1 @Â10

# 1
Eo

is equal to
#
"
@q1o
@Â1 @Â20

"
Œ 2 W1 C . 1

2 / W2  E o

@q1o

@Â2 @Â10

#

"
Eo

@q1o

# 1

@Â1 @Â10

where
#,
"
## 1
@q1o
@ q1o C q2o
Eo
D Eo
Eo
@Â2 @Â20
@Â@Â 0
@Â1 @Â10
"
## 1 " Ä
"
##
#,
" "
@q1o
@q1o
@q1o
@ q1o C q2o
Eo
Eo
D
Eo
Eo
@Â@Â 0
@Â1 @Â10
@Â@Â 0
@Â1 @Â10
## 1
" "
#,
"
@ q1o C q2o
@q1o
Eo
Eo
@Â@Â 0
@Â1 @Â10
"

W1
W2

In particular, 2

@q2o

# 1

"

Â

1

1
implies VCF

"

Â

1
VQLIML
:
@q

@q2
@Â2

When 1 6D 2 ; GIME for .q1 C q2 / is not met, and QLIML is not optimally weighting @Â1 and
2
as GMM-QLIML does. In this sense, Proposition 1.3.13 helps us to understand the situation

where complete GIMEs start to break down. Contrary to the unambiguous case of 1 Ä 2 in
the proposition; when 1 > 2 ; the expression 2 W1 C . 1

2 / W2

is indefinite in general: This

observation explains why general efficiency ordering of QLIML and CF is not obvious without any
form of GIMEs.
The next proposition shows how the general theory applies in a class of fully robust models
specified with multivariate normal q2 : It is one of the most frequently used specification that attains
fully robust estimation but not the only class of models that results can be applied to. It is shown
that correct specification of conditional means and GLM variance assumptions with a restriction
on scaling factors are sufficient for relative efficiency of QLIML for the structural parameters. In
particular, correctly specified conditional moments up to second order are sufficient.

15

Proposition 1.3.14 Assume that Assumptions 1-12 hold. Suppose that q1 is a member of LEF
with conditional mean G .y2 ; z1 ; v2 ; Â1 / ; and that q2 is a multivariate normal density for linear
reduced form equations. In other words,
qi1 .Â1 ; Â2 / D a .G .y2 ; z1 ; v2 ; Â1 // C b .yi1 / C yi1 c .G .y2 ; z1 ; v2 ; Â1 //
qi 2 .Â2 / D

k
ln 2
2

1
ln j†22 j
2

1
vi 2 †221 v0i 2
2
0

where a; b; c and G are smooth enough functions, v2 D y2 zı 2 and Â2 D vec .ı 2 /0 ; vech .†22 /0 :
Assume that Eq .y1 jy2 ; z/ and Eq .y2 jz/ are correctly specified. Then, Vo .y1 jy2 ; z/ D 1 Vq .y1 jy2 ; z/
and Vo .y2 ; jz/ D 2 Vq .y2 jz/ with 0 < 1 Ä 2 is sufficient for QLIML to be efficient relative to
CF for Â1 .
As a special case of Proposition 1.3.14, the next example considers a probit response function
with endogeneous explanatory variables. Specifically, Proposition 1.3.14 implies that relative
efficiency of QLIML holds under a much weaker condition than correct specification of likelihood
given in Rivers-Vuong (1988), and this result is new in the literature.
Example 1.3.15 (Rivers-Vuong, 1988) Consider probit model in Example 2.1. Note that y1 is
not restricted to be binary response as long as the probit response function is correct. Assume
regularity and identification conditions (Assumption 1–12). For computational convienience,
impose following reparametrization
Á Á †221 †21
along with normalization of e1 D u1
q1 .Â1 ; Â2 / D .1

v2 Á. Then, quasi-likelihood can be simplified as following

y1 / log Œ1

ˆ .w .Â// C y1 log ˆ .w .Â//

1
vi 2 .ı 2 / †221 vi 2 .ı 2 /0
2
Ä
0
0
0
0
0
0
0
where Â1 D ˛ ; ı1 ; Á ; Â2 D vec .ı 2 / ; vech .†22 / , x D y2 z1 v2 and w .Â/ D xÂ1 .
qi 2 .Â2 / D

k
ln 2
2

1
ln j†22 j
2

16

Taking derivatives, quasi-scores can be expressed as
2

@q2
@Â2

3

6
7
6
7
y1 ˆ .w .Â//
6
0
.w .Â// 6 z 7
7
ˆ .w .Â// ˆ .w .Â//
4 1 5
v02
3
2
0
Á˝z
y1 ˆ .w .Â//
7
6
.w .Â// 4
D
5
Œ1 ˆ .w .Â// ˆ .w .Â//
0 r.rC1/
1
2
2
3
0
1
0
6 Ir ˝ z †22 v2 .ı 2 /
7
D4
Á 5
1 L vec † 1 v .ı /0 v .ı / † 1 † 1
i2 2
2 r
22 i 2 2
22
22

@q1
D
Œ1
@Â1

@q1
@Â2

y02

Assume GMM-QLIML extra moment functions are
@q1
D
@Â22

1

h
i
y1 ˆ .w .Â//
0
.w .Â// Ái z2; r
ˆ .w .Â// ˆ .w .Â//

where Áoi 6D 0. To derive conditions under which QLIML is efficient relative to CF, first note that
†22 is nuisance parameter, that is, under correctly specifed regression functions,
Ä
Eo
Ä
Eo

@q1o

@Â1 @vech .†22 /0
@q2o

D0

DE
@vec .ı 2 / @vech .†22 /0

h

Ir ˝ z0

h

1 ˝† 1
v2 †o22
o22

ii

L0r D 0

Therefore, we can assume †22 is known, that is, redefine Â2 D ı2 : Then, expected Hessian and
score outer product matrices are
#
"
"
#
@q1o
Œ .w .Âo //2
Eo
D Eo
xt xi
Œˆ .w .Âo // Œ1 ˆ .w .Âo // i
@Â1 @Â10
#
"
#
"
@q1o
Œ .w .Âo //2
t
0
Áo ˝ z
Eo
D Eo
x
Œˆ .w .Âo // Œ1 ˆ .w .Âo //
@Â1 @Â20
"
#
"
#
@q1o
Œ .w .Âo //2
0
0
Eo
D Eo
Áo Áo ˝ z z
Œˆ .w .Âo // Œ1 ˆ .w .Âo //
@Â2 @Â20
Ä
h
i
@q2o
1 ˝ z0 z
Eo
D
E
†
o
o22
@vec .ı 2 / @vec .ı 2 /0

17

and
@q1o

#
Ã2
y1 ˆ .w .Âo //
.w .Âo //2 xt x
Vo
D Eo
@Â1
ˆ .w .Âo // Œ1 ˆ .w .Âo //
#
"Â
Ã2
Ã
Â o
@q1 @q1o
y1 ˆ .w .Âo //
Œ .w .Âo //2 xt Á0o ˝ z
;
D Eo
covo
Œ1 ˆ .w .Âo // ˆ .w .Âo //
@Â1 @Â2
#
"Â
Â oÃ
Ã2
@q1
y1 ˆ .w .Âo //
.w .Âo //
Vo
Áo ˝ z0
D Eo
Á0o ˝ z
Œ1 ˆ .w .Âo // ˆ .w .Âo //
@Â2
Ã
Â
o
h
i
@q2
0
1
0
1
Vo
D Eo Ir ˝ z †o22 v2 v2 †o22 ŒIr ˝ z
@vec .ı 2 /
Â

Ã

"Â

The orthogonality between scores holds if conditional means are correct since
#
" Ä
#
"
ˇ
@q2o
@q1o ˇ
@q1o @q2o
ˇ y2 ; z1 ; v2
D Eo Eo
Eo
@Â @Â20
@Â ˇ
@Â20
D0
Then, it is implied by Proposition 1.3.14 that followings are sufficient for QLIML to be efficient
relative to CF for Â1
Eo Œy1 jy2 ; z D ˆ .w .Âo //
Vo Œy1 jy2 ; z D 1 ˆ .w .Âo // Œ1

ˆ .w .Âo //

Eo Œy2 jz D zı o2
Vo Œy2 jz D 2 †o22
with
1

Ä 2

The restriction 1 Ä 2 is especially plausible for application of Probit model to fractional
response y1 2 Œ0; 1 : Note that, with correctly specified conditional mean function, the conditional

18

variance of y1 is bounded above by Vq Œy1 jy2 ; z W
Vo Œy1 jy2 ; z D E

h

i

y12 jy2 ; z

Ä E Œy1 jy2 ; z
D ˆ .w .Âo // Œ1

Œˆ .w .Âo //2
Œˆ .w .Âo //2
ˆ .w .Âo //

And 1 often appears to be very small in practice when 2 is normalized to 1.
Another example where the relative efficiency conditions are applicable is Poisson regression
model for positive response (such as count data) with endogenous explanatory variable.
Example 1.3.16 (exponential model) In the following simulataneous equation system
y1 D exp .y2 ˛ C z1 ı1 C v2 Á/ u1
y2 D zı 2 C v2
assume y1 jz; y2

q

P oisson .exp .y2 ˛ C z1 ı1 C v2 Á// and y2 jz

q

Normal .zı 2 ; †22 / : Then,

quasi-log-likelihood is
q1 .Â1 ; Â2 / D
q2 .Â2 / D

log .y1 Š/
k
ln 2
2

exp .y2 ˛ C z1 ı1 C v2 Á/ C y1 .y2 ˛ C z1 ı1 C v2 Á/
1
ln j†22 j
2

1
vi 2 .ı 2 / †221 vi 2 .ı 2 /0
2

Since Poisson likelihood also belongs to LEF, by Proposition 1.3.14, a set of sufficient conditions
for relative efficiency of QLIML for Â1 is
Eo Œy1 jy2 ; z D exp .y2 ˛o C z1 ıo1 C v2 Áo /
Vo Œy1 jy2 ; z D 1 exp .y2 ˛o C z1 ıo1 C v2 Áo /
Eo Œy2 jz D zı o2
Vo Œy2 jz D 2 †o22
with
1

Ä 2

19

1.4

Concluding Remarks

I show that, when both QLIML and CF estimators are consistent and asymptotic normal, there exists
an efficient GMM estimator called GMM-QLIML whose asymptotic variance constitutes a lower
bound of those of QLIML and CF estimators. In particular, a set of generalized information matrix
equalities is shown to be sufficient for QLIML estimator to be as efficient as GMM-QLIML. In
fully robust estimation of correctly specified conditional means, the condition is further weakened
to GLM variance assumption with a scaling restriction. As Example 1.3.15 demonstrates, this
condition is especially appealing for Probit model applied to fractional response.
Still, there are remaining questions to be answered. Regarding Proposition 1.3.13, can we
derive a refined condition for relative efficiency of QLIML estimator in 1 > 2 case? As discussed
for Poisson regression model in Example 1.3.16, there are models that often exibits large 1 ; and
this refinement (if possible) will be useful. Also, we cannot rule out the possibility of even weaker
condition than GIMEs with scaling restriction given in Proposition 1.3.13. Direct comparison of
asymptotic variances does not seem to work well in that direction of research.
Moreover, relative efficiency relationship with other QLIL-based estimators can be studied. For
example, when reduced form model for y1 is available, the minimum distance estimator suggested
by Amemiya (1978, 1979) can be used. Newey (1987) showed its asymptotic efficiency in limited
information structure with normal errors. It would be interesting to study its relative efficiency
relationship when it is based on QLIL.

20

CHAPTER 2
EFFICIENT MINIMUM DISTANCE ESTIMATOR BASED ON QUASI-LIMITED
INFORMATION LIKELIHOOD

2.1

Introduction

When a model is over-identified in limited information simulataneous system, the classical minimum distance estimator is often proposed as an estimation method. Amemiya (1978,1979) first
introduced its application to Probit and Tobit model with endogenous explanatory variables and gave
an interpretation of ‘generalized least square’. Newey (1987) called this estimator as ‘Amemiya’s
GLS (AGLS)’ and showed its asymptotic efficiency under correct specification of likelihood in a
general class of limited information structures. Recent work by Wooldridge (2014) implies that,
in linear exponential family (LEF), correct specification of regression functions of reduced form
model guarantees robustness of AGLS. Still, its relative efficiency relationship has not been clarified
for the case of potentially misspecified likelihood.
The purpose of this chapter is to study asymptotic behavior of minimum distance estimator based
on quasi-limited information likelihood. The primary focus is on its relative efficiency relationship
with respect to quasi-limited information maximum likelihood (QLIML) estimator and two-step
control function (CF) approach. this chapter takes the quasi-limited information framework from
Wooldridge (2014) and relies on results from Chapter 1.
The main contributions of this chapter are followings. First, AGLS is interpreted as a concentrated estimator (cMD-QLIML) and ‘full’ minimum distance estimator (MD-QLIML) is proposed.
Based on an analogous result of Crepon, Kramarz, Trognon (1997), cMD-QLIML is shown to be
asymptotically equivalent to MD-QLIML for structural parameters. Second, given quasi-limited
information likelihood, cMD-QLIML is proved to be asymptotically efficient relative to QLIML
and CF. In particular, cMD-QLIML can be strictly more efficient than QLIML in Newey’s framework if enough degree of misspecification is present in likelihood. Third, if and only if condition

21

for cMD-QLIML and other estimators to be asymptotically equivalent under the null hypothesis of
exogeneity is derived. Immediate implication shows that a set of generalized information matrix
equalities for reduced form model is sufficient. Fourth, an explicit formula of cMD-QLIML estimator for linear model is derived. It is the same as GMM but with a different weighting matrix
derived from the reduced form parameter estimates.
The rest of this chapter is organized as follows. In Section 2.2, basic model restrictions are given.
In Section 2.3, MD-QLIML and cMD-QLIML are defined, and relative efficiency relationship is
presented. Section 2.4 contains application to linear model and quantile regression model with
endogeneous explanatory variables.

2.2

Model Restrictions

Assume random sampling from a population. Model restrictions start from a decomposed quasilimited information log-likelihood framework in Wooldridge (2014)
QLL D q1 .yi1 ; yi 2 ; zi ; Â1 ; Â2 / C q2 .yi 2 ; zi ; Â2 /
where Â D Â10 ; Â20

0

is .p1 C p2 /-dimensional vector of parameter, yi1 is the i th observation of

a scalar response variable, yi 2 is a 1
is 1

(2.1)

r vector of potentially endogenous variables, zi D .zi1 ; zi 2 /

k vector of included/excluded exogenous instruments with k D k1 C k2 . For details, see

Wooldridge (2014).
QLIML and CF estimators are initially given as

ÂOQLIML D arg max
Â

N
X

Œqi1 .Â/ C qi 2 .Â2 / and

i D1

8
ˆ
O2;CF D arg max PN qi 2 .Â2 /
ˆ
Â
ˆ
i D1
<
Â2

ˆ
O1;CF D arg max PN qi1 Â1 ; ÂO2;CF
ˆ
Â
ˆ
iD1
:

Á

Â1

and redefined as GMM estimators based upon first order conditions
2
3
2
Á
@
@
N
N
O
X
X
6
7
6 @Â1 qi1
@Â1 qi1 ÂQLIML
Á
Á 5 D 0 and
4
4
@ q
@
@
O
O
Â
C
q
Â
i
D1
i D1
i1
QLIML
i
2
2;QLIML
@Â2
@Â2
@Â2 qi 2
22

Á 3
ÂOCF
7
Á 5D0
ÂO2;CF

respectively. Finite sample estimates of above extremum estimator and GMM-interpreted estimator
may not coincide. Such numerical discrepancy does not harm our asymptotic analysis since they
are asymptotically equivalent under regularity conditions.
Both QLIML and CF estimators are assumed to be

p

N consistent for the true parameter

values and asymptotic normal. The essential model restriction for validity of the relative efficiency
results in this chapter is that the asymptotic variance of each estimator is in the standard sandwich
form. To explicitly account for some cases of non-smooth objective functions, the Jacobian matrix
of expected score is used rather than the expected Jacobian matrix of score. Note that redundancy
conditions in Breusch, Qian, Schmidt and Wyhowski (1999) are compatible with such modification.
Generalized information matrix equalities (GIMEs) are also defined accordingly. A set of standard
regularity conditions for GMM interpreted estimators (Assumptions 1–13) are given in Appendix
B.1. It is easy to show that, under these conditions, Proposition 1.3.4 in Chapter 1 holds with
Jacobian matrices of expected score in the sandwich form.
To consider a distance minimization problem, a reduced form model should be defined. The
existence of link function

W ‚ ! 

Rg is essential in the following characterization of a

reduced form model.
Á
R . ; Â / ; .Â/ such that
Definition 2.2.1 A reduced form model is a pair qi1
2
R . .Â/ ; Â / D q .Â ; Â / a.s. for 8Â 2 ‚
qi1
i1 1 2
2
R
@qi1

@Â2
for some p2

. ; Â2 / D C .Â/

R
@qi1

@

. ; Â2 / a.s.

(2.2)
(2.3)

g matrix C .Â/ whose elements are real-valued function of Â:

The link function .Â/ represents the functional relationship between structural parameters and
reduced form parameters. It relates likelihoods of structural and reduced form model as in the first
condition (2.2). Note that the decomposed likelihood
R . ; Â / C q .Â /
qi1
2
2 2

23

for a reduced form model still belongs to QLIML framework, and q1 alone cannot identify without
help of q2 . Based on relative efficiency results in Chapter 1, this model is ‘reduced’ in the sense
@q R

that GMM-QLIML for this model has no additional moment functions from @Âi1 . In other words,
2
all nonredundant effects of .zi ; yi 2 / on y1 are captured in . This is chracterized by the second
condition (2.3). In turn, QLIML and CF estimator are asymptotically equivalent for reduced form
model parameters . ; Â2 / : A set of standard regularity conditions for a reduced form model and a
link function (Assumptions 14–17) are given in Appendix B.1.

2.3

Minimum Distance Estimators: MD/cMD-QLIML

Given reduced form estimates . O ; ÂO2 /, MD-QLIML is defined as minimum distance estimator of Â
minimizing optimally weighted sum of distance O

.Â/ and ÂO2

Â2 .

Definition 2.3.1 Let . O ; ÂO2 / be reduced form parameter estimates and suppose
00
1 0
11
p BB b C B o CC d
N @@
A @
AA ! N .0; R /
O
Â2
Âo2
Then MD-QLIML estimator ÂOMD QLIML is a solution to
20
1
30
1
20
b C
6B b C
7
1 6B
min 4@
A h .Â/5 O R 4@
A
Â
ÂO2
ÂO2
p

where h .Â/ D . .Â/0 ; Â20 /0 and O R !

3
7
h .Â/5

(2.4)

R:

Hence, MD-QLIML is a two-step procedure:
1. estimate . O ; ÂO2 / by solving just-identified reduced form model. . O ; ÂO2 / mainly represents
estimated mean responsiveness of yi1 and yi 2 with respect to all available exogenous variation
of instuments.
2. compress information contained in . O ; ÂO2 / into structural parameter estimates by solving the
distance minimization problem (2.4).

24

Later, we will see this two-step estimation procedure can enhance finite sample performance
remarkably compared to asymptotically equivalent optimal GMM estimator when model is overidentified. The next proposition states the asymptotic distribution of MD-QLIML.
Proposition 2.3.2 Assume Assumption 1–17. MD-QLIML is asymptotically normal
p

N ÂOMD QLIML

Â
Á d
Á 1Ã
0
1
Âo ! N 0; Ho R Ho

where Ho D @Â@ 0 h .Âo /
One significant distinction of MD-QLIML from Amemiya’s GLS estimator is that the distance
minimization problem of MD-QLIML considers almost redundant looking distance ÂO2

Â2 while

that of AGLS imposes a constraint Â2 D ÂO2 with corresponding adjustment of the weighting
matrix. In fact, AGLS can be interpreted as a ‘concentrated MD-QLIML’ where ‘concentration’
means Â2 is regarded as a implicit function of Â1 in the solution space of minimization problem:
This interpretation is based on the following general result which is analogous to Proposition 1 in
Crepon, Kramarz and Trognon (1997).
0

Proposition 2.3.3 (Concentrated Minimum Distance Estimator) Assume (1) h .Â/ D h01 ; h02 is
continuously differentiable in Â D .Â1 ; Â2 / where Â1 2 Rp1 ; Â2 2 Rp2 ; h1 2 Rg ; h2 2 Rp2 with
o / has full column rank (3)
h .Â/ D
6 0 if Â 6D Âo where o D . o1 ; o2 /,
p1 (2) @h.Â
o
@Â
Á
p
d
@h2 .Â/
p2 (4)
g and
.
/
.0;
/
2
R
2
R
N
O
!
N
(5)
det
6D 0 for 8Â 2 ‚:
o
o
o1
o2
@Â2

g

Then, 'N .Â1 / is well-defined by O2

h2 .Â1 ; 'N .Â1 // D 0 for each .Â1 ; N / ; and an estimator ÂOc;1

derived from
min . O1
Â1

h1 .Â1 ; 'N .Â1 ///0 WO 1 . O1

h1 .Â1 ; 'N .Â1 ///

is asymptotically equivalent to a minimum distance estimator of Â1 which solves
min . O
Â

h .Â//0 WO . O

25

h .Â//

p

where WO 1 ! So

0
o So

1

Ä
, So D

Ig1

the asymptotic distribution of ÂOc;1 is
p

where Hc D

N ÂOc;1

@ h .Â /
@Â10 1 o

@h1 .Âo /
@Â20

Ä

@h2 .Âo /
@Â20

1

Â h
Á d
Âo1 ! N 0; Hc0 So o So0
1

Ä

@ h .Â / @h2 .Âo /
@Â20 1 o
@Â20

1

p

and WO !

Hc

o

1:

Moreover,

i 1Ã

@h2 .Âo /
:
@Â10

Proposition 2.3.3 provides a method to construct an asymptotically equivalent minimum distance estimator for Â1 by concentrating Â2 out. The key condition is that the dimension of concentrated parameter Â2 is the same as that of concentrating equation O2

h2 .Â1 ; Â2 / D 0: This

condition is presicely satisfied for Â1 and Â2 in QLIML framework. Applying Proposition 2.3.3 to
QLIML framework, we have h2 .Â1 ; Â2 / D Â2 and the implict function in the proposition is merely
'N .Â1 / D ÂO2 , a constant function of Â1 : In turn, AGLS can be defined as concentrated MD-QLIML
(cMD-QLIML) and its asymptotic distribution and asymptotic equivalence to MD-QLIML follows
as a corollary.
Definition 2.3.4 cMD-QLIML (=AGLS) is defined to be a solution ÂO1;cMD QLIML to
h
Ái0
h
Ái
min b
Â1 ; ÂO2 WO 1 b
Â1 ; ÂO2
Â1

p

where WO 1 ! So

0
R So

1

Ä
and So D

(2.5)

@ .Âo /
@Â20

Ig

Corollary 2.3.5 Assume Assumption 1–17. Then,
1
1
VMD
QLIML D VcMD QLIML

and
p

N Â1;cMD QLIML
Ä
@ .Âo /
where Hc D
0 and So D Ig
@Â1

Â h
Âo1 ! N 0; Hc0 So R So0
d

1

Hc

i 1Ã

@ .Âo /
@Â20

To study relative efficiency relationship between GMM-QLIML and AGLS(cMD-QLIML), it
is useful to consider following GMM counterpart for MD-QLIML.

26

Definition 2.3.6 (MD-QLIML equivalent GMM) mGMM-QLIML is defined to be an optimal
GMM estimator based on

2
6
4

@
@

q1R . .Â/ ; Â2 /
@
@Â2 q2 .Â2 /

3
7
5

(2.6)

The moments in (2.6) are the same first order conditions used in the first stage of MD-QLIML
except that

is being treated as a function of Â. This estimator can be understood as combining

two-step procedure of MD-QLIML into one-step: accounting for mean responsiveness of y1 and
y2 with respect to all exogeneous variation of instruments, choose Â optimally. Under regularity
and identification conditions for MD-QLIML (Assumption 1–17), this estimator is well-defined
and asymptotically equivalent to MD-QLIML.
Proposition 2.3.7 Assume Assumption 1–17. Then MD-QLIML and mGMM-QLIML are asymptotically equivalent.
mGMM-QLIML is efficient relative to GMM-QLIML in matrix positive semi-definite sense.
Proposition 2.3.8 formalizes the results along with a condition under which mGMM-QLIML and
GMM-QLIML are asymptotically equivalent.
Proposition 2.3.8 Assume Assumption 1–17. Then
VmGMM QLIML

VGMM QLIML

where the inequality becomes equality if p1 C p22 D g:
The following proposition summarizes the results in the framework of linear index model which
is most frequently used but not the only class of models that results can be applied to.
Proposition 2.3.9 Assume Assumption 1–18. Suppose
qi1 .Â1 ; Â2 / D l .yi1 ; y2 ˛ C z1 ı1 C v2 Á; /
qi 2 .Â2 / D

k
ln 2
2

1
ln j†22 j
2

27

1
vi 2 .ı 2 / †221 vi 2 .ı 2 /0
2

0
0
where Â1 D ˛0 ; ı10 ; Á0 ; 0 ; Â2 D vec .ı 2 /0 ; vec .†22 /0 and vi 2 .ı 2 / Á yi 2

zi ı 2 : Then

following results hold:
(a) VMD QLIML

VGMM QLIML

VQLIML ; VCF

(b) If k2 D r; then VMD QLIML D VGMM QLIML D VQLIML D VCF
(c) If Áo 6D 0; then VMD QLIML D VGMM QLIML
(d) If Áo D 0; then VMD QLIML

VQLIML ; VCF

VGMM QLIML D VQLIML D VCF

(e) If Áo D 0 and k2 > r; then VMD QLIML D VGMM QLIML if and only if
ˇ
Ä
@
@qi1 ˇˇ
E
Á
ˇ 0 0Á
0
@Â 0
@
;Â2 D o0 ;Âo2
0
1 0
1 1
2
3ˇ
o
o
ˇ
@qi1
@qi1
@q
i1
o
ˇ
@q
@
B
C B
C
6
7ˇ
@Â
@Â
@Â1
D cov @ i1 ; @q o 1 @q o A V @ @q o 1 @q o A
E
4 @q
5ˇ
ˇ
@
@Â 0
i1 C @qi 2
i1 C i 2
i1 C i 2
ˇ
@Â
@Â
@Â
@Â
@Â
@Â
2

2

2

2

2

2

Â DÂo

o

where

@q
is such that optimal GMM on @ i1 together with QLIML moment functions is asymp-

totically equivalent to mGMM-QLIML. A sufficient condition is GIME’s for q1 ; q2 ; q1 C q2 with
same scaling factor for reduced form model.
(a) holds also for the cases where the index contains higher order terms of y2 . However,
asymptotic equivalence of MD-QLIML and GMM-QLIML in (b) and (c) doesn’t hold in general
when higher order terms of y2 are present. (b) is well-known property and, in fact, estimators
are numerically equivalent. (c) is a typical relative efficiency relationship when endogeneity is
present. If a set of GIME’s with common scaling factor holds, QLIML will also be asymptotically
equivalent to MD-QLIML. (d) and (e) shows relative efficiency of MD-QLIML when there exists no
endogeneity. With over-identification, there exists potential efficiency gain which vanishes under a
set of GIME’s for reduced form model.

28

2.4

Example 1: Linear Regression Model with Endogeneous Explanatory
Variables

From equation yi 2 D zi ı 2 Cvi 2 and uniqueness of linear projection; it is clear that L .yi1 jyi 2 ; zi / D
L .yi1 jvi 2 ; zi / : Substituting y2 into q1 , or equivalently, substituting into regression equation
yi1 yields
yi1 D yi 2 ˛Czi1 ı1 C vi 2 †221 †21 C ei1
Á
D z1i .ı 21 ˛ C ı1 / C z2i ı 22 ˛ C vi 2 ˛C†221 †21 C ei1
Á z1i 1 C z2i 2 C vi 2 3 C ei1
Along with 4 Á 11j2 .Â/ ; .Â/ is naturally defined as
Ã0
Â
Á0
0
0
1
.Â/ D .ı 21 ˛ C ı1 / ; .ı 22 ˛/ ; ˛C†22 †21 ; 11j2 .Â/
while dependence of qi1 on Â2 is through the control function vi 2 .ı 2 / : Consistency and asymptotic
normality of estimator of . ; Â2 / is implied by invertibility of E z0 z and other regularity conditions
assumed for QLIML and CF estimators. However, additional assumptions are needed in nonlinear
models in general. For example, when we allow higher order terms of y2 in regression function as
in
yi1 D yi22 ˛ C zi1 ı1 C vi 2 †221 †21 C ei1
it is required to impose additional orthogonality and rank condition for structural and reduced form
model. i.e. (a) regression function specification is done with conditional mean operator rather
than linear projection operator. (b) linear independence of higher order terms of instruments is
assumed. Following demonstrates a computationally useful reparameterization of classical LIML
under which explicit expressions for Ho , Hc ; So and the closed form of cMD-QLIML estimator
are given.
Consider reparametrization
Á Á †221 †21
11j2

†12 †221 †21

Á †11

29

It can be easily shown that this reparameterization does not alter other parameter estimates of any
estimation method discussed in this chapter. Taking derivative of h .Â/ at Âo , we have
2
3
6 H11 .Âo2 / H12 .Âo1 / 7
Ho D 4
5
0
Ip2
where
2

3

3
2
0 7
6 ı21k1 r Ik1 0
0
6
7
7
6
7
6 ˛ ˝ Ik
7
6 ı22k r 0
6
0
0 7
7 and H12 .Â1 / D 6 0
2
H11 .Â2 / D 6
0 r.rC1/ 7
7
6
7
6
r rk
g
6
7
5
4
2
I
0
I
0
r
r
6
7
01 rk
4
5
0
0
0 11 1
Ä
Hence, we have Hc D H11 .Âo2 / and So D Ig
H12 .Âo1 / : Preliminary estimates for ˛ can
be calculated by QLIML or CF when constructing weighting maxtrix for cMD-QLIML. Moreover,
since

.Â/ D H11 .Â2 / Â1 ; taking first order condition of (2.5) yields
Ä
Á0
Á 1Ä
Á0
O
O
O
O
H11 Â2 WO 1b
Â1;cMD QLIML D H11 Â2 W1 H11 Â2

Note that ei .Â/ D ei . .Â// where
ei .Â/ Á yi1
ei . .Â// Á yi1

vi 2 †221 †21

yi 2 ˛ zi1 ı1

z2i 2 .Â/

z1i 1 .Â/

vi 2 3 .Â/

By replacing ei .Â/ with ei . / in qi1 along with reparameterizing 11j2 .Â/, a reduced form model
likelihood q1 . ; Â2 / is derived:
qi1 . ; Â2 / D

1
ln 2
2

Taking derivatives with respect to , we have
2
6
@qi1 . ; Â2 / 6
D6
6
@
4

1
ei . /2 4 1
2

1
ln 4
2

1
4 ei
4
1
1
2 4

30

1e

i

.

/ z0

. / v02

C 21 ei . /2 4 2

3
7
7
7
7
5

Then, mGMM-QLIML moment functions are derived by noting .Â/ as a function of Â
3
2
1
0
11j2 .Â/ ei .Â/ z
7
6
7
@qi1 . .Â/ ; Â2 / 6
7
6
D6
.Â/ 1 ei .Â/ v02
7
11j2
@
5
4
1
1
1
2
1
C 2 ei .Â/ 11j2 .Â/
2 11j2 .Â/
It is not difficult to see that mGMM-QLIML is asymtotically equivalent to GMM-QLIML whose
@q1
@Â22

replaced with (1.11) as discussed in previous section.1 To see existence of C .Â/ ; assuming,

without loss of generality, that first k2
Ä
C .Â/ D
Á

@q

r instruments in z2 are chosen in @Â 1 ; it is implied that
22

0.k

2 r / k1

†221 †21

Á
io

Ik2 r 0.k r / .2rC1/
2

when

†221 †21
io

2.5

Example 2: Probit with Endogeneous Explanatory Variables

is nonzero for chosen io and k2 > r:

Consider probit model with endogeneity.
yi1 D 1 Œyi 2 ˛ C zi1 ı1 C ui1 > 0 ;
yi 2 D zi ı 2 Cvi 2
Assume regularity and identification conditions (Assumption 1–17). For computational convienience, similar reparametrization in Example 1 is done along with normalization of e1 D u1

v2 Á.

Also, †22 is taken out from q2 since its exclusion does not affect other parameter estimates. Then,
quasi-likelihood can be simplified as following
q1 .Â1 ; Â2 / D .1
qi 2 .Â2 / D

y1 / log Œ1

ˆ .w .Â// C y1 log ˆ .w .Â//

1
vi 2 .ı 2 / vi 2 .ı 2 /0
2

1 Moreover, by invertible transformation of mGMM-moment functions and by separability con-

dition of GMM, it can be shown that mGMM-QLIML is asymptotically equivalent to optimal
GMM on E z 0 u :

31

0
where Â10 D ˛ 0 ; ı10 ; Á0 ; Â2 D vec .ı 2 /, and w .Â/ D y2 ˛ C z1 ı1 C v2 Á. Taking derivatives,

quasi-scores can be expressed as
2

3

6
7
6
7
y1 ˆ .w .Â//
6
0
.w .Â// 6 z 7
7
ˆ .w .Â// ˆ .w .Â//
4 1 5
v02

@q1
D
@Â1
1
@q1
D
@Â2
@q2
D
@Â2

y02

1

y1 ˆ .w .Â//
.w .Â// Á ˝ z0
ˆ .w .Â// ˆ .w .Â//

Ir ˝ z0 .y2

zı2 /0

It is easy to see that, GMM-QLIML extra moment functions are
@q1
D
@Â22

1

i
h
y1 ˆ .w .Â//
.w .Â// Áio z02; r
ˆ .w .Â// ˆ .w .Â//

where Áio 6D 0.
Components of Ho , Hc and So can be calculated similarily as in Example 1.
3
2
3
2
ı21k r Ik1 0
7
6
1
0
7
6
6 ˛ ˝ Ik 7
7
/
.Â
D
and
H
H11 .Â2 / D 6
4
5
0 7
12 1
6 ı22k r 0
2
5
4
0r rk
Ir
0 Ir
To derive mGMM moment functions, note
w .Â/ D y2 ˛ C z1 ı1 C v2 Á
D z1 1 C z2 2 C v2 3
D w . .Â//
Then, differentiating with respect to reduced form parameters, we have
2
@q1 . .Â/ ; Â2 /
D
@
1

z0

3

y1 ˆ .w .Â//
6
7
.w .Â// 4
5
ˆ .w .Â// ˆ .w .Â//
v02

and it shows that this model is LL-class as claimed by Proposition 2.3.9.

32

2.6

Monte Carlo Simulation on Probit Model with EEV

In this section, Monte Carlo simulations are conducted on probit model under several specifications.
The purpose is to investigate effects of GIME’s on finite sample performance of estimators when
model is over-identified. Based on the relative asymptotic efficiency results in previous sections,
it is expected that MD and equivalent estimators outperform CF and QLIML (in terms of standard
deviation) at large enough sample size when enough misspecification is present in an overidentified
model. Root mean squared error (RMSE) is used as main performance measure in this study to
take account of bias as well as mean deviation.
The assumptions on data generation is as follows: all data points are independently and idenÁ
1
10
4
tically generated. The instruments fzk gkD1 are mutually independent and zk
SBi n 10 ; 3
d
for each 1 Ä k Ä 10 where SBi n .n; p/ D Bipn.n;p/ np : The regression equation for scalar y2 is
np.1 p/

y2 D z1 C

C z10 C v2 : For specification of GIME’s of q1 and q2 , following restrictions were

imposed in each case
Probit

GIME holds

y1 D

1Œz Cy CÁv Ce >0
1 2
2 2
Á
SBi n 104 ; 1=3

v2

GIME doesn’t hold
11
4 Œz1 Cy2 CÁv2 Ce2 >0

C 34 1h

i
p
z1 Cy2 CÁv2 C.e2 Ce3 /= 2>0

Á
z1 z2 SBi n 104 ; 1=3

where e2 and e3 each follow independent standard normal distribution. Due to nomalized variance
of v2 , it can be shown that, by considering q2 without †22 term; only homoskedasticity is needed
for GIME to hold for q2 . Fractional response y1 fails GIME for q1 due to correlation between e2 and
p
.e2 C e3 / = 2: Since relevent random variables are all discrete and have bounded supports, RMSE
for all estimators are well-defined and can be used in comparison. Number of repetition is 104 and
mGMM was estimated by iterative (or continuously updating) method. The simulation program
was written in ado/MATA language in STATA 13, and it was executed using HPCC(High Performance Computing Center) resources provided by the iCER(institute for Cyber-Enabled Research)
at Michigan State University.
Table 2.1 shows results. Simulation I is a case where a complete set of GIME’s hold so that MD,

33

Table 2.1 Root Mean Squared Error and Standard Deviation (Á D 0:6)
y1
v2

I
bin
hom

II
bin
heter

III
frac
hom

IV
frac
heter

RMSE.˛/
O
CF
QLIML
mGMM
cMD
MD

.075365
.076288
.106961
.074520
.074145

.106819
.107902
.135496
.107595
.107138

.077754
.078776
.110279
.077083
.076336

.104740
.065795
.088611
.065725
.065140

.104207
.067283
.089067
.066771
.066081

.092770
.095036
.108038
.093617
.093134

III
frac
hom

IV
frac
heter

.074631
.075181
.088201
.074508
.073898

.076346
.076850
.090190
.076982
.076332

.083360
.064932
.074308
.065704
.065104

.082906
.066215
.075319
.066712
.066068

.115156
.116472
.130357
.115562
.114801

.108713
.093296
.102978
.094569
.094078

.117753
.102338
.110010
.102651
.102141

.123702
.127413
.135053
.124358
.123641

.100161
.085600
.091019
.086323
.085903

.124508
.112983
.118997
.112214
.111646

O
SD.ı/
.116410
.118604
.148364
.115777
.114848

.125905
.094155
.114708
.094605
.094074

.132584
.103526
.122376
.102796
.102169

RMSE.Á/
O
CF
QLIML
mGMM
cMD
MD

II
bin
heter

SD.˛/
O

O
RMSE.ı/
CF
QLIML
mGMM
cMD
MD

I
bin
hom

.106276
.106946
.120365
.107592
.107046
SD.Á/
O

.123793
.128690
.142233
.124352
.123649

.107716
.086661
.096547
.086388
.085921

.128835
.113820
.124093
.112209
.111645

.092549
.094068
.101424
.093618
.093101

cMD, mGMM and QLIML are all asymptotically equivalent. In simulation I, thus, all estiamators
are asymptotically equivalent including CF. Other simuations (II, III, IV) have at least one GIME
failing, and MD, cMD and mGMM are efficient relative to both QLIML and CF. Standard deviations
are also presented along with RMSEs.
There are some points to be mentioned: 1) These results show that there can be cases where
minimum distance and its equivalent estimators can outperform CF and QLIML in finite sample.
2) Minimum distance estimators and equivalent ones except mGMM seem to behave quite similar
to each other in Simulation I as predicted. 3) MD-QLIML performs remarkably well while
asymptotically equivalent mGMM had poor finite sample behavior. Compared to other estimators,
MD-QLIML has usually the best performance and, even when it is the second best, the RMSE
difference from the best is not large.

34

CHAPTER 3
SHORT PANEL DATA QUANTILE REGRESSION MODEL WITH SPARSE
CORRELATED EFFECTS

3.1

Introduction
Application of quantile regression to panel data is attractive to empirical researchers. Com-

pared to conditional mean regression, quantile regression provides a more thorough description of
the population distribution by nature. With its application to panel data, the unobserved individual effects can be accounted for so that a potential source of endogenous variation is eliminated.
A natural quantile analogue of a linear panel data model, however, suffers from the well-known
incidental parameters problem as in generic nonlinear models (Neyman and Scott, 1954). Rosen
(2012) shows that with time dimension fixed, the conditional quantile restriction alone cannot
identify the regression coefficients in general. Additional point-identifying restrictions considered
in the literature so far assume at least one of the following: (i) infinite time dimension, (ii) pure
location-shifting unobserved effects, and (iii) a certain degree of within-group independence of the
regression errors (e.g. Koenker, 2004; Rosen, 2012; Lamarche, 2010; Canay, 2011; see Section 2).
Depending on the empirical contexts, these assumptions may not be credible for short panel data
analysis, and any breakdown of such identifying restrictions will result in an inconsistent estimation. The purpose of this chapter is to study an alternative point-identifying model restriction and
feasible estimation procedure for linear panel data quantile regression with a fixed time dimension.
The main contributions of this chapter are as follows. First, I propose a new point-identifying
restriction for a linear panel data quantile regression model with a finite time dimension. The
new model restriction reasonably accounts for the -quantile-specific time-invariant heterogeneity,
and allows arbitrary within-group dependence of regression errors. The generalized Chamberlain
device is taken analogously as a control function to capture -quantile-specific time-invariant
endogenous variations. Endogeneity due to an observability pattern in unbalanced panel data can

35

be accounted for as well. Second, asymptotic properties of a non-convex penalized estimator are
studied. To treat the high-dimensional nature of the generalized Chamberlain device, a nonconvex
penalized estimator is adopted. Compared to exact sparse models for cross-sectional data in Wang,
Wu and Li (2012; WWL) and Sherwood and Wang (2016; SW), the model in consideration
accounts for an approximation error in the sparse model, and within-group dependence of panel
data. A convergence rate and asymptotic distribution of the oracle estimator is studied under
both exact and approximate sparsity assumptions. A sparse version of the standard partially linear
semiparametric model asymptotics is derived under approximate sparsity. The proposed penalized
estimator is shown to have an oracle property in the sense that the estimator based on the true sparse
model belongs to local minima of penalized quasi-likelihood with probability tending to one. The
lower bound condition on the smallest magnitude of nonzero coefficients, so-called the beta-min
condition is relaxed compared to the one given in SW. Third, a transformation of sieve-approximated
correlated effects into a generalized Mundlak form is proposed to make the sparsity assumption
more plausible in some cases. Given a choice of sieve basis elements, the approximating terms are
transformed into time average and deviations. Whenever the sieve elements contain a first-order
polynomial term, both a classical Chamberlain and Mundlak form of correlated effects is nested
by the transformed approximating terms as a special case of true sparse models. The Monte Carlo
simulation shows that, depending on the true model, the estimator using a generalized Chamberlain
form can outperform the one using a Mundlak form, and vice versa. Fifth, an empirical application
to birth weight analysis demonstrates a convincing case where the proposed estimator works as
intended in real data.
The rest of this chapter is organized as follows: Section 3.2 gives a brief literature review of
linear panel data quantile regression. In Section 3.3, the new identifying restriction is explained and
formalized. Along with sieve-approximated correlated effects, nonconvex penalized estimation and
its asymptotic properties are presented in Section 3.4. Simulation results are discussed in Section
3.5. The empirical application to birth weight analysis is in Section 3.6. Section 3.7 contains
concluding remarks.

36

3.2

Literature on Linear Panel Data Quantile Regression
The literature on linear panel data quantile regression model has been growing rapidly in

recent years. First, there are several studies where both the time dimension T and sample size N
are assumed to be large. Koenker (2004) proposed penalized estimation under the pure location
shift restriction and large .T; N / asymptotics. Lamarche (2010) showed that it is unbiased under
a zero median condition on fixed effects, and derived an optimal choice of a penalty parameter.
Harding and Lamarche (2016) considered a semiparametric correlated effects model in a similar
framework to Koenker (2004) and Lamarche (2010). Kato, Galvao Jr. and Montes-Rojas (2012)
formally studied asymptotic results when .T; N / tends to infinity. They relaxed the intertemporal
independence assumption in Koenker (2004), and found that for asymptotic normality, the rate
condition imposed on T is more restrictive than the one found in generic nonlinear models due to
non-smoothness of the loss function. This result indicates that its short panel data application is
even less appealing.
Second, point-identifying restrictions and estimation methods for the fixed T case are studied.
Rosen (2012) showed weak conditional independence of regression errors across time together
with some support and tail conditions imply point-identification. Canay (2011) also showed that an
alternative conditional independence restriction in the random coefficient framework is sufficient
for identification, and he proposed a simple estimation method when unobserved effects are pure
location shifters. When the independence assumption is strengthened to i.i.d., Graham, Hahn and
Powell (2009) showed that there is no incidental parameters problem since the first-differenced
regression errors have a zero conditional median. Abrevaya and Dahl (2008), without explicitly
setting up rigorous model restrictions, applied a quantile analogue of a correlated random effect
model to analyze the effects of birth inputs on birthweight.
Apart from linear panel data quantile regression models, there are several related works on
panel data models. Wooldridge and Zhu (2016, manuscript) proposed high-dimensional probit
model with sparse correlated effects under fixed T . Arellano and Bonhomme (2016) considered

37

a class of nonlinear panel data models under fixed T where unobserved heterogeneity is nonparametrically modelled. Graham, Hahn, Poirier, Powell (2015, manuscript) extends the correlated
random coefficients representation of linear quantile regression to panel data under fixed T . Chernozhukov, Fernández-Val, Hahn, Newey (2013) studied a general nonseparable model assuming
time-homogeneous errors and large .T; N /.

3.3

Identification

3.3.1

Generalized Chamberlain Device

One of the essential advantages from using panel data is to resolve the potential endogeneity
problem that arises from unobserved time-constant heterogeneity. The unobserved effects are
typically specified as unknown coefficient parameters for individual dummy variables. In the linear
panel data conditional mean model, such specification is useful: both the differencing method and
direct control of dummies yield consistent estimators under mild conditions. Unfortunately, panel
data quantile regression with individual dummies suffers from an incidental parameters problem in
general.
I propose a generalized Chamberlain device as an alternative approach to achieve elimination of
time-invariant endogeneity in the spirit of a control function approach. The idea is to control timeconstant endogenous variation (regressor-correlated variation) only, not the whole heterogeneous
individual effect in the unobserved error. A well-known example of the conditional mean model
clearly demonstrates this idea: Suppose, for 1 Ä i Ä N and 1 Ä t Ä T;
yi t D xi t ˇ C ci C vi t
where xi t 2 RK ; xi Á .xi1 ;

(3.1)

; xiT / ; E Œvi t jxi ; ci  D 0, and xi t is assumed to be time-varying

and continuously distributed. Here, the unobserved time-invariant effect is denoted as ci following
Chamberlain (1984). By taking the conditional expectation of yi t on xi , we have
E Œyi t jxi  D xi t ˇ C g .xi /

38

(3.2)

for some measurable function g W RTK ! R: Note that the unknown arbitrary function g does
not depend on the time index t: Then, regression with sieve-approximated g .xi /, for example,
yields a consistent estimator of ˇ: In this sense, the conditional moment restriction (3.2) can be
viewed as a control function approach counterpart for the linear panel data model. (see section
19.8.2 of Li and Racine (2007) for details.) Such control function g will be called a “generalized
Chamberlain device” or “correlated effect” in this chapter. To date, this approach has not been
considered seriously in the linear panel data literature since the methods based on direct control
or removal of the individual effect ci equivalently eliminate potential endogeneity without much
difficulty.
The generalized Chamberlain device is taken analogously in quantile regression setting. Suppose, for 1 Ä i Ä N and 1 Ä t Ä T; the structural equation is
yi t D xi t ˇ C ui t

(3.3)

where, for simplicity, xi t is assumed to be time-varying and continuously distributed. We consider
balanced panel data from now on unless explicitly mentioned. For each
Q .yi t jxi / D xi t ˇ . / C gt .xi ; /
for some measurable function gt W RTK

2 .0; 1/ ; we can write
(3.4)

.0; 1/ ! R: The function gt .xi ; / represents -

quantile-specific endogenous variation contained in ui t that is allowed to vary across time given
xi . Unfortunately, such gt is not separately identifiable from xi t ˇ . / in general: Now, assume
that any endogenous variation contained in ui t is time-constant in the sense that gt .xi ; / does not
depend on t but is allowed to have a constant level difference across time. Then, for some constants,
kt . /s, we have
Q .yi t jxi / D xi t ˇ . / C g .xi ; / C kt . /

(3.5)

where kT . / D 0 is imposed for normalization: Note that (3.5) takes a quantile analogue of
(3.2) with the introduction of time effects, and that it formalizes the “time-constant endogeneity”
assumption for ui t . It does not rely on either additivity of composite error, ci C vi t , or the widely
used conditional quantile restriction Q .vi t jxi / D 0:

39

3.3.2

Model Restriction and Identification

In this subsection, the time-constant endogeneity assumption is used to derive the generalized
Chamberlain device for a formal structural equation. Together with the derived control function, a
set of model restrictions that attains point-identification is presented.
For each i D 1;

; N and t D 1;

; T; we observe .yi t ; xi t ; zi ; vt / : The response variable

is yi t 2 R; and the covariates are time/individual-varying variables xi t 2 RK1 ; time-constant
variables zi 2 RK2 ; and individual-constant variables vt 2 RK3 : The covariates are allowed to
contain both continuous and discrete variables which will be notated by tilde and dot accents,
c

c

d

K
K
K
respectively. Specifically, xQ i t 2 R 1 and zQ i 2 R 2 are continuous while xP i t 2 R 1 and
d

K
zP i 2 R 2 are discrete where we have K1 D K1c C K1d and K2 D K2c C K2d by construction.

For individual-varying variables, we assume random sampling conditional on individual-constant
variables.
Assumption 1 (Random Sample) fyi t ; xi t ; zi gTtD1 are i.i.d. across i conditional on fvt gTtD1 :
Since we consider linear quantile regression models, it is natural to assume the structural
equation for yi t to be a linear function of the observed covariates. Throughout the paper, the
structural equation is defined as follows: For i D 1;

; N and t D 1;

; T;

yi t D xi t ˇ C zi Á C vt C ui t

(3.6)

where ui t is an unobserved error. The equation (3.6) describes the data generating process of
the response variable yi t , which is typically implied by economic theories and specific empirical
contexts. We may also think of it as an equation in the researcher’s mind. Following Hurwicz
(1950) and Koopmans and Reiers l (1950), it constitutes a ‘structure’ when paired with a joint
distribution function of .fui t ; xi t gTtD1 ; zi / conditional on fvt gTtD1
F

fui t ;xi t gTtD1 ;zi jfvt gTtD1

.ui1 ;

; uiT ; xi1 ;

40

; xiT ; zi jv1 ;

; vT / :

(3.7)

Depending on the model restrictions imposed on (3.6) and (3.7), the interpretation of parameter
.ˇ; Á; / changes. For example, conditional quantile restrictions on ui t with different values of
2 .0; 1/ will change the value and interpretation of .ˇ; Á; / in general:
Under the model restriction of a generalized Chamberlain device, neither Á nor

can be

identified. However, following this argument shows that it is important to include time constant
regressor zi when the control function g is constructed: Taking conditional -quantile of yi t ; we
have
Q .yi t jxi ; zi ; fvt gTtD1 / D xi t ˇ . / C zi Á . / C vt . / C ft .xi ; zi ; fvt gTtD1 ; /
for some measurable function ft W RT .K1 CK3 /CK2
on ft and that of t on ft are confounded.

(3.8)

.0; 1/ ! R: Note that the effect of fvt gTtD1

Thus, without loss of generality, we can write

ft .xi ; zi ; fvt gTtD1 ; / D ht .xi ; zi ; / for some ht : (The notation neglects randomness arising from
fvt gTtD1 since fvt gTtD1 is always fixed in this chapter.) The time-constant endogeneity assumption,
then, implies ht .xi ; zi ; / D h .xi ; zi ; / C mt . / for some function h and some constant mt : The
conditional quantile of yi t can be written as
Q .yi t jxi ; zi ; fvt gTtD1 / D xi t ˇ . / C g .xi ; zi ; / C kt . /

(3.9)

where g .xi ; zi ; / D zi Á . / C h .xi ; zi ; / and kt . / D vt . / C mt . / : Assumption 2 below
summarizes the model restriction of the generalized Chamberlain device. From now on, we will drop
fvt gTtD1 in the conditioning and treat kt s as parameters to be estimated. The parameters’ dependence
on will also be omitted. The time dummies are denoted as dt for t D 1;
considered together with xi t as in wi t D Œ xi t d1

;T

1; and they will be

dT 1 : The kth element of wi t is written

0
as wi t k : The corresponding coefficient parameters are defined as ˇ D ˇ 0 ; 0 2 RK1 C.T 1/ :

Assumption 2 (Correlated Effect) There exists a measurable function g W RK1 T CK2 ! R such
that, for all .i; t/
Q .yi t jxi ; zi / D wi t ˇ C g .xi ; zi /

41

(3.10)

Assumption 2 is a new model restriction that takes a control function approach for the linear
panel data quantile regression model. Note that the correlated effect g depends on time-constant
variables zi that enter the structural equation. Although the causal effects of zi on yi t are not
identified, it is important to include zi as arguments of g: Also, there may be some deterministic
relationship among .xi ; zi / which should be dealt with. For example, we may have xi t k D xi;t 0 ;k
for some t; t 0 ; k with probability one. Then, having both variables is redundant for g, and one
should be removed from the specification. Similarly, if there is a functional relationship between
the covariates such as xi t k2 D xi2t k ; only the one containing finer information, xi t k1 ; should
1
remain. Throughout the paper, such redundancy is assumed away for simplicity.
The following theorem shows that a certain degree of richness in the support of fwi t gt and
well-behaved error distribution is sufficient for point-identification of ˇ under Assumption 1–2.
Define "i t Á yi t

wi t ˇ

g .xi ; zi / ; and let fi t ."/ be a density of "i t conditional on .xi ; zi / :

The condition imposed on fi t below is part (i) and (ii) of Assumption 3 in Section 3.4.2.
N i D ..wi 2
Theorem 3.3.1 (Identification of ˇ/ Let W

wi1 /0 ;

; wiT

0

wi.T 1/ /0 : Assume

Assumption 1–2 and that fi t ."/ is continuous and uniformly bounded away from 0 and 1 in a
neighborhood of 0. Suppose that the support of .wi1 ;
1 Ä j Ä J such that J T

.K1 C T

.j /

; wiT / contains J points .wi1 ;

.j /

; wiT /

1/ matrix
N .J /0 0
W
i

N .1/0
Œ W
i
.j /

.j /

(3.11)
.j /

.j /

has full column rank, the pmf of .Pxi t ; xP i t 0 / satisfies p.Pxi t ; xP i t 0 / > 0 8j; and the pdf of
.j / .j /
.j / .j / .j / .j /
.Qxi t ; xQ i t 0 / satisfies f.Qxi t ;Qx 0 /j.Pxi t ;Px 0 / .Qxi t ; xQ i t 0 jPxi t ; xP i t 0 / > 0 8j where f.Qxi t ;Qx 0 /j.Pxi t ;Px 0 /
it
it
it
it
.j /

.j /

has continuous extension at each .wi t ; wi t 0 /: Then, ˇ is identified.
The result above is not surprising since the current specification does not have incidental
parameters. It shows that specification with an unknown function common to every individual has
more identifying power than one with unknown parameters unique to each individual. For the rest
of the paper, we assume point-identification.

42

3.3.3

Case of Unbalanced Panel Data with Time-constant Endogeneity

When some time periods are missing for some individuals in the observed panel data, potential
endogeneity related to observability should be treated as well. We assume the structural equation
for all observed units and time periods are homogeneous. Then, with the introduction of selection
indicators and auxiliary balanced data, the generalized Chamberlain device can be modified to
account for time-constant endogeneity related to observability. The approach can be regarded as a
nonparametric version of the correlated random effect models studied by Wooldridge (2009).
First, define selection indicator si t to be a binary function that takes 1 if .yi t ; xi t ; zi / is observed
for t; 0 otherwise. In addition, consider an auxiliary balanced panel data .si t yi t ; si t xi t ; si t zi ; si t /
for i D 1;

; N; t D 1;

; T: A corresponding structural equation is assumed to be
si t yi t D si t xi t ˇ C si t zi Á C si t vt C si t ui t

(3.12)

which is derived by multiplying si t to the original structural equation. The (3.12) is restrictive
only for observed time periods and it assumes the structural equations are homogeneous across
all observed units and time periods. The conditional -quantile of si t yi t conditional on S T D
.fsi t xi t gTtD1 ; si t zi ; fsi t vt gTtD1 ; si / is
Q .si t yi t jS T / D si t xi t ˇ C si t zi Á C si t vt C si t gt .fsi t xi t gTtD1 ; si t zi ; fsi t vt gTtD1 ; si / (3.13)
for some gt where si D .si1 ;

; siT / : Then, based on the previous argument, the time-constant

endogeneity assumption implies a conditional quantile restriction
Q .si t yi t j fsi t xi t gTtD1 ; si t zi ; si / D si t xi t ˇ C si t g.fsi t xi t gTtD1 ; si t zi ; si / C si t kt .si /

(3.14)

where fsi t vt gTtD1 is omitted and .g . / ; kt / depends on the selection indicator si : Since (3.14) is
not restrictive when si t D 0; we may write an equivalent model restriction for .i; t/ with si t D 1 as
Q .yi t j fxi t gft Ws D1g ; zi ; si / D xi t ˇ C g.fxi t gftWs D1g ; zi ; si / C kt .si / :
it
it

(3.15)

Since the generalized Chamberlain device, g.fxi t gftWs D1g ; zi ; si / in (3.15) is a function of seit
lection operator si ; it now accounts for time-constant endogenous variation due to observability.

43

Note that si acts as a classification device for the observed pattern of each individual. If there is no
endogeneity related to observability, g and kt will not depend on si when g is assumed to have an
additive form (3.16) in the next section.
Identification of ˇ (not ˇ) can be trivially achieved under conditions in Theorem 3.3.1 for
P
fwi t gftWs D1g jsi D sQi such that P .si D sQi / > 0 and TiD1 sQi t 2. In other words, if there exists
it
a positive fraction of observable units with a certain number of multiple periods, and if the support
of regressors are rich enough, the parameter of interest is identified. The estimation procedure
will be applied to the auxiliary balanced panel data. When g has an additive form, an essential
difference in estimation is that each fraction of individuals with different observability pattern si is
allowed to have different additive components of g for each xi t k and zi k .

3.4

Estimation
For estimation, the correlated effect g is approximated by sieve spaces. The approximated g

has high-dimensionality for three reasons: First, the number of approximating terms is infinite in
general. Second, the number of arguments in g is TK1 C K2 , which can grow fast in T: Besides the
problem that the truncation choice on sieve approximation can be limited with a large number of
arguments, the existence of discrete variables can introduce more nontrivial problems. In particular,
if discrete variables .Pxi ; zP i / have rich enough supports, the total number of approximating terms
can be comparable to, or larger than N even after we impose the additive functional form on g
and truncate the approximating terms for additive components of continuous variables .Qxi ; zQ i /.
Such “too many regressors” problem arises from the fact that it is not obvious how to truncate
approximating terms related to discrete variables in general. Note that N is the maximal number
of linearly independent time-invariant regressors. Third, when the panel data is unbalanced, more
complex observability patterns result in a larger number of nonparametric nuisance components to
be approximated since we allow different functional forms of g for each group of individuals with
a different pattern.

44

The reasons for high-dimensionality mentioned above indicate that the standard sieve truncation
via information criteria or cross-validation is not always usable and effective. For high-dimesional
models, a penalized estimation of a sparse model with the Least Absolute Shrinkage and Selection
Operator (LASSO; Tibshirani, 1996) and its variants is popular due to its prediction accuracy
and computational feasibility. For high-dimensional quantile regression models, Belloni and
Chernozhukov (2011; BC), Wang, Wu and Li (2012; WWL) and SW (2016) studied properties of
penalized estimators with certain penalty functions. Since the nonconvex penalty functions used in
WWL (2012) and SW (2016) have oracle property under mild conditions, asymptotic distribution
of resulting penalized estimators can be studied via that of oracle estimator. This is a big benefit
from using nonconvex penalty function compared to LASSO which has oracle property only under
a quite restrictive condition. Another practical benefit is that the relaxed estimation procedure
such as “post-Lasso estimation” is not necessary for nonconvex penalized estimators applied to a
large sample. In this chapter, two nonconvex penalty functions are considered: Smoothly Clipped
Absolute Deviation (SCAD; Fan and Li, 2001) and Minimax Concave Penalty (MCP; Zhang,
2010). For details of a general class of penalty functions, see Fan and Lv (2009) and Lv and Fan
(2009) for example. Besides overcoming a nontrivial high-dimensionalty problem, it is expected
that the penalized estimator improves over the standard truncated sieve estimators under sparsity
assumption (Belloni and Chernozhukov, 2011a). Note that the penalized estimator selects the
relevant sparse terms only while the relevant terms can be excluded from the first K elements
selected by standard truncated sieve estimators. To make the sparsity assumption more plausible
in some cases, a transformation of the approximated correlated effect into a generalized Mundlak
form is proposed.

3.4.1

Sieve-approximated Correlated Effect

In this chapter, the specific sieve space in which g belongs is not assumed. The theoretical
framework is quite flexible and can accomodate various specifications. As long as the true function
g can be sparsely approximated by a collection of terms that satisfies regularity conditions, such a

45

collection of terms can be used. Here, we briefly cover one useful example of the sieve approximation of g with an additive form; smoothness of additive components of g for continuous covariates,
and finiteness of support for discrete covariates. While a smoothness assumption is quite standard,
the additive function space can be replaced by a tensor product sieve space in general. We may
also consider using multiple sieve spaces together so that basis elements can be mixed (Bunea,
Tsybakov, Wegkamp, 2007; Belloni, Chen, Chernozhukov, Hansen, 2012).
The additivity requires g to be represented by the sum of the univariate functions of each
argument,
g .xi ; zi / D g0 C

K1
T X
X

gtxk .xi t k / C

t D1 kD1

K2
X

gkz .zi k /

(3.16)

kD1

h
i
h
i
where g0 2 R is a constant. For identification purposes, E gtxk .xi t k / D E gkz .zi k / D 0 8t; k
is typically assumed but we may instead drop a constant term (if there exists any) in the sieve elements
for each gtxk and gkz : Given additivity, smoothness restriction is imposed on the additive components
Á
of g with continuous covariates, .gtxk xQ i t k /; gkz .zQ i k / : The Hölder condition of a certain order
is the most popular choice. (see Chen, 2007). Finite supports are assumed for components with
Á
discrete covariates, .gtxk xP i t k /; gkz .zP i k / , which implies that relevant approximating errors will
be exactly zeros with large enough N:
For approximating gtxk .Qxi t / and gkz .Qzi /, B-spline elements are frequently used. See Schumaker
(2007) for details. The following shows how the B-spline basis can be implemented in practice:
Given a knot sequence 0 D t0 < t1 <
are 1; x;

; x p ; .x

p

t1 / C ;

assumed to be Œ0; 1 and

p
. /C

; x

< tJ < tJN C1 D 1; degree p B-spline elements
Áp N
tJN
where the range of continuously distributed x is

D .max f0;

C
p
g/ :

Then, the approximated gtxk .xQ i t k /, for example,

can be written as
stxk .xQ i t k /

D

p
X
qD1

q
x
t kq xQ i t k

JN
X

C

j D1

x
t k.pCj /

xQ i t k

p

tj C

(3.17)

where the constant term is removed for identification. In several contexts, nonparametric or
semiparametric conditional quantile estimators using B-splines are shown to achieve the optimal

46

1

convergence rate with JN

N 2rC1 under regularity conditions where r denotes the degree

of Hölder condition. For example, He, Zhu and Fung (2002) showed the result for a univariate
semiparametric component in the panel data model with an unspecified dependence structure. If
the optimal growth rate is the same for the additive semiparametric model, Assumption 5 in the
1

next subsection is conservatively satisfied when we impose pN

N 2rC1 and r > 1:

For discrete xP i t k and zP i k ; we do not rely on a smoothness assumption in general. The corresponding gtxk and gkz function can be exactly expressed as a linear combination of indicator
functions that take value 1 on the support elements. Suppose that a discrete random variable xP i t k
has realized support elements

Sx
t k;N
fas gsD1

in the given sample: Then, without loss of generality, the

function gtxk .xP i t k / can be written (or approximated) as
stxk .xP i t k / D

Sx
t k;N
X

x
t ks 1 ŒxP i t k

D as 

(3.18)

sD1

where indicator functions 1 ŒxP i t k D as  act as sieve basis elements of the function space in which
Á
gtxk .xP i t k / lies. To meet the identification condition E gtxk .xP i t k / D 0, we may equivalently drop
x

one of the indicator terms. Since xP i t k is assumed to have finite support, for some S t k < 1; we
p

x

have Stxk;N ! S t k as N tends to infinity, and the approximation error becomes zero. However,
z ) whose total is quite large relative
since there can be multiple discrete variables with Stxk;N (or Sk;N

to N; model selection is inevitable in some cases. Note that the approximating terms for discrete
g components do not have a natural way to regularize the dimension as in the case of splines with
increasing knots under smoothness assumption. Unless some additional assumption is employed,
all of the terms in (3.18) should be included as regressors in principle.
As discussed at the beginning, sparsity on the approximated correlated effect is assumed for a
feasible inference. Sparsity in our context means that only a small number of approximating terms
have true nonzero coefficients. In other words, the correlated effect is regular enough so we need
only a small number of variables to describe it well. Recently, the sparsity assumption is gaining
more credibility as the ‘bet on the sparsity’ principle is understood better (Hastie, Tibshirani,

47

Wainwright, 2015). Since the validity of sparsity depends on the choice of basis, a specific basis
(or mixture of them) should be selected carefully.
Given a set of approximating terms, a transformation into a generalized Mundlak form is
proposed for time-varying regressor parts, gtxk s. The idea is to take the time averages and deviations
given the common approximating terms of gtxk for t D 1;

; T:

Definition 3.4.1 (Generalized Mundlak Form) Suppose the approximated gtxk .xi t k / of correlated
P
effect is stxk .xi t k / D SsD1 t ks pks .xi t k / for t D 1;
; T: Then, given t0 ; define a transformed
pQks .xi t k / as
pQks .xi t k / D

8
ˆ
< 1 PT
T

t D1 pks .xi t k /

t D t0

PT

t 6D to

ˆ
: pks .xi t k /

1
T

tD1 pks .xi t k /

(3.19)

If one of the approximating terms contains a first-order polynomial, that is, pks .xi t k / D xi t k
for some s; then both classical Chamberlain and Mundlak device is nested in the transformed
pQt ks .xi t k / as a special case of sparse models. Note that the basis elements in (3.18) also can
be easily transformed into a form that contains a first-order polynomial. The choice of t0 can be
Á
1 PT p
avoided if pt0 ks xi t0 k
t D1 t ks .xi t k / is also included in the approximating terms which
T
will be defined as ‘dictionary variables’ in Subsection 3.4.2.
The rationale for the transformation (3.19) is the following. In empirical studies, it is often
found that the estimators based on classical Chamberlain and Mundlak devices do not differ much
while a Chamberlain device contains many more terms. If this is because the true coefficients of
xi t k in the Chamberlain device are the same for all t , then the true coefficient of time deviation
P
terms, xi t k T1 TtD1 xi t k ; in the generalized Mundlak form (3.19) are zeros. Similarly, if the true
coefficients of pks .xi t k / in the approximated correlated effect are the same for all t, then the true
P
coefficients of time deviations, pks .xi t k / T1 TtD1 pks .xi t k / ; are zeros, and the generalized
Mundlak form has far fewer number of approximating terms. This indicates that the selection over
the generalized Mundlak form can have a more sparse submodel on the correlated effects.

48

3.4.2

Penalized Estimation via Non-convex Penalty Functions

Given the approximating terms for g, a nonconvex penalized estimator is proposed along
with its asymptotic properties. The convergence rate and asymptotic distribution of the penalized
estimator is studied indirectly by deriving those of an estimator based on the true sparse model.
In turn, inference is conducted as if the estimated submodel is the true sparse model. This is
mainly justified by two facts: (i) the proposed nonconvex penalized estimator is shown to have
oracle property, and (ii) any submodel can be interpreted as an approximation to the true model by
construction. In the following, the asymptotic properties of the true sparse estimator are discussed
first, and then the oracle property of the penalized estimator is presented. Two approaches are
considered: (i) exactly sparse model of g and (ii) approximately sparse model of g:
The approximating terms can be divided into two groups in general. A group to be penalized
and another group not to be penalized. Each group of variables is denoted as

.xi ; zi / 2 RpN and

N .xi ; zi / 2 RpN , respectively. Note that the number of unpenalized term pN is fixed and not allowed
to depend on N; and that the total number of dictionary variable pN can be very large relative
to the sample size N (i.e. pN

N ) since ultra-high dimensionality is allowed for the proposed

estimator. Then, g can be written as
g .xi ; zi / D N .xi ; zi / N C

.xi ; zi /

C ri

(3.20)

where . N ; / has a conformable dimension and ri is an approximation error. The terms to be
penalized,

.xi ; zi / ; will be called ‘dictionary variables’ from now on. There is no hard guideline

about whether a given approximating variable should be penalized or not. However, a constant
term, g0 in our setting, is typically not recommended to be penalized and will not be treated as a
dictionary variable in this paper. Given the choice of dictionary variables, wi t is redefined as
wi t D Œ xi t d1

dT 1

N .xi ; zi / 

(3.21)

where wi t 2 RK4 and N .xi ; zi / is assumed to contain a constant as the default. ˇ is redefined
accordingly. Also, dictionary variables,

.xi ; zi / ; should be rescaled to have unit (pooled) sample

variance. Otherwise, selection will get affected by the scales of the variables.

49

Under the sparsity assumption, only a small subset of dictionary variables have nonzero true
coefficients. The cardinality of sparse coefficients is allowed to increase as N tends to infinity and its
growth rate is ruled by a true sparse model given a sequence of dictionary variables. With increasing
cardinality, the sparse approximation tends to the true function by construction. The framework is
similar to the “approximate sparsity model” proposed in Belloni, Chernozhukov (2011a), Belloni,
Chen, Chernozhukov, Hansen (2012) and Belloni, Chernozhukov, Hansen (2014) in its spirit. A
difference is that the sparse model is assumed to be exact in the sense that the corresponding
approximation error is not explicitly considered. Application of a penalized estimation to highdimensional nonparametric modeling is also discussed by Fan and Li (2001). Note that, in contrast
to the theoretical framework of Sherwood and Wang (2016), the parameter of interest ˇ is not
penalized in this chapter. In turn, there is no such pathological case where a parameter of interest
is not selected in a penalized estimate.
The estimator based on the true sparse model is often called “oracle estimator” in highdimensional statistics literature. The corresponding true sparse model will be refered to as “oracle
model” in this chapter. Let A be the index set of sparse coefficients given

.xi ; zi / ;

˚
«
A D AN D 1 Ä j Ä pN W oj 6D 0
and its cardinality be qN D jAj : By rearranging

(3.22)

.xi ; zi / ; we may assume the first qN elements of

D . 0oA ; 00p q /0 , and
N N
.xi ; zi / D . A .xi ; zi / ; Ac .xi ; zi //. Then, we can define the oracle estimator

o are nonzero and the remaining pN

similarly, denote

qN components are zeros i.e.

o

as follows.
Definition 3.4.2 (Oracle Estimator)
T N
1 XX
O
.ˇ; O A / D arg min
.ˇ; A / N tD1 iD1

where

.u/ D u.

.yi t

wi t ˇ

A .xi ; zi / A /

(3.23)

1Œu<0 /

Regularity conditions for the oracle estimator and penalized estimator are given for exactly
sparse model below. In the following, let Fi t ."/ D F ."jxi ; zi / be a conditional cdf of "i t given

50

.xi ; zi / ; and ei t Á yi t

wi t ˇ

A .xi ; zi / A

be the approximated regression error: The vector

QA
of all regressors is written as w
i t D .wi t ; A .xi ; zi // while its stacked versions are denoted as
A0 ;
0
Q A D .w
Q
Q A0
Q A0 /0 :
Q i1
Q A0
W
;w
;W
i
iT / and WA D .W1 ;
N
Assumption 3 (Regression Error) (i) "i t has the continuous conditional density function fi t (ii)
fi t is uniformly bounded away from 0 and 1 in a neighborhood of 0 8t . (iii) fi0t has a uniform
upper bound in a neighborhood of 0 8t.
Assumption 4 (Covariates) (i) 9M1 > 0 such that jwQ i t k j Ä M1 8 .i; t; k/ ; (ii) 9C1 > 0, C2 > 0
1 Q 0 Q
Q0 W
Q
such that, with probability one, C1 Ä min . N1 W
A A / Ä max . N WA WA / Ä C2 :

Assumption 5 (Sparse Model Size) qN D O.N C3 / where C3 < 13
Assumption 6 (Exact Sparsity) g .xi ; zi / D A .xi ; zi / A for each N:
Assumption 3 is a fairly standard regularity condition on regression error "i t : Note that the
within-group dependence of "i D ."i1 ;

; "iT / conditional on .xi ; zi / is allowed to be arbitrary

under Assumption 1 and 3. Assumption 4 imposes boundedness on the regressors and eigenvalues
Q0 W
Q
of the Gram matrix N 1 W
A A in the oracle model. Assumption 5 restricts the growth rate of the
oracle model size. It is required for a given dictionary variable sequence to be valid. Assumption
6 is an essential condition that characterizes exactly sparse g: It means that for each sample size,
the model with qN terms exactly describes the correlated effects. Under these conditions, the
0
convergence rate and asymptotic normality results for oracle estimator of Â A D ˇ 0 ; 0A is shown

below.
Theorem 3.4.3 (Convergence Rate of Oracle Estimator) Suppose Assumption 1–6. Then,
kÂO A

q
Â oA k D Op . N 1 qN /

Proof. The result follows from Lemma C.1.1 and C.1.4 in Appendix C.1.2.

51

(3.24)

Theorem 3.4.4 (Asymptotic Normality of Oracle Estimator) Suppose Assumption 1–6. Let GN
be an l

0 ! G; a positive definite matrix. Then,
qN matrix with l fixed and GN GN

p

1

N GN †N 2 .ÂO A

d

Â oA / ! N .0l ; G/

(3.25)

."iT //0 ; BN D d iag .fffi t .0/gt gi /,
I ."i t < 0/ ; ‰ ."i / D . ."i1 / ;
;
P
0 Q A
1
1
Q 0 BN W
Q A0
Q A , SN D 1 N
KN D N1 W
i D1 Wi ‰ ."i / ‰ ."i / Wi , and †N D KN SN KN .
N
A
where

."i t / D

Proof. The result follows from Lemma C.1.1 and C.1.4 in Appendix C.1.2.
In Theorem 3.4.3 and 3.4.4, we are not assuming primitive conditions regarding the sieve basis
nature of

A .xi ; zi / :

It is only assumed that given a collection of dictionary variable
A .xi ; zi /

true sparse regressors

.xi ; zi / ;

exist and satisfy the assumptions given, especially the exact

sparse model condition. With additional assumptions on

A .xi ; zi / ;

it is possible to derive a

sparse version of standard partially linear semiparametric model results accounting for nonzero
approximation error ri .
The additional assumptions on

A .xi ; zi /

require some definitions and notations. First, let

G be a function space to which g belongs. For example, if g is assumed to be additive, and
if .xi ; zi / contains continuous variables only; then an additive function space, H1 C

C HK4

where H is the Hölder space of certain degree, is a popular choice. Define G to be the subspace
of G whose elements can be expressed by active elements of the dictionary variable sequence,
f A .xi ; zi /g1
N D1 : Then, we can consider a weighted projection of each regressor in wi t onto G
hk Á arg inf
h2G

T
N X
X

h

E fi t .0/ .wi t k

h .xi ; zi

//2

i

(3.26)

i D1 t D1

where wi t k is the kth element of wi t : The corresponding population residual is written as i t k D
wi t k hk : Stacked version of hk and i t k are denoted as follows: hi D .h1 .xi ; zi / ;
1T D .1;
0i1 ;

; 1/0 2 RT ; H D h01 ;
; 0iT

0

;  D 01 ;

; h0N

; 0N

Then, it is easy to check hO k .xi ; zi / D

0

0

˝ 1T 2MN T K4 ; i t D i t1 ;

; hK .xi ; zi //;
Á4
; i tK4 ; i D

so that W D H C  where W D.w011 ; w012 ;
A .xi ; zi / 'Ok

52

where Wk D .w11k ;

; w0N T /0 :

; wN T k /0 ; …A D

. A .x1 ; z1 /0 ;

; A .xN ; zN /0 /0 ˝1T ; 'Ok D …0A BN …A

1

…0A BN Wk . Additional conditions

on A .xi ; zi / are given below.
Assumption 7 (Covariates) 9M2 > 0 such that EŒ4it k  Ä M2 8 .i; t; k/
1=2
Assumption 8 (Approximate Sparse Correlated Effects) (i) sup jri j D O.N 1=2 qN /
i
P
2 D o .1/ 8k
O
.x
/
.x
/
(ii) N 1 N
Œh
;
z
h
;
z
p
k i i
i D1 k i i

Assumption 7 restricts the population residual of wi t k projected out of hk to have finite fourth
order moment. Assumption 8 is the essential condition that characterizes f .xi ; zi /g as the
sieve basis elements that attain well-behaved sparse submodel. Part (i) assumes that the order
p
of approximation error is uniformly dominated by 1= N . Part (ii) is a high-level assumption
assuming that the sample analogue estimator of hk converges to a true function with respect to the
empirical L2 -norm. Since the convergence rate is not restricted and fi t .0/ is uniformly bounded,
this is a fairly standard property of the sieve basis elements. For example, when qN diverges, it can
be shown that the uniform approximation property of A .x1 ; z1 / (Newey, 1997)
ˇ
sup ˇhk .xi ; zi /
x ;z

A .xi ; zi / 'o;k

i i

ˇ
ˇ D O.q ˛ /
N

(3.27)

for some ˛ > 0 implies Assumption 8 (ii) under Assumption 1–5. With Assumptions 7 and 8
additionally assumed, the theorems below show a sparse version of the standard partially linear
semiparametric model asymptotics. Denote gO .xi ; zi / D A .xi ; zi / O A :
Theorem 3.4.5 (Convergence Rate of Oracle Estimator) Suppose Assumption 1–5, 7 and 8. Then,
kˇO
N

1

N
X

ŒgO .xi ; zi /

1

ˇ o k D Op .N 2 /

go .xi ; zi /2 D Op N 1 qN

i D1

Proof. See Appendix C.1.3.

53

(3.28)
Á

(3.29)

Theorem 3.4.6 (Asymptotic Normality of Oracle Estimator) Suppose Assumption 1–5, 7 and 8.
Then,
p

1

N †N 2 .ˇO

d

ˇ o / ! N .0; I /

(3.30)

P
0
0
where †N D KN 1 SN KN 1 with KN D N 1 0 BN , SN D N 1 N
i D1 i ‰ ."i / ‰ ."i / i .
Proof. The result follows from Lemmas C.1.9 and C.1.11 in Appendix C.1.3.
The convergence rate of ˇO is the parametric rate as in the typical partially linear semiparametric
models. On the other hand, gO .xi ; zi / has a convergence rate of N 1 qN which depends on the
sparsity of a true submodel given the dictionary variable sequence. One obvious implication is
that the performance of an oracle estimator will depend on the choice of the dictionary variable
sequence. Each of Theorem 3.4.4 and 3.4.6 shows an asymptotic distribution result that can be
used to approximate the distribution of the oracle estimator in a finite sample. Note that the sample
O The computation
analogue estimator of N 1 †N and N 1 †N coincide for approximating VO .ˇ/:
of variance estimators is presented in Subsubsection 3.4.2.2.
Since true nonzero coefficients are unknown, the sparse set is estimated by penalizing coefficients of all dictionary variable in the sample optimization problem. In multiple contexts, the
penalized estimators using nonconvex penalty functions such as SCAD (Fan and Li, 2001)
8
ˆ
ˆ j j
ˆ
0Ä <
ˆ
ˆ
Á
<
2
2
a j j
C
=2
for some a > 2
(3.31)
p .j j/ D
Äj jÄa
ˆ
.a 1/
ˆ
ˆ
ˆ
2
ˆ
a <j j
: .aC1/
2

and MCP (Zhang, 2010)
p .j j/ D

8
ˆ
<

.j j

2
ˆ
: a
2

2
2a

/

0Äj jÄa

for some a > 1

(3.32)

a Äj j

are shown to yield the oracle estimator among the local minima of a penalized objective function
with probability tending to one. Such feature is called “a (weak) oracle property” of the nonconvex
penalized estimator. To present the oracle property for the current model setting, the penalized
estimator is defined as follows.

54

Definition 3.4.7 (Penalized Estimator)
N T
1 XX
O
.ˇ ; O / D arg min
.ˇ; / N

.yi t

wi t ˇ

pN
X

.xi ; zi / / C

i D1 t D1

ˇ ˇ
p .ˇ j ˇ/

(3.33)

j D1

where p . / is either a SCAD or MCP penalty function.
The oracle property is shown with one additional condition on true sparse coefficients. Assumption 9 below is often called a ‘beta-min’ condition, which basically assumes that the minimum
maginitude of nonzero coefficients in the oracle model is sufficiently large. In our context, the lower
bound for the coefficient magnitude can be understood as a truncation cut-off for approximating
surrogate function

.xi ; zi /

given a sequence of dictionary variables.

Assumption 9 (Nonzero Coefficients) There exist positve constants C4 and C5 such that C3 <
ˇ ˇ
C4 < 1 and N .1 C4 /=2 min ˇ oj ˇ C5 :
1Äj ÄqN

Theorem 3.4.8 (Oracle Property of Penalized Estimator) Suppose Assumption 1–5 and 9 together
Á
1=2
with Assumption 6 or with Assumption 7 and 8. If D o N .1 C4 /=2 ; N 1=2 qN D o . / ;
and log .pN / D o.N 2 /; then
lim P ..ˇO ; O / 2 "N . // D 1

N !1

(3.34)

where "N . / is the set of local minima of objective function in (3.33).
Proof. See Appendix.
The rate condition on qN and

in Theorem 3.4.8 is weaker than the one given in Sherwood

and Wang (2016). In turn, a necessary requirement on C4 in the beta-min condition is also weaker.

3.4.2.1

Choice of Thresholding Parameter

Lee, Noh and Park (2014; LNP) recently proposed a modified Bayesian Information Criterion
for linear quantile regression with cross-sectional data when the dimension of the dictionary
variables diverges and the dimension of the true model is a constant. Sherwood and Wang (2016)

55

take LNP’s criterion for the case where the dimension of the true model may diverge. Its pooled
information version for panel data in the current setting can be considered as follows
0
1
N
T
XX
QBICL . / D 2T N log @
.ei t .ˇO ; O //A C S CN log .T N /

(3.35)

iD1 tD1

where ei t .ˇ; / D yi t

wi t ˇ

.xi ; zi / ; S is the degree of freedom of the fitted model and

CN is chosen as log .pN / in Sherwood and Wang (2016).
Note that the goodness of fit measure in (3.35) is derived from the quasi-likelihood of asymmetric
laplace distribution with scaling parameter . (See LNP for details.) When

D 1 is imposed,

the resulting measure coincides with the conventional check loss function without logarithm and
T N -scaling. Then, an alternative form of high-dimensional BIC can be written as
BICL . / D 2

N X
T
X

.ei t .ˇO ; O // C S CN log .T N / :

(3.36)

iD1 tD1

To take into account the clustered information of panel data, it is useful to think of clustering as
a kind of misspecification problem in quasi-likelihood. In this perspective, generalized BIC (GBIC)
and GBICp studied by Lv and Liu (2014; LL) can be considered. They explicitly incorporate model
misspecification using a second-order term in the asymptotic expansion of the Bayesian principle
under generalized linear model settings. The final result is claimed to be general enough to be
applied to other contexts. Adding the second-order term of GBIC studied by LL to (3.35), we have
1
0
N X
T
X
GQBICL . / D 2T N log @
.ei t .ˇO ; O //A C S CN log .T N / log det H ;N
iD1 tD1

(3.37)
1 SO
O
where H ;N D KO ;N
;N is a covariance contrast matrix evaluated at .ˇ ; O /: Note that the

second-order term can be negative. The same modification can be done for BICL. For further
details about GBIC and GBICp , see LL (2014). Note that if the correction term log det H ;N is
asymptotically bounded, then the first two terms in the information criterion will be dominant as
N tends to infinity. Thus, if GBIC is indeed a valid criterion for selection consistency, then regular
BICs without the correction term must be valid as well in such cases.

56

3.4.2.2

Computation of Variance Estimators

The sample analogue estimators for the sandwich form of KN1 SN KN1 and KN 1 SN KN 1
are computed using the set of selected variables, AO . / ; given the penalized estimator .ˇO ; O /:
This is mainly justfied by the fact that (i) the penalized estimator has an oracle property, and that
(ii) any submodel constitutes an approximation of the true model. Here, estimators are constructed
following the cluster-robust variance estimator proposed by Wooldridge (2010). Let M C denote
the Moore-Penrose generalized inverse of M . First, the residual eOi t is simply computed by plugging
in estimates .ˇO ; O / in the formula:
eOi t D yi t

wi t ˇO

O /
A.

.xi ; zi / O :

O i t can be estimated as, for a sequence hN tending to 0,
The A -projected-out regressor 
0
1C 0
1
N X
T
N X
T
X
X
O i t D wi t
@
A @

1ŒjeO jÄh  0 O A;i
1ŒjeO jÄh  0 O wi t A
O
A;i
it
it
N A;i O
N A;i
i D1 t D1

(3.38)

(3.39)

iD1 tD1

where the conditional density fi t .0/ is approximated via uniform kernel. Then, the sample
O DK
O C SO K
O C
analogue estimator of †N can be written as †
N
N N N with
O D
K
N
SO N D

N T
1 XX
O0 
O
1ŒjeO jÄh  
it it
it
N
2N hN

(3.40)

1
N

(3.41)

i D1 t D1
N
T X
T
XX

eOi t 0

i D1 tD1 t 0 D1

O :
O 0 0
.eOi t / 
it it

The generalized inverse is used instead of the regular inverse since A may contain a set of linearly
dependent variables given sample size N and threshold parameter : Note that the Moore-Penrose
inverse coincides with the ordinary inverse whenever the operated matrix is invertible. The estimator
ON D K
O C SO N K
O C can be similarly computed by replacing 
O i t with wQ i t in (3.40) and (3.41): It
†
N
N
O N and †
O are numerically equivalent if
can be shown that the estimated variances of ˇO based on †
N

O N is invertible. For the choice of sequence hN ; see Perente and Santos Silva (2010) for example.
K

57

3.5

Monte Carlo Simulation
A set of Monte Carlo simulations is conducted to study the selection performance and estimator

performance on simple location shift and location-scale shift models. With a 3-period panel
structure, 5 specifications are considered:
DGP 1 W yi t D xi t1 C xi t 2 C xi11 C xi12 C xi13 C xi 21 C xi 22 C xi 23 C ui t
DGP 2 W yi t D .8 C xi t1 C xi t 2 C xi11 C xi12 C xi13 C xi 21 C xi 22 C xi 23 / .ui t C 1/
X
k C xk C xk C xk C xk C xk / C u
DGP 3 W yi t D 14 C xi t1 C xi t 2 C
.xi11
it
i12
i13
i 21
i 22
i 23
k2K

DGP 4 W yi t D f14 C xi t1 C xi t 2 C

X

k C x k C x k C x k C x k C x k /g .u C 1/
.xi11
it
i 23
i 22
i 21
i13
i12

k2K

DGP 5 W yi t D f14 C xi t1 C xi t 2 C

X

k C 2x k C x k C 2x k /g .u C 1/
.xi11
it
i12
i 21
i 22

k2K

where T D 3; N D 300 or 1000; x1t

U . 1; 1/ ; x2t

U . 1; 1/ ; ui t

U .0; 1/ ; K D

f1; 2; 7g : DGP 1, 3 and 2, 4, 5 are location shift and location-scale shift models, respectively. Note
that location-scale shift models have heteroskedastic regression error terms. DGP 1 and 2 impose a
Chamberlain specification while DGP 3, 4 and 5 introduce nonlinearity or different coefficients on
correlated effect terms across time periods. Note that the rescaled true parameters for higher-order
polynomial terms are smaller since all dictionary variables are rescaled to have unit sample variance
7 in DGP 3
in estimation. For example, a normalized 1st and 7th order polynomial term xi11 ; xi11

have true parameter values of about .58 and .26, respectively. In turn, higher-order terms are harder
to be selected as relevant variables at given sample sizes. The number of simulated draws is 1000.
Table 3.1-3.2 and Table A.1-A.11 in Appendix C.2 contain the results. Along with QBICL and
BICL, their non-high dimensional versions, QBIC and BIC are also considered. AIC1 and AIC2
are AIC counterparts of QBIC and BIC that use different goodness of fit measures. “pN ” denotes
the number of dictionary variables, and “qo ” denotes the number of terms in the true models. “TV”
and “FV” are defined as the average number of true and false coefficients selected, respectively.
“true” means the true model hit rate.

58

Table 3.1 Selection Performance, DGP 1 and 2
D 0:1

D 0:5

D 0:9

Method

pN

qo

N

IC

TV

FV

True

TV

FV

True

TV

FV

True

gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund

136
136
136
136
136
136
136
136
136
136
136
136

2
2
2
2
2
2
2
2
2
2
2
2

300
300
300
300
300
300
1000
1000
1000
1000
1000
1000

QBIC
QBICL
BIC
BICL
AIC1
AIC2
QBIC
QBICL
BIC
BICL
AIC1
AIC2

2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00

0.13
0.00
0.00
0.00
2.18
0.00
0.08
0.00
0.00
0.00
2.57
0.00

0.89
1.00
1.00
1.00
0.35
1.00
0.94
1.00
1.00
1.00
0.39
1.00

2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00

0.17
0.00
0.00
0.00
3.78
0.01
0.10
0.00
0.00
0.00
7.22
0.01

0.88
1.00
1.00
1.00
0.29
0.99
0.93
1.00
1.00
1.00
0.23
0.99

2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00

0.17
0.00
0.00
0.00
4.05
0.00
0.11
0.00
0.00
0.00
6.34
0.00

0.87
1.00
1.00
1.00
0.23
1.00
0.92
1.00
1.00
1.00
0.24
1.00

gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham

102
102
102
102
102
102
102
102
102
102
102
102

6
6
6
6
6
6
6
6
6
6
6
6

300
300
300
300
300
300
1000
1000
1000
1000
1000
1000

QBIC
QBICL
BIC
BICL
AIC1
AIC2
QBIC
QBICL
BIC
BICL
AIC1
AIC2

6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00

0.18
0.00
0.00
0.00
5.71
0.00
0.12
0.00
0.00
0.00
7.49
0.00

0.87
1.00
1.00
1.00
0.24
1.00
0.92
1.00
1.00
1.00
0.23
1.00

6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00

0.19
0.00
0.00
0.00
3.77
0.01
0.13
0.00
0.00
0.00
7.42
0.01

0.85
1.00
1.00
1.00
0.27
0.99
0.91
1.00
1.00
1.00
0.20
0.99

6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00

0.22
0.00
0.00
0.00
4.36
0.00
0.16
0.00
0.00
0.00
7.87
0.00

0.85
1.00
1.00
1.00
0.27
1.00
0.90
1.00
1.00
1.00
0.21
1.00

D 0:1

D 0:5

D 0:9

Method

pN

qo

N

IC

TV

FV

True

TV

FV

True

TV

FV

True

gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund

136
136
136
136
136
136
136
136
136
136
136
136

2
2
2
2
2
2
2
2
2
2
2
2

300
300
300
300
300
300
1000
1000
1000
1000
1000
1000

QBIC
QBICL
BIC
BICL
AIC1
AIC2
QBIC
QBICL
BIC
BICL
AIC1
AIC2

2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00

0.85
0.00
0.02
0.00
6.63
1.42
0.72
0.00
0.01
0.00
7.59
1.68

0.55
1.00
0.98
1.00
0.03
0.39
0.59
1.00
0.99
1.00
0.00
0.31

2.00
1.98
2.00
1.98
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00

1.16
0.02
1.15
0.02
6.75
6.65
0.86
0.00
0.85
0.00
8.03
7.98

0.40
0.98
0.41
0.98
0.00
0.00
0.51
1.00
0.51
1.00
0.01
0.01

2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00
2.00

1.16
0.01
0.02
0.01
7.91
1.74
0.80
0.00
0.01
0.00
8.90
1.89

0.44
1.00
0.98
1.00
0.00
0.30
0.54
1.00
0.99
1.00
0.00
0.26

gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham

102
102
102
102
102
102
102
102
102
102
102
102

6
6
6
6
6
6
6
6
6
6
6
6

300
300
300
300
300
300
1000
1000
1000
1000
1000
1000

QBIC
QBICL
BIC
BICL
AIC1
AIC2
QBIC
QBICL
BIC
BICL
AIC1
AIC2

5.85
5.74
5.75
0.86
5.96
5.88
6.00
5.99
6.00
5.99
6.00
6.00

0.80
0.27
0.28
0.11
7.33
1.24
0.46
0.01
0.02
0.01
8.95
1.31

0.55
0.77
0.77
0.06
0.02
0.44
0.71
0.98
0.98
0.98
0.01
0.44

5.61
5.46
5.61
5.46
5.95
5.95
6.00
5.96
6.00
5.96
6.00
6.00

1.01
0.49
0.98
0.49
6.24
6.07
0.48
0.04
0.48
0.04
7.59
7.55

0.37
0.57
0.38
0.57
0.01
0.01
0.68
0.96
0.69
0.96
0.01
0.01

6.00
5.94
5.98
5.86
6.00
6.00
6.00
6.00
6.00
6.00
6.00
6.00

0.82
0.14
0.15
0.14
6.40
1.31
0.51
0.01
0.02
0.01
8.35
1.36

0.54
0.87
0.87
0.87
0.01
0.40
0.70
0.99
0.98
0.99
0.01
0.42

59

Table 3.2 Estimator performance, DGP 1 and 2, ˇ1
D 0:1
Method

N

IC

gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund

300
300
300
300
300
300
1000
1000
1000
1000
1000
1000

gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham

300
300
300
300
300
300
1000
1000
1000
1000
1000
1000

D 0:5

D 0:9

Bias

SD

RMSE

Bias

SD

RMSE

Bias

SD

RMSE

QBIC
QBICL
BIC
BICL
AIC1
AIC2
QBIC
QBICL
BIC
BICL
AIC1
AIC2

0.0000
-0.0001
-0.0001
-0.0001
-0.0001
-0.0001
-0.0002
-0.0001
-0.0001
-0.0001
-0.0001
-0.0001

0.0197
0.0197
0.0197
0.0197
0.0201
0.0197
0.0119
0.0119
0.0119
0.0119
0.0118
0.0119

0.0197
0.0197
0.0197
0.0197
0.0200
0.0197
0.0119
0.0119
0.0119
0.0119
0.0118
0.0119

0.0009
0.0008
0.0008
0.0008
0.0021
0.0008
0.0013
0.0013
0.0013
0.0013
0.0015
0.0013

0.0345
0.0345
0.0345
0.0345
0.0346
0.0345
0.0191
0.0191
0.0191
0.0191
0.0194
0.0191

0.0345
0.0345
0.0345
0.0345
0.0346
0.0345
0.0192
0.0191
0.0191
0.0191
0.0195
0.0191

-0.0000
-0.0000
-0.0000
-0.0000
-0.0002
-0.0000
-0.0001
-0.0001
-0.0001
-0.0001
-0.0001
-0.0001

0.0215
0.0215
0.0215
0.0215
0.0215
0.0215
0.0114
0.0114
0.0114
0.0114
0.0114
0.0114

0.0215
0.0215
0.0215
0.0215
0.0215
0.0215
0.0114
0.0114
0.0114
0.0114
0.0114
0.0114

QBIC
QBICL
BIC
BICL
AIC1
AIC2
QBIC
QBICL
BIC
BICL
AIC1
AIC2

0.0007
0.0005
0.0005
0.0005
0.0014
0.0005
0.0004
0.0003
0.0003
0.0003
0.0004
0.0003

0.0211
0.0212
0.0212
0.0212
0.0208
0.0212
0.0116
0.0115
0.0115
0.0115
0.0115
0.0115

0.0211
0.0212
0.0212
0.0212
0.0209
0.0212
0.0116
0.0115
0.0115
0.0115
0.0115
0.0115

0.0042
0.0040
0.0040
0.0040
0.0058
0.0039
-0.0004
-0.0005
-0.0005
-0.0005
-0.0001
-0.0005

0.0358
0.0358
0.0358
0.0358
0.0355
0.0359
0.0186
0.0186
0.0186
0.0186
0.0186
0.0186

0.0360
0.0360
0.0360
0.0360
0.0359
0.0361
0.0186
0.0186
0.0186
0.0186
0.0186
0.0186

-0.0005
-0.0006
-0.0006
-0.0006
-0.0002
-0.0006
0.0002
0.0002
0.0002
0.0002
0.0005
0.0002

0.0212
0.0208
0.0208
0.0208
0.0212
0.0208
0.0117
0.0117
0.0117
0.0117
0.0117
0.0117

0.0212
0.0208
0.0208
0.0208
0.0212
0.0208
0.0117
0.0117
0.0117
0.0117
0.0117
0.0117

D 0:1
Method

N

IC

gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund

300
300
300
300
300
300
1000
1000
1000
1000
1000
1000

gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham

300
300
300
300
300
300
1000
1000
1000
1000
1000
1000

D 0:5

D 0:9

Bias

SD

RMSE

Bias

SD

RMSE

Bias

SD

RMSE

QBIC
QBICL
BIC
BICL
AIC1
AIC2
QBIC
QBICL
BIC
BICL
AIC1
AIC2

0.0073
0.0066
0.0066
0.0066
0.0160
0.0073
0.0066
0.0056
0.0056
0.0056
0.0103
0.0071

0.1551
0.1565
0.1562
0.1565
0.1559
0.1551
0.0877
0.0877
0.0878
0.0877
0.0893
0.0881

0.1552
0.1565
0.1562
0.1565
0.1567
0.1552
0.0879
0.0878
0.0879
0.0878
0.0898
0.0883

0.0048
0.0064
0.0049
0.0064
-0.0009
-0.0007
-0.0051
-0.0053
-0.0053
-0.0053
-0.0026
-0.0028

0.2746
0.2778
0.2747
0.2778
0.2657
0.2658
0.1479
0.1467
0.1478
0.1467
0.1479
0.1480

0.2745
0.2777
0.2746
0.2777
0.2656
0.2657
0.1479
0.1467
0.1478
0.1467
0.1478
0.1479

0.0028
0.0013
0.0017
0.0013
-0.0033
0.0028
0.0063
0.0048
0.0048
0.0048
0.0036
0.0054

0.1648
0.1647
0.1651
0.1647
0.1612
0.1650
0.0854
0.0858
0.0858
0.0858
0.0865
0.0851

0.1648
0.1646
0.1650
0.1646
0.1612
0.1649
0.0856
0.0859
0.0859
0.0859
0.0865
0.0852

QBIC
QBICL
BIC
BICL
AIC1
AIC2
QBIC
QBICL
BIC
BICL
AIC1
AIC2

0.0105
0.0148
0.0141
1.0494
0.0135
0.0101
0.0033
0.0040
0.0038
0.0042
0.0047
0.0039

0.1679
0.1668
0.1669
0.4584
0.1640
0.1660
0.0882
0.0879
0.0878
0.0879
0.0870
0.0875

0.1682
0.1674
0.1674
1.1450
0.1645
0.1663
0.0883
0.0879
0.0878
0.0880
0.0871
0.0875

0.0367
0.0503
0.0367
0.0508
0.0289
0.0284
-0.0004
0.0017
-0.0004
0.0017
-0.0008
-0.0006

0.2562
0.2738
0.2565
0.2745
0.2590
0.2591
0.1477
0.1490
0.1482
0.1490
0.1487
0.1488

0.2587
0.2783
0.2589
0.2790
0.2605
0.2605
0.1476
0.1489
0.1481
0.1489
0.1486
0.1487

-0.0015
0.0027
0.0003
0.0077
-0.0070
-0.0022
-0.0006
-0.0002
-0.0002
-0.0002
-0.0035
-0.0007

0.1625
0.1627
0.1626
0.1622
0.1570
0.1611
0.0880
0.0891
0.0891
0.0891
0.0880
0.0878

0.1624
0.1627
0.1625
0.1623
0.1571
0.1610
0.0880
0.0891
0.0891
0.0891
0.0880
0.0878

60

There are some useful findings to be mentioned. First, the root mean squared error (RMSE)
of ˇ estimates decreases and TV increases as the sample size increases for all DGPs, quantiles
and information criteria considered. Second, FV decreases as the sample size increases in DGP
1, 2 and 3 with non-high dimensional BIC type criteria. With AIC or in other models, FV may
increase. In DGP 4 and 5, FV increases for BIC type criteria but not as much as AICs. Third,
TV seems to be the key element that determines the estimator performance. Neither FV nor the
true hit rate seems to matter much. This can be easily seen by comparing AICs with BICs. AIC
typically involves a much higher FV and lower true hit rate but often shows the smallest bias, SD
and RMSE. Fourth, the estimators using a generalized Mundlak form and generalized Chamberlain
form can outperform the others. When the coefficients on the correlated effect terms across time
periods are constant as in DGP 1, 2, 3, and 4, the Mundlak form yields a smaller number of nonzero
terms to be selected and the corresponding estimator often has better performance. But when the
coefficients are different across time as in DGP 5, the generalized Chamberlain form can have a
more sparse selected submodel than the generalized Mundlak form and the corresponding estimator
often outperforms.

3.6

Application: The Effect of Smoking on Birth Outcomes
In this section, the proposed estimator is applied to an empirical example of birth weight

analysis. The data in use is the matched panel1 #3 of Abrevaya (2006) where mean regression
analysis was done accounting for unobserved individual moms’ heterogeneity. First, the median
regression with a correlated effect shows convincing evidence that the correlated effect estimator
works well as intended. Second, the corresponding other quantile regression results show that for
lower quantiles, the impact of smoking on birth weight is smaller in terms of absolute magnitute
but can be larger relative to fitted quantile birth weights. Third, some computational issues are
1 The

data do not have a panel structure in the strict sense since each mom is observed at a
different time point when she gave a birth. Still, the data set is a clustered in general and the
proposed method is valid as long as the set of assumptions hold.

61

reported regarding optimization of the nonconvex objective function.
The matched panel data #3 of Abrevaya (2006) contains information on 129,569 two-birth
moms and 12,360 three-birth moms. Note that in these data, the quantile regression with individual
fixed effects will cost an additional 141,929 dummies. One of the main benefits of the correlated
effect estimator is to reduce the number of additional terms while individual heterogeneity is treated
well. The results below show that less than 400 additional terms are spent to obtain reasonable
estimates. The number of additional terms is less than 0.3% that of fixed effect estimator.
The structural equation is taken from Abrevaya (2006) as
BWi b D smokei b C malei b C agei b C agei2b C adeqcode2i b C adeqcode3i b
C novisiti b C pret ri2i b C pret ri3i b C _I nlbnl_1i b C
C _Iyear_1i b C

(3.42)

C _I nlbnl_15i b

C _Iyear_8i b C const C "i b

where i is an individual mother index, b is an observed birth index, “adeqcode#” is the Kessner
index of #, “novisit” is an indicator of no prenatal visit during pregnancy, “pretri#” is an indicator
of the first prenatal visit in #th trimester, and other terms are live birth order and year effects. The
observed birth index b corresponds to time index t in our setting. All right-hand-side variables in
(3.42) constitute xi b following the previous notation.
Besides xi b ; there are several within-group-constant variables, zi that are used to construct
the correlated effects: binary dummies for high school graduate, some college experience, college
graduate, marital status, being black and state of residence. Since all variables except “age” are
treated as discrete, the sparsity assumption is essentially imposed on the “age” component of the
correlated effect. For the sieve approximation, polynomials of order up to 10th are used as default.
To account for potential endogeneity due to observability, the selection indicator will be used
to construct the correlated effects. Basically, there are two observed patterns: 2-birth mothers
and 3-birth mothers. Then, the vector of the selection indicator can be written as si D .1; 1; 0/
Œ2

or .1; 1; 1/ : For notational convenience, denote si

62

Œ3

D .1; 1; 0/ and si

D .1; 1; 1/ : Then, the

Table 3.3 Birthweight, mean and median regression, all moms (unit:grams)
Mean

Smoke
Male
Age
Age2
High-school graduate
Some College
College Graduate
Married
Black
Kessner index = 2
Kessner index = 3
No prenatal visit
First prenatal visit in 2nd trimester
First prenatal visit in 3rd trimester
Information Criterion
# of Dictionary Var.
# of Selected Var.

Median

OLS

FE

Pooled

CE1

CE2

CE3

-243.27
(3.20)
126.70
(1.88)
7.06
(1.77)
-0.12
(0.03)
60.52
(4.12)
91.34
(4.52)
100.89
(4.73)
64.43
(3.65)
-252.04
(4.36)
-100.93
(4.19)
-176.48
(10.20)
-26.49
(18.00)
89.12
(4.96)
154.66
(12.03)

-144.04
(4.75)
133.58
(2.08)
-15.98
(3.96)
0.32
(0.05)

-138.26
(6.31)
138.87
(2.51 )
-13.37
(5.38)
0.35
(0.07)

-138.55
(6.39)
139.38
(2.54)
-8.32
(27.04)
0.26
(0.29)

-140.34
(6.45)
139.45
(2.52)
-8.74
(4.57)
2.75
(0.07)

-84.43
(4.45)
-143.91
(10.28)
-42.35
(16.57)
66.56
(5.27)
111.90
(12.49)

-238.49
(3.79)
131.27
(2.10)
2.59
(2.10)
-0.04
(0.04)
64.19
(4.98)
96.65
(5.55)
102.84
(5.79)
55.23
(4.32)
-239.28
(5.07)
-81.71
(4.56)
-149.85
(12.48)
7.87
(21.29)
72.21
(5.40)
119.48
(14.34)

-79.17
(5.66)
-163.42
(15.67)
-32.02
(24.99)
67.38
(6.73)
109.92
(18.82)

-74.12
(6.14)
-154.35
(15.52)
-47.25
(27.03)
62.80
(9.27)
111.45
(22.84)

-69.96
(5.55)
-150.94
(15.50)
-52.70
(26.88)
58.66
(6.67)
118.90
(18.41)

-

-

-

BIC
301
298

BIC
451
300

BICL
301
169

63

conditional quantiles with the generalized Chamberlain device can be succinctly written as
Q
Q

Œ2

sib yi b j fxi t g2tD1 ; zi ; si D si

Á

D si b xi b ˇ C si b g2 .xi1 ; xi 2 ; zi / C si b k2b
Á
Á
Œ3
si b yi b j fxi t g3tD1 ; zi ; si D si D si b xi b ˇ C si b g3 fxi t g3tD1 ; zi C si b k3b

(3.43)
(3.44)

for some g2 ; g3 ; k2b s and k3b s: Assuming additivity of gs, following transformation is considered
g2 .xi1 ; xi 2 ; zi / C k2b D g2;0 C
g3 fxi t g3tD1 ; zi

Á

K1
3 X
X

x
g2bk

bD1 kD1
3 K1

C k3b D g2;0 C

C h3;0 C

x .x
g2bk
i bk / C

XX

bD1 kD1
3 K1

XX

hxbk .xi bk / C

bD1 kD1

kD1
K2

X

z .z / C k
g2k
ik
2b

(3.45)

z .z / C k
g2k
ik
2b

(3.46)

kD1
K2
X

hz2k .zi k / C lb

(3.47)

kD1
x .x
g2;0 ; hxbk .xi bk / D g3bk
i bk /

x .x
where g2bk
i bk / D 0 for b D 3; h3;0 D g3;0
z .z /
hz2k .zi k / D g3k
ik

.xi bk / C

K2
X

z .z / and l D k
g2k
b
3b
ik

x .x
g2bk
i bk / ;

k2b : In estimation, the interaction of the 3-birth

x
z (b D 1; 2) are included: The constant
mom dummy and approximating terms for g2bk
and g2k

term g2;0 ; h3;0 and time effects k2b ; lb are also included but not penalized. l2 is excluded to
avoid multicollinearity. If there is no systematic difference in g components between the two-birth
mothers and three-birth mothers, then the corresponding h compoents will be zeros and there will
be fewer terms selected in the final estimates.
In Table 3.3, the OLS estimator, FE estimator, pooled median regression estimator and CE
estimators are compared. BIC and BICL were used to choose the threshold parameters of the
CE1/CE2 and CE3 estimates, respectively where the candidate threshold parameters were chosen
to be 50 equi-spaced points between 0 and 0:01: The CE2 estimator uses polynomials of order upto
40th. The "rqPen" and "pracma" packages for R were used for computing penalized estimates and
Moore-Penrose inverse matrices, respectively. As noted by Abrevaya (2006), the FE coefficient
estimate on the “smoke” variable has approximately a 100g lower magnitude than the OLS estimate,
which is consistent with the basic omitted variables story. The CE coefficient estimates on the
“smoke” variable also has a lower magnitude than the pooled median regression estimate by a

64

Table 3.4 Birthweight, quantile regression with CE, all moms, (unit:grams)
Quantile

Smoke
Male
Age
Age2
Kessner index = 2
Kessner index = 3
No prenatal visit
First prenatal visit in 2nd trimester
First prenatal visit in 3rd trimester
# of Dictionary Var.
# of Selected Var.

0.1

0.25

0.5

0.75

0.9

-129.99
(10.75)
102.44
(3.79)
-1.25
(8.34)
1.94
(0.13)
-121.12
(10.29)
-255.81
(27.03)
-169.44
(49.00)
92.05
(12.20)
260.68
(36.80)

-136.67
(7.29)
122.63
(2.91)
-6.75
(5.38)
0.26
(0.08)
-106.11
(6.90)
-220.18
(17.05)
-40.80
(27.40)
88.90
(7.97)
178.45
(20.16)

-138.26
(6.31)
138.87
(2.51)
-13.37
(5.38)
0.35
(0.07)
-79.17
(5.66)
-163.42
(15.67)
-32.02
(24.99)
67.38
(6.73)
109.92
(18.82)

-148.34
(7.63)
150.31
(2.87)
-9.05
(5.27)
0.25
(0.08)
-69.06
(6.45)
-138.40
(17.80)
-9.20
(30.33)
53.81
(7.59)
97.94
(21.59)

-152.86
(9.80)
162.12
(3.41)
-4.47
(3.04)
1.12
(0.05)
-75.41
(8.65)
-120.40
(19.91)
-88.82
(48.29)
69.25
(10.72)
112.61
(25.68)

301
150

301
246

301
298

301
269

301
141

similar amount. Moreover, the FE and CE coefficient estimates on other variables show similar
patterns of changes from the OLS and pooled median regression, respectively. For example, the
coefficient estimates of OLS/FE on age and age2 alternate in sign and similar patterns are found
in the pooled median regression and CE estimates on age and age2 : Overall, the CE estimates
are quite close to the FE estimates, and they can be regarded as a median analogue of the FE
estimates. Note that this is sensible because, considering the nature of dependent variables, we
expect the conditional distribution of regression errors are fairly symmetric, and because the CE
estimator takes the control function analogue of the FE estimator. For the unconditional distribution,
the mean/median pairs of birth weights for b D 1; 2 and 3 are 3426g/3430g, 3482g/3487g and
3517g/3520g, respectively. Figure A.1 in Appendix C.2 shows a frequency histrogram for pooled
birth weights across all births.

65

Table 3.5 Coefficient Estimates on ‘Smoke’ using Different ICs, (unit:grams)
Quantile

QBIC
QBICL
BIC
BICL

0.1

0.25

0.5

0.75

0.9

-132.51
[88]
-140.61
[37]
-129.99
[150]
-129.99
[150]

-145.97
[75]
-158.1
[37]
-136.66
[246]
-140.18
[179]

-145.27
[45]
-153.97
[30]
-138.26
[298]
-140.34
[169]

-147.74
[40]
-152.33
[27]
-148.34
[269]
-146.47
[244]

-152.41
[93]
-245.82
[30]
-152.86
[141]
-152.86
[141]

Table 3.4 contains the CE estimates for 10, 25, 50, 75 and 90 percentiles.2 The same set of
dictionary variables is used with BIC for all cases. Evidently, the magnitude of the coefficient
estimates on “smoke” variable declines as the percentile decreases. Note that the pooled quantile
regression results in Appendix C.2 shows an exact opposite relationship, which indicates that the
impact of smoking is more severely overestimated in the pooled regressions for the lower quantiles.
Although the absolute magnitude of impact declines, its proportionate impact can be larger for
lower quantiles. For example, relative to the fitted values for a two-birth mom who had two female
babies at age 27, and 28 (with all other dummy variables equal to zero), the proportionate impacts
of smoking for 10, 25, 50, 75 and 90 percentiles are -5.13%, -4.60%, -4.03%, -3.98%, and -3.92%,
respectively.
There are some computational issues that need to be addressed. First, the main computational
challenge in the SCAD or MCP penalized estimator lies in the non-convex nature of the objective
function. The numerical algorithms studied so far use some version of an approximated objective
function. In this chapter, all estimates are computed using iterative quantile regression on an
augmented data set based on local linear approximation of the SCAD panelty function (Sherwood
and Wang, 2016). Second, when Sherwood and Wang (2016)’s iterative quantile regression method
2 The

estimates in Table A.13 were computed with classical CRE (after dropping linearly dependent terms). CRE estimator is less robust than the proposed CE estimator by construction.

66

is used on the given data set, the selection path can vanish for small enough threshold parameters at
close enough to 0 or 1. That is, the penalized estimator is essentially not computable for a small
enough

at high-end or low-end quantiles. The .1 and .9 quantile results could have more selected

terms if there was no such problems. Table 3.5 shows the coefficient estimates for the “smoke”
variable based on four different Bayesian-type information criteria. The bracketed numbers are
the number of selected variables out of 301 dictionary variables. BIC and BICL do not have any
difference at the 10 and 90 percentiles. Unreported results show that .15 and .85 quantiles do not
suffer from the problem. It seems that the numerical algorithm for the quantile regression matters.
For the given data set, Koenker and d’Orey’s (1987; KO) algorithm yields more stable results than
Armstrong, Frome and Kung’s (1979; AFK) algorithm.

3.7

Concluding Remarks
I propose a new model restriction and estimation procedure for a linear panel data quantile

regression model with fixed T. By introducing a nonparametric correlated effect, the new model
restriction reasonably accounts for the -quantile-specific time-invariant heterogeneity and allows
arbitrary within-group dependence of regression errors. A non-convex penalized estimation procedure is employed under the sparsity assumption on the correlated effect. To make the sparsity
assumption more plausible in some cases, a transformation of the approximated correlated effect
into a generalized Mundlak form is proposed.
There are interesting questions to be answered in future research. First, it would be useful
to study Bayesian type information criteria that allow diverging pN and qN and attain selection
consistency under a certain degree of misspecification. Second, the numerical algorithm for a
nonconvex objective function can be improved for more stable and efficient computation. Third,
extending the current framework to account for a censored response variable and time-varying
endogeneity is another interesting direction to pursue. The extended estimator is expected to have
similar advantages to the estimator studied in this chapter.

67

APPENDICES

68

APPENDIX A
AN APPENDIX FOR CHAPTER 1

A.1

Assumptions

Assumptions (1) .yi1 ; yi 2 ; zi / are i.i.d.

(2) ‚

cpt

Rp (3) q1 W ‚

W ! R and q2 W

W ! R where .yi1 ; yi 2 ; zi / 2 W (4) Âo 2 i nt .‚/ and let N be a neighborhood of Âo

‚2

(5) with probability one, q1 .yi1 ; yi 2 ; zi ; Â1 ; Â2 / and q2 .yi 2 ; zi ; Â2 / are
2 continuously differentiable
3
@.q1 Cq2 /
6
7
@Â
at each Â 2 ‚ and twice continuosly differentiable in N . (6) E 4 sup
5 < 1
Â2‚

(7) each element of
2
@.q1 Cq2 /
6
@Â @Â 0
E 4 sup
@q
Â 2N

2
@Â2 @Â 08

@q1 .yi1 ;yi 2 ;zi ;Âo1 ;Âo2 /
@Â
3

and

@q2 .yi 2 ;zi ;Âo2 /
@Â2

@q2
@Â2

has finite second moment. (8)

Ä
@.q1 Cq2 /
7
D 0 (10)
5 < 1 (9) (QLIML f.o.c.) fÂo g D Â 2 ‚ W E
@Â

9
2
3
@q1 .Â /
ˆ
>
Ä
<
=
@.q1 Cq2 /.Âo /
6 @Â1 7
(CF f.o.c.) fÂo g D Â 2 ‚ W E 4
5 D 0 (11) (QLIML rank condition) E
@Â @Â 0
@q
Â
ˆ
>
2. 2/
;
:
@Â2
3
2
3
2
@q1 .Âo /
@q1 .Âo /
Ä
@.q1 Cq2 /.Âo /
6
7
6 @Â1 @Â 0 7
@Â
and V
are
invertible.
(12)
(CF
rank
condition)
E
5 and V 4 @q Â1
5
4 @q Â
@Â
.
/
.
/
2 o2
2 o2
@Â2 @Â 0

are invertible.

A.2

Proofs

A.2.1

Proof of Proposition 1.3.4

@Â2

Under regularity conditions, it suffices to show
(Newey and McFadden, 1994):
2 the followings 3
2
3
@q1 .Â /
@q1 .Â/
7
6
@Â1 @Â 0
@Â1
6
7
7
6
6
7
7
6
.Â
/
@q
.Â/
@q
6
7
1
1
(a) E 6 sup
sup
7<1
0
7 < 1 (b) E 6
@Â22
@Â
@Â
7
6Â2N
22
4Â 2‚
5
4
@q2 .Â2 /
@q2 .Â2 / 5
@Â2 @Â 0

@Â2

69

(c) fÂo g D

8
ˆ
ˆ
ˆ
ˆ
<

2
Â

ˆ
ˆ
ˆ
ˆ
:
0
B
B
(e) V B
B
@

6
6
2‚WE6
6
4

@q1 .Âo /
@Â1
@q1 .Âo /
@Â22
@q2 .Âo2 /
@Â2

@q1 .Â/
@Â1
@q1 .Â/
@Â22
@q2 .Â2 /
@Â2

2
39
>
>
6
7>
>
6
7=
7 (d) E 6
6
7>
6
5>
>
4
>
;

@q1 .Âo /
@Â1 @Â 0
@q1 .Âo /
@Â22 @Â 0
@q2 .Âo2 /
@Â2 @Â 0

3
7
7
7
7 has full column rank.
7
5

1
C
C
C is invertible.
C
A

(c) and (e) are direct implications of definition of GMM-QLIML. (d) can be shown from
Assumption 12 since adding extra rows does not affect column rank.

(a) and (b) are im@q .Â/

plied by triangular inequality together with Assumption 6 and Assumption 8. i.e. @Â1
22
@q1 .Â/
@q2 .Â2 /
@q1 .Â/
@q2 .Â2 /
@q2 .Â2 /
@q1 .Â/
@q2 .Â2 /
C @Â
C @Â
and @Â @Â 0 Ä @Â @Â 0 C @Â @Â 0 C @Â @Â 0
@Â
22

A.2.2

22

22

22

22

22

Ä

22

Proof of Proposition 1.3.6

(a) and (b) are directly implied by following lemma.
Lemma A.2.1 Let G be a linear span of moment functions fgi g in .1:8/: Optimal GMM based
on each maximal linearly independent set at true parameter values in G yields asymptotically
equivalent estimator.
Proof. Suppose fgQ i .Âo /g and fgO i .Âo /g are maximal linearly independent subsets in linear span
of moment functions fgi .Âo /g. First, the number of moments in fgQ i .Âo /g and fgO i .Âo /g is the
same since both fgQ i .Âo /g and fgO i .Âo /g are basis for span .fgi .Âo /g/ and span .fgi .Âo /g/ is a
finite dimensional
there exists an invertible linear map A .Âo / such that
2
3vector
2 space. Second,
3
6 gQ 1 .Âo / 7
6 gO 1 .Âo / 7
6
7
6
7
::
::
7 D 6
7 by definition of basis. By proposition 3.4, efficient GMM
A .Âo / 6
:
:
6
7
6
7
4
5
4
5
gQ k .Âo /
gO k .Âo /

70

based on fgQ i .Âo /g and fgO i .Âo /g are well-defined and asymptotic normal. Then, it is easy to see
@gO .Âo / 0
E
V
@Â
Ä
@gQ .Âo /
DE
@Â
Ä
@gQ .Âo /
DE
@Â
Ä

A.2.3

Â

@gO .Âo /
@Â

Ã 1

Ä

@gO .Âo /
@Â
Ã
Ä
Â
0
@gQ .Âo /
@gQ .Âo / 1
1
0
1
0
.A .Âo // A .Âo / E
A .Âo / A .Âo /
V
@Â
@Â
Ã 1 Ä
0 Â
@gQ .Âo /
@gQ .Âo /
V
E
@Â
@Â
E

Statement (d),(e) and Proof of Proposition 1.3.7

S
S
(d) VGMM
QLIML D VQLIML iff
2
3
0
2
31
o
o
@qi1
@qi1
Ä
6
7
@qio2
@Â1 @ÂS0
B @qio2 6
7C
@Â1
6
7
;
W
E
Eo
cov
@
4
5
A
o
o
o
o
o
o
0
o
4 @qi1
@Â22
@qi1
@qi 2
@qi 2 5
@Â22 @ÂS
C
@Â2 C @Â2
@Â2 @ÂS0
@Â2 @ÂS0
2
2
2
3
1
0
o
o
@qi1
@qi1
Ä
6
6
@qio2
@Â1 @Â 0 S
7C
B @qio2 6
@Â1
6
;
W
E
D6
cov
E
4
5
A
@
o
o
o @Â
o o4
4 o @Â22 @Â 0
@qi1
@qi1
@qio2
@qi 2
22
S
C
@Â2 C @Â2
@Â2 @Â 0 S
@Â2 @Â 0 S
30
33 1
2 2
2
o
o

6 6
6Eo 6
4 4
2

2

6 6
6Eo 6
4 4
where

@qi1
@Â1 @Â 0 S
o
@qi1
@qio2
C
@Â2 @Â 0 S
@Â2 @Â 0 S
o
@qi1
@Â1 @Â 0 S
o
@qi1
@qio2
C
@Â2 @Â 0 S
@Â2 @Â 0 S

7
6
7 W Eo 6
o
5
4
30

2

7
6
7 W Eo 6
o
5
4
0

B
Wo D V o @

@qi1
77
@Â1 @Â 0 S
77
o
o
55
@qi1
@qi 2
C
0
0
@Â2 @Â S
@Â2 @Â S
33
o
@qi1
77
@Â1 @ÂS0
77
o
o
@qi1
@qi 2 55
C
@Â2 @ÂS0
@Â2 @ÂS0

o
@qi1
@Â
o 1 @q o
@qi1
i2
@Â2 C @Â2

1 1
C
A

S
S
(e) VGMM
QLIML D VCF iff
2
3
0
2 o 31 0 o 1 1
o
@qi1
@qi1
@qi1
Ä
o
o
6 @Â1 @Â 0 7
@qi1
B @qi1
6 @Â1 7C B @Â1 C
S 7
Eo
cov
;
E
V
4 @q o 5A o @ @q o A
o @ @Â
o6
4 @qio2 5
@Â22 @Â 0
22
i2
i2
S

@Â2

@Â2

71

@Â2 @ÂS0

33
77
77
55

2

0

6
D6
4Eo
2

Ä

2

6 6
6Eo 6
4 4
2

2

6 6
6Eo 6
4 4

o
@qi1
@Â22 @Â 0 S
o
@qi1
@Â1 @Â 0
@qio2
@Â2 @Â 0
o
@qi1
@Â1 @Â 0
@qio2
@Â2 @Â 0

B @q o 6
covo @ @Â i1 ; 4
22

30
S
S
S

2

0

7
7 Vo B
@
5
30

0

7
7 Vo B
@
5

S

o
@qi1
@Â1
@qio2
@Â2

31

0

7C B
5 A Vo @

o
@qi1
@Â1
@qio2
@Â2

2
1 1
o
@qi1
6 @Â1 @Â 0
C
S
A Eo 6
4 @q o

o
@qi1
@Â1
@qio2
@Â2

2
1 1
6
C
A Eo 6
4

o
@qi1
@Â1
@qio2
@Â2

2
1 1
o
@qi1
6 @Â1 @Â 0
C
S
A Eo 6
4 @q o

33 1

i2
@Â2 @Â 0 S

33
77
77
55

77
77
55

i2
@Â2 @Â 0 S
33
o
@qi1
7
@Â1 @ÂS0 7
77
o
@qi 2 55
@Â2 @ÂS0

Proof
(a) VGMM QLIML

VCF is trivial. To see VGMM QLIML

VQLIML , note first that, at

true parameter value,
2
6
4

@
@Â1 q1 .Â1 ; Â2 /
@
@
@Â2 q1 .Â1 ; Â2 / C @Â2 q2 .Â2 /

3
7
5

is linearly independent by Assumption 11. Thus, there exists an extension to a basis
2
3
@ q .Â ; Â /
6
7
@Â1 1 1 2
6
7
6 @
@ q .Â / 7
/
.Â
C
;
Â
q
6 @Â 1 1 2
@Â2 2 2 7
6 2
7
4
5
@ q .Â/
1

(A.1)

@Â2

which is an invertible linear transformation of (1.9). Hence, the result follows by Lemma C.1.
(b) Apply BQSW redundancy condition to (1.9).
(c) Apply BQSW redundancy condition to (A.1).
(d),(e) Apply BQSW partial redundancy results to (1.9) and (A.1) , respectively.

A.2.4

Proof of Corollary 1.3.9

(a) Suppose GIMEs:

72

Â oÃ
@q

Vo
02

1
@Â

Ä

D Eo
31
o

@q1o
@Â @Â 0

@q1
@Â1

, Vo
2

Â oÃ
@q
2
@Â2

It is implied that covo

D Eo

@q1o
@Â1 @Â 0

B6
7C
6
Vo @4 @q o
o 5 A D Eo 4
@q
1
2
@Â2 C @Â2
Â

@q2o

Ä

0

@Â @Â
3 2 2

,

7

@q1o
@q2o 5
0
@Â
@Â2 @Â 0
Ã2 @Â
o
o
@q1 @q2
@Â1 ; @Â2 D 0. Then the condition

(c) in Proposition 1.3.7 follows.

To see VQLIML D
6 VCF ; consider a submatrix of
Á
Á
Á 1
Á
@q
@q
@q
@q
@q
@q
@q
covo @Â i1 ; @Âi1
covo @Â i1 ; @Âi1 Vo @Âi1
covo @Âi1 ; @Âi1 ;
22
2
22
1
1
1
2
Ã
Â
Ã Â
Ã
Â
Ã
@qi1 @qi1
@qi1 @qi1
@qi1 1
@qi1 @qi1
covo
;
covo
;
Vo
covo
;
@Â22 @Â22
@Â22 @Â1
@Â1
@Â1 @Â22
Ã
Â
Ã Â
Ã
Â
Ã Â
Ã
Â
Ã
Â
@qi1 @qi1
@qi1 1
@qi1
@qi1 1
@qi1 @qi1
@qi1
Vo
covo
covo
;
Vo
Vo
;
D Vo
@Â22
@Â22 @Â1
@Â1
@Â1
@Â1
@Â1 @Â22
!
Â
Ã
Â
Ã Â
Ã
@qi1
@qi1 @qi1
@qi1 1 @qi1
D Vo
V covo
;
Vo
@Â22
@Â22 @Â1
@Â1
@Â1
Â

ˇ
Á
@q
@q ˇ @q
The last expression can be interpreted as difference of outer products of @Â i1 and L @Â i1 ˇ @Âi1 :
22
22
1
@qi1
@qi1
Since @Â and @Â are assumed to be linearly independent at true parameter, it cannot be zero.
22

1

(b) Suppose GIMEs. Without loss of generality,

D 1 is assumed:Then, the result (e) of

Proposition 1.3.7 implies
"
.LHS / D Eo

o
@qi1

0
@Â22 @Â11

#

0

o

2

B @q 6
covo @ i1 ; 4
@Â22

D 0p22 p11

73

o
@qi1
@Â1
@qio2
@Â2

31

0

7C B
5A Vo @

o
@qi1
@Â1
@qio2
@Â2

2
1 1
o
@qi1
6 @Â1 @Â 0
C
11
A Eo 6
4 @q o
i2
0
@Â2 @Â11

3
7
7
5

For RHS,
.1st part of RHS/
"

#

@qi1
DE
0 ; Â0
@Â22 @ Â12
2
Ä

@q

31

2

@Â2

0

@Â2

0 ;Â 0
@Â2 @ Â12
2

Á
@q
@q
cov @Â i1 ; @Âi1
22
2
2
Á 1
@qi1
Ä
Á
0p1 p2
V
6
@Â1
@q
@q
Á 1
cov @Â i1 ; @Âi1
0p22 p2 4
@q
22
1
0p2 p1
V @Âi 2
2
2
Á
Á 3
@q
@q
@qi1 @qi1
cov @Âi1 ; @Âi1 7
6 cov @Â1 ; @Â12
1
Á2 5
4
@qi 2
0p2 p12
V @Â
2
Ä
Á
Á
Á
@qi1 @qi1
@qi1
@qi1 @qi1
D 0p p
;
cov
;
V
cov
22 11
@Â
@Â
@Â
@Â
@Â
D

@q

2
1 1
@qi1
Á
@qi1
@qi1
6
0
0
B @qi1 6 @Â1 7C B @Â1 C
6 @Â1 @ Â12 ;Â2
cov @
;4
5A V @ @q A E 6
@qi 2
@qi 2
4
@Â22
i2
Á
0

cov @Â i1 ; @Â i1
22
12

3
7
7
7
5

Á

22

2

22

1

1

3
7
5

1

cov

@qi1 @qi1
@Â1 ; @Â2

Á

and
.2nd part of RHS/
3
2
Á
Á 30 2
Á 1
@qi1 @qi1
@qi1
@qi1 @qi1
cov @Â ; @Â
0p1 p2 7
7 6 V @Â1
6 cov @Â1 ; @Â12
1 Á2
D4
Á 1 5
5
4
@qi 2
@qi 2
0p2 p12
V @Â
0p2 p1
V @Â
2
2
2
3
Á
Á
@qi1 @qi1
@qi1 @qi1
cov @Â ; @Â
6 cov @Â1 ; @Â12
7
1 Á2
4
5
@qi 2
0p2 p12
V @Â
2
2
3
Á
Á
@qi1
@qi1 @qi1
cov @Â ; @Â
V @Â
6
7
12
2
D4
Á
Á
Á 112
Á
Á 5
@qi1 @qi1
@qi1 @qi1
@qi1
@qi1 @qi1
@qi 2
cov @Â ; @Â
cov @Â ; @Â V @Â
cov @Â ; @Â C V @Â
2

12

2

1

1

74

1

2

2

and
.3rd part of RHS/
3
2
Á
Á 30 2
Á 1
@qi1 @qi1
@qi1
@qi1 @qi1
cov @Â ; @Â
0p1 p2 7
6 cov @Â1 ; @Â12
7 6 V @Â1
1 Á2
D4
Á 1 5
4
5
@q
@q
0p2 p12
V @Âi 2
0p2 p1
V @Âi 2
2
2
2
Á 3
@qi1 @qi1
6 cov @Â1 ; @Â11 7
5
4
0p2 p11
2
Á 3
@qi1 @qi1
6 cov @Â12 ; @Â11 7
D4
Á 5
@q
@q
cov @Âi1 ; @Â i1
2

11

Consider the inverse of the second part:
2
V
6
4

@qi1
@Â12
Á
@qi1 @qi1
@Â2 ; @Â12

Á

cov
3
2
6 R11 R12 7
D4
5
R21 R22

@q
@q
cov @Â i1 ; @Âi1
12
2
Á 1
Á
@qi1
@qi1 @qi1
cov @Â ; @Â C V
@Â1
1
2

Á

cov

@qi1 @qi1
@Â2 ; @Â1

Á

V

3 1
7
Á 5
@qi 2
@Â2

Then (RHS) can be expressed as
.RHS /
Ä
Á
Á
Á 1
Á
@qi1 @qi1
@qi1
@qi1 @qi1
@qi1 @qi1
D 0p p
;
cov
;
V
cov
;
cov
22 11
@Â22 @Â2
@Â22 @Â1
@Â1
@Â1 @Â2
2
3 2
Á 3
@qi1 @qi1
6 R11 R12 7 6 cov @Â12 ; @Â11 7
Á 5
4
5 4
@qi1 @qi1
cov @Â ; @Â
R21 R22
2
11
"
Ã
Â
Ã Â
Ã
Â
Ã#
Â
@qi1 @qi1
@qi1 @qi1
@qi1 1
@qi1 @qi1
D cov
;
cov
;
V
cov
;
@Â22 @Â2
@Â22 @Â1
@Â1
@Â1 @Â2
Ä
Â
Ã
Â
Ã
@qi1 @qi1
@qi1 @qi1
R21 cov
;
C R22 cov
;
@Â12 @Â11
@Â2 @Â11

75

where
Ã Â
Ã
Â
Ã
Â
Ã# 1
Â
Ã
@qi1 @qi1
@qi1 1
@qi1 @qi1
@qi 2
@qi1 @qi1
R21 D
cov
;
V
cov
;
CV
cov
;
@Â2 @Â1
@Â1
@Â1 @Â2
@Â2
@Â12 @Â2
Â Â
Ã Ä
Â
Ã
Â
Ã
Â
Ã Ã 1
@qi1
@qi1 @qi1
@qi1 @qi1
@qi1 @qi1
1
V
cov
;
Mo cov
;
cov
;
@Â12
@Â12 @Â2
@Â2 @Â12
@Â12 @Â2
"
Ã Â
Ã
Â
Ã# 1
Â
@qi1 1
@qi1 @qi1
@qi1 @qi1
;
V
cov
;
R22 D Mo cov
@Â2 @Â12
@Â12
@Â12 @Â2
Â
Ã Â
Ã
Ã
Â
Ã
Â
@qi1 @qi1
@qi1 1
@qi 2
@qi1 @qi1
Mo D cov
;
V
;
CV
cov
@Â2 @Â1
@Â1
@Â1 @Â2
@Â2
"

A.2.5

Â

Proof of Proposition 1.3.10

For

3

2
6 m1 .
4
m2 .

1; 2/
1; 2/

7
5

asymptotically equivalent linearized moment functions are
2
3
6 li1 . 1 ; 2 / 7
5
4
li 2 . 1 ; 2 /
Ä
Ä
2
@mi1 . o1 ; o2 /
@mi1 . o1 ; o2 /
. 1
. 2
o1 / C E
0
6 mi1 . o1 ; o2 / C E
@
@ 20
1
6
Ä
Ä
D4
@mi 2 . o1 ; o2 /
@mi 2 . o1 ; o2 /
. 1
. 2
mi 2 . o1 ; o2 / C E
o1 / C E
0
0
@ 1

@ 2

3
o2 /
o2 /

By subtracting
"

@mi1 . o1 ; o2 /
E
@ 20

#

"

@mi 2 . o1 ; o2 /
E
@ 20

#! 1
li 2 . 1 ; 2 /

from li1 . 1 ; 2 / ; we get
"

0 . /
li1
1

D li1 . 1 ; 2 /

@mi1 . o1 ; o2 /
E
@ 20

#

"

@mi 2 . o1 ; o2 /
E
@ 20

#! 1
li 2 . 1 ; 2 /

0 . / is a function of
0
where li1
1
1 only. Then standard asymptotics from li1 . 1 / yields
1
VQLIML
D A1 1 B1 A1 1

76

7
7
5

where
"

@mi1 . o1 ; o2 /
A1 D E
@ 10
0
B1 D V @mi1 . o1 ; o2 /
Ä
Since E

@mi1 . o1 ; o2 /
@ 20

#
"
#! 1 "
#
@mi1 . o1 ; o2 /
@mi 2 . o1 ; o2 /
@mi 2 . o1 ; o2 /
E
E
E
@ 20
@ 20
@ 10
1
"
#
"
#! 1
@mi1 . o1 ; o2 /
@mi 2 . o1 ; o2 /
E
E
mi 2 . o1 ; o2 /A
0
@ 2
@ 20

#

"

D 0; this simply reduces to
"

@mi1 . o1 ; o2 /
A1 D E
@ 10

#

B1 D V .mi1 . o1 ; o2 //
and the result is the same for the case of
3

2

6 m1 . 1 ; 2 / 7
5
4
m3 . 1 ; 2 /

A.2.6

Proof of Corollary 1.3.11

First, consider following reparameterizations with Á and 11j2
Á Á †221 †21
11j2

†12 †221 †21

Á †11

Á0
0
where Â1 D ˛ 0 ; ı10 ; Á0 ; 11j2 , Â2 D vec .ı 2 /0 ; vec .†22 / (Similar proof can be done with
original scores as well.) These modifications do not change parameter estimates (other than
†11 and †21 / of any methods studied in this chapter. It is due to the fact that the first two
reparameterization do not impose any restriction on parameter space. Now, the quasi-scores of q1

77

are modified to
2
@q1
6
D4
@Â1
2

1 e .Â/ x0
11j2 i
1
2
2 11j2 hi .Â/
1 e
11j2 i

.Â/
@q1
6
D4
@Â2
0 r.rC1/

3
7
5
Á ˝ z0

3
7
5

1

2

Ä
where x D

y2 z1 v2 .ı 2 /

and .ei .Â/ ; hi .Â// is defined correspondingly: Moment functions

for LIML and CF are
2
1
0
6 11j2 .y1 y2 ˛ z1 ı1 v2 Á/ y2
6
1
0
6
6 11j2 .y1 y2 ˛ z1 ı1 v2 Á/ z1
6
6
1 .y
0
6 11j2
1 y2 ˛ z1 ı1 v2 Á/ v2
6
h
i
E6
6 1 2 .y1 y2 ˛ z1 ı1 v2 Á/2
11j2
6 2 11j2
Á
6
6 vec z0 .y
1 .y
1
vec z0 11j2
1 y2 ˛
2 zı 2 / †22
6
4
Á
1 L vec † 1 v .ı /0 v .ı / † 1 † 1
r
i2 2
2
22
22
22 i 2 2
and

2

1 .y
y2 ˛ z1 ı1 v2 Á/ y02
11j2 1
1 .y
y2 ˛ z1 ı1 v2 Á/ z01
11j2 1
1 .y
y2 ˛ z1 ı1 v2 Á/ v02
11j2 1
h
1
2
2
2 11j2 .y1 y2 ˛ z1 ı1 v2 Á/

3

z1 ı1

6
6
6
6
6
6
6
i
E6
6
6
11j2
6
Á
6
6 vec z0 .y
1
2 zı 2 / †22
6
4
Á
1 L vec † 1 v .ı /0 v .ı / † 1 † 1
r
i
2
2
i
2
2
2
22
22
22

7
7
7
7
7
7
7
7D0
7
7
Á 7
7
v2 Á/ Á0 7
7
5

(A.2)

3
7
7
7
7
7
7
7
7D0
7
7
7
7
7
7
5

(A.3)

If Á D 0; the result is trivial. Suppose that there exists at least one nonzero element of Á: Also,
assume over-identification. By substitution of y2 and an invertible linear transformation; these can

78

be equivalently expressed as
2
0 0
6 .y1 y2 ˛ z1 ı1 v2 Á/ ı22 z2
6
0
6 .y
6 1 y2 ˛ z1 ı1 v2 Á/ z1
6
6
6 .y1 y2 ˛ z1 ı1 v2 Á/ v02
E6
6
6 .y1 y2 ˛ z1 ı1 v2 Á/2
11j2
6
Á
6
1
1 .y
6 vec z0 .y
vec z0 11j2
1
2 zı 2 / †22
6
4
Á
Lr vec †221 vi 2 .ı 2 /0 vi 2 .ı 2 / †221 †221
and

2
6
6
6
6
6
6
6
E6
6
6
6
6
6
6
4

.y1

y2 ˛

y2 ˛

z1 ı1

0 z0
v2 Á/ ı22
2

.y1

y2 ˛

z1 ı1

v2 Á/ z01

.y1

y2 ˛

z1 ı1

v2 Á/ v02

.y1

y2 ˛

z1 ı1

vec z0 .y2

3

v2 Á/2
Á
zı 2 / †221

z1 ı1

7
7
7
7
7
7
7
7D0
7
7
Á 7
7
v2 Á/ Á0 7
7
5

(A.4)

3

11j2

Lr vec †221 vi 2 .ı 2 /0 vi 2 .ı 2 / †221

†221

7
7
7
7
7
7
7
7D0
7
7
7
7
7
7
Á 5

(A.5)

respectivley. We can show (A.4) and (A.5) can be transformed by an invertible linear transformation
into following expressions
2
6
6
6
6
6
6
6
6
6
E6
6
6
6
6
6
6
6
4

y2 ˛

0 z0
z1 ı1 / ı22
2

.y1

y2 ˛

z1 ı1 / z01

.y1

y2 ˛

z1 ı1

.y1

3

v2 Á/ v02

v2 Á/2
11j2
Á
vec z01 .y2 zı 2 / †221
Á
1 .y
vec z02 .y2 zı 2 / †221
vec z02 11j2
1
Á
Lr vec †221 vi 2 .ı 2 /0 vi 2 .ı 2 / †221 †221

.y1

y2 ˛

z1 ı1

79

y2 ˛

z1 ı1

7
7
7
7
7
7
7
7
7
7D0
7
7
7
7
Á 7
v2 Á/ Á0 7
7
5

(A.6)

and

2
6
6
6
6
6
6
6
E6
6
6
6
6
6
6
4

.y1

y2 ˛

.y1

y2 ˛

z1 ı1 / z01

.y1

y2 ˛

z1 ı1

.y1

y2 ˛

z1 ı1

vec z0 .y2

3

0 z0
z1 ı1 / ı22
2

7
7
7
7
7
7
7
7D0
7
7
7
7
7
7
Á 5

v2 Á/ v02

v2 Á/2
Á
zı 2 / †221

11j2

(A.7)

Lr vec †221 vi 2 .ı 2 /0 vi 2 .ı 2 / †221 †221
repectively. Then the result follows by Proposition 1.3.11 and by similar argument as in Lemma
C.1. Equivalence in CF case is clear since
h
E vec z0 .y2

zı 2 / †221

Ái

D0

implies
22

3

3

0 0
66 ı z 7
7
E 44 22 2 5 v2 Á5 D 0:
z01

Too see (A.4) implies (A.6), note the second k2 r part of the fifth moments implies
Ái
Á
h
0
1 .y
0 z0
1
0 z0 .y
/
D0
y
˛
z
ı
v
Á/
Á
vec
ı
†
zı
E vec ı22
2
1 1
2
2
22 2 11j2 1
22
2 2
1 and elements of Á; we have
and, by adding the first moments after multiplying 11j2
Ái
h
1
0 z0 .y
/
D0
†
zı
E vec ı22
2
22
2 2
2
3
0 z0 v Á 5 D 0: Similarly, by adding second moment after multiplying
1
which implies E 4ı22
2
2 „ƒ‚…
11j2
scalar

and elements of Á to the first k1 r part of the fifth moments, we have
h
Ái
E vec z01 .y2 zı 2 / †221 D 0
and it implies E z01 v2 Á D 0. The converse can be shown as following:
h
Á
t zt .y
1
t zt
1
/
E vec ı22
zı
†
vec ı22
2
2 2
22
2 11j2 .y1 y2 ˛ z1 ı1
h
Á
Ái
t zt .y
1
t zt
1 . v Á/ Á0
/
D E vec ı22
zı
†
vec
ı
2
2
2 2
22
22 2 11j2
h
Á
i
t zt v † 1 C vec ı t zt v ÁÁ0
1
D E vec ı22
22 2 2
2 2 22
11j2 D 0

80

v2 Á/ Á0

Ái

multiplying †22 Á from right, we have
2

0

13
1
Á0 †22 Á 11j2
„ ƒ‚ …

6
t zt v Á C vec Bı t zt v Á
E 4vec ı22
@ 22 2 2
2 2

C7
A5 D 0

strict positive scalar if Á6D0

which implies
t zt v Á D 0
E ı22
2 2
h
Ái
And, again, seeing E vec z01 .y2 zı 2 / †221 D 0 implies E z01 v2 Á D 0 delivers the result.
Ä
Ä
@mi 3 . o1 ; o2 /
@mi 2 . o1 ; o2 /
and E
can be easily derived from identiAnd invertibility of E
0
0
@ 2

@ 2

fication conditions from LIML and CF. Hence, it is shown that there exist such T1 .Â/ and T2 .Â/
in Proposition 1.3.10.

A.2.7

Proof of Proposition 1.3.12

Lemma H.1 below proves the results when g1 .w; Â; / and g2 .w; Â; / are taken properly:
2
3
2
3
2
3
@q1

@q1

@q1

@ 2

@ 2

6
7
6 0 7
6 0 7
@ 01
7 , CF W g2 D 6 @ 1 7 ; GMM-QLIML W g2 D 6 @ 1 7
QLIML W g2 D 6
4 @q1
4 @q2 5
4 @q2 5
@q2 5
0 C 0
0
0
@ 2

@ 2

with g1 chosen to be the rest of moment functions in each GMM-interpreted estimator.
2 Rr and g D g1 .w; Â; /0 ; g2 .w; Â; /0

Lemma A.2.2 Let Â 2 Rp ;
moment functions with q

0

be RqCr valued

p. Assume regularity conditions for well-definedness of relevant

GMM estimators below. Suppose
2
6
E4

@g1 .Âo ; o /
@.Â 0 ; 0 /
@g2 .Âo ; o /
@.Â 0 ; 0 /

3

2

3

11
7 6 Gq p 0 q r 7
5D4
5
22
0 r p Gr r

where both G 22 and V .g .w; Âo ; o // are invertible. Then, the asymptotic variance of optimal
0

GMM estimator of Â based on g1 .Â; /0 ; g2 .Â; /0 is the same as that of optimal GMM estimator
of Â based on g1 .Â; o / treating o as a known value.

81

Proof. Let
2
6
GD4

Gq11 p

3
0q r 7
5
Gr22 r

0r p
0
1 1 2
3 1 2
3
1
12
B g1 .w; Âo ; o / C
6 V11;q q V12;q r 7
6 Bq q Bq r 7
V 1DV @
A D4
5 D4
5
21
22
g2 .w; Âo ; o /
V21;r q V22;r r
Br q Br r
Then
2
6
G 0V 1G D 4
2
6
D4
2
6
D4

30 2

Gq11 p

Bq1 q

Bq12 r

0q r 7 6
5 4
Br21 q Br22 r
0r p Gr22 r
32
110
1
110
12
G B
G B 7 6 Gq11 p
54
G 220 B 21 G 220 B 22
0r p
3
110
1
11
110
12
22
G B G
G B G 7
5
G 220 B 21 G 11 G 220 B 22 G 22

32
76
54

Gq11 p
0r p
3

3
0q r 7
5
Gr22 r

0q r 7
5
Gr22 r

Now, it suffices to show that
Â

G 110 B 1 G 11

G 110 B 12 G 22

h

G 220 B 22 G 22

i 1

G 220 B 21 G 11

Ã 1

D G 110 V111 G 11

This is true since
i 1
h
G 220 B 21 G 11
G 110 B 1 G 11 G 110 B 12 G 22 G 220 B 22 G 22
Ä
Á 1
Á 1
Á 1
110
B 1 B 12 G 22 G 22
B 22
G 220
G 220 B 21 G 11
DG
Ä
Á 1
110
1
12
22
DG
B
B
B
B 21 G 11
D G 110 V111 G 11

82

Á 1

A.2.8

Proof of Proposition 1.3.13

Define Schur complements as
A12 .A22 C C22 / 1 A21

A= .A22 C C22 / Á A11

A=A11 Á A22 C C22 A21 A111 A12
3
Ä
Ä
@q1
@q1
7
t
; A12 D A21 D E
; A22 D
5 ; A11 D E
@Â1 @Â10
@Â1 @Â20

2

A12
6 A11
where A D 4
A21 A22 C C22
Ä
Ä
@q2
@q1
and C22 D E
:
E
@Â2 @Â20
@Â2 @Â20
Á
Á
@q
@q @q
Assume GIMEs i.e. V @Â1 D 1 A11 ; cov @Â1 ; @Â1 D 1 A12 ; V
1
1
2
Á
Á
@q2
@q1 @q2
V @Â D 2 C22 and cov @Â ; @Â D 0: Then, what needs to be shown is
2

@q1
@Â2

Á

D 1 A22 ;

2

Â

1
VCF

Â

1
VQLIML
D A111 A12 Œ 2 W1 C . 1

1
2 / W2  A21 A11

where
W1 D C221
W2 D

ŒA=A11  1
Á
A21 A111 A12 ŒA=A11  1

ŒA=A11  1 A22

First, by argument used in the proof of Proposition 1.3.10 (i), the variance difference is
Â

1
VCF

Â

1
VQLIML

D A111 B2 A111

h

A11

A12 .A22 C C22 / 1 A21

i 1

h
B1 A11

A12 .A22 C C22 / 1 A21

i 1

where
Ä
B1 D V

@q1
@Â1

A12 .A22 C C22 /

1

Â

@q1
@q2
C
@Â2
@Â2

Ã

D 1 A11 C A12 .A22 C C22 / 1 . 1 A22 C 2 C22 / .A22 C C22 / 1 A21
2 1 A12 .A22 C C22 / 1 A21
D 1 A11 C A12 .A22 C C22 / 1 . 1 A22 C 1 C22
2 1 A12 .A22 C C22 / 1 A21

83

1 C22

C 2 C22 / .A22 C C22 / 1 A21

h

D 1 A11
C. 2

A12 .A22 C C22 /

1 / A12 .A22

1A
21

i

C C22 / 1 C22 .A22 C C22 / 1 A21

and
Ä
B2 D V

@q1o

@q2o
1
A12 C22
@Â2

@Â1

D 1 A11 C 2 A12 C221 A21
Since
h

A11

A12 .A22 C C22 / 1 A21

i 1

D A111 C A111 A12 ŒA22 C C22 

A21 A111 A12

Á 1

A21 A111

the difference can be rearranged as
i 1
i 1
h
B1 A11 A12 .A22 C C22 / 1 A21
A12 .A22 C C22 / 1 A21
i 1
h
Á
1A
/
.A
C
C
A
A
D A111 1 A11 C 2 A12 C221 A21 A111
22
22
21
11
12
1
i 1
i 1
h
h
1
1
. 2
A12 DA21 A11 A12 .A22 C C22 / A21
1 / A11 A12 .A22 C C22 / A21
Â
Á 1Ã
1
1
1
A21 A111
D A11 A12 2 C22
1 ŒA22 C C22  A21 A11 A12
h

A111 B2 A111

A11

h

1A

i 1

h

A12 DA21 A11 A12 .A22 C C22 /
A11 A12 .A22 C C22 /
21
Â
Á 1Ã
1
1
1
ŒA22 C C22  A21 A11 A12
A21 A111
D 2 A11 A12 C22

C. 1

2/

C. 1

h
/
2 A11

. 1

1
2 / A11 A12

ŒA22 C C22 

i 1

h
A12 DA21 A11
Á 1
A21 A111 A12
A21 A111

A12 .A22 C C22 / 1 A21

1A
21

A12 .A22 C C22 / 1 A21

where D D .A22 C C22 / 1 C22 .A22 C C22 / 1 : Now, it suffices to show
h
i 1
h
i 1
A11 A12 .A22 C C22 / 1 A21
A12 DA21 A11 A12 .A22 C C22 / 1 A21
Á 1
A111 A12 ŒA22 C C22  A21 A111 A12
A21 A111
Á
D A111 A12 ŒA=A11  1 A22 A21 A111 A12 ŒA=A11  1 A21 A111

84

i 1

i 1

The first term:
i 1
h
i 1
A12 .A22 C C22 / 1 A21
A12 DA21 A11 A12 .A22 C C22 / 1 A21
h
i
h
i
1
1
1
1
1
D A11 A11 C A12 ŒA=A11  A21 A11 A12 DA21 A11 A11 C A12 ŒA=A11  A21 A111
h
i
1
1
1
1
D A11 A11 A11 A12 C A12 ŒA=A11  A21 A11 A12
h
i
D A21 A111 A11 C A21 A111 A12 ŒA=A11  1 A21 A111
h
i h
i
D A111 A12 I C ŒA=A11  1 A21 A111 A12 D I C A21 A111 A12 ŒA=A11  1 A21 A111
h
i h
i
D A111 A12 ŒA=A11  1 ŒA=A11  C A21 A111 A12 D ŒA=A11  C A21 A111 A12 ŒA=A11  1 A21 A111

h

A11

D A111 A12 ŒA=A11  1 ŒA22 C C22  D ŒA22 C C22  ŒA=A11  1 A21 A111
D A111 A12 ŒA=A11  1 C22 ŒA=A11  1 A21 A111
The second term:
A111 A12 ŒA22 C C22 

A21 A111 A12

Á 1

A21 A111

D A111 A12 ŒA=A11  1 ŒA=A11  ŒA=A11  1 A21 A111
Hence the result.
To see positive semi-definiteness of W1 ;
C221

ŒA=A11  1

0

”
ŒA=A11 
D A22

C22
A21 A111 A12

”

under A11 0

2

3

6 A11 A12 7
4
5
A21 A22

85

0

0

where the last equivalence is from Schur complement condition for positive semi-definiteness.
Since
A11 D
and

1
1

Â
Vo

@q1o

Ã

Â
0

@Â1
2

@q1
*
linearly independent at Âo
@Â1

Ã

3

Â oÃ
@q1
1
6 A11 A12 7
4
5 D Vo
@Â
1
A21 A22

0

the statement is shown. Negative semi-definiteness of W2 is also proved similarly.

A.2.9

Proof of Proposition 1.3.14

Note that
Ä
E
Ä
E

@q1o

@Â1 @vech .†22 /0
@q2o

@vec .ı 2 / @vech .†22 /0

D0
D0

Thus we can treat †22 as a known value by Proposition 1.3.12. Then, redefined qQ 2 is also a member
of linear exponential family (Gouriéroux, Monfort and Trognon ,1984). Now, it suffices to show
GLM variance assumptions imply GIMEs with corresponding scaling factor in linear exponential
family. Let m .Â/ Á G .y2 ; z1 ; v2 ; Â1 / : Based on general form of score and Hessian (Wooldridge,
2010), it is easy to see
Ä

Ä
@q1
1
@m .Â/ @m .Â/
Eo
D
E
o
@Â@Â 0
Vq . y1 j y2 ; z/ @Â
@Â 0
"
#
Ä
Œy1 m .Â/2 @m .Â/ @m .Â/
@q1 @q1
D Eo
Eo
@Â @Â 0
@Â 0
Vq . y1 j y2 ; z/2 @Â
ˇ
i
2 h
3
ˇ
Eo Œy1 m .Â/2 ˇ y2 ; z @m .Â/ @m .Â/
5
D Eo 4
0
2
@Â
@Â
Vq . y1 j y2 ; z/
Ä
@q1
D 1 Eo
@Â@Â 0

86

And GIMEs for qQ 2 can be shown similarly.
Ä
Eo

h
i
@qQ 2
1
0
D Eo †22 ˝ z z
@vec .ı 2 / @vec .ı 2 /0
Â
Ã
h
i
@qQ 2
0
1
0
1
Vo
D Eo Ir ˝ z †22 v2 v2 †22 ŒIr ˝ z
@vec .ı 2 /
h
h
ii
1
0
1
0
D Eo Ir ˝ z †22 Eo v2 v2 jz †22 ˝ z
h
h
ii
0
1
D 2 Eo Ir ˝ z †22 ˝ z
Ä
@qQ 2
D 2 Eo
@vec .ı 2 / @vec .ı 2 /0

Orthogonality of scores holds under correct specification of conditional means since
"
#
" Ä
#
ˇ
@q1 @q2
@q2
@q1 ˇˇ
Eo
D Eo Eo
y2 ; z
@Â @Â20
@Â ˇ
@Â20
"
#
Eo Œy1 m .Â/j y2 ; z @m .Â/ @q2
D Eo
Vq .y1 j y2 ; z/
@Â @Â20
D0
Then, by Proposition 1.3.13, QLIML is efficient relative to CF for Â1 :

87

APPENDIX B
AN APPENDIX FOR CHAPTER 2

B.1

Regularity conditions

Assumptions (consistency and asymptotic normality of QLIML and CF estimator)
(1) .yi1 ; yi 2 ; zi / are i.i.d.
(2) ‚

cpt

(3) q1 W ‚

Rp
W ! R and q2 W ‚2

W ! R where .yi1 ; yi 2 ; zi / 2 W

(4) Âo 2 i nt .‚/ and let N be a neighborhood of Âo
at each Â 2 ‚ with probability one.
(5) qi1 .Â1 ; Â2 / and qi 2 .Â2 / are continuously
2 differentiable 3
2
3
2
@.q1 Cq2 /
@q1 .Âo /
6
7
6
7
@Â
@Â
6
7<1
(6) E 4 sup
< 1 (7) E 4
5
5
@q2
@q2 .Âo2 /
Â2‚
@Â2

2
6
(8) E 4

@Â2

@.q1 .Â/Cq2 .Â2 //
@Â
@q2 .Â2 /
@Â2

3
7
5 is differentiable with respect to Â at Âo
Ä

@.q1 Cq2 /
(9) (QLIML orthogonality) fÂo g D Â 2 ‚ W E
D0
@Â
8
9
2
3
@q1 .Â/
ˆ
>
<
=
6 @Â1 7
(10) (CF orthogonality) fÂo g D Â 2 ‚ W E 4
5D0
@q2 .Â2 /
ˆ
>
:
;
@Â2
"
ˇ
ˇ
Ä
@.q1 .Â /Cq2 .Â// ˇˇ
@.q1 .Â/Cq2 .Â// ˇˇ
@
and V
(11) (QLIML rank) @Â 0 E
ˇ
ˇ
@Â
@Â

Â DÂo

ÂDÂo

ible.

2

6
(12) (CF rank) @Â@ 0 E 4

@q1 .Â/
@Â1
@q2 .Â/
@Â2

3ˇ
ˇ
ˇ
7ˇ
5ˇ
ˇ
ˇ

2
6
and V 4

ÂDÂo

@q1 .Âo /
@Â1
@q2 .Âo2 /
@Â2

3
7
5 are invertible.

(13) (stochastic differentiability) For j D 1; 2 and any ıN ! 0;
h
i
p
j
j
j
N gO N .Â/ gO N .Âo / E gO N .Â/
p
!0
sup
p
1 C N kÂ Âo k
kÂ Âo kÄıN

88

#
are invert-

where
1 .Â/ D
gO N

N Ä
X
iD1

@ .qi1 .Â/ C qi 2 .Â//
@Â

2 .Â/ D
and gO N

N
X
i D1

2
6
4

@qi1 .Â/
@Â1
@qi 2 .Â/
@Â2

3
7
5

Assumptions (consistency and asymptotic normality of minimum distance estimators)
.Â1 ; Âo2 / D o if and only if Â1 D Âo1
@ .Âo1 ;Âo2 /
has full rank
(15)
@Â
(14)

1

(16)

W ‚ !  is continuous at each Â 2 ‚ and continuously differentiable in N

(17) (asymptotic normality of reduced form parameters)
p

Â
N

O 0 ; ÂO20

Á0

0 0 0
o ; Âo2

Ã

d

Â

! N 0;

A0R BR 1 AR

Á 1Ã

where
2
AR D

@

@
6
E4
0; Â 0
2

@q1 . ;Â2 /
@
@q2 .Â2 /
@Â2

3ˇ
3
2
ˇ
@q1 . o ;Âo2 /
ˇ
7ˇ
7
6
@
and BR D V 4
5ˇ
5
@q2 .Âo2 /
ˇ
ˇ
@Â2
. ;Â2 /D. o ;Âo2 /

Assumption (18) each component of
@q1 .Â/
@Â :

B.2

@q2 .Âo2 /
@Â2

cannot be expressed as linear combination of

Proof of Proposition 2.3.3

First, it is shown that a well-defined minimum distance estimator and its linearized version are
asymptotically equivalent.
Lemma B.2.1 (linearized minimum distance estimator) Assume (1) o

h .Â/ 6D 0 if Â 6D Âo (2)

@h .Â / has full column
h is continuously differentiable in Â where Â 2 Rp , 2 Rg and g p (3) @Â
o
p
d
rank (4) N . O
o / ! N .0; o /. Then, efficient MD on a linearized link function yields an

estimator asymptotically equivalent to efficient MD with original link function.

89

Proof. Consider a first-order expansion of h around Âo
h .Âo / C

h .Â/

The minimization problem is
Â
@h .Âo /
.Â
min O h .Âo /
@Â 0
Â

@h .Âo /
.Â
@Â 0

Ã0
Âo /

WO

Âo /

Â
O

h .Âo /

@h .Âo /
.Â
@Â 0

Ã
Âo /

(A.1)

where
p

WO ! Wo
The first order condition is
@h .Âo / O
W
@Â

Â
O

@h .Âo /
.Â
@Â 0

h .Âo /

Ã
Âo / D 0

Then
Ä
Â
Ã
@h .Âo /
@h .Âo / O @h .Âo / 1 @h .Âo / O
O
ÂD
W
W O h .Âo / C
Âo
@Â
@Â 0
@Â
@Â 0
Ä
@h .Âo / O @h .Âo / 1 @h .Âo / O
D Âo C
W
W . O h .Âo //
@Â
@Â 0
@Â
The asymptotic distribution is
p

N ÂO

Á

Ä

Âo D

1

@h .Âo / O @h .Âo /
W
@Â
@Â 0

D Ho0 Wo Ho
D Ho0 Wo Ho

@h .Âo / O p
W N .O
@Â

p
Ho0 Wo N . O
p
1 0
Ho Wo N . O
1

h .Âo //

h .Âo // C op .1/
o / C op .1/

Then the optimal weighting matrix is such that
Wo D

o

1

Hence the proof
Now, consider an auxiliary model.
Lemma B.2.2 There exists an auxiliary asymptotic model whose GLS estimator is asymptotically
equivalent to efficient linearized MD.

90

Proof. Consider
O
„

@h .Âo /
@h .Âo /
D
Â C un
h .Âo / C
Â
o
0
0
@Â
@Â
ƒ‚
… „ ƒ‚ …
y

X

where
E Œun  D 0
V .un / D

o

Then GLS estimator of Â in this auxiliary asymptotic model is asymptotically equivalent to efficient
linearized MD since GLS estimator solves (A.1). Here, mean and variance of un are just imposed
restrictions in auxiliary model, not derived from original model. (Gouriéroux, Monfort, Trognon,
1985)
Next, above two lemma will be applied to a minimum distance problem with partitioned link
function. Consider
8
2
3 2
ˆ
<
6 o1 7 6
f.Âo1 ; Âo2 /g D .Â1 ; Â2 / W 4
5D4
ˆ
:
o2
where

D

0 0 0
1; 2 ;

h1 .Â1 ; Â2 /
h2 .Â1 ; Â2 /

0

h D h01 ; h02 ; 1 2 Rg1 ; 2 2 Rg2 ; Â1 2 Rp1 ; Â2 2 Rp2 ; and p1 D g2 :

Then, by first order expansion, a linearized partioned model is
3 2
2
3 2
6 h1 .Â1 ; Â2 / 7
4
5
h2 .Â1 ; Â2 /

6 h1 .Âo1 ; Âo2 / 7 6
5C6
4
4
h2 .Âo1 ; Âo2 /

@h1 .Âo /
@Â10
@h2 .Âo /
@Â10

and the corresponding auxiliary asymptotic model is
2
3 2
3 2
6 O1 7
4
5
O2
2
6
D6
4

39
>
=
7
5
>
;

6 h1 .Âo1 ; Âo2 / 7 6
4
5C6
4
h2 .Âo1 ; Âo2 /
32
3

@h1 .Âo /
@Â10
@h2 .Âo /
@Â10

@h1 .Âo /
@Â20
@h2 .Âo /
@Â20

@h1 .Âo /
@Â20
@h2 .Âo /
@Â20

@h1 .Âo /
@Â10
@h2 .Âo /
@Â10

2

3

3

2

32

31

6 Âo1 7C
4
5A
Âo2

7 B6 Â1 7
7 @4
5
5
Â2

@h1 .Âo /
@Â20
@h2 .Âo /
@Â20

7 6 Â1 7 6 u1n 7
74
5C4
5
5
Â2
u2n

91

3 02

3

7 6 Âo1 7
74
5
5
Âo2

where
2

3

6 u1n
E4
u2n
2
6 u1n
V4
u2n

7
5D0
3
7
5D

o

Define
y1 Á O1
y2 Á O2

@h .Âo /
@h1 .Âo /
Âo1 C 1 0 Âo2
0
@Â1
@Â2
@h .Âo /
@h .Âo /
h2 .Âo1 ; Âo2 / C 2 0 Âo1 C 2 0 Âo2
@Â1
@Â2
h1 .Âo1 ; Âo2 / C

X1 Á

@h1 .Âo /
@Â10

Z1 Á

@h1 .Âo /
@Â20

X2 Á

@h2 .Âo /
@Â10

Z2 Á

@h2 .Âo /
I invertible p2
@Â20

p2

where
2

3

6 u1n 7
E4
5D0
u2n
2
3
6 u1n 7
V4
5D o
u2n

92

Define
@h1 .Âo /
@h1 .Âo /
Â
C
Âo2
o1
@Â10
@Â20
@h2 .Âo /
@h2 .Âo /
Âo1 C
Âo2
O2 h2 .Âo1 ; Âo2 / C
0
@Â1
@Â20
@h1 .Âo /
@Â10
@h1 .Âo /
@Â20
@h2 .Âo /
@Â10
@h2 .Âo /
I invertible p2 p2
@Â20

y1 Á O1
y2 Á
X1 Á
Z1 Á
X2 Á
Z2 Á

h1 .Âo1 ; Âo2 / C

Then we have a system
8
ˆ
< y1 D X1 Â1 C Z1 Â2 C u1n
ˆ
: y2 D X2 Â1 C Z2 Â2 C u2n
Since Z2 invertible, we have
Z2 1 y2 D Z2 1 X2 Â1 C Â2 C Z2 1 u2n
”
Â2 D Z2 1 y2

Z2 1 X2 Â1

Z2 1 u2n

Thus
y1 D

X1 Â1 C Z1 Z2 1 y2

Z2 1 X2 Â1

Z2

1u

Á
2n

C u1n

”
y1

Z1 Z2 1 y2 D X1

Then an equivalent system is
8
ˆ
< y1 Z1 Z 1 y2 D X1
2
ˆ
:

Á
Z1 Z2 1 X2 Â1 C u1n

Á
Z1 Z2 1 X2 Â1 C u1n

y2 D X2 Â1 C Z2 Â2 C u2n

93

Z1 Z2 1 u2n

Z1 Z2 1 u2n

Moreover, define
Z1 Z2 1 u2n

u1n Á u1n
y1 Á y1

Z1 Z2 1 y2

X1 Á X1

Z1 Z2 1 X2

u2n Á u2n

L u2n ju1 D u2n

y2 D y2

Ay1

X2 D X2

AX1

Au1

where the linear projection L . j / is defined in auxiliary population space. Then we have another
equivalent system
8
ˆ
<

y1 D X1 Â1 C u1n

ˆ
: y D X Â1 C Z2 Â2 C u
2
2
2n
Since u1n and u2n are orthogonal here, GLS on the first part only is equivalent to joint GLS for Â1 .
From now on, It will be proved that concentrated MD is asymptotically equivalent to running GLS
on the first part only.
Consistency

The concentrating equation is
h2 .Â1 ; Â2 / D 0

O2
and, by implicit function theorem, we know

Â2 D 'n .Â1 /
is well-defined and continuously differentiable at each Â1 : The concentrated MD is derived from
minimizing distance
O1

h1 .Â1 ; 'n .Â1 //

and consistency of concentrated MD easily follows by the fact that 'n .Â1 / is well-defined and
smooth enough for each Â1 .

94

First, from the concentration identity, we can take differen-

Optimal Weight Calculation
tiation on both handsides

@h2 .Â1 ; 'n .Â1 // @h2 .Â1 ; 'n .Â1 // @'n .Â1 /
C
D0
@Â10
@Â20
@Â10
„
ƒ‚
…
invertible

Hence
@'n .Â1 /
D
@Â10

"

@h2 .Â1 ; 'n .Â1 //
@Â20

# 1

@h2 .Â1 ; 'n .Â1 //
@Â10

In the minimization problem,
min Œ O1
Â1

h1 .Â1 ; 'n .Â1 //0 WO Œ O1

h1 .Â1 ; 'n .Â1 //

taking first order condition, we have
2
0D4

ÁÁ
@
O1 ; 'n ÂO1 C @ h1 ÂO1 ; 'n ÂO1
Â
h
1
@Â10
@Â20

2
ÁÁ
6 @
D 4 0 h1 ÂO1 ; 'n ÂO1
@Â1
„
WO

ÁÁ @'n ÂO1
@Â1

Á 30
5 WO

h

ÁÁ @h2 ÂO1 ; 'n ÂO1
@
O
O
h1 Â1 ; 'n Â1 4
@Â20
@Â20
ƒ‚ Á
2

O1

h1 ÂO1 ; 'n ÂO1

ÁÁ 3 1
5

ÁÁi

@h2 ÂO1 ; 'n ÂO1

ÁÁ 30
7
5

@Â10

…

ÁHn ÂO1

h

h1 ÂO1 ; 'n ÂO1

O1

ÁÁi

Note
h1 ÂO1 ; 'n ÂO1

ÁÁ

D h1 .Âo1 ; 'n .Âo1 //
2
@
C 4 0 h1 Â1 ; 'n Â1
@Â1
„

"

@
h1 Â1 ; 'n Â1
@Â20

@h2 Â1 ; 'n Â1
@Â20

ƒ‚ Á

ÁHn Â1

ÂO1

Âo1

Á

95

# 1

@h2 Â1 ; 'n Â1
@Â10

3
5
…

where
Â1 lies on the segment connecting ÂO1 and Âo1
Hence
0 D Hn ÂO1

Á0

WO

h

O1

h1 ÂO1 ; 'n ÂO1

D Hn ÂO1

Á0

WO

h

O1

h1 .Âo1 ; 'n .Âo1 //

ÁÁi
Hn Â1

ÂO1

Âo1

Ái

)
p

Á
O
N Â1 Âo1
Ä
Á0
O
D Hn Â1 WO Hn Â1

D Ho0 Wo Ho
D Ho0 Wo Ho
D Ho0 Wo Ho

1

Hn .Â/0 WO

p

N Œ O1

h1 .Âo1 ; 'n .Âo1 //

p
Ho0 Wo N Œ O1 h1 .Âo1 ; 'n .Âo1 // C op .1/
p
1 0
Ho Wo N Œ O1 h1 .Âo1 ; Âo2 / C h1 .Âo1 ; Âo2 / h1 .Âo1 ; 'n .Âo1 // C op .1/
hp
i
p
1 0
///
.Â
.Â
/
/
.Â
.h
Ho Wo
C op .1/
;
'
h
C
;
Â
N . O1
N
n
o1
1 o1
o1
1 o1 o2
1

Note
h1 .Âo1 ; 'n .Âo1 // D h1 .Âo1 ; Âo2 / C

@h1 Âo1 ; Â2
@Â20

.'n .Âo1 /

Âo2 /

and
0 D O2
D O2

h2 .Âo1 ; 'n .Âo1 //
h2 .Âo1 ; Âo2 /

@h2 Âo1 ; Â2
@Â20

.'n .Âo1 /

Âo2 /

where
Â2 and Â2

lie on the segment connecting 'n .Âo1 / and Âo2

which implies
"
.'n .Âo1 /

Âo2 / D
"

@h2 Âo1 ; Â2

# 1
. O2

@Â20

@h2 Âo1 ; Â2
@Â20

96

h2 .Âo1 ; Âo2 //

# 1
. O2

o2 /

Thus
p

N ÂO1

Âo1

Á
1

D Ho0 Wo Ho
2
p
4 N . O1

o1 /

p

0
N@

1

Ho0 Wo Ho

Ho0 Wo

@Â20

@h2 Âo1 ; Â2
@Â20

13

# 1
. O2

o2 /A5 C op .1/

3

# 1

"

p
@h1 .Âo / @h2 .Âo /
N . O2
o2 /5 C op .1/
0
0
@Â2
@Â2
02
31
Ä
1 p
o1 7C
B6 O1
@h1 .Âo / @h2 .Âo /
N
@4
5A C op .1/
0
0
@Â2
@Â2
O2
o2
ƒ‚
…

Ä
Ig1
„

"

@h1 Âo1 ; Â2

2
p
1 0
Ho Wo 4 N . O1

D Ho0 Wo Ho

D

Ho0 Wo

o1 /

inverse of asymp. var. of this expression is optimal weight

where
@
Ho D 0 h1 .Âo /
@Â1

"
# 1
@
@h2 .Âo /
@h2 .Âo /
h1 .Âo /
0
0
@Â2
@Â2
@Â10

Therefore, the optimal weight is
Ä
Wo D

Ig1

@h1 .Âo /
@Â20

Ä

@h2 .Âo /
@Â20

1

Ä
o

Ig1

@h1 .Âo /
@Â20

Ä

@h2 .Âo /
@Â20

1

0

! 1

Linearized cMD and auxiliary asymptotic model The linearized cMD on
2
3
"
# 1
@
@
@h2 .Âo /
@h2 .Âo / 5
.Â1
O1 h1 .Âo1 ; 'n .Âo1 // 4 0 h1 .Âo /
h1 .Âo /
0
0
@Â1
@Â2
@Â2
@Â10
D O1

h1 .Âo1 ; 'n .Âo1 //

Ho .Â1

Âo1 /

Âo1 /

with weights calculated above is asymptotically equivalent to concentrated MD. The corresponding
auxiliary asymptotic model is
O1
where u1n

h1 .Âo1 ; 'n .Âo1 // C Ho Âo1 D Ho Â C u1n

Z1 Z2 1 u2n

Z1 Z2 1 u2n is derived from optimal weight calculation. Moreover, since we know
X1

Á
Z1 Z2 1 X2 D Ho

97

to see its equivalence to
y1

Z1 Z2 1 y2

D X1

it suffices to show that we can replace O1

Z1 Z2 1 X2

Á

Â1 C u1n

Z1 Z2 1 u2n

h1 .Âo1 ; 'n .Âo1 // C Ho Âo1 with y1

Z1 Z2 1 y2 : Note

that we have
y1

Z1 Z2 1 y2
@h1 .Âo /
@h1 .Âo /
Âo1 C
Âo2
0
@Â1
@Â20
"
# 1"
#
@h1 .Âo / @h2 .Âo /
@h2 .Âo /
@h2 .Âo /
O2 h2 .Âo1 ; Âo2 / C
Âo1 C
Âo2
@Â20
@Â20
@Â10
@Â20

D O1

h1 .Âo1 ; Âo2 / C

D O1

h1 .Âo1 ; Âo2 / C

"
@h1 .Âo / @h2 .Âo /
@Â20
@Â20

@h1 .Âo /
Âo1
@Â10
# 1"
O2

@h2 .Âo /
h2 .Âo1 ; Âo2 / C
Âo1
@Â10

#

along with
O1

h1 .Âo1 ; 'n .Âo1 // C Ho Âo1

h1 .Âo1 ; Âo2 / C h1 .Âo1 ; Âo2 / h1 .Âo1 ; 'n .Âo1 //
3
"
# 1
.Â
/
.Â
/
@
@
@h
@h
2 o
2 o 5
C 4 0 h1 .Âo /
h1 .Âo /
Âo1
0
0
@Â1
@Â2
@Â2
@Â10
"
# 1
@h1 .Âo /
@h1 .Âo / @h2 .Âo /
@h2 .Âo /
D O1 h1 .Âo1 ; Âo2 / C
Âo1
Âo1
0
0
0
@Â1
@Â2
@Â2
@Â10
D O1
2

C h1 .Âo1 ; Âo2 /

h1 .Âo1 ; 'n .Âo1 //

"
# 1
@h1 .Âo / @h2 .Âo /
@h2 .Âo /
@h1 .Âo /
D O1 h1 .Âo1 ; Âo2 / C
Âo1
Âo1
0
0
0
@Â1
@Â2
@Â2
@Â10
"
# 1
@h1 Âo1 ; Â2
@h2 Âo1 ; Â2
. O2
C
o2 /
@Â20
@Â20
D O1

@h1 .Âo /
Âo1
@Â10
# 1"

h1 .Âo1 ; Âo2 / C

"
@h1 .Âo / @h2 .Âo /
@Â20
@Â20

O2

#
Â
Ã
1
@h2 .Âo /
h2 .Âo1 ; Âo2 / C
Âo1 C op n 2
@Â10

98

Hence the result.

B.3

Proof of Proposition 2.3.7

First, note that
1
0 0
1
VMD
QLIML D Ho AR BR AR Ho

where
2
6
Ho D 4

@
@Â10

.Âo /
0

@
6
E4
0; Â 0
2

@

3

.Âo / 7
5

Ip2
2

AR D

@
@Â20

@q1 . ;Â2 /
@
@q2 .Â2 /
@Â2

0

3ˇ
ˇ
ˇ
7ˇ
5ˇ
ˇ
ˇ
. ;Â2 /D. o ;Âo2 /

1

@ q . ;Â /
i1 o o2 C
B
BR D V @ @
A
@ q .Â /
o2
i
2
@Â
2

Next, by product rule of differentiation, we have
3ˇ
2
ˇ
@ q . .Â/ ; Â / ˇ
@
2 7ˇ
6 @ 1
E4
5ˇ
ˇ
@ q .Â /
@Â 0
ˇ
@Â2 2 2
ÂDÂ
2
3ˇ o
2
3ˇ
ˇ
@q1 . ;Â2 / ˇˇ
ˇ
.Â/
@
@
6
7
ˇ
6
7
ˇ
@
D
E
ˇ
ˇ
4
5
4
5
0
0
@q2 .Â2 /
ˇ
ˇ
@Â
@ 0 ; Â2
Â2
ˇ
ˇ
@Â2
Â DÂo
. ;Â2 /D. o ;Âo2 / „
ƒ‚
…
„
ƒ‚
…
DHo

DAR

Hence
1
0 0
1
VmGMM
QLIML D Ho AR BR AR Ho

99

B.4

Proof of Proposition 2.3.8

Note that quasi-likelihoods for reduced form model are
q1 . .Â1 ; Â2 / ; Â2 /
q2 .Â2 /
1
We will show VmGMM
QLIML

1
VGMM
QLIML

0: First, note that

1
0 0
1
VmGMM
QLIML D Ho AR BR AR Ho

where
2
6
Ho D 4

@
@Â10

.Âo /
0

@
6
E4
0; Â 0
2

@

3

.Âo / 7
5

Ip2
2

AR D

@
@Â20

3ˇ
ˇ
ˇ
7ˇ
5ˇ
ˇ
ˇ
. ;Â2 /D. o ;Âo2 /
1

@q1 . ;Â2 /
@
@q2 .Â2 /
@Â2

0

@
B @ qi1 . o ; Âo2 / C
BR D V @
A
@ q .Â /
@Â i 2 o2
2

1
It will be shown that VGMM
QLIML can be expressed in terms of Ho ; AR ,BR and an additional

linear transformation. Suppose Â22 is not empty. With probability one, for some p22

100

g matrix

C2 .Â/
2
6
6
6
6
4

@
@Â1 qi1 .Â/
@
@Â22 qi1 .Â/
@
@Â2 qi 2 .Â2 /

3

2

3

@ @
@Â1 @

qi1 . ; Â2 /
7 6
7
7 6 @
7
@ q . ;Â / C @ q . ;Â / 7
7D6
i1
2
i1
2
7 6 @Â22 @
7
@Â22
5 4
5
@ q .Â /
@Â2 i 2 2
2
3
@ @
.
/
q
;
Â
2
@Â1 @ i1
6 h
7
i
6 @
7
@
6
7
D6
.Â/
.
/
C
C
q
;
Â
2
i1
2
7
@Â22
@
5
4
@ q .Â /
@Â2 i 2 2
3
2
@
3
0 72
@Â1
6 h
@
i
7 6 @ qi1 . ; Â2 / 7
6
@
74
D6
5
.Â/
0
C
C
2
7
6 @Â22
@ q .Â /
5
4
@Â2 i 2 2
0
Ip2
„
ƒ‚
…
ÁW .Â /W.pCp22 / .gCp2 /

By product rule of differentiation,
0

1
3
C
B
@ q . ;Â / C
d B
2
i1
7C
B
6
W .Â/
E4 @
B
5C
C
@ q .Â /
dÂ 0 B.pCp / .gCp /
A
@
22
2
@Â2 i 2 2
.gCp2 / 1
0
2

1

B Ä
C d
B
C
D BE @@ 0 qi1 . ; Â2 / @ 0 qi 2 .Â2 / ˝ I.pCp / C 0 vec ŒW .Â/
@Â2
22 A dÂ
@
„
ƒ‚
…
vanish in expectation at true paramters

0 2
C W .Â/

d B 6
@E 4
dÂ 0

31

@
@

qi1 . ; Â2 / 7C
5A
@ q .Â /
@Â i 2 2
2

101

where
0 2

31

@
@

qi1 . ; Â2 / 7C
5A
@ q .Â /
@Â2 i 2 2
2
h
iÁ
h
iÁ
h
iÁ
@
@
1 E @ q . ;Â /
1 E @ q . ;Â /
1 E @ q . ;Â /
C
2
2
2
6 @ 0
@ i1
@ i1
@ i1
@Â10 @ 0
@Â20
@Â20
D6
h
iÁ
4
1 E @ q .Â /
0
@Â2 i 2 2
@Â20
2
iÁ
h
iÁ 3 2
h
3
1 E @ q . ;Â /
1 E @ q . ;Â /
@
@
i1
2
i1
2
7 6 @Â 0 @Â 0 7
6 @ 0
@
@
@Â20
74 1
2 5
D6
h
iÁ
5
4
1
@
0
E @Â qi 2 .Â2 /
0
Ip
@Â20
2
„
ƒ‚
…
ƒ‚
…
„
DH
when
Â
DÂ
o
o
DAR when . ;Â2 /D. o ;Âo2 /

d B 6
@E 4
dÂ 0

Then we have

2
6
6
d
6
E
dÂ 0 6
4

@
@Â1 qi1 .Â/
@
@Â22 qi1 .Â/
@
@Â2 qi 2 .Â2 /

3ˇ
ˇ
ˇ
7ˇ
7ˇ
7ˇ
7ˇ
5ˇ
ˇ
ˇ

where Wo D W .Âo / : Also, it is easy to see
0
@ q
Â Â
B @Â1 i1 o1; o2
B
@
VB
B @Â22 qi1 Âo1; Âo2
@
@
@Â qi 2 .Âo2 /

D Wo AR Ho

Â DÂo

1
C
C
C D Wo BM W 0
o
C
A

2

Hence
1
0 0
0
0
VGMM
QLIML D Ho AR Wo Wo BR Wo

1

Wo AR Ho

To see relative efficiency of MD-QLIML,
1
VmGMM
QLIML

1
VGMM
QLIML
1

D Ho0 A0R BR 1 AR Ho

Wo AR Ho
Ho0 A0R Wo0 Wo BR Wo0
Á
1
D Ho0 A0R BR 1 Wo0 Wo BR Wo0
Wo AR Ho
0
1
! 1
D

1
0
0
Ho AR BR 2

@I

1
BR2 Wo0

10 1
Wo BR2 BR2 Wo0

102

10
10
2
2
A
Wo BR BR AR Ho

0

3
7
7
5

where
10

1

BR D BR2 BR2
When p1 C p22 D g holds; Wo is invertible and we have
I

1
BR2 Wo0

10 1
Wo BR2 BR2 Wo0

! 1

10

Wo BR2 D 0

In the case where Â22 is empty, the result can be shown with slight modification of above proof.

B.5

Proof of Proposition 2.3.9

(a)
(Well-definedness of reduced form model) The reduced form likelihood is
qi1 .Â1 ; Â2 / D l .yi1 ; y2 ˛ C z1 ı1 C v2 Á; /
D l .yi1 ; z1 .ı 21 ˛ C ı1 / C z2 ı 22 ˛ C v2 .˛ C Á/ ; /
D l .yi1 ; z 1 .Â/ C v2 2 .Â/ ; 3 .Â//
D q1 . .Â/ ; Â2 /

@q

Since qi1 depends on Â2 only through ı 2 ; it suffices to show that each element of @ı1 can be
2

@q
expressed as a linear combination of @ 1 : Note
1

@q1 .Â1 ; Â2 /
D s .yi1 ; y2 ˛ C z1 ı1 C v2 .ı 2 / Á; / Á ˝ z0
@ı 2
@q1 . ; Â2 /
D s .yi1 ; z 1 C v2 .ı 2 / 2 ; 3 / z0
@ 1
D s .yi1 ; y2 ˛ C z1 ı1 C v2 .ı 2 / Á; / z0
where s .yi1 ; ‡; Â1 / D

@l .yi1 ;‡;Â1 /
: Hence
@‡

the proof.

(b)

103

Suppose k2 D r: [To show VGMM QLIML D VQLIML D VCF  The quasi-scores are
2
3
0
.y
/
6 s i1 ; y2 ˛ C z1 ı1 C v2 Á; y2 7
6
7
0 7
6
.y
/
s
;
y
˛
C
z
ı
C
v
Á;
z
@q1 .Â1 ; Â2 / 6
i1 2
1 1
2
1 7
D6
(A.2)
7
6 s .y ; y ˛ C z ı C v Á; / v0 7
@Â1
6
7
i1 2
1 1
2
2 5
4
@l .yi1 ;y2 ˛Cz1 ı1 Cv2 Á; /
@

2
@q1 .Â1 ; Â2 / 6
D4
@Â2

s .yi1 ; y2 ˛ C z1 ı1 C v2 Á;
0 r.rC1/
2

QLIML and CF rank conditions implies that k2

/ Á ˝ z0

3
(A.3)

7
5
1

r matrix ı o22 is required to be full column rank.

Since k2 D r, ı o22 is an invertible matrix. Also, noting that
s .yi1 ; y2 ˛ C z1 ı1 C v2 Á; / y02 D s .yi1 ; y2 ˛ C z1 ı1 C v2 Á; / .z1 ı 21 C z2 ı 22 C v2 /0
any moment function in

@q1 .Â1 ;Â2 /
@Â2

can be expressed as linear combination of

@q1 .Â1 ;Â2 /
: Thus Â22 is
@Â1

empty and the result follows by Proposition 2.3.6 (b). [To show VMD QLIML D VGMM QLIML ]
It suffices to show p1 C p22 D g: Let

; 3 2 Rl : As shown above, p22 D 0: Then, p1 D

r C k1 C r C l and g D k1 C k2 C r C l: Since k2 D r; the result follows.
(c)
Suppose Áo 6D 0: [To show VMD QLIML D VGMM QLIML ] As in (b,2), p1 D r Ck1 Cr Cl
and g D k1 C k2 C r C l: It suffices to show p1 C p22 D g or equivalently, p22 D k2

r: The case

with k2 D r was shown in (b,2). Case k2 < r is ruled out by order condition. i.e. ı o22 cannot
be full column rank with k2 < r. Suppose k2 > r: By rank condition of reduced form model, we

104

know linear independence of components in score
2
0
6 s .yi1 ; z 1 C v2 .ı 2 / 2 ; 3 / z
@q1 . .Â/ ; Â2 / 6
0
D6
6 s .yi1 ; z 1 C v2 .ı 2 / 2 ; 3 / v2
@
4
@q1
@ 3

2

3
7
7
7
7
5
/ z0

3

6 s .yi1 ; y2 ˛ C z1 ı1 C v2 .ı 2 / Á;
7
6
7
0 7
D6
.y
.ı
/
/
s
;
y
˛
C
z
ı
C
v
Á;
v
i1 2
1 1
2 2
6
2 7
4
5
@q1
@

Then, due to explicit linear relationship y2 D z1 ı 21 Cz2 ı 22 Cv2 ; a maximal linearly independent set
n
o
@q1
in sy2 ; sz1 ;sz2 ;sv2 ; @ always contains .k1 C k2 C r C l/ elements: Hence, a maximal linearly
o
n
@q1 .Âo / @q1 .Âo /
; @Â
contains .k1 C k2 C r C l/ elements whenever there exists
independent set in
@Â
1

2

at least one nonzero element in Áo : Since
implied that p22 D k2

@q1 .Âo /
@Â1

contains k1 C 2r C l moment functions, it is

r: [To show VGMM QLIML

VQLIML ; VCF ] The result follows from

Proposition 2.3.7
(d)
Suppose Áo D 0: In (A.3), we have

@q1 .Â1 ;Â2 /
@Â2

D 0p2 1 and Â22 is empty. Then, VGMM QLIML D

VQLIML D VCF by Proposition 1.3.6 in Chapter 1.
(e)
Let M be the linear span generated by mGMM-QLIML moment functions
2
3
@q1 . .Âo /;Â2 /
6
7
@
4
5
@q2 .Â2 /

(A.4)

@Â2

When Áo D 0; Â22 is empty and VGMM QLIML D VQLIML D VCF as shown in (d). Hence,
GMM-QLIML moment functions are
2
3 2
6
4

@q1 .Âo /
@Â1
@q2 .Âo2 /
@Â2

7 6
5D4
„

@ .Âo /
@Â1

0

3 2
0 7 6
5 4

@q1 . .Âo /;Âo2 /
@
@q2 .Âo2 /
@Â2

3
7
5

Ip2
ƒ‚
… „
ƒ‚
…
mGMM
QLIML
moments
.gCp2 / p
105

where

2
6
4

3

@ .Âo /
@Â1

0

0 7
5

Ip2

is assumed to be full column rank by Assumption 15. Hence, GMM-QLIML moment functions is
a linearly independent set in M . Also, since
2
6
4

@q1 .Âo /
@Â1
@q2 .Âo2 /
@Â2

3

@q1 .Âo /
@Â2

2

7 6
5D4

D 0 when Áo D 0; clearly we have

@q1 .Âo /
@Â1
@q1 .Âo /
@q2 .Âo2 /
@Â2 C
@Â2

3
7
5

and it is implied that QLIML moment functions are also linearly independent in M: Since we are
assuming k2 > r; we have g D k1 C k2 C r C l > r C k1 C r C l D p1 : Thus, the dimension of
M is larger than number of GMM-QLIML(or QLIML) moment functions, p: Relative efficiency
of mGMM-QLIML is obvious. To find a condition for asymptotic equivalence, consider QLIML
moment functions

2
6
4

@q1 .Â /
@Â1
@q1 .Â /
@q2 .Â2 /
@Â2 C @Â2

3
7
5

By replacement theorem (Thm 1.10, Friedberg et al, 2003), there exists k2

r elements in (A.4)

with which QLIML moment functions constitute a basis of M at true parameter values: Denote
such k2

r elements as
@q1 . .Â/ ; Â2 /
@

where

is .k2

r/

1: Then optimal GMM on
2
6
6
6
6
4

@q1 .Â/
@Â1
@q1 .Â/
@q2 .Â2 /
@Â2 C @Â2
@q1 . .Â/;Â2 /
@

3
7
7
7
7
5

(A.5)

is asympotically equivalent to mGMM-QLIML by similar reasoning in Lemma C.1. Equivalence
condition follows by applying BQSW redundancy condition to (A.5). To see sufficiency of GIME’s

106

for reduced form model, note that
!
#ˇ
"
ˇ
@q1o
@
@q1
ˇ
V
D
E
0
0 ˇˇ
Á
Á
0
0
0
0
0
0
@ ; Â2
@ ; Â2
@ ; Â2
0 ;Â 0 0 D 0 ;Â 0 0
o
o2
2
Ä o ˇˇ
Â oÃ
@q2 ˇ
@q2
@
E
V
D
ˇ
0
@Â2
@Â2
@Â2 ˇ
Â DÂ
3ˇ
2 2 o2
3
1
02
ˇ
@q1o
@q1o
ˇ
0
7
ˇ
6
@
@
@
7C
B6
@
7ˇ
6
o
o
D
E
V @4 @q o
o
5
A
@q
@q
4
@q2
2 Á 5ˇ
1 ÁC
@ 0 ; Â20
1
ˇ
Á
0
0
@Â2 C @Â2
ˇ 0 0 Á0
@Â2 @ ;Â2
@Â2 @ 0 ;Â20
0 0
;Â2 D o0 ;Âo2
0
1
implies cov @

@q1o

@

0 ;Â 0
2

Á0 ;

@q2o
A
@Â2

D 0 and GIME’s for structural model. Then result follows by some

algebra.

107

APPENDIX C
AN APPENDIX FOR CHAPTER 3

C.1

Appendix: Proofs

Many proof ideas and steps used in Theorem 3.4.7 3.4.15 are similar to Wang, Wu, Li (2012) and
Sherwood and Wang (2016). In the following, C denotes a constant that does not depend on N . It
is allowed to take different values in different places.

C.1.1

Proof of Theorem 3.3.1

The model structure (or admissible structure) S in consideration can be defined as following.
Q g;
g .xi ; zi / : Let S D f.ˇ;
/jˇQ 2 RK1 CT 1 ; gQ W
Q FQ
f"i t ;xi t gTtD1 ;zi
RK1 T CK2 ! R is measurable, FQ
is a distribution function on RK1 T CK2 CT such
f"i t ;xi t gTtD1 ;zi
that Q ."i t jxi ; zi / D 0 for each t and the support condition on .wi t /TtD1 given in the premise is

Denote "i t D yi t

wi t ˇ

satisfied:g Suppose .ˇ ; g ; F
/ 2 S: Then, for each t D 2;
; T; we can write
f"i t ;xi t gTtD1 ;zi
Ä
Q yi t jxi ; zi I ˇ ; g ; F
(A.1)
D wi t ˇ C g .xi ; zi /
f"i t ;xi t gTtD1 ;zi
Ä
Q yi .t 1/ jxi ; zi I ˇ ; g ; F
(A.2)
D wi .t 1/ ˇ C g .xi ; zi /
f"i t ;xi t gTtD1 ;zi
Note that the conditional quantiles of yi t and yi.t 1/ are unique. By taking difference across time
periods, we have
Ä
Q yi t jxi ; zi I ˇ ; g ; F
f"i t ;xi t gTtD1 ;zi
D wi t

Ä
Q

yi .t 1/ jxi ; zi I ˇ ; g ; F
f"i t ;xi t gTtD1 ;zi

wi.t 1/ ˇ

(A.3)
(A.4)

The full rank condition on (3:11) will be shown to imply point-identification of ˇ : First, note
R By continuity of
that there exists a square invertible submatrix of (3:11). Let such matrix be W:
the matrix determinant, there exists a neighborhood around

108

J .Qx.j / ; xQ .j / /
j D1 i t
i.t 1/

each of whose

.j / .j /
R
elements along with .Pxi t ; xP i.t 1/ /Jj D1 and time dummies constitute a perturbed version of W

which is still invertible. Since f

Á

xQ i t ;Qxi .t 1/ j xP i t ;Pxi .t 1/

Á .Qx.j / ; xQ .j / jPx.j / ; xP .j / /
it
i.t 1/ i t
i .t 1/

> 0 8j and

it is continuously extendable, the probability of observing such collection of support points is
positive. (equivalently, a change in ˇ implies a nontrivial change in F

fyi t ;xi t gTtD1 ;zi

/ Hence the

proof.

C.1.2

Proof of Theorem 3.4.3 and 3.4.4
0

L i t D p1 w
Q A and ‰ .e/ D ‰ .e 1 /0 ;
Define Â A D .ˇ; A / ; w
; ‰ .e N /0 : Also, kAk D
N it
p
0
max .A A/ denotes spectral norm for matrix A, kvk denotes Euclidean norm for vector v; and
write Ew . / D E . jxi ; zi / and Pw . / D P . jxi ; zi / :
Consider a following reparameterized objection function
N T
1 XX
N

N T
1 XX
L i t ı/ D
w
N

.ei t

i D1 tD1

Â
yi t

QA
w
it

Â

.ei t

L i t ı/
w

i D1 t D1

1
ÂoA C p ı
N

ÃÃ

Let ıO be the reparameterized oracle estimator
N X
T
X
Oı D arg min 1
N
ı

i D1 tD1

where ıO D

p

N ÂO A

Á

Â oA holds. Its Bahadur Representation can be written as
ıQ D

p

Q 0 BN W
QA
N W
A

1

Q 0 ‰ ."/
W
A

p
Lemma C.1.1 If Assumption 1–4 hold, then ıQ D Op
qN : (ii) If Assumption 1-5 hold, then
1=2 Q d

G N †N

ı ! N .0; G/ :

Proof. Since
s
Â

Ã
1
1 0
1
2 Q
Q
Q
W
B
B
W
D
p
max
N A
N WA
N A
N
Ä

1
BN2

1
QA D
p W
N

109

s
p

max .BN /

Â

1 0
Q W
QA
W
max
N A

Ã

1 Q 0
Q
N WA BN WA

max

Á

Á
Q 0 BN W
Q A is
is also bounded above. Similarly, we can show min N1 W
A

bounded below by some positive constant. Then, note that
Â

ıQ D

Â
Ã 1
Ã 1 2
2
1 0
1
1 0
1
Q 0 ‰ ."/ Ä
Q 0 ‰ ."/
Q BN W
Q BN W
QA
QA
W
W
p W
p W
A
A
A
A
N
N
N
N

1
p
Q 0 ‰ ."/ D Op
qN
ÄC p W
A
N
(ii)
1=2
GN †N ıQ

D

1=2
Q0 ‰
GN †N KN1 N 1=2 W
A

D

N X
T
X
1=2
1
1=2
Q A0
GN †N KN N
w
it
iD1 tD1
1=2

where DN i t D GN †N

Q A0
KN1 N 1=2 w
it

."/ D

N
X
1=2
1
1=2
Q A0
GN †N KN N
W
i ‰
iD1
N
T
XX

."i t / D

."i /

DN i t

i D1 tD1

."i t / : Note E ŒDN i t  D 0 and E

hP

T
tD1 DN i t

i

D

0: Then,
N
X

20
E 4@

i D1

10

T
X

DN i t A @

T
X

10 3
DN i t A 5

t D1

tD1

0
N
T
X
X
1=2
1
1
@
4
4
Q A0
w
D E GN †N KN N
it

10

2

2

i D1

2

2
1=2

D E 4GN †N

KN1 4N 1

N
X

."i t /A @

T
X

1=2

D G N E †N

3

3

0 Q A 5 K 1 † 1=2 G 0 5
Q A0
W
i ‰ ."i / ‰ ."i / Wi
N N
N

1=2

KN1 SN KN1 †N

1=2

A5 K 1 †
QA
."i t / w
it
N N

t D1

t D1

i D1

h

3

13

i

0 D G G0 ! G
GN
N N

110

0 5
GN

To check Lindeberg-Feller condition, fix " > 0: By Assumption 3, 4, and 5,
2
33
4
2 2
N
T
N
T
T
X
X
X 6 X
X
1
7
E
DN i t
E4
DN i t 1 4
DN i t > "55 Ä 2
"
i D1

Ä

Ä

t D1

1
N 2 "2
C
N 2 "2

N
X

T X
T ˇ
X
ˇ
@
E
ˇ

."i t /

tD1 t 0 D1

0
E@

T X
T
X
tD1 t 0 D1

i D1

12
1=2

1
QA
w
i t KN †N

tD1

1=2

0 G †
GN
N N

N
T
X
C X
1=2
1
Q A0
Ä 2 2
w
E GN †N KN
it0
N "
0
i D1
t D1
0
1
4
N
T
X
C B1 X
C
QA
Ä
E
w
A D Op
@
i
t
2
N
N"
i D1

t D1

12
ˇ
ˇA
1=2 0
1=2
1
Q A0
QA
GN GN †N KN1 w
"i t 0 w
i t KN †N
it0ˇ

0

i D1
N
X

i D1

tD1

4

A
Q A0
KN1 w
it0

N
T
4 X
C X
1=2
4
1
Q A0
KN
w
Ä 2 2
E kGN k †N
it0
N "
0
i D1

2
qN

N

4

t D1

!
D op .1/

0 G
0
where the last inequality follows from max GN
N D max GN GN ! c as N ! 1:

Lemma C.1.2 (i) Assume Assumption 1–6. Then, for any finite constant M;
ˇ
ˇ
ˇ
ˇN
i
h
Ái 1 h
ˇ
ˇX
0
0
ı KN ı ıQ KN ıQ .1 C o .1//ˇˇ D op .1/
sup ˇˇ
Ew QQ i ı; ıQ
2
ˇ
ˇ
Q
ı ı ÄM i D1

Á P
h
where QQ i ı; ıQ D TtD1

.ei t

L i t ı/
w

ei t

L i t ıQ
w

Ái

Á
Á
L i t ık D O N 1=2 qN
L i t ıQ D Op N 1=2 qN by Lemma C.1.1 (i), and kw
Proof. Note that w
by construction. Also, we have

.ei t

L i t ı/ D
w

."i t

L i t ı/ by Assumption 6. Then,
w

applying Knight’s identity,
N
X

h

Ew QQ i ı; ıQ

Ái

D

i D1

D

Ew

h

."i t

L i t ı/
w

"i t

L i t ıQ
w

Ái

iD1 tD1

N X
T Z w
L it ı
X
L i t ıQ
i D1 tD1 w

1h 0
D
ı KN ı
2

N X
T
X

.Fi t .s/

Ä
N T
1 XX
L i t ı/2
Fi t .0// ds D
fi t .0/ .w
2
i D1 tD1

Q0

ı KN ıQ

i

1 C op .1/

111

L i t ıQ
w

Á2

1 C op .1/

L i t ıQ
where the third inequality is followed from w

L i t ık D o .1/ : Hence the
D op .1/ and kw

result.
Lemma C.1.3 Assume Assumption 1–6. Then, for any given positive constant M;
ˇ
ˇ
ˇX
Áˇˇ
ˇN
Ai ı; ıQ ˇˇ D op .1/
sup ˇˇ
ˇ
ı ıQ ÄM ˇi D1
Á
Á
Q
Q
Q
where Ai ı; ı D Qi ı; ı

h
Á
i P
Á
T
Q
Q
Q
L it ı ı
.ei t / :
E Qi ı; ı jxi ; zi C tD1 w
q
q
L
Proof. By Assumption 4, max kwi t k Ä ˛1 NN for some positive constant ˛1 : (Fn1 in Sherwood
i;t

and Wang (2016) has probability one here.) It suffices to show for 8" > 0;
1
0
ˇ
ˇ
ˇN
ˇ
Á
ˇX
ˇ
C
B
Ai ı; ıQ ˇˇ > "A ! 0
P @ sup ˇˇ
ˇ
ı ıQ ÄM ˇi D1
Let
n
N
 D ı 2 RqN W ı

ıQ Ä M

o

N into disjoint sets 
N 1;
We can partition 
exceed m0 D

10T ˛1

"
p

N qN

N D such that the diameter of each set does not

N
Â p
Ãq
N
C N qN
and the cardinality of partition satisfies DN Ä
(For
2"

example, by similar argument used in Lemma 5.2 of Vershynin, 2011): Pick arbitrary ı d 2 Dd for
1 Ä d Ä DN : Then
0

1
ˇ
ˇ
ˇN
ˇ
Áˇ
ˇX
B
C
P @ sup ˇˇ
Ai ı; ıQ ˇˇ > "A
ˇ
ı ıQ ÄM ˇiD1
ˇ
0ˇ
ˇN
ˇ
DN
Á
X
X
ˇ
ˇ
ˇ
Q
Ä
P @ˇ
Ai ı d ; ı ˇˇ C sup
ˇi D1
ˇ ı2N d
d D1

ˇ
ˇN h
Á
ˇX
ˇ
Q
A
ı;
ı
i
ˇ
ˇi D1

112

ˇ
1
Áiˇˇ
Ai ı d ; ıQ ˇˇ > "A
ˇ

1
2

Since u1 Œu < 0 D 12 u
QQ i ı; ıQ
D

D

D

D

Á

T h
X
t D1
T
X
t D1
T
X

QQ i ı d ; ıQ

juj, we have

Á

L i t ı/
w

.ei t

L i t ıQ
w

ei t

T h
X

Ái

L i t ıd /
w

.ei t

ei t

L i t ıQ
w

Ái

tD1

Œ.

1 Œei t

L i t .ı d
Œ w

t D1
T ÄÂ
X
t D1

L i t ı < 0/ .ei t
w

.ei t

ı/

Ã
1
L i t .ı d
w
2

L i t k sup Œkı
Ä 2T max kw
i;t

N
ı2
d

L i t ı/
w

L i t ı/ 1 Œei t
w

ı/ C

.

1 Œei t

L i t ı d < 0/ .ei t
w

L i t ı < 0 C .ei t
w

1
j.ei t
2

L i t ı/j
w

1
j.ei t
2

L i t ı d /
w

L i t ı d / 1 Œei t
w

L i t ı d < 0
w

L i t ı d /j
w

ı d k

Thus
ˇ
ˇ
ˇ
ˇX
N
h
Á
Ái
ˇ
ˇ
Ai ı; ıQ
Ai ı d ; ıQ ˇˇ
sup ˇˇ
N ˇi D1
ˇ
ı2
d
ˇ
ˇN
T
h
Ái X
Á
ˇX
L it ı
w
fQQ i ı; ıQ
Ew QQ i ı; ıQ C
D sup ˇˇ
N
ˇ
ı2d iD1
tD1
Á

h

QQ i ı d ; ıQ C Ew QQ i ı d ; ıQ

Ái

T
X
tD1

L i t k sup Œkı
Ä 5N T max kw
i;t

r
Ä 5N T ˛1

N
ı2
d

L i t ıd
w

ıQ

Á

ıQ

Á

.ei t /

ˇ
ˇ
ˇ
.ei t /gˇˇ
ˇ

ı d k

qN
"
m0 D
N
2

Therefore, now it suffices to show

PDN

ˇP
ˇ
Á
ˇ N
ˇ "
Q
P
A
.ı
;
ı/
>
ˇ 2 has a vanishing upper
d D1 w ˇ i D1 i d

bound that does not depend on .xi ; zi /. Berstein’s inequality is used. To evaluate maximum, using

113

1
2

.u/ D

Á

u C 21 juj, we can write

Q
maxAi .ı d ; ı/
i;d

Q
D maxQQ i .ı d ; ı/

Q C
Ew ŒQQ i .ı d ; ı/

i;d

i;d

C

L i t .ı d
w

Q .ei t /
ı/

tD1

D max
T
X

T
X

T
X

L i t ıd /
w

.ei t

Œ

Q
L i t ı/
w

.ei t

Ew Œ

T h
X

.ei t

L i t ıd /
w

Ã
1
L i t .ıQ
w
2

ıd /

.ei t

i
Q 
L i t ı/
w

t D1

tD1

ıQ

L i t ıd
w

Á

.ei t /

tD1

D max
i;d

Ew

T Ä h
X
1

tD1
T
XÄ
tD1

C

T
X

2

L i t ı d /j
w

j.ei t

1h
j.ei t
2

L i t .ı d
w

ˇi Â
ˇ
Q
L i t ı/ˇ C
w

ˇ
ˇ
ˇ.ei t

ˇi Â
ˇ
Q
L i t ı/ˇ C
w

ˇ
ˇ
ˇ.ei t

L i t ı d /j
w

Ã
1
L i t .ıQ
w
2

ıd /

Q .ei t /
ı/

tD1

r
L i t k max kı
Ä 3T max kw
i;t

ıd k Ä C

d

qN
N

To evaluate variance, applying Knight’s identity, we have
T
Á X
L i t ıd
QQ i ı d ; ıQ C
w

ıQ

Á

.ei t /

tD1

D

D

T
X
tD1
T
X

L i t ıd
w

.ei t / C

tD1

C

T
X
t D1

L i t ıQ
w

T
X

L i t ıd /
w

.ei t

.ei t /

t D1
T Z w
L i t ıd
X

t D1 0
T Z w
L i t ıQ
X

tD1 0

D

T
Á X
L i t ıQ C
L i t ıd
w
w

ei t

L i t ıQ
t D1 w

Á

.ei t /

t D1

.1 Œ.ei t < t/

.1 Œ.ei t < t/

T Z w
L i t ıd
X

ıQ

1 Œei t < 0/ dt

1 Œei t < 0/ dt C

T
X
tD1

.1 Œ.ei t < t/

114

1 Œei t < 0/ dt

L i t ıd
w

ıQ

Á

.ei t /

Thus,
Q
Ai .ı d ; ı/
D

T Z w
L i t ıd
X

.1 Œ.ei t < t/

L i t ıQ
tD1 w
2
T Z w
L i t ıd
X
Ew 4
L i t ıQ
t D1 w

1 Œei t < 0/ dt
3

2

1 Œei t < 0/ dt 5 C Ew 4

.1 Œ.ei t < t/

3

T
X

L i t ıd
w

ıQ

Á

.ei t /5

t D1

And it implies
N
X

Á

Á

Var Ai ı d ; ıQ jxi ; zi Ä

i D1

Ä

N
X
iD1

N
X

2

r
ÄC

6
Ew 4 @

T Z w
L i t ıd
X

L i t ıQ
t D1 w

T Z w
Áˇ X
L i t ıd
ˇ
.Fi t .t/
ıQ ˇ

ˇ
ˇ
L i t ıd
Ew 4max ˇw

L i t ıQ
tD1 w

i;t

i D1

20

Â
N T
qN X X
L i t ı d /2
fi t .0/ .w
N

L i t ıQ
w

Á2 Ã

.1 Œ.ei t < t/

12 3
7
1 Œei t < 0/ dt A 5

3
Fi t .0// dt 5
r

.1 C o .1// Ä C

i D1 t D1

qN
.1 C o .1//
N

by similar argument as in Lemma C.1.2. Then, by Bernstein’s inequality and Assumption 5,

ˇ
0ˇ
ˇN
Áˇˇ
ˇX
Ai ı d ; ıQ ˇˇ >
Pw @ˇˇ
ˇi D1
ˇ
d D1
DN
X

ÄC C

p

N qN

Áq

N

exp

1

0

1

DN
"2 =4
B
C X
exp @ q
exp
q
AÄ
qN
qN
C N C "C N
d D1
d D1
s
s
!
!!
N
N
C
Ä C exp C qN log N
!0
qN
qN

"A
Ä
2

DN
X

s
C

N
qN

!

Lemma C.1.4 (Asymptotic Equivalence with Bahadur Representation) Assume Assumption 1–6.
Then, we have ıQ

ıO D op .1/ :

Proof. It suffices to show that for any positive constants M ,
0
1
N
Á
X
C
B
P @ inf
QQ i ı; ıQ > 0A ! 1
ıQ ı

M iD1

115

since, we have

Á
O ıQ Ä 0. By Lemma C.1.3,
Q i ı;
Q
iD1

PN

ˇ
2
ˇN
Á
ˇX
ˇ
Q
Q
4
Qi ı; ı
sup ˇ
ˇ
Q
ı ı ÄM i D1

h

Ew QQ i ı; ıQ

Ái

C

T
X

ıQ

L it ı
w

Á

t D1

3ˇ
ˇ
ˇ
.ei t /5ˇˇ D op .1/
ˇ

Then by Lemma C.1.2,
ˇ
ˇX
Á
ˇN
QQ i ı; ıQ
sup ˇˇ
ı ıQ ÄM ˇi D1
C

N X
T
X

1h 0
ı KN ı
2

ıQ

L it ı
w

0
ıQ KN ıQ

i

1 C op .1/

(A.5)

ˇ
ˇ
ˇ
.ei t /ˇˇ
ˇ

Á

i D1 t D1

D op .1/
And since
Ã 1
1
1
0
QA
Q BN W
Q 0 ‰ .e/
W
ıQ D
p W
A
A
N
N
1
Q 0 ‰ .e/
D KN1 p W
A
N
Â

we have
N X
T
X

L it ı
w

ıQ

Á

ıQ

.ei t / D ı

N X
T
Á0 X

L 0i t
w

.ei t /

i D1 t D1

iD1 tD1

Á0 1
Q 0 ‰ .e/
ıQ p W
A
N
Á0
ıQ KN ıQ

D ı
D ı
Combining (A.5) and (A.6), we have
ˇ
ˇN
Á 1h
ˇX
ˇ
Q
Q
ı 0 KN ı
sup ˇ
Qi ı; ı
2
ı ıQ ÄM ˇi D1

0
ıQ KN ıQ C ı

i

ıQ

Á0

ˇ
ˇ
ˇ
KN ıQ ˇˇ D op .1/
ˇ

which implies
ˇ
ˇN
Á
ˇX
sup ˇˇ
QQ i ı; ıQ
ı ıQ ÄM ˇiD1

1
ı
2

116

ıQ

Á0

KN ı

ˇ
Áˇˇ
ıQ ˇˇ D op .1/
ˇ

(A.6)

By assumption 4, for any ı

ıQ > M;
1
ı
2

ıQ

Á0

KN ı

Á
ıQ > CM

for some positive C:

C.1.3

Proof of Theorem 3.4.5 and 3.4.6

QA
Consider partition of w
i t into a single variable (that is not penalized) and the rest of vari0
b
Qa D w
Q ait /: Define their stacked versions as W
Q a0
Q a0
Q a0
ables, .wQ ibt ; w
;w
;w
A
11 ;
1T ;
N T and WA D
b ;
.wQ 11

b ;
; wQ 1T

b /: Let ı
; wQ N
ab D .ı a ; ıb / denote another reparametrization of ÂA such that
T

N T
1 XX
N

yi t

QA
w
it ÂA

Á

i D1 tD1

R ait
where w

p1
N

N T
1 XX
D
N

wR ibt ıb

Á

i D1 tD1

Ä

Q ait wR ibt WAb0 BN WAb
w
p
a
Then, we have ıO a D N .ÂO A Â aoA / and
D

" i t C ri

R ait ı a
w

h
i1=2
b
ıOb D WAb0 BN WAb0
ÂOA

Á 1

h
i 1=2
Q a and wR b D W b0 BN W b0
WAb0 BN W
wQ ibt :
it
A
A
A

Á h
i 1=2
b
Q a ÂO aA
ÂoA
C WAb0 BN WAb0
WAb0 BN W
A

Â aoA

Á

Á
b is defined similarly. Also, note that ıO is a subvector of ı:
O
where Â aA ; ÂA
a
Lemma C.1.5 Assume Assumption 1–5, 7 and 8. Then, for any finite constant M1 and M2 ;
ˇ
ˇ
ˇN
ˇ
h
Ái
h
i
X
ˇ
ˇ
1
0
0
ˇ
R N ı a ıQ a K
R N ıQ a .1 C o .1//ˇ
Ew QQ i ı a ; ıQ a ; ıb
ıa K
sup
ˇ
ˇ
2
p
ˇ
ı a ıQ a ÄM1 ;kıb kÄM2 qN ˇi D1
is op .1/ where
T
Á X
Q
Q
Qi ı a ; ı a ; ıb D
Œ

."i t C ri

R ait ı a
w

R a0
;w
iT ;

R a0
;w
NT

wR ibt ıb /

t D1

RaD w
R a0
W
11 ;

RN DW
R 0a BN W
Ra
K

117

0

."i t C ri

R ait ıQ a
w

wR ibt ıb /

Á
L i t ıQ D Op N 1=2 qN is implied by Lemma C.1.1 (i), and kw
L i t ık D
Proof. Note that w
Á
Á
R ait ı a D
R ait ıQ a D Op N 1=2 qN and w
O N 1=2 qN by construction. Similarly, we have w
Á
Op N 1=2 qN : Then, applying Knight’s identity,
N
X

h
Ái
Ew QQ i ı a ; ıQ a ; ıb

iD1

D

D

D

D

D

N X
T
X

Ew Œ

."i t C ri

R ait ı a
w

wR b ıb /

."i t C ri

R ait ıQ a
w

wR ibt ıb /

i D1 tD1
N X
T Z w
R a ı a CwR b ıb ri
X
it
.Fi t .s/ Fi t .0// ds
a ıQ CwR ı r
R
w
a
b b i
i D1 tD1
it
Ä
N
T
Á2
Á2
1 XX
R ait ı a C wR ibt ıb ri
R ait ıQ a C wR ibt ıb ri
w
1 C op .1/
fi t .0/ w
2
iD1 tD1
N T
h
i
1 XX
R ait ı a /2 .w
R ait ıQ a /2 C 2.wR ibt ıb ri /w
R ait .ı a ıQ a / 1 C op .1/
fi t .0/ .w
2
iD1 tD1
N X
T
i
X
1h 0 R
0 R Q
0
Q
Q
R a0
.ı a ı a /
ı KN ı a ı a KN ı a 1 C op .1/
fi t .0/ ri w
it
2 a
i D1 tD1

Note that
N X
T
X

R ait .ı a
fi t .0/ ri w

ıQ a /

iD1 tD1

Ä
N T
Á 1
1 XX
Q a .ı a ıQ a /
Q ait
WAb0 BN W
Dp
fi t .0/ ri w
WAb0 BN WAb
A
N iD1 tD1
Á 1
1
a
b0
b
R a C op .1/ where 
R a denotes the
Q a  D p1 
p
Q
and that
Œw
WA BN WA
WAb0 BN W
it
A
N it
N it
R a is well-defined by Assumption 5. To show
population projection error. 
it

N T
1 XX
R ait .ı a
sup
fi t .0/ ri w
p
N i D1 tD1
p
ı a ıQ a ÄM1 ;kıb kÄM2 qN

, note that by Markov inequality, we have
ˇ
0ˇ
ˇ
ˇ
N
T
X
X
ˇ 1
ˇ
a
R .ı a ıQ a /ˇ
fi t .0/ ri 
P @ˇˇ p
it
ˇ
ˇ N i D1 tD1
ˇ

1
"A Ä

118

ıQ a / D op .1/

h
PN PT
Ra
E p1
i D1
tD1 fi t .0/ ri i t .ı a
N
"2

i2
ıQ a /

where
2

N X
T
X

32

1
R ait .ı a ıQ a /5
E 4p
fi t .0/ ri w
N iD1 tD1
3 0 2
2
N X
N T
T
X
1 XX
1
a
Q
5
@
4
4
R i t .ı a ı a / C E p
R ait .ı a
fi t .0/ ri w
fi t .0/ ri w
DV p
N i D1 tD1
N i D1 tD1

The first part:
2
N T
1 XX
4
R ait .ı a
V p
fi t .0/ ri w
N iD1 tD1

3

2

ıQ a /5 D V 4

T
X

312
ıQ a /5A

3
R ait .ı a
fi t .0/ ri w

ıQ a /5

t D1

ÄC

ÄC

ÄC

T
X
tD1
T
X
tD1
T
X

h
R ait .ı a
V fi t .0/ ri w

i
ıQ a /

h
R ait .ı a
E fi t .0/ ri w

i2
ıQ a /

h
R ait .ı a
E fi t .0/ w

i2
ıQ a / .sup jri j/2

tD1

! 0;
The second part:
ˇ 2
ˇ
N X
T
X
ˇ
ˇE 4 p1
R ait .ı a
fi t .0/ ri w
ˇ
N
ˇ
i D1 t D1

3ˇ ˇ
ˇ ˇ
T
h
ˇ ˇp X
Qı a /5ˇ D ˇ N
R ait .ı a
E
fi t .0/ ri w
ˇ ˇ
ˇ ˇ
tD1
Ä

p

N

T
X

ˇ
ˇ
R ait .ı a
E ˇfi t .0/ ri w

ˇ
iˇˇ
ıQ a / ˇˇ
ˇ
ˇ
ˇ
ıQ a /ˇ

t D1

Ä

p

N sup jri j

T
X

ˇ
ˇ
R ait .ı a
E ˇfi t .0/ w

tD1

Hence the result.
Lemma C.1.6 Assume Assumption 1–5, 7 and 8. Then, for any positive constant L;
ˇ
ˇ
ˇN
ˇ
X
ˇ
ˇ
p
1
qN
sup ˇˇ
Di .ı ab ; qN /ˇˇ D op .1/
ˇ
kı ab kÄL ˇi D1
119

ˇ
ˇ
ıQ a /ˇ

where
Qi .qN / D

T
X

R ait ı a
qN w

."i t C ri

qN wR ibt ıb /

tD1

Di .ı; qN / D Qi .qN /

Qi .0/

R ait ı a C wR b ıb
C qN w

Ew ŒQi .qN /

Qi .0/

."i t /

Proof. It suffices to show for all " > 0;
ˇ
ˇ
1
0
ˇ
ˇX
N
ˇ
ˇ
p
P @qN1 sup ˇˇ
Di .ı ab ; L qN /ˇˇ > "A ! 0
ˇ
kı ab kÄ1 ˇi D1
First, consider a constant ˛1 such that
max

R ait ; wR ibt
w

Á

1=2
Ä ˛1 N 1=2 qN :

Partition B D fı W kık Ä 1g into MN disjoint sets B1 ;
; BMN with diameter less than m0 D
p Áq C1
" p where M Ä C C N N
b / 2 B for 1 Ä m Ä M : Then
: Let d m D .d am ; dm
m
N
N
"
4˛1 L N

ˇ
ˇ
1
ˇ
ˇX
N
ˇ
ˇ
p
P @qN1 sup ˇˇ
Di .ı ab ; L qN /ˇˇ > "A
ˇ
ı ab 2B ˇi D1
ˇ
ˇ
1
0
ˇ
ˇX
MN
N
X
ˇ
ˇ
p
Ä
P @qN1 sup ˇˇ
Di .ı ab ; L qN /ˇˇ > "A
ˇ
ı ab 2Bm ˇi D1
mD1
1
0
ˇ
ˇP
p
ˇ
ˇ N
MN
ˇ iD1 Di .d m ; L qN /ˇ
X B
ˇ
ˇ > "q C
C
B
P@
Ä
PN
p
p
NA
ˇPN
ˇ
C sup ˇ i D1 Di .ı ab ; L qN /
D
.d
;
L
q
/
ˇ
m
i
N
iD1
0

mD1

ı ab 2Bm

120

Since u1 ŒU < 0 D 12 u 12 juj ; we can write
ˇ
ˇ
ˇX
ˇ
N
N
X
ˇ
ˇ
p
p
sup ˇˇ
Di .ı ab ; L qN /
Di .d m ; L qN /ˇˇ
ˇ
ı ab 2Bm ˇi D1
i D1
ˇ
ˇN T
ˇ
ˇX X 1 hˇˇ
p
p
ˇ
R ait ı a L qN wR ibt ıb L qN C ri ˇ
D sup ˇˇ
ˇ"i t w
2
ı 2Bm ˇ
ab

i D1 t D1

N X
T
X
1

C

i D1 t D1
N X
T
X
i D1 t D1
N X
T
X

C

i
j"i t C ri j

i D1 t D1
N X
T
X
i D1 t D1
N X
T
X

2

hˇ
ˇ
Ew ˇ"i t
p

p
R ait ı a L qN
w
C wR ibt ıb

L qN

R ait ı a
w

1 hˇˇ
ˇ"i t
2

p
R ait d am L qN
w

hˇ
1
ˇ
Ew ˇ"i t
2

p
wR ibt ıb L qN

Á

b Lpq
wR ibt dm
N

Á

i D1 tD1

Ä 2NLm0 qN

1=2

Ä 2˛1 NLm0 qN

max
1=2 p

R ait ; wR ibt
w

i
j"i t C ri j

."i t /

p
R ait d am L qN
w

p
b
R ait d am C wR ibt dm
L qN w

ˇ
ˇ
C ri ˇ

ˇ
ˇ
C ri ˇ

b L pq
wR ibt dm
N

i
j"i t C ri j

ˇ
ˇ
C ri ˇ

i
j"i t C ri j

ˇ
ˇ
ˇ
."i t /ˇˇ
ˇ

Á

p
qN =N D 2˛1 L N m0 D "=2

Now, it suffices to show
ˇ
0ˇ
1
ˇX
ˇ
N
ˇ
ˇ
p
P @ˇˇ
Di .d m ; L qN /ˇˇ > "qN =2A ! 0
ˇi D1
ˇ
mD1
MN

X

Bernstein’s inequality will be used. To evaluate the maximum, note
ˇ
ˇ
p
max ˇDi .d m ; L qN /ˇ
i
ˇˇ
ˇ
ˇ
p
p
ˇˇ
ˇ
ˇ
R ait ı a L qN wR ibt ıb L qN C ri ˇ j"i t C ri jˇ
Ä max ˇˇ"i t w
i
ˇ p
ˇ
p
p Á
ˇ
ˇ
a
b
R i t ı a L qN C wR i t ıb L qN
."i t /ˇ
C max ˇL qN w
i
Á
p
R ait ; wR ibt Ä C qN N 1=2
Ä 2L qN max w

121

To evaluate an upper bound of the variance, first note that, by Knight’s identity,
p
Qi L qN

Á
p
R ait ı a C wR ibt ıb
."i t /
Qi .0/ C L qN w
Á
p
R ait ı a C wR ibt ıb ŒI ."i t C ri < 0/ I ."i t < 0/
D L qN w
Á
Z Lpq wR a ı a CwR b ı
N it
it b
ŒI ."i t C ri < s/ I ."i t ri < 0/ ds
C
0

D Vi1 C Vi 2
Then, the second order conditional moments can be bounded as
N
X

h

Ew Vi12

i

D

i D1

N
X

Ä
Á2
R ait ı a C wR ibt ıb jI ."i t C ri < 0/
Ew qN L2 w

I ."i t < 0/j

i D1

Ä

N
X

2L2 qN

Ä
Ew

max

R ait ; wR ibt
w

Á Á2

I .0 Ä j"i t j Ä jri j/

i D1

Ä

2 N 1
C qN

N Z jr j
X
i
iD1

jri j

fi t .s/ ds Ä

2
C qN

r

qN
N

and
N
X

h

Ew Vi22

i

iD1

Ä C qN N 1=2

Ä C qN N

p

N Z
X

R a ı a CwR b ıb
qN L w
it
it

Á

i D1 0
Á
N Z pq L w
R a ı a CwR b ıb
X
N
it
it
1=2
i D1 0

2
2 N 1=2 4ı 0
Ä C qN
a

N
X

R a0
R ait ı a C
fi t .0/ w
it w

i D1

ŒFi t .s

ri /

Fi t .ri / ds

.fi t .0/ C o .1// s C O s 2

N
X

fi t .0/ wR ibt ıb

iD1

2 N 1=2 .1 C o .1//
Ä C qN

Since bounds do not depend on w;we
Q
have
N
X

p
2
Vars Di .d m ; L qN / Ä C qN

iD1

122

r

qN
N

Á2

ÁÁ

ds

3
5 .1 C o .1//

Then, Bernstein’s inequality implies
ˇ
1
0ˇ
ˇ
ˇX
MN
X
p
ˇ
ˇN
Ps @ˇˇ
Di .d m ; L qN =N /ˇˇ > qN "=2A
ˇ
ˇiD1
mD1
1
0
MN
2 "2 =4
X
qN
C
B
Ä2
exp @
q
A
q
2
N C C "q N 1=2
C qN
mD1
N
N
s
s
!
!
MN
X
N
N
Ä2
exp
C
D 2MN exp
C
qN
qN
mD1
s
!
N
Ä C exp C .qN C 1/ log N C
!0
qN

Lemma C.1.7 Assume Assumption 1–5, 7 and 8. Then, 8Á > 0; there exists an L > 0 such that
0
1
N
X
p
P @ inf qN1
qN
Qi .0// > 0A 1 Á
.Qi
ı
DL
k ab k
i D1
Proof. Consider
qN

1

N
X

p
qN
.Qi

Qi .0// D qN

i D1

qN

1

N
X

Di .ı ab ;

i D1
N
1=2 X

p

qN / C qN

1

N
X

E w Qi

p

qN

Qi .0/

i D1

R ait ı a C wR ibt ıb
w

Á

."i t /

i D1

D GN1 C GN 2 C GN 3
sup jGN1 j D op .1/ : Also, note that E ŒGN 3  D 0 and that
kı ab kÄL
h
i
h
i
Á
2
1E ı0 W
0W
2W
0W
1
2
R
R
R
R
E GN
Ä
C
q
ı
C
ı
D
O
q
kı
k
ab
a a a a
3
N
N
b b b

By Lemma C.1.6, we have

123

Á
kı ab k : By applying Knight’s identity,
2 p
Á
Z q wR a ı a CwR b ı
N X
T
ri
X
N
b
1
it
it
ŒI ."i t < s/
GN 2 D
Ew 4
qN
ri

Thus, GN 3 D Op qN

1=2

3
I ."i t < 0/ ds 5

i D1 t D1

N T Z pq
R a ı a CwR b ıb ri
N w
1 XX
it
it
D
fi t .0/ sds .1 C o .1//
qN
r
i
i D1 t D1
Ä
N
T
Á2
Á
1 XX
1
p
R ait ı a C wR ibt ıb
R ait ı a C wR ibt ıb .1 C o .1//
D
ri qN w
fi t .0/ qN w
qN
2
i D1 t D1
0
1
N X
T
X
R ait A ı a .1 C o .1//
R a0
D C ı 0a @
fi t .0/ w
it w
Á

iD1 tD1

C C ıb2
qN

N X
T
X

fi t .0/ .1 C o .1//

iD1 tD1
N T
1=2 X X

R ait ı a C wR ibt ıb
fi t .0/ ri w

Á

i D1 t D1

R a ı a .1 C o .1// C C ı 2 .1 C o .1//
R 0a BN W
D C ı 0a W
b
qN

T N
1=2 X X

R ait ı a C wR ibt ıb
fi t .0/ ri w

Á

tD1 i D1

R a ı a .1 C o .1// C C ı 2 .1 C o .1//
R 0a BN W
Note that there exists a constant M such that C ı 0a W
b
p
2
T
N
; rN / 2 R : Then we have kRN k D O
qN : By CauchyM kı ab k . Let RN D .r1 ; r1 ;
Schwarz inequality,
qN

N T
1=2 X X

R ait ı a D qN
fi t .0/ ri w

1=2 0 R 0
ı a Wa BN RN

iD1 tD1
1=2

R 0a kBN RN k
ı 0a W
Á
1=2 1=2
D Op qN qN kı a k D Op .kı ab k/
Ä qN

Similarly,
qN

N T
1=2 X X

1=2

ıb WR b0 BN RN

1=2

ıb WR b0 BN

fi t .0/ ri wR ibt ıb D qN

i D1 tD1

Ä qN

124

1=2

1=2

BN RN D Op .kı ab k/

Therefore, for L sufficiently large, qN1

PN

i D1 .Qi

p

Qi .0// > 0 has asymptotically a lower

qN

bound cL2 :
Lemma C.1.8 Assume Assumption 1–5, 7 and 8. Then, for any given positive constant M1 and
M2 ;

ˇ
ˇ
ˇN
ˇ
ˇX
ˇ
ˇ
sup
Ai .ı a ; ıQ a ; ıb /ˇˇ D op .1/
ˇ
p
ˇ
ı a ıQ a ÄM1 ;kıb kÄM2 qN ˇi D1
Á
h
i P
T
a
Q
Q
Q
Q
Q
Q
R i t ıa ıa
."i t / :
where Ai .ı a ; ı a ; ıb / D Qi .ı a ; ı a ; ıb / E Qi .ı a ; ı a ; ıb /jxi ; zi C tD1 w
q
q
L i t k Ä ˛1 NN for some positive constant ˛1 : By Assumption 8
Proof. By Assumption 4, max kw
i;t
q
qN
(i), max kri k Ä ˛2 N for some positive constant ˛2 : It suffices to show for 8" > 0;
i

0

1
ˇ
ˇ
ˇX
ˇ
N
ˇ
ˇ
B
C
ˇ
P@
sup
Ai .ı a ; ıQ a ; ıb /ˇˇ > "A ! 0
ˇ
p
ˇ
ı a ıQ a ÄM1 ;kıb kÄM2 qN ˇi D1
Let
n
a
N
 D ı 2 RqN W ı a

ıQ a Ä M1

o

˚
«
N b D ıb 2 R W kıb k Ä M2 pqN

N a (
N b ) into disjoint sets 
N a;
We can partition 
1

N a a (
N b;

1
D
N

N b ) such that the diameter

Db
N

"
a Ä
p
and the cardinality of partition satisfies DN
of each set does not exceed m0 D
10T ˛1 N qN
Ãq
Â p
Ãq
Â p
N
N
C N qN
C N qN
b
and
D
Ä
(For example, by similar argument used in Lemma 5.2
2"
2"
N

N b for 1 Ä l Ä D b : Then
N a for 1 Ä k Ä D a and ı l 2 
of Vershynin, 2011): Pick arbitrary ı ka 2 
N
N
k
b
l
0
1
ˇ
ˇ
ˇX
ˇ
N
ˇ
ˇ
B
C
ˇ
P@
sup
Ai .ı a ; ıQ a ; ıb /ˇˇ > "A
ˇ
p
ˇ
ı a ıQ a ÄM1 ;kıb kÄM2 qN ˇi D1
0ˇ
1
ˇ
ˇ
ˇ
a Db
DN
ˇ
ˇ
ˇ
ˇ
N
N
N
Áˇ
Áiˇ
XX
ˇX h
Bˇ X
k ; ı;
l ˇ > "C
ˇ
Q ıl ˇ C
Q
Q
Ä
P @ˇˇ
Ai ı ka ; ı;
sup
A
.ı
;
ı
;
ı
/
A
ı
ı
A
i a a b
i
a
ˇ
b ˇ
b ˇ
ˇ
ˇ
ˇ
ˇ
a
l
b
N
N
i D1
ı a 2 ;ı 2 i D1
kD1 lD1
k b

125

l

1
2

Since u1 Œu < 0 D 12 u

QQ i ı ka ; ıQ a ; ıbl

QQ i .ı a ; ıQ a ; ıb /
D

T h
X
t D1
T h
X

D

tD1
T
X

juj, we have
Á

ei t

R ait ı a
w

wR b ıb

ei t

R ait ı ka
w

wR b ıbl

Â
Œ

Á

Ái

ei t

R ait ıQ a
w

wR b ıb

ei t

R ait ıQ a
w

wR b ıbl

Ái

Ã
1
R ait ı ka
w
2

t D1

1 ˇˇ
ˇ"i t C ri
2
R ait
Ä 2T max w
i;t

Á 1 ˇ
ˇ
ˇ"i t C ri w
R ait ı a wR b ıb ˇ
ıa C
2
ˇ ˇ
ˇÁ
ˇ ˇ
ˇ
R ait ıQ a wR b ıb ˇ ˇ"i t C ri w
R ait ıQ a wR b ıbl ˇ 
w
h
i
k
l
sup
ı a ı a C ıb ıb

ˇ
ˇ
ˇ"i t C ri

R ait ı ka
w

ˇÁ
ˇ
wR b ıbl ˇ

N a ;ı l 2
Nb
ık
a 2
k b
l

Thus
ˇ
ˇ
ˇX
Áiˇˇ
ˇN h
ˇ
Q ıl ˇ
sup
Ai .ı a ; ıQ a ; ıb / Ai ı ka ; ı;
ˇ
b ˇ
ˇ
N a ;ı l 2
N b ˇi D1
ı a 2
k b
ˇ l
ˇN
T
h
i X
ˇX
ˇ
Q
Q
Q
Q
R ait ı a
fQi .ı a ; ı a ; ıb / Ew Qi .ı a ; ı a ; ıb / C
w
D sup ˇ
N ˇi D1
ı2
t D1
d
h

QQ i .ı kd ; ıQ a ; ıbl / C Ew QQ i .ı ka ; ıQ a ; ıbl /

i

T
X

R ait ı ka
w

tD1

D sup

N
ı2
d

C

T
X

ˇ
ˇN
ˇX
ˇ
fQQ i .ı a ; ıQ a ; ıb /
ˇ
ˇi D1

L i t ıa
w

ı ka

r
Ä 5N T ˛1

sup
N a ;ı l 2
Nb
ık
a 2
k b
l

h

ı a ı ka C ıb

qN
"
m0 D
N
2

126

ıbl

i

Á

."i t /

ˇ
ˇ
ˇ
."i t /ˇˇ
ˇ

h
Ew QQ i .ı a ; ıQ a ; ıb /

ˇ
ˇ
ˇ
."i t /gˇˇ
ˇ

Á

tD1

R ait
Ä 5N T max w
i;t

QQ i .ı kd ; ıQ a ; ıbl /

ıQ a

Á

ıQ a

i
QQ i .ı ka ; ıQ a ; ıbl /

ˇP
Á
Áˇ
a PD b
PDN
ˇ N
ˇ "
N
l
k
Q
Therefore, now it suffices to show kD1 lD1 Pw ˇ iD1 Ai ı a ; ı; ıb ˇ > 2 has a vanishing
upper bound that does not depend on .xi ; zi /. Berstein’s inequality is used. To evaluate maximum,
Á
1 u C 1 juj, we can write
using .u/ D
2
2
Q ıl
maxAi ı ka ; ı;
b

Á

i;k;l

T
i X
k
l
Q
Q
R ait ı ka
E Qi .ı a ; ı a ; ıb /jxi ; zi C
w

h

D maxŒQQ i .ı ka ; ıQ a ; ıbl /
i;k;l

T h
X

t D1
T h
X

tD1
T
X

D max
i;k;l

t D1
T
X

t D1

C

T
X

R ait ı ka
w

ei t

i;k;l

Ew

Á

."i t /

tD1

D max

Ew Œ

ıQ a

wR b ıbl

R ait ı ka
w

ei t

wR b ıbl

Á

ei t

Á

ei t

1 ˇˇ
Œˇei t
2

1 ˇˇ
Œˇei t
2

R ait ı ka
w
R ait ı ka
w
ıQ a

Á

ˇ
ˇ
ˇei t

ˇ

ˇ
wR b ıbl ˇ
ˇ

ˇ
wR b ıbl ˇ

ˇ
ˇ
ˇei t

R ait ıQ a
w
R ait ıQ a
w

."i t /

tD1
i;t

R ait ıQ a
w

wR b ıb

Ái

Ái

C

T
X

R ait
w

ıQ a

ıa

Á

tD1

R ait ı a
w

Ä 3T max

wR b ıbl

R ait ıQ a
w

R ait
w

max ıQ a
k

ı ka

r
ÄC

qN
N

127

ˇ

ˇ
wR b ıbl ˇ C
ˇ

ˇ
wR b ıbl ˇ C

Â

Â

Ã
1
R ait ıQ a
w
2
Ã
1
R ait ıQ a
w
2

ı ka
ı ka

Á


Á


."i t /

To evaluate variance, applying Knight’s identity, we have
QQ i .ı ka ; ıQ a ; ıbl / C

T
X

R ait ı ka
w

ıQ a

Á

."i t /

t D1

D

D

C

C

T h
X
tD1
T
X

R ait ı ka
w

ei t

Á

R ait ıQ a
w

ei t

wR b ıbl

Ái

C

T
X

R ait ı ka
w

ıQ a

Á

."i t /

tD1

R ait ı ka C wR b ıbl
.w

t D1
T
X

R ait ıQ a C wR b ıbl
.w

t D1
T
X

wR b ıbl

R ait ı ka
w

ıQ a

Á

ri /

."i t / C

tD1 0
T Z w
R a ıQ a CwR b ı l ri
X
it
b

."i t /

ri /

l
T Z w
R a ık
X
i t a CwR b ıb ri

t D1 0

ŒI ."i t < t/

ŒI ."i t < t/

I."i t < 0/dt

I."i t < 0/ dt

."i t /

t D1
l
T Z w
R a ık
X
i t a CwR b ıb ri
D
ŒI
a ıQ CwR ı l r
R
w
a
b b i
t D1
it

."i t < t/

I."i t < 0/dt

Thus,
Ai ı ka ; ıQ a ; ıbl
D QQ i

Á

ı ka ; ıQ a ; ıbl

Á

h

E QQ i

ı ka ; ıQ a ; ıbl

Á

i

jxi ; zi C

T
X

R ait ı ka
w

ıQ a

Á

."i t /

t D1
l
T Z w
R a ık
X
i t a CwR b ıb ri
D
ŒI ."i t < t/
a ıQ CwR ı l r
R
w
a
b b i
tD1
2 it
Z
a
l
T
R ık
w
X
i t a CwR b ıb ri
Ew 4
ŒI ."i t
R a ıQ a CwR b ı l ri
t D1 w
it
b
T
Á
X
R ait ı ka ıQ a
."i t /
C
w
tD1

D

l
T Z w
R a ık
X
i t a CwR b ıb ri

ŒI ."i t < t/
a ıQ CwR ı l r
R
w
a
b b i
tD1
2 it
Z
a
l
T
R ık
w
X
i t a CwR b ıb ri
Ew 4
ŒI ."i t
a ıQ CwR ı l r
R
w
b b i
t D1
it a

I."i t < 0/dt

T
X

R ait ı ka
w

ıQ a

Á

."i t /

tD1

< t/

I."i t < 0/dt

T
X

3
R ait ı ka
w

ıQ a

Á

."i t /5

t D1

I."i t < 0/dt
3
< t/

2

I."i t < 0/dt 5 C Ew 4

T
X

t D1

128

3
R ait ı ka
w

ıQ a

Á

."i t /5

And it implies
N
X

Var Ai

ı ka ; ıQ a ; ıbl

Á

jxi ; zi

Á

i D1

Ä

N
X
i D1

Ä

N
X
i D1

12 3
7
I."i t < 0/dt A 5

20
l
T Z Ra k
6@X wi t ı a CwR b ıb ri
ŒI ."i t < t/
Ew 4
R a ıQ a CwR b ı l ri
tD1 w
it
b

ˇ
ˇ a
R i t ı ka
T max ˇw
i;t

r
ÄC

l
T Z w
Áˇ X
R a ık
a CwR b ıb ri
i
t
ˇ
ŒFi t .t /
ıQ a ˇ
a
l
R ıQ a CwR b ı ri
t D1 w
it
b

Â
N T
qN X X
R ait ı ka C wR b ıbl
fi t .0/ w
N

ri

Á2

R ait ıQ a
w

Fi t .0/dt
C wR b ıbl

ri

Á2 Ã

.1 C o .1//

iD1 tD1

r
ÄC

qN
.1 C o .1//
N

by similar argument as in Lemma C.1.2. Then, by Bernstein’s inequality and Assumption 5,
a Db
DN
N
XX

ˇ
1
0ˇ
ˇN
ˇ
Á
ˇX
ˇ
"
Q ıl ˇ > A
Pw @ˇˇ
Ai ı ka ; ı;
b ˇ
ˇi D1
ˇ 2
kD1 lD1
0
1
a Db
a Db
s
!
DN
DN
N
N
2
XX
X
X
N
" =4
B
C
Ä
exp @ q
exp
C
q
AÄ
qN
qN
qN
C
C
"C
kD1 lD1
kD1 lD1
N
N
s
!
Áq
Áq
p
p
N
N
N
C N qN
exp
C
Ä C exp C qN log N
Ä C C N qN
qN

s

N
qN

!!
!0

Lemma C.1.9 (Asymptotic Equivalence with Bahadur Representation) Assume Assumption 1–5,
7 and 8. Then, we have ıQ a ıO a D op .1/ :
Proof. Note that

PN

i D1 ŒQi

p

p
qN ıL a ; qN ıLb

Á

Qi .0; 0/ Ä 0 where

incides with oracle estimator ıO ab . Then, Lemma C.1.7 implies ıO ab

p

Á
p
qN ıL a ; qN ıLb cop
D Op
qN . Now, it

suffices to show that for any positive constants M1 and M2 ;
0
1
N
Á
X
B
C
P@
inf
QQ i ı a ; ıQ a ; ıb > 0A ! 1
p
ıQ a ıO a M1 ;kıb kÄM2 qN i D1

129

since, we have

p o
ıQ a Ä M1 ; kıb k Ä M2 qN :

Á
n
Q ıOb Ä 0. Let B D ı ab W ı a
Q i ıO a ; ı;
Q
i D1

PN

By Lemma C.1.8,
ˇ
2
ˇN
ˇX
4QQ i .ı a ; ıQ a ; ıb /
sup ˇˇ
ˇ
ı ab 2B i D1

h

i

E QQ i .ı a ; ıQ a ; ıb /jxi ; zi C

C

R ait
w

R ait ı a
w

ıQ a

Á

tD1

Then by Lemma C.1.5,
ˇ
ˇN Ä
ˇX
sup ˇˇ
QQ i .ı a ; ıQ a ; ıb /
ı ab 2B ˇi D1
T
X

T
X

ıa

ıQ a

1h 0 R
ı KN ı a
2 a
ˇ
iˇˇ
."i t / ˇˇ
ˇ

Á

t D1

3ˇ
ˇ
ˇ
."i t /5ˇˇ D op .1/
ˇ

i
0 R Q
.1 C o .1//
ıQ a K
ı
a
N

(A.7)

D op .1/
And since
Ra
R 0a BN W
ıQ a D W

1

R 0a ‰ ."/
W

R 1W
R 0a ‰ ."/
DK
N
we have
N X
T
X

R ait
w

ıa

ıQ a

Á

ıQ a

."i t / D ı a

N X
T
Á0 X

.ei t /

i D1 t D1

i D1 tD1

Combining (A.7) and (A.9), we have
ˇ
ˇN Ä
ˇX
1h 0 R
sup ˇˇ
QQ i .ı a ; ıQ a ; ıb /
ı KN ı a
2 a
ı 2B ˇ
ab

R a0
w
it

D ıa

ıQ a

D ıa

ıQ a

Á0

R 0a ‰ ."/
W

(A.8)

Á0

R N ıQ a
K

(A.9)

0 R Q
ıQ a K
N ıa C ıa

i

i D1

ıQ a

Á0

ˇ
ˇ
ˇ
Q
R
KN ı a ˇˇ D op .1/
ˇ

which implies
ˇ
ˇN Ä
ˇX
sup ˇˇ
QQ i .ı a ; ıQ a ; ıb /
ı ab 2B ˇi D1

1
ıa
2

130

ıQ a

Á0

R N ıa
K

ˇ
Á ˇˇ
ıQ a ˇˇ D op .1/
ˇ

By assumption 4, for any

ıa

ıQ a

1
ıa
2

Á

> M;
ıQ a

Á0

R N ıa
K

Á
Qı a > CM

for some positive C:
Q ait / is arbitrary, Lemma C.1.9 implies ıQ
Since the partition .wQ ibt ; w

ıO D op .1/ :

1) convergence rate of ˇO : Consider the Bahadur representation ıQ in Lemma C.1.1. Let ıQ 1 be
Q Then, by paritioned matrix formula, we can write
the subvector of the first K4 components in ı:
ıQ 1 D

p
h

N W0 BN W

W0 ‰ ."/

W0 BN …A …0A BN …A

1 0
W BN W
N

D
„
"

Á 1
…0A BN W
i
1 0
…A ‰ ."/

W0 BN …A …0A BN …A

1

! 1
Â
Ã 1
1 0
1 0
1 0
W BN …A
… BN …A
… BN W
N
N A
N A
ƒ‚
…
.a/

1
p W0 ‰ ."/
N

Â

1 0
1 0
W BN …A
… BN …A
N
N A

Since part (a) is upper-left submatrix of

Á 1

1 Q 0
Q
N W BN W

#
Ã 1
1
p …0A ‰ ."/
N

; its maximum eigenvalue is bouned

above by an argument used in Lemma C.1.1. Then, by Lemma C.1.10,

ıQ 1 Ä

1 0
W BN W
N
1
p W0
N

."/

h
1
Ä C p W0 I N T
N„

! 1 2
Ã 1
2
1 0
1 0
1 0
W BN …A
… BN …A
… BN W
N
N A
N A
Â

Â
Ã 1
1 0
1 0
1
W BN …A
…A BN …A
p …0A ‰ ."/
N
N
N
BN …A …0A BN …A
ƒ‚
DW 0

Ã
1
DC
p  C op .1/ ‰ ."/ D Op .1/
N
Á
ˇ o D Op N 1=2 :
Â

Thus ˇO

2) convergence rate of gO

131

1

i
…0A ‰ ."/
…

By Lemma C.1.1 and C.1.9,
0
B
oA k Ä @

k OA

1

ˇO

ˇo

OA

oA

1
C
A D p ıO D Op
N

Âr

qN
N

Ã

And, by Assumption 8, we have
N
1 X
ŒgO .xi ; zi /
N

go .xi ; zi

/2

iD1

N
1 X
. iA O A
D
N

1
Ä
N

iD1
N
X

go .xi ; zi //2
N
1 X
. iA oA
oA // C
N
iD1
qN Á
2
//
C
O
oA
N
2

. iA . O A

i D1

1 X
. iA . O A
Ä
N
Now, it suffices to show
X

. iA . O A

oA //

2

P

2

D Op .qN / : First note that

0

.…A . O A

. iA . O A

oA //

D .…A . O A

oA //

1
2

0

D N . OA

oA //
Á1
2

oA / N

1 …0 …
A A

oA / Ä

Á1
N 1 …0A …A 2

Á1
2
1 …0 …
A A

N

go .xi ; zi //2

1

N 2 . OA

oA /

Then, since we have
Á1 1
N 1 …0A …A 2 N 2 . O A

1

N 2 . OA

Ä max N 1 …0A …A
D Op

p

Á

oA /

1

N 2 . OA

oA /

qN

The result follows.
1

1

Lemma C.1.10 Assume Assumption 1–5, 7 and 8. Then (i) N 2 W D N 2  C op .1/ (ii)
h
i
1 0
N 1 W 0 BN W D KN C op .1/ where W D I N T …A …0A BN …A
…A BN W and
KN D N1 0 BN :
Proof. (i) Let P D …A …0A BN …A
1

1

1

N 2 W D N 2 ŒW

…0A BN : We can write
1

1

P W D N 2  C N 2 .H

132

P W/

Note that
2

1

N 2 .H

P W/

P W/0 .H

D N 1 max .H

P W/0 .H

Ä N 1 t r .H
DN

1

CT 1 h
N X
T K1X
X
i D1 tD1

P W/
P W/
i2
O
hk .xi ; zi / D op .1/

hk .xi ; zi /

kD1

where the last equality follows from Assumption 8.
(ii) Note
N

1W 0B W
N

Â
D N

Ã0
1
2  C op .1/ BN

Â
Ã
1
N 2  C op .1/ D N 1 0 BN  C op .1/

Lemma C.1.11 Assume Assumption 1–5, 7 and 8. Then, †N

1=2 Q

d

ı 1 ! N .0; I /

Proof. We can write
ıQ 1 D

p

1

N W 0 BN W

D KN C op .1/

1

W 0 ‰ ."/
1
2 C N

Â
N

1
2

.H

Ã0
P W/ ‰ ."/

Note that a similar argument used in Lemma C.1.10 yields
Â
N

1
2 C N

1
2

.H

Ã0
1
1
P W/ ‰ ."/ D N 2 0 ‰ ."/ C N 2 .H
1

D N 2 0 ‰ ."/ C op .1/
so that we have
ıQ 1 D KN
Define DN i t D †N

1=2

1
KN 1 N 2 0i t
N X
T
X

1

1

N 2 0 ‰ ."/ C op .1/

."i t /. Then,

DN i t D †N

1=2

i D1 t D1

133

1
KN 1 N 2 0 ‰ ."/

P W/0 ‰ ."/

Similarly as in Lemma C.1.1, note E DN i t D 0 and E
10 3
10
20
T
N
T
X
X
X
DN i t A 5
E 4@
DN i t A @
i D1

D

N
X

E 4@

i D1

T
X

1=2

†N

KN 1 N

D E †N

1=2

1=2

i

D 0: Also,

."i t /A @

T
X

†N

1=2

KN 1 N

10 3

1
2 0
it

."i t /A 5

tD1

2
KN 1 4N 1

N
X

0

T
X

@
i D1

h

10

1
2 0
it

tD1

2
D E 4†N

T
t D1 DN i t

tD1

tD1

20

hP

10
0i t

."i t /A @

tD1

13

T
X

3

1=2 5
."i t / i t A5 KN 1 †N

tD1

i

1=2
KN 1 SN KN 1 †N
DI

To check Lindeberg-Feller condition, fix " > 0:
2
2
N
T
X 6 X
E4
DN i t 1 PT

3

N
T
X
X
Á7 Ä 1
E
DN i t
5
DN i t >"
"2
tD1
i D1
t D1
iD1
t D1
12
0
N
T
T
X
M X @X
1=2
1=2
E
†N
KN 1
0i t A
Ä 2 2
i t KN 1 †N
" N
i D1

tD1

tD1

N
T
X
M X
1=2
1
Ä 2 2
E †N
KN
0i t
" N
i D1

4

tD1

N
T
X
C X
E
Ä 2 2
0i t
" N
i D1

4

4

Â
D Op

t D1

1
N

Ã
D op .1/

Proof of Theorem 3.4.8
The penalized objective function in (3.33) can be expressed as a difference of two convex functions
k .ˇ; / and l .ˇ; /
N T
1 XX
N

.yi t

wi t ˇ

.xi ; zi / / C

pN
X
j D1

iD1 tD1

134

p

ˇ ˇ
ˇ j ˇ D k .ˇ; /

l .ˇ; /

where
N T
1 XX
k .ˇ; / D
N

.yi t

.xi ; zi / / C

wi t ˇ

i D1 tD1

l .ˇ; / D

pN
X

pN
X
ˇ ˇ
ˇ jˇ
j D1

Á

L

j

j D1

for some L . /. The function L
L

j

D

2
j

Á
j

for SCAD and MCP is defined as

ˇ ˇ
C 2 ˇ jˇ C 2
1
2 .a 1/

ˇ ˇ
Ä ˇ jˇ Ä a

ˇ ˇ
ˇ jˇ

C

.a C 1/ 2
2

!
ˇ ˇ
1 ˇ jˇ > a

with a > 2; and
L

j

D

2
j

2a

ˇ ˇ
1 0 Ä ˇ jˇ < a

a 2
2

ˇ ˇ
ˇ jˇ

C

!
ˇ ˇ
1 ˇ jˇ > a

with (a > 1), respectively. Subdifferential of f at Áo is defined to be a set
˚
@f .Áo / D t W f .Á/

f .Áo / C .Á

«
Áo /0 t; 8Á

Then @l; the subdifferential of l; is merely a regular derivative
8
8
ˆ
ˆ
<
<
Á
0
for 1 Ä j Ä K4
@l .ˇ; / D
; K4 CpN W j D
1;
ˆ
ˆ
:
: @@l.ˇ; /
for K4 C 1 Ä j Ä K4 C pN
j K4

where

8
ˆ
ˆ
ˆ
ˆ
ˆ
<

0

@l .ˇ; /
sgn j
j
D
ˆ
@ j
a 1
ˆ
ˆ
ˆ
ˆ
:
sgn j
for SCAD, and

8
ˆ
<

@l .ˇ; /
D
ˆ
@ j
:

Á

j
a

sgn

j

99
>
=>
=
>
>
;;

ˇ ˇ
0 Ä ˇ jˇ <
ˇ ˇ
Ä ˇ jˇ Ä a
ˇ ˇ
ˇ jˇ > a

ˇ ˇ
0 Ä ˇ jˇ < a
ˇ ˇ
ˇ jˇ a

for MCP. Before we derive the subdifferential of k; first consider the subgradient of the unpenalized
objective function
N T
1 XX
N

.yi t

wi t ˇ

i D1 tD1

135

.xi ; zi / / :

For 1 Ä j Ä K4 ;
N T
1 XX
wQ i tj 1 Œyi t
N

sj .ˇ; / D

i D1 t D1
N X
T
X

1
1/
N

.

wi t ˇ

wQ i tj 1 Œyi t

.xi ; zi /

> 0

.xi ; zi /

wi t ˇ

N T
1 XX
wi tj ai t
N

< 0

i D1 tD1

iD1 tD1

and for K4 C 1 Ä j Ä K4 C pN ;
N T
1 XX
wQ i tj 1 Œyi t
N

sj .ˇ; / D

i D1 t D1
N X
T
X

.

1/

where ai t D 0 if yi t

1
N

wi t ˇ

wQ i tj 1 Œyi t

.xi ; zi /

wi t ˇ

> 0

.xi ; zi /

N T
1 XX
wQ i tj ai t
N

< 0

i D1 tD1

iD1 tD1

.xi ; zi /

wi t ˇ

6D 0, ai t 2 Œ

1;  otherwise. The subgradient of k

conincides with s .ˇ; / with additional term lj introduced for K4 C 1 Ä j Ä K4 C pN :
@k .ˇ; / D

n

Ä1 ;

; ÄK4 CpN

Á

W Äj D sj .ˇ; / if 1 Ä j Ä K4 ; sj .ˇ; / C lj otherwise

o

Á

Á
O O be the oracle
if j K4 6D 0 and lj 2 Œ 1; 1 otherwise. Let ˇ;
Á
estimator. Define K be a collection of vector Ä D Ä1 ;
; ÄK4 CpN such that
8
ˆ
ˆ
0
if j D 1;
; K4
ˆ
ˆ
<
Á
Äj D
sgn Oj K4
if j D K4 C 1;
; K4 C qN
ˆ
ˆ
Á
ˆ
ˆ
: sj ˇ;
O O C lj if j D K4 C qN C 1;
; K4 C pN

where lj D sgn

j K4

Then, Lemma C.1.12 and Lemma C.1.13 [Lemma C.1.16] below deliver the result. First note
ÁÁ
O
that, by Lemma C.1.13 [Lemma C.1.16], it is implied that lim P K @k ˇ; O
D 1 since
N !1
Á
Á Á
O
lj D sgn Oj K4 for j D K4 C1;
; K4 CqN : Now, consider a point .ˇ; / 2 B ˇ; O ; 2 :
By Lemma C.1.12, it suffices to show that there exists Ä 2 K such that P .Ä 2 @l .ˇ; // ! 1
i.e.
lim P

N !1

lim P

N !1

@l .ˇ; /
Äj D
; j D 1;
@ˇj

@l .ˇ; /
Äj D
; j D K4 C 1;
@ j K4

136

!
; K4

D1

(A.10)

D1

(A.11)

!
; K4 C pN

/
Since @l.ˇ;
D 0 for j D 1;
@ˇ

; K4 ; (A.10) holds by definition of K. To show (A.11): For both
ˇ
ˇ
Á
ˇ
ˇ
@l.ˇ; /
D sgn j K4 for ˇ j K4 ˇ a (the case with
SCAD and MCP penalty function, @
j K4
ˇ
ˇ
ˇ
ˇ
ˇ j K4 ˇ D a can be easily checked.): By Lemma C.1.13 [Lemma C.1.16], it is implied that
j

min

1Äj ÄqN

ˇ ˇ
ˇ jˇ

min

1Äj ÄqN

ˇ ˇ
ˇ Oj ˇ

max

1Äj ÄqN

ˇ
ˇ Oj
Â

with probability approaching one. Thus, lim P
N !1

K4 C 1;

ˇ
ˇ
j

@l.ˇ; /
@ j K
4

Â

1
aC
2

Ã
2

Da

ÁÃ

D sgn

j K4

D 1 for j D

; K4 C pN : For K4 C 1 Ä j Ä K4 C qN , it suffices to show
lim P

N !1

Á

sgn Oj K4 D sgn

ÁÁ
j K4

D1

Á

since Äj D

sgn Oj K4 for K4 C 1 Ä j Ä K4 C qN : From the fact that Oj
oj D
Á
1=2
Op N 1=2 qN
D o . / for 1 Ä j Ä qN where oj > 0; and that Oj
j < 2 ; it is implied
that Oj K4 and j K have the same sign for K4 C 1 Ä j Ä K4 C qN with probability tending to
4
one. For K4 C qN C 1 Ä j Ä K4 C pN ; we have Oj K4 D 0 by the definition of oracle estimator.
Then,
ˇ
ˇ ˇ
ˇ
ˇ ˇ
ˇ j K4 ˇ Ä ˇ j K4

ˇ ˇ
ˇ ˇ
ˇ ˇ
ˇ ˇ
Oj K4 ˇ C ˇ Oj K4 ˇ D ˇ j K4

ˇ
ˇ
Oj K4 ˇ <

:
2
ˇ
ˇ
ˇ ˇ
ˇ @l.ˇ; / ˇ
j
@l.ˇ; /
@l.ˇ; /
ˇ
ˇ
ˇ
ˇÄ ;
For j < ; @
D 0 for SCAD and @
D a for MCP, which implies ˇ @
j
j
j K4 ˇ
K4 CqN C1 Ä j Ä K4 CpN , for both penalty functions. Also, by Lemma C.1.13 [Lemma C.1.16],
ˇ
Áˇ
ˇ
O O ˇˇ Ä with probability tending to one for K4 CqN C1 Ä j Ä K4 CpN :
it is implied that ˇsj ˇ;
2
Therefore, there exists lj 2 Œ 1; 1 such that
lim P

N !1

O O C l D @l .ˇ; / ; j D K4 C qN C 1;
sj ˇ;
j
@ j K4
Á

!
; K4 C pN

D 1:

Á
O O C l with such l : Then the result follows.
Take Äj D sj ˇ;
j
j
Lemma C.1.12 (Tao and An, 1997) Consider the function k .Á/

l .Á/ where both k and l are

convex with subdifferential functions @k .Á/ and @l .Á/ : Let Á be a point that has neighborhood U
such that @l .Á/ \ @k .Á / 6D ; 8Á 2 U \ dom .k/ : Then Á is a local minimizer of k .Á/

137

l .Á/ :

Á
1=2
Lemma C.1.13 Assume Assumption 1–6 and 9. Suppose D o N .1 C4 /=2 ; N 1=2 qN D
Á
Á
O O be the oracle estimator. Then, there exists a with
o . / ; and log .pN / D o N 2 : Let ˇ;
it
ai t D 0 if yi t

.xi ; zi /

wi t ˇ

6D 0 and ai t 2 Œ1

;  otherwise, such that, for sj .ˇ; /

with ai t D ai t ; with probability approaching one
Á
O O D 0; j D 1;
sj ˇ;
; K4 C qN
Ã
Â
ˇ ˇ
1
ˇ Oj ˇ
; j D 1;
; qN
aC
2
ˇ
Áˇ
ˇ
O O ˇˇ Ä c ; 8c > 0; j D K4 C qN C 1;
ˇsj ˇ;
n
Proof. Define D D .i; t/ W yi t

wi t ˇO

(A.12)
(A.13)
; K4 C pN

(A.14)

o
O
.x
/
;
z
D
0
: To show (A.12), note that, with
A i i
A

Q i t / is in general position i.e. jDj D K4 C qN since yi t
probability tending to one, .yi t ; w
˚ «
has a continuous density (2.2.2., Koenker, 2005). Thus, there exists ai t with .K4 C qN /
Á
O O A implies 0K Cp 2
nonzero elements such that (A.12) holds: (Alternatively, optimality of ˇ;
4
N
Á
˚ «
P P
O
@
wi t ˇ
A .xi ; zi / A / at ˇ; O so that such ai t exist.) To show (A.13),
i
t .yi t
note that
min

1Äj ÄqN

By Assumption 9,

min

1Äj ÄqN

ˇ ˇ
ˇ Oj ˇ

ˇ ˇ
ˇ oj ˇ

min

1Äj ÄqN

ˇ ˇ
ˇ oj ˇ

max

1Äj ÄqN

ˇ
ˇ Oj

oj

ˇ
ˇ

C5 N .1 C4 /=2 : Then, by Lemma C.1.1 and Lemma C.1.4,
Âq Ã
Á
ˇ
qN
1 C4 /=2 : The result follows
.
ˇ
D
O
N
D
o
p
p
oj
N

ˇ
ˇ Oj
1Äj ÄqN
Á
from D o N .1 C4 /=2 : To show (A.14), define J3 Á fj WK4 C qN C 1 Ä j Ä K4 C pN g :
Á
O O ;
Then, for j 2 J3 ; by definition of sj ˇ;
it is implied that

O O
sj ˇ;

max

Á

N T
h
1 XX
wQ i tj 1 yi t
D
N

wi t ˇO

A .xi ; zi / O A Ä 0

i

Á

iD1 tD1

where ai t satisfies the given condition. Thus,

1
N

P

138

Q i tj
.i;t /2D w

1
N

X

wQ i tj ai t C .1

/

.i;t /2D

ai t C .1

/ D Op

qN
N

Á

D

op . / since jDj D K4 C qN with probability tending to one: It will be shown that
ˇ
ˇ
1
0
ˇ
N X
T
i
Áˇˇ
h
ˇ1 X
ˇ>c AD0
wQ i tj 1 yi t wi t ˇO
lim P @ max ˇˇ
A .xi ; zi / O A Ä 0
ˇ
N !1
j 2J3 ˇ N
ˇ
i D1 tD1

Define
h
Ii t .Â/ D 1 yi t

QA
w
it Â Ä 0

i
Á

Pi t .Â/ D P yi t

QA
QA
w
it
i t Â Ä 0jw

Hi t .Â/ D Ii t .Â/

Ii t .Â o /

Pi t .Â/ C Pi t .Â o /

wi t ˇO

i
O
/
.x
;
z
Ä
0
i
i
A
A

Then, note
ˇ
ˇ X
T
h
ˇ1 N X
ˇ
@
P max ˇ
wQ i tj 1 yi t
j 2J3 ˇ N
i D1 t D1
0

ˇ
1
Áˇˇ
ˇ>c A
ˇ
ˇ

ˇ
ˇ
1
ˇ
N X
T
h
i
Áˇˇ
ˇ1 X
ˇ>c A
P @ max ˇˇ
wQ i tj 1 yi t wi t ˇO
A .xi ; zi / O A Ä 0
ˇ
j 2J3 ˇ N
ˇ
i D1 tD1
ˇ
ˇ
0
1
ˇ X
T
Áˇˇ
Á
ˇ1 N X
Ä P @ max ˇˇ
wQ i tj Ii t ÂO A
Ii t .Â oA / ˇˇ > c A
2
j 2J3 ˇ N
ˇ
i D1 tD1
ˇ
ˇ
1
0
ˇ
ˇ
N X
T
ˇ
ˇ1 X
/ˇˇ > c A
wQ i tj .Ii t .Â oA /
C P @ max ˇˇ
2
j 2J3 ˇ N
ˇ
i D1 tD1
0
1
ˇ
ˇ
ˇ
ˇ
N
T
ˇ
ˇ 1 XX
B
C
ˇ
ˇ
.I
.Â
/
.Â
//
w
Q
I
Ä P @ max
sup q
i tj i t
it
A
oA ˇ > c A C op .1/
ˇ
2
j 2J3
ˇ
q ˇ N i D1 tD1
kÂ A Â oA kÄC NN
1
0
ˇ
ˇ
ˇ
ˇ
N
T
ˇ 1 XX
ˇ
B
C
ˇ
wQ i tj .Hi t .Â A //ˇˇ > c A
Ä P @ max
sup q
ˇ
4
j 2J3
ˇ
q ˇ N i D1 tD1
kÂ A Â oA kÄC NN
0
1
ˇ
ˇ
ˇ X
ˇ
T
ˇ1 N X
ˇ
B
C
ˇ
C P @ max
wQ i tj .Pi t .Â A / Pi t .Â oA //ˇˇ > c A C op .1/
sup q
ˇ
N
4
j 2J3
ˇ
q ˇ
i D1 tD1
kÂ A Â oA kÄC NN
0

139

where the second inequality follows by Lemma C.1.14. Here, note that
ˇ
ˇ
ˇ
ˇ
N
T
ˇ 1 XX
ˇ
ˇ
ˇ
.Â
/
.Â
/
max
sup q
w
Q
.P
P
i
tj
i
t
i
t
A
oA
ˇ
ˇ
j 2J3
ˇ
qN ˇ N i D1 t D1
kÂ A Â oA kÄC N
ˇ
ˇ X
T
Á
ˇ1 N X
A
ˇ
Q i t .Â A Â oA /
wQ i tj .Fi t w
D max
sup q
ˇ
j 2J3
qN ˇ N i D1 t D1
kÂ A Â oA kÄC N

ˇ
ˇ
ˇ
Fi t .0/ˇˇ
ˇ

N T
ˇ
1 X X ˇˇ A
ˇ
Q i t .Â A Â oA /ˇ
ÄC
sup q
ˇw
q N i D1 t D1
kÂ A Â oA kÄC NN
s
r
Â
Ã
1
qN
0
Q AW
Q
Â
Ä
C
W
D o. /
ÄC
sup q
kÂ
k
max
A
oA
A
N
N
qN
kÂ A Â oA kÄC N

Thus, now it suffices to show
ˇ
0
ˇ X
T
ˇ1 N X
ˇ
@
P max sup ˇ
wQ i tj .Ii t .Â A /
j 2J3 Â 2L ˇ N
A

N

Ii t .Â oA /

i D1 tD1

ˇ
1
ˇ
ˇ
Pi t .Â A / C Pi t .Â oA //ˇˇ > c =4A
ˇ

tends to 0 as N goes to infinity where LN D Â A W kÂ A

Â oA k Ä C

q

qN
N

: It is implied by

Lemma C.1.15.
Lemma C.1.14 Assume Assumption 1–6. Suppose log .pN / D o N
ˇ
ˇ
N X
T
ˇ1 X
ˇ
@
P max ˇ
wQ i tj .Ii t .Â oA /
j 2J3 ˇ N
i D1 tD1
0

2

Á

and N 2 ! 1: Then

ˇ
1
ˇ
ˇ
/ˇˇ > c A ! 0:
2
ˇ

Proof. The argument is similar to Wang, Wu, Li (2012). Note that, for some constant C , the
P
random variable C1 TtD1 wQ i tj Ii t .Â oA / is independent across i , bounded by the interval Œ0; 1 :
Then, by Hoeffding’s inequality, it is implied
0
N T
1 XX
@
P
wQ i tj .Ii t .Â oA /
N
i D1 tD1

140

1
/ > c A Ä exp
2

CN 2

Á

so that
ˇ
ˇ X
T
ˇ1 N X
ˇ
@
P max ˇ
wQ i tj .Ii t .Â oA /
j 2J3 ˇ N
i D1 tD1
Á
2
Ä 2pN exp CN
Á
D 2exp log pN CN 2 ! 0

ˇ
1
ˇ
ˇ
/ˇˇ > c A
2
ˇ

0

Á
1=2
1
C
=2
.
/
4
Do N
; N 1=2 qN D

Lemma C.1.15 Assume Assumption 1–6 and 9. Suppose
Á
2
: Then, for any given positive constant C;
o . / ; and log .pN / D o N
ˇ
ˇN T
ˇX X
@
lim P max sup ˇˇ
wQ i tj .Ii t .Â A /
N !1
j 2J3 Â 2B ˇ
A
i D1 t D1
0

where B D Â A W kÂ A

Â oA k Ä C

q

qN
N

Ii t .Â oA /

ˇ
1
ˇ
ˇ
Pi t .Â A / C Pi t .Â oA //ˇˇ > N A D 0
ˇ

:

Proof. B can be covered with a net of balls with radius C

qq

N
N5

with cardinality N1 Á jBj Ä

CN 2qN . Denote the N1 balls centered at t m by b .t m / for m D 1;
; N1 :
1
0
ˇ
ˇ
ˇ
ˇX
N
T
X
ˇ
ˇ
B
ˇ>N C
ˇ
//
.Â
/
.Â
/
.Â
/
.Â
P@
sup q
C
P
P
I
w
Q
.I
A
i
t
i
t
i
t
i
tj
i
t
oA
A
oA
A
ˇ
ˇ
ˇ
qN ˇiD1 tD1
kÂ A Â oA kÄC N
ˇ
1
0ˇ
ˇ
ˇN T
N1
X
X
X
ˇ
ˇ
Ä
P @ˇˇ
wQ i tj .Ii t .t m / Ii t .Â oA / Pi t .t m / C Pi t .Â oA //ˇˇ > N =2A
ˇ
ˇi D1 tD1
mD1
1
0
ˇP
Á
ˇ N PT
N1
B
C
X
Ii t .t m /
ˇ i D1 tD1 wQ i tj .Ii t ÂQ A
B
C
ˇ
C
PB
sup r
> N =2C
Á
ˇ
@
A
qN
Pi t ÂQ A C Pi t .t m //ˇ
mD1
ÂQ
t ÄC
A

m

N5

Á INj1 C INj 2 :
To evaluate INj1 ; define vij D

PT

Q i tj .Ii t
tD1 w

.t m /

Ii t .Â oA /

Pi t .t m / C Pi t .Â oA //; which

are bounded, independent mean zero random variables. First, note that Ii t .t m /

141

Ii t .Â oA / is

nonzero only if (Ii t .t m / D 1 and Ii t .Â oA / D 0) or (Ii t .t m / D 0 and Ii t .Â oA / D 1), which
ˇ
ˇ
ˇ
ˇ A
Q i t .t m Â oA /ˇ a.s. where ei t D yi t w
QA
implies jei t j < ˇw
i t Â oA : Then,
V vij Ä C

ÄC

ÄC

ÄC

T
X
tD1
T
X
tD1
T
X
tD1
T
X

V wQ i tj .Ii t .t m /

E

wQ i2tj

ŒIi t .t m /

P ŒIi t .t m /

Pi t .t m / C Pi t .Â oA //

Ii t .Â oA /

Ii t .Â oA

Ii t .Â oA

ˇ
ˇ A
Q i t .t m
P jei t j < ˇw

/2

/2

D1

Á

Á

ˇÁ
ˇ
Â oA /ˇ

tD1

ÄC

T
X

E

h

ˇ
ˇ A
Q i t .t m
Fi t ˇw

ˇÁ
ˇ
Â oA /ˇ

Fi t

ˇ
ˇ A
Q i t .t m
ˇw

ˇÁiÁ
ˇ
Â oA /ˇ

t D1

0

T ˇ
X
ˇ A
@
Q i t .t m
Ä CE
ˇw

1
ˇ
ˇ
Â oA /ˇA

tD1

Therefore,
N
X
iD1

1
0v
u
N X
T
ˇ
u1 X
ˇA
t
QA
w
Â oA /ˇ Ä CNE @
i t .t m
N
i D1 t D1
iD1 tD1
s
Â
Ã!
p
1 Q Q0
Ä CNE
WA WA
kt m Â oA k Ä C N qN
max
N
0

N X
T ˇ
X
ˇ A
Q i t .t m
V vij Ä CE @
ˇw

Â oA /

Á2

1
A

Applying Bernstein’s inequality,
INj1 Ä N1 exp

N 2 2 =8
p
C N qN C CN

!
Ä N1 exp . CN / Ä exp .C qN log .N /

To evaluate INj 2 ; note that
h
Ii t .Â A / D 1 yi t

i
h
QA
w
Â
Ä
0
D
1
yi t
it A

142

QA
QA
w
it tm Ä w
i t .Â A

i
t m/

CN /

Á
QA
w
Â
Ä
e
and Pi t .Â; e/ D
it

and that an indicator function is increasing. Define Ii t .Â; e/ D 1 yi t
Á
A : Then, we have
QA
Q
P yi t w
Â
Ä
ej
w
it
it

ÂQ A

ˇ
ˇX
T
Á
ˇN X
ˇ
Q
sup r
w
Q
.I
Â
i tj i t
A
ˇ
qN ˇiD1 tD1
t m ÄC

Ii t .t m /

N5

r
Ã
Ä Â
N X
T ˇ
X
ˇ
q
N
A
ˇwQ i tj ˇ Ii t t m ; w
Q it C
Ä
N5
D

C

ˇ
ˇ
ˇ
Pi t ÂQ A C Pi t .t m //ˇˇ
ˇ
Á

Ii t .t m /

i D1 t D1
N X
T ˇ
X

r
Ã
Ä Â
ˇ
q
N
A
ˇwQ i tj ˇ Ii t t m ; w
Q it C
N5

Ii t .t m /

r
Ä
Â
Ã
ˇ
q
N
A
ˇwQ i tj ˇ Pi t t m ; w
Q it C
N5

QA
w
it

Pi t t m ;

r
C

C
r

QA
w
it

Pi t t m ;

Â

r

QA
w
it

Â

i D1 tD1
N X
T ˇ
X
i D1 t D1

Â
Pi t t m ;

C

qN
N5

qN
N5

qN
N5

Ã

qN
N5

Ã

Ã
C Pi t .t m /

C Pi t .t m /

Ã

Note that
r
Ä
Ã
Â
N X
T ˇ
X
ˇ
qN
A
ˇwQ i tj ˇ Pi t t m ; w
Q it C
N5

i D1 tD1
N X
T ˇ
X

Ä
Â
ˇ
ˇwQ i tj ˇ Fi t w
QA
i t .t m

D

Â
Pi t t m ;
QA
w
it

Â oA / C

r
C

i D1 t D1

Â
Fi t
ÄC

QA
w
it

.t m

N X
T
X

QA
w
it

Â oA /
r

i D1 tD1

QA
w
it

r
C

qN
N5

QA
w
it
qN
N5

r
C

Ã

Ã

qN
Ä C qN N 3=2 D o .N /
5
N

Hence, for all N sufficiently large, INj 2 Ä

PN1

r
Ä Â
Ã
T ˇ
X
ˇ
qN
A
ˇ
ˇ
Q it C
˛mi D
wQ i tj Ii t t m ; w
N5

mD1 P

PN

iD1 ˛mi

Â
Ii t .t m /

Pi t t m ;

N
4

Á

QA
w
it

where
r
C

tD1

qN
N5

Ã
C Pi t .t m /

Since ˛mi are independent bounded random variables with mean zero, similarly as in the evaluation
of INj1 ; we can show that
V .˛mi / Ä C

T
X

Ã
Âq
A
5
Q it
E
qN =N w
Ä C qN N 5=2

tD1

143

Applying Bernstein’s inequality,
0
N1
N
X
X
@
P
˛mi

1

N A
Ä N1 exp
4

i D1

mD1

!

N 2 2 =32
C qN N 3=2 C CN

Ä N1 exp . CN /
Ä C exp .C qN log N

CN /

Therefore, we have
X

INj1 C INj 2 Ä C exp .log pN C C qN log .N /

CN / D o .1/

j 2J3
1 1

since N 2 qN2 log N D o .1/ is implied by conditions given. Hence the result.
Á
1=2
Lemma C.1.16 Assume Assumption 1–5, and 7–9. Suppose D o N .1 C4 /=2 ; N 1=2 qN D
Á
Á
O O be the oracle estimator. Then, there exists a with
o . / and log .pN / D o N 2 : Let ˇ;
it
ai t D 0 if yi t

wi t ˇ

.xi ; zi /

6D 0 and ai t 2 Œ1

;  otherwise, such that, for sj .ˇ; /

with ai t D ai t ; with probability approaching one
Á
O O D 0; j D 1;
sj ˇ;
; K4 C qN
Â
Ã
ˇ ˇ
1
ˇ Oj ˇ
aC
; j D 1;
; qN
2
ˇ
Áˇ
ˇ
ˇ
O
O
s
ˇ;
ˇ j
ˇ Ä c ; 8c > 0; j D K4 C qN C 1;
n
Proof. Define D D .i; t/ W yi t

wi t ˇO

(A.15)
(A.16)
; K4 C pN

(A.17)

o
O
.x
/
;
z
D
0
: To show (A.15), note that, with
A i i
A

Q i t / is in general position i.e. jDj D K4 C qN since yi t
probability tending to one, .yi t ; w
˚ «
has a continuous density (2.2.2., Koenker, 2005). Thus, there exists ai t with .K4 C qN /
Á
O O A implies 0K Cp 2
nonzero elements such that (A.12) holds: (Alternatively, optimality of ˇ;
4
N
Á
˚ «
P P
O
@
wi t ˇ
A .xi ; zi / A / at ˇ; O so that such ai t exist.) To show (A.16),
i
t .yi t
note that
min

1Äj ÄqN

ˇ ˇ
ˇ Oj ˇ

min

1Äj ÄqN

ˇ ˇ
ˇ oj ˇ

144

max

1Äj ÄqN

ˇ
ˇ Oj

oj

ˇ
ˇ

By Assumption 9,

min

1Äj ÄqN

ˇ ˇ
ˇ oj ˇ

C5 N .1 C4 /=2 : Then, by Lemma C.1.1 and Lemma C.1.9,
Âq Ã
Á
ˇ
qN
1
C
=2
.
/
ˇ
4
D op N
: The result follows
oj D Op
N

ˇ
ˇ Oj
1Äj ÄqN
Á
1
C
=2
.
/
4
: To show (A.17), define J3 Á fj WK4 C qN C 1 Ä j Ä K4 C pN g :
from D o N
Á
O
Then, for j 2 J3 ; by definition of sj ˇ; O ;
it is implied that

O O
sj ˇ;

max

Á

N T
h
1 XX
D
wQ i tj 1 yi t
N

wi t ˇO

Ä0

A .xi ; zi / O A

iD1 tD1

where ai t satisfies the given condition. Thus, N1

i

Á

1
N

X

wQ i tj ai t C .1

/

.i;t /2D

P

Q i tj ai t C .1
.i;t /2D w

/ D Op

qN
N

Á

op . / since jDj D K4 C qN with probability tending to one: It will be shown that
ˇ
ˇ
1
0
ˇ
ˇ X
N
T
i
Á
h
X
ˇ
ˇ1
ˇ>c AD0
lim P @ max ˇˇ
wQ i tj 1 yi t wi t ˇO
A .xi ; zi / O A Ä 0
ˇ
N !1
j 2J3 ˇ N
ˇ
i D1 tD1
Define
h
Ii t .Â/ D 1 yi t
Iiot D 1 Œyi t
Pi t .Â/ D P yi t
Piot D P yi t

QA
w
it Â Ä 0

i

go .xi ; zi / Ä 0
Á
A
QA
Q
w
Â
Ä
0j
w
it
it

wi t ˇ o

wi t ˇ o

QA
go .xi ; zi / Ä 0jw
it

Á

Then, note
ˇ
ˇ
N X
T
h
ˇ1 X
ˇ
@
P max ˇ
wQ i tj 1 yi t
j 2J3 ˇ N
i D1 t D1
0

wi t ˇO

145

i
O
.x
/
Ä
0
;
z
A i i
A

ˇ
1
Áˇˇ
ˇ>c A
ˇ
ˇ

D

ˇ
ˇ
ˇ1
Ä P @ max ˇˇ
j 2J3 ˇ N
ˇ
0
ˇ
ˇ1
C P @ max ˇˇ
j 2J3 ˇ N
0
0

N X
T
X

wQ i tj Ii t ÂO A

i D1 tD1
N X
T
X
i D1 tD1

wQ i tj Iiot

Á

ˇ
1
Áˇˇ
Iiot ˇˇ > c A
2
ˇ
1

ˇ
ˇ
ˇ
ˇ>c A
ˇ
2
ˇ

1
ˇ
ˇ
ˇ
C
Iiot ˇˇ > c A C op .1/
2
ˇ

ˇ
ˇ X
T
ˇ1 N X
B
ˇ
Ä P @ max
sup q
wQ i tj Ii t .Â A /
ˇ
j 2J3
qN ˇ N iD1 tD1
kÂ A Â oA kÄC N

1
ˇ
ˇ
ˇ X
ˇ
N
T
X
ˇ1
ˇ
C
B
ˇ
Ä P @ max
sup q
wQ i tj .Ii t .Â A / Iiot Pi t .Â A / C Piot /ˇˇ > c A
ˇ
4
j 2J3
ˇ
q ˇ N i D1 t D1
kÂ A Â oA kÄC NN
1
0
ˇ
ˇ
ˇ
ˇ
N
T
ˇ 1 XX
ˇ
C
B
ˇ
C P @ max
sup q
wQ i tj .Pi t .Â A / Piot /ˇˇ > c A C op .1/
ˇ
N
4
j 2J3
ˇ
q ˇ
i D1 t D1
kÂ A Â oA kÄC NN
0

where the second inequality follows by Lemma C.1.17. Here, note that
ˇ
ˇ
ˇ
ˇ X
N
T
X
ˇ
ˇ1
o
ˇ
wQ i tj .Pi t .Â A / Pi t /ˇˇ
max
sup q
ˇ
j 2J3
ˇ
q ˇ N iD1 tD1
kÂ A Â oA kÄC NN
ˇ
ˇ X
T
ˇ1 N X
ˇ
D max
sup q
wQ i tj .Fi t .wi t .ˇ ˇ o /
ˇ
j 2J3
qN ˇ N i D1 t D1
kÂ A Â oA kÄC N

ri /

ˇ
ˇ
ˇ
Fi t .0/ˇˇ
ˇ

N T
1 XX
.jwi t .ˇ ˇ o /j C jri j/
q
qN N i D1 t D1
kÂ A Â oA kÄC N
s
Â
Ã
1
0
Q AW
Q
ÄC
sup q Œ
W
max
A kÂ A Â oA k C sup jri j
N
i
qN
kÂ A Â oA kÄC N
r
Âr
Ã
qN
qN
ÄC
C
D o. /
N
N

ÄC

sup

Thus, now it suffices to show
0

ˇ
ˇ X
T
ˇ1 N X
B
ˇ
P @ max
sup q
wQ i tj .Ii t .Â A /
ˇ
j 2J3
qN ˇ N i D1 tD1
kÂ A Â oA kÄC N

Iiot

1
ˇ
ˇ
ˇ
C
Pi t .Â A / C Piot /ˇˇ > c =4A
ˇ

tends to 0 as N goes to infinity, which is implied by Lemma C.1.18.

146

Á
Lemma C.1.17 Assume Assumption 1–5, and 7–9. Suppose log .pN / D o N 2 and N 2 !
1: Then

ˇ
ˇ
N X
T
ˇ1 X
P @ max ˇˇ
wQ i tj Iiot
N
j 2J3 ˇ
i D1 t D1

ˇ
1
ˇ
ˇ
ˇ > c A ! 0:
ˇ
2
ˇ

0

Proof. The argument is similar to Wang, Wu, Li (2012). Note that, for some constant C , the
P
random variable C1 TtD1 wQ i tj Iiot is independent across i , bounded by the interval Œ0; 1 : Then, by
Hoeffding’s inequality, it is implied
0
N X
T
X
1
P@
wQ itj Iiot
N

1
> c A Ä exp
2

i D1 tD1

CN 2

Á

so that
ˇ
ˇ X
T
ˇ1 N X
wQ i tj Iiot
P @ max ˇˇ
j 2J3 ˇ N
i D1 tD1
Á
Ä 2pN exp CN 2
Á
D 2exp log pN CN 2 ! 0
0

ˇ
1
ˇ
ˇ
ˇ>c A
ˇ
2
ˇ

Á
1=2
Lemma C.1.18 Assume Assumption 1–5, and 7–9. Suppose D o N .1 C4 /=2 ; N 1=2 qN D
Á
o . / ; and log .pN / D o N 2 : Then, for any given positive constant C;
0

ˇ
ˇN T
ˇX X
B
ˇ
lim P @ max
sup q
wQ i tj .Ii t .Â A /
ˇ
N !1
j 2J3
qN ˇi D1 t D1
kÂ A Â oA kÄC N

Iiot

1
ˇ
ˇ
ˇ
C
Pi t .Â A / C Piot /ˇˇ > N A D 0
ˇ

q
q
Proof. Let B D Â A W kÂ A Â oA k Ä C NN : B can be covered with a net of balls with radius
qq
N with cardinality N Á jBj Ä CN 2qN . Denote the N balls centered at t D .t ; t /
C
m
1
1
m1 m2
5
N

147

; N1 :

by b .t m / for m D 1;
0

ˇ
ˇ
ˇN T
ˇ
ˇX X
ˇ
B
o
o
ˇ
ˇ
.Â
/
.Â
/
w
Q
.I
I
P
C
P
/
P@
sup q
i
tj
i
t
i
t
A
A
it
it ˇ > N
ˇ
ˇ
qN ˇi D1 t D1
kÂ A Â oA kÄC N
ˇ
0ˇ
1
ˇ
ˇN T
N1
X
ˇ
ˇX X
Ä
P @ˇˇ
wQ i tj .Ii t .t m / Iiot Pi t .t m / C Piot /ˇˇ > N =2A
ˇ
ˇiD1 tD1
mD1
0
ˇP
Á
PT
ˇ
N1
N
B
Q
X B
Ii t .t m /
ˇ i D1 tD1 wQ i tj .Ii t Â A
ˇ
C
PB
sup r
>N
Á
ˇ
@
Q
qN
Pi t Â A C Pi t .t m //ˇ
mD1
ÂQ
t ÄC
A

m

1
C
A

1
C
C
=2C
A

N5

Á INj1 C INj 2 :
To evaluate INj1 ; define vij D

PT

Q i tj .Ii t
tD1 w

.t m / Iiot

Pi t .t m /CPiot /; which are bounded, in-

dependent mean zero random variables. First, note that Ii t .t m / Iiot is nonzero only if (Ii t .t m / D
1 and Iiot D 0) or (Ii t .t m / D 0 and Iiot D 1), which implies j"i t j < jwi t .t m1 ˇ o / ri j a.s: Then,
V vij Ä C

ÄC

ÄC

ÄC

ÄC

T
X

V wQ i tj .Ii t .t m /

tD1
T
X

Iiot

P

tD1
T
X

Ii t .t m /

Iiot

2

Iiot

E wQ i2tj Ii t .t m /

tD1
T
X

Pi t .t m / C Piot /

2

Á

Á
D1

P .j"i t j < jwi t .t m1 ˇ o / ri j/

tD1
T
X

E .ŒFi t .jwi t .t m1 ˇ o / ri j/

tD1

0
Ä CE @

T
X

1
jwi t .t m1 ˇ o / ri jA

t D1

148

Fi t . jwi t .t m1 ˇ o / ri j//

Therefore,
0v
1
u
N X
T
u1 X
.wi t .t m1 ˇ o / ri /2 A
V vij Ä CE @
jwi t .t m1 ˇ o / ri jA Ä CNE @t
N
i D1
i D1 t D1
i D1 tD1
s
#
"
Ã!
Â
p
1 Q Q0
Â
C
sup
Ä
C
W
W
Ä CN E
N qN
kt
k
jr
j
m
max
i
oA
A A
N
0

N
X

1

N X
T
X

Applying Bernstein’s inequality,
INj1 Ä N1 exp

N 2 2 =8
p
C N qN C CN

!
Ä N1 exp . CN / Ä exp .C qN log .N /

CN /

To evaluate INj 2 ; note that
h
Ii t .Â A / D 1 yi t

i
h
QA
Â
Ä
0
D
1
yi t
w
it A

Á
QA
w
Â
Ä
e
and Pi t .Â; e/ D
it

and that an indicator function is increasing. Define Ii t .Â; e/ D 1 yi t
Á
A : Then, we have
QA
Q
P yi t w
Â
Ä
ej
w
it
it

ÂQ A

ˇ
ˇN T
Á
ˇX X
ˇ
Q
sup r
w
Q
.I
Â
i tj i t
A
ˇ
qN ˇiD1 tD1
t m ÄC

Ii t .t m /

N5

r
Ä Â
Ã
N X
T ˇ
X
ˇ
q
N
A
ˇwQ i tj ˇ Ii t t m ; w
Q it C
Ä
N5
D

C

i D1 t D1
N X
T ˇ
X

r
Ã
Ä Â
ˇ
q
N
A
ˇwQ i tj ˇ Ii t t m ; w
Q it C
N5

i D1 tD1
N X
T ˇ
X
i D1 t D1

r
Ä
Â
Ã
ˇ
q
N
A
ˇwQ i tj ˇ Pi t t m ; w
Q it C
N5

ˇ
ˇ
ˇ
Pi t ÂQ A C Pi t .t m //ˇˇ
ˇ

Ii t .t m /

Á

Â
Pi t t m ;
Â

Ii t .t m /
Â
Pi t t m ;

149

i
t m/

QA
QA
w
i t .Â A
it tm Ä w

Pi t t m ;
QA
w
it

QA
w
it
QA
w
it
r

C

qN
N5

r
C
r

C
Ã

qN
N5

qN
N5

Ã
C Pi t .t m /

Ã
C Pi t .t m /

Note that
r
Ã
Ä
Â
N X
T ˇ
X
ˇ
q
N
A
ˇwQ i tj ˇ Pi t t m ; w
Q it C
N5

Â
Pi t t m ;

r

QA
w
it

qN
N5

C

i D1 tD1
N X
T ˇ
X

D

i D1 t D1

r
Ã
Ä
Â
ˇ
q
N
A
ˇwQ i tj ˇ Fi t wi t .t m1 ˇ / ri C w
Q it C
o
N5

Â

QA
w
it

Fi t wi t .t m1 ˇ o / ri
N X
T
X

ÄC

Ã

QA
w
it

r

i D1 tD1

r
C

qN
N5

Ã

qN
Ä C qN N 3=2 D o .N /
N5

Hence, for all N sufficiently large, INj 2 Ä

PN1

mD1 P

r
Ä Â
Ã
T ˇ
X
ˇ
qN
A
ˇ
ˇ
Q it C
wQ i tj Ii t t m ; w
˛mi D
N5

PN

iD1 ˛mi

Â
Ii t .t m /

Pi t t m ;

N
4

Á

where

QA
w
it

r
C

tD1

qN
N5

Ã
C Pi t .t m /

Since ˛mi are independent bounded random variables with mean zero, similarly as in the evaluation
of INj1 ; we can show that
V .˛mi / Ä C

T
X

Ã
Âq
A
5
Q it
qN =N w
Ä C qN N 5=2
E

tD1

Applying Bernstein’s inequality,
0
N1
N
X
X
@
P
˛mi
mD1

i D1

1

N A
Ä N1 exp
4

N 2 2 =32

!

C qN N 3=2 C CN

Ä N1 exp . CN /
Ä C exp .C qN log N

CN /

Therefore, we have
X

INj1 C INj 2 Ä C exp .log pN C C qN log .N /

CN / D o .1/

j 2J3

since N

1 1
2q2
N

log N D o .1/ is implied by conditions given. Hence the result.

150

C.2

Appendix: Supplementary Tables

Figure A.1 Pooled Birth Weights

151

Table A.1 Estimator performance, DGP 1, ˇ2
D 0:1
Method

N

gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham

D 0:5

D 0:9

IC

Bias

SD

RMSE

Bias

SD

RMSE

Bias

SD

RMSE

300
300
300
300
300
300
1000
1000
1000
1000
1000
1000

QBIC
QBICL
BIC
BICL
AIC1
AIC2
QBIC
QBICL
BIC
BICL
AIC1
AIC2

-0.0005
-0.0006
-0.0006
-0.0006
-0.0000
-0.0006
0.0001
0.0001
0.0001
0.0001
0.0001
0.0001

0.0217
0.0216
0.0216
0.0216
0.0217
0.0216
0.0111
0.0112
0.0112
0.0112
0.0112
0.0112

0.0216
0.0216
0.0216
0.0216
0.0217
0.0216
0.0111
0.0112
0.0112
0.0112
0.0112
0.0112

0.0010
0.0011
0.0011
0.0011
0.0019
0.0011
0.0012
0.0011
0.0011
0.0011
0.0011
0.0012

0.0356
0.0357
0.0357
0.0357
0.0355
0.0358
0.0185
0.0186
0.0186
0.0186
0.0183
0.0186

0.0356
0.0357
0.0357
0.0357
0.0356
0.0358
0.0186
0.0186
0.0186
0.0186
0.0183
0.0186

-0.0003
-0.0003
-0.0003
-0.0003
-0.0000
-0.0003
-0.0000
-0.0001
-0.0001
-0.0001
-0.0001
-0.0001

0.0215
0.0215
0.0215
0.0215
0.0213
0.0215
0.0111
0.0111
0.0111
0.0111
0.0110
0.0111

0.0215
0.0215
0.0215
0.0215
0.0212
0.0215
0.0111
0.0111
0.0111
0.0111
0.0110
0.0111

300
300
300
300
300
300
1000
1000
1000
1000
1000
1000

QBIC
QBICL
BIC
BICL
AIC1
AIC2
QBIC
QBICL
BIC
BICL
AIC1
AIC2

0.0009
0.0008
0.0008
0.0008
0.0010
0.0008
0.0002
0.0002
0.0002
0.0002
0.0001
0.0002

0.0201
0.0200
0.0200
0.0200
0.0205
0.0200
0.0110
0.0110
0.0110
0.0110
0.0109
0.0110

0.0201
0.0200
0.0200
0.0200
0.0205
0.0200
0.0110
0.0110
0.0110
0.0110
0.0109
0.0110

-0.0005
-0.0008
-0.0008
-0.0008
-0.0002
-0.0007
0.0008
0.0007
0.0007
0.0007
0.0010
0.0007

0.0347
0.0346
0.0346
0.0346
0.0346
0.0346
0.0196
0.0196
0.0196
0.0196
0.0196
0.0196

0.0347
0.0346
0.0346
0.0346
0.0346
0.0346
0.0196
0.0196
0.0196
0.0196
0.0196
0.0196

0.0010
0.0009
0.0009
0.0009
0.0008
0.0009
0.0004
0.0005
0.0005
0.0005
0.0007
0.0005

0.0195
0.0196
0.0196
0.0196
0.0191
0.0196
0.0115
0.0116
0.0116
0.0116
0.0117
0.0116

0.0195
0.0196
0.0196
0.0196
0.0191
0.0196
0.0115
0.0116
0.0116
0.0116
0.0117
0.0116

152

Table A.2 Estimator performance, DGP 2, ˇ2
D 0:1
Method

N

gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham

D 0:5

D 0:9

IC

Bias

SD

RMSE

Bias

SD

RMSE

Bias

SD

RMSE

300
300
300
300
300
300
1000
1000
1000
1000
1000
1000

QBIC
QBICL
BIC
BICL
AIC1
AIC2
QBIC
QBICL
BIC
BICL
AIC1
AIC2

0.0105
0.0085
0.0082
0.0085
0.0186
0.0119
0.0057
0.0045
0.0045
0.0045
0.0081
0.0068

0.1643
0.1642
0.1638
0.1642
0.1665
0.1648
0.0887
0.0882
0.0883
0.0882
0.0878
0.0895

0.1645
0.1643
0.1639
0.1643
0.1674
0.1652
0.0889
0.0883
0.0884
0.0883
0.0882
0.0897

-0.0078
-0.0021
-0.0075
-0.0021
-0.0041
-0.0039
0.0093
0.0101
0.0091
0.0101
0.0080
0.0078

0.2587
0.2635
0.2587
0.2635
0.2580
0.2579
0.1473
0.1479
0.1474
0.1479
0.1478
0.1477

0.2587
0.2634
0.2587
0.2634
0.2579
0.2578
0.1476
0.1481
0.1476
0.1481
0.1480
0.1479

-0.0078
-0.0063
-0.0065
-0.0063
-0.0119
-0.0080
-0.0054
-0.0049
-0.0048
-0.0049
-0.0067
-0.0045

0.1596
0.1621
0.1622
0.1621
0.1598
0.1610
0.0876
0.0857
0.0856
0.0857
0.0878
0.0876

0.1597
0.1622
0.1622
0.1622
0.1601
0.1612
0.0878
0.0858
0.0857
0.0858
0.0880
0.0876

300
300
300
300
300
300
1000
1000
1000
1000
1000
1000

QBIC
QBICL
BIC
BICL
AIC1
AIC2
QBIC
QBICL
BIC
BICL
AIC1
AIC2

0.0238
0.0269
0.0266
1.0523
0.0240
0.0237
0.0004
0.0011
0.0006
0.0014
0.0004
0.0015

0.1623
0.1639
0.1640
0.4443
0.1593
0.1613
0.0891
0.0886
0.0886
0.0886
0.0878
0.0881

0.1640
0.1660
0.1660
1.1421
0.1610
0.1629
0.0890
0.0885
0.0886
0.0886
0.0878
0.0880

0.0129
0.0243
0.0135
0.0239
0.0055
0.0059
-0.0078
-0.0061
-0.0078
-0.0061
-0.0097
-0.0098

0.2653
0.2830
0.2650
0.2825
0.2659
0.2659
0.1433
0.1445
0.1433
0.1445
0.1432
0.1434

0.2655
0.2839
0.2652
0.2834
0.2658
0.2658
0.1434
0.1446
0.1434
0.1446
0.1435
0.1437

-0.0121
-0.0089
-0.0109
-0.0036
-0.0201
-0.0143
0.0047
0.0049
0.0050
0.0050
0.0035
0.0040

0.1608
0.1644
0.1622
0.1691
0.1646
0.1623
0.0865
0.0859
0.0859
0.0862
0.0867
0.0863

0.1612
0.1646
0.1625
0.1691
0.1657
0.1628
0.0866
0.0860
0.0860
0.0863
0.0867
0.0863

153

Table A.3 Selection Performance, DGP 3
D 0:1

D 0:5

D 0:9

Method

pN

qo

N

IC

TV

FV

True

TV

FV

True

TV

FV

True

gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund

136
136
136
136
136
136
136
136
136
136
136
136

6
6
6
6
6
6
6
6
6
6
6
6

300
300
300
300
300
300
1000
1000
1000
1000
1000
1000

QBIC
QBICL
BIC
BICL
AIC1
AIC2
QBIC
QBICL
BIC
BICL
AIC1
AIC2

5.14
4.86
4.64
4.42
5.15
4.76
5.52
5.53
5.15
4.85
5.62
5.52

2.61
1.52
1.36
1.12
4.33
1.42
1.81
1.54
1.05
1.15
3.06
1.53

0.07
0.07
0.07
0.07
0.06
0.07
0.19
0.20
0.20
0.20
0.13
0.20

4.98
4.74
4.71
4.64
5.30
4.81
5.89
5.51
5.41
5.35
5.89
5.83

2.19
1.58
1.49
1.36
4.83
1.82
0.93
0.65
0.63
0.65
4.38
0.74

0.10
0.11
0.11
0.11
0.07
0.11
0.42
0.46
0.46
0.46
0.23
0.45

5.72
5.73
5.70
5.37
5.43
5.72
5.92
5.96
5.92
5.91
5.79
5.96

0.80
0.36
0.31
0.44
3.62
0.33
0.28
0.12
0.09
0.09
3.35
0.12

0.59
0.71
0.71
0.67
0.16
0.71
0.83
0.92
0.92
0.92
0.23
0.92

gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham

102
102
102
102
102
102
102
102
102
102
102
102

18
18
18
18
18
18
18
18
18
18
18
18

300
300
300
300
300
300
1000
1000
1000
1000
1000
1000

QBIC
QBICL
BIC
BICL
AIC1
AIC2
QBIC
QBICL
BIC
BICL
AIC1
AIC2

14.89
14.31
13.49
2.11
15.02
14.11
16.61
16.39
15.66
15.29
16.50
16.35

5.64
4.51
4.54
0.40
9.44
4.44
3.98
2.45
2.43
2.50
7.18
2.36

0.00
0.00
0.00
0.00
0.00
0.00
0.05
0.05
0.05
0.05
0.02
0.05

13.46
12.65
12.52
8.89
14.11
12.96
15.59
14.34
14.11
13.43
16.63
14.92

8.86
5.84
5.61
3.89
11.92
6.95
5.12
4.67
4.71
4.70
7.98
4.67

0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00

14.71
14.22
13.32
2.76
14.88
14.08
16.89
16.62
15.91
15.40
16.65
16.57

6.45
5.04
4.87
5.65
10.49
4.96
3.31
2.24
2.20
2.44
7.30
2.15

0.00
0.00
0.00
0.00
0.00
0.00
0.10
0.10
0.10
0.10
0.02
0.10

154

Table A.4 Estimator performance, DGP 3, ˇ1
D 0:1
Method

N

gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham

D 0:5

D 0:9

IC

Bias

SD

RMSE

Bias

SD

RMSE

Bias

SD

RMSE

300
300
300
300
300
300
1000
1000
1000
1000
1000
1000

QBIC
QBICL
BIC
BICL
AIC1
AIC2
QBIC
QBICL
BIC
BICL
AIC1
AIC2

0.0015
0.0005
0.0004
0.0074
0.0016
0.0005
0.0007
0.0007
0.0008
0.0008
0.0009
0.0007

0.0210
0.0212
0.0214
0.0404
0.0207
0.0214
0.0116
0.0116
0.0120
0.0123
0.0117
0.0116

0.0210
0.0212
0.0214
0.0411
0.0208
0.0214
0.0116
0.0116
0.0120
0.0123
0.0117
0.0116

0.0016
0.0012
0.0009
0.0002
0.0021
0.0013
0.0016
0.0015
0.0013
0.0013
0.0015
0.0016

0.0364
0.0366
0.0363
0.0359
0.0364
0.0362
0.0191
0.0192
0.0192
0.0192
0.0188
0.0191

0.0365
0.0366
0.0363
0.0359
0.0365
0.0362
0.0191
0.0193
0.0193
0.0193
0.0189
0.0192

-0.0001
-0.0001
-0.0002
0.0100
0.0001
-0.0001
-0.0000
0.0000
-0.0001
-0.0001
-0.0001
0.0000

0.0200
0.0203
0.0204
0.0419
0.0202
0.0204
0.0115
0.0116
0.0115
0.0115
0.0115
0.0116

0.0200
0.0203
0.0204
0.0431
0.0202
0.0203
0.0115
0.0116
0.0115
0.0115
0.0115
0.0116

300
300
300
300
300
300
1000
1000
1000
1000
1000
1000

QBIC
QBICL
BIC
BICL
AIC1
AIC2
QBIC
QBICL
BIC
BICL
AIC1
AIC2

0.0012
0.0010
0.0011
0.8448
0.0006
0.0010
0.0001
0.0001
0.0002
0.0010
-0.0000
0.0001

0.0211
0.0215
0.0235
0.5553
0.0209
0.0220
0.0116
0.0118
0.0119
0.0141
0.0116
0.0117

0.0212
0.0216
0.0235
1.0109
0.0209
0.0220
0.0116
0.0118
0.0119
0.0142
0.0116
0.0117

0.0033
0.0004
-0.0003
0.0629
0.0037
0.0017
0.0004
0.0001
0.0002
-0.0004
0.0004
0.0003

0.0332
0.0335
0.0331
0.0530
0.0332
0.0336
0.0190
0.0189
0.0189
0.0186
0.0191
0.0189

0.0333
0.0335
0.0331
0.0823
0.0334
0.0337
0.0190
0.0189
0.0189
0.0186
0.0191
0.0189

-0.0004
-0.0007
-0.0002
0.1379
-0.0005
-0.0008
0.0001
0.0002
0.0002
0.0039
0.0002
0.0002

0.0226
0.0229
0.0247
0.0987
0.0219
0.0234
0.0110
0.0110
0.0114
0.0249
0.0110
0.0111

0.0226
0.0229
0.0247
0.1695
0.0219
0.0234
0.0110
0.0110
0.0114
0.0252
0.0110
0.0111

155

Table A.5 Estimator performance, DGP 3, ˇ2
D 0:1
Method

N

gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham

D 0:5

D 0:9

IC

Bias

SD

RMSE

Bias

SD

RMSE

Bias

SD

RMSE

300
300
300
300
300
300
1000
1000
1000
1000
1000
1000

QBIC
QBICL
BIC
BICL
AIC1
AIC2
QBIC
QBICL
BIC
BICL
AIC1
AIC2

0.0015
0.0011
0.0004
0.0060
0.0012
0.0008
0.0005
0.0005
0.0004
0.0004
0.0005
0.0005

0.0215
0.0219
0.0222
0.0404
0.0211
0.0222
0.0117
0.0117
0.0117
0.0118
0.0117
0.0117

0.0215
0.0219
0.0222
0.0409
0.0211
0.0222
0.0117
0.0117
0.0117
0.0118
0.0117
0.0117

0.0015
0.0008
0.0006
0.0004
0.0015
0.0011
0.0001
-0.0002
-0.0001
-0.0001
0.0002
0.0001

0.0355
0.0356
0.0355
0.0352
0.0359
0.0355
0.0195
0.0195
0.0195
0.0195
0.0196
0.0196

0.0355
0.0356
0.0355
0.0352
0.0359
0.0355
0.0195
0.0195
0.0195
0.0195
0.0196
0.0196

0.0010
0.0010
0.0011
0.0121
0.0011
0.0010
0.0011
0.0011
0.0010
0.0010
0.0011
0.0011

0.0212
0.0211
0.0214
0.0443
0.0209
0.0212
0.0114
0.0115
0.0114
0.0114
0.0115
0.0115

0.0212
0.0211
0.0214
0.0459
0.0209
0.0212
0.0115
0.0115
0.0114
0.0114
0.0116
0.0115

300
300
300
300
300
300
1000
1000
1000
1000
1000
1000

QBIC
QBICL
BIC
BICL
AIC1
AIC2
QBIC
QBICL
BIC
BICL
AIC1
AIC2

-0.0002
0.0005
0.0009
0.8354
-0.0005
0.0008
-0.0006
-0.0004
-0.0005
0.0004
-0.0007
-0.0006

0.0216
0.0223
0.0236
0.5497
0.0213
0.0227
0.0118
0.0119
0.0120
0.0148
0.0120
0.0119

0.0216
0.0223
0.0237
0.9999
0.0213
0.0227
0.0119
0.0119
0.0120
0.0148
0.0120
0.0119

0.0037
0.0011
0.0006
0.0606
0.0044
0.0022
0.0008
0.0007
0.0008
0.0005
0.0008
0.0007

0.0345
0.0329
0.0329
0.0533
0.0348
0.0333
0.0191
0.0190
0.0189
0.0191
0.0191
0.0192

0.0347
0.0329
0.0329
0.0806
0.0350
0.0334
0.0191
0.0190
0.0189
0.0191
0.0191
0.0192

-0.0001
0.0003
0.0004
0.1367
-0.0005
0.0004
0.0002
0.0000
0.0002
0.0040
-0.0000
0.0000

0.0220
0.0227
0.0240
0.0998
0.0217
0.0227
0.0113
0.0113
0.0115
0.0264
0.0115
0.0113

0.0220
0.0227
0.0240
0.1692
0.0217
0.0227
0.0113
0.0113
0.0115
0.0267
0.0115
0.0113

156

Table A.6 Selection Performance, DGP 4
D 0:1

D 0:5

D 0:9

Method

pN

qo

N

IC

TV

FV

True

TV

FV

True

TV

FV

True

gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund

136
136
136
136
136
136
136
136
136
136
136
136

6
6
6
6
6
6
6
6
6
6
6
6

300
300
300
300
300
300
1000
1000
1000
1000
1000
1000

QBIC
QBICL
BIC
BICL
AIC1
AIC2
QBIC
QBICL
BIC
BICL
AIC1
AIC2

3.41
1.65
3.08
1.38
4.20
4.04
4.11
3.17
4.00
2.54
4.70
4.57

2.23
1.01
1.63
0.94
7.48
5.49
2.40
1.34
2.06
1.11
8.67
6.39

0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00

2.67
1.17
3.21
1.66
3.62
4.05
3.65
1.97
4.02
3.23
4.52
4.94

2.40
1.05
4.42
1.30
7.22
11.82
2.52
1.43
4.30
1.95
8.66
16.30

0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00

3.71
2.31
3.60
1.58
3.99
3.90
4.38
3.90
4.30
3.66
4.66
4.56

3.36
2.06
2.78
1.93
10.00
7.45
3.13
2.04
2.60
2.02
10.86
7.98

0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00

gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham

102
102
102
102
102
102
102
102
102
102
102
102

18
18
18
18
18
18
18
18
18
18
18
18

300
300
300
300
300
300
1000
1000
1000
1000
1000
1000

QBIC
QBICL
BIC
BICL
AIC1
AIC2
QBIC
QBICL
BIC
BICL
AIC1
AIC2

6.56
1.39
5.75
0.46
9.18
8.53
9.79
4.52
8.85
3.99
11.88
11.55

4.06
1.13
3.46
0.43
8.27
6.57
5.52
2.81
4.85
2.66
10.11
8.30

0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00

4.83
0.93
6.11
3.37
7.31
8.69
7.84
3.39
9.34
4.61
10.41
11.45

3.82
0.88
5.06
2.77
6.96
10.22
5.37
3.15
6.88
3.66
9.05
12.97

0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00

7.83
3.18
6.84
2.86
9.79
9.44
11.51
7.19
11.15
5.12
12.38
12.17

7.27
4.04
6.49
3.79
11.46
10.01
7.60
6.56
7.29
6.07
12.00
10.11

0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00

157

Table A.7 Estimator performance, DGP 4, ˇ1
D 0:1
Method

N

gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham

D 0:5

D 0:9

IC

Bias

SD

RMSE

Bias

SD

RMSE

Bias

SD

RMSE

300
300
300
300
300
300
1000
1000
1000
1000
1000
1000

QBIC
QBICL
BIC
BICL
AIC1
AIC2
QBIC
QBICL
BIC
BICL
AIC1
AIC2

0.0259
0.0796
0.0322
0.1409
0.0177
0.0177
0.0060
0.0443
0.0086
0.0625
0.0053
0.0069

0.3429
0.3854
0.3489
0.4682
0.3421
0.3431
0.1788
0.1950
0.1805
0.1943
0.1760
0.1783

0.3437
0.3933
0.3503
0.4887
0.3424
0.3434
0.1788
0.1999
0.1806
0.2040
0.1760
0.1784

0.0434
0.1254
0.0324
0.0866
0.0293
0.0170
0.0186
0.0763
0.0100
0.0313
0.0063
0.0086

0.5665
0.6198
0.5584
0.5692
0.5506
0.5498
0.3064
0.3240
0.3030
0.3152
0.3036
0.3025

0.5678
0.6321
0.5591
0.5755
0.5511
0.5498
0.3068
0.3327
0.3030
0.3166
0.3035
0.3025

-0.0069
0.0966
0.0008
0.1341
0.0005
-0.0009
-0.0023
0.0088
-0.0027
0.0336
-0.0030
-0.0031

0.3193
0.3656
0.3270
0.3740
0.3233
0.3214
0.1883
0.1979
0.1877
0.2165
0.1929
0.1940

0.3192
0.3780
0.3268
0.3971
0.3232
0.3213
0.1882
0.1980
0.1876
0.2190
0.1928
0.1940

300
300
300
300
300
300
1000
1000
1000
1000
1000
1000

QBIC
QBICLP
BIC
BICLP
AIC1
AIC2
QBIC
QBICLP
BIC
BICLP
AIC1
AIC2

0.1006
0.9157
0.1263
1.2698
0.0437
0.0517
0.0305
0.0775
0.0426
0.0831
0.0098
0.0128

0.3474
0.6470
0.3550
0.5288
0.3407
0.3419
0.1835
0.1901
0.1863
0.1974
0.1815
0.1810

0.3615
1.1210
0.3766
1.3754
0.3434
0.3456
0.1859
0.2052
0.1910
0.2140
0.1817
0.1814

0.1191
1.3293
0.0694
0.3097
0.0429
0.0225
0.0531
0.1170
0.0348
0.0951
0.0276
0.0230

0.5735
0.8436
0.5610
0.7033
0.5478
0.5443
0.3109
0.3379
0.3064
0.3092
0.3046
0.3024

0.5855
1.5742
0.5650
0.7681
0.5492
0.5445
0.3153
0.3575
0.3082
0.3234
0.3057
0.3032

0.0432
0.1974
0.0725
0.2272
0.0062
0.0107
0.0011
0.1160
0.0036
0.1638
-0.0052
-0.0028

0.3505
0.4299
0.3544
0.4578
0.3382
0.3414
0.1881
0.2168
0.1900
0.2254
0.1890
0.1902

0.3530
0.4728
0.3615
0.5109
0.3381
0.3414
0.1880
0.2458
0.1899
0.2785
0.1890
0.1902

158

Table A.8 Estimator performance, DGP 4, ˇ2
D 0:1
Method

N

gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham

D 0:5

D 0:9

IC

Bias

SD

RMSE

Bias

SD

RMSE

Bias

SD

RMSE

300
300
300
300
300
300
1000
1000
1000
1000
1000
1000

QBIC
QBICL
BIC
BICL
AIC1
AIC2
QBIC
QBICL
BIC
BICL
AIC1
AIC2

0.0423
0.0882
0.0542
0.1200
0.0342
0.0332
0.0148
0.0477
0.0164
0.0637
0.0159
0.0170

0.3284
0.3400
0.3286
0.3883
0.3250
0.3238
0.1835
0.1956
0.1861
0.1998
0.1819
0.1829

0.3309
0.3511
0.3329
0.4062
0.3267
0.3253
0.1840
0.2012
0.1867
0.2096
0.1825
0.1836

0.0432
0.1255
0.0313
0.0873
0.0288
0.0177
0.0238
0.0883
0.0175
0.0387
0.0181
0.0162

0.5459
0.5864
0.5319
0.5363
0.5270
0.5267
0.3133
0.3228
0.3102
0.3205
0.3070
0.3046

0.5474
0.5993
0.5325
0.5431
0.5275
0.5267
0.3141
0.3345
0.3105
0.3227
0.3074
0.3049

0.0055
0.1222
0.0072
0.1595
-0.0005
-0.0004
0.0051
0.0160
0.0047
0.0433
0.0051
0.0042

0.3433
0.3856
0.3433
0.3912
0.3371
0.3392
0.1823
0.1915
0.1799
0.2108
0.1783
0.1800

0.3432
0.4043
0.3432
0.4223
0.3369
0.3390
0.1823
0.1921
0.1798
0.2151
0.1783
0.1799

300
300
300
300
300
300
1000
1000
1000
1000
1000
1000

QBIC
QBICLP
BIC
BICLP
AIC1
AIC2
QBIC
QBICLP
BIC
BICLP
AIC1
AIC2

0.0723
0.8832
0.0982
1.2474
0.0127
0.0284
0.0249
0.0757
0.0387
0.0784
0.0060
0.0056

0.3467
0.6648
0.3581
0.5287
0.3366
0.3383
0.1887
0.1926
0.1891
0.1966
0.1902
0.1915

0.3540
1.1052
0.3712
1.3547
0.3367
0.3393
0.1902
0.2068
0.1930
0.2116
0.1902
0.1915

0.0962
1.2828
0.0447
0.2693
0.0280
0.0091
0.0317
0.0948
0.0171
0.0794
0.0111
0.0068

0.5598
0.8614
0.5466
0.6714
0.5434
0.5350
0.3141
0.3322
0.3109
0.3121
0.3101
0.3080

0.5677
1.5449
0.5481
0.7230
0.5439
0.5348
0.3156
0.3453
0.3113
0.3219
0.3102
0.3079

0.0496
0.1973
0.0728
0.2520
0.0062
0.0071
-0.0009
0.1151
0.0053
0.1628
-0.0059
-0.0054

0.3642
0.4430
0.3712
0.4945
0.3535
0.3532
0.1844
0.2145
0.1856
0.2199
0.1852
0.1850

0.3674
0.4847
0.3781
0.5548
0.3534
0.3531
0.1843
0.2433
0.1855
0.2735
0.1852
0.1850

159

Table A.9 Selection Performance, DGP 5
D 0:1

D 0:5

D 0:9

Method

pN

qo

N

IC

TV

FV

True

TV

FV

True

TV

FV

True

gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund

136
136
136
136
136
136
136
136
136
136
136
136

18
18
18
18
18
18
18
18
18
18
18
18

300
300
300
300
300
300
1000
1000
1000
1000
1000
1000

QBIC
QBICL
BIC
BICL
AIC1
AIC2
QBIC
QBICL
BIC
BICL
AIC1
AIC2

7.54
3.20
6.87
2.56
9.21
8.72
9.65
7.43
9.23
5.93
11.72
11.22

3.67
1.44
3.08
1.22
7.09
5.70
4.82
2.17
4.20
1.49
8.97
7.50

0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00

6.01
2.35
7.26
3.65
8.13
9.23
8.44
4.56
9.45
7.17
10.65
11.85

3.88
1.75
5.30
2.39
6.76
9.66
4.93
2.52
6.24
3.81
8.74
12.94

0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00

8.77
4.20
8.00
2.92
10.57
10.11
12.12
7.87
11.72
6.99
13.08
12.87

6.34
4.30
5.70
3.82
10.81
9.03
7.14
5.17
6.67
5.08
11.36
9.64

0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00

gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham

102
102
102
102
102
102
102
102
102
102
102
102

12
12
12
12
12
12
12
12
12
12
12
12

300
300
300
300
300
300
1000
1000
1000
1000
1000
1000

QBIC
QBICL
BIC
BICL
AIC1
AIC2
QBIC
QBICL
BIC
BICL
AIC1
AIC2

5.96
3.09
5.54
2.28
7.23
6.92
7.51
5.51
7.23
5.24
8.54
8.31

3.42
1.08
2.72
0.83
8.79
6.75
4.17
1.96
3.71
1.55
10.60
8.01

0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00

4.85
2.08
5.67
3.59
6.39
7.27
6.53
4.61
7.36
5.40
8.20
9.03

3.34
1.20
4.96
2.12
7.43
11.71
4.10
2.51
5.65
3.02
8.74
15.04

0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00

6.65
3.81
6.29
3.18
7.47
7.27
8.41
6.97
8.24
6.12
8.95
8.81

5.16
3.47
4.63
3.34
9.64
7.80
5.03
3.99
4.64
3.84
10.53
8.05

0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00

160

Table A.10 Estimator performance, DGP 5, ˇ1
D 0:1
Method

N

gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham

D 0:5

D 0:9

IC

Bias

SD

RMSE

Bias

SD

RMSE

Bias

SD

RMSE

300
300
300
300
300
300
1000
1000
1000
1000
1000
1000

QBIC
QBICL
BIC
BICL
AIC1
AIC2
QBIC
QBICL
BIC
BICL
AIC1
AIC2

0.0459
0.1934
0.0575
0.3393
0.0346
0.0371
0.0152
0.0339
0.0155
0.0513
0.0153
0.0151

0.3642
0.4944
0.3677
0.6274
0.3594
0.3593
0.1861
0.1926
0.1856
0.2024
0.1836
0.1832

0.3669
0.5306
0.3720
0.7130
0.3609
0.3610
0.1866
0.1955
0.1862
0.2087
0.1841
0.1838

-0.0024
0.1810
-0.0092
0.0659
-0.0166
-0.0193
0.0038
0.0632
0.0041
0.0134
0.0045
0.0044

0.5246
0.6825
0.5188
0.5413
0.5249
0.5259
0.3030
0.3224
0.3028
0.3038
0.3034
0.3056

0.5244
0.7057
0.5186
0.5451
0.5249
0.5260
0.3028
0.3284
0.3027
0.3040
0.3033
0.3055

0.0090
0.1306
0.0202
0.1669
0.0017
0.0042
-0.0013
0.0338
-0.0019
0.0634
-0.0054
-0.0040

0.3442
0.4318
0.3543
0.4681
0.3366
0.3374
0.1798
0.2153
0.1792
0.2334
0.1789
0.1786

0.3442
0.4510
0.3547
0.4968
0.3365
0.3372
0.1797
0.2179
0.1791
0.2418
0.1789
0.1786

300
300
300
300
300
300
1000
1000
1000
1000
1000
1000

QBIC
QBICL
BIC
BICL
AIC1
AIC2
QBIC
QBICL
BIC
BICL
AIC1
AIC2

0.0202
0.2321
0.0277
0.3751
0.0009
0.0091
0.0141
0.0255
0.0161
0.0256
0.0137
0.0100

0.3090
0.4315
0.3097
0.4462
0.3242
0.3209
0.1704
0.1691
0.1713
0.1704
0.1785
0.1755

0.3095
0.4898
0.3108
0.5828
0.3240
0.3208
0.1709
0.1710
0.1719
0.1722
0.1790
0.1757

0.0732
0.4136
0.0538
0.1377
0.0529
0.0462
0.0226
0.0413
0.0177
0.0317
0.0210
0.0232

0.5035
0.6022
0.4978
0.5489
0.5055
0.5086
0.2826
0.2899
0.2830
0.2777
0.2877
0.2894

0.5086
0.7303
0.5005
0.5657
0.5080
0.5104
0.2834
0.2927
0.2834
0.2794
0.2883
0.2902

0.0128
0.0861
0.0221
0.1168
-0.0024
0.0023
-0.0037
0.0202
-0.0031
0.0359
-0.0052
-0.0051

0.3246
0.3633
0.3240
0.3952
0.3244
0.3201
0.1704
0.1730
0.1706
0.1766
0.1772
0.1763

0.3247
0.3732
0.3245
0.4119
0.3242
0.3199
0.1703
0.1741
0.1705
0.1801
0.1772
0.1763

161

Table A.11 Estimator performance, DGP 5, ˇ2
D 0:1
Method

N

gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gMund
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham
gCham

D 0:5

D 0:9

IC

Bias

SD

RMSE

Bias

SD

RMSE

Bias

SD

RMSE

300
300
300
300
300
300
1000
1000
1000
1000
1000
1000

QBIC
QBICL
BIC
BICL
AIC1
AIC2
QBIC
QBICL
BIC
BICL
AIC1
AIC2

0.0468
0.1815
0.0535
0.3173
0.0305
0.0372
0.0150
0.0308
0.0140
0.0473
0.0144
0.0164

0.3428
0.4963
0.3491
0.6270
0.3319
0.3360
0.1813
0.1889
0.1840
0.2013
0.1819
0.1817

0.3458
0.5282
0.3530
0.7024
0.3331
0.3379
0.1818
0.1913
0.1844
0.2066
0.1824
0.1823

0.0297
0.1784
0.0139
0.0868
0.0090
0.0029
0.0081
0.0646
0.0065
0.0132
0.0069
0.0072

0.5341
0.6529
0.5313
0.5283
0.5312
0.5309
0.2922
0.3073
0.2914
0.3000
0.2923
0.2953

0.5347
0.6765
0.5312
0.5351
0.5310
0.5306
0.2922
0.3138
0.2913
0.3001
0.2923
0.2952

-0.0059
0.1127
0.0040
0.1521
-0.0101
-0.0105
-0.0015
0.0344
-0.0019
0.0679
-0.0041
-0.0046

0.3508
0.4217
0.3557
0.4538
0.3416
0.3444
0.1866
0.2085
0.1854
0.2284
0.1848
0.1870

0.3506
0.4363
0.3555
0.4784
0.3416
0.3444
0.1866
0.2112
0.1853
0.2382
0.1848
0.1869

300
300
300
300
300
300
1000
1000
1000
1000
1000
1000

QBIC
QBICL
BIC
BICL
AIC1
AIC2
QBIC
QBICL
BIC
BICL
AIC1
AIC2

0.0467
0.2552
0.0488
0.3876
0.0268
0.0289
0.0133
0.0269
0.0134
0.0294
0.0102
0.0084

0.3172
0.4229
0.3241
0.4361
0.3250
0.3246
0.1727
0.1707
0.1722
0.1744
0.1821
0.1818

0.3204
0.4937
0.3276
0.5833
0.3259
0.3257
0.1732
0.1727
0.1726
0.1768
0.1823
0.1819

0.0185
0.3702
0.0067
0.0823
-0.0023
-0.0070
0.0266
0.0450
0.0230
0.0342
0.0232
0.0226

0.5172
0.6165
0.5211
0.5595
0.5236
0.5261
0.2898
0.2962
0.2964
0.2886
0.3005
0.3061

0.5173
0.7188
0.5209
0.5652
0.5233
0.5259
0.2908
0.2995
0.2972
0.2905
0.3012
0.3068

0.0130
0.0782
0.0153
0.1005
0.0022
0.0048
0.0003
0.0240
-0.0007
0.0420
0.0010
0.0006

0.3223
0.3827
0.3261
0.4033
0.3292
0.3278
0.1708
0.1743
0.1683
0.1802
0.1768
0.1748

0.3224
0.3904
0.3263
0.4154
0.3291
0.3277
0.1708
0.1759
0.1682
0.1849
0.1767
0.1747

162

Table A.12 Birthweight, pooled quantile regression, all moms, (unit: grams)
Quantile

Smoke
Male
Age
Age2
Kessner index = 2
Kessner index = 3
No prenatal visit
First prenatal visit in 2nd trimester
First prenatal visit in 3rd trimester

0.1

0.25

0.5

0.75

0.9

-258.82
(6.57)
93.59
(3.52)
18.11
(3.73)
-0.34
(0.06)
-157.10
(8.50)
-297.62
(24.05)
-118.36
(40.77)
139.51
(10.11)
282.43
(27.28)

-250.75
(4.38)
114.26
(2.38)
7.28
(2.41)
-0.13
(0.04)
-108.39
(5.39)
-212.68
(15.39)
0.77
(24.07)
93.27
(6.32)
194.33
(17.32)

-238.49
(3.79)
131.27
(2.10)
2.59
(2.10)
-0.04
(0.04)
-81.71
(4.56)
-149.85
(12.48)
7.87
(21.29)
72.21
(5.40)
119.48
(14.34)

-234.27
(4.21)
145.08
(2.35)
-0.10
(2.38)
0.02
(0.04)
-66.64
(4.84)
-120.34
(11.17)
30.44
(17.82)
60.64
(5.89)
89.75
(14.15)

-227.57
(5.09)
157.47
(3.05)
-2.81
(2.89)
0.09
(0.05)
-63.02
(6.22)
-91.31
(17.79)
25.41
(29.10)
57.45
(7.52)
56.94
(20.60)

163

Table A.13 Birthweight, quantile regression with Classical CRE, all moms, (unit: grams)
Quantile

Smoke
Male
Age
Age2
Kessner index = 2
Kessner index = 3
No prenatal visit
First prenatal visit in 2nd trimester
First prenatal visit in 3rd trimester

0.1

0.25

0.5

0.75

0.9

-144.27
(10.43)
98.49
(4.59)
-15.43
(8.69)
0.30
(0.12)
-139.90
(10.29)
-257.35
(22.65)
-133.51
(36.44)
109.02
(11.60)
209.63
(27.53)

-145.56
(7.51)
122.15
(3.30)
-20.22
(6.26)
0.35
(0.08)
-88.61
(7.06)
-173.21
(16.31)
-31.59
(26.23)
66.49
(8.35)
137.03
(19.82)

-147.59
(6.62)
139.71
(2.91)
-22.62
(5.51)
0.41
(0.07)
-60.36
(6.22)
-126.93
(14.36)
3.45
(23.10)
43.00
(7.36)
84.09
(17.46)

-147.18
(7.56)
153.22
(3.32)
-17.66
(6.30)
0.32
(0.08)
-57.38
(7.11)
-93.61
(16.42)
14.55
(26.41)
40.34
(8.41)
60.87
(19.96)

-147.37
(9.87)
162.48
(4.34)
-13.34
(8.22)
0.34
(0.11)
-50.37
(9.28)
-65.79
(21.42)
57.79
(34.46)
39.56
(10.97)
55.49
(26.04)

164

BIBLIOGRAPHY

165

BIBLIOGRAPHY
Abrevaya, Jason. 2006. Estimating the effect of smoking on birth outcomes using a matched panel
data approach. Journal of Applied Econometrics 21(4). 489–519. doi:10.1002/jae.851.
Abrevaya, Jason & Christian M Dahl. 2008. The Effects of Birth Inputs on Birthweight. Journal
of Business & Economic Statistics 26(4). 379–397. doi:10.1198/073500107000000269. http:
//dx.doi.org/10.1198/073500107000000269.
Amemiya, T. 1985. Advanced econometrics. Havard University Press.
Amemiya, Takeshi. 1978. The Estimation of a Simultaneous Equation Generalized Probit Model.
Econometrica 46(5). 1193–1205. doi:10.2307/1911443.
Amemiya, Takeshi. 1979. The estimation of a simultaneous-eqution Tobit model. International
Economic Review 20(1). 169–181.
Belloni, Alexandre, D Chen, Victor Chernozhukov & Christian Hansen. 2012a. Sparse Models
and Methods for Optimal Instruments with an Application to Eminent Domain. Econometrica
80(6). 2369–2429. doi:10.3982/ECTA9626.
Belloni, Alexandre, V Chernozhukov & C Hansen. 2012b. Inference for high-dimensional sparse
econometric models. arXiv:1201.0220v1 (June 2010). 1–41. doi:10.1017/CBO9781139060035.
008.
http://arxiv.org/abs/1201.0220$\delimiter"026E30F$npapers3:
//publication/uuid/9974B31C-5AB7-4852-AA3C-B6A2D25D3901.
Belloni, Alexandre & Victor Chernozhukov. 2011a. Ch.3 High Dimensional Sparse Econometric
Models:. In Pierre Alquier, Eric Gautier & Gilles Stoltz (eds.), An introduction, inverse problems
and high-dimensional estimation ( lecture notes in statistics volume 203), 121–156. SpringerVerlag. doi:10.1007/978-3-642-19989-9.
Belloni, Alexandre & Victor Chernozhukov. 2011b. l1-Penalized Quantile Regression in HighDimensional Sparse Models. The Annals of Statistics 39(1). 82–130. doi:10.1214/10-AOS827.
http://projecteuclid.org/euclid.aos/1291388370.
Belloni, Alexandre, Victor Chernozhukov & Christian Hansen. 2014a. Inference on treatment
effects after selection among high-dimensional controls. Review of Economic Studies 81(2).
608–650. doi:10.1093/restud/rdt044.
Belloni, Alexandre, Victor Chernozhukov, Christian Hansen & Damian Kozbur. 2014b. Inference
in High Dimensional Panel Models with an Application to Gun Control. arXiv:1411.6507v1
(April 2013). 1–80.
Breusch, Trevor, Hailong Qian, Peter Schmidt & Donald Wyhowski. 1999. Redundancy of moment
conditions. Journal of Econometrics 91(1). 89–111. doi:10.1016/S0304-4076(98)00050-5.
Bunea, Florentina, Alexandre Tsybakov & Marten Wegkamp. 2007. Sparsity oracle inequalities
for the Lasso. Electronic Journal of Statistics 1. 169–194. doi:10.1214/07-EJS008. http:
//eprints.pascal-network.org/archive/00003861/.

166

Canay, Ivan a. 2011. A simple approach to quantile regression for panel data. Econometrics Journal
14(3). 368–386. doi:10.1111/j.1368-423X.2011.00349.x.
Chamberlain, Gary. 1980. Analysis of Covariance with Qualitative Data. Review of Economic
Studies 47(146). 225. doi:10.3386/w0325.
Chamberlain, Gary. 1984. Panel Data. In Zvi Griliches & Michael D. Intriligator (eds.), Handbook
of econometrics, chap. 22, 1247–1318. Amsterdam: North Holland.
Chernozhukov, Victor, Ivan Fernandez-Val, Jinyong Hahn & Whitney Newey. 2009. Average
and Quantile Effects in Nonseparable Panel Models. Econometrica 81(2). 71. doi:10.3982/
ECTA8405. http://arxiv.org/abs/0904.1990.
Crepon, Bruno, Francis Kramarz & Alain Trognon. 1997. Parameters of Interest, Nuisance Parameters and Orthogonality Conditions: An Application to Autoregressive Error Component
Models. Journal of Econometrics 82(1). 135–156. doi:10.1016/S0304-4076(97)00054-7.
Fan, Jianqing & Runze Li. 2001. Variable Selection via Nonconcave Penalized. Journal of American
Statistical Association 96(456). 1348–1360.
Fan, Jianqing & Jinchi Lv. 2010. A Selective Overview of Variable Selection in High Dimensional
Feature Space. Statistica Sinica 18(11). 1492–1501. doi:10.1016/j.str.2010.08.012.Structure.
Fan, Jianqing & Jinchi Lv. 2011. Nonconcave penalized likelihood with NP-dimensionality. IEEE
Transactions on Information Theory 57(8). 5467–5484. doi:10.1109/TIT.2011.2158486.
Fan, Yingying & Cheng Yong Tang. 2013. Tuning parameter selection in high dimensional penalized
likelihood. Journal of the Royal Statistical Society: Series B 75(3). 531–552.
Friedberg, Stephen H., Arnold J. Insel & Lawrence E. Spence. 2003. Linear algebra. Prentice Hall
4th edn.
Gouriéroux, Christian, Alain Monfort & Alain Trognon. 1984. Pseudo maximum likelihood
methods: theory. Econometrica 52(3).
Graham, Bryan, Jinyong Hahn & James Powell. 2009a. A Quantile Correlated Random Coefficient
Panel Data Model. manuscript (March 2008).
Graham, Bryan S., Jinyong Hahn & James L. Powell. 2009b. The incidental parameter problem
in a non-differentiable panel data model. Economics Letters 105(2). 181–182. doi:10.1016/j.
econlet.2009.07.015. http://dx.doi.org/10.1016/j.econlet.2009.07.015.
Harding, Matthew & Carlos Lamarche. 2016. Penalized Quantile Regression with Semiparametric Correlated Effects: An Application with Heterogeneous Preferences. Journal of Applied
Econometrics doi:10.1002/jae.2520.
Hastie, Trevor, Robert Tibshirani & Martin Wainwright. 2015. Statistical Learning with Sparsity.
CRC Press.

167

He, Xuming & Peide Shi. 1994. Convergence rate of b-spline estimators of nonparametric conditional quantile functions. Journal of Nonparametric Statistics 3(3-4). 299–
308. doi:10.1080/10485259408832589. http://www.tandfonline.com/doi/abs/10.
1080/10485259408832589.
He, Xuming & Peide Shi. 1996. Bivariate Tensor-Product B-Splines in a Partly Linear Model.
Journal of Multivariate Analysis 58(2). 162–181. doi:10.1006/jmva.1996.0045. http://www.
sciencedirect.com/science/article/pii/S0047259X96900457.
He, Xuming, Zhong Yi Zhu & Wing Kam Fung. 2002. Estimation in a semiparametric model
for longitudinal data with unspecified dependence structure. Biometrika 89(3). 579–590. doi:
10.1093/biomet/89.3.579.
Horowitz, Joel L & Sokbae Lee. 2005. Nonparametric Estimation of an Additive Quantile Regression Model. Journal of the American Statistical Association 100(472). 1238–1249. doi:
10.1198/016214505000000583. http://dx.doi.org/10.1198/016214505000000583.
Hurwicz, Leonid. 1950. Generalization of the Concept of Identification, "Statistical Inference in
Dynamic Economic Models". New York: John Wiley.
Jun, Sung Jae, Yoonseok Lee & Youngki Shin. 2016. Treatment Effects With Unobserved Heterogeneity: A Set Identification Approach. Journal of Business & Economic Statistics 34(2).
302–311. doi:10.1080/07350015.2015.1044008. http://dx.doi.org/10.1080/07350015.
2015.1044008.
Kato, Kengo, Antonio F. Galvao Jr. & Gabriel V. Montes-Rojas. 2012. Asymptotics for panel
quantile regression models with individual effects. Journal of Econometrics 170(1). 76–91. doi:
10.1016/j.jeconom.2012.02.007. http://dx.doi.org/10.1016/j.jeconom.2012.02.007.
Kim, Tae-Hwan & Halbert White. 2003. Estimation, Inference, and Specification Testing for
Possibly Misspecified Quantile Regression. In Maximum likelihood estimation of misspecified
models: Twenty years later, advances in econometrics, volume 17, 107–132. Emerald Group
Publishing Limited.
Kim, Yongdai & Sunghoon Kwon. 2012. Global optimality of nonconvex penalized estimators.
Biometrika 99(2). 315–325. doi:10.1093/biomet/asr084.
Koenker, Roger. 2004. Quantile regression for longitudinal data. Journal of Multivariate Analysis
91(1). 74–89. doi:10.1016/j.jmva.2004.05.006.
Koenker, Roger. 2005.
Quantile regression.
Cambridge University Press.
https://books.google.com/books?hl=en{&}lr={&}id=hdkt7V4NXsgC{&}oi=
fnd{&}pg=PP1{&}dq=koenker+quantile+regression+{&}ots=FtvlcdE2xv{&}sig=
yIkQo3GawnC7IRgvmAlbtPE3NtI.
Koenker, Roger & Gilbert Bassett Jr. 1982. Robust Tests for Heteroscedasticity Based on Regression
Quantiles. Econometrica 50(1). 43–61. doi:10.2307/1912528. http://ideas.repec.org/a/
ecm/emetrp/v50y1982i1p43-61.html.

168

Lamarche, Carlos. 2010. Robust penalized quantile regression estimation for panel data. Journal
of Econometrics 157(2). 396–408. doi:10.1016/j.jeconom.2010.03.042. http://dx.doi.org/
10.1016/j.jeconom.2010.03.042.
Lee, Sokbae. 2007. Endogeneity in quantile regression models: A control function approach.
Journal of Econometrics 141(2). 1131–1158. doi:10.1016/j.jeconom.2007.01.014.
Li, Qi & Jefferey S. Racine. 2007. Nonparametric Econometrics: theory and practice. Princeton
University Press.
Lv, Jinchi & Yingying Fan. 2009. A unified approach to model selection and sparse recovery using
regularized least squares. Annals of Statistics 37(6 A). 3498–3528. doi:10.1214/09-AOS683.
Matzkin, Rosa. 2007. Nonparametric identification. Handbook of Econometrics http://www.
sciencedirect.com/science/article/pii/S1573441207060734.
Mundlak, Yair. 1978. On the pooling of time series and cross section data. Econometrica 46(1).
69–85. doi:10.1017/CBO9781107415324.004.
Newey, Whitney K. 1987. Efficient Estimation of Limited Dependent Variable Models with
Endogenous Explanatory Variables. Journal of Econometrics 36(3). 231–250. doi:10.1016/
0304-4076(87)90001-7.
Newey, Whitney K. & Daniel McFadden. 1994. Chapter 36 Large sample estimation and hypothesis
testing. doi:10.1016/S1573-4412(05)80005-4.
Noh, Hoh Suk & Byeong U Park. 2010. Sparse Varying Coefficient Models for Longitudinal Data.
Statistica Sinica 20. 1183–1202.
Parente, Paulo & J.M.C Santos Silva. 2016. Quantile regression with clustered data. Journal of
Econometric Methods (forthcoming) 5(1). 1–15. doi:10.1515/jem-2014-0011.
Pollard, David. 1985. New Ways to Prove Central Limit Theorems. Econometric Theory 1(03).
295–313. doi:10.1017/S0266466600011233.
Rivers, Douglas & Quang H. Vuong. 1988. Limited information estimators and exogeneity tests for simultaneous probit models. Journal of Econometrics 39(3). 347–366. doi:
10.1016/0304-4076(88)90063-2.
Rosen, Adam M. 2012. Set identification via quantile restrictions in short panels. Journal of
Econometrics 166(1). 127–137. doi:10.1016/j.jeconom.2011.06.011. http://dx.doi.org/
10.1016/j.jeconom.2011.06.011.
Schumaker, Larry L. 2007. Spline functions: basic theory. Cambridge University Press.
Sherwood, Ben & Lan Wang. 2016. Partially linear additive quantile regression in ultra-high
dimension. Annals of Statistics 44(1). 288–317. doi:10.1214/12-AAP876.

169

Tao, Pd & Lth An. 1997. Convex analysis approach to dc programming: Theory, algorithms
and applications. Acta Mathematica Vietnamica 22(1). 289–355. http://www.math.ac.vn/
publications/acta/pdf/9701289.pdf.
Tibshirani, Ryan J. 2013. The lasso problem and uniqueness. Electronic Journal of Statistics 7(1).
1456–1490. doi:10.1214/13-EJS815.
Turkington, Darrell A. 2013. Generalized vectorization, cross-products, and matrix calculus.
Cambridge.
Vershynin, R. 2010. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint
arXiv:1011.3027 http://arxiv.org/abs/1011.3027.
Wang, Lan, Yichao Wu & Runze Li. 2012. Quantile Regression for Analyzing Heterogeneity in Ultra-high Dimension. Journal of the American Statistical Association 107(497).
214–222. doi:10.1080/01621459.2012.656014. http://www.pubmedcentral.nih.gov/
articlerender.fcgi?artid=3471246{&}tool=pmcentrez{&}rendertype=abstract.
Welsh, A.H. 1989. On M-Processes and M-Estimation. The Annals of Statistics 17(1). 337–361.
White, Halbert. 1994. Estimation, Inference and Specification Analysis. Cambridge University
Press.
Wooldridge, Jeffrey M. 2009. Correlated random effects models with unbalanced panels.
Manuscript (version July 2009) Michigan State http://citeseerx.ist.psu.edu/viewdoc/
download?doi=10.1.1.472.4787{&}rep=rep1{&}type=pdf.
Wooldridge, Jeffrey M. 2010. Econometric Analysis of Cross Section and Panel Data. MIT Press
2nd edn. doi:10.1515/humr.2003.021.
Wooldridge, Jeffrey M. 2014. Quasi-maximum Likelihood Estimation and Testing for Nonlinear
Models with Endogenous Explanatory Variables. Journal of Econometrics 182(1). 226–234. doi:
10.1016/j.jeconom.2014.04.020. http://dx.doi.org/10.1016/j.jeconom.2014.04.020.
Wooldridge, Jeffrey M. & Ying Zhu. 2016. L1-Regularized Quasi-Maximum likelihood Estimation
and Inference in High-Dimensional Correlated Random Effects Probit. manuscript .
Zhang, Cun-Hui. 2010. Nearly unbiased variable selection under minimax concave penalty. Annals
of Statistics 38(2). 894–942. doi:10.1214/09-AOS729.
Zhang, F. 2005. The Schur complement and its applications, vol. 4. Springer Science & Business
Media. doi:10.1007/b105056.
Zou, Hui. 2006. The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical
Association 101(476). 1418–1429. doi:10.1198/016214506000000735.

170