THREE ESSAYS IN COMPLEX SAMPLES
by
Iraj Rahmani

A DISSERTATION
Submitted
to Michigan State University
in partial fulﬁllment of the requirements
for the degree of
DOCTOR OF PHILOSOPHY
ECONOMICS
2012

ABSTRACT
THREE ESSAYS IN COMPLEX SAMPLES
by
Iraj Rahmani
The samples used in econometric studies are not always sets of randomly drawn observations
from the populations of interest. In many studies sampling has a complex design involving clustering and stratiﬁcation. In stratiﬁcation, the population is divided into subpopulations or strata
based on exogenous or endogenous variables and then a random sample of unit observations or
clusters is drawn from each stratum. Clusters are contiguous groups of units existing within a
stratum. Reducing the cost of sampling or operational convenience might be reasons for applying
stratiﬁcation and clustering. On the other hand, particular interest in a small subpopulation may
cause oversampling that justiﬁes non-random sampling scheme.
This dissertation consists of three essays addressing estimation and inference in cross section
and panel data models with non-random samples. In general, ignoring sampling design could
produce inconsistent estimators and also inconsistent estimators for their standard errors. In the
ﬁrst essay, a multi-stage sampling design including standard stratiﬁcation and clustering stages at
ﬁrst and variable probability sampling in the ﬁnal stage is considered. The problem is studied under
M-estimators framework. Under a set of regularity conditions, the usual weighting estimators are
consistent and have asymptotic normal distributions. In cases that stratiﬁcations in the ﬁrst or the
second or in the both stages are exogenous, dropping the corresponding weights are allowed; we
still have consistent estimators.
The second essay contributes to the subject of non-random sampling by studying efﬁciency
in panel data models when data set comes from stratiﬁed samples. The goal in this chapter is to
obtain more efﬁcient estimators by considering correlation within panels in models with stratiﬁed
structure. We do not try to ﬁnd the efﬁciency bound in this kind of models. Our attempt is to
increase efﬁciency in compare with pooled models that ignore correlations within panels. The

paper takes into account correlation within each panel and in each stratum under a GMM based
framework. Theoretical development and Monte Carlo study show that by considering correlation
within the panels in each stratum and adding them together with appropriate weights, ﬁnding
more efﬁcient estimators is possible. Like generalized estimating equations (GEE), we are able to
consider the speciﬁc form for correlation for panels in each stratum. Monte Carlo results conﬁrm
that the new GMM estimators that is called weighted and unweighted GLS are more efﬁcient
than their competitors OLS and weighted OLS that simply overlook the correlation within the
panels. In case of endogenous stratiﬁcation, weighted GLS and in case of exogenous stratiﬁcation
unweighted GLS is doing better than the rest. For a speciﬁc sample size, this efﬁciency gain
depends on what form is chosen for correlation and how strong or weak it is. We applied results
to study determinants of inequality in the U.S. and estimation results show that efﬁciency gain in
compare with POLS or weighted POLS is substantial.
The subject of the third essay is model selection problem. In complex samples involving stratiﬁcation and clustering, the assumption that observations are distributed independently and identically is not held anymore and therefore the Vuong’s (1989) model selection tests are not applicable
directly. In order to generalize Vuong’s results to estimators other than MLE, we study the problem
under M- estimator framework that contains many estimators including but not limited to linear
and non-linear least squares, MLE, and QMLE. The theoretical results show that for two nonnested
competing models, the asymptotic property of the weighted tests statistics are not a function of the
competing estimators but observations and has normal distribution. An interesting ﬁnding is that
even in case of exogenous stratiﬁcation, we cannot drop weights in the tests statistics since for
nonnested tests both competing models should be misspeciﬁed under the null. We also apply
results in two empirical studies.

Copyright by
Iraj Rahmani
2012

To my late father,
my mother,
and my brothers, Behzad, and Reza, and my sister, Maryam

v

ACKNOWLEDGEMENTS

It would not have been possible to write this doctoral thesis without the help and support of many
great people around me, to only some of whom it is possible to give particular mention here.
It is difﬁcult to overstate my gratitude to my Ph.D. supervisor, Professor Jeffery Wooldridge.
With his enthusiasm, his inspiration, and his unsurpassed knowledge of econometrics and statistics,
he helped me to move in a right direction. I would have been lost without him. Throughout my
thesis-writing period, he provided encouragement, sound advice, good teaching and many good
ideas.
I would like to thank Professors Peter Schmidt, and Tim Vogelsang in the Department of Economics, and Tapabrata Maiti in the Department of Statistic and Probability for their encouragements and supports. It was great honor to have these great scholars as my committee members. I
wish to thank Emma Iglesias, who was a member of my committee, for her help and support.
Also I would like to thank Professor Hassan Mohammadi at Illinois State University and Professor Kambiz H. Kiani at Shahid Beheshti Universty (the National University of Iran) who encourage me to continue my education in Iran and the U.S., and for their kind assistance with writing
letters and wise advice.
My friendship with other Ph.D. students was very fruitful and I learnt many great lessons from
them. I would like to thank all of them, particularly Dr. Do Won Kwak whom I learnt many things
in the Stata Programming.
I received many helps from former and current Graduate Secretaries, Jennifer Carducci, and
Lori Jean Nichols, and also Margaret Lynch, the Ofﬁce Manager, the Department of Economics
and would like to thank them for kindness and sincere assistance.
I owe thanks as well to Leila Ardestani, for the continuous support and encouragement that I
received from her.
I wish to thank my wonderful and very kind brothers and sister for giving me their uncondi-

vi

tional love and support. My brothers Behzad and Reza and my sister Maryam who have always
been fountains of sincere friendship and love. I am so thankful for having them in all stages of
my life. Without their help and support, I could not stay so many years far from home without
worrying about any issue.
Lastly, and most importantly, I wish to thank my father Rahman Rahmani and my wonderful
mother Roohangiz Saadati. They were my ﬁrst teachers who thought me the most important elements of life; love, friendship, and forgiveness. It was a great unfortunate experience of losing
my father almost at the end of my ﬁrst year in the Ph.D. program. The pain is still fresh and his
place in my heart will never be ﬁlled. He was the pillar of my life, the best friend, and a source
of wisdom and great advice. In his absence, my dear mother did her best to keep me on the road
sound and ﬁrm. I cannot ﬁnd the suitable words to express my gratitude to her. To my parents and
brothers and sister, I dedicate this thesis.

vii

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CHAPTER 1

1.1
1.2
1.3
1.4

1.5

1.6
1.7
1.8

ASYMPTOTIC INFERENCE OF M-ESTIMATOR FORM MULTISTAGE SAMPLES WITH VARIABLE PROBABILITY IN FINAL
STAGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Population Optimization Problem . . . . . . . . . . . . . . . . . . . . . .
The Sampling Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Estimation under Multi-stage Sampling . . . . . . . . . . . . . . . . . . . . .
1.4.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.2 Asymptotic Normality of the Weighted M-Estimator . . . . . . . . . .
1.4.3 Estimating the Asymptotic Variance . . . . . . . . . . . . . . . . . . .
Estimation under Exogenous stratiﬁcation . . . . . . . . . . . . . . . . . . . .
1.5.1 Consistency of the Unweighted M-Estimator . . . . . . . . . . . . . .
1.5.1.1 Consistency of the Unweighted M-Estimator: Case One . . .
1.5.1.2 Consistency of Unweighted M-estimator: Case Two . . . . .
1.5.2 Asymptotic Normality of the Unweighted M-Estimator . . . . . . . . .
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Two-Step M-Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

1
1
4
5
8
8
10
12
13
13
13
14
15
18
21
22

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

24
24
25
25
27
27
29
31
36
43
47

.
.
.
.
.
.
.

.
.
.
.
.
.
.

49
49
51
52
55
55
57

CHAPTER 2
2.1
2.2
2.3
2.4

2.5
2.6
2.7
2.8

ASYMPTOTIC EFFICIENCY IN THE PANEL DATA MODELS
WITH STRATIFIED SAMPLING . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Moment Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Sampling Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Efﬁcient estimation under moment restrictions . . . . . . . . . . . . . . . . . .
2.4.1 Moment restrictions in the sample . . . . . . . . . . . . . . . . . . . .
2.4.2 Efﬁcient estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The normal linear model: A Monte Carlo investigation . . . . . . . . . . . . .
Determinants of Family income in the U.S: An Empirical Application . . . . .
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

CHAPTER 3 MODEL SELECTION TESTS IN COMPLEX SAMPLES
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 The Nonnested Competing Models . . . . . . . . . . . . . . . . . . .
3.3 Basic Framework under Standard Stratiﬁed Samples . . . . . . . . . .
3.4 Tests Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.1 The Test Statistic under Standard Stratiﬁed Sampling . . . . .
3.4.2 The Test Statistic under Variable Probability Sampling . . . .
viii

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

3.5
3.6
3.7
3.8

3.4.3 Tests Statistics under Multi-Stage Sampling
Model Selection Tests in Panel Data Models . . . .
Tests Statistics and Exogenous Stratiﬁcation . . . .
Empirical Examples . . . . . . . . . . . . . . . . .
Conclusion . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

58
60
61
63
64

APPENDIX A

PROOFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

APPENDIX B

TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

ix

LIST OF TABLES

B.1 Exogenous Stratiﬁcation with ρ = 0.0, 1000 replications . . . . . . . . . . . . . . . . 68
B.2 Exogenous Stratiﬁcation with ρ = 0.1, 1000 replications . . . . . . . . . . . . . . . . 68
B.3 Exogenous Stratiﬁcation with ρ = 0.5, 1000 replications . . . . . . . . . . . . . . . . 69
B.4 Exogenous Stratiﬁcation with ρ = 0.9, 1000 replications . . . . . . . . . . . . . . . . 69
B.5 Exogenous Stratiﬁcation with ρ = 0.0, 1000 replications . . . . . . . . . . . . . . . . 69
B.6 Exogenous Stratiﬁcation with ρ = 0.1, 1000 replications . . . . . . . . . . . . . . . . 70
B.7 Exogenous Stratiﬁcation with ρ = 0.5, 1000 replications . . . . . . . . . . . . . . . . 70
B.8 Exogenous Stratiﬁcation with ρ = 0.9, 1000 replications . . . . . . . . . . . . . . . . 70
B.9 Endogenous Stratiﬁcation with ρ = 0.0, 1000 replications . . . . . . . . . . . . . . . . 71
B.10 Endogenous Stratiﬁcation with ρ = 0.1, 1000 replications . . . . . . . . . . . . . . . . 71
B.11 Endogenous Stratiﬁcation with ρ = 0.5, 1000 replications . . . . . . . . . . . . . . . . 71
B.12 Endogenous Stratiﬁcation with ρ = 0.9, 1000 replications . . . . . . . . . . . . . . . . 72
B.13 Endogenous Stratiﬁcation with ρ = 0.0, 1000 replications . . . . . . . . . . . . . . . . 72
B.14 Endogenous Stratiﬁcation with ρ = 0.1, 1000 replications . . . . . . . . . . . . . . . . 72
B.15 Endogenous Stratiﬁcation with ρ = 0.5, 1000 replications . . . . . . . . . . . . . . . . 73
B.16 Endogenous Stratiﬁcation with ρ = 0.9, 1000 replications . . . . . . . . . . . . . . . . 73
B.17 Variables Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
B.18 Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
B.19 Determinants of Family Income in the U.S . . . . . . . . . . . . . . . . . . . . . . . . 76

x

Chapter 1
ASYMPTOTIC INFERENCE OF M-ESTIMATOR FORM MULTI-STAGE SAMPLES
WITH VARIABLE PROBABILITY IN FINAL STAGE

1.1

Introduction

In economic analyses are often assumes that the observations come from simple random sampling.
It means that a set of independent and identically distributed (i.i.d.) observations is available.
However in reality many data sets used in economics and other branches of the social sciences
are come form stratiﬁed sampling schemes that produces nonrandom samples. When a data set
comes from nonrandom sampling schemes, i.i.d. assumption is not valid anymore and therefore
one might need to make inference about the econometric model more carefully.
The goal in this chapter is to examine asymptotic properties of M-estimators when the observations come from several levels of stratiﬁcations.
Three well known stratiﬁed sampling schemes are standard stratiﬁed sampling (SS sampling),
multinomial sampling, and variable probability sampling (VP sampling). In SS sampling, the population is divided into several subpopulation based on factors like income, race, gender, education,
area of residence, etc. Then a random sample is taken within each subpopulation or stratum independently. The result is a sample of independent but not identically distributed observations.
It should be emphasized that unlike simple random sampling, in SS sampling the proportions of
observations within strata do not reﬂect population proportions as they would if the sample were
selected randomly from the population.
Multinomial sampling scheme is similar to SS sampling. The difference is that in multinomial
sampling ﬁrst a stratum is chosen randomly and then samples randomly from the stratum. Although
this kind of sampling is not common in practice but theoretically it is easier to deal with because it
produces i.i.d. observations.
In VP sampling which also is known as Bernoulli sampling in the literature, ﬁrst an observation
1

is drawn randomly from the population, then its stratum is determined by the researcher. After
determining its stratum, it will be kept in the sample with speciﬁc probability that is set by the
researcher also. If the observation is not chosen then it will be returned to the population and its
values are not recorded.
SS sampling scheme is often used when observations from each stratum are easily identiﬁed
before sampling. Variable probability sampling scheme is more suitable when stratum of an observation is known only after sampling. For example determining a family’s income bracket is
difﬁcult before sampling and therefore VP sampling is used.
In general, stratiﬁcation can be based on dependent variable or variables, explanatory variables
or both. Dividing the population of interest in terms of explanatory variables is called exogenous
stratiﬁcation. Stratiﬁcation is endogenous if we deﬁne subpopulations with respect to dependent
variables. Whether stratiﬁcation is exogenous, or endogenous is determined only after deﬁning an
econometric model. In other words, determining a speciﬁc model comes ﬁrst and then discussions
about appropriate sampling schemes start.
Reviewing the literature shows that the subject have been studied by both statisticians and
econometricians. In summery, stratiﬁcation based on exogenous variables does not produce serious
problems; one can ignore it and still obtains consistent estimates for parameters of the population.
In this line of research we can mention DuMouchel and Duncan (1983) that conﬁrms the above
statement in a linear model for SS sampling. Manski and McFadden (1981) show it is true in
maximum likelihood estimation where data set comes form multinomial sampling. Wooldridge
(1999, 2001) shows same result is true for VP, and SS sampling when we consider the case in
framework of M-estimators.
In practice combination of these methods of sampling are commonly used also. For instance
the Panel Study of Income Dynamics (PSID) involves stratiﬁcation and clustering. Bhattacharya
(2005) describes a multi-stage sampling in which SS sampling is used in ﬁrst level to choose some
clusters in each stratum by simple random sampling, and then from each sampled cluster a few
observations are chosen again by simple random sampling. In this scheme clusters are deﬁned as

2

contiguous groups of units existing within a stratum. For example in rural areas villages can be
considered as clusters, and in urban areas, they are blocks or neighborhoods and in both examples
unit observations are households.
In his paper, Bhattacharya (2005) drives asymptotic properties of estimators when data set
comes from surveys whose designs involve stratiﬁcation and clustering in GMM framework. In a
set up similar to Bhattachary’s multistage sampling, Wooldridge (2008) drives asymptotic variance
of estimators in linear models.
The goal in this chapter, as mentioned already, is to investigate asymptotic properties of estimators when data set comes from multi-stage sampling. It is closely related to Bhattacharya (2005)
sampling scheme with one distinction. We add variable probability sampling in ﬁnal stage and
then develop M-estimator framework for asymptotic inference to evaluate data from surveys with
multi level of stratiﬁcation and clustering structure. The set up is general enough to contain linear
and non-linear models as well as maximum likelihood ones. This kind of sampling design is used
in many surveys in practice, particularly those that involve phone interviews.
As an example of big scale survey that has a structure very similar to the sampling scheme
considered in this study, we can name National Survey of Families and Households (NSFH) .The
NSFH is a complex survey sample that involved ﬁve sampling stages. In the ﬁrst stage of this
national multistage sampling design, 100 primary sampling units were drawn from a list of all
countries in the nation that had been stratiﬁed into two groups. In ﬁrst stratum, 18 self -representing
areas composed of the largest metropolitan areas that make up 36 % of the nation’s population and
second stratum contains the rest of the country. From the the ﬁrst stratum, 36 primary sampling
units were drawn with certainty. The second stratum that make up 64 % of the nation was divided
into 32 strata, and two primary sampling units were drawn from each stratum using probability
proportional to size sampling.
In the second stage an average of 17 block groups or enumeration districts from each primary
sampling unit is randomly selected. Within each of these district, a list of 45 or more households
was selected. These households were given a short screening interview to allow oversampling of

3

certain interested groups like African American, cohabiting couples etc. Members of these groups
in the cluster were selected with certainty, and others were selected at a lower rate. In the ﬁnal
stage, an adult from each household was randomly chosen as the eligible respondent. At the end
from 45 or more households in each district or cluster 20 of them were included in the sample.
Substitutions were not allowed. in this study, the sample size was 13007 primary respondents. The
survey contains 1700 clusters, with an average of 7.6 respondents per cluster. In this study we have
many clusters with small size. For more detail see Johnson and Elliott (1998).
The rest of paper is organized as follows. The next section presents the population optimization
problem and basic framework. Sampling scheme and sample objective function are explained in
section 3. In section 4 consistency and asymptotic normality of weighted M-estimator under multistage sampling is discussed. We introduce theories that summarize conditions needed to have
consistent weighted estimators with asymptotic normal distribution. Also in section 4 we study
estimating of asymptotic variances of M-estimators. In section 5, estimation under exogenous
stratiﬁcation is discussed. Under exogenous stratiﬁcation in our model where more than one level
of stratiﬁcation exist, three cases are distinguishable. However we only consider the two ﬁrst cases.
In section 5, four theorems are presented to cover consistency and asymptotic distribution of Mestimators under exogenous stratiﬁcation. In section 6, four examples are presented. In section 7,
two-step M-estimator is discussed. In section 8, the last section, the main ﬁndings of the paper are
reviewed.

1.2

The Population Optimization Problem

Our goal is to estimate a P × 1 vector of parameter θ that minimize the population problem
min E [q (W,θ )]

θ ∈Θ

(1.1)

where E[.] denotes the expectation with respect to the true distribution of W, and θ ∈ Θ and Θ is the
parameter space that is a subset of Euclidean space RP . The objective function in the population
is denoted as q(W, θ ) that is a function of W and θ . W is an M × 1 random vector taking values
4

in W , where W is a subset of RM . We assume that there exists a unique solution θ ◦ ∈ Θ, that
minimize population problem (1.1).
In cases that q(.) is a correctly speciﬁed model, θ ◦ is the true parameter that uniquely minimize
(1.1). However, in misspeciﬁed cases where q(.) is not a correct model, there is no true value of θ ,
i.e. θ ◦ . In these cases, it is standard to assume θ ◦ is the unique solution to (1.1).
We are usually interested in explaining a K × 1 random vector Y conditional on a L × 1 vector
of explanatory variable X such as E(Y|X). Here K + L = M, and (X, Y) = W. Random vectors X
and Y belong to subsets X and Y respectively, where X ⊂ RK , Y ⊂ RL and union of X and
Y , denoted by X ∪ Y is W . The framework is general enough to cover panel data models with
large cross section dimension and small time periods T .

1.3

The Sampling Scheme

The sample design is a combination of standard stratiﬁcation, clustering and variable probability
sampling. First, according with SS sampling, the population is divided into S ﬁrst stage strata
that are non-overlapping and exhaustive. In this stage, stratiﬁcation can be based on a variable
or variables like the area of residence or race that allows us to divide the population easily. Each
stratum s contains a mass of Cs clusters. For example these clusters in rural areas are villages, and
in urban areas, they are blocks or neighborhoods. In next step Ns clusters with replacement are
drawn randomly from each stratum s. Since in this study we require some sort of large-sample
approximation, the assumption of with replacement is not important if the number of clusters
samples, Ns , is “large”. Each sampled cluster c from stratum s contains a ﬁnite population of
Msc households or units of observations. An observation (household) is selected by random from
sampled cluster c in stratum s. In next stage the selected household is classiﬁed according to
interested non-overlapping and exhaustive strata based on, for example, the level of income. The
household is retained into the sample with some probability that is set by the practitioner. As
it mentioned already, sampling in the second stage is called variable probability. The process is
repeated for K (a constant and small number) of unit observations for each sampled cluster c in
5

stratum s and a sample of Ksc households is obtained where 1

Ksc

K.

In practice a ﬁxed and large number of clusters Ns are sampled randomly within each stratum
s, and then within each sampled cluster, a small and ﬁxed number of households are sampled
randomly.
We can summarize sampling design as follows
i The population is divided into S non-overlapping and exhaustive ﬁrst stage strata based on
criteria like area of residence, race, age etc.
ii In stratum s, Cs clusters exist.
iii For each stratum s randomly draw Ns clusters with replcement.
iv Each sampled cluster cs from stratum s contains a ﬁnite population of Msc units (for example
households).
v A household is selected by random from sampled cluster cs in stratum s.
vi The household is classiﬁed according to interested non-overlapping and exhaustive strata
(for example income level).
vii The household is retained into the sample with some probability that depends on interested
stratum and is determined by the researcher.
viii The process is followed for K household in each sampled cluster cs in stratum s and a sample
of Ksc households is obtained.
Considering structure of most surveys in practice and the same as Bhattacharya (2005), two
assumptions are made to study the asymptotic inference of the model. First assumption is that the
number of clusters N goes to inﬁnity with numbers of household staying ﬁxed and ﬁnite within each
cluster. The second assumption is that the clusters are independent within a stratum but household
level variables are correlated within each cluster. Therefore for a given stratum s, clusters are
independently but not identically distributed.
6

Under sample scheme, clusters are chosen by simple random sample within each stratum s
independently. In second step unit observations (households) are chosen by variable probability.
Therefore the sample optimization problem is
1 S Ns J K
∑ ∑ ∑ ∑ vsc p−1r jmz jmq (Wscm, θ )
j
N s=1 c=1 j=1 m=1
θ ∈Θ
min

(1.2)

Msc
Cs
. In the sample problem (1.2), r is an indicator
·
Ns
Ksc
N
variable that takes value one if W is in stratum j and zero otherwise. z is also an indicator variable

Here N = N1 + N2 + . . . + N j and vsc =

that takes value one if W is kept in the sample and zero if not and therefore P(z = 1) = p. In order
to study asymptotic properties of M-estimator, we also assume that the ratio of sampled clusters in
Ns
each stratum s to total sampled clusters N or
is constant and therefore ∑J as = 1. We need
j=1
N
this assumption in order to limit the range of ﬂuctuations of weights vsc .
If we re-index clusters from i = 1, · · · , N, and deﬁne new indicator variable yis such that yis
equals one if cluster i is from stratum s or i ∈ s and zero otherwise, then the optimum problem is
1 N S J K
∑ ∑ ∑ ∑ yisvis p−1r jmz jmq (Wism, θ )
j
θ ∈Θ N i=1 s=1 j=1 m=1
min

(1.3)

ˆ
The weighted M-estimator θ w minimizes (1.2) over the parameter space Θ. vsc and p−1 are
j
weights corresponding to ﬁrst level (SS sampling), and second level of stratiﬁcations (VP sampling) respectively. The inner summation in (1.2) is over all potential observations, which would
appear in a random sample. The sample objective function weights each sampled observation unit
(households for example) by product of the two weights corresponding to two level of stratiﬁcations i.e. vsc · p−1 . Note that all sampled observations from a same sampled cluster get same
j
weights.

7

1.4
1.4.1

Estimation under Multi-stage Sampling
Consistency

In order to study the consistency of the weighted M-estimator deﬁned by equation (1.2) is assumed
that the parameter vector θ ◦ uniquely solves the population problem (1.1)
min E [q (W, θ )]

θ ∈Θ

Moreover we need to show that uniform convergence in probability is hold. It is assumed that function q(·) satisﬁes some regularity conditions. We summarize these requirements for consistency of
the weighted M-estimator in following theorem.
Theorem 1.4.1. Let W ∈ W be a random vector where W ⊂ RM , and Θ ⊂ RP , and q : W ×Θ → R
a real valued function. if
1. Θ is a compact set.
2. vsc · p−1 > 0 for all clusters and strata s = 1, . . . , S, j = 1, . . . , J, c = 1, . . . , N.
j
3. For each θ ∈ Θ, q(W, θ ) is Borel measurable on W .
4. For each w ∈ W , q(w, θ ) is continuous on Θ.
5. |q(w, θ )| ≤ b(w), where b is an arbitrary nonnegative function on W such that E [b(w)] < ∞.
6. θ ◦ uniquely solves the population problem.
ˆ P
Then uniform weak law of large numbers holds, and θ w −→ θ ◦ as N −→ ∞.
Proof. For each cluster i in stratum s deﬁne
J

g(Wm , θ ) =

K

∑ ∑ vp−1r jmz jmq(Wm, θ )
j

(1.4)

j=1 m=1

In this function weight v is a random variable since the number of observations from the ﬁnal stage
Kis is random. In fact v is a function of z jm , indicator variable that shows if the randomly drawn
8

observation from ﬁnal stage is kept in the sample or discarded. Therefore we can consider v · z jm
as an indicator variable



 v if
v · z jm =
 0


z jm = 1,
otherwise

And its probability distribution function is


p


j

if v · z jm = v,


1 − p

j

f (v · z jm ) =

if v · z jm = 0.

Therefore the expected value of (1.4) is
J

E [g (Wm , θ )] =

K

∑ ∑

p−1 E vr jm z jm q(Wm , θ )
j

(1.5)

j=1 m=1

Since v · z jm is independent of r jm , the right hand of (1.21) is equal to
J

=

K

∑ ∑

p−1 E(vz jm )E r jm q(Wm , θ ) =
j

J

Ksc

∑ ∑

p−1 p j vE r jm q(Wm , θ )
j

j=1 m=1

j=1 m=1

which can be simpliﬁed to
Ksc

=E

Ksc

J

∑ ∑

vr jm q(Wm , θ ) = E

m=1 j=1

∑ vq(Wm, θ )

m=1

Last equality holds because∑J r jm = 1. Therefore the expected value of (1.21) is equal to
j=1
Mis · E [q(W, θ )]

(1.6)

Mis is the population number of observation in cluster i and stratum s and hence it is constant and
it does not effect estimation and inference. By assumption (6) of the Theorem(1.3.1) θ ◦ solves the
population problem (1.1) uniquely and so is the unique solution for (1.6). Next we need to show
that (1.4) satisﬁes the uniform law of large numbers for each stratum s. By assumption (3) of Theorem(1.3.1), q(·) is a continuous function on Θ for each W ∈ W , and therefore g(·) deﬁned by (1.4)
has same property. g(·) is bounded also, because |g(W, θ )| = | ∑J ∑K p−1 r jm z jm q(W, θ )| ≤
j=1 m=1 j
C · |q(W, θ )| ≤ C · b(W) by assumption (5) where C = max(p−1 , p−1 , . . . , p−1 ). This complete the
J
1
2
proof.
9

1.4.2

Asymptotic Normality of the Weighted M-Estimator

In order to show that the weighted M-estimator is asymptotically normally distributed, conditions
mentioned for consistency in Theorem (1.3.1) is not enough and additional assumptions are needed.
Theorem (1.3.2) lists these new assumptions that imply asymptotic normality of the weighted Mestimator.
Theorem 1.4.2. In addition to the conditions of Theorem(1.3.1), if
7. θ ◦ is in the interior of Θ or θ ◦ ∈ int(Θ).
8. s(W, θ ) the score of the objective function is continuously differentiable on int(Θ).
9. Each element of Hessian matrix, H(W, θ ) is bounded in absolute value by a function b(W),
where E [b(W)] < ∞.
10. Aw = E ∇2 q(W, θ ) is nonsingular.
θ
11. E [s(W, θ ◦ )] = 0 and each element of s(W, θ ) has ﬁnite second moment. Then
√
d
ˆ
N θ w − θ ◦ −→ Normal 0, A−1 Bw A−1
w
w

(1.7)

Here Bw is
S

Bw = ∑ E
s=1
S

+

K

j=1 m=1



J

J

K



K

∑ E  ∑ ∑ ∑ ∑ v2 p−1 p−1r jmr j t z jmz j t ∇θ q(W, θ ◦)∇θ q(W, θ ◦) 
j
j

s=1
S

−

J

∑ ∑ v2 p−2r jmz jm∇θ q(W, θ ◦)∇θ q(W, θ ◦)
j

∑E

s=1

(1.8)

j=1 j =1 m=1 t=m
J

K

∑ ∑

v · p−1 r jm z jm ∇θ q(W, θ ◦ ) ·
j

j=1 m=1

J

K

∑ ∑ v · p−1r jmz jm∇θ q(W, θ ◦)
j

j=1 m=1

Proof. Score of objective function in each stratum s is
J

scs (θ ) = ∇θ gcs (Wm , θ ) =

K

∑ ∑ v · p−1r jmz jm∇θ q(Wmcs, θ )
j

j=1 m=1

10

(1.9)

Because clusters are independent sequence in each stratum s by assumption, we can apply the
central limit theorem for the sampled clusters within each stratum. Therefore
−1/2

Ns

Ns

d

∑ [scs(θ ◦) − E (ss(θ ◦))] −→ Normal (0, Bs)

(1.10)

s=1

In (1.10) E [ss (θ ◦ )] = E ∇θ g(W, θ ◦ ) |W ∈ Ws . Bs is the variance of score function in stratum s
and is equal to
Bs = var [scs (θ ◦ )] = var [∇θ gcs (Wm , θ ◦ )]
J

K

∑ ∑ vp−1r jmz jm∇θ q(Wmcs, θ ◦)
j

= var

j=1 m=1
J

K

∑ ∑ v2 p−2r jmz jm∇θ q(Wmcs, θ ◦)∇θ q(Wmcs, θ ◦)
j

=E

(1.11)

j=1 m=1



J

J

K



K

∑ ∑ ∑ v2 p−1 p j−1r jmr j t z jmz j t ∇θ q(Wmcs, θ ◦)∇θ q(Wmcs, θ ◦) 
j

+E ∑

j=1 j =1 m=1 t=m
J

−E

K

∑ ∑

J

vp−1 r jm z jm ∇θ q(Wmcs , θ ◦ )
j

·E

K

∑ ∑ vp−1r jmz jm∇θ q(Wmcs, θ ◦)
j

j=1 m=1

j=1 m=1

Variance of score function consists of three terms. The ﬁrst term in (1.11) is simply the variance of
score if a simple random sample is in hand. In other words, the ﬁrst part is correct variance if i.i.d
observations are available. Second and third terms in (1.11) are added due to the sample design.
The second term measures the cluster effect and accounts for correlation within clusters. This term
is positive in most cases and it is substantial if the degree of correlation between the observations
inside a single cluster is high and/or K the number of observations sampled from each cluster
increases. The third part captures the stratum effect. It is negative and therefore reduces the size
of variance.
We also obtain the following important equality by using(1.6) in Theorem(1.3.1)
S

S

s=1

s=1

J

K

∑ E [∇θ gcs(Wm, θ ◦)] = ∑ E ∑ ∑ vcs p−1r jmz jm∇θ q(Wcsm, θ ◦)
j

Using (1.12) the score of the objective function, multiplied by
N −1/2

S

Ns

∑ ∑ ∇θ gcs(Wm, θ ◦

s=1 c=1

≡0

(1.12)

j=1 m=1

) = N −1/2

S

√
N can be written as

Ns

∑ ∑ ∇θ gcs(Wm, θ ◦) − E [∇θ gcs(Wm, θ )]

s=1 c=1

11

(1.13)

because the sampled clusters across strata and are also independent by assumption, then (1.13) has
asymptotic normal distribution with mean zero and variance equal to A−1 Bw A−1 .
w
w

1.4.3

Estimating the Asymptotic Variance

Obtaining consistent estimation of the asymptotic variance of

√
ˆ
N(θ w − θ ◦ ) is fairly straightfor-

ward. First, we need to have a consistent estimation of Hessian matrix Aw . It is second-order
partial derivative of (1.4) sum over all strata
S

Aw =

∇2 gcs (Wm , θ ◦ ) =
θ

∑E

s=1

J

J

K

∑ E ∑ ∑ vp−1r jmz jm∇2 q(W, θ ◦)
θ
j

s=1

(1.14)

j=1 m=1

By lemma (4.3) in Newey and McFadden (1994) and under the assumptions of Theorem (1.3.2)
consistent estimator of Aw is
N

ˆ
Aw = N −1 ∑

S

J

K

ˆ
∑ ∑ ∑ vis p−1yisr jmz jm∇2 q(wism, θ w)
θ
j

i=1 s=1 j=1 m=1

As Wooldridge (2010), we assume that the elements of ∇θ q(W, θ )∇θ q(W, θ ) are bounded in
absolute value by a function with ﬁnite expectation in order to have consistent estimation of Bw .
Then, a consistent estimator of Bw is
N

ˆ
Bw = N −1 ∑

S

J

K

ˆ
ˆ
∑ ∑ ∑ v2 p−2yisr jmz jm∇θ q(θ w)∇θ q(θ w)
is j

i=1 s=1 j=1 m=1
N

+ N −1 ∑

S

J

J

K

K

ˆ
ˆ
∑ ∑ ∑ ∑ ∑ v2 p−1 p−1yisr jmr j t z jmz j t ∇θ q(θ w)∇θ q(θ w)
is j
j

i=1 s=1 j=1 j =1 m=1 t=m

N J K
1 N J K
ˆ
ˆ
vis p−1 yis r jm z jm ∇θ q(θ w ) · ∑ ∑ ∑ vis p−1 yis r jm z jm ∇θ q(θ w )
∑N ∑∑ ∑
j
j
i=1 j=1 m=1
i=1 j=1 m=1
s=1
S

−

ˆ
ˆ
Here ∇θ q(θ w ) ≡ ∇θ q(wism , θ w ).
ˆ
Therefore the estimate of asymptotic variance of θ w is
ˆw ˆ ˆw
ˆ
Avar(θ w ) = A−1 Bw A−1 /N
The diagonal elements of (1.15) are the asymptotic variances of estimated parameters.

12

(1.15)

1.5

Estimation under Exogenous stratiﬁcation

Partitioning w as (x, y) and then dividing the population of interest purely based on x in a model
that is made to explain distribution of Y given x, E [Y|X = x] is called exogenous stratiﬁcation. In
multi-stage sampling, exogenous stratiﬁcation can be applied in any stages of sampling.
In the sampling scheme described in section 2, there are two levels of stratiﬁcation. In ﬁrst
level, standard sampling and in second level, variable probability sampling are used. Stratiﬁcation
in each level can be endogenous or exogenous and therefore three possibilities can be distinguished
when at least we have one level of exogenous stratiﬁcation. In case one both levels of stratiﬁcation
are exogenous. In case two, the ﬁrst level of stratiﬁcation is exogenous but is endogenous in
second level. Alternatively in case three, ﬁrst level of stratiﬁcation is endogenous and second level
is exogenous. Since case three is very unlikely to be used in practice, we limit our studies to cases
one and two.

1.5.1

Consistency of the Unweighted M-Estimator

Assume W is partitioned as (X, Y), then in exogenous stratiﬁcation population problem is
min E [q(W, θ )|X]

θ ∈Θ

(1.16)

Our analysis of weighted estimator in previous section can be applied with or without exogenous
stratiﬁcation. However weighting observations in exogenous case is not necessary anymore and an
unweighted estimator is also consistent.

1.5.1.1

Consistency of the Unweighted M-Estimator: Case One

As mentioned above, in case one both level of stratiﬁcations are exogenous. The unweighted
estimator solves the sample objective function
1 N S J K
min ∑ ∑ ∑ ∑ yis r jm z jm q(wism , θ )
θ ∈Θ N i=1 s=1 j=1 m=1

13

(1.17)

Objective function (1.17) is same as (1.3) without the weights vis · p−1 . The following theorem
j
states conditions for consistency of unweighted estimator.
Theorem 1.5.1. Assume that ﬁrst ﬁve conditions in Theorem (1.3.1) hold. Add new two following
conditions
6. Stratiﬁcation in ﬁrst and second levels are based on exogenous variables x. It means that
stratiﬁcation is a deterministic function of x in both levels.
7. For all x, θ ◦ solves minθ ∈Θ E [q(W, θ )|X], and θ◦ uniquely minimizes
J

K

∑ ∑

J

p j E r jm q(Wm , θ ) =

j=1 m=1

K

∑ ∑

p j E r jm E [q(Wm , θ )] |X

(1.18)

j=1 m=1

ˆ
Then uniform weak law of large numbers holds and θ u −→ θ ◦ in probability as N → ∞.
Proof. We need to show that θ is the unique solution to
J

E

K

∑ ∑ r jmz jmq(Wm, θ )

(1.19)

j=1 m=1

By assumption (6) in Theorem (1.4.1), r jm is a function of x. Also z jm is independent of w and
consequently of x. Therefore
E r jm z jm q(Wm , θ )|X = r jm E(z jm |X)E [q(Wm , θ )|X] = r jm p j E [q(Wm , θ )|X]

(1.20)

By assumption (7) in Theorem (1.4.1), r jm p j E [q(Wm , θ )|X] is minimized at θ ◦ , but perhaps not
uniquely. By iterated expectation we have E[q(Wm , θ )] = E [E[q(Wm , θ )|X]] and therefore θ ◦ is
a solution to (1.19). Then the expectation of (1.19) is same as (1.18), and by assumption θ ◦ is
unique solution to (1.18).

1.5.1.2

Consistency of Unweighted M-estimator: Case Two

When ﬁrst level of stratiﬁcation is exogenous while the second level is endogenous, a logical
analogy is that we can drop the weight associated to ﬁrst level of stratiﬁcation i.e. vis but need to
keep the weight associated to the second level of stratiﬁcan i.e. p−1 in order to have consistent
j
estimator. Next theorem conﬁrms the truth of this analogy under speciﬁc conditions.
14

Theorem 1.5.2. Assume that ﬁrst ﬁve conditions in Theorem (1.3.1) hold. Add new two following
conditions
6. Stratiﬁcation in ﬁrst level is a deterministic function of x.
7. θ ◦ is the unique solution to E [q(W, θ )|x ∈ X] for all s.
ˆ p
Then uniform law of large numbers hold and θ u −→ θ ◦ as N → ∞.
Proof. The expected value of cluster i in stratum s is
J

K

∑ ∑

E

p−1 r jm z jm q(Wm , θ )|x ∈ Xs =
j

j=1 m=1
J K

=

∑ ∑

J

∑ ∑

p−1 E z jm |x ∈ Xs · E r jm q(Wm , θ )|x ∈ Xs =
j

J

K

∑ ∑E

r jm q(Wm , θ )|x ∈ Xs

j=1 m=1
Ksc

K

∑ ∑ r jmq(Wm, θ )|x ∈ Xs

=E

p−1 E r jm z jm q(Wm , θ )|x ∈ Xs
j

j=1 m=1

j=1 m=1
J

K

j=1 m=1

=E

∑ q(Wm, θ )|x ∈ Xs

m=1

Ksc

=

∑ [q(Wm, θ )|x ∈ Xs]

(1.21)

m=1

By assumption (7) in Theorem (1.4.2) θ ◦ is unique solution for E [q(W, θ )|x ∈ Xs ] and so is unique
solution for last equality in (1.21). We also need to show that the uniform law of large numbers
holds for each s which is similar to the argument as in Theorem (1.3.1).

1.5.2

Asymptotic Normality of the Unweighted M-Estimator

According to previous section, asymptotic normality results for the unweighted estimator when
stratiﬁcation is based on x in both levels or just in ﬁrst level are represented in frame of the following two theorems.
Theorem 1.5.3. In addition to the condition of Theorem (1.4.1) if
8. θ ◦ is in the interior of Θ or θ ◦ ∈ int(Θ).
9. For all w ∈ W , ∇θ q(w, ·) the score of objective function is continuously differentiable on
int(Θ).
15

10. Each element of Hessian matrix, H(W, θ ) is bounded in absolute value by an arbitrary
function b(w), where E [b(w)] < ∞.
11. For all x, E [∇θ q(W, θ ◦ )|X = x] = 0, and all elements of ∇θ q(W, θ ) has ﬁnite second moment.
j

12. Au = ∑S E ∑ j=1 ∑K r jm z jm ∇2 q(Wm , θ ◦ )|X = x is nonsingular.
m=1
s=1
θ
Then
√
d
ˆ
N(θ u − θ ◦ ) −→ Normal 0 , A−1 Bu A−1
u
u

(1.22)

where
S

J

Bu = ∑ E
s=1

S

+

K

∑ ∑

p j r jm ∇θ q(W, θ ◦ )∇θ q(W, θ ◦ ) |X = x

j=1 m=1



J

J

K



K

∑ E  ∑ ∑ ∑ ∑ p j p j r jmr j t ∇θ q(W, θ ◦)∇θ q(W, θ ◦) |X = x

s=1

(1.23)

j=1 j =1 m=1 t=m

for all x.
Proof. In this case, stratiﬁcations in both levels are exogenous and the score of the objective funcj

tion in each stratum s is s(Wm , θ ) = ∇θ g(Wm , θ ) = ∑ j=1 ∑K r jm z jm ∇θ q(W, θ ). Under asm=1
sumption (11) in Theorem (1.5.1), the expected value of the score is
J

E [s(Wm , θ ◦ )|x] = E

K

∑ ∑ r jmz jm∇θ q(Wm, θ ◦)|X = x

=0

(1.24)

j=1 m=1

Then by applying central limit theorem for independent clusters within each stratum, asymptotic
distribution of the score in stratum s is
−1/2

Ns

Ns

d

∑ [sis(Wm, θ ◦)|X = x] −→ Normal(0 , Bu)
s

i=1

16

(1.25)

Bs u , represents the variance of the score function in stratum s under exogenous stratiﬁcation. It is
equal to
Bu =var [s(Wm , θ ◦ )|X = x] = var [∇θ g(Wm , θ ◦ )|X = x]
s
J

=var

K

∑ ∑ r jmz jm∇θ q(W, θ ◦)|X = x

j=1 m=1
J

K

∑ ∑ r jmz jm∇θ q(W, θ ◦)∇θ q(W, θ ◦) |X = x

=E

j=1 m=1



J

+ E ∑

J

K



K

∑ ∑ ∑ r jmr j t z jmz j t ∇θ q(W, θ ◦)∇θ q(W, θ ◦) |X = x

(1.26)

j=1 j =1 m=1 t=m

for all x. Independency of z’s from W and each other, and also indepdency of clusters between
strata leads us to ∑S Bu which is the score of the objective function Bu in (1.23) and this complete
s=1 s
the proof.
It is interesting to note that under assumption (11), Theorem (1.5.1) in exogenous stratiﬁcation,
the effect of stratiﬁcation is vanished as comparing (1.23) and (1.8) show this point.
The asymptotic results when stratiﬁcation is exogenous just in ﬁrst level is very similar to case
one. Next theorem summarizes main conditions and results.
Theorem 1.5.4. Same conditions as Theorem (1.5.1) last two ones that are replaced with following
11. E [s(W, θ ◦ )|x ∈ X] = 0, in other words we assume the score of the objective function under
exogenous in ﬁrst stage is zero. Also we assume that elements of s(W, θ ) have ﬁnite second
moment.
12. Au = ∑S E ∑J ∑K p−1 r jm z jm ∇2 q(W, θ ◦ )|x ∈ X is nonsingular.
¯
s=1
j=1 m=1 j
θ
Then
√
d
ˆ¯
N(θ u − θ ◦ −→ Normal(0 , A−1 Bu A−1 )
¯
u
¯

17

u
¯

where
S

Bu =
¯

J

∑E

s=1
S

+

K

∑ ∑

p−2 r jm z jm ∇θ q(W, θ ◦ )∇θ q(W, θ ◦ ) |X = x
j

j=1 m=1



J

J

K



K

∑ E  ∑ ∑ ∑ ∑ p−1 p−1r jmr j t z jmz j t ∇θ q(W, θ ◦)∇θ q(W, θ ◦) |X = x
j
j

s=1

j=1 j =1 m=1 t=m

for all x.
Proof. It is similar to the Theorem (1.5.1). We just need to weight observations by p−1 that
j
corresponding with VP sampling in second stage. Like previous case, stratiﬁcation effect due to
SS sampling in ﬁrst stage is zero.

1.6

Examples

This section contains some examples that illustrate theoretical results. It also covers some special
cases.
Example 1.
As the ﬁrst example consider a simple liner model
y = xβ + u

(1.27)

Here x is a 1 × K vector of exogenous variables and β is the K × 1 vector of parameters of interest.
Assuming noncorrelationo between exogenous x’s and error term u, E(x u) = 0, the weighted
estimator provides consistent estimates of β .
The sample optimization problem is
1 S Ns J K
∑ ∑ ∑ ∑ vsc p−1r jmz jm(yscm − xscmβ )2
j
θ ∈Θ N s=1 c=1 j=1 m=1
min

First order condition is
1 S Ns J K
ˆ
∑ ∑ ∑ ∑ vsc p−1r jmz jmxscm(y − xscmβ ) = 0
j
N s=1 c=1 j=1 m=1

18

(1.28)

or
1 S Ns J K
ˆ
∑ ∑ ∑ ∑ vsc p−1r jmz jmxscmuscm = 0
j
N s=1 c=1 j=1 m=1
ˆ
where uscm = yscm − xscm β . In this linear model and under multi-stage sampling scheme, a conˆ
ˆ
sistent estimators of asymptotic variances of β ’s are obtained by applying Theorem (1.3.2) where
consistent estimators of Aw , and Bw are
1 S Ns J K
ˆ
Aw = ∑ ∑ ∑ ∑ vsc p−1 r jm z jm xscm xscm
j
N s=1 c=1 j=1 m=1
and
S Ns J K
ˆ w = 1 ∑ ∑ ∑ ∑ v2 p−1 r jm z jm xscm xscm
B
N s=1 c=1 j=1 m=1 sc j

+

1 S Ns J J K K 2 −1 −1
ˆ
ˆ
∑∑∑
∑ ∑ v p p r r z z uscmusct xscmxsct
N s=1 c=1 j=1 ∑ m=1 t=m sc j j jm j t jm j t
j =1

−

Ns J K
1 Ns J K
ˆ
ˆ
vsc p−1 r jm z jm uscm xscm · ∑ ∑ ∑ vsc p−1 r jm z jm uscm xscm
∑ ∑∑ ∑
j
j
s=1 N c=1 j=1 m=1
c=1 j=1 m=1
S

Example 2.
As the second example, consider binary models like logit or probit. In binary response models
of the form
P(y = 1|x) = G(xβ ) ≡ p(x)
where x is 1 × K, β is K × 1, we take the ﬁrst element of x to be unity. Also we assume 0 < G(xβ <
1 for all x and β . The log-likelihood for observation i is
li (β ) = yi log [G(xi β )] + (1 − yi ) [1 − G(xi β )]
The weighted estimator in this case simply is the weighted maximum likelihood that gives observation i in cluster c in stratum s corresponding weight that is vsc · p−1 .
j
In this example, consistent estimator of Aw and Bw according to Theorem (1.3.2) are
1 S Ns J K
ˆ
ˆ
Aw = ∑ ∑ ∑ ∑ vsc p−1 r jm z jm g2 xscm xscm /ξscm
scm
j
N s=1 c=1 j=1 m=1
19

(1.29)

and
S

Ns

J

K

1
ˆ
Bw = ∑ ∑ ∑ ∑ v2 p−2 r jm z jm gscm xscm xscm
N s=1 c=1 j=1 m=1 sc j
+

1 S Ns J J K K 2 −1 −1
∑∑∑
∑ ∑ v p p r r z z gscmgsct xscmxsct
N s=1 c=1 j=1 ∑ m=1 t=m sc j j jm j t jm j t
j =1

−

Ns J K
1 Ns J K
vsc p−1 r jm z jm gscm xscm · ∑ ∑ ∑ vsc p−1 r jm z jm gscm xscm
∑ ∑∑ ∑
j
j
c=1 j=1 m=1
s=1 N c=1 j=1 m=1
S

dG(z)
ˆ
ˆ
ˆ
and ξscm = Gscm (1 − Gscm ).
dz
Example 3.

Here g(z) =

Example 3 is a special case when p j is set equal 1. In other words, we eliminate last level of
stratiﬁcation or VP sampling. In this case our results in section 3 change to:

N

ˆ
Aw = N −1 ∑

S

K

ˆ
∑ ∑ visyis∇2 q(wism, θ )
θ

i=1 s=1 m=1

And estimation of Bw is
N

S

ˆ
Bw =N −1 ∑
+

K

ˆ
ˆ
∑ ∑ v2 yis∇θ q(wism, θ )∇θ q(wism, θ )
is

i=1 s=1 m=1
N S K
−1
N

K

ˆ
ˆ
∑ ∑ ∑ ∑ v2 yis∇θ q(wism, θ )∇θ q(wism, θ )
is

i=1 s=1 m=1 t=m
S

−

1
∑N
s=1

N

K

∑∑

ˆ
vis yis ∇θ q(wism , θ ) ·

i=1 m=1

N

K

ˆ
∑ ∑ visyis∇θ q(wism, θ )

i=1 m=1

These results are similar to Bhattacharya’s (2005) ones. Also Wooldridge (2008) obtains same
results in case of linear model estimated by least squares.
Example 4.
Consider a case without ﬁrst level of stratiﬁcation and clusters that contains just one unit of
observation. Then our results will change to

N

ˆ
Aw = N −1 ∑

J

ˆ
∑ p−1ri j zi j ∇2 q(w, θ )
θ
j

i=1 j=1

20

And
N

ˆ
Bw = N −1 ∑

J

ˆ
ˆ
∑ p−2ri j zi j ∇θ q(wi, θ )∇θ q(wi, θ )
j

i=1 j=1

These are same results as Wooldridge (1999) in studying variable probability sampling case.

1.7

Two-Step M-Estimator

Consider a panel data model for a random draw i from the population
E(Yi |Xi = xi ) = m(xi , θ ◦ )

(1.30)

where yi is a T × 1 vector on the dependent variable and m(xi , θ ) is a T × 1 of conditional mean
functions. Here we assume that explanatory variables are strictly exogenous. Stratiﬁcation is normally done on variables on ﬁrst period. A consistent, asymptotically normal estimator is obtained
by applying pooled weighted M-estimator discussed in previous sections. The estimator of asympˆ
ˆ
totic variance of θ w is obtained from (1.15), where ∇θ q(wism , θ w ) is the P × T matrix. Arbitrary
serial correlation and heteroskedasticity are allowed in calculation of the estimator (1.15).
Under assumption (1.30), where conditional mean is correctly speciﬁed, can we do more in
context of stratiﬁed samples? This is the question that we will answer in the next chapter. In
general, under (1.30) we can use generalized least squares (GLS) methods to obtain more efﬁcient
estimators of the parameters appearing in a set of conditional mean functions. To obtain more
ˆ
efﬁcient estimators we usually need θ w from the ﬁrst step.
Let Ω(xi , γ) be a model for the T × T conditional variance matrix Var(Yi |Xi ). If this model
is correctly speciﬁed, in general, we can obtain consistent estimator of the true parameters in the
variance matrix, γ◦ . In most application, we obtain an estimation of γ from a ﬁrst step, for example
ˆ
by using residuals from an initial weighted M-estimator, discussed in this paper. Given, γ, and
assuming that conditional variance matrix is nonsingular for all i, we can estimate θ ◦ by solving
N

ˆ
min ∑ [yi − m(xi , θ )] [Ω(xi , γ)]−1 [yi − m(xi , θ )] .
θ i=1

21

(1.31)

Wooldridge (2010) calls the solution to (1.31) weighted multivariate nonlinear least squares (WMNLS)
estimator.
Interestingly, even if the chosen model for the conditional variance Ω(xi , γ) is misspesiﬁed,
WMNLS estimator might produce a more efﬁcient estimator of θ ◦ than an estimator that ignores
variances and covariances at least under (1.30). In most cases, a misspeciﬁed model of variance
matrix captures key features of the conditional second moments. This is the key insight in the
generalized estimating equation (GEE) literature, which is typically applied to panel data models.
In GEE literature, the conditional variance matrix Ω(xi , γ) is called working variance matrix, which
is allowed, and in many cases is known, to be misspeciﬁed. In next chapter we investigate the
problem of efﬁcient estimator in panel data models where simple random sampling is not a correct
assumption in more detail.

1.8

Conclusion

Many data sets in economics studies and other branches of social sciences are not i.i.d observations
but come from multi-stage stratiﬁcation and clustered surveys. These surveys usually produce data
that are not random. Then statistical inference could be faulty if we overlook sampling design.
In this chapter I examine statistical inference in multi-stage sampling designs in framework
of M-estimators. The results show that neglecting sampling scheme causes overestimating or underestimating of the variances. Applying weighted M-estimator gives consistent and normally
distributed estimators regardless of stratiﬁcation type; exogenous or endogenous. However under
exogenous stratiﬁcation, unweighted estimators are consistent. The results show that variance of
the weighted estimator consists of three parts. Part one measures variance under “i.i.d” assumption. Parts two and three take into account clustering and stratiﬁcation effects. Clustering effects
are usually positive, while stratiﬁcation effect is negative. These two rarely offset each other and
therefore overlooking these two parts is potentially problematic.
An interesting question that arise is the possibility of having more efﬁcient estimator under
stratiﬁed samples. Under simple random sampling, and assuming that the conditional mean is
22

correctly speciﬁed, we can apply generalized least squares methods to obtain efﬁciency gain. We
follow up this possibility in panel data models with stratiﬁed sampling schemes in next chapter.

23

Chapter 2
ASYMPTOTIC EFFICIENCY IN THE PANEL DATA MODELS WITH STRATIFIED
SAMPLING

2.1

Introduction

Finding more efﬁcient estimators helps researchers to increase the precision of their statistical
inferences. Efﬁciency usually comes at a price. It requires stronger assumptions needed for consistency. Here we assume that explanatory variables are strictly exogenous. However in panel
data studies, this assumption is violated in models with lagged dependent variables and perhaps
in models without lagged dependent variables. Fixed effects (FE) and random effects (RE) are
two well known linear methods used in empirical studies that require strict exogeneity of the estimators. In RE approach, the serial correlation in the composite error is exploited in a generalized
least squares (GLS) framework. In GLS procedure we also need to add assumptions on conditional
variance matrix of the error term.
The issue of efﬁciency in context of stratiﬁed sampling has been the subject of interest already.
Among others, Cosslett (1981a, 1081b, and 1993), and Imbens (1992) examines efﬁciency for
discrete choice models. Imbens and Lancaster (1996) develop a new estimator for this estimation
problem and show that it achieves the semi-parametric efﬁciency bound for this case. Recently Tripathi (2011) develops efﬁcient empirical likelihood-based inference for moment restriction models
when data are collected by stratiﬁed sampling schemes. In this chapter, we study efﬁciency in panel
data models with stratiﬁed data. The main idea is to utilize information within panels similar to
GLS method in order to gain efﬁciency. It should be emphasized that we are only approximate
the efﬁcient estimator in the sample and try to obtain more efﬁcient estimates compare with just
pooled estimators that ignore correlations within panels. In other words, our goal in this chapter is
not to ﬁnd the efﬁciency bound.
The paper is organized as follows. The next section presents the model and conditional moment
24

restriction that we need as an assumption to be held in the population. Section 3 introduces the
sampling scheme, sampling objective function and relevant probabilities. In section 4 we ﬁrst
show that conditional moment restrictions is held in the sample. Then we discuss about efﬁcient
estimators by referring to some well known works in the literature and how one should apply them
in the context of stratiﬁed samples. In the same section we drive a function that minimize the
asymptotic variance. In section 6 we do a Monte Carlo experiment with the normal linear case and
look at the results of applying new estimators in case of exogenous and endogenous stratiﬁcation.
Section 7 shows application of the method on PSID data. In the last section the main ﬁnding of
the paper will be summarized and ends with some concluding remarks. All proofs and tables are
contained in Appendices.

2.2

The Moment Conditions

Let Wi be a M × 1 random vector taking values in W ⊂ RM , where RM is an M-dimensional
Euclidean space. Some feature of the distribution of W is function of a P × 1 parameter vector θ
that is an element of the parameter space Θ where Θ ⊂RP . Now consider the class of estimators
such that a zero conditional moment restriction in the population is satisﬁed:

E [r (W, θ◦ ) |W2 ] = 0 for all W2 ∈ W2

(2.1)

Here r (W, θ ) is a L × 1 vector of functions, θ◦ satisﬁes the conditional moment assumptions
and W2 ∈ RK is a sub-vector of W ∈ RM . For instance r (W, θ ) can be a vector of residuals and
W2 a vector of instrumental variables. We need standard regularity conditions such as continuity
and differentiability of r (W, θ ) on the interior of Θ.

2.3

Sampling Scheme

The analysis of asymptotic behaviors of an estimator becomes more complicated when the data
set comes from non-random sampling schemes like stratiﬁed samples. One important source of
25

the complexity is the difference between the population distribution on the one hand and sample
distribution on the other hand. However, in simple random sampling these two distributions are
the same.
In multinomial sampling, stratum W j is a subset of W for j = 1, · · · , J . Let Qs be the probability of a randomly drawn observation lying in W j i.e.
Qj = P W ∈ Wj

(2.2)

And let S be the stratum indicator that shows from which stratum an observation was drawn. In a
multinomial scheme, ﬁrst the stratum indicator si where si ∈ {1, 2, · · · , J} is chosen randomly with
probability H j . It means
H j = P (Si = j)

(2.3)

In the second step, observation Wi is randomly drawn from the stratum which the indicator si = j.
This leads to the sample objective function
N

J

Qj

∑ ∑ 1 [Si = j] H j r (Vi, θ )

(2.4)

i=1 j=1

Unlike random sampling where all the observations are equally weighted no matter which
subpopulation or stratum they belong, in multinomial sampling scheme observations depend on
Qj
their stratum have different weights. The objective function in ( 2.4) weights observation i by
Hj
if it comes from stratum j. So if all observations are weighted equally or if Qs = Hs for all s then
there is no gain of stratiﬁed sampling over random sampling.
To emphasize the difference between distribution of observations in population and in the sample under stratiﬁed sampling scheme, random vectors in population and in the sample are represented with W and V respectively.

26

2.4
2.4.1

Efﬁcient estimation under moment restrictions
Moment restrictions in the sample

To study efﬁciency in panel data models when data set comes form stratiﬁed samples and under
conditional expectation assumption ( 2.1), one ﬁrst needs to evaluate conditional expectation of the
sample objective function in equation (2.4). To this end, ﬁrst for each observation i, deﬁne
J

q (S, V, θ ) =

Qj

∑ 1 [S = j] H j r (V, θ )

(2.5)

j=1

q (·) is a function of random variable S, an indicator variable representing stratum of observation
i, and random vector V. This function also depends on the sampling weight of each observation
Qj
, that are assumed to be known. We want to show that the expected value of function q (·)
i,
Hj
given V2 and evaluated in true parameter value θ ◦ is zero.
J

E [q (S, V, θ ◦ ) |V2 ] =

∑E

1 [S = j]

j=1

Qj
r (V, θo ) |V2
Hj

=0

(2.6)

Using deﬁnition of expected value and assuming that V is a continuous random vector, expected
value of (2.5) is
J

∑

1 [s = j]

j=1 v∈W

Qj
r (v, θ o ) · g (s, v|v2 ) dv
Hj

(2.7)

Equation ( 2.7) shows that we need to ﬁnd the conditional sampling density of S and V given V2 ,
or g (s, v|v2 ). Imbens and Lancaster (1996) show that this conditional density function is
f (v|v2 , θ )
g (s, v|v2 ) =

J

Hs
Qs

Hj
R ( j, v2 , θ )
∑
j=1 Q j

(2.8)

Equation ( 2.8) represent conditional sampling density of S and V given V2 in terms of conHs
ditional density of V given V2 in the population, sampling weight
, and R (s, v2 , θ ). Here
Qs
R (s, v2 , θ ) is deﬁned to be the probability that a random drawn observation is in stratum s given
V2 . It is a known function of s, v2 , and θ . Also it is important to note that since we assume the
27

strata are not overlapping, the conditional sampling density of S and V given V2 is the same as the
conditional density of V given V2 i.e. g (s, v|v2 ) = g (v|v2 ). By substituting ( 2.8) in ( 2.7) we have
f (v|v2 , θ )

J

1 [s = j]

∑

j=1 v∈W

Qj
dv
r (v, θ ) ·
J Hj
Hj
R ( j, v2 , θ )
∑
j=1 Q j

J

=

∑

j=1 v∈W

Hj
Qj

1 [s = j] r (v, θ ) ·

f (v|v2 , θ )
J

dv

(2.9)

∑ R ( j, v2 , θ )

j=1

Since W1 , W2 , · · · , WJ are mutually disjoint and the union set of this disjoints subpopulations,
J W , covers whole population, saying that stratum of observation i is
j=1 j

j or Si = j is equivalent

to say that observation i belongs to subpopulation j or vi ∈ W j . So we can exchange 1 [s = j]
with 1 wi ∈ W j in expression ( 2.9) which gives us
J

=

∑

j=1 v∈W

1 v ∈ W j r (v, θ ) ·

f (v|v2 , θ )
J

dv

(2.10)

∑ R ( j, v2 , θ )

j=1

In expression ( 2.10), ∑J R ( j, v2 , θ ) is constant and 1 v ∈ W j just deﬁnes the limits of integraj=1
tion and therefore (2.10) can be rewritten as
J

1

=

∑

J

∑ R ( j, v2 , θ )

j=1 v∈W j

r (v, θ ) · f (v|v2 , θ ) dv

j=1

=η
Here η =

1
J

v∈W

r (v, θ ) · f (v|v2 , θ ) dv

(2.11)

is a constant and equation ( 2.11) by deﬁnition is the conditional expec-

∑ R ( j, v2 , θ )

j=1

tation of r(·) or

v∈W

r (v, θ ) · f (v|v2 , θ ) dv = E [r (V, θ ) |V2 ]

(2.12)

By assumption ( 2.1), equation (2.12) evaluated in true parameter value θ◦ , is equal to zero. Hence
we show that although multinomial sampling changes the distribution of observations in the sample
but zero conditional mean assumption is still held. we summarize the above ﬁnding in the following
lemma.
28

Lemma 2.4.1. If zero conditional moment (2.1) evaluated in true parameter value θ ◦ holds in the
population, then under multinomial stratiﬁcation sampling scheme, its analog in the sample (2.6)
evaluated in θ ◦ is zero also.
The result is valid under standard stratiﬁed and variable probability sampling schemes too.
Imbens and Lancaster (1996) show that these three common types of stratiﬁcation can be analyzed
in a uniﬁed manner. They show that regardless of the actual sampling scheme efﬁcient inference
should be identical for both standard stratiﬁed sampling and multinomial sampling. And variable
probability sampling model is just a re-parametrization of the multinomial sampling scheme and
therefore the inference should be identical for both models.

2.4.2

Efﬁcient estimation

The result in previous section opens door to apply the well known results developed by Chamberlain (1987), and Newey and McFadden (1994) to ﬁnd the smallest asymptotic variance under zero
conditional mean assumption ( 2.1). To ﬁnd such a solution let
Ω (W2 , θ◦ ) = E r (W, θ ◦ ) r (W, θ ◦ ) |W2 = Var [r (W, θ ◦ ) |W2 ]

(2.13)

be the T × T conditional variance of r (W, θ ◦ ) given W2 , in the population, and deﬁne
G (W2 , θ◦ ) = E [∇θ r (W, θ ◦ ) |W2 ]

(2.14)

be the T × P conditional mean of gradient in the population. Then it can be shown that
Z∗ (W2 , θ◦ ) = Ω (W2 , θ◦ )−1 G (W2 , θ◦ )

(2.15)

is the function that minimize the asymptotic variance. This function is T × P and the efﬁcient
method of moments estimator solves
E Z∗ (W2 , θ◦ ) r (W, θ ◦ ) = 0

29

(2.16)

Since stratiﬁcation changes the distribution of observations in the sample we need ﬁrst to evaluate
conditional variance of the sample objective function q (S, V, θ ). In the ﬁrst appendix, we show
that this variance is equal to
J

E q (S, V, θ ◦ ) q (S, V, θ ◦ ) |V2 =

Qj

∑ Hj E

r (V, θ ◦ ) r (V, θ ◦ ) |V2 , S = j

(2.17)

j=1

We can write the right hand side of equation ( 2.17) in terms of the conditional variance of
r (V, θ ) in each stratum and so (2.17) can be rewritten as
J

var [q (S, V, θ ◦ ) |V2 ] =

Qj

∑ H j var [r (V, θ ◦) |V2, S = j]

(2.18)

j=1

J

+

Qj

∑ H j E [r (V, θ ◦) |V2, S = j] E [r (V, θ ◦) |V2, S = j]

j=1

Expression ( 2.18) show that sampling conditional variance of r (V, θ ) is equal to the sum of
conditional weighted variances in strata plus the sum of conditional weighted squares of means in
strata. To see the effect of stratiﬁcation, it is useful to compare it with random sample case, where
each observation in population has same weight or in other words Q j = H j for all j, and assume
conditional expected value in each stratum is equal to conditional expected value in the population
which is zero by assumption. Then equation ( 2.18) reduces to sum of conditional variances in
strata.
There are two interesting cases that need attention. First case is when strata are function of
exogenous variables V2 . Then the stratiﬁcation is exogenous. It causes the second term in right
hand side of ( 2.18) to be zero, because
E [r (V, θ ◦ ) |V2 , S = j] = E [r (V, θ ◦ ) |V2 ] = 0
and equation ( 2.18) simpliﬁes to
J Q
Qj
j
{var [r (V, θ ◦ ) |V2 ]} = Ω (V2 ) ∑
∑ Hj
Hj
j=1
j=1
J

var [q (S, V, θ ◦ ) |V2 ] =

(2.19)

Qj
is constant, it does not affect the variance, and therefore conditional variance
Hj
of q (S, V, θ ) is equal to conditional variance of r (V, θ ) in the population.
and since ∑J
j=1

30

The second case occurs when despite changes of the variances between strata the structure of
correlation remains constant. As an example consider cases like AR(p) or MA(q). If variancecovariance matrix remains same despite stratiﬁcation then by Equation(2.17) the sample objective
function q (S, V, θ ) has same variance as r (V, θ ) in the population. Actually in the next section we
assume that the variance-covariance matrix does not change by stratiﬁcation and then check the
simulation results for this case by assuming that the correlation follows AR(1) process.
We also need to check the score of the objective function. In the ﬁrst appendix we also show
that the conditional expected value of the sample gradient vector is
E [∇θ q (S, V, θ ) |V2 ] = E [∇θ r (V, θ ) |V2 ] .

(2.20)

The right hand of ( 2.20) is the conditional expected value of the population Jacobian matrix.
It leads us to optimal instruments matrix that is a T × P matrix
Z∗ (V2 ) ≡ {var [q (S, V, θ ◦ ) |V2 ]}−1 E [∇θ q (S, V, θ ◦ ) |V2 ]

(2.21)

Therefore the efﬁcient method of moments estimator GMM solves the sample moment conditions
N

∑ Z∗ (V2) q

θ =0

(2.22)

i=1

This is a T × P matrix.

2.5

Examples

In this section some speciﬁc examples are covered that illustrate the theoretical results.
Example 1.
As the ﬁrst example consider linear model
Yi = Xi θ + Ui

(2.23)

where Y is a T × 1 vector of dependent variables, X is a T × P matrix of control variables, θ
is a P × 1 vector of parameters and ﬁnally U is a T × 1 vector of error terms. In this example
31

r (Xi , Yi , θ ) = Ui = Yi − Xi θ . We assume strict exogeneity assumption
E (U|X) = 0

(2.24)

and add assumption that variance-covariance function in the population is function of control variables X
E UU |X = Ω (X)

(2.25)

Under multinomial sampling scheme we have
J

q (X, Y, S, θ ) =

Qj

∑ 1 [S = j] H j U

(2.26)

j=1

and by equations( 2.12) and ( 2.17) conditional expected value and conditional variance of this
function are
E [q (X, Y, S, θ ) |X] = 0

(2.27)

and
J

var [q (X, Y, S, θ ) |X] =

Qj

∑ Hj E

UU |X, S = j

(2.28)

j=1

respectively. From (2.28), it is clear that variance matrix is a function of X and strata. Also
conditional expected value of gradient vector in this simple linear model is x according to ( 2.20).
Therefore optimal choice of instrument according to ( 2.21) is
J

Qj
∑ H j E UU |X, S = j
j=1

−1

x

(2.29)

And GMM solution that produces efﬁcient estimators solves
N

J

∑ xi

i=1

Qj
∑ H j E UU |X, S = j
j=1

−1 J

Qj

∑ 1 [S = j] H j Ui = 0

(2.30)

j=1

Computational version of ( 2.30) can be written as
N

Qj j
∑ H j ∑ xi j
j=1
i=1
J

J

Qj
∑ H j E UU |X, S = j
j=1
32

−1

ui j = 0

(2.31)

Estimation of θ ◦ are obtained by solving equation ( 2.31). These parameters estimations are
−1 


J Q Nj
J Q Nj
j
j
ˆ
ˆ
ˆ
(2.32)
θwGMM =  ∑
∑ xi j Ω−1(x, j)xi j   ∑ H j ∑ xi j Ω−1(x, j)yi j 
H j i=1
j=1
i=1
j=1
ˆ
that looks like a GLS estimator. In (2.32) N j is the sample size in stratum j and Ω−1 (x, j) is an
J Q

estimation of variance matrix var [q (X, Y, S, θ ) |X = x] = ∑ H j E UU |X = x, S = j . To have a
j=1 j
clear idea about equation (2.32), as an example, assume that variance matrix (2.28) is a function
ˆ
ˆ
of gender. In this case we need to obtain Ω−1 ( f emalei , j) and Ω−1 (malei , j) for each stratum j;
j ∈ {1, 2, . . . , S}. In cases that conditional variance matrix is a function of continuous explanatory
variable xi j , one possible solution is to divide it into some interval and then estimate variance
matrix for each interval in each stratum j separately. Of course if we know the functional form of
relationship between the variance matrix and the explanatory variable xi in each stratum j, we can
improve the efﬁciency in our model by incorporating this knowledge in the estimation process. For
example if we are interested in relationship between saving and income in different states and a
theory provides an speciﬁc form for the variance matrix that relates changes in second moments of
saving to changes in income and other explanatory variables then we are in a situation like weighted
least squares that provides more efﬁcient estimators relative to OLS. Note that in weighted least
square the reason for weighting observations is to solve heteroskedastisity problem while in models
with stratiﬁed or complex sampling design we need weights even in homoskedastic cases.
we can summarize the above procedure to ﬁnd a GMM estimator in panel data models with
stratiﬁed structure in few simple steps as follows:
1. Obtain a consistant estimation of θ .
2. Obtain residuals ui jt
ˆ
ˆ
3. Estimate Ω j (X) = E[UU |X, S = j] for each stratum j. Call them Ω j (X).
ˆ
ˆ
4. Form Ω(X, S) by adding weighted Ω j (X).
ˆ
ˆ
5. By substitute Ω(X, S) in equation (2.32), we obtain θ wGMM which we hope it is more efﬁcent
than a pooled estimator.
33

Obtaining a consistent estimator of θ should not be difﬁcult. Any computer package that allows
users to estimate surveys panel data can be used to do the ﬁrst step.
Example 2.
The second example considers a nonlinear model. Assume
E [Yt |Xt ] = m (Xt , β ◦ ) ,

t = 1, · · · , T

(2.33)

Here {(Xt ,Yt ) : 1, 2, · · · , T } is the time series observations for a random draw from the cross section
population and assumption ( 2.33) simply means that parametric model for E [Yt |Xt ] has been
correctly speciﬁed. For example if Y is a count variable, a Poisson QMLE can be used. In this case
and in general for Y ≥ 0 and unbounded from above, the most common conditional mean function
is the exponential
m (Xt , β ) = exp (Xt β )

(2.34)

where Xt is 1 × K and contains unity as its ﬁrst element, and β is K × 1. If we impose the Generalized Linear Models (GLM) assumption then
2
2
var (Yt |Xt ) = σ◦ m (Xt , β◦ ) = σ◦ exp (Xt β ) , t = 1, 2, · · · , T

(2.35)

In this model r (X, Y, β ) = Y − exp (Xβ ) = U and U is T × 1 vector with elements Ut = Yt −
exp (Xt β ) for t = 1, 2, · · · , T . By multinomial stratiﬁed sampling and according to (4.1), sample
objective function is
J

q (X, Y, S, β ) =

Qj

∑ 1 [S = j] H j U

(2.36)

j=1

Its conditional expected value is zero as it shown in general case, and its conditional variance
is
J

var [q (X, Y, S, β ) |X] =

Qj

∑ Hj E

j=1

34

UU |X, S = j

(2.37)

by ( 2.17). Conditional expected value of gradient of sample objective function is


 −X1 exp (X1 β ) 


 −X exp (X β ) 
2
2


E ∇β q (X, Y, S, β ) |X = 
= R (X)

.


.
.




−XT exp (XT β )

(2.38)

T ×K

Then optimal choice of instruments is given by
J

Qj
∑ H j E UU |X, S = j
j=1

−1

R (X)

(2.39)

And ﬁnally GMM estimators are obtained by solving
N

∑ R (X)

i=1

J

Qj
∑ H j E UU |X, S = j
j=1

−1 J

Qj

∑ 1 [S = j] H j Ui = 0

(2.40)

j=1

Here, one way to approach is to model E UU |X, S = j similar to the ﬁrst example hoping
to obtain more efﬁcient estimators. However, we can choose a hypothesized structure for the
within-panel correlation like generalized estimating equations (GEE) literature. The main idea
in GEE approach in panel data models is that under strict exogenity assumption (2.1), even a
misspeciﬁed model for the conditional variance (2.17) that nevertheless captures key features of the
conditional second moments might lead to a more efﬁcient estimator of θ ◦ than an estimator that
ignores variances and covariances. Identity matrix is the simplest form of the correlation within the
panels that assumes independency or in other words no correlation within panels. Exchangeable
correlation matrix is a simple extension to this structure. This matrix looks like


1 α · · · α

.
.
α 1
.

Λ(α) =  .

...
.
α
.



α
α 1

(2.41)

Here parameter α is a scalar that shows common correlation among observations within the panels.
For an example consider a health study in which the panels are clinics and the observations
within the panels i.e. clinics are patients.
35

If observations within the panels have a natural order it is more reasonable to assume the autoregressive structure for within the panels correlation. In health study case for instance, one can
consider that the panels represent patients who are measured over time. In this case an autoregressive process can be a good model for dependency of a patient’s health conditions over time. In
section 5 we consider the autoregressive structure implied by the AR(1) for the correlation matrix
to study a linear model. There are several ways in which we might hypothesize the within-panel
correlation. To see more options and examples see Hardin and Hilbe (2003).
By assuming correlation matrix (2.41), and adding GLM assumption ( 2.35), the variance of
sample objective function reduces to
1

1

var [q (X, Y, S, β ) |X] = m (X) 2 Λ (α) m (X) 2

J

Qj

∑ Hj σj

(2.42)

j=1

where m (X) is




0
···
0

m (X1 , β )


.
.


0
m (X2 , β )
.




.
..


.
.
0
.




0
0 m (XT , β )
and by dropping ∑J
j=1

(2.43)

Qj
σ in ( 2.42) equation ( 2.40) changes to (2.44) in the sample
Hj j

N

−1 J

−1

∑ R (x) m (x) 2 Λ (α)−1 m (x) 2

i=1

Qj

∑ 1 [S = j] H j ui = 0

(2.44)

j=1

Equation ( 2.44) can be represented as
J

Qj N

−1

∑ H j ∑ R (x) m (x) 2

j=1

2.6

−1

Λ (α)−1 m (x) 2 ui j = 0

(2.45)

i=1

The normal linear model: A Monte Carlo investigation

It would be insightful to have a Monte Carlo analysis of a number of examples of stratiﬁed sampling in the normal linear model. We consider the simple following model

36

Yi = Xi θ + Ui

U|X = x ∼ N 0, σ 2 Ω , and xit ∼ N (0, 1) for i = 1, · · · , N

(2.46)

In this simple two-variable linear regression model Yi is a T × 1 vector of dependent variable,
Xi is a T × 2 matrix of explanatory variables where the ﬁrst column is a constant term. Ui is a
T × 1 vector of error terms. The vector of parameters θ has two elements; intercept θ◦ and slope
θ1 . We set zero and one as true values of the intercept and slope in population respectively. We
also assume that the only control variable in the model Xit has the normal standard distribution
and error term U has the normal distribution with mean zero and variance σ 2 Ω, where Ω has a
ﬁrst order autoregression AR (1) structure with parameters ρ and σ 2 = 1. Under these assumptions
variance-covariance matrix is


ρ ···
 1

 ρ
1

E Ui Ui |Xi = E Ui Ui = σ 2 Ω =  .
..
 .
.
 .

ρ T −1
ρ

ρ T −1








ρ 


1
.
.
.

(2.47)

Three strata are considered
W1 = X × (−∞, −0.25)

and

W2 = X × (−0.25, 1.5)

and

W3 = X × (1.5, ∞)

That are endogenous. We also consider three exogenous strata that are
W1 = (−∞, −0.25) × Y

and

W2 = (−0.25, 1.5) × Y

and

W3 = (1.5, ∞) × Y

In all cases the strata are deﬁned by dividing the population into subpopulations in the ﬁrst
period t = 1. W = (X ,Y ) is the population space where W ⊂R2 . In this example population
weights Q1 , Q2 , and Q3 are known. These weights are obtained from normal distribution with
mean zero and variance two in endogenous case and standard normal distribution in exogenous
Ns
for s ∈ {1, 2, 3}. Here Ns is the number of
one. The Hs or sampling probabilities are equal to
N
observations from stratum s and N is the sum of the total number of observations in the sample.
We estimate parameters and their asymptotic variances by estimators developed in this paper which we call them weighted GLS and un-weighted GLS and compare them with OLS and
weighted pooled OLS.
37

In this exercise sample objective function is ( 2.26) in the ﬁrst example and its variance is equal
to
J

var [q (X, Y, S, θ ) |X] =
=

Qj

∑ Hj E

j=1
J Q

j

∑ Hj

UU |X, S = j
2
σ1 Ω + σ 2 Ω + σ 2 Ω
2
3

j=1

3

=

Qj

∑ Hj σ2
j

Ω

(2.48)

j=1

Qj 2
The term ∑3
j=1 H σ j in ( 2.48) is a constant and has no effect in estimating the parameters
j
so we drop it for simplicity. Therefore, with these simpliﬁcations, var [q (X, Y, S, θ ) |X] = Ω is the
variance matrix in the population which is not a function of control variables X. Of course, by
changing assumptions, we can obtain different estimations for the variance. In this example, we
consider the simplest case by assuming strong assumptions to make estimation easy to execute.
Weighted GLS estimation of θ◦ and θ1 are obtained by solving equation ( 2.30) in the ﬁrst
example. These parameters estimations are
−1 


J Q Nj
J Q Nj
j
j
ˆ
θwGLS =  ∑
∑ xi j Ω−1xi j   ∑ H j ∑ xi j Ω−1yi j 
i=1
j=1
j=1 H j i=1

(2.49)

that looks like GLS estimator. In (2.49) N j is the sample size in stratum j.
To have a good judgement of how much gain a practitioner obtains by using this estimator a
comparison between estimators developed in this paper and two other estimators is done by using
simulation. The comparison is between weighted GLS estimator equation ( 2.49) and unweighted
Qj
GLS estimator which is exactly same as ( 2.49) but drops the weights
for all j s, and weighted
Hj
pooled OLS that ignores the correlation over time for each cross section observation i

−1 

J Q Nj
J Q Nj
j
j
ˆ
θwOLS =  ∑
(2.50)
∑ xi j xi j   ∑ H j ∑ xi j yi j 
j=1 H j i=1
j=1
i=1
And usual OLS assuming homoscedasticity. We also estimate feasible version of weighted and
unweighted GLS estimators and call them weighted and unweighted FGLS estimators respectively.
So in total six estimators are evaluated in the simulation.
38

Also we look at the variance of these estimators to evaluate their efﬁciency. An appropriate
ˆ
estimator of asymptotic variance of θwGLS is
ˆw ˆ ˆw
ˆ
Avar θwGLS = A−1 Bw A−1
ˆ
Where Aw = ∑J
j=1

(2.51)

Qj Nj
ˆ
∑ x Ω−1 xi j and Bw is more complicated
H j i=1 i j
Q2
j

J

ˆ
Bw =

Nj

ˆ ˆ
∑ H 2 ∑ xi j Ω−1ui j ui j Ω−1xi j

j=1

J

−

j i=1

Q2
j


1



Nj

1

Nj



ˆ
ˆ
∑ H 2  N j ∑ xi j Ω−1ui j   N j ∑ xi j Ω−1ui j 

j=1

j

i=1

(2.52)

i=1

If weights are dropped from ( 2.51) we obtain the estimator of asymptotic variance of unˆ
weighted GLS; Avar θuwGLS . And if the variance-covariance matrix Ω is dropped from ( 2.51)
we have estimator of asymptotic variance of weighted pooled OLS in hand.
As it mentioned already, in this exercise we consider variance-covariance matrix with AR(1)
structure in the population. As equation ( 2.48) shows if we assume ρ does not change by changing
the stratum then there is only one parameter that we need to estimate i.e. ρ. In practice it is possible
to estimate different ρ for different stratum too but for simplicity, in this example, we assume ρ
in each stratum is equal to the value of ρ in population. Therefore, it is a constant parameter
and not a function of strata. However, there is a point that we should keep in mind. we call our
estimators GLS (weighted or unweighted) whenever we use the value of ρ in population, since the
equation (2.52) is very similar to well known GLS estimators. However, this naming may cause
some confusion, since we do not know the true value of ρ in each stratum. We just “assume” that
they are equal to the value of ρ in the population of interest. Now, it should be clear why in tables
B.1 to B.16, the estimated parameters from GLS and FGLS are very close in most cases.
We consider four values for correlation parameter . It changes from no correlation ρ = 0 to
high degree of correlation ρ = .9. In between ρ = 0.1 and ρ = 0.5 are considered. It helps us to
see the effect of correlation magnitude on the efﬁciency gain in this simple exercise.

39

Tables B.1 to B.4 and B.5 to B.8 in the second appendix summarize the results for cases T = 2,
and T = 5 respectively when stratiﬁcation is exogenous. In these tables means and their standard errors, and mean squared errors for the intercept and slope for six estimators are reported.
When ρ = 0, POLS and unweighted GLS (uwGLS) are almost identical as we expected. Under
exogenous stratiﬁcation it has been shown that ignoring stratiﬁcation does not cause any problem.
See for example Manski and McFadden (1981), DuMouchel and Duncan (1983) and Wooldridge
(1999, 2001). This is clearly seen in the tables. In both T = 2, and T = 5 cases OLS that ignores
stratiﬁcation and unweighted GLS and its feasible counterpart are superior to weighted POLS and
weighted GLS and their feasible versions; they are consistent and more efﬁcient.
The interesting point, that is actually the main issue of this paper, appears when correlation
increases. Now as ρ increases un-weighted GLS and its feasible version which takes correlation
into account by considering it in estimation process, shows its efﬁciency over OLS and weighted
POLS that simply ignore correlation. This is especially correct about estimation of slope rather
than intercept that does not change very much except in a high degree of correlation.
As an illustration consider exogenous stratiﬁcation and T = 2 and ρ = 0.9 in table B.4. In this
case the mean of the slope is almost the same in both POLS and un-weighted FGLS, but the latter
estimator mean squared error is .0151 compare to .0355 in the former. It shows about 57 percent
reduction that is substantial. In case T=5 the cutback is even more and it is about 62 percent that is
presented in table B.8.
When ρ decreases to 0.5, the improvement in efﬁciency is still considerable. In this case and
T = 2 in table B.3, mean of standard deviation for slope decreases from .0355 for OLS to .0302
for un-weighted FGLS. The mean of standard deviation diminishes about 18 percent when T is 5
(table B.7). Meanwhile the mean of standard deviation for intercept does not show any changes at
all when T = 2 but it shows a sign of improvement as T increases to 5 albeit not too much.
The simulation results show that in the case of exogenous stratiﬁcation in a panel data model,
un-weighted GLS and its feasible counterpart that consider the structure of variance-covariance
matrix in estimation is better than OLS and weighted POLS that simply ignore correlations within

40

each cross-section observation. Another interesting observation is that even weighted GLS in
the case of exogenous stratiﬁcation is getting better in terms of reduction of bias and smaller
variances when the degree of correlation increases. It is deﬁnitely superior to weighted POLS and
its estimation of slope has much smaller variance comparing with POLS when ρ exceeds .5 (Tables
B3, B4, B7 and B8).
Also the results show that as correlation increases the difference between standard deviation
of mean-presented in parentheses- and mean of standard deviation of intercept increases for OLS
estimator that a sign that variance of OLS estimator is inconsistent and the inconsistency raises
along with . However the inconsistency for the variance of slope is much lesser and does not show
signiﬁcant variation alongside the change in correlation between observations.
The main challenge is when stratiﬁcation is based on the endogenous variable. In this case unweighted estimators are generally inconsistent. Tables B.9 to B.12 and B.13 to B.16 summarize the
results when stratiﬁcation is based on the endogenous variable for cases T = 2 and T = 5 respectively. As it is expected OLS and un-weighted GLS and its feasible un-weighted counterpart all
produce inconsistent estimations for both the slope and intercept. The interesting point is that this
inconsistency shrinks for slope but enlarges for intercept by increase in the correlation parameter
for both estimators. Moreover OLS gives inconsistent estimation of the variances too although it
is reduced by increase in ρ. It can be seen by comparing the standard deviation of mean presented
in parentheses and mean of standard deviations.
Results presented in the tables B.9 to B.16 show that in case of endogenous stratiﬁcation
weighted estimators- weighted POLS, weighted GLS and weighted FGLS- are consistent and are
almost same in low values of correlation parameter ρ = 0 and ρ = 0.1. The difference between
these estimators are more remarkable when correlation parameter ρ starts growing. For example
in table B.11, when T = 2 and ρ = 0.5, while weighted POLS and GLS are both consistent estimators for the slope, the latter has standard deviation of mean equal to .0365 comparing to the former
which is equal to .0414. In case of T = 5 this difference is even more considerable (look at table
B.15).

41

Superiority of weighted GLS or its feasible equivalent to other estimators is unambiguously
clear if we increase ρ to 0.9. In this case and when T=2, mean of standard deviation of the slope is
.0186 which is less than half of the same value for its closest competitor i.e. weighted POLS that is
about .0408 (Table B.12). The difference between weighted GLS and the rest of the estimators is
even more dramatic when T=5. In Table B.16, we can compare the efﬁciency of weighted GLS and
weighted POLS. Here the mean of the standard deviation of the slope for weighted GLS is just 36
percent of the same value for weighted POLS (.0098 verses .0272). Of course this big advantage
of weighted GLS verses weighted POLS are just substantial for the estimation of the slope not the
intercept.
In another set of Monte Carlo experiments, we relaxed the assumption that the correlation
matrix is same for all strata and estimate the matrix for each stratum. The results are even better; we
have estimators with smaller variances although in most cases the variances of the old and the new
ones are very close. We also repeated the experiment by changing the covariance matrix structure
to the random effect model. The results show that weighted and unweighted GLS estimator are
efﬁcient estimators in the endogenous and exogenous stratiﬁcation respectively. In order to make
the appendices shorter we do not report the related tables and results.
Overall the simulations show the way to some tentative conclusions. First, ﬁnding more efﬁcient estimators in panel data models with stratiﬁed sampling structure and under appropriate
assumptions is possible. Depending on whether the stratiﬁcation is based on exogenous or endogenous variable, the GMM estimators developed in this paper, i.e. unweighed or weighted GLS,
outperform OLS or weighted POLS which do not consider the correlations over time within each
panel i.
Second, this superiority is positively related to the level of correlation of a cross section observation through time. In low level of correlation there is no big advantage of using GMM over OLS
or weighted POLS. This is changed when correlation parameter is get bigger.
Also for the same sample size the efﬁciency gain depends on what the structure of correlation
is or what kind of structure is chosen in case of GEE models. The simulation results show that

42

correlation matrix structure affect the amount of reduction in the variances of the estimators.

2.7

Determinants of Family income in the U.S: An Empirical Application

In this section we analyze the determinants of family income and sources that cause family incomes
varies across households. We estimate a simple linear model that considers total family income a
function of family characteristics like education of head of family, age of the head, gender of the
head, marital status of the head and so on. The model is estimated with different methods. These
methods are pooled OLS, weighted POLS, feasible GLS and weighted feasible GLS methods developed in this paper to compare the efﬁciency gain if there exist any.
The source of the data set used in this exercise is the 2003-2009 Panel study of Income Dynamics (PSID). The PSID is a complex longitudinal panel survey that have collected data from
the same families and their descendants in United States since 1968. Data has been collected on a
wide range of economic, social, demographic, psychological and health factors over the life course
and across generations. The sample size has grown from roughly 4,800 families in 1968, to about
7,400 by 2005, and to more than 8,690 families and 24,385 individuals as of 2009 (Heeringa, et
al., 2011) . As of 2009, the PSID has information on over 70,000 individuals collected over the
past four decades.
the core sample of individuals and their families in PSID is rooted in two distinct samples.
The Survey Research Center designed a nationally representative sample that known as the SRC
sample. The second sample known as the Survey of Economic Opportunity or SEO sample, drawn
mainly from lower income families. An oversample of low-income families was included to provide adequate sample size for investigating poverty related issues. Roughly 18,000 individuals
living in 4800 households were members of the original 1968 sample. In 1997, PSID Immigrant
Supplement added 511 immigrant families to the core sample to obtain more complete picture
from the population and to enhance representativeness.
Individuals in PSID fall in two categories; sample and non-sample persons. By deﬁnition a
sample person is someone who is either a resident of a PSID original sample family in 1968, or an
43

offspring born to or adopted by a sample individual who is actively engage in the study at the time.
The deﬁnition of sample persons slightly relaxed in 1994 and allowed a child born to or adopted
by a sample person who was not participating in the study to be considered as a sample person.
According to Heeringa, et al. (2011a), from 24,385 individuals distributed in 8,690 families in
2009, 17,471 are PSID sample persons and 6,914 are non-sample spouses and family members.
Longitudinal weights are calculated at the beginning of a four year (two wave) cycle. The last
cycle began in 2007, and therefore the 2009 weights are just “carry-over” weights. Weights need to
adjust for attrition and also changes in family size that happens because of marriage, divorce, death,
and other additions of new members. The longitudinal family weight in PSID is the average of the
positive individual weights for sample person and zero value weights for non-sample persons in
the family. For example if a PSID sample person with an individual longitudinal weight of 100 has
spouse who is a PSID non-sample person with assigned weight equal to 0, then the family weight
for this two-person family is 50. For more detail on the construction of the PSID longitudinal
family and individual weights see Heeringa, et al. (2011a, b).
To study the relationship between income and family characteristics covariates, a simple linear
model is considered where dependent variable is total family income or t f inc, which is the sum
of taxable income of the family head and his wife and other members of the family last year plus
social security income of the head, his wife and other members of the family unit. This variable
can take negative values that indicate net losses occur as a result of business or farm activities. The
model is represented as
t f incit = Xit β + vit

(2.53)

where X is a vector of family characteristics. Here vit ≡ ci + uit , t = 1, . . . , T are the composite
error, ci , i = 1, . . . , N are unobserved heterogeneity, and idiosyncratic errors are uit . Parameters
of interest is represented by vector β . The vector X include the total family wealth (twealth), the
head’s age, age square (age2) and age cube (age3), health condition of the head, marital status,
education level, and employment status of the head, family size ( f size), persons less than 6 years of
age in the family (aychild6), the head’s father and mother education levels, race and gender of the
44

head, and number of persons less than 18 years of age (nchild) in the family as well as year dummy
variables and intercept. The variable twealth is constructed as sum of seven asset types, net of debt
value plus value of home equity. We also added interaction terms between education level of the
head (edu_hs) and his age and between edu_hs and the head’s employment status, unemployed to
the model. Tables B.17 and B.18 provide variables description and summary statistics respectively.
The panel in this empirical study consists of 4 waves ( 7 years) starting 2003 and ending 2009.
The 2003 longitudinal family weights are used. After dropping all observations with missing
values and strata with just one panel, the ﬁnal data set is a balanced panel, contains of 15,672
observations or 3,918 panels distributed between 33 strata.
To estimate family income equation (2.53), seven methods are used. These seven methods are
pooled OLS, and weighted pooled OLS that ignore the serial correlation problem, and feasible
versions of generalized least squares (GLS) that consider two forms for the serial correlation. The
ﬁrst form is a ﬁrst-order autoregression AR(1), and in the second form the random effect structure
is estimated for unconditional variance matrix of error term vit . the remaining three methods
are weighted FGLS discussed in this paper. Beside AR(1) and the random effects, we estimate
ˆ ˆ
ˆ
ˆ
ˆ
unrestricted variance matrix of error term i.e. Ω = N −1 ∑N vi vi where the vi is a 4 × 1 of the
i=1 ˆ ˆ
pooled OLS residuals. We call these three methods wFGLS_ar1, wFGLS_re and wFGLS_un
respectively. We hope to obtain efﬁciency gain by using the latter estimators to estimate total
family income equation (2.53).
The estimation results are presented in Table B.19. Robust standard errors are listed in parentheses. In wFGLS_ar1, wFGLS_re and wFGLS_un standard errors are calculated using equations
ˆ
(2.51) and (2.52). In Table B.19, λ is a consistent estimation of λ , and λ is
2
2
λ = 1 − {1/[1 + T (σc /σu )]}1/2

(2.54)

2
2
ˆ
where σc , and σu are the variance of ci , and the variance of uit , respectively. If λ is close to unity,

the random effects (RE) and ﬁxed effects (FE) estimates tend to be close.
We just estimate one variance matrix for all strata, same as the Monte Carlo study case represented in last section. The results show that almost all coefﬁcients have expected signs regardless
45

of the method of estimation. However, depending on which method we use for estimation, their
magnitudes widely differ in many cases. For example, in terms of absolute value, the coefﬁcient
on edu_hs estimated by weighted pooled OLS is -31.082, and the same coefﬁcient drops to just
-2.636 when the model is estimated by FGLS_re and rise to -9.913 in FGLS_ar1 case (columns 2,
3 and 5 in Table B.19). These substantial changes in the size of most coefﬁcients are mainly due
to weighting. A simple comparison between unweighted FGLS methods in columns 3 and 5 with
their weighted counterparts in columns 4 and 6 in Table B.19, shows substantial effect of weighting on size of the coefﬁcients. For instance, consider again coefﬁcient on edu_hs. It is about 10
times bigger if the family income equation is estimated by wFGLS_re rather than FGLS_re. Same
coefﬁcient is almost 3 times bigger if wFGLS_ar1 is used for estimating the same model instead
of FGLS_ar1. As another example consider coefﬁcient on health. The size of the coefﬁcient falls
almost 50% when the model is estimated by FGLS instead of wFGLS.
The big effect of weighting on estimation should not view unusual. Since PSID purposely
oversample low income family and in our model income is the dependent variable, OLS using the
stratiﬁed sample does not consistently estimate the parameters of the total family income because
the stratiﬁcation is endogenous. This is true for unweighted FGLS estimators i.e. FGLS_ar1 and
FGLS_re also. The pooled OLS standard errors are smaller than the weighted pooled OLS ones
as we expected. In chapter one we showed that by ignoring stratiﬁcation, the pooled OLS tends to
underestimate standard errors. The standard errors are even smaller in the other two unweighted
FGLS estimators as it was expected. Therefore, despite smaller standard errors of the unweighted
estimators, the main competition is between the weighted pooled OLS on the one hand and the
weighted FGLS methods on the other hand that reﬂects in columns 2, 4, 6 and 7 in Table B.19.
The main idea in this chapter was to increase efﬁciency in panel data models with stratiﬁed data
by considering serial correlation in each panel. Under correct conditional mean speciﬁcation, even
a wrong working correlation matrix that captures key features of the conditional second moments
might lead to a more efﬁcient estimator. Comparison between the weighted pooled OLS and the
weighted FGLS estimators in Table B.19 shows that standard errors are smaller almost in all cases

46

for latter estimators indeed. The only exceptions are coefﬁcients on father and mother education
levels f edu_hs, medu_hs. Reduction in standard errors are considerable. For example, standard error on twealth reduces about 33 by using wFGLS to estimate the family income equation. Standard
errors of the rest of coefﬁcients drops between 4% (coefﬁcients on age and nchild), and about 35%
(coefﬁcients on unemployed, and unem.edu_hs). Three consistent estimators i.e. wFGLS_ar1,
wFGLS_re, and wFGLS_un are very stable in estimating almost all coefﬁcients, but it seems that
efﬁciency gain is higher in case of wFGLS_re, and wFGLS_un compare to wFGLS_ar1.

2.8

Conclusion

Efﬁciency in panel data models where data set comes from stratiﬁed sample schemes is investigated in this paper. We start from some conditional moments in the population and then based on
works done by Chamberlain (1987) and Newey and McFadden (1994) propose a GMM estimator
that takes into account dependency structure within the panels. The result is an efﬁcient GMM
estimator that is computationally simple to implement. By estimating covariance matrix for each
stratum or even estimating same covariance matrix for all strata we are able to improve efﬁciency.
Monte Carlo simulation results show that the new estimators that we called them weighted
and unweighted GLS (and FGLS) in general do better in compare with ordinary least square or
weighted and unweighted pooled OLS that simply overlook dependency in the data. In case of
endogenous stratiﬁcation weighted GLS is the efﬁcient estimator among all, and in case of exogenous stratiﬁcation dropping weight and using unweighted GLS produce best performance as
we expect. Of course the gains of new estimators are smaller when we have weaker correlation
structure in the panel. Monte Carlo experiments show that the structure of correlation matrix has
affects on efﬁciency gain.
Also simulation results suggest that by increasing T , the importance and effects of endogenous
stratiﬁcation is reduced. A convincing explanation is that by increasing T , the weight of ﬁrst
period diminishes that makes the sample get closer to simple random sampling. Another interesting
ﬁnding is that by increasing the degree of correlation ρ, inconsistency declines that can attributed
47

to decrease in degree of freedom movements of observations.
We apply the method to estimate a simple linear model using PSID data. Although PSID has
very complex structure including multi-stage stratiﬁcation, by considering very simple form for
the working variance matrix the new estimators decrease standard errors on most coefﬁcients in
the model.

48

Chapter 3
MODEL SELECTION TESTS IN COMPLEX SAMPLES

3.1

Introduction

Using the Kullback-Leibler Information Criterion to measure the closeness of a model to the truth,
Vuong (1989) developed a classical approach to model selection. He proposes simple likelihood
ratio based statistics for testing the null hypothesis that the competing models are equally close
to the true data generating process against the alternative hypothesis that one model is closer. In
his approach both, one, or neither of the two competing models is misspeciﬁed. He assumes
that observations are independent and identically distribute (i.i.d.). All of his tests are based on
likelihood ratio principle, and consequently he drives asymptotic distribution of the likelihood
ratio statistics that covers both nested, overlapping and non-nested models.
While Vuong’s tests are based on i.i.d. assumption, in practice in most large surveys, such as
the Current Population Survey (CPS), the Panel survey of Income Dynamics (PSID) and National
Survey of Families and households (NSFH) that require stratiﬁed and clustered samples, simple
random sampling and therefore i.i.d. observations is not the right assumption. In other words, a
non-random sampling scheme like Standard Stratiﬁed (SS) sampling or Variable Probability (VP)
sampling, or complex survey design like CPS does not produce a set of independent, identically
distributed random variables. Clearly, the i.i.d. assumption is one of the limitation of the Vuong’s
model selection tests in case of complex samples. This assumption is restrictive when considering
time series data too. Rivers and Vuong (2002) along with Findley (1990, 1991) and Findley and
Wei (1993) relax this assumption for time series cases like ARMA models and some dynamic
regression models.
Also Vuong’s model selection tests cannot be used to differentiate between two econometric
models deﬁned by moment conditions, or more generally, between two competing models that are
incompletely speciﬁed. The second limitation in applying the Vuong’s tests happens because they
49

are based on the likelihood function. These tests require that competing models belong to some
parametric family of distributions and therefore they must be completely parametrized.
While maximum likelihood method is a widespread method of estimation in econometric studies, there are other common methods of estimations that are used by researchers. Techniques like
least absolute deviation, nonlinear least squared, generalized method of moments (GMM), or other
extremum estimators are used by researchers for different reasons. This is the third limitation of
Vuong’s tests.
This paper contributes to the subject by extending Vuong’s model selection tests for competing
models with stratiﬁed multistage cluster sampling. Many data sets used in microeconometrics research are collected by surveys like CPS or PSID that have complex multi stage sampling structure
and violate the i.i.d. assumption needed in Vuong’s tests. Also, In order to generalize Vuong’s
results to cases other than MLE, we study the problem in M-estimators framework. Many econometrics estimators are M-estimators including but not limited to linear and nonlinear regression,
conditional maximum likelihood including discrete response models.
The paper is organized as follows. In section 3.2, we deﬁne two nonnested competing models. In section 3.3, we consider basic framework under standard stratiﬁed sampling. We start with
standard stratiﬁed sampling because it is widely used in practice to divide the population of interest into subpopulations or strata and it gives us a base to extend the results to more complex
sampling designs. Section 3.4 introduces tests statistics under SS and VP sampling and also multi
stage sampling scheme. Also in section 3.4, I show that the test statistics has normal distribution
asymptotically. In section 3.5, we extend the model selection test to panel data models with standard stratiﬁcation design. An interesting problem is if we need to weight the test statistics when
stratiﬁcation is exogenous. I discuss this point in section 3.6. Section 3.7, shows applications of
the tests in two empirical examples. Section 3.8 summarizes the results and conclude.

50

3.2

The Nonnested Competing Models

Consider the population minimization problem
min E [q(W, θ )]

θ ∈Θ

(3.1)

where scalar q(.) denotes an objective function depending on W and θ and W is an M × 1 random
vector taking values in W ⊂ RM . Data generating process depends on θ which is a P × 1 parameter
vector and it belongs to parameter space Θ, and Θ is a subset of Euclidean space RP or in other
words Θ ⊂ RP . We assume that there is a unique value that minimize population problem (3.1)
on parameter space Θ at θ ◦ called true parameter value that generates the data.
In many applications, the vector W is partitioned into W = (X, Y) where X and Y are respectively K and L dimensional vectors with L + K = M. We are often interested in some aspect of the
conditional distribution of W given X, such as E (Y|X).
Now as Vuong (1989) consider two competing objective functions q1 (W, θ 2 ) and q2 (W, θ 2 ).
These two competing functions are nonnested in the sense that neither can be represented as a
special case of the other. It is important to have a clear idea about nonnested models. Vuong
considers two sets of conditional models Fθ = { f (y|x, θ ); θ ∈ Θ} and Gγ = {g(y|x, γ); γ ∈ Γ} and
then deﬁnes two models nonnested if and only if
Fθ ∩ Gγ = 0
/

(3.2)

This deﬁnition is more suitable for MLE cases where we have full distribution assumptions about
the endogenous variables given the exogenous variables for the two competing models. For more
general cases as Wooldridge (2010) we consider the following deﬁnition
P [q1 (W, θ ∗ ) = q2 (W, θ ∗ )] > 0
1
2

(3.3)

It means that the two function q1 (., θ ∗ ) and q2 (., θ ∗ ) evaluated at the psuedo-true values θ ∗ and θ ∗
1
2
1
2
must differ for a nontrivial set of outcomes on Wi if they are nonnested. By this deﬁnition nested
models are ruled out as well as other forms of degeneracies.
51

As the ﬁrst example assume we have a random variable Y and would like to model E(Y |X) as a
function of the explanatory variables X, a K × 1 vector. W specify two competing models; a linear
qi1 (θ 1 ) = (Yi − Xi θ 1 )2 and a nonlinear qi2 (θ 2 ) = (Yi − exp(Xi θ 2 ))2 . These models are nonnested
if the mean of Yi given Xi depends on the nonconstant elements in Xi . Yet if the mean function is
independent of Xi , or in other words E(Yi |Xi ) = E(Yi ), then the two models are linear with same
constant means. In this case two models are nested and the limiting standard normal distribution
for Vuong’s type statistic breaks down.

3.3

Basic Framework under Standard Stratiﬁed Samples

The population problem is minθ ∈Θ E [q(W, θ )] and we assume θ ◦ uniquely solves the problem.
Let q1 (W, θ ∗ ) and q2 (W, θ ∗ ) be the two competing models where both may be misspeciﬁed. The
1
2
null hypothesis is
H0 : E [qi1 (Wi , θ ∗ )] = E [qi2 (Wi , θ ∗ )]
1
2

(3.4)

Depending on what method we use to estimate these two competing models, the alternative hypothesis is
HAq : E [qi1 (Wi , θ ∗ )] > E [qi2 (Wi , θ ∗ )]
1
2

(3.5)

HAq : E [qi1 (Wi , θ ∗ )] < E [qi2 (Wi , θ ∗ )]
1
2

(3.6)

1

or

2

For example if the competing estimators are QMLEs, then the alternative HAq means q1 (.) is
1
better than q2 (.) because its value of the likelihood function is bigger than the other.
To test the null (3.4) against alternative (3.5) or (3.6), in context of complex samples, suppose
ˆ
ˆ
the estimators θ 1 and θ 2 solve the sample objective function with complex design that involves
stratiﬁcation and clustering. In this section we ﬁrst consider sample objective function under standard stratiﬁed sampling scheme and then consider other types of sampling design. In standard
52

stratiﬁed sampling the population of interest is divided into J nonempty, mutually exclusive, and
exhaustive strata and then a random sample of size N j is drawn from stratum j, where j = 1, . . . , J.
Then for each j, we have random sample {Wi j : i = 1, 2, . . . , N j }. See Wooldridge (2001). Therefore sample objective function is


J

1

Nj



∑ Q j  N j ∑ q(Wi j , θ )

j=1

(3.7)

i=1

Equation (3.7) can be rephrased as
N
1 J Qj j
∑ ∑ q(Wi j , θ )
N j=1 H j i=1

(3.8)

where Q j is the population frequencies or in other words the probability that a randomly drawn
Nj
observation from the population falls into stratum j and H j ≡
is the fraction of observations in
N
stratum j. As (3.8) shows in standard stratiﬁed sampling observation i in stratum j is weighted by
Qj
.
Hj
ˆ
ˆ
We also assume that θ 1 and θ 2 converge to θ ∗ and θ ∗ respectively. They are referred to as
2
1
pseudo true value and are not necessary equal to true value θ ◦ and therefore the both models may
be misspeciﬁed.
In order to construct Vuong type test, we need following lemma that shows by assuming

√
N-

ˆ
consistency of θ g for θ g for g = 1, 2 we can ﬁnd a test statistic that its asymptotic distribution is
ˆ
ˆ
not affected by the two estimators θ 1 and θ 2 .
ˆ
ˆ
Lemma 3.3.1. If θ 1 and θ 2 are

√
N-consistent estimators for θ ∗ and θ ∗ then
1
2

N
N
1 J Qj j
1 J Qj j
ˆ
√ ∑
∑ qg(Wi j , θ g) = √N ∑ H j ∑ qg(Wi j , θ ∗) + o p(1)
g
H j i=1
N j=1
j=1
i=1

(3.9)

for g = 1, 2.
Proof. Assuming that q(.) is a differentiable function in respect to θ g , from a Taylor expansion of
ˆ
∑J q(Wi j , θ g ) and then dividing both side by N j we obtain
i=1

53

N

N

N

j
j
1 j
ˆ g ) ≈ 1 ∑ qg (Wi j , θ ∗ ) + 1 ∑ ∇θ qg (Wi j , θ ∗ )(θ g − θ ∗ )
ˆ
∑ qg(Wi j , θ
g
g
g
N j i=1
N j i=1
N j i=1

(3.10)

Multiplied by Q j and then sum over j, (3.10) can be written as
N
J Q Nj
Qj j
j
ˆ
∑ N j ∑ qg(Wi j , θ g) ≈ ∑ N j ∑ qg(Wi j , θ ∗)
g
j=1
i=1
j=1
i=1
J

N

Qj j
ˆ
∑ N j ∑ ∇θ qg(Wi j , θ ∗) (θ g − θ ∗)
g
g
j=1
i=1
J

+
Finally if we times both side by

(3.11)

√
N we have

N
N
1 J Qj j
1 J Qj j
ˆ
√ ∑
∑ qg(Wi j , θ g) ≈ √N ∑ H j ∑ qg(Wi j , θ ∗)
g
N j=1 H j i=1
j=1
i=1

+

N

√
Qj j
ˆ
∑ N j ∑ ∇θ qg(Wi j , θ ∗) · N(θ g − θ ∗)
g
g
i=1
j=1
J

(3.12)

In the second term in the right hand side of (3.12)


Nj
J
J Q Nj
 1 ∑ ∇θ qg (Wi j , θ ∗ ) =plim 1 ∑ j ∑ ∇θ qg (Wi j , θ ∗ )
plim ∑ Q j
g
g
N j i=1
N j=1 H j i=1
j=1
=E ∇θ qg (Wi j , θ ∗ ) = 0
g

(3.13)

See Wooldridge (2010). Therefore


J

1



Nj

∑ Q j  N j ∑ ∇θ qg(Wi j , θ ∗) = o p(1)
g

j=1

ˆ
and since by assumption θ g is

(3.14)

i=1

√
√
ˆ
N-consistent, N(θ g − θ ∗ ) = O p (1). Therefore the second term
g

product in (3.12) is o p (1) and it can be written as
N
N
1 J Qj j
1 J Q j
ˆ g ) ≈ √ ∑ j ∑ qg (Wi j , θ ∗ ) + o p (1)
√ ∑
∑ qg(Wi j , θ
g
N j=1 H j i=1
N j=1 H j i=1

This complete the proof.

54

(3.15)

Note that the right hand side of equation (3.9) in Lemma 3.2.1 is just a function of random
vector Wi j . Now we are ready to set up tests statistics similar to Vunge’s tests with asymptotic
normal distribution under the null hypothesis that the two nonnested competing models are ﬁt
equally well.

3.4

Tests Statistics

3.4.1

The Test Statistic under Standard Stratiﬁed Sampling

In this section we construct tests statistics that allow us to discriminate between two competing
models. Let qi j1 (Wi j , θ 1 ) − qi j2 (Wi j , θ 2 ) ≡ ri j (Wi j , θ 1 , θ 2 ). Then by Lemma 3.2.1 we have
N
N
1 J Qj j
1 J Qj j
ˆ ˆ
√ ∑
∑ ri j (Wi j , θ 1, θ 2) = √N ∑ H j ∑ ri j (Wi j , θ ∗, θ ∗) + o p(1)
1 2
N j=1 H j i=1
j=1
i=1

(3.16)

The following theorem shows that (3.16) under some conditions has asymptotic normal distribution.
Theorem 3.4.1. For g ∈ {1, 2} assume that
1. {Wi j : i = 1, 2, . . . , N j , j = 1, . . . , J} follows the standard stratiﬁed sample scheme.
2. N j → ∞ for each j.
3. Θg is a compact subset of RP .
4. The objective function E qg (., θ g ) has unique solution on Θg at θ ∗ .
g
5. θ ∗ is an interior point of Θg .
g
6. For each w ∈ W , qg (w, .) is continuous on Θ.
7. qg (w, .) is twice continuously differentiable on Θ.
∗
8. E ∇θ q(W, θ ∗ )q(W, θ ∗ ) < ∞ and E ∇θ q(W, θg ) = 0
g
g

55

9. For all θ , |∂ 2 qg (w, θ g )/∂ θgk ∂ θgm | ≤ b(w), all k and m, where E[b(w)] < ∞.
then
N
√
1 J Qj j
d
ˆ ˆ
√ ∑
ri j (Wi j , θ 1 , θ 2 ) − N · E [r(W, θ ∗ , θ ∗ )] −→ N(0, η 2 ).
∑
1 2
N j=1 H j i=1

(3.17)

where
η2

J

=

∑

Q2
j

j=1 H j

var[r(W, θ ∗ , θ ∗ )|W ∈ W j ]
1 2

(3.18)

Proof. The proof is essentially same as Theorem 3.3 in Vuong (1989) and Theorems 3.1, and 3.2
in Wooldridge (2001). The ﬁrst assumption shows the diverge from i.i.d. observations assumption
in the Vuong model. For the asymptotic analysis, we need second assumption to be sure that the
number of observations N j in each stratum j goes to inﬁnity. The regularity assumptions 2 to 6
are similar to those of Vuong (1989) and we need assumption 8, and 9 since we extend likelihood
function to more general one i.e. q(.) function. Also these same regularity assumptions ensures
ˆ
that θ g is consistent and has normal distribution asymptotically. See Wooldridge (2010).
Now we have test statistic necessary to choose between two competing models. The null
hypothesis is
H0 : E [qi1 (Wi , θ ∗ )] = E [qi2 (Wi , θ ∗ )]
1
2

(3.19)

HA : E [qi1 (Wi , θ ∗ )] > E [qi2 (Wi , θ ∗ )]
1
2

(3.20)

against

Under (3.19), E r(W, θ ∗ , θ ∗ ) = 0 and (3.17) can be written as
1 2
N
1 J Qj j
d
ˆ ˆ
√ ∑
∑ ri j (Wi j , θ 1, θ 2) −→ N(0, η 2).
N j=1 H j i=1

(3.21)

A consistent estimator of η 2 is
ˆ
η2

Q2
j

N

1 j
ˆ ˆ
ˆ ˆ 2
≡∑
¯
∑ ri j (Wi j , θ 1, θ 2) − r j (Wi j , θ 1, θ 2)
j=1 H j N j i=1
J

2 N
1 J Qj j
ˆ ˆ
ˆ ˆ 2
= ∑ 2 ∑ ri j (Wi j , θ 1 , θ 2 ) − r j (Wi j , θ 1 , θ 2 )
¯
N j=1 H j i=1

56

(3.22)

1 Nj
ˆ ˆ
ˆ ˆ
Here r j (Wi j , θ 1 , θ 2 ) =
¯
∑ r (W , θ , θ ). Therefore Voung type model selection statistic
N j i=1 i j i j 1 2
is

Qj Nj
1
ˆ ˆ
√ ∑J
j=1 H ∑i=1 ri j (Wi j , θ 1 , θ 2 )
N
j
∑J
j=1

d

1/2

Q2
j

1 Nj
ˆ ˆ
ˆ ˆ 2
¯
∑i=1 ri j (Wi j , θ 1 , θ 2 ) − r j (Wi j , θ 1 , θ 2 )
Hj Nj

−→ N(0, 1)

(3.23)

or
Qj Nj
1
ˆ ˆ
√ ∑J
j=1 H ∑i=1 ri j (Wi j , θ 1 , θ 2 )
N
j
1
N

3.4.2

Q2
j
J
∑ j=1 2
Hj

d

Nj
ˆ ˆ
ˆ ˆ 2
¯
∑i=1 ri j (Wi j , θ 1 , θ 2 ) − r j (Wi j , θ 1 , θ 2 )

1/2

−→ N(0, 1)

(3.24)

The Test Statistic under Variable Probability Sampling

When observations in the strata are difﬁcult to identify prior to sampling, or when collecting information on the variable determining stratiﬁcation is cheap relative to the cost of collecting the
remaining information variable probability sampling is convenient. In variable probability sampling or VP sampling in short, an observation is ﬁrst drawn at random from the population. If the
observation fall into stratum j, it is kept with probability p j . For example if we need to deﬁne
stratiﬁcation in terms of individual incomes, we might draw randomly a person from the population, determine his income class, and then keep him in the sample with a probability that depend
on his income class and is set by the researcher.
In variable probability samples, under the null hypothesis that two competing models are
equally ﬁt i.e. (3.19), the test statistic is
1
ˆ ˆ
√ ∑N ∑J p−1 ri j (Wi j , θ 1 , θ 2 )
N 1=1 j=1 j
1 N
ˆ ˆ 2
∑i=1 ∑J p−2 ri j (Wi j , θ 1 , θ 2 )
j=1 j
N

57

d

1/2

−→ N(0, 1)

(3.25)

that is very similar with what we obtained for standard stratiﬁed samples under the null in last
Qj
section. We just need to replace weights
with p j in (3.24). Here we need the sampling probaHj
bilities p1 , p2 , . . . , p j be all strictly positive. The rest of the assumptions needed to hold this result
are same as Theorem 3.3.1. For more details see Wooldridge (1999).

3.4.3

Tests Statistics under Multi-Stage Sampling

Clustering and stratiﬁcation are main features of survey data. For example National Survey of
Families and Households (NSFH), is a complex survey sample. It has multistage design that involves clustering, stratiﬁcation and variable probability sampling. Clusters are groups of families,
households or individuals positioned or occurring a relatively close association. For example in
a school, students in each class are form a cluster. In rural areas villages, and in urban areas,
neighborhoods are clusters.
The sampling design considered here is closely related to Bhattacharya (2005). In the ﬁrst
stage, the population of interest is divided into S subpopulations or strata. They are exhaustive
and mutually exclusive. Within stratum s, there are Cs clusters. In the next step Ns clusters are
drawn randomly. Since the asymptotic analysis is based on number of clusters going to inﬁnity, we
assume that in each stratum a large number of clusters is sampled. Units (for example households)
within each cluster allow for arbitrary correlations. Each sampled cluster c in stratum s contains
a ﬁnite population of Msc units (for example households) of observations. Finally, for each sampled cluster c in stratum s, randomly sample Ksc households with replacement. Sample objective
function is

1 S Ns Ksc
∑ ∑ ∑ vscqg Wscm, θ g
N s=1 c=1 m=1

(3.26)

for g = 1, 2. Here N = N1 + N2 + . . . + NS is the total number of clusters sampled and vsc =
Cs Msc
is weight associated with observations m = 1, . . . , Ksc within cluster c within stratum
Ns Ksc
N

58

Ns
converges to as where as is ﬁxed and 0 < as < 1. By this assumption, weights
N
vsc be constant.
s. We assume

By same reasoning as section 3.3 we can show that asymptotic distribution of the following
ˆ
ˆ
statistic is not affected by estimators θ 1 and θ 2 .
1 S Ns Ksc
ˆ ˆ
√ ∑ ∑ ∑ vsc · rscm Wscm , θ 1 , θ 2
N s=1 c=1 m=1

(3.27)

ˆ
ˆ
Here rscm = qscm1 Wscm , θ 1 − qscm2 Wscm , θ 2 is the difference between the two objective
functions for each unit m, in cluster c, in stratum s. Also we can show under the null hypothesis
that both competing models equally ﬁt well (3.27) has asymptotic normal distribution
1 S Ns Ksc
d
ˆ ˆ
√ ∑ ∑ ∑ vsc · rscm Wscm , θ 1 , θ 2 −→ N(0, ξ 2 )
N s=1 c=1 m=1

(3.28)

Because of correlation within clusters, the variance of (3.27), ξ 2 is more complicated than η 2 in
3.4.1. A consistent estimator of ξ 2 is
S

ˆ
ξ2 = ∑

Ns Ksc

2
∑ ∑ v2 rscm
sc

ˆ ˆ
θ 1, θ 2

s=1 c=1 m=1
S Ns Ksc Ksc

+

∑∑ ∑ ∑

ˆ ˆ
ˆ ˆ
v2 rscm θ 1 , θ 2 rscm θ 1 , θ 2
sc

s=1 c=1 m=1 m =m
S

1
−∑
s=1 Ns

Ns Ksc

∑ ∑

2

ˆ ˆ
vsc rscm θ 1 , θ 2

(3.29)

c=1 m=1

The ﬁrst term in (3.29) is a correct estimate of the variance under simple random sampling. Under non-random sampling, it is not true anymore and we need to add the other two terms that are
estimations of clustering and stratiﬁcation effects respectively. In general, in most cases, correlation between unit observation (for example families) in each cluster is positive and therefore the
second term appears with a positive sign. On the other hand because of stratiﬁcation, more homogenous observations are sampled in each stratum that decreases the variance and hence it enters
in the formula with negative sign. Therefore, ignoring clustering effect (the second term) causes
underestimating the true variance while overlooking stratiﬁcation effect, we overestimate it.
Extending Bhattacharya’s (2005) model to more complex sampling designs, in chapter one, we
investigate a sampling design with variable probability sampling in the ﬁnal stage. The framework
59

resemble complex surveys like NSFH and other routine phone surveys in practice. In this case
an appropriate statistic for choosing between two competing models is very similar to (3.29) as
follows
N

ˆ
ξ 2 = N −1 ∑

S

J

K

2
∑ ∑ ∑ v2 p−2yisτ jmτ j t z jmz j t ris jm
is j

ˆ ˆ
θ 1, θ 2

i=1 s=1 j=1 m=1

N

+N −1 ∑

S

J

J

K

K

∑ ∑ ∑ ∑ ∑ v2 p−1 p−1yisτ jmτ j t z jmz j t ris jm
is j
j

ˆ ˆ
ˆ ˆ
θ 1 , θ 2 ris jm θ 1 , θ 2

i=1 s=1 j=1 j =1 m=1 t=m

1 N J K
ˆ ˆ
−∑
∑ ∑ ∑ vis p−1yisτ jmz jmris jm θ 1, θ 2
j
N i=1 j=1 m=1
s=1
S

N

×

J

K

∑ ∑ ∑ vis p−1yisτ jmz jmris jm
j

ˆ ˆ
θ 1, θ 2

(3.30)

i=1 j=1 m=1

Here in (3.30), vis are weights exactly as (3.29) corresponding to ﬁrst level of stratiﬁcation. p j are
weights corresponding to variable probability sampling. Indicator variable τ jm takes value one if
observation W in the second level of stratiﬁcation (variable probability sampling) is in stratum j
and zero otherwise. Indicator variable z jm corresponds to the second level of stratiﬁaction too. It
take value one if W is kept in the sample and zero otherwise and therefore P(z = 1) = p. yis is an
indicator variable also. It is equal to one if cluster i is in stratum s.

3.5

Model Selection Tests in Panel Data Models

Model selection tests in panel data models with complex sampling designs are similar to the tests
in the cross section cases. When D(yi1 , . . . , yiT |xi1 , . . . , xiT ) is fully speciﬁed, the Vuong’s approach is directly applicable using MLE. In less restrictive cases when we do not have a complete
densities- like partial or pooled MLEs or other M-estimators- we need to account for the time series dependence properly. Assume, for each t, qt1 (Wt , θ 1 ) and qt2 (Wt , θ 2 ) are competing models
of the conditional density in each time period. Here, the same null hypothesis, (3.19), still means
the models ﬁt equally well but it is the weakest sense. The convergence result in equation (3.17)
still holds under the null. Under assumption (3.3), the models are nonnested and the variance η 2 is
positive. In estimating η 2 , the serial dependence in {qit1 (θ ∗ ) − qit2 (θ ∗ )} is a new extra term that
1
2
60

ˆ
ˆ
must be added in calculations. Let ri jt = qi jt1 (θ 1 ) − qi jt2 (θ 2 ) denote the difference in estimated
ˆ
Nj

ˆ
functions for each t, and stratum j, and r jt = N −1 ∑i=1 ri jt . Then, in case of standard stratiﬁcation,
¯
j
a consistent estimate for η 2 is
ˆ
η2

2 N
1 J Qj j
= ∑ 2 ∑ 1T Di j Di j 1T
N j=1 H j i=1

where 1T is the T × 1 vector of ones and Di j is a T × 1 vector deﬁned as


ˆ
¯
 ri j1 − r j1 


r −r 
ˆi j2 ¯ j2 



.


.
.




ri jT − r jT
ˆ
¯

(3.31)

(3.32)

Therefore model selection test in a panel data model with standard stratiﬁcation design is
Qj Nj T
1
√ ∑J
ˆ
j=1 H ∑i=1 ∑t=1 ri jt
N
j
2
1/2
1 J Qj Nj
∑ j=1 2 ∑i=1 1T Ui j 1T
N
Hj

(3.33)

Here Ui j is an upper triangular matrix, obtained from Di j Di j by changing values of entries below
its diagonal to zero1 . Test statistic (3.33) has standard normal distribution. Note that in variance
estimator (3.31) the mean difference r jt varies across t and j but is same across i. If we replace
¯
hypothesis (3.19) with the stronger one, E qit1 (θ ∗ ) = E qit2 (θ ∗ ) for t = 1, . . . , T , then we can
2
1
replace r jt with the average of ri jt across i and t, r j . Here the mean difference r j is just a function
¯
ˆ
¯
¯
of strata.

3.6

Tests Statistics and Exogenous Stratiﬁcation

It is known that when the population of interest is divided into subpopulations or strata by exogenous variables unweighted estimators are consistent and even more efﬁcient than weighted ones
and it does not cause any real problems. However model selection tests are a different matter.
1 Since D D
ij ij

is a symmetric matrix, Ui j could be a lower triangular matrix.
61

Usually, we are interested in cases that a model for some feature of the distribution of Y given
X is correctly speciﬁed. Then in correctly speciﬁed model, θ ◦ solves
min E [q(W, θ )|X]

θ ∈Θ

(3.34)

for all x ∈ X . For example assume we are performing nonlinear least squares on a correctly
speciﬁed parametric model of E(Y |X), then in this case W = (Y, X). In other words our objective
function is
q(W, θ ) = [Y − m(X, θ )]2 /2

(3.35)

and θ ◦ is the true parameter vector such that
E (Y |X = x, θ ◦ ) = m(x, θ ◦ )

(3.36)

for all x. Then θ ◦ solves minθ ∈Θ E [q(Y, θ )|x] = E{[Y − m(X, θ )]2 /2|x} for all x. It means that
θ ◦ minimizes E q(Y, θ )|x ∈ X j for each j.
However when the underlying model is misspeciﬁed in the sense that θ ◦ , the soultion to (3.1)
does not solve (3.34) for each x, the unweighted estimator is not consistent for θ ◦ while weighted
estimator is consistent for θ ◦ .
In model selection tests when the goal is to choose between two nonnested competing models,
the null hypothesis (3.19) will only hold if both models are misspesiﬁed. If one model were
correctly speciﬁed, then equality in (3.19) will change to strict inequality in favor of the correctly
speciﬁed model, assuming the objective functions are not same. For example suppose that in our
example above there are two competing models for E(Y |x), i.e. m1 (x, θ 1 ) and m2 (x, θ 2 ) that are
both misspeciﬁed. In this case θ ∗ , g = 1, 2 does not solve
g
min E qg (W, θ g )|X = x

θg ∈Θg

(3.37)

for all x ∈ X , and therefore unweighted estimator is inconsistent for θ ∗ . On the other hand
g
weighted estimator delivers consistent estimator for θ ∗ . Since in the model selection test we need
g
consistent estimators for θ ∗ for g = 1, 2, we need weight observations appropriately even in case
g
of exogenous stratiﬁcation.
62

3.7

Empirical Examples

For illustration purpose, the date set nhanes2 provided by Stata is used to contrast two competing
models. The data set nhanes2 has complex sampling scheme including clustering and stratiﬁcation.
We are interested in modeling the risk of heart attack as a function of variables like age, sex, race,
weight, and height 2 . The dependent variable is heartatk that is a binary variable. It is equal to one
if the observation has experienced heart attack, and zero otherwise. Two competing models are
probit and Bernoulli with contemporary log-log link, estimated by GLM. By ignoring sampling
design, the quasi-log likelihood evaluated at relevant estimates for probit and Bernoulli models are
ˆ
-555.028, and -556.665 respectively obtained from 4238 observations. The statistic η 2 turns out to
be 6.433, and therefore unweighted Vuong’e test statistic is equal to 0.645. On the other hand, if we
consider sampling scheme the values obtained for quasi-log likelihood are -559.393 and -561.889
respectively and our estimation for weighted η 2 is 11.599. Therefore weighted Vuong’s statistic
in (3.24) is .733. Both tests are in favor of the probit model, and although the weighted Vuong’s
test is bigger, using a standard normal test at 5% the difference is not statistically signiﬁcant.
As a second example, consider the determinants of family income in the United States discussed in chapter 2, section 2.7 using panel data set obtained from PSID. We are interested in
choosing between two competing models wFGLS-ar1 (model 1) and wFGLS-re (model 2), where
in the ﬁrst model a random effect structure is considered for dependency within panels while in the
second one we model this dependecy as AR(1). The null hypothesis is

2
2
H◦ : E[U1 ] = E[U2 ]

(3.38)

2
2
HA : E[U1 ] > E[U2 ]

(3.39)

against alternative

2 The

complete set of covariates considered in this example are: houssize, age, agesq, sex,
height, weight, iron, diabets, sizeplace, vitaminc, zinc, copper, f emale, black, race, orace,
region1, region2, region3, rural, highbp, highlead, and healthstat. for more information about
the data set see http://www.stata-press.com/data/r10/svy.html.
63

The proper test statistic in this case is (3.33) and its value is about .93 which although is in
favor of second model (wFGLS-re), but we cannot reject null hypothesis in favor of alternative at
5% conﬁdence interval.

3.8

Conclusion

In many applied econometric studies researchers are forced to choose between competing models
that seems equally well in ﬁtting the data. Model selection tests are suitable tools to distinguish
“better” model or models. However Vounge (1989) model selection tests are not readily applicable in cases that data sets come from complex sampling design. In this paper Vounge type tests
purpose for the cases that data is not a set of i.i.d. observations due to stratiﬁcation and clustering.
The results show that the test statistics have normal distribution and have to be weighted. An interesting ﬁnding is that even in case of exogenous stratiﬁcation we cannot drop the weights since
for nonnested models by null assumption two competing models are misspeciﬁed. The tests are
applicable for panel data models with complex samples designs but we need to account for time
series dependence properly. One advantage of the model selection tests is that they can be obtained
easily in empirical studies.

64

APPENDICES

65

Appendix A
PROOFS

In ﬁrst appendix we show that the conditional variance of sample objective function is (2.17).
Proof. Starting point is deﬁnition of variance.
J

Qj

∑ 1 [S = j] H j r (V, θ ◦) |V2

var

j=1

Q2
j

J

∑ 1 [S = j] H 2 r (V, θ ◦) r (V, θ ◦) |V2

=E

j=1

j

J

=

1 [S = j]

∑

j=1 v∈W

Q2
j
H2
j

r (v, θ ◦ ) r (v, θ ◦ ) g (s, v|v2 ) dv

(A.1)

since stratiﬁcation is not overlapping we can substitute 1 [S = j] with 1 v ∈ W j . Therefore
J

Qj

∑ 1 [S = j] H j r (V, θ ◦) |V2

var

j=1
J

=

∑

j=1 v∈W

1 v ∈ Wj

J

=

1

∑

j=1 v∈W

=

Q2
j
H2
j

r (v, θ ◦ ) r (v, θ ◦ ) g (s, v|v2 ) dv

Q2
j
v ∈ W j 2 r (v, θ ◦ ) r (v, θ ◦ )
Hj

f (v|v2 , θ )
J

Hj
Qj

Hj
R ( j, v2 , θ )
∑
j=1 Q j

dv

J

Qj
r (v, θ ◦ ) r (v, θ o ) f (v|v2 , θ ) dv
J Hj
j=1 v∈W j H j
∑ Q R ( j, v2 , θ )
j
1

j=1
J

=η

Qj

∑ Hj

j=1
J

= η·

∑

Qj

v∈W j

∑ Hj E

r (v, θ ◦ ) r (v, θ o ) f (v|v2 , θ ) dv

r (V, θ ◦ ) r (V, θ ◦ ) |V2 , S = j

(A.2)

j=1

By dropping the constant term η in (A.2), we obtain equation (2.17). This complete the proof.

66

To show that R◦ (V2 ) the conditional expectation of gradient in sample is same as gradient of
the objective function in population we start from deﬁnition:
J

R◦ (V2 ) =

∑E

1 [S = j]

j=1
J

=

Qj
∇ r (V, θ◦ ) |V2
Hj θ

1 [s = j]

∑

j=1 v∈W

=

1
J Hj
∑ Q R ( j, v2 , θ )
j=1 j

Qj
∇ r (v, θ ◦ ) g (s, v|v2 ) dv
Hj θ

E [∇θ r (V, θ ◦ ) |V2 ]

=η · E [∇θ r (V, θ ◦ ) |V2 ]

67

(A.3)

Appendix B
TABLES1

Table B.1: Exogenous Stratiﬁcation with ρ = 0.0, 1000 replications
T =2
Average
ˆ
β

POLS

wPOLS

uwGLS

wGLS

uwFGLS

wFGLS

.9999

.9991

.9999

.9991

.9999

.9991

(.0353)

(.0431)

(.0353)

(.0431)

(.0354)

(.0432)

sβ
ˆ

.0355

.0443

.0353

.0443

.0353

.0443

ˆ
β◦

.0014

.0008

.0014

.0008

.0014

.0008

(.0426)

(.0471)

(.0426)

(.0471)

(.0426)

(.0471)

.0422

.0473

.0420

.0473

.0420

.0473

1

1

sβ
ˆ

◦

ˆ
ρ = .0016(.0610) in feasible cases.

Table B.2: Exogenous Stratiﬁcation with ρ = 0.1, 1000 replications
T =2
Average
ˆ
β

POLS

wPOLS

uwGLS

wGLS

uwFGLS

wFGLS

.9991

1.0006

.9991

1.0004

.9989

1.0002

(.0354)

(.0446)

(.0352)

(.0444)

(.0354)

(.0445)

sβ
ˆ

.0355

.0446

.0351

.0443

.0350

.0443

ˆ
β◦

-.0006

.0011

-.0005

.0011

-.0005

.0011

(.0449)

(.0492)

(.0449)

(.0492)

(.0449)

(.0492)

.0422

.0495

.0440

.0495

.0440

.0495

1

1

sβ
ˆ

◦

ˆ
ρ = .1021(.0597) in feasible cases.

1 In tables B.1 to B.16 presented in this appendix, rows 2, and 4 are average values of estimated

β◦ and β1 obtained from 1000 simulated samples and the values in parenthesis are their standard
deviation. rows 3, and 5 represent average values of estimated standard deviations of the estimators
calculated by the formula discussed in chapter 2.
68

Table B.3: Exogenous Stratiﬁcation with ρ = 0.5, 1000 replications
T =2
Average
ˆ
β

POLS

wPOLS

uwGLS

wGLS

uwFGLS

wFGLS

.9998

.9991

.9997

.9993

.9997

.9993

(.0349)

(.0431)

(.0304)

(.0377)

(.0305)

(.0377)

sβ
ˆ

.0355

.0443

.0302

.0383

.0302

.0383

ˆ
β◦

.0014

.0007

.0014

.0007

.0014

.0007

(.0513)

(.0575)

(.0513)

(.0575)

(.0513)

(.0575)

.0422

.0578

.0509

.0578

.0509

.0578

1

1

sβ
ˆ

◦

ˆ
ρ = .5008(.0527) in feasible cases.
Table B.4: Exogenous Stratiﬁcation with ρ = 0.9, 1000 replications
T =2
Average
ˆ
β

POLS

wPOLS

uwGLS

wGLS

uwFGLS

wFGLS

.9999

.9995

.9997

.9997

.9997

.9997

(.0351)

(.0442)

(.0153)

(.0193)

(.0154)

(.0193)

sβ
ˆ

.0355

.0442

.0151

.0192

.0151

.0193

ˆ
β◦

.0010

.0003

.0010

.0003

.0010

.0003

(.0569)

(.0641)

(.0565)

(.0641)

(.0565)

(.0641)

.0421

.0650

.0568

.0650

.0568

.0650

1

1

sβ
ˆ

◦

ˆ
ρ = .8997(.0267) in feasible cases.
Table B.5: Exogenous Stratiﬁcation with ρ = 0.0, 1000 replications
T =5
Average
ˆ
β

POLS

wPOLS

uwGLS

wGLS

uwFGLS

wFGLS

1.0001

1.0003

1.0001

1.0003

1.0002

1.0003

(.0232)

(.0257)

(.0232)

(.0257)

(.0232)

(.0257)

sβ
ˆ

.0244

.0271

.0242

.0262

.0242

.0262

ˆ
β◦

.0010

.0016

.0010

.0016

.0009

.0016

(.0263)

(.0277)

(.0257)

(.0277)

(.0257)

(.0277)

.0263

.0284

.0262

.0277

.0262

.0277

1

1

sβ
ˆ

◦

ˆ
ρ = −.0003(.0303) in feasible cases.

69

Table B.6: Exogenous Stratiﬁcation with ρ = 0.1, 1000 replications
T =5
Average
ˆ
β

POLS

wPOLS

uwGLS

wGLS

uwFGLS

wFGLS

1.0001

1.0003

1.0002

1.0003

1.0002

1.0003

(.0232)

(.0257)

(.0230)

(.0256)

(.0231)

(.0256)

sβ
ˆ

.0244

.0271

.0240

.0260

.0240

.0260

ˆ
β◦

.0010

.0017

.0010

.0017

.0010

.0017

(.0278)

(.0300)

(.0278)

(.0300)

(.0278)

(.0300)

.0263

.0307

.0283

.0300

.0283

.0300

1

1

sβ
ˆ

◦

ˆ
ρ = .0995(.0301) in feasible cases.
Table B.7: Exogenous Stratiﬁcation with ρ = 0.5, 1000 replications
T =5
Average
ˆ
β

POLS

wPOLS

uwGLS

wGLS

uwFGLS

wFGLS

1.0000

1.0001

1.0002

1.0000

1.0004

1.0003

(.0235)

(.0260)

(.0193)

(.0213)

(.0195)

(.0211)

sβ
ˆ

.0244

.0271

.0197

.0213

.0197

.0212

ˆ
β◦

.0014

.0023

.0011

.0019

-.0009

-.0002

(.0382)

(.0411)

(.0377)

(.0407)

(.0371)

(.0403)

.0263

.0423

.0383

.0406

.0382

.0405

1

1

sβ
ˆ

◦

ˆ
ρ = .4995(.0258) in feasible cases.
Table B.8: Exogenous Stratiﬁcation with ρ = 0.9, 1000 replications
T =5
Average
ˆ
β

POLS

wPOLS

uwGLS

wGLS

uwFGLS

wFGLS

.9999

.9999

1.0000

.9999

1.0000

.9999

(.0234)

(.0263)

(.0089)

(.0097)

(.0089)

(.0097)

sβ
ˆ

.0244

.0271

.0088

.0095

.0088

.0095

ˆ
β◦

.0008

.0015

.0004

.0010

.0004

.0010

(.0532)

(.0577)

(.0524)

(.0569)

(.0524)

(.0569)

.0263

.0586

.0531

.0564

.0531

.0564

1

1

sβ
ˆ

◦

ˆ
ρ = .8991(.0126) in feasible cases.

70

Table B.9: Endogenous Stratiﬁcation with ρ = 0.0, 1000 replications
T =2
Average
ˆ
β

POLS

wPOLS

uwGLS

wGLS

uwFGLS

wFGLS

1.0710

.9996

1.0710

.9996

1.0709

.9997

(.0381)

(.0412)

(.0381)

(.0412)

(.0381)

(.0411)

sβ
ˆ

.0413

.0417

.0376

.0417

.0375

.0416

ˆ
β◦

.1119

.0005

.1119

.0005

.1119

.0005

(.0373)

(.0394)

(.0373)

(.0394)

(.0373)

(.0394)

.0430

.0394

.0375

.0394

.0375

.0394

1

1

sβ
ˆ

◦

ˆ
ρ = −.0012(.0562) in feasible cases.
Table B.10: Endogenous Stratiﬁcation with ρ = 0.1, 1000 replications
T =2
Average
ˆ
β

POLS

wPOLS

uwGLS

wGLS

uwFGLS

wFGLS

1.0695

.9997

1.0701

.9996

1.0701

.9997

(.0382)

(.0412)

(.0379)

(.0411)

(.0379)

(.0410)

sβ
ˆ

.0412

.0417

.0374

.0415

.0373

.0414

ˆ
β◦

.1242

.0005

.1241

.0005

.1241

.0005

(.0386)

(.0407)

(.0386)

(.0407)

(.0386)

(.0407)

.0429

.0409

.0388

.0409

.0388

.0409

1

1

sβ
ˆ

◦

ˆ
ρ = .0989(.0559) in feasible cases.
Table B.11: Endogenous Stratiﬁcation with ρ = 0.5, 1000 replications
T =2
Average
ˆ
β

POLS

wPOLS

uwGLS

wGLS

uwFGLS

wFGLS

1.0643

1.0001

1.0527

.9992

1.0529

.9993

(.0389)

(.0414)

(.0334)

(.0365)

(.0335)

(.0365)

sβ
ˆ

.0413

.0414

.0329

.0364

.0329

.0363

ˆ
β◦

.1732

.0008

.1746

.0007

.1746

.0007

(.0428)

(.0448)

(.0426)

(.0448)

(.0426)

(.0448)

.0431

.0453

.0428

.0453

.0428

.0453

1

1

sβ
ˆ

◦

ˆ
ρ = .4996(.0487) in feasible cases.

71

Table B.12: Endogenous Stratiﬁcation with ρ = 0.9, 1000 replications
T =2
Average
ˆ
β

POLS

wPOLS

uwGLS

wGLS

uwFGLS

wFGLS

1.0591

1.0007

1.0133

.9994

1.0133

.9994

(.0404)

(.0419)

(.0174)

(.0193)

(.0175)

(.0193)

sβ
ˆ

.0419

.0408

.0170

.0186

.0170

.0186

ˆ
β◦

.2223

.0010

.2278

.0008

.2278

.0008

(.0459)

(.0478)

(.0451)

(.0478)

(.0451)

(.0478)

.0437

.0482

.0449

.0482

.0449

.0482

1

1

sβ
ˆ

◦

ˆ
ρ = .9006(.0251) in feasible cases.
Table B.13: Endogenous Stratiﬁcation with ρ = 0.0, 1000 replications
T =5
Average
ˆ
β

POLS

wPOLS

uwGLS

wGLS

uwFGLS

wFGLS

1.0324

1.0001

1.0324

1.0001

1.0324

1.0001

(.0240)

(.0261

(.0240)

(.0261)

(.0240)

(.0261)

sβ
ˆ

.0260

.0277

.0240

.0269

.0248

.0269

ˆ
β◦

.0469

-.0001

.0469

-.0001

.0469

-.0001

(.0245)

(.0263)

(.0245)

(.0263)

(.0246)

(.0263)

.0265

.0272

.0249

.0266

.0249

.0266

1

1

sβ
ˆ

◦

ˆ
ρ = −.0003(.0280) in feasible cases.
Table B.14: Endogenous Stratiﬁcation with ρ = 0.1, 1000 replications
T =5
Average
ˆ
β

POLS

wPOLS

uwGLS

wGLS

uwFGLS

wFGLS

1.0321

1.0000

1.0317

1.0001

1.0317

1.0001

(.0241)

(.0263)

(.0240)

(.0260)

(.0240)

(.0260)

sβ
ˆ

.0260

.0277

.0246

.0267

.0246

.0267

ˆ
β◦

.0523

-.0001

.0550

-.0001

.0550

-.0001

(.0265)

(.0248)

(.0263)

(.0282)

(.0264)

(.0282)

.0265

.0294

.0268

.0285

.0268

.0285

1

1

sβ
ˆ

◦

ˆ
ρ = .0996(.0279) in feasible cases.

72

Table B.15: Endogenous Stratiﬁcation with ρ = 0.5, 1000 replications
T =5
Average
ˆ
β

POLS

wPOLS

uwGLS

wGLS

uwFGLS

wFGLS

1.0300

.9997

1.0203

.9998

1.0203

.9998

(.0241)

(.0264)

(.0202)

(.0217)

(.0202)

(.0217)

sβ
ˆ

.0261

.0276

.0202

.0219

.0202

.0218

ˆ
β◦

.0921

-.0002

.1019

-.0005

.1019

-.0004

(.0357)

(.0383)

(.0341)

(.0367)

(.0341)

(.0367)

.0266

.0394

.0347

.0370

.0347

.0370

1

1

sβ
ˆ

◦

ˆ
ρ = .4991(.0241) in feasible cases.

Table B.16: Endogenous Stratiﬁcation with ρ = 0.9, 1000 replications
T =5
Average
ˆ
β

POLS

wPOLS

uwGLS

wGLS

uwFGLS

wFGLS

1.0252

.9997

1.0036

.9997

1.0036

.9997

(.0245)

(.0259)

(.0093)

(.0099)

(.0093)

(.0099)

sβ
ˆ

.0267

.0272

.0091

.0098

.0091

.0098

ˆ
β◦

.1955

-.0011

.1980

-.0013

.1980

-.0013

(.0442)

(.0476)

(.0426)

(.0460)

(.0426)

(.0460)

.0271

.0486

.0432

.0463

.0432

.0463

1

1

sβ
ˆ

◦

ˆ
ρ = .8993(.0121) in feasible cases.

73

Table B.17: Variables Descriptions
age

the actual age of Head

aychild6

1 if age of youngest person in the family is 6 or less

black

1 if Head is black

f emale

1 if Head is female

f size

the actual number of persons in the family

f weight3

2003 core/immigrant family weight

edu_hs

1 if the highest level of Head’s education is completed high school

f edu_hs

1 if the highest level of Head’s father education is completed high school

medu_hs

1 if the highest level of Head’s mother education is completed high school

health

1 if health condition of Head is good, very good or excellent

married

1 if Head is married

nchild

the actual number of persons currently in the family under 18 years of age

unemployed

1 if Head is unemployed

t f inc

total family money income last year

twealth

sum of values of seven asset types, net of debt value plus value of home equity

74

Table B.18: Summary Statistics
(1)

(2)

(3)

(4)

(5)

VARIABLES

N

mean

sd

min

max

age

15,672

47.45

14.97

16

99

aychild6

15,672

0.197

0.398

0

1

black

15,672

0.282

0.450

0

1

f emale

15,672

0.250

0.433

0

1

f size

15,672

2.638

1.414

1

10

f weight3

15,672

23.38

16.74

0

114.3

edu_hs

15,672

0.464

0.499

0

1

f edu_hs

15,672

0.692

0.462

0

1

medu_hs

15,672

0.702

0.457

0

1

health

15,672

0.854

0.353

0

1

married

15,672

0.566

0.496

0

1

nchild

15,672

0.787

1.128

0

8

unemployed

15,672

0.0501

0.218

0

1

t f inc

15,672

74.69

111.4

-99.26

6,317

twealth

15,672

310.0

1,201

-2,700

50,475

75

76

health

age3/1000

age2

age

age.edu_hs

13.861**

(0.156)

(0.097)
12.772**

1.000**

(0.026)

(0.016)
1.000**

-0.222**

(1.286)

(0.748)
-0.179**

12.950**

(0.119)

(0.081)
10.774**

0.150

(5.807)

(3.471)
-0.061

-31.082**

-19.194**

(0.008)

(0.006)

edu_hs

0.033**

0.036**

twealth

wPOLS

POLS

(2)

VARIABLES

(1)

6.457**

(0.100)

0.340**

(0.017)

-0.097**

(0.932)

7.383**

(0.100)

-0.501**

(4.494)

-2.639

(0.004)

0.011**

FGLS_re

(3)

10.803**

(0.137)

1.000**

(0.024)

-0.193**

(1.233)

11.821**

(0.101)

-0.005

(5.083)

-25.548**

(0.006)

0.021**

wFGLS_re

(4)

5.527**

(0.100)

0.480**

(0.017)

-0.119**

(0.901)

8.311**

(0.096)

-0.341**

(4.352)

-9.913*

(0.004)

0.013**

FGLS_ar1

(5)

Table B.19: Determinants of Family Income in the U.S

9.880**

(0.137)

1.000**

(0.024)

-0.193**

(1.234)

11.832**

(0.102)

0.008

(5.153)

-26.751**

(0.006)

0.019**

wFGLS_un

(7)

Continued on next page

10.619**

(0.139)

1.000**

(0.024)

-0.197**

(1.240)

11.980**

(0.103)

0.036

(5.196)

-27.669**

(0.006)

0.022**

wFGLS_ar1

(6)

77

medu_hs

fedu_hs

unem.edu_hs

unemployed

aychild6

fsize

married

VARIABLES

-8.722*
(3.478)

(1.872)

(3.344)

(1.793)
-7.334**

-12.616**

(7.295)

(3.786)
-12.540**

24.091**

(6.441)

(3.427)
18.206**

-27.979**

(4.793)

(2.227)
-22.596**

-5.642

(1.530)

(1.019)
-3.365

12.830**

(3.022)

(1.788)
10.017**

26.845**

(2.039)

(1.219)
25.914**

wPOLS

(2)

POLS

(1)

(3.181)

-9.418**

(3.157)

-14.012**

(3.042)

4.781

(2.669)

-7.613**

(1.913)

-1.089

(1.010)

6.747**

(2.168)

24.413**

(1.068)

FGLS_re

(3)

(3.544)

-9.820**

(3.503)

-13.203**

(5.552)

13.763*

(4.880)

-16.387**

(2.774)

-3.201

(1.241)

10.396**

(2.870)

29.093**

(1.509)

wFGLS_re

(4)

(3.372)

-9.070**

(3.204)

-13.829**

(2.737)

4.742

(2.419)

-6.692**

(2.100)

-2.880

(1.049)

5.910**

(2.129)

25.608**

(1.058)

FGLS_ar1

(5)

Table B.19 –continued from previous page

(3.610)

-9.671**

(3.498)

-12.931**

(5.338)

12.432*

(4.755)

-14.189**

(3.156)

-3.923

(1.178)

9.484**

(2.861)

29.960**

(1.421)

wFGLS_un

(7)

Continued on next page

(3.584)

-9.663**

(3.474)

-12.917**

(5.517)

14.353**

(4.863)

-15.962**

(3.395)

-4.425

(1.280)

9.983**

(2.865)

29.349**

(1.503)

wFGLS_ar1

(6)

78

Constant

year09

year07

year05

nchild

female

black

VARIABLES

-185.597**
(18.300)

(10.626)

(1.930)

(2.387)
-146.552**

12.816**

(1.910)

(2.031)
11.822**

6.344**

(1.446)

(1.903)
5.409**

2.537

(1.991)

(1.263)
2.015

-10.604**

(2.872)

(1.287)
-9.119**

-6.304*

(2.063)

(0.985)
-7.389**

-9.478**

wPOLS

(2)

-11.494**

POLS

(1)

(15.406)

-86.376**

(1.616)

12.772**

(1.190)

7.656**

(1.049)

2.994**

(1.172)

-7.618**

(1.727)

-12.720**

(1.538)

-16.158**

FGLS_re

(3)

(18.229)

-163.858**

(1.836)

13.240**

(1.597)

7.615**

(1.315)

3.007*

(1.645)

-9.830**

(2.418)

-8.815**

(1.879)

-11.550**

wFGLS_re

(4)

(14.094)

-96.402**

(1.614)

12.721**

(1.302)

7.488**

(1.085)

2.913**

(1.464)

-5.289**

(1.698)

-12.680**

(1.460)

-16.635**

FGLS_ar1

(5)

Table B.19 –continued from previous page

(18.251)

-162.316**

(1.826)

13.280**

(1.618)

7.773**

(1.321)

3.058*

(1.917)

-7.774**

(2.356)

-9.010**

(1.912)

-12.137**

wFGLS_un

(7)

Continued on next page

(18.201)

-165.917**

(1.831)

13.155**

(1.631)

7.520**

(1.328)

2.965*

(2.056)

-7.964**

(2.432)

-8.666**

(1.898)

-11.657**

wFGLS_ar1

(6)

79

** p<0.01, * p<0.05

Robust standard errors in parentheses

ˆ
ρ

ˆ
λ

0.282

0.286

R-squared

15,672
3,918

15,672

Observations

wPOLS

(2)

Number of ﬁd

POLS

VARIABLES

(1)

0.63

3,918

15,672

FGLS_re

(3)

0.40

3,918

15,672

wFGLS_re

(4)

0.69

3,918

15,672

FGLS_ar1

(5)

Table B.19 –continued from previous page

0.45

3,918

15,672

wFGLS_ar1

(6)

3,918

15,672

wFGLS_un

(7)

BIBLIOGRAPHY

80

BIBLIOGRAPHY

[1] Bhattacharya, D. (2005): “Asymptotic Inference from Multi-Stage Samples” Journal of
Econometrics, 126, 145-171.
[2] Cameron, A.C., Pravin, K.T. (2005): “Microeconometrics Methods and Applications” Cambridge University Press, New York, NY.
[3] Chamberlain, G. (1987): “Asymptotic Efﬁciency in Estimation with Conditional Moment
Restrictions” Journal of Econometrics, 34, 305-334.
[4] Cosslett, S.R. (1981a): “Efﬁcient Estimation of Discrete Choice models” In: Manski, C.F.,
McFadden, D. (Eds.), Structural Analysis of Discrete Data with Econometrics Applications.
MIT Press, Cambridge, MA.
[5] Cosslett, S.R. (1981b): “Maximum Likelihood Estimators for Choice-Based Samples”
Econometrica 49, 1289-1316.
[6] Cosslett, S.R. (1993): “Estimation from Endogenously Stratiﬁed Samples” In: Maddala,
G.S., Rao, C.R., Vinod, H.D. (Eds.), Handbook of Statistics, vol. 11, 1-43
[7] Findley, D. F. (1990): “Making Difﬁcult Model Comparisons” mimeo, U.S. Bureau of the
Census.
[8] Findley, D. F. (1991): “ Convergence of ﬁnite multistep predictors from incorrect models and
its role in model selection” Note di Matematica XI, 145-55.
[9] Findley, D. F., Wei, C.Z. (1993): “ Moment bound for deriving time series CLT’s and model
selection procedures” Statistica Sinica 3, 453-80.
[10] Hardin, J.H., Hilbe, J.M. (2003):
Hall/CRC.

“Generalized Estimating Equations” Chapman &

[11] Johnson D.R., Elliott L.A. (1998): “Sampling Design Effects: Do They Affect the Analyses
of Data from the National Survey of Families and Households?” Journal of Marriage and
Family, 60, 993-1001.
[12] Heeringa, S.G., Berglund, P.A., Khan, A. (2011): “Construction and Evaluation of the 2009
Longitudinal Individual and Family Weights” Panel Study of Income Dynamics Technical
Report. Survey Research Center, University of Michigan, Ann Arbor.
[13] Heeringa, S.G., Berglund, P.A., Khan, A., Lee, S., Gouskova, E. (2011): “PSID Crosssectional Individual Weights, 1997-2009” Panel Study of Income Dynamics Technical Report. Survey Research Center, University of Michigan, Ann Arbor.
[14] Hausman, J.A., Wise, D.A. (1981): “Stratiﬁcation on an endogenous variable and estimation:
The Gary income maintenance experiment” In: Manski, C.F., McFadden, D. (Eds.), Structural Analysis of Discrete Data with Econometrics Applications. MIT Press, Cambridge, MA,
365-391.
81

[15] Imbens, G. W. (1992): “An Efﬁcient Method of Moments Estimator for Discrete Choice
Models with Choice-Based Sampling” Econometrica, 60, 1187-1214.
[16] Imbens, G. W., Lancaster, T. (1996): “Efﬁcient Estimation and Stratiﬁed Sampling” Journal
of Econometrics, 74, 289-318.
[17] Manski, C.F., Lerman, S. (1977): “The Estimation of Choice Probabilities from ChoiceBased Samples” Econometrica, 45, 1977-1988.
[18] Manski, C.F., McFadden, D. (1981): “Alternative Estimators and Sample Desighns for Discrete Choice Analysis” In: Manski, C.F., McFadden, D. (Eds.), Structural Analysis of Discrete Data with Econometrics Applications. MIT Press, Cambridge, MA, 2-50.
[19] Newey, W.K., McFadden, D. (1994): “Large Sample Estimation and Hypothesis Testing” In:
Engle, R.F., McFadden, D.L. (Eds.), Handbook of Econometrics, vol. IV, Amsterdam: North
Holland, 2111-2245.
[20] Panel Study of Income Dynamics, public use dataset. Produced and distributed by the Institute for Social Research, Survey Research Center, University of Michigan, Ann Arbor, MI
(2012).
[21] Rivers, D., Vuong, Q. (2002): “Model Selection Tests for Nonlinear Dynamic Models” The
Econometrics Journal, 5, 1-39
[22] Tripathi, G. (2011): “Moment-Based Inference with Stratiﬁed Data” Econometric Theory,
27,47-73.
[23] Vuong, Q. (1989): “Likelihood Ratio Tests for Model Selection and Non-Nested Hypothese”
Econometrica, 57, 307-333
[24] Wooldridge, J.M. (1999): “Asymptotic Properties of Weighted M-Estimators for Variable
Probability Samples” Econometrica, 67 (6), 1385-1406.
[25] Wooldridge, J.M. (2001): “Asymptotic Properties of Weighted M-Estimators for Standard
Stratifed Samples” Econometric Theory, 17, 451-470.
[26] Wooldridge, J.M. (2008): “Cluster and stratiﬁed sampling” Imbens / Wooldridge BEA/FTC
Lectures, Lecture notes 7 & 8.
[27] Wooldridge, J.M. (2010): “Econometric Analysis of Cross-Section and Panel Data” (2nd ed.)
MIT Press, Cambridge, MA.

82