THREE ESSAYS IN COMPLEX SAMPLES by Iraj Rahmani A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY ECONOMICS 2012 ABSTRACT THREE ESSAYS IN COMPLEX SAMPLES by Iraj Rahmani The samples used in econometric studies are not always sets of randomly drawn observations from the populations of interest. In many studies sampling has a complex design involving clustering and stratification. In stratification, the population is divided into subpopulations or strata based on exogenous or endogenous variables and then a random sample of unit observations or clusters is drawn from each stratum. Clusters are contiguous groups of units existing within a stratum. Reducing the cost of sampling or operational convenience might be reasons for applying stratification and clustering. On the other hand, particular interest in a small subpopulation may cause oversampling that justifies non-random sampling scheme. This dissertation consists of three essays addressing estimation and inference in cross section and panel data models with non-random samples. In general, ignoring sampling design could produce inconsistent estimators and also inconsistent estimators for their standard errors. In the first essay, a multi-stage sampling design including standard stratification and clustering stages at first and variable probability sampling in the final stage is considered. The problem is studied under M-estimators framework. Under a set of regularity conditions, the usual weighting estimators are consistent and have asymptotic normal distributions. In cases that stratifications in the first or the second or in the both stages are exogenous, dropping the corresponding weights are allowed; we still have consistent estimators. The second essay contributes to the subject of non-random sampling by studying efficiency in panel data models when data set comes from stratified samples. The goal in this chapter is to obtain more efficient estimators by considering correlation within panels in models with stratified structure. We do not try to find the efficiency bound in this kind of models. Our attempt is to increase efficiency in compare with pooled models that ignore correlations within panels. The paper takes into account correlation within each panel and in each stratum under a GMM based framework. Theoretical development and Monte Carlo study show that by considering correlation within the panels in each stratum and adding them together with appropriate weights, finding more efficient estimators is possible. Like generalized estimating equations (GEE), we are able to consider the specific form for correlation for panels in each stratum. Monte Carlo results confirm that the new GMM estimators that is called weighted and unweighted GLS are more efficient than their competitors OLS and weighted OLS that simply overlook the correlation within the panels. In case of endogenous stratification, weighted GLS and in case of exogenous stratification unweighted GLS is doing better than the rest. For a specific sample size, this efficiency gain depends on what form is chosen for correlation and how strong or weak it is. We applied results to study determinants of inequality in the U.S. and estimation results show that efficiency gain in compare with POLS or weighted POLS is substantial. The subject of the third essay is model selection problem. In complex samples involving stratification and clustering, the assumption that observations are distributed independently and identically is not held anymore and therefore the Vuong’s (1989) model selection tests are not applicable directly. In order to generalize Vuong’s results to estimators other than MLE, we study the problem under M- estimator framework that contains many estimators including but not limited to linear and non-linear least squares, MLE, and QMLE. The theoretical results show that for two nonnested competing models, the asymptotic property of the weighted tests statistics are not a function of the competing estimators but observations and has normal distribution. An interesting finding is that even in case of exogenous stratification, we cannot drop weights in the tests statistics since for nonnested tests both competing models should be misspecified under the null. We also apply results in two empirical studies. Copyright by Iraj Rahmani 2012 To my late father, my mother, and my brothers, Behzad, and Reza, and my sister, Maryam v ACKNOWLEDGEMENTS It would not have been possible to write this doctoral thesis without the help and support of many great people around me, to only some of whom it is possible to give particular mention here. It is difficult to overstate my gratitude to my Ph.D. supervisor, Professor Jeffery Wooldridge. With his enthusiasm, his inspiration, and his unsurpassed knowledge of econometrics and statistics, he helped me to move in a right direction. I would have been lost without him. Throughout my thesis-writing period, he provided encouragement, sound advice, good teaching and many good ideas. I would like to thank Professors Peter Schmidt, and Tim Vogelsang in the Department of Economics, and Tapabrata Maiti in the Department of Statistic and Probability for their encouragements and supports. It was great honor to have these great scholars as my committee members. I wish to thank Emma Iglesias, who was a member of my committee, for her help and support. Also I would like to thank Professor Hassan Mohammadi at Illinois State University and Professor Kambiz H. Kiani at Shahid Beheshti Universty (the National University of Iran) who encourage me to continue my education in Iran and the U.S., and for their kind assistance with writing letters and wise advice. My friendship with other Ph.D. students was very fruitful and I learnt many great lessons from them. I would like to thank all of them, particularly Dr. Do Won Kwak whom I learnt many things in the Stata Programming. I received many helps from former and current Graduate Secretaries, Jennifer Carducci, and Lori Jean Nichols, and also Margaret Lynch, the Office Manager, the Department of Economics and would like to thank them for kindness and sincere assistance. I owe thanks as well to Leila Ardestani, for the continuous support and encouragement that I received from her. I wish to thank my wonderful and very kind brothers and sister for giving me their uncondi- vi tional love and support. My brothers Behzad and Reza and my sister Maryam who have always been fountains of sincere friendship and love. I am so thankful for having them in all stages of my life. Without their help and support, I could not stay so many years far from home without worrying about any issue. Lastly, and most importantly, I wish to thank my father Rahman Rahmani and my wonderful mother Roohangiz Saadati. They were my first teachers who thought me the most important elements of life; love, friendship, and forgiveness. It was a great unfortunate experience of losing my father almost at the end of my first year in the Ph.D. program. The pain is still fresh and his place in my heart will never be filled. He was the pillar of my life, the best friend, and a source of wisdom and great advice. In his absence, my dear mother did her best to keep me on the road sound and firm. I cannot find the suitable words to express my gratitude to her. To my parents and brothers and sister, I dedicate this thesis. vii TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 ASYMPTOTIC INFERENCE OF M-ESTIMATOR FORM MULTISTAGE SAMPLES WITH VARIABLE PROBABILITY IN FINAL STAGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Population Optimization Problem . . . . . . . . . . . . . . . . . . . . . . The Sampling Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Estimation under Multi-stage Sampling . . . . . . . . . . . . . . . . . . . . . 1.4.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Asymptotic Normality of the Weighted M-Estimator . . . . . . . . . . 1.4.3 Estimating the Asymptotic Variance . . . . . . . . . . . . . . . . . . . Estimation under Exogenous stratification . . . . . . . . . . . . . . . . . . . . 1.5.1 Consistency of the Unweighted M-Estimator . . . . . . . . . . . . . . 1.5.1.1 Consistency of the Unweighted M-Estimator: Case One . . . 1.5.1.2 Consistency of Unweighted M-estimator: Case Two . . . . . 1.5.2 Asymptotic Normality of the Unweighted M-Estimator . . . . . . . . . Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Two-Step M-Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x . . . . . . . . . . . . . . . . 1 1 4 5 8 8 10 12 13 13 13 14 15 18 21 22 . . . . . . . . . . . . . . . . . . . . . . 24 24 25 25 27 27 29 31 36 43 47 . . . . . . . . . . . . . . 49 49 51 52 55 55 57 CHAPTER 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 ASYMPTOTIC EFFICIENCY IN THE PANEL DATA MODELS WITH STRATIFIED SAMPLING . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Moment Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sampling Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Efficient estimation under moment restrictions . . . . . . . . . . . . . . . . . . 2.4.1 Moment restrictions in the sample . . . . . . . . . . . . . . . . . . . . 2.4.2 Efficient estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The normal linear model: A Monte Carlo investigation . . . . . . . . . . . . . Determinants of Family income in the U.S: An Empirical Application . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 3 MODEL SELECTION TESTS IN COMPLEX SAMPLES 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 The Nonnested Competing Models . . . . . . . . . . . . . . . . . . . 3.3 Basic Framework under Standard Stratified Samples . . . . . . . . . . 3.4 Tests Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 The Test Statistic under Standard Stratified Sampling . . . . . 3.4.2 The Test Statistic under Variable Probability Sampling . . . . viii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 3.6 3.7 3.8 3.4.3 Tests Statistics under Multi-Stage Sampling Model Selection Tests in Panel Data Models . . . . Tests Statistics and Exogenous Stratification . . . . Empirical Examples . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 60 61 63 64 APPENDIX A PROOFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 APPENDIX B TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 ix LIST OF TABLES B.1 Exogenous Stratification with ρ = 0.0, 1000 replications . . . . . . . . . . . . . . . . 68 B.2 Exogenous Stratification with ρ = 0.1, 1000 replications . . . . . . . . . . . . . . . . 68 B.3 Exogenous Stratification with ρ = 0.5, 1000 replications . . . . . . . . . . . . . . . . 69 B.4 Exogenous Stratification with ρ = 0.9, 1000 replications . . . . . . . . . . . . . . . . 69 B.5 Exogenous Stratification with ρ = 0.0, 1000 replications . . . . . . . . . . . . . . . . 69 B.6 Exogenous Stratification with ρ = 0.1, 1000 replications . . . . . . . . . . . . . . . . 70 B.7 Exogenous Stratification with ρ = 0.5, 1000 replications . . . . . . . . . . . . . . . . 70 B.8 Exogenous Stratification with ρ = 0.9, 1000 replications . . . . . . . . . . . . . . . . 70 B.9 Endogenous Stratification with ρ = 0.0, 1000 replications . . . . . . . . . . . . . . . . 71 B.10 Endogenous Stratification with ρ = 0.1, 1000 replications . . . . . . . . . . . . . . . . 71 B.11 Endogenous Stratification with ρ = 0.5, 1000 replications . . . . . . . . . . . . . . . . 71 B.12 Endogenous Stratification with ρ = 0.9, 1000 replications . . . . . . . . . . . . . . . . 72 B.13 Endogenous Stratification with ρ = 0.0, 1000 replications . . . . . . . . . . . . . . . . 72 B.14 Endogenous Stratification with ρ = 0.1, 1000 replications . . . . . . . . . . . . . . . . 72 B.15 Endogenous Stratification with ρ = 0.5, 1000 replications . . . . . . . . . . . . . . . . 73 B.16 Endogenous Stratification with ρ = 0.9, 1000 replications . . . . . . . . . . . . . . . . 73 B.17 Variables Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 B.18 Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 B.19 Determinants of Family Income in the U.S . . . . . . . . . . . . . . . . . . . . . . . . 76 x Chapter 1 ASYMPTOTIC INFERENCE OF M-ESTIMATOR FORM MULTI-STAGE SAMPLES WITH VARIABLE PROBABILITY IN FINAL STAGE 1.1 Introduction In economic analyses are often assumes that the observations come from simple random sampling. It means that a set of independent and identically distributed (i.i.d.) observations is available. However in reality many data sets used in economics and other branches of the social sciences are come form stratified sampling schemes that produces nonrandom samples. When a data set comes from nonrandom sampling schemes, i.i.d. assumption is not valid anymore and therefore one might need to make inference about the econometric model more carefully. The goal in this chapter is to examine asymptotic properties of M-estimators when the observations come from several levels of stratifications. Three well known stratified sampling schemes are standard stratified sampling (SS sampling), multinomial sampling, and variable probability sampling (VP sampling). In SS sampling, the population is divided into several subpopulation based on factors like income, race, gender, education, area of residence, etc. Then a random sample is taken within each subpopulation or stratum independently. The result is a sample of independent but not identically distributed observations. It should be emphasized that unlike simple random sampling, in SS sampling the proportions of observations within strata do not reflect population proportions as they would if the sample were selected randomly from the population. Multinomial sampling scheme is similar to SS sampling. The difference is that in multinomial sampling first a stratum is chosen randomly and then samples randomly from the stratum. Although this kind of sampling is not common in practice but theoretically it is easier to deal with because it produces i.i.d. observations. In VP sampling which also is known as Bernoulli sampling in the literature, first an observation 1 is drawn randomly from the population, then its stratum is determined by the researcher. After determining its stratum, it will be kept in the sample with specific probability that is set by the researcher also. If the observation is not chosen then it will be returned to the population and its values are not recorded. SS sampling scheme is often used when observations from each stratum are easily identified before sampling. Variable probability sampling scheme is more suitable when stratum of an observation is known only after sampling. For example determining a family’s income bracket is difficult before sampling and therefore VP sampling is used. In general, stratification can be based on dependent variable or variables, explanatory variables or both. Dividing the population of interest in terms of explanatory variables is called exogenous stratification. Stratification is endogenous if we define subpopulations with respect to dependent variables. Whether stratification is exogenous, or endogenous is determined only after defining an econometric model. In other words, determining a specific model comes first and then discussions about appropriate sampling schemes start. Reviewing the literature shows that the subject have been studied by both statisticians and econometricians. In summery, stratification based on exogenous variables does not produce serious problems; one can ignore it and still obtains consistent estimates for parameters of the population. In this line of research we can mention DuMouchel and Duncan (1983) that confirms the above statement in a linear model for SS sampling. Manski and McFadden (1981) show it is true in maximum likelihood estimation where data set comes form multinomial sampling. Wooldridge (1999, 2001) shows same result is true for VP, and SS sampling when we consider the case in framework of M-estimators. In practice combination of these methods of sampling are commonly used also. For instance the Panel Study of Income Dynamics (PSID) involves stratification and clustering. Bhattacharya (2005) describes a multi-stage sampling in which SS sampling is used in first level to choose some clusters in each stratum by simple random sampling, and then from each sampled cluster a few observations are chosen again by simple random sampling. In this scheme clusters are defined as 2 contiguous groups of units existing within a stratum. For example in rural areas villages can be considered as clusters, and in urban areas, they are blocks or neighborhoods and in both examples unit observations are households. In his paper, Bhattacharya (2005) drives asymptotic properties of estimators when data set comes from surveys whose designs involve stratification and clustering in GMM framework. In a set up similar to Bhattachary’s multistage sampling, Wooldridge (2008) drives asymptotic variance of estimators in linear models. The goal in this chapter, as mentioned already, is to investigate asymptotic properties of estimators when data set comes from multi-stage sampling. It is closely related to Bhattacharya (2005) sampling scheme with one distinction. We add variable probability sampling in final stage and then develop M-estimator framework for asymptotic inference to evaluate data from surveys with multi level of stratification and clustering structure. The set up is general enough to contain linear and non-linear models as well as maximum likelihood ones. This kind of sampling design is used in many surveys in practice, particularly those that involve phone interviews. As an example of big scale survey that has a structure very similar to the sampling scheme considered in this study, we can name National Survey of Families and Households (NSFH) .The NSFH is a complex survey sample that involved five sampling stages. In the first stage of this national multistage sampling design, 100 primary sampling units were drawn from a list of all countries in the nation that had been stratified into two groups. In first stratum, 18 self -representing areas composed of the largest metropolitan areas that make up 36 % of the nation’s population and second stratum contains the rest of the country. From the the first stratum, 36 primary sampling units were drawn with certainty. The second stratum that make up 64 % of the nation was divided into 32 strata, and two primary sampling units were drawn from each stratum using probability proportional to size sampling. In the second stage an average of 17 block groups or enumeration districts from each primary sampling unit is randomly selected. Within each of these district, a list of 45 or more households was selected. These households were given a short screening interview to allow oversampling of 3 certain interested groups like African American, cohabiting couples etc. Members of these groups in the cluster were selected with certainty, and others were selected at a lower rate. In the final stage, an adult from each household was randomly chosen as the eligible respondent. At the end from 45 or more households in each district or cluster 20 of them were included in the sample. Substitutions were not allowed. in this study, the sample size was 13007 primary respondents. The survey contains 1700 clusters, with an average of 7.6 respondents per cluster. In this study we have many clusters with small size. For more detail see Johnson and Elliott (1998). The rest of paper is organized as follows. The next section presents the population optimization problem and basic framework. Sampling scheme and sample objective function are explained in section 3. In section 4 consistency and asymptotic normality of weighted M-estimator under multistage sampling is discussed. We introduce theories that summarize conditions needed to have consistent weighted estimators with asymptotic normal distribution. Also in section 4 we study estimating of asymptotic variances of M-estimators. In section 5, estimation under exogenous stratification is discussed. Under exogenous stratification in our model where more than one level of stratification exist, three cases are distinguishable. However we only consider the two first cases. In section 5, four theorems are presented to cover consistency and asymptotic distribution of Mestimators under exogenous stratification. In section 6, four examples are presented. In section 7, two-step M-estimator is discussed. In section 8, the last section, the main findings of the paper are reviewed. 1.2 The Population Optimization Problem Our goal is to estimate a P × 1 vector of parameter θ that minimize the population problem min E [q (W,θ )] θ ∈Θ (1.1) where E[.] denotes the expectation with respect to the true distribution of W, and θ ∈ Θ and Θ is the parameter space that is a subset of Euclidean space RP . The objective function in the population is denoted as q(W, θ ) that is a function of W and θ . W is an M × 1 random vector taking values 4 in W , where W is a subset of RM . We assume that there exists a unique solution θ ◦ ∈ Θ, that minimize population problem (1.1). In cases that q(.) is a correctly specified model, θ ◦ is the true parameter that uniquely minimize (1.1). However, in misspecified cases where q(.) is not a correct model, there is no true value of θ , i.e. θ ◦ . In these cases, it is standard to assume θ ◦ is the unique solution to (1.1). We are usually interested in explaining a K × 1 random vector Y conditional on a L × 1 vector of explanatory variable X such as E(Y|X). Here K + L = M, and (X, Y) = W. Random vectors X and Y belong to subsets X and Y respectively, where X ⊂ RK , Y ⊂ RL and union of X and Y , denoted by X ∪ Y is W . The framework is general enough to cover panel data models with large cross section dimension and small time periods T . 1.3 The Sampling Scheme The sample design is a combination of standard stratification, clustering and variable probability sampling. First, according with SS sampling, the population is divided into S first stage strata that are non-overlapping and exhaustive. In this stage, stratification can be based on a variable or variables like the area of residence or race that allows us to divide the population easily. Each stratum s contains a mass of Cs clusters. For example these clusters in rural areas are villages, and in urban areas, they are blocks or neighborhoods. In next step Ns clusters with replacement are drawn randomly from each stratum s. Since in this study we require some sort of large-sample approximation, the assumption of with replacement is not important if the number of clusters samples, Ns , is “large”. Each sampled cluster c from stratum s contains a finite population of Msc households or units of observations. An observation (household) is selected by random from sampled cluster c in stratum s. In next stage the selected household is classified according to interested non-overlapping and exhaustive strata based on, for example, the level of income. The household is retained into the sample with some probability that is set by the practitioner. As it mentioned already, sampling in the second stage is called variable probability. The process is repeated for K (a constant and small number) of unit observations for each sampled cluster c in 5 stratum s and a sample of Ksc households is obtained where 1 Ksc K. In practice a fixed and large number of clusters Ns are sampled randomly within each stratum s, and then within each sampled cluster, a small and fixed number of households are sampled randomly. We can summarize sampling design as follows i The population is divided into S non-overlapping and exhaustive first stage strata based on criteria like area of residence, race, age etc. ii In stratum s, Cs clusters exist. iii For each stratum s randomly draw Ns clusters with replcement. iv Each sampled cluster cs from stratum s contains a finite population of Msc units (for example households). v A household is selected by random from sampled cluster cs in stratum s. vi The household is classified according to interested non-overlapping and exhaustive strata (for example income level). vii The household is retained into the sample with some probability that depends on interested stratum and is determined by the researcher. viii The process is followed for K household in each sampled cluster cs in stratum s and a sample of Ksc households is obtained. Considering structure of most surveys in practice and the same as Bhattacharya (2005), two assumptions are made to study the asymptotic inference of the model. First assumption is that the number of clusters N goes to infinity with numbers of household staying fixed and finite within each cluster. The second assumption is that the clusters are independent within a stratum but household level variables are correlated within each cluster. Therefore for a given stratum s, clusters are independently but not identically distributed. 6 Under sample scheme, clusters are chosen by simple random sample within each stratum s independently. In second step unit observations (households) are chosen by variable probability. Therefore the sample optimization problem is 1 S Ns J K ∑ ∑ ∑ ∑ vsc p−1r jmz jmq (Wscm, θ ) j N s=1 c=1 j=1 m=1 θ ∈Θ min (1.2) Msc Cs . In the sample problem (1.2), r is an indicator · Ns Ksc N variable that takes value one if W is in stratum j and zero otherwise. z is also an indicator variable Here N = N1 + N2 + . . . + N j and vsc = that takes value one if W is kept in the sample and zero if not and therefore P(z = 1) = p. In order to study asymptotic properties of M-estimator, we also assume that the ratio of sampled clusters in Ns each stratum s to total sampled clusters N or is constant and therefore ∑J as = 1. We need j=1 N this assumption in order to limit the range of fluctuations of weights vsc . If we re-index clusters from i = 1, · · · , N, and define new indicator variable yis such that yis equals one if cluster i is from stratum s or i ∈ s and zero otherwise, then the optimum problem is 1 N S J K ∑ ∑ ∑ ∑ yisvis p−1r jmz jmq (Wism, θ ) j θ ∈Θ N i=1 s=1 j=1 m=1 min (1.3) ˆ The weighted M-estimator θ w minimizes (1.2) over the parameter space Θ. vsc and p−1 are j weights corresponding to first level (SS sampling), and second level of stratifications (VP sampling) respectively. The inner summation in (1.2) is over all potential observations, which would appear in a random sample. The sample objective function weights each sampled observation unit (households for example) by product of the two weights corresponding to two level of stratifications i.e. vsc · p−1 . Note that all sampled observations from a same sampled cluster get same j weights. 7 1.4 1.4.1 Estimation under Multi-stage Sampling Consistency In order to study the consistency of the weighted M-estimator defined by equation (1.2) is assumed that the parameter vector θ ◦ uniquely solves the population problem (1.1) min E [q (W, θ )] θ ∈Θ Moreover we need to show that uniform convergence in probability is hold. It is assumed that function q(·) satisfies some regularity conditions. We summarize these requirements for consistency of the weighted M-estimator in following theorem. Theorem 1.4.1. Let W ∈ W be a random vector where W ⊂ RM , and Θ ⊂ RP , and q : W ×Θ → R a real valued function. if 1. Θ is a compact set. 2. vsc · p−1 > 0 for all clusters and strata s = 1, . . . , S, j = 1, . . . , J, c = 1, . . . , N. j 3. For each θ ∈ Θ, q(W, θ ) is Borel measurable on W . 4. For each w ∈ W , q(w, θ ) is continuous on Θ. 5. |q(w, θ )| ≤ b(w), where b is an arbitrary nonnegative function on W such that E [b(w)] < ∞. 6. θ ◦ uniquely solves the population problem. ˆ P Then uniform weak law of large numbers holds, and θ w −→ θ ◦ as N −→ ∞. Proof. For each cluster i in stratum s define J g(Wm , θ ) = K ∑ ∑ vp−1r jmz jmq(Wm, θ ) j (1.4) j=1 m=1 In this function weight v is a random variable since the number of observations from the final stage Kis is random. In fact v is a function of z jm , indicator variable that shows if the randomly drawn 8 observation from final stage is kept in the sample or discarded. Therefore we can consider v · z jm as an indicator variable    v if v · z jm =  0  z jm = 1, otherwise And its probability distribution function is   p  j if v · z jm = v,  1 − p  j f (v · z jm ) = if v · z jm = 0. Therefore the expected value of (1.4) is J E [g (Wm , θ )] = K ∑ ∑ p−1 E vr jm z jm q(Wm , θ ) j (1.5) j=1 m=1 Since v · z jm is independent of r jm , the right hand of (1.21) is equal to J = K ∑ ∑ p−1 E(vz jm )E r jm q(Wm , θ ) = j J Ksc ∑ ∑ p−1 p j vE r jm q(Wm , θ ) j j=1 m=1 j=1 m=1 which can be simplified to Ksc =E Ksc J ∑ ∑ vr jm q(Wm , θ ) = E m=1 j=1 ∑ vq(Wm, θ ) m=1 Last equality holds because∑J r jm = 1. Therefore the expected value of (1.21) is equal to j=1 Mis · E [q(W, θ )] (1.6) Mis is the population number of observation in cluster i and stratum s and hence it is constant and it does not effect estimation and inference. By assumption (6) of the Theorem(1.3.1) θ ◦ solves the population problem (1.1) uniquely and so is the unique solution for (1.6). Next we need to show that (1.4) satisfies the uniform law of large numbers for each stratum s. By assumption (3) of Theorem(1.3.1), q(·) is a continuous function on Θ for each W ∈ W , and therefore g(·) defined by (1.4) has same property. g(·) is bounded also, because |g(W, θ )| = | ∑J ∑K p−1 r jm z jm q(W, θ )| ≤ j=1 m=1 j C · |q(W, θ )| ≤ C · b(W) by assumption (5) where C = max(p−1 , p−1 , . . . , p−1 ). This complete the J 1 2 proof. 9 1.4.2 Asymptotic Normality of the Weighted M-Estimator In order to show that the weighted M-estimator is asymptotically normally distributed, conditions mentioned for consistency in Theorem (1.3.1) is not enough and additional assumptions are needed. Theorem (1.3.2) lists these new assumptions that imply asymptotic normality of the weighted Mestimator. Theorem 1.4.2. In addition to the conditions of Theorem(1.3.1), if 7. θ ◦ is in the interior of Θ or θ ◦ ∈ int(Θ). 8. s(W, θ ) the score of the objective function is continuously differentiable on int(Θ). 9. Each element of Hessian matrix, H(W, θ ) is bounded in absolute value by a function b(W), where E [b(W)] < ∞. 10. Aw = E ∇2 q(W, θ ) is nonsingular. θ 11. E [s(W, θ ◦ )] = 0 and each element of s(W, θ ) has finite second moment. Then √ d ˆ N θ w − θ ◦ −→ Normal 0, A−1 Bw A−1 w w (1.7) Here Bw is S Bw = ∑ E s=1 S + K j=1 m=1  J J K  K ∑ E  ∑ ∑ ∑ ∑ v2 p−1 p−1r jmr j t z jmz j t ∇θ q(W, θ ◦)∇θ q(W, θ ◦)  j j s=1 S − J ∑ ∑ v2 p−2r jmz jm∇θ q(W, θ ◦)∇θ q(W, θ ◦) j ∑E s=1 (1.8) j=1 j =1 m=1 t=m J K ∑ ∑ v · p−1 r jm z jm ∇θ q(W, θ ◦ ) · j j=1 m=1 J K ∑ ∑ v · p−1r jmz jm∇θ q(W, θ ◦) j j=1 m=1 Proof. Score of objective function in each stratum s is J scs (θ ) = ∇θ gcs (Wm , θ ) = K ∑ ∑ v · p−1r jmz jm∇θ q(Wmcs, θ ) j j=1 m=1 10 (1.9) Because clusters are independent sequence in each stratum s by assumption, we can apply the central limit theorem for the sampled clusters within each stratum. Therefore −1/2 Ns Ns d ∑ [scs(θ ◦) − E (ss(θ ◦))] −→ Normal (0, Bs) (1.10) s=1 In (1.10) E [ss (θ ◦ )] = E ∇θ g(W, θ ◦ ) |W ∈ Ws . Bs is the variance of score function in stratum s and is equal to Bs = var [scs (θ ◦ )] = var [∇θ gcs (Wm , θ ◦ )] J K ∑ ∑ vp−1r jmz jm∇θ q(Wmcs, θ ◦) j = var j=1 m=1 J K ∑ ∑ v2 p−2r jmz jm∇θ q(Wmcs, θ ◦)∇θ q(Wmcs, θ ◦) j =E (1.11) j=1 m=1  J J K  K ∑ ∑ ∑ v2 p−1 p j−1r jmr j t z jmz j t ∇θ q(Wmcs, θ ◦)∇θ q(Wmcs, θ ◦)  j +E ∑ j=1 j =1 m=1 t=m J −E K ∑ ∑ J vp−1 r jm z jm ∇θ q(Wmcs , θ ◦ ) j ·E K ∑ ∑ vp−1r jmz jm∇θ q(Wmcs, θ ◦) j j=1 m=1 j=1 m=1 Variance of score function consists of three terms. The first term in (1.11) is simply the variance of score if a simple random sample is in hand. In other words, the first part is correct variance if i.i.d observations are available. Second and third terms in (1.11) are added due to the sample design. The second term measures the cluster effect and accounts for correlation within clusters. This term is positive in most cases and it is substantial if the degree of correlation between the observations inside a single cluster is high and/or K the number of observations sampled from each cluster increases. The third part captures the stratum effect. It is negative and therefore reduces the size of variance. We also obtain the following important equality by using(1.6) in Theorem(1.3.1) S S s=1 s=1 J K ∑ E [∇θ gcs(Wm, θ ◦)] = ∑ E ∑ ∑ vcs p−1r jmz jm∇θ q(Wcsm, θ ◦) j Using (1.12) the score of the objective function, multiplied by N −1/2 S Ns ∑ ∑ ∇θ gcs(Wm, θ ◦ s=1 c=1 ≡0 (1.12) j=1 m=1 ) = N −1/2 S √ N can be written as Ns ∑ ∑ ∇θ gcs(Wm, θ ◦) − E [∇θ gcs(Wm, θ )] s=1 c=1 11 (1.13) because the sampled clusters across strata and are also independent by assumption, then (1.13) has asymptotic normal distribution with mean zero and variance equal to A−1 Bw A−1 . w w 1.4.3 Estimating the Asymptotic Variance Obtaining consistent estimation of the asymptotic variance of √ ˆ N(θ w − θ ◦ ) is fairly straightfor- ward. First, we need to have a consistent estimation of Hessian matrix Aw . It is second-order partial derivative of (1.4) sum over all strata S Aw = ∇2 gcs (Wm , θ ◦ ) = θ ∑E s=1 J J K ∑ E ∑ ∑ vp−1r jmz jm∇2 q(W, θ ◦) θ j s=1 (1.14) j=1 m=1 By lemma (4.3) in Newey and McFadden (1994) and under the assumptions of Theorem (1.3.2) consistent estimator of Aw is N ˆ Aw = N −1 ∑ S J K ˆ ∑ ∑ ∑ vis p−1yisr jmz jm∇2 q(wism, θ w) θ j i=1 s=1 j=1 m=1 As Wooldridge (2010), we assume that the elements of ∇θ q(W, θ )∇θ q(W, θ ) are bounded in absolute value by a function with finite expectation in order to have consistent estimation of Bw . Then, a consistent estimator of Bw is N ˆ Bw = N −1 ∑ S J K ˆ ˆ ∑ ∑ ∑ v2 p−2yisr jmz jm∇θ q(θ w)∇θ q(θ w) is j i=1 s=1 j=1 m=1 N + N −1 ∑ S J J K K ˆ ˆ ∑ ∑ ∑ ∑ ∑ v2 p−1 p−1yisr jmr j t z jmz j t ∇θ q(θ w)∇θ q(θ w) is j j i=1 s=1 j=1 j =1 m=1 t=m N J K 1 N J K ˆ ˆ vis p−1 yis r jm z jm ∇θ q(θ w ) · ∑ ∑ ∑ vis p−1 yis r jm z jm ∇θ q(θ w ) ∑N ∑∑ ∑ j j i=1 j=1 m=1 i=1 j=1 m=1 s=1 S − ˆ ˆ Here ∇θ q(θ w ) ≡ ∇θ q(wism , θ w ). ˆ Therefore the estimate of asymptotic variance of θ w is ˆw ˆ ˆw ˆ Avar(θ w ) = A−1 Bw A−1 /N The diagonal elements of (1.15) are the asymptotic variances of estimated parameters. 12 (1.15) 1.5 Estimation under Exogenous stratification Partitioning w as (x, y) and then dividing the population of interest purely based on x in a model that is made to explain distribution of Y given x, E [Y|X = x] is called exogenous stratification. In multi-stage sampling, exogenous stratification can be applied in any stages of sampling. In the sampling scheme described in section 2, there are two levels of stratification. In first level, standard sampling and in second level, variable probability sampling are used. Stratification in each level can be endogenous or exogenous and therefore three possibilities can be distinguished when at least we have one level of exogenous stratification. In case one both levels of stratification are exogenous. In case two, the first level of stratification is exogenous but is endogenous in second level. Alternatively in case three, first level of stratification is endogenous and second level is exogenous. Since case three is very unlikely to be used in practice, we limit our studies to cases one and two. 1.5.1 Consistency of the Unweighted M-Estimator Assume W is partitioned as (X, Y), then in exogenous stratification population problem is min E [q(W, θ )|X] θ ∈Θ (1.16) Our analysis of weighted estimator in previous section can be applied with or without exogenous stratification. However weighting observations in exogenous case is not necessary anymore and an unweighted estimator is also consistent. 1.5.1.1 Consistency of the Unweighted M-Estimator: Case One As mentioned above, in case one both level of stratifications are exogenous. The unweighted estimator solves the sample objective function 1 N S J K min ∑ ∑ ∑ ∑ yis r jm z jm q(wism , θ ) θ ∈Θ N i=1 s=1 j=1 m=1 13 (1.17) Objective function (1.17) is same as (1.3) without the weights vis · p−1 . The following theorem j states conditions for consistency of unweighted estimator. Theorem 1.5.1. Assume that first five conditions in Theorem (1.3.1) hold. Add new two following conditions 6. Stratification in first and second levels are based on exogenous variables x. It means that stratification is a deterministic function of x in both levels. 7. For all x, θ ◦ solves minθ ∈Θ E [q(W, θ )|X], and θ◦ uniquely minimizes J K ∑ ∑ J p j E r jm q(Wm , θ ) = j=1 m=1 K ∑ ∑ p j E r jm E [q(Wm , θ )] |X (1.18) j=1 m=1 ˆ Then uniform weak law of large numbers holds and θ u −→ θ ◦ in probability as N → ∞. Proof. We need to show that θ is the unique solution to J E K ∑ ∑ r jmz jmq(Wm, θ ) (1.19) j=1 m=1 By assumption (6) in Theorem (1.4.1), r jm is a function of x. Also z jm is independent of w and consequently of x. Therefore E r jm z jm q(Wm , θ )|X = r jm E(z jm |X)E [q(Wm , θ )|X] = r jm p j E [q(Wm , θ )|X] (1.20) By assumption (7) in Theorem (1.4.1), r jm p j E [q(Wm , θ )|X] is minimized at θ ◦ , but perhaps not uniquely. By iterated expectation we have E[q(Wm , θ )] = E [E[q(Wm , θ )|X]] and therefore θ ◦ is a solution to (1.19). Then the expectation of (1.19) is same as (1.18), and by assumption θ ◦ is unique solution to (1.18). 1.5.1.2 Consistency of Unweighted M-estimator: Case Two When first level of stratification is exogenous while the second level is endogenous, a logical analogy is that we can drop the weight associated to first level of stratification i.e. vis but need to keep the weight associated to the second level of stratifican i.e. p−1 in order to have consistent j estimator. Next theorem confirms the truth of this analogy under specific conditions. 14 Theorem 1.5.2. Assume that first five conditions in Theorem (1.3.1) hold. Add new two following conditions 6. Stratification in first level is a deterministic function of x. 7. θ ◦ is the unique solution to E [q(W, θ )|x ∈ X] for all s. ˆ p Then uniform law of large numbers hold and θ u −→ θ ◦ as N → ∞. Proof. The expected value of cluster i in stratum s is J K ∑ ∑ E p−1 r jm z jm q(Wm , θ )|x ∈ Xs = j j=1 m=1 J K = ∑ ∑ J ∑ ∑ p−1 E z jm |x ∈ Xs · E r jm q(Wm , θ )|x ∈ Xs = j J K ∑ ∑E r jm q(Wm , θ )|x ∈ Xs j=1 m=1 Ksc K ∑ ∑ r jmq(Wm, θ )|x ∈ Xs =E p−1 E r jm z jm q(Wm , θ )|x ∈ Xs j j=1 m=1 j=1 m=1 J K j=1 m=1 =E ∑ q(Wm, θ )|x ∈ Xs m=1 Ksc = ∑ [q(Wm, θ )|x ∈ Xs] (1.21) m=1 By assumption (7) in Theorem (1.4.2) θ ◦ is unique solution for E [q(W, θ )|x ∈ Xs ] and so is unique solution for last equality in (1.21). We also need to show that the uniform law of large numbers holds for each s which is similar to the argument as in Theorem (1.3.1). 1.5.2 Asymptotic Normality of the Unweighted M-Estimator According to previous section, asymptotic normality results for the unweighted estimator when stratification is based on x in both levels or just in first level are represented in frame of the following two theorems. Theorem 1.5.3. In addition to the condition of Theorem (1.4.1) if 8. θ ◦ is in the interior of Θ or θ ◦ ∈ int(Θ). 9. For all w ∈ W , ∇θ q(w, ·) the score of objective function is continuously differentiable on int(Θ). 15 10. Each element of Hessian matrix, H(W, θ ) is bounded in absolute value by an arbitrary function b(w), where E [b(w)] < ∞. 11. For all x, E [∇θ q(W, θ ◦ )|X = x] = 0, and all elements of ∇θ q(W, θ ) has finite second moment. j 12. Au = ∑S E ∑ j=1 ∑K r jm z jm ∇2 q(Wm , θ ◦ )|X = x is nonsingular. m=1 s=1 θ Then √ d ˆ N(θ u − θ ◦ ) −→ Normal 0 , A−1 Bu A−1 u u (1.22) where S J Bu = ∑ E s=1 S + K ∑ ∑ p j r jm ∇θ q(W, θ ◦ )∇θ q(W, θ ◦ ) |X = x j=1 m=1  J J K  K ∑ E  ∑ ∑ ∑ ∑ p j p j r jmr j t ∇θ q(W, θ ◦)∇θ q(W, θ ◦) |X = x s=1 (1.23) j=1 j =1 m=1 t=m for all x. Proof. In this case, stratifications in both levels are exogenous and the score of the objective funcj tion in each stratum s is s(Wm , θ ) = ∇θ g(Wm , θ ) = ∑ j=1 ∑K r jm z jm ∇θ q(W, θ ). Under asm=1 sumption (11) in Theorem (1.5.1), the expected value of the score is J E [s(Wm , θ ◦ )|x] = E K ∑ ∑ r jmz jm∇θ q(Wm, θ ◦)|X = x =0 (1.24) j=1 m=1 Then by applying central limit theorem for independent clusters within each stratum, asymptotic distribution of the score in stratum s is −1/2 Ns Ns d ∑ [sis(Wm, θ ◦)|X = x] −→ Normal(0 , Bu) s i=1 16 (1.25) Bs u , represents the variance of the score function in stratum s under exogenous stratification. It is equal to Bu =var [s(Wm , θ ◦ )|X = x] = var [∇θ g(Wm , θ ◦ )|X = x] s J =var K ∑ ∑ r jmz jm∇θ q(W, θ ◦)|X = x j=1 m=1 J K ∑ ∑ r jmz jm∇θ q(W, θ ◦)∇θ q(W, θ ◦) |X = x =E j=1 m=1  J + E ∑ J K  K ∑ ∑ ∑ r jmr j t z jmz j t ∇θ q(W, θ ◦)∇θ q(W, θ ◦) |X = x (1.26) j=1 j =1 m=1 t=m for all x. Independency of z’s from W and each other, and also indepdency of clusters between strata leads us to ∑S Bu which is the score of the objective function Bu in (1.23) and this complete s=1 s the proof. It is interesting to note that under assumption (11), Theorem (1.5.1) in exogenous stratification, the effect of stratification is vanished as comparing (1.23) and (1.8) show this point. The asymptotic results when stratification is exogenous just in first level is very similar to case one. Next theorem summarizes main conditions and results. Theorem 1.5.4. Same conditions as Theorem (1.5.1) last two ones that are replaced with following 11. E [s(W, θ ◦ )|x ∈ X] = 0, in other words we assume the score of the objective function under exogenous in first stage is zero. Also we assume that elements of s(W, θ ) have finite second moment. 12. Au = ∑S E ∑J ∑K p−1 r jm z jm ∇2 q(W, θ ◦ )|x ∈ X is nonsingular. ¯ s=1 j=1 m=1 j θ Then √ d ˆ¯ N(θ u − θ ◦ −→ Normal(0 , A−1 Bu A−1 ) ¯ u ¯ 17 u ¯ where S Bu = ¯ J ∑E s=1 S + K ∑ ∑ p−2 r jm z jm ∇θ q(W, θ ◦ )∇θ q(W, θ ◦ ) |X = x j j=1 m=1  J J K  K ∑ E  ∑ ∑ ∑ ∑ p−1 p−1r jmr j t z jmz j t ∇θ q(W, θ ◦)∇θ q(W, θ ◦) |X = x j j s=1 j=1 j =1 m=1 t=m for all x. Proof. It is similar to the Theorem (1.5.1). We just need to weight observations by p−1 that j corresponding with VP sampling in second stage. Like previous case, stratification effect due to SS sampling in first stage is zero. 1.6 Examples This section contains some examples that illustrate theoretical results. It also covers some special cases. Example 1. As the first example consider a simple liner model y = xβ + u (1.27) Here x is a 1 × K vector of exogenous variables and β is the K × 1 vector of parameters of interest. Assuming noncorrelationo between exogenous x’s and error term u, E(x u) = 0, the weighted estimator provides consistent estimates of β . The sample optimization problem is 1 S Ns J K ∑ ∑ ∑ ∑ vsc p−1r jmz jm(yscm − xscmβ )2 j θ ∈Θ N s=1 c=1 j=1 m=1 min First order condition is 1 S Ns J K ˆ ∑ ∑ ∑ ∑ vsc p−1r jmz jmxscm(y − xscmβ ) = 0 j N s=1 c=1 j=1 m=1 18 (1.28) or 1 S Ns J K ˆ ∑ ∑ ∑ ∑ vsc p−1r jmz jmxscmuscm = 0 j N s=1 c=1 j=1 m=1 ˆ where uscm = yscm − xscm β . In this linear model and under multi-stage sampling scheme, a conˆ ˆ sistent estimators of asymptotic variances of β ’s are obtained by applying Theorem (1.3.2) where consistent estimators of Aw , and Bw are 1 S Ns J K ˆ Aw = ∑ ∑ ∑ ∑ vsc p−1 r jm z jm xscm xscm j N s=1 c=1 j=1 m=1 and S Ns J K ˆ w = 1 ∑ ∑ ∑ ∑ v2 p−1 r jm z jm xscm xscm B N s=1 c=1 j=1 m=1 sc j + 1 S Ns J J K K 2 −1 −1 ˆ ˆ ∑∑∑ ∑ ∑ v p p r r z z uscmusct xscmxsct N s=1 c=1 j=1 ∑ m=1 t=m sc j j jm j t jm j t j =1 − Ns J K 1 Ns J K ˆ ˆ vsc p−1 r jm z jm uscm xscm · ∑ ∑ ∑ vsc p−1 r jm z jm uscm xscm ∑ ∑∑ ∑ j j s=1 N c=1 j=1 m=1 c=1 j=1 m=1 S Example 2. As the second example, consider binary models like logit or probit. In binary response models of the form P(y = 1|x) = G(xβ ) ≡ p(x) where x is 1 × K, β is K × 1, we take the first element of x to be unity. Also we assume 0 < G(xβ < 1 for all x and β . The log-likelihood for observation i is li (β ) = yi log [G(xi β )] + (1 − yi ) [1 − G(xi β )] The weighted estimator in this case simply is the weighted maximum likelihood that gives observation i in cluster c in stratum s corresponding weight that is vsc · p−1 . j In this example, consistent estimator of Aw and Bw according to Theorem (1.3.2) are 1 S Ns J K ˆ ˆ Aw = ∑ ∑ ∑ ∑ vsc p−1 r jm z jm g2 xscm xscm /ξscm scm j N s=1 c=1 j=1 m=1 19 (1.29) and S Ns J K 1 ˆ Bw = ∑ ∑ ∑ ∑ v2 p−2 r jm z jm gscm xscm xscm N s=1 c=1 j=1 m=1 sc j + 1 S Ns J J K K 2 −1 −1 ∑∑∑ ∑ ∑ v p p r r z z gscmgsct xscmxsct N s=1 c=1 j=1 ∑ m=1 t=m sc j j jm j t jm j t j =1 − Ns J K 1 Ns J K vsc p−1 r jm z jm gscm xscm · ∑ ∑ ∑ vsc p−1 r jm z jm gscm xscm ∑ ∑∑ ∑ j j c=1 j=1 m=1 s=1 N c=1 j=1 m=1 S dG(z) ˆ ˆ ˆ and ξscm = Gscm (1 − Gscm ). dz Example 3. Here g(z) = Example 3 is a special case when p j is set equal 1. In other words, we eliminate last level of stratification or VP sampling. In this case our results in section 3 change to: N ˆ Aw = N −1 ∑ S K ˆ ∑ ∑ visyis∇2 q(wism, θ ) θ i=1 s=1 m=1 And estimation of Bw is N S ˆ Bw =N −1 ∑ + K ˆ ˆ ∑ ∑ v2 yis∇θ q(wism, θ )∇θ q(wism, θ ) is i=1 s=1 m=1 N S K −1 N K ˆ ˆ ∑ ∑ ∑ ∑ v2 yis∇θ q(wism, θ )∇θ q(wism, θ ) is i=1 s=1 m=1 t=m S − 1 ∑N s=1 N K ∑∑ ˆ vis yis ∇θ q(wism , θ ) · i=1 m=1 N K ˆ ∑ ∑ visyis∇θ q(wism, θ ) i=1 m=1 These results are similar to Bhattacharya’s (2005) ones. Also Wooldridge (2008) obtains same results in case of linear model estimated by least squares. Example 4. Consider a case without first level of stratification and clusters that contains just one unit of observation. Then our results will change to N ˆ Aw = N −1 ∑ J ˆ ∑ p−1ri j zi j ∇2 q(w, θ ) θ j i=1 j=1 20 And N ˆ Bw = N −1 ∑ J ˆ ˆ ∑ p−2ri j zi j ∇θ q(wi, θ )∇θ q(wi, θ ) j i=1 j=1 These are same results as Wooldridge (1999) in studying variable probability sampling case. 1.7 Two-Step M-Estimator Consider a panel data model for a random draw i from the population E(Yi |Xi = xi ) = m(xi , θ ◦ ) (1.30) where yi is a T × 1 vector on the dependent variable and m(xi , θ ) is a T × 1 of conditional mean functions. Here we assume that explanatory variables are strictly exogenous. Stratification is normally done on variables on first period. A consistent, asymptotically normal estimator is obtained by applying pooled weighted M-estimator discussed in previous sections. The estimator of asympˆ ˆ totic variance of θ w is obtained from (1.15), where ∇θ q(wism , θ w ) is the P × T matrix. Arbitrary serial correlation and heteroskedasticity are allowed in calculation of the estimator (1.15). Under assumption (1.30), where conditional mean is correctly specified, can we do more in context of stratified samples? This is the question that we will answer in the next chapter. In general, under (1.30) we can use generalized least squares (GLS) methods to obtain more efficient estimators of the parameters appearing in a set of conditional mean functions. To obtain more ˆ efficient estimators we usually need θ w from the first step. Let Ω(xi , γ) be a model for the T × T conditional variance matrix Var(Yi |Xi ). If this model is correctly specified, in general, we can obtain consistent estimator of the true parameters in the variance matrix, γ◦ . In most application, we obtain an estimation of γ from a first step, for example ˆ by using residuals from an initial weighted M-estimator, discussed in this paper. Given, γ, and assuming that conditional variance matrix is nonsingular for all i, we can estimate θ ◦ by solving N ˆ min ∑ [yi − m(xi , θ )] [Ω(xi , γ)]−1 [yi − m(xi , θ )] . θ i=1 21 (1.31) Wooldridge (2010) calls the solution to (1.31) weighted multivariate nonlinear least squares (WMNLS) estimator. Interestingly, even if the chosen model for the conditional variance Ω(xi , γ) is misspesified, WMNLS estimator might produce a more efficient estimator of θ ◦ than an estimator that ignores variances and covariances at least under (1.30). In most cases, a misspecified model of variance matrix captures key features of the conditional second moments. This is the key insight in the generalized estimating equation (GEE) literature, which is typically applied to panel data models. In GEE literature, the conditional variance matrix Ω(xi , γ) is called working variance matrix, which is allowed, and in many cases is known, to be misspecified. In next chapter we investigate the problem of efficient estimator in panel data models where simple random sampling is not a correct assumption in more detail. 1.8 Conclusion Many data sets in economics studies and other branches of social sciences are not i.i.d observations but come from multi-stage stratification and clustered surveys. These surveys usually produce data that are not random. Then statistical inference could be faulty if we overlook sampling design. In this chapter I examine statistical inference in multi-stage sampling designs in framework of M-estimators. The results show that neglecting sampling scheme causes overestimating or underestimating of the variances. Applying weighted M-estimator gives consistent and normally distributed estimators regardless of stratification type; exogenous or endogenous. However under exogenous stratification, unweighted estimators are consistent. The results show that variance of the weighted estimator consists of three parts. Part one measures variance under “i.i.d” assumption. Parts two and three take into account clustering and stratification effects. Clustering effects are usually positive, while stratification effect is negative. These two rarely offset each other and therefore overlooking these two parts is potentially problematic. An interesting question that arise is the possibility of having more efficient estimator under stratified samples. Under simple random sampling, and assuming that the conditional mean is 22 correctly specified, we can apply generalized least squares methods to obtain efficiency gain. We follow up this possibility in panel data models with stratified sampling schemes in next chapter. 23 Chapter 2 ASYMPTOTIC EFFICIENCY IN THE PANEL DATA MODELS WITH STRATIFIED SAMPLING 2.1 Introduction Finding more efficient estimators helps researchers to increase the precision of their statistical inferences. Efficiency usually comes at a price. It requires stronger assumptions needed for consistency. Here we assume that explanatory variables are strictly exogenous. However in panel data studies, this assumption is violated in models with lagged dependent variables and perhaps in models without lagged dependent variables. Fixed effects (FE) and random effects (RE) are two well known linear methods used in empirical studies that require strict exogeneity of the estimators. In RE approach, the serial correlation in the composite error is exploited in a generalized least squares (GLS) framework. In GLS procedure we also need to add assumptions on conditional variance matrix of the error term. The issue of efficiency in context of stratified sampling has been the subject of interest already. Among others, Cosslett (1981a, 1081b, and 1993), and Imbens (1992) examines efficiency for discrete choice models. Imbens and Lancaster (1996) develop a new estimator for this estimation problem and show that it achieves the semi-parametric efficiency bound for this case. Recently Tripathi (2011) develops efficient empirical likelihood-based inference for moment restriction models when data are collected by stratified sampling schemes. In this chapter, we study efficiency in panel data models with stratified data. The main idea is to utilize information within panels similar to GLS method in order to gain efficiency. It should be emphasized that we are only approximate the efficient estimator in the sample and try to obtain more efficient estimates compare with just pooled estimators that ignore correlations within panels. In other words, our goal in this chapter is not to find the efficiency bound. The paper is organized as follows. The next section presents the model and conditional moment 24 restriction that we need as an assumption to be held in the population. Section 3 introduces the sampling scheme, sampling objective function and relevant probabilities. In section 4 we first show that conditional moment restrictions is held in the sample. Then we discuss about efficient estimators by referring to some well known works in the literature and how one should apply them in the context of stratified samples. In the same section we drive a function that minimize the asymptotic variance. In section 6 we do a Monte Carlo experiment with the normal linear case and look at the results of applying new estimators in case of exogenous and endogenous stratification. Section 7 shows application of the method on PSID data. In the last section the main finding of the paper will be summarized and ends with some concluding remarks. All proofs and tables are contained in Appendices. 2.2 The Moment Conditions Let Wi be a M × 1 random vector taking values in W ⊂ RM , where RM is an M-dimensional Euclidean space. Some feature of the distribution of W is function of a P × 1 parameter vector θ that is an element of the parameter space Θ where Θ ⊂RP . Now consider the class of estimators such that a zero conditional moment restriction in the population is satisfied: E [r (W, θ◦ ) |W2 ] = 0 for all W2 ∈ W2 (2.1) Here r (W, θ ) is a L × 1 vector of functions, θ◦ satisfies the conditional moment assumptions and W2 ∈ RK is a sub-vector of W ∈ RM . For instance r (W, θ ) can be a vector of residuals and W2 a vector of instrumental variables. We need standard regularity conditions such as continuity and differentiability of r (W, θ ) on the interior of Θ. 2.3 Sampling Scheme The analysis of asymptotic behaviors of an estimator becomes more complicated when the data set comes from non-random sampling schemes like stratified samples. One important source of 25 the complexity is the difference between the population distribution on the one hand and sample distribution on the other hand. However, in simple random sampling these two distributions are the same. In multinomial sampling, stratum W j is a subset of W for j = 1, · · · , J . Let Qs be the probability of a randomly drawn observation lying in W j i.e. Qj = P W ∈ Wj (2.2) And let S be the stratum indicator that shows from which stratum an observation was drawn. In a multinomial scheme, first the stratum indicator si where si ∈ {1, 2, · · · , J} is chosen randomly with probability H j . It means H j = P (Si = j) (2.3) In the second step, observation Wi is randomly drawn from the stratum which the indicator si = j. This leads to the sample objective function N J Qj ∑ ∑ 1 [Si = j] H j r (Vi, θ ) (2.4) i=1 j=1 Unlike random sampling where all the observations are equally weighted no matter which subpopulation or stratum they belong, in multinomial sampling scheme observations depend on Qj their stratum have different weights. The objective function in ( 2.4) weights observation i by Hj if it comes from stratum j. So if all observations are weighted equally or if Qs = Hs for all s then there is no gain of stratified sampling over random sampling. To emphasize the difference between distribution of observations in population and in the sample under stratified sampling scheme, random vectors in population and in the sample are represented with W and V respectively. 26 2.4 2.4.1 Efficient estimation under moment restrictions Moment restrictions in the sample To study efficiency in panel data models when data set comes form stratified samples and under conditional expectation assumption ( 2.1), one first needs to evaluate conditional expectation of the sample objective function in equation (2.4). To this end, first for each observation i, define J q (S, V, θ ) = Qj ∑ 1 [S = j] H j r (V, θ ) (2.5) j=1 q (·) is a function of random variable S, an indicator variable representing stratum of observation i, and random vector V. This function also depends on the sampling weight of each observation Qj , that are assumed to be known. We want to show that the expected value of function q (·) i, Hj given V2 and evaluated in true parameter value θ ◦ is zero. J E [q (S, V, θ ◦ ) |V2 ] = ∑E 1 [S = j] j=1 Qj r (V, θo ) |V2 Hj =0 (2.6) Using definition of expected value and assuming that V is a continuous random vector, expected value of (2.5) is J ∑ 1 [s = j] j=1 v∈W Qj r (v, θ o ) · g (s, v|v2 ) dv Hj (2.7) Equation ( 2.7) shows that we need to find the conditional sampling density of S and V given V2 , or g (s, v|v2 ). Imbens and Lancaster (1996) show that this conditional density function is f (v|v2 , θ ) g (s, v|v2 ) = J Hs Qs Hj R ( j, v2 , θ ) ∑ j=1 Q j (2.8) Equation ( 2.8) represent conditional sampling density of S and V given V2 in terms of conHs ditional density of V given V2 in the population, sampling weight , and R (s, v2 , θ ). Here Qs R (s, v2 , θ ) is defined to be the probability that a random drawn observation is in stratum s given V2 . It is a known function of s, v2 , and θ . Also it is important to note that since we assume the 27 strata are not overlapping, the conditional sampling density of S and V given V2 is the same as the conditional density of V given V2 i.e. g (s, v|v2 ) = g (v|v2 ). By substituting ( 2.8) in ( 2.7) we have f (v|v2 , θ ) J 1 [s = j] ∑ j=1 v∈W Qj dv r (v, θ ) · J Hj Hj R ( j, v2 , θ ) ∑ j=1 Q j J = ∑ j=1 v∈W Hj Qj 1 [s = j] r (v, θ ) · f (v|v2 , θ ) J dv (2.9) ∑ R ( j, v2 , θ ) j=1 Since W1 , W2 , · · · , WJ are mutually disjoint and the union set of this disjoints subpopulations, J W , covers whole population, saying that stratum of observation i is j=1 j j or Si = j is equivalent to say that observation i belongs to subpopulation j or vi ∈ W j . So we can exchange 1 [s = j] with 1 wi ∈ W j in expression ( 2.9) which gives us J = ∑ j=1 v∈W 1 v ∈ W j r (v, θ ) · f (v|v2 , θ ) J dv (2.10) ∑ R ( j, v2 , θ ) j=1 In expression ( 2.10), ∑J R ( j, v2 , θ ) is constant and 1 v ∈ W j just defines the limits of integraj=1 tion and therefore (2.10) can be rewritten as J 1 = ∑ J ∑ R ( j, v2 , θ ) j=1 v∈W j r (v, θ ) · f (v|v2 , θ ) dv j=1 =η Here η = 1 J v∈W r (v, θ ) · f (v|v2 , θ ) dv (2.11) is a constant and equation ( 2.11) by definition is the conditional expec- ∑ R ( j, v2 , θ ) j=1 tation of r(·) or v∈W r (v, θ ) · f (v|v2 , θ ) dv = E [r (V, θ ) |V2 ] (2.12) By assumption ( 2.1), equation (2.12) evaluated in true parameter value θ◦ , is equal to zero. Hence we show that although multinomial sampling changes the distribution of observations in the sample but zero conditional mean assumption is still held. we summarize the above finding in the following lemma. 28 Lemma 2.4.1. If zero conditional moment (2.1) evaluated in true parameter value θ ◦ holds in the population, then under multinomial stratification sampling scheme, its analog in the sample (2.6) evaluated in θ ◦ is zero also. The result is valid under standard stratified and variable probability sampling schemes too. Imbens and Lancaster (1996) show that these three common types of stratification can be analyzed in a unified manner. They show that regardless of the actual sampling scheme efficient inference should be identical for both standard stratified sampling and multinomial sampling. And variable probability sampling model is just a re-parametrization of the multinomial sampling scheme and therefore the inference should be identical for both models. 2.4.2 Efficient estimation The result in previous section opens door to apply the well known results developed by Chamberlain (1987), and Newey and McFadden (1994) to find the smallest asymptotic variance under zero conditional mean assumption ( 2.1). To find such a solution let Ω (W2 , θ◦ ) = E r (W, θ ◦ ) r (W, θ ◦ ) |W2 = Var [r (W, θ ◦ ) |W2 ] (2.13) be the T × T conditional variance of r (W, θ ◦ ) given W2 , in the population, and define G (W2 , θ◦ ) = E [∇θ r (W, θ ◦ ) |W2 ] (2.14) be the T × P conditional mean of gradient in the population. Then it can be shown that Z∗ (W2 , θ◦ ) = Ω (W2 , θ◦ )−1 G (W2 , θ◦ ) (2.15) is the function that minimize the asymptotic variance. This function is T × P and the efficient method of moments estimator solves E Z∗ (W2 , θ◦ ) r (W, θ ◦ ) = 0 29 (2.16) Since stratification changes the distribution of observations in the sample we need first to evaluate conditional variance of the sample objective function q (S, V, θ ). In the first appendix, we show that this variance is equal to J E q (S, V, θ ◦ ) q (S, V, θ ◦ ) |V2 = Qj ∑ Hj E r (V, θ ◦ ) r (V, θ ◦ ) |V2 , S = j (2.17) j=1 We can write the right hand side of equation ( 2.17) in terms of the conditional variance of r (V, θ ) in each stratum and so (2.17) can be rewritten as J var [q (S, V, θ ◦ ) |V2 ] = Qj ∑ H j var [r (V, θ ◦) |V2, S = j] (2.18) j=1 J + Qj ∑ H j E [r (V, θ ◦) |V2, S = j] E [r (V, θ ◦) |V2, S = j] j=1 Expression ( 2.18) show that sampling conditional variance of r (V, θ ) is equal to the sum of conditional weighted variances in strata plus the sum of conditional weighted squares of means in strata. To see the effect of stratification, it is useful to compare it with random sample case, where each observation in population has same weight or in other words Q j = H j for all j, and assume conditional expected value in each stratum is equal to conditional expected value in the population which is zero by assumption. Then equation ( 2.18) reduces to sum of conditional variances in strata. There are two interesting cases that need attention. First case is when strata are function of exogenous variables V2 . Then the stratification is exogenous. It causes the second term in right hand side of ( 2.18) to be zero, because E [r (V, θ ◦ ) |V2 , S = j] = E [r (V, θ ◦ ) |V2 ] = 0 and equation ( 2.18) simplifies to J Q Qj j {var [r (V, θ ◦ ) |V2 ]} = Ω (V2 ) ∑ ∑ Hj Hj j=1 j=1 J var [q (S, V, θ ◦ ) |V2 ] = (2.19) Qj is constant, it does not affect the variance, and therefore conditional variance Hj of q (S, V, θ ) is equal to conditional variance of r (V, θ ) in the population. and since ∑J j=1 30 The second case occurs when despite changes of the variances between strata the structure of correlation remains constant. As an example consider cases like AR(p) or MA(q). If variancecovariance matrix remains same despite stratification then by Equation(2.17) the sample objective function q (S, V, θ ) has same variance as r (V, θ ) in the population. Actually in the next section we assume that the variance-covariance matrix does not change by stratification and then check the simulation results for this case by assuming that the correlation follows AR(1) process. We also need to check the score of the objective function. In the first appendix we also show that the conditional expected value of the sample gradient vector is E [∇θ q (S, V, θ ) |V2 ] = E [∇θ r (V, θ ) |V2 ] . (2.20) The right hand of ( 2.20) is the conditional expected value of the population Jacobian matrix. It leads us to optimal instruments matrix that is a T × P matrix Z∗ (V2 ) ≡ {var [q (S, V, θ ◦ ) |V2 ]}−1 E [∇θ q (S, V, θ ◦ ) |V2 ] (2.21) Therefore the efficient method of moments estimator GMM solves the sample moment conditions N ∑ Z∗ (V2) q θ =0 (2.22) i=1 This is a T × P matrix. 2.5 Examples In this section some specific examples are covered that illustrate the theoretical results. Example 1. As the first example consider linear model Yi = Xi θ + Ui (2.23) where Y is a T × 1 vector of dependent variables, X is a T × P matrix of control variables, θ is a P × 1 vector of parameters and finally U is a T × 1 vector of error terms. In this example 31 r (Xi , Yi , θ ) = Ui = Yi − Xi θ . We assume strict exogeneity assumption E (U|X) = 0 (2.24) and add assumption that variance-covariance function in the population is function of control variables X E UU |X = Ω (X) (2.25) Under multinomial sampling scheme we have J q (X, Y, S, θ ) = Qj ∑ 1 [S = j] H j U (2.26) j=1 and by equations( 2.12) and ( 2.17) conditional expected value and conditional variance of this function are E [q (X, Y, S, θ ) |X] = 0 (2.27) and J var [q (X, Y, S, θ ) |X] = Qj ∑ Hj E UU |X, S = j (2.28) j=1 respectively. From (2.28), it is clear that variance matrix is a function of X and strata. Also conditional expected value of gradient vector in this simple linear model is x according to ( 2.20). Therefore optimal choice of instrument according to ( 2.21) is J Qj ∑ H j E UU |X, S = j j=1 −1 x (2.29) And GMM solution that produces efficient estimators solves N J ∑ xi i=1 Qj ∑ H j E UU |X, S = j j=1 −1 J Qj ∑ 1 [S = j] H j Ui = 0 (2.30) j=1 Computational version of ( 2.30) can be written as N Qj j ∑ H j ∑ xi j j=1 i=1 J J Qj ∑ H j E UU |X, S = j j=1 32 −1 ui j = 0 (2.31) Estimation of θ ◦ are obtained by solving equation ( 2.31). These parameters estimations are −1    J Q Nj J Q Nj j j ˆ ˆ ˆ (2.32) θwGMM =  ∑ ∑ xi j Ω−1(x, j)xi j   ∑ H j ∑ xi j Ω−1(x, j)yi j  H j i=1 j=1 i=1 j=1 ˆ that looks like a GLS estimator. In (2.32) N j is the sample size in stratum j and Ω−1 (x, j) is an J Q estimation of variance matrix var [q (X, Y, S, θ ) |X = x] = ∑ H j E UU |X = x, S = j . To have a j=1 j clear idea about equation (2.32), as an example, assume that variance matrix (2.28) is a function ˆ ˆ of gender. In this case we need to obtain Ω−1 ( f emalei , j) and Ω−1 (malei , j) for each stratum j; j ∈ {1, 2, . . . , S}. In cases that conditional variance matrix is a function of continuous explanatory variable xi j , one possible solution is to divide it into some interval and then estimate variance matrix for each interval in each stratum j separately. Of course if we know the functional form of relationship between the variance matrix and the explanatory variable xi in each stratum j, we can improve the efficiency in our model by incorporating this knowledge in the estimation process. For example if we are interested in relationship between saving and income in different states and a theory provides an specific form for the variance matrix that relates changes in second moments of saving to changes in income and other explanatory variables then we are in a situation like weighted least squares that provides more efficient estimators relative to OLS. Note that in weighted least square the reason for weighting observations is to solve heteroskedastisity problem while in models with stratified or complex sampling design we need weights even in homoskedastic cases. we can summarize the above procedure to find a GMM estimator in panel data models with stratified structure in few simple steps as follows: 1. Obtain a consistant estimation of θ . 2. Obtain residuals ui jt ˆ ˆ 3. Estimate Ω j (X) = E[UU |X, S = j] for each stratum j. Call them Ω j (X). ˆ ˆ 4. Form Ω(X, S) by adding weighted Ω j (X). ˆ ˆ 5. By substitute Ω(X, S) in equation (2.32), we obtain θ wGMM which we hope it is more efficent than a pooled estimator. 33 Obtaining a consistent estimator of θ should not be difficult. Any computer package that allows users to estimate surveys panel data can be used to do the first step. Example 2. The second example considers a nonlinear model. Assume E [Yt |Xt ] = m (Xt , β ◦ ) , t = 1, · · · , T (2.33) Here {(Xt ,Yt ) : 1, 2, · · · , T } is the time series observations for a random draw from the cross section population and assumption ( 2.33) simply means that parametric model for E [Yt |Xt ] has been correctly specified. For example if Y is a count variable, a Poisson QMLE can be used. In this case and in general for Y ≥ 0 and unbounded from above, the most common conditional mean function is the exponential m (Xt , β ) = exp (Xt β ) (2.34) where Xt is 1 × K and contains unity as its first element, and β is K × 1. If we impose the Generalized Linear Models (GLM) assumption then 2 2 var (Yt |Xt ) = σ◦ m (Xt , β◦ ) = σ◦ exp (Xt β ) , t = 1, 2, · · · , T (2.35) In this model r (X, Y, β ) = Y − exp (Xβ ) = U and U is T × 1 vector with elements Ut = Yt − exp (Xt β ) for t = 1, 2, · · · , T . By multinomial stratified sampling and according to (4.1), sample objective function is J q (X, Y, S, β ) = Qj ∑ 1 [S = j] H j U (2.36) j=1 Its conditional expected value is zero as it shown in general case, and its conditional variance is J var [q (X, Y, S, β ) |X] = Qj ∑ Hj E j=1 34 UU |X, S = j (2.37) by ( 2.17). Conditional expected value of gradient of sample objective function is    −X1 exp (X1 β )     −X exp (X β )  2 2   E ∇β q (X, Y, S, β ) |X =  = R (X)  .   . .     −XT exp (XT β ) (2.38) T ×K Then optimal choice of instruments is given by J Qj ∑ H j E UU |X, S = j j=1 −1 R (X) (2.39) And finally GMM estimators are obtained by solving N ∑ R (X) i=1 J Qj ∑ H j E UU |X, S = j j=1 −1 J Qj ∑ 1 [S = j] H j Ui = 0 (2.40) j=1 Here, one way to approach is to model E UU |X, S = j similar to the first example hoping to obtain more efficient estimators. However, we can choose a hypothesized structure for the within-panel correlation like generalized estimating equations (GEE) literature. The main idea in GEE approach in panel data models is that under strict exogenity assumption (2.1), even a misspecified model for the conditional variance (2.17) that nevertheless captures key features of the conditional second moments might lead to a more efficient estimator of θ ◦ than an estimator that ignores variances and covariances. Identity matrix is the simplest form of the correlation within the panels that assumes independency or in other words no correlation within panels. Exchangeable correlation matrix is a simple extension to this structure. This matrix looks like   1 α · · · α  . . α 1 .  Λ(α) =  .  ... . α .    α α 1 (2.41) Here parameter α is a scalar that shows common correlation among observations within the panels. For an example consider a health study in which the panels are clinics and the observations within the panels i.e. clinics are patients. 35 If observations within the panels have a natural order it is more reasonable to assume the autoregressive structure for within the panels correlation. In health study case for instance, one can consider that the panels represent patients who are measured over time. In this case an autoregressive process can be a good model for dependency of a patient’s health conditions over time. In section 5 we consider the autoregressive structure implied by the AR(1) for the correlation matrix to study a linear model. There are several ways in which we might hypothesize the within-panel correlation. To see more options and examples see Hardin and Hilbe (2003). By assuming correlation matrix (2.41), and adding GLM assumption ( 2.35), the variance of sample objective function reduces to 1 1 var [q (X, Y, S, β ) |X] = m (X) 2 Λ (α) m (X) 2 J Qj ∑ Hj σj (2.42) j=1 where m (X) is   0 ··· 0  m (X1 , β )   . .   0 m (X2 , β ) .     . ..   . . 0 .     0 0 m (XT , β ) and by dropping ∑J j=1 (2.43) Qj σ in ( 2.42) equation ( 2.40) changes to (2.44) in the sample Hj j N −1 J −1 ∑ R (x) m (x) 2 Λ (α)−1 m (x) 2 i=1 Qj ∑ 1 [S = j] H j ui = 0 (2.44) j=1 Equation ( 2.44) can be represented as J Qj N −1 ∑ H j ∑ R (x) m (x) 2 j=1 2.6 −1 Λ (α)−1 m (x) 2 ui j = 0 (2.45) i=1 The normal linear model: A Monte Carlo investigation It would be insightful to have a Monte Carlo analysis of a number of examples of stratified sampling in the normal linear model. We consider the simple following model 36 Yi = Xi θ + Ui U|X = x ∼ N 0, σ 2 Ω , and xit ∼ N (0, 1) for i = 1, · · · , N (2.46) In this simple two-variable linear regression model Yi is a T × 1 vector of dependent variable, Xi is a T × 2 matrix of explanatory variables where the first column is a constant term. Ui is a T × 1 vector of error terms. The vector of parameters θ has two elements; intercept θ◦ and slope θ1 . We set zero and one as true values of the intercept and slope in population respectively. We also assume that the only control variable in the model Xit has the normal standard distribution and error term U has the normal distribution with mean zero and variance σ 2 Ω, where Ω has a first order autoregression AR (1) structure with parameters ρ and σ 2 = 1. Under these assumptions variance-covariance matrix is  ρ ···  1   ρ 1  E Ui Ui |Xi = E Ui Ui = σ 2 Ω =  . ..  . .  .  ρ T −1 ρ ρ T −1       ρ    1 . . . (2.47) Three strata are considered W1 = X × (−∞, −0.25) and W2 = X × (−0.25, 1.5) and W3 = X × (1.5, ∞) That are endogenous. We also consider three exogenous strata that are W1 = (−∞, −0.25) × Y and W2 = (−0.25, 1.5) × Y and W3 = (1.5, ∞) × Y In all cases the strata are defined by dividing the population into subpopulations in the first period t = 1. W = (X ,Y ) is the population space where W ⊂R2 . In this example population weights Q1 , Q2 , and Q3 are known. These weights are obtained from normal distribution with mean zero and variance two in endogenous case and standard normal distribution in exogenous Ns for s ∈ {1, 2, 3}. Here Ns is the number of one. The Hs or sampling probabilities are equal to N observations from stratum s and N is the sum of the total number of observations in the sample. We estimate parameters and their asymptotic variances by estimators developed in this paper which we call them weighted GLS and un-weighted GLS and compare them with OLS and weighted pooled OLS. 37 In this exercise sample objective function is ( 2.26) in the first example and its variance is equal to J var [q (X, Y, S, θ ) |X] = = Qj ∑ Hj E j=1 J Q j ∑ Hj UU |X, S = j 2 σ1 Ω + σ 2 Ω + σ 2 Ω 2 3 j=1 3 = Qj ∑ Hj σ2 j Ω (2.48) j=1 Qj 2 The term ∑3 j=1 H σ j in ( 2.48) is a constant and has no effect in estimating the parameters j so we drop it for simplicity. Therefore, with these simplifications, var [q (X, Y, S, θ ) |X] = Ω is the variance matrix in the population which is not a function of control variables X. Of course, by changing assumptions, we can obtain different estimations for the variance. In this example, we consider the simplest case by assuming strong assumptions to make estimation easy to execute. Weighted GLS estimation of θ◦ and θ1 are obtained by solving equation ( 2.30) in the first example. These parameters estimations are −1    J Q Nj J Q Nj j j ˆ θwGLS =  ∑ ∑ xi j Ω−1xi j   ∑ H j ∑ xi j Ω−1yi j  i=1 j=1 j=1 H j i=1 (2.49) that looks like GLS estimator. In (2.49) N j is the sample size in stratum j. To have a good judgement of how much gain a practitioner obtains by using this estimator a comparison between estimators developed in this paper and two other estimators is done by using simulation. The comparison is between weighted GLS estimator equation ( 2.49) and unweighted Qj GLS estimator which is exactly same as ( 2.49) but drops the weights for all j s, and weighted Hj pooled OLS that ignores the correlation over time for each cross section observation i  −1   J Q Nj J Q Nj j j ˆ θwOLS =  ∑ (2.50) ∑ xi j xi j   ∑ H j ∑ xi j yi j  j=1 H j i=1 j=1 i=1 And usual OLS assuming homoscedasticity. We also estimate feasible version of weighted and unweighted GLS estimators and call them weighted and unweighted FGLS estimators respectively. So in total six estimators are evaluated in the simulation. 38 Also we look at the variance of these estimators to evaluate their efficiency. An appropriate ˆ estimator of asymptotic variance of θwGLS is ˆw ˆ ˆw ˆ Avar θwGLS = A−1 Bw A−1 ˆ Where Aw = ∑J j=1 (2.51) Qj Nj ˆ ∑ x Ω−1 xi j and Bw is more complicated H j i=1 i j Q2 j J ˆ Bw = Nj ˆ ˆ ∑ H 2 ∑ xi j Ω−1ui j ui j Ω−1xi j j=1 J − j i=1 Q2 j  1  Nj 1 Nj  ˆ ˆ ∑ H 2  N j ∑ xi j Ω−1ui j   N j ∑ xi j Ω−1ui j  j=1 j i=1 (2.52) i=1 If weights are dropped from ( 2.51) we obtain the estimator of asymptotic variance of unˆ weighted GLS; Avar θuwGLS . And if the variance-covariance matrix Ω is dropped from ( 2.51) we have estimator of asymptotic variance of weighted pooled OLS in hand. As it mentioned already, in this exercise we consider variance-covariance matrix with AR(1) structure in the population. As equation ( 2.48) shows if we assume ρ does not change by changing the stratum then there is only one parameter that we need to estimate i.e. ρ. In practice it is possible to estimate different ρ for different stratum too but for simplicity, in this example, we assume ρ in each stratum is equal to the value of ρ in population. Therefore, it is a constant parameter and not a function of strata. However, there is a point that we should keep in mind. we call our estimators GLS (weighted or unweighted) whenever we use the value of ρ in population, since the equation (2.52) is very similar to well known GLS estimators. However, this naming may cause some confusion, since we do not know the true value of ρ in each stratum. We just “assume” that they are equal to the value of ρ in the population of interest. Now, it should be clear why in tables B.1 to B.16, the estimated parameters from GLS and FGLS are very close in most cases. We consider four values for correlation parameter . It changes from no correlation ρ = 0 to high degree of correlation ρ = .9. In between ρ = 0.1 and ρ = 0.5 are considered. It helps us to see the effect of correlation magnitude on the efficiency gain in this simple exercise. 39 Tables B.1 to B.4 and B.5 to B.8 in the second appendix summarize the results for cases T = 2, and T = 5 respectively when stratification is exogenous. In these tables means and their standard errors, and mean squared errors for the intercept and slope for six estimators are reported. When ρ = 0, POLS and unweighted GLS (uwGLS) are almost identical as we expected. Under exogenous stratification it has been shown that ignoring stratification does not cause any problem. See for example Manski and McFadden (1981), DuMouchel and Duncan (1983) and Wooldridge (1999, 2001). This is clearly seen in the tables. In both T = 2, and T = 5 cases OLS that ignores stratification and unweighted GLS and its feasible counterpart are superior to weighted POLS and weighted GLS and their feasible versions; they are consistent and more efficient. The interesting point, that is actually the main issue of this paper, appears when correlation increases. Now as ρ increases un-weighted GLS and its feasible version which takes correlation into account by considering it in estimation process, shows its efficiency over OLS and weighted POLS that simply ignore correlation. This is especially correct about estimation of slope rather than intercept that does not change very much except in a high degree of correlation. As an illustration consider exogenous stratification and T = 2 and ρ = 0.9 in table B.4. In this case the mean of the slope is almost the same in both POLS and un-weighted FGLS, but the latter estimator mean squared error is .0151 compare to .0355 in the former. It shows about 57 percent reduction that is substantial. In case T=5 the cutback is even more and it is about 62 percent that is presented in table B.8. When ρ decreases to 0.5, the improvement in efficiency is still considerable. In this case and T = 2 in table B.3, mean of standard deviation for slope decreases from .0355 for OLS to .0302 for un-weighted FGLS. The mean of standard deviation diminishes about 18 percent when T is 5 (table B.7). Meanwhile the mean of standard deviation for intercept does not show any changes at all when T = 2 but it shows a sign of improvement as T increases to 5 albeit not too much. The simulation results show that in the case of exogenous stratification in a panel data model, un-weighted GLS and its feasible counterpart that consider the structure of variance-covariance matrix in estimation is better than OLS and weighted POLS that simply ignore correlations within 40 each cross-section observation. Another interesting observation is that even weighted GLS in the case of exogenous stratification is getting better in terms of reduction of bias and smaller variances when the degree of correlation increases. It is definitely superior to weighted POLS and its estimation of slope has much smaller variance comparing with POLS when ρ exceeds .5 (Tables B3, B4, B7 and B8). Also the results show that as correlation increases the difference between standard deviation of mean-presented in parentheses- and mean of standard deviation of intercept increases for OLS estimator that a sign that variance of OLS estimator is inconsistent and the inconsistency raises along with . However the inconsistency for the variance of slope is much lesser and does not show significant variation alongside the change in correlation between observations. The main challenge is when stratification is based on the endogenous variable. In this case unweighted estimators are generally inconsistent. Tables B.9 to B.12 and B.13 to B.16 summarize the results when stratification is based on the endogenous variable for cases T = 2 and T = 5 respectively. As it is expected OLS and un-weighted GLS and its feasible un-weighted counterpart all produce inconsistent estimations for both the slope and intercept. The interesting point is that this inconsistency shrinks for slope but enlarges for intercept by increase in the correlation parameter for both estimators. Moreover OLS gives inconsistent estimation of the variances too although it is reduced by increase in ρ. It can be seen by comparing the standard deviation of mean presented in parentheses and mean of standard deviations. Results presented in the tables B.9 to B.16 show that in case of endogenous stratification weighted estimators- weighted POLS, weighted GLS and weighted FGLS- are consistent and are almost same in low values of correlation parameter ρ = 0 and ρ = 0.1. The difference between these estimators are more remarkable when correlation parameter ρ starts growing. For example in table B.11, when T = 2 and ρ = 0.5, while weighted POLS and GLS are both consistent estimators for the slope, the latter has standard deviation of mean equal to .0365 comparing to the former which is equal to .0414. In case of T = 5 this difference is even more considerable (look at table B.15). 41 Superiority of weighted GLS or its feasible equivalent to other estimators is unambiguously clear if we increase ρ to 0.9. In this case and when T=2, mean of standard deviation of the slope is .0186 which is less than half of the same value for its closest competitor i.e. weighted POLS that is about .0408 (Table B.12). The difference between weighted GLS and the rest of the estimators is even more dramatic when T=5. In Table B.16, we can compare the efficiency of weighted GLS and weighted POLS. Here the mean of the standard deviation of the slope for weighted GLS is just 36 percent of the same value for weighted POLS (.0098 verses .0272). Of course this big advantage of weighted GLS verses weighted POLS are just substantial for the estimation of the slope not the intercept. In another set of Monte Carlo experiments, we relaxed the assumption that the correlation matrix is same for all strata and estimate the matrix for each stratum. The results are even better; we have estimators with smaller variances although in most cases the variances of the old and the new ones are very close. We also repeated the experiment by changing the covariance matrix structure to the random effect model. The results show that weighted and unweighted GLS estimator are efficient estimators in the endogenous and exogenous stratification respectively. In order to make the appendices shorter we do not report the related tables and results. Overall the simulations show the way to some tentative conclusions. First, finding more efficient estimators in panel data models with stratified sampling structure and under appropriate assumptions is possible. Depending on whether the stratification is based on exogenous or endogenous variable, the GMM estimators developed in this paper, i.e. unweighed or weighted GLS, outperform OLS or weighted POLS which do not consider the correlations over time within each panel i. Second, this superiority is positively related to the level of correlation of a cross section observation through time. In low level of correlation there is no big advantage of using GMM over OLS or weighted POLS. This is changed when correlation parameter is get bigger. Also for the same sample size the efficiency gain depends on what the structure of correlation is or what kind of structure is chosen in case of GEE models. The simulation results show that 42 correlation matrix structure affect the amount of reduction in the variances of the estimators. 2.7 Determinants of Family income in the U.S: An Empirical Application In this section we analyze the determinants of family income and sources that cause family incomes varies across households. We estimate a simple linear model that considers total family income a function of family characteristics like education of head of family, age of the head, gender of the head, marital status of the head and so on. The model is estimated with different methods. These methods are pooled OLS, weighted POLS, feasible GLS and weighted feasible GLS methods developed in this paper to compare the efficiency gain if there exist any. The source of the data set used in this exercise is the 2003-2009 Panel study of Income Dynamics (PSID). The PSID is a complex longitudinal panel survey that have collected data from the same families and their descendants in United States since 1968. Data has been collected on a wide range of economic, social, demographic, psychological and health factors over the life course and across generations. The sample size has grown from roughly 4,800 families in 1968, to about 7,400 by 2005, and to more than 8,690 families and 24,385 individuals as of 2009 (Heeringa, et al., 2011) . As of 2009, the PSID has information on over 70,000 individuals collected over the past four decades. the core sample of individuals and their families in PSID is rooted in two distinct samples. The Survey Research Center designed a nationally representative sample that known as the SRC sample. The second sample known as the Survey of Economic Opportunity or SEO sample, drawn mainly from lower income families. An oversample of low-income families was included to provide adequate sample size for investigating poverty related issues. Roughly 18,000 individuals living in 4800 households were members of the original 1968 sample. In 1997, PSID Immigrant Supplement added 511 immigrant families to the core sample to obtain more complete picture from the population and to enhance representativeness. Individuals in PSID fall in two categories; sample and non-sample persons. By definition a sample person is someone who is either a resident of a PSID original sample family in 1968, or an 43 offspring born to or adopted by a sample individual who is actively engage in the study at the time. The definition of sample persons slightly relaxed in 1994 and allowed a child born to or adopted by a sample person who was not participating in the study to be considered as a sample person. According to Heeringa, et al. (2011a), from 24,385 individuals distributed in 8,690 families in 2009, 17,471 are PSID sample persons and 6,914 are non-sample spouses and family members. Longitudinal weights are calculated at the beginning of a four year (two wave) cycle. The last cycle began in 2007, and therefore the 2009 weights are just “carry-over” weights. Weights need to adjust for attrition and also changes in family size that happens because of marriage, divorce, death, and other additions of new members. The longitudinal family weight in PSID is the average of the positive individual weights for sample person and zero value weights for non-sample persons in the family. For example if a PSID sample person with an individual longitudinal weight of 100 has spouse who is a PSID non-sample person with assigned weight equal to 0, then the family weight for this two-person family is 50. For more detail on the construction of the PSID longitudinal family and individual weights see Heeringa, et al. (2011a, b). To study the relationship between income and family characteristics covariates, a simple linear model is considered where dependent variable is total family income or t f inc, which is the sum of taxable income of the family head and his wife and other members of the family last year plus social security income of the head, his wife and other members of the family unit. This variable can take negative values that indicate net losses occur as a result of business or farm activities. The model is represented as t f incit = Xit β + vit (2.53) where X is a vector of family characteristics. Here vit ≡ ci + uit , t = 1, . . . , T are the composite error, ci , i = 1, . . . , N are unobserved heterogeneity, and idiosyncratic errors are uit . Parameters of interest is represented by vector β . The vector X include the total family wealth (twealth), the head’s age, age square (age2) and age cube (age3), health condition of the head, marital status, education level, and employment status of the head, family size ( f size), persons less than 6 years of age in the family (aychild6), the head’s father and mother education levels, race and gender of the 44 head, and number of persons less than 18 years of age (nchild) in the family as well as year dummy variables and intercept. The variable twealth is constructed as sum of seven asset types, net of debt value plus value of home equity. We also added interaction terms between education level of the head (edu_hs) and his age and between edu_hs and the head’s employment status, unemployed to the model. Tables B.17 and B.18 provide variables description and summary statistics respectively. The panel in this empirical study consists of 4 waves ( 7 years) starting 2003 and ending 2009. The 2003 longitudinal family weights are used. After dropping all observations with missing values and strata with just one panel, the final data set is a balanced panel, contains of 15,672 observations or 3,918 panels distributed between 33 strata. To estimate family income equation (2.53), seven methods are used. These seven methods are pooled OLS, and weighted pooled OLS that ignore the serial correlation problem, and feasible versions of generalized least squares (GLS) that consider two forms for the serial correlation. The first form is a first-order autoregression AR(1), and in the second form the random effect structure is estimated for unconditional variance matrix of error term vit . the remaining three methods are weighted FGLS discussed in this paper. Beside AR(1) and the random effects, we estimate ˆ ˆ ˆ ˆ ˆ unrestricted variance matrix of error term i.e. Ω = N −1 ∑N vi vi where the vi is a 4 × 1 of the i=1 ˆ ˆ pooled OLS residuals. We call these three methods wFGLS_ar1, wFGLS_re and wFGLS_un respectively. We hope to obtain efficiency gain by using the latter estimators to estimate total family income equation (2.53). The estimation results are presented in Table B.19. Robust standard errors are listed in parentheses. In wFGLS_ar1, wFGLS_re and wFGLS_un standard errors are calculated using equations ˆ (2.51) and (2.52). In Table B.19, λ is a consistent estimation of λ , and λ is 2 2 λ = 1 − {1/[1 + T (σc /σu )]}1/2 (2.54) 2 2 ˆ where σc , and σu are the variance of ci , and the variance of uit , respectively. If λ is close to unity, the random effects (RE) and fixed effects (FE) estimates tend to be close. We just estimate one variance matrix for all strata, same as the Monte Carlo study case represented in last section. The results show that almost all coefficients have expected signs regardless 45 of the method of estimation. However, depending on which method we use for estimation, their magnitudes widely differ in many cases. For example, in terms of absolute value, the coefficient on edu_hs estimated by weighted pooled OLS is -31.082, and the same coefficient drops to just -2.636 when the model is estimated by FGLS_re and rise to -9.913 in FGLS_ar1 case (columns 2, 3 and 5 in Table B.19). These substantial changes in the size of most coefficients are mainly due to weighting. A simple comparison between unweighted FGLS methods in columns 3 and 5 with their weighted counterparts in columns 4 and 6 in Table B.19, shows substantial effect of weighting on size of the coefficients. For instance, consider again coefficient on edu_hs. It is about 10 times bigger if the family income equation is estimated by wFGLS_re rather than FGLS_re. Same coefficient is almost 3 times bigger if wFGLS_ar1 is used for estimating the same model instead of FGLS_ar1. As another example consider coefficient on health. The size of the coefficient falls almost 50% when the model is estimated by FGLS instead of wFGLS. The big effect of weighting on estimation should not view unusual. Since PSID purposely oversample low income family and in our model income is the dependent variable, OLS using the stratified sample does not consistently estimate the parameters of the total family income because the stratification is endogenous. This is true for unweighted FGLS estimators i.e. FGLS_ar1 and FGLS_re also. The pooled OLS standard errors are smaller than the weighted pooled OLS ones as we expected. In chapter one we showed that by ignoring stratification, the pooled OLS tends to underestimate standard errors. The standard errors are even smaller in the other two unweighted FGLS estimators as it was expected. Therefore, despite smaller standard errors of the unweighted estimators, the main competition is between the weighted pooled OLS on the one hand and the weighted FGLS methods on the other hand that reflects in columns 2, 4, 6 and 7 in Table B.19. The main idea in this chapter was to increase efficiency in panel data models with stratified data by considering serial correlation in each panel. Under correct conditional mean specification, even a wrong working correlation matrix that captures key features of the conditional second moments might lead to a more efficient estimator. Comparison between the weighted pooled OLS and the weighted FGLS estimators in Table B.19 shows that standard errors are smaller almost in all cases 46 for latter estimators indeed. The only exceptions are coefficients on father and mother education levels f edu_hs, medu_hs. Reduction in standard errors are considerable. For example, standard error on twealth reduces about 33 by using wFGLS to estimate the family income equation. Standard errors of the rest of coefficients drops between 4% (coefficients on age and nchild), and about 35% (coefficients on unemployed, and unem.edu_hs). Three consistent estimators i.e. wFGLS_ar1, wFGLS_re, and wFGLS_un are very stable in estimating almost all coefficients, but it seems that efficiency gain is higher in case of wFGLS_re, and wFGLS_un compare to wFGLS_ar1. 2.8 Conclusion Efficiency in panel data models where data set comes from stratified sample schemes is investigated in this paper. We start from some conditional moments in the population and then based on works done by Chamberlain (1987) and Newey and McFadden (1994) propose a GMM estimator that takes into account dependency structure within the panels. The result is an efficient GMM estimator that is computationally simple to implement. By estimating covariance matrix for each stratum or even estimating same covariance matrix for all strata we are able to improve efficiency. Monte Carlo simulation results show that the new estimators that we called them weighted and unweighted GLS (and FGLS) in general do better in compare with ordinary least square or weighted and unweighted pooled OLS that simply overlook dependency in the data. In case of endogenous stratification weighted GLS is the efficient estimator among all, and in case of exogenous stratification dropping weight and using unweighted GLS produce best performance as we expect. Of course the gains of new estimators are smaller when we have weaker correlation structure in the panel. Monte Carlo experiments show that the structure of correlation matrix has affects on efficiency gain. Also simulation results suggest that by increasing T , the importance and effects of endogenous stratification is reduced. A convincing explanation is that by increasing T , the weight of first period diminishes that makes the sample get closer to simple random sampling. Another interesting finding is that by increasing the degree of correlation ρ, inconsistency declines that can attributed 47 to decrease in degree of freedom movements of observations. We apply the method to estimate a simple linear model using PSID data. Although PSID has very complex structure including multi-stage stratification, by considering very simple form for the working variance matrix the new estimators decrease standard errors on most coefficients in the model. 48 Chapter 3 MODEL SELECTION TESTS IN COMPLEX SAMPLES 3.1 Introduction Using the Kullback-Leibler Information Criterion to measure the closeness of a model to the truth, Vuong (1989) developed a classical approach to model selection. He proposes simple likelihood ratio based statistics for testing the null hypothesis that the competing models are equally close to the true data generating process against the alternative hypothesis that one model is closer. In his approach both, one, or neither of the two competing models is misspecified. He assumes that observations are independent and identically distribute (i.i.d.). All of his tests are based on likelihood ratio principle, and consequently he drives asymptotic distribution of the likelihood ratio statistics that covers both nested, overlapping and non-nested models. While Vuong’s tests are based on i.i.d. assumption, in practice in most large surveys, such as the Current Population Survey (CPS), the Panel survey of Income Dynamics (PSID) and National Survey of Families and households (NSFH) that require stratified and clustered samples, simple random sampling and therefore i.i.d. observations is not the right assumption. In other words, a non-random sampling scheme like Standard Stratified (SS) sampling or Variable Probability (VP) sampling, or complex survey design like CPS does not produce a set of independent, identically distributed random variables. Clearly, the i.i.d. assumption is one of the limitation of the Vuong’s model selection tests in case of complex samples. This assumption is restrictive when considering time series data too. Rivers and Vuong (2002) along with Findley (1990, 1991) and Findley and Wei (1993) relax this assumption for time series cases like ARMA models and some dynamic regression models. Also Vuong’s model selection tests cannot be used to differentiate between two econometric models defined by moment conditions, or more generally, between two competing models that are incompletely specified. The second limitation in applying the Vuong’s tests happens because they 49 are based on the likelihood function. These tests require that competing models belong to some parametric family of distributions and therefore they must be completely parametrized. While maximum likelihood method is a widespread method of estimation in econometric studies, there are other common methods of estimations that are used by researchers. Techniques like least absolute deviation, nonlinear least squared, generalized method of moments (GMM), or other extremum estimators are used by researchers for different reasons. This is the third limitation of Vuong’s tests. This paper contributes to the subject by extending Vuong’s model selection tests for competing models with stratified multistage cluster sampling. Many data sets used in microeconometrics research are collected by surveys like CPS or PSID that have complex multi stage sampling structure and violate the i.i.d. assumption needed in Vuong’s tests. Also, In order to generalize Vuong’s results to cases other than MLE, we study the problem in M-estimators framework. Many econometrics estimators are M-estimators including but not limited to linear and nonlinear regression, conditional maximum likelihood including discrete response models. The paper is organized as follows. In section 3.2, we define two nonnested competing models. In section 3.3, we consider basic framework under standard stratified sampling. We start with standard stratified sampling because it is widely used in practice to divide the population of interest into subpopulations or strata and it gives us a base to extend the results to more complex sampling designs. Section 3.4 introduces tests statistics under SS and VP sampling and also multi stage sampling scheme. Also in section 3.4, I show that the test statistics has normal distribution asymptotically. In section 3.5, we extend the model selection test to panel data models with standard stratification design. An interesting problem is if we need to weight the test statistics when stratification is exogenous. I discuss this point in section 3.6. Section 3.7, shows applications of the tests in two empirical examples. Section 3.8 summarizes the results and conclude. 50 3.2 The Nonnested Competing Models Consider the population minimization problem min E [q(W, θ )] θ ∈Θ (3.1) where scalar q(.) denotes an objective function depending on W and θ and W is an M × 1 random vector taking values in W ⊂ RM . Data generating process depends on θ which is a P × 1 parameter vector and it belongs to parameter space Θ, and Θ is a subset of Euclidean space RP or in other words Θ ⊂ RP . We assume that there is a unique value that minimize population problem (3.1) on parameter space Θ at θ ◦ called true parameter value that generates the data. In many applications, the vector W is partitioned into W = (X, Y) where X and Y are respectively K and L dimensional vectors with L + K = M. We are often interested in some aspect of the conditional distribution of W given X, such as E (Y|X). Now as Vuong (1989) consider two competing objective functions q1 (W, θ 2 ) and q2 (W, θ 2 ). These two competing functions are nonnested in the sense that neither can be represented as a special case of the other. It is important to have a clear idea about nonnested models. Vuong considers two sets of conditional models Fθ = { f (y|x, θ ); θ ∈ Θ} and Gγ = {g(y|x, γ); γ ∈ Γ} and then defines two models nonnested if and only if Fθ ∩ Gγ = 0 / (3.2) This definition is more suitable for MLE cases where we have full distribution assumptions about the endogenous variables given the exogenous variables for the two competing models. For more general cases as Wooldridge (2010) we consider the following definition P [q1 (W, θ ∗ ) = q2 (W, θ ∗ )] > 0 1 2 (3.3) It means that the two function q1 (., θ ∗ ) and q2 (., θ ∗ ) evaluated at the psuedo-true values θ ∗ and θ ∗ 1 2 1 2 must differ for a nontrivial set of outcomes on Wi if they are nonnested. By this definition nested models are ruled out as well as other forms of degeneracies. 51 As the first example assume we have a random variable Y and would like to model E(Y |X) as a function of the explanatory variables X, a K × 1 vector. W specify two competing models; a linear qi1 (θ 1 ) = (Yi − Xi θ 1 )2 and a nonlinear qi2 (θ 2 ) = (Yi − exp(Xi θ 2 ))2 . These models are nonnested if the mean of Yi given Xi depends on the nonconstant elements in Xi . Yet if the mean function is independent of Xi , or in other words E(Yi |Xi ) = E(Yi ), then the two models are linear with same constant means. In this case two models are nested and the limiting standard normal distribution for Vuong’s type statistic breaks down. 3.3 Basic Framework under Standard Stratified Samples The population problem is minθ ∈Θ E [q(W, θ )] and we assume θ ◦ uniquely solves the problem. Let q1 (W, θ ∗ ) and q2 (W, θ ∗ ) be the two competing models where both may be misspecified. The 1 2 null hypothesis is H0 : E [qi1 (Wi , θ ∗ )] = E [qi2 (Wi , θ ∗ )] 1 2 (3.4) Depending on what method we use to estimate these two competing models, the alternative hypothesis is HAq : E [qi1 (Wi , θ ∗ )] > E [qi2 (Wi , θ ∗ )] 1 2 (3.5) HAq : E [qi1 (Wi , θ ∗ )] < E [qi2 (Wi , θ ∗ )] 1 2 (3.6) 1 or 2 For example if the competing estimators are QMLEs, then the alternative HAq means q1 (.) is 1 better than q2 (.) because its value of the likelihood function is bigger than the other. To test the null (3.4) against alternative (3.5) or (3.6), in context of complex samples, suppose ˆ ˆ the estimators θ 1 and θ 2 solve the sample objective function with complex design that involves stratification and clustering. In this section we first consider sample objective function under standard stratified sampling scheme and then consider other types of sampling design. In standard 52 stratified sampling the population of interest is divided into J nonempty, mutually exclusive, and exhaustive strata and then a random sample of size N j is drawn from stratum j, where j = 1, . . . , J. Then for each j, we have random sample {Wi j : i = 1, 2, . . . , N j }. See Wooldridge (2001). Therefore sample objective function is  J 1 Nj  ∑ Q j  N j ∑ q(Wi j , θ ) j=1 (3.7) i=1 Equation (3.7) can be rephrased as N 1 J Qj j ∑ ∑ q(Wi j , θ ) N j=1 H j i=1 (3.8) where Q j is the population frequencies or in other words the probability that a randomly drawn Nj observation from the population falls into stratum j and H j ≡ is the fraction of observations in N stratum j. As (3.8) shows in standard stratified sampling observation i in stratum j is weighted by Qj . Hj ˆ ˆ We also assume that θ 1 and θ 2 converge to θ ∗ and θ ∗ respectively. They are referred to as 2 1 pseudo true value and are not necessary equal to true value θ ◦ and therefore the both models may be misspecified. In order to construct Vuong type test, we need following lemma that shows by assuming √ N- ˆ consistency of θ g for θ g for g = 1, 2 we can find a test statistic that its asymptotic distribution is ˆ ˆ not affected by the two estimators θ 1 and θ 2 . ˆ ˆ Lemma 3.3.1. If θ 1 and θ 2 are √ N-consistent estimators for θ ∗ and θ ∗ then 1 2 N N 1 J Qj j 1 J Qj j ˆ √ ∑ ∑ qg(Wi j , θ g) = √N ∑ H j ∑ qg(Wi j , θ ∗) + o p(1) g H j i=1 N j=1 j=1 i=1 (3.9) for g = 1, 2. Proof. Assuming that q(.) is a differentiable function in respect to θ g , from a Taylor expansion of ˆ ∑J q(Wi j , θ g ) and then dividing both side by N j we obtain i=1 53 N N N j j 1 j ˆ g ) ≈ 1 ∑ qg (Wi j , θ ∗ ) + 1 ∑ ∇θ qg (Wi j , θ ∗ )(θ g − θ ∗ ) ˆ ∑ qg(Wi j , θ g g g N j i=1 N j i=1 N j i=1 (3.10) Multiplied by Q j and then sum over j, (3.10) can be written as N J Q Nj Qj j j ˆ ∑ N j ∑ qg(Wi j , θ g) ≈ ∑ N j ∑ qg(Wi j , θ ∗) g j=1 i=1 j=1 i=1 J N Qj j ˆ ∑ N j ∑ ∇θ qg(Wi j , θ ∗) (θ g − θ ∗) g g j=1 i=1 J + Finally if we times both side by (3.11) √ N we have N N 1 J Qj j 1 J Qj j ˆ √ ∑ ∑ qg(Wi j , θ g) ≈ √N ∑ H j ∑ qg(Wi j , θ ∗) g N j=1 H j i=1 j=1 i=1 + N √ Qj j ˆ ∑ N j ∑ ∇θ qg(Wi j , θ ∗) · N(θ g − θ ∗) g g i=1 j=1 J (3.12) In the second term in the right hand side of (3.12)   Nj J J Q Nj  1 ∑ ∇θ qg (Wi j , θ ∗ ) =plim 1 ∑ j ∑ ∇θ qg (Wi j , θ ∗ ) plim ∑ Q j g g N j i=1 N j=1 H j i=1 j=1 =E ∇θ qg (Wi j , θ ∗ ) = 0 g (3.13) See Wooldridge (2010). Therefore  J 1  Nj ∑ Q j  N j ∑ ∇θ qg(Wi j , θ ∗) = o p(1) g j=1 ˆ and since by assumption θ g is (3.14) i=1 √ √ ˆ N-consistent, N(θ g − θ ∗ ) = O p (1). Therefore the second term g product in (3.12) is o p (1) and it can be written as N N 1 J Qj j 1 J Q j ˆ g ) ≈ √ ∑ j ∑ qg (Wi j , θ ∗ ) + o p (1) √ ∑ ∑ qg(Wi j , θ g N j=1 H j i=1 N j=1 H j i=1 This complete the proof. 54 (3.15) Note that the right hand side of equation (3.9) in Lemma 3.2.1 is just a function of random vector Wi j . Now we are ready to set up tests statistics similar to Vunge’s tests with asymptotic normal distribution under the null hypothesis that the two nonnested competing models are fit equally well. 3.4 Tests Statistics 3.4.1 The Test Statistic under Standard Stratified Sampling In this section we construct tests statistics that allow us to discriminate between two competing models. Let qi j1 (Wi j , θ 1 ) − qi j2 (Wi j , θ 2 ) ≡ ri j (Wi j , θ 1 , θ 2 ). Then by Lemma 3.2.1 we have N N 1 J Qj j 1 J Qj j ˆ ˆ √ ∑ ∑ ri j (Wi j , θ 1, θ 2) = √N ∑ H j ∑ ri j (Wi j , θ ∗, θ ∗) + o p(1) 1 2 N j=1 H j i=1 j=1 i=1 (3.16) The following theorem shows that (3.16) under some conditions has asymptotic normal distribution. Theorem 3.4.1. For g ∈ {1, 2} assume that 1. {Wi j : i = 1, 2, . . . , N j , j = 1, . . . , J} follows the standard stratified sample scheme. 2. N j → ∞ for each j. 3. Θg is a compact subset of RP . 4. The objective function E qg (., θ g ) has unique solution on Θg at θ ∗ . g 5. θ ∗ is an interior point of Θg . g 6. For each w ∈ W , qg (w, .) is continuous on Θ. 7. qg (w, .) is twice continuously differentiable on Θ. ∗ 8. E ∇θ q(W, θ ∗ )q(W, θ ∗ ) < ∞ and E ∇θ q(W, θg ) = 0 g g 55 9. For all θ , |∂ 2 qg (w, θ g )/∂ θgk ∂ θgm | ≤ b(w), all k and m, where E[b(w)] < ∞. then N √ 1 J Qj j d ˆ ˆ √ ∑ ri j (Wi j , θ 1 , θ 2 ) − N · E [r(W, θ ∗ , θ ∗ )] −→ N(0, η 2 ). ∑ 1 2 N j=1 H j i=1 (3.17) where η2 J = ∑ Q2 j j=1 H j var[r(W, θ ∗ , θ ∗ )|W ∈ W j ] 1 2 (3.18) Proof. The proof is essentially same as Theorem 3.3 in Vuong (1989) and Theorems 3.1, and 3.2 in Wooldridge (2001). The first assumption shows the diverge from i.i.d. observations assumption in the Vuong model. For the asymptotic analysis, we need second assumption to be sure that the number of observations N j in each stratum j goes to infinity. The regularity assumptions 2 to 6 are similar to those of Vuong (1989) and we need assumption 8, and 9 since we extend likelihood function to more general one i.e. q(.) function. Also these same regularity assumptions ensures ˆ that θ g is consistent and has normal distribution asymptotically. See Wooldridge (2010). Now we have test statistic necessary to choose between two competing models. The null hypothesis is H0 : E [qi1 (Wi , θ ∗ )] = E [qi2 (Wi , θ ∗ )] 1 2 (3.19) HA : E [qi1 (Wi , θ ∗ )] > E [qi2 (Wi , θ ∗ )] 1 2 (3.20) against Under (3.19), E r(W, θ ∗ , θ ∗ ) = 0 and (3.17) can be written as 1 2 N 1 J Qj j d ˆ ˆ √ ∑ ∑ ri j (Wi j , θ 1, θ 2) −→ N(0, η 2). N j=1 H j i=1 (3.21) A consistent estimator of η 2 is ˆ η2 Q2 j N 1 j ˆ ˆ ˆ ˆ 2 ≡∑ ¯ ∑ ri j (Wi j , θ 1, θ 2) − r j (Wi j , θ 1, θ 2) j=1 H j N j i=1 J 2 N 1 J Qj j ˆ ˆ ˆ ˆ 2 = ∑ 2 ∑ ri j (Wi j , θ 1 , θ 2 ) − r j (Wi j , θ 1 , θ 2 ) ¯ N j=1 H j i=1 56 (3.22) 1 Nj ˆ ˆ ˆ ˆ Here r j (Wi j , θ 1 , θ 2 ) = ¯ ∑ r (W , θ , θ ). Therefore Voung type model selection statistic N j i=1 i j i j 1 2 is Qj Nj 1 ˆ ˆ √ ∑J j=1 H ∑i=1 ri j (Wi j , θ 1 , θ 2 ) N j ∑J j=1 d 1/2 Q2 j 1 Nj ˆ ˆ ˆ ˆ 2 ¯ ∑i=1 ri j (Wi j , θ 1 , θ 2 ) − r j (Wi j , θ 1 , θ 2 ) Hj Nj −→ N(0, 1) (3.23) or Qj Nj 1 ˆ ˆ √ ∑J j=1 H ∑i=1 ri j (Wi j , θ 1 , θ 2 ) N j 1 N 3.4.2 Q2 j J ∑ j=1 2 Hj d Nj ˆ ˆ ˆ ˆ 2 ¯ ∑i=1 ri j (Wi j , θ 1 , θ 2 ) − r j (Wi j , θ 1 , θ 2 ) 1/2 −→ N(0, 1) (3.24) The Test Statistic under Variable Probability Sampling When observations in the strata are difficult to identify prior to sampling, or when collecting information on the variable determining stratification is cheap relative to the cost of collecting the remaining information variable probability sampling is convenient. In variable probability sampling or VP sampling in short, an observation is first drawn at random from the population. If the observation fall into stratum j, it is kept with probability p j . For example if we need to define stratification in terms of individual incomes, we might draw randomly a person from the population, determine his income class, and then keep him in the sample with a probability that depend on his income class and is set by the researcher. In variable probability samples, under the null hypothesis that two competing models are equally fit i.e. (3.19), the test statistic is 1 ˆ ˆ √ ∑N ∑J p−1 ri j (Wi j , θ 1 , θ 2 ) N 1=1 j=1 j 1 N ˆ ˆ 2 ∑i=1 ∑J p−2 ri j (Wi j , θ 1 , θ 2 ) j=1 j N 57 d 1/2 −→ N(0, 1) (3.25) that is very similar with what we obtained for standard stratified samples under the null in last Qj section. We just need to replace weights with p j in (3.24). Here we need the sampling probaHj bilities p1 , p2 , . . . , p j be all strictly positive. The rest of the assumptions needed to hold this result are same as Theorem 3.3.1. For more details see Wooldridge (1999). 3.4.3 Tests Statistics under Multi-Stage Sampling Clustering and stratification are main features of survey data. For example National Survey of Families and Households (NSFH), is a complex survey sample. It has multistage design that involves clustering, stratification and variable probability sampling. Clusters are groups of families, households or individuals positioned or occurring a relatively close association. For example in a school, students in each class are form a cluster. In rural areas villages, and in urban areas, neighborhoods are clusters. The sampling design considered here is closely related to Bhattacharya (2005). In the first stage, the population of interest is divided into S subpopulations or strata. They are exhaustive and mutually exclusive. Within stratum s, there are Cs clusters. In the next step Ns clusters are drawn randomly. Since the asymptotic analysis is based on number of clusters going to infinity, we assume that in each stratum a large number of clusters is sampled. Units (for example households) within each cluster allow for arbitrary correlations. Each sampled cluster c in stratum s contains a finite population of Msc units (for example households) of observations. Finally, for each sampled cluster c in stratum s, randomly sample Ksc households with replacement. Sample objective function is 1 S Ns Ksc ∑ ∑ ∑ vscqg Wscm, θ g N s=1 c=1 m=1 (3.26) for g = 1, 2. Here N = N1 + N2 + . . . + NS is the total number of clusters sampled and vsc = Cs Msc is weight associated with observations m = 1, . . . , Ksc within cluster c within stratum Ns Ksc N 58 Ns converges to as where as is fixed and 0 < as < 1. By this assumption, weights N vsc be constant. s. We assume By same reasoning as section 3.3 we can show that asymptotic distribution of the following ˆ ˆ statistic is not affected by estimators θ 1 and θ 2 . 1 S Ns Ksc ˆ ˆ √ ∑ ∑ ∑ vsc · rscm Wscm , θ 1 , θ 2 N s=1 c=1 m=1 (3.27) ˆ ˆ Here rscm = qscm1 Wscm , θ 1 − qscm2 Wscm , θ 2 is the difference between the two objective functions for each unit m, in cluster c, in stratum s. Also we can show under the null hypothesis that both competing models equally fit well (3.27) has asymptotic normal distribution 1 S Ns Ksc d ˆ ˆ √ ∑ ∑ ∑ vsc · rscm Wscm , θ 1 , θ 2 −→ N(0, ξ 2 ) N s=1 c=1 m=1 (3.28) Because of correlation within clusters, the variance of (3.27), ξ 2 is more complicated than η 2 in 3.4.1. A consistent estimator of ξ 2 is S ˆ ξ2 = ∑ Ns Ksc 2 ∑ ∑ v2 rscm sc ˆ ˆ θ 1, θ 2 s=1 c=1 m=1 S Ns Ksc Ksc + ∑∑ ∑ ∑ ˆ ˆ ˆ ˆ v2 rscm θ 1 , θ 2 rscm θ 1 , θ 2 sc s=1 c=1 m=1 m =m S 1 −∑ s=1 Ns Ns Ksc ∑ ∑ 2 ˆ ˆ vsc rscm θ 1 , θ 2 (3.29) c=1 m=1 The first term in (3.29) is a correct estimate of the variance under simple random sampling. Under non-random sampling, it is not true anymore and we need to add the other two terms that are estimations of clustering and stratification effects respectively. In general, in most cases, correlation between unit observation (for example families) in each cluster is positive and therefore the second term appears with a positive sign. On the other hand because of stratification, more homogenous observations are sampled in each stratum that decreases the variance and hence it enters in the formula with negative sign. Therefore, ignoring clustering effect (the second term) causes underestimating the true variance while overlooking stratification effect, we overestimate it. Extending Bhattacharya’s (2005) model to more complex sampling designs, in chapter one, we investigate a sampling design with variable probability sampling in the final stage. The framework 59 resemble complex surveys like NSFH and other routine phone surveys in practice. In this case an appropriate statistic for choosing between two competing models is very similar to (3.29) as follows N ˆ ξ 2 = N −1 ∑ S J K 2 ∑ ∑ ∑ v2 p−2yisτ jmτ j t z jmz j t ris jm is j ˆ ˆ θ 1, θ 2 i=1 s=1 j=1 m=1 N +N −1 ∑ S J J K K ∑ ∑ ∑ ∑ ∑ v2 p−1 p−1yisτ jmτ j t z jmz j t ris jm is j j ˆ ˆ ˆ ˆ θ 1 , θ 2 ris jm θ 1 , θ 2 i=1 s=1 j=1 j =1 m=1 t=m 1 N J K ˆ ˆ −∑ ∑ ∑ ∑ vis p−1yisτ jmz jmris jm θ 1, θ 2 j N i=1 j=1 m=1 s=1 S N × J K ∑ ∑ ∑ vis p−1yisτ jmz jmris jm j ˆ ˆ θ 1, θ 2 (3.30) i=1 j=1 m=1 Here in (3.30), vis are weights exactly as (3.29) corresponding to first level of stratification. p j are weights corresponding to variable probability sampling. Indicator variable τ jm takes value one if observation W in the second level of stratification (variable probability sampling) is in stratum j and zero otherwise. Indicator variable z jm corresponds to the second level of stratifiaction too. It take value one if W is kept in the sample and zero otherwise and therefore P(z = 1) = p. yis is an indicator variable also. It is equal to one if cluster i is in stratum s. 3.5 Model Selection Tests in Panel Data Models Model selection tests in panel data models with complex sampling designs are similar to the tests in the cross section cases. When D(yi1 , . . . , yiT |xi1 , . . . , xiT ) is fully specified, the Vuong’s approach is directly applicable using MLE. In less restrictive cases when we do not have a complete densities- like partial or pooled MLEs or other M-estimators- we need to account for the time series dependence properly. Assume, for each t, qt1 (Wt , θ 1 ) and qt2 (Wt , θ 2 ) are competing models of the conditional density in each time period. Here, the same null hypothesis, (3.19), still means the models fit equally well but it is the weakest sense. The convergence result in equation (3.17) still holds under the null. Under assumption (3.3), the models are nonnested and the variance η 2 is positive. In estimating η 2 , the serial dependence in {qit1 (θ ∗ ) − qit2 (θ ∗ )} is a new extra term that 1 2 60 ˆ ˆ must be added in calculations. Let ri jt = qi jt1 (θ 1 ) − qi jt2 (θ 2 ) denote the difference in estimated ˆ Nj ˆ functions for each t, and stratum j, and r jt = N −1 ∑i=1 ri jt . Then, in case of standard stratification, ¯ j a consistent estimate for η 2 is ˆ η2 2 N 1 J Qj j = ∑ 2 ∑ 1T Di j Di j 1T N j=1 H j i=1 where 1T is the T × 1 vector of ones and Di j is a T × 1 vector defined as   ˆ ¯  ri j1 − r j1    r −r  ˆi j2 ¯ j2     .   . .     ri jT − r jT ˆ ¯ (3.31) (3.32) Therefore model selection test in a panel data model with standard stratification design is Qj Nj T 1 √ ∑J ˆ j=1 H ∑i=1 ∑t=1 ri jt N j 2 1/2 1 J Qj Nj ∑ j=1 2 ∑i=1 1T Ui j 1T N Hj (3.33) Here Ui j is an upper triangular matrix, obtained from Di j Di j by changing values of entries below its diagonal to zero1 . Test statistic (3.33) has standard normal distribution. Note that in variance estimator (3.31) the mean difference r jt varies across t and j but is same across i. If we replace ¯ hypothesis (3.19) with the stronger one, E qit1 (θ ∗ ) = E qit2 (θ ∗ ) for t = 1, . . . , T , then we can 2 1 replace r jt with the average of ri jt across i and t, r j . Here the mean difference r j is just a function ¯ ˆ ¯ ¯ of strata. 3.6 Tests Statistics and Exogenous Stratification It is known that when the population of interest is divided into subpopulations or strata by exogenous variables unweighted estimators are consistent and even more efficient than weighted ones and it does not cause any real problems. However model selection tests are a different matter. 1 Since D D ij ij is a symmetric matrix, Ui j could be a lower triangular matrix. 61 Usually, we are interested in cases that a model for some feature of the distribution of Y given X is correctly specified. Then in correctly specified model, θ ◦ solves min E [q(W, θ )|X] θ ∈Θ (3.34) for all x ∈ X . For example assume we are performing nonlinear least squares on a correctly specified parametric model of E(Y |X), then in this case W = (Y, X). In other words our objective function is q(W, θ ) = [Y − m(X, θ )]2 /2 (3.35) and θ ◦ is the true parameter vector such that E (Y |X = x, θ ◦ ) = m(x, θ ◦ ) (3.36) for all x. Then θ ◦ solves minθ ∈Θ E [q(Y, θ )|x] = E{[Y − m(X, θ )]2 /2|x} for all x. It means that θ ◦ minimizes E q(Y, θ )|x ∈ X j for each j. However when the underlying model is misspecified in the sense that θ ◦ , the soultion to (3.1) does not solve (3.34) for each x, the unweighted estimator is not consistent for θ ◦ while weighted estimator is consistent for θ ◦ . In model selection tests when the goal is to choose between two nonnested competing models, the null hypothesis (3.19) will only hold if both models are misspesified. If one model were correctly specified, then equality in (3.19) will change to strict inequality in favor of the correctly specified model, assuming the objective functions are not same. For example suppose that in our example above there are two competing models for E(Y |x), i.e. m1 (x, θ 1 ) and m2 (x, θ 2 ) that are both misspecified. In this case θ ∗ , g = 1, 2 does not solve g min E qg (W, θ g )|X = x θg ∈Θg (3.37) for all x ∈ X , and therefore unweighted estimator is inconsistent for θ ∗ . On the other hand g weighted estimator delivers consistent estimator for θ ∗ . Since in the model selection test we need g consistent estimators for θ ∗ for g = 1, 2, we need weight observations appropriately even in case g of exogenous stratification. 62 3.7 Empirical Examples For illustration purpose, the date set nhanes2 provided by Stata is used to contrast two competing models. The data set nhanes2 has complex sampling scheme including clustering and stratification. We are interested in modeling the risk of heart attack as a function of variables like age, sex, race, weight, and height 2 . The dependent variable is heartatk that is a binary variable. It is equal to one if the observation has experienced heart attack, and zero otherwise. Two competing models are probit and Bernoulli with contemporary log-log link, estimated by GLM. By ignoring sampling design, the quasi-log likelihood evaluated at relevant estimates for probit and Bernoulli models are ˆ -555.028, and -556.665 respectively obtained from 4238 observations. The statistic η 2 turns out to be 6.433, and therefore unweighted Vuong’e test statistic is equal to 0.645. On the other hand, if we consider sampling scheme the values obtained for quasi-log likelihood are -559.393 and -561.889 respectively and our estimation for weighted η 2 is 11.599. Therefore weighted Vuong’s statistic in (3.24) is .733. Both tests are in favor of the probit model, and although the weighted Vuong’s test is bigger, using a standard normal test at 5% the difference is not statistically significant. As a second example, consider the determinants of family income in the United States discussed in chapter 2, section 2.7 using panel data set obtained from PSID. We are interested in choosing between two competing models wFGLS-ar1 (model 1) and wFGLS-re (model 2), where in the first model a random effect structure is considered for dependency within panels while in the second one we model this dependecy as AR(1). The null hypothesis is 2 2 H◦ : E[U1 ] = E[U2 ] (3.38) 2 2 HA : E[U1 ] > E[U2 ] (3.39) against alternative 2 The complete set of covariates considered in this example are: houssize, age, agesq, sex, height, weight, iron, diabets, sizeplace, vitaminc, zinc, copper, f emale, black, race, orace, region1, region2, region3, rural, highbp, highlead, and healthstat. for more information about the data set see http://www.stata-press.com/data/r10/svy.html. 63 The proper test statistic in this case is (3.33) and its value is about .93 which although is in favor of second model (wFGLS-re), but we cannot reject null hypothesis in favor of alternative at 5% confidence interval. 3.8 Conclusion In many applied econometric studies researchers are forced to choose between competing models that seems equally well in fitting the data. Model selection tests are suitable tools to distinguish “better” model or models. However Vounge (1989) model selection tests are not readily applicable in cases that data sets come from complex sampling design. In this paper Vounge type tests purpose for the cases that data is not a set of i.i.d. observations due to stratification and clustering. The results show that the test statistics have normal distribution and have to be weighted. An interesting finding is that even in case of exogenous stratification we cannot drop the weights since for nonnested models by null assumption two competing models are misspecified. The tests are applicable for panel data models with complex samples designs but we need to account for time series dependence properly. One advantage of the model selection tests is that they can be obtained easily in empirical studies. 64 APPENDICES 65 Appendix A PROOFS In first appendix we show that the conditional variance of sample objective function is (2.17). Proof. Starting point is definition of variance. J Qj ∑ 1 [S = j] H j r (V, θ ◦) |V2 var j=1 Q2 j J ∑ 1 [S = j] H 2 r (V, θ ◦) r (V, θ ◦) |V2 =E j=1 j J = 1 [S = j] ∑ j=1 v∈W Q2 j H2 j r (v, θ ◦ ) r (v, θ ◦ ) g (s, v|v2 ) dv (A.1) since stratification is not overlapping we can substitute 1 [S = j] with 1 v ∈ W j . Therefore J Qj ∑ 1 [S = j] H j r (V, θ ◦) |V2 var j=1 J = ∑ j=1 v∈W 1 v ∈ Wj J = 1 ∑ j=1 v∈W = Q2 j H2 j r (v, θ ◦ ) r (v, θ ◦ ) g (s, v|v2 ) dv Q2 j v ∈ W j 2 r (v, θ ◦ ) r (v, θ ◦ ) Hj f (v|v2 , θ ) J Hj Qj Hj R ( j, v2 , θ ) ∑ j=1 Q j dv J Qj r (v, θ ◦ ) r (v, θ o ) f (v|v2 , θ ) dv J Hj j=1 v∈W j H j ∑ Q R ( j, v2 , θ ) j 1 j=1 J =η Qj ∑ Hj j=1 J = η· ∑ Qj v∈W j ∑ Hj E r (v, θ ◦ ) r (v, θ o ) f (v|v2 , θ ) dv r (V, θ ◦ ) r (V, θ ◦ ) |V2 , S = j (A.2) j=1 By dropping the constant term η in (A.2), we obtain equation (2.17). This complete the proof. 66 To show that R◦ (V2 ) the conditional expectation of gradient in sample is same as gradient of the objective function in population we start from definition: J R◦ (V2 ) = ∑E 1 [S = j] j=1 J = Qj ∇ r (V, θ◦ ) |V2 Hj θ 1 [s = j] ∑ j=1 v∈W = 1 J Hj ∑ Q R ( j, v2 , θ ) j=1 j Qj ∇ r (v, θ ◦ ) g (s, v|v2 ) dv Hj θ E [∇θ r (V, θ ◦ ) |V2 ] =η · E [∇θ r (V, θ ◦ ) |V2 ] 67 (A.3) Appendix B TABLES1 Table B.1: Exogenous Stratification with ρ = 0.0, 1000 replications T =2 Average ˆ β POLS wPOLS uwGLS wGLS uwFGLS wFGLS .9999 .9991 .9999 .9991 .9999 .9991 (.0353) (.0431) (.0353) (.0431) (.0354) (.0432) sβ ˆ .0355 .0443 .0353 .0443 .0353 .0443 ˆ β◦ .0014 .0008 .0014 .0008 .0014 .0008 (.0426) (.0471) (.0426) (.0471) (.0426) (.0471) .0422 .0473 .0420 .0473 .0420 .0473 1 1 sβ ˆ ◦ ˆ ρ = .0016(.0610) in feasible cases. Table B.2: Exogenous Stratification with ρ = 0.1, 1000 replications T =2 Average ˆ β POLS wPOLS uwGLS wGLS uwFGLS wFGLS .9991 1.0006 .9991 1.0004 .9989 1.0002 (.0354) (.0446) (.0352) (.0444) (.0354) (.0445) sβ ˆ .0355 .0446 .0351 .0443 .0350 .0443 ˆ β◦ -.0006 .0011 -.0005 .0011 -.0005 .0011 (.0449) (.0492) (.0449) (.0492) (.0449) (.0492) .0422 .0495 .0440 .0495 .0440 .0495 1 1 sβ ˆ ◦ ˆ ρ = .1021(.0597) in feasible cases. 1 In tables B.1 to B.16 presented in this appendix, rows 2, and 4 are average values of estimated β◦ and β1 obtained from 1000 simulated samples and the values in parenthesis are their standard deviation. rows 3, and 5 represent average values of estimated standard deviations of the estimators calculated by the formula discussed in chapter 2. 68 Table B.3: Exogenous Stratification with ρ = 0.5, 1000 replications T =2 Average ˆ β POLS wPOLS uwGLS wGLS uwFGLS wFGLS .9998 .9991 .9997 .9993 .9997 .9993 (.0349) (.0431) (.0304) (.0377) (.0305) (.0377) sβ ˆ .0355 .0443 .0302 .0383 .0302 .0383 ˆ β◦ .0014 .0007 .0014 .0007 .0014 .0007 (.0513) (.0575) (.0513) (.0575) (.0513) (.0575) .0422 .0578 .0509 .0578 .0509 .0578 1 1 sβ ˆ ◦ ˆ ρ = .5008(.0527) in feasible cases. Table B.4: Exogenous Stratification with ρ = 0.9, 1000 replications T =2 Average ˆ β POLS wPOLS uwGLS wGLS uwFGLS wFGLS .9999 .9995 .9997 .9997 .9997 .9997 (.0351) (.0442) (.0153) (.0193) (.0154) (.0193) sβ ˆ .0355 .0442 .0151 .0192 .0151 .0193 ˆ β◦ .0010 .0003 .0010 .0003 .0010 .0003 (.0569) (.0641) (.0565) (.0641) (.0565) (.0641) .0421 .0650 .0568 .0650 .0568 .0650 1 1 sβ ˆ ◦ ˆ ρ = .8997(.0267) in feasible cases. Table B.5: Exogenous Stratification with ρ = 0.0, 1000 replications T =5 Average ˆ β POLS wPOLS uwGLS wGLS uwFGLS wFGLS 1.0001 1.0003 1.0001 1.0003 1.0002 1.0003 (.0232) (.0257) (.0232) (.0257) (.0232) (.0257) sβ ˆ .0244 .0271 .0242 .0262 .0242 .0262 ˆ β◦ .0010 .0016 .0010 .0016 .0009 .0016 (.0263) (.0277) (.0257) (.0277) (.0257) (.0277) .0263 .0284 .0262 .0277 .0262 .0277 1 1 sβ ˆ ◦ ˆ ρ = −.0003(.0303) in feasible cases. 69 Table B.6: Exogenous Stratification with ρ = 0.1, 1000 replications T =5 Average ˆ β POLS wPOLS uwGLS wGLS uwFGLS wFGLS 1.0001 1.0003 1.0002 1.0003 1.0002 1.0003 (.0232) (.0257) (.0230) (.0256) (.0231) (.0256) sβ ˆ .0244 .0271 .0240 .0260 .0240 .0260 ˆ β◦ .0010 .0017 .0010 .0017 .0010 .0017 (.0278) (.0300) (.0278) (.0300) (.0278) (.0300) .0263 .0307 .0283 .0300 .0283 .0300 1 1 sβ ˆ ◦ ˆ ρ = .0995(.0301) in feasible cases. Table B.7: Exogenous Stratification with ρ = 0.5, 1000 replications T =5 Average ˆ β POLS wPOLS uwGLS wGLS uwFGLS wFGLS 1.0000 1.0001 1.0002 1.0000 1.0004 1.0003 (.0235) (.0260) (.0193) (.0213) (.0195) (.0211) sβ ˆ .0244 .0271 .0197 .0213 .0197 .0212 ˆ β◦ .0014 .0023 .0011 .0019 -.0009 -.0002 (.0382) (.0411) (.0377) (.0407) (.0371) (.0403) .0263 .0423 .0383 .0406 .0382 .0405 1 1 sβ ˆ ◦ ˆ ρ = .4995(.0258) in feasible cases. Table B.8: Exogenous Stratification with ρ = 0.9, 1000 replications T =5 Average ˆ β POLS wPOLS uwGLS wGLS uwFGLS wFGLS .9999 .9999 1.0000 .9999 1.0000 .9999 (.0234) (.0263) (.0089) (.0097) (.0089) (.0097) sβ ˆ .0244 .0271 .0088 .0095 .0088 .0095 ˆ β◦ .0008 .0015 .0004 .0010 .0004 .0010 (.0532) (.0577) (.0524) (.0569) (.0524) (.0569) .0263 .0586 .0531 .0564 .0531 .0564 1 1 sβ ˆ ◦ ˆ ρ = .8991(.0126) in feasible cases. 70 Table B.9: Endogenous Stratification with ρ = 0.0, 1000 replications T =2 Average ˆ β POLS wPOLS uwGLS wGLS uwFGLS wFGLS 1.0710 .9996 1.0710 .9996 1.0709 .9997 (.0381) (.0412) (.0381) (.0412) (.0381) (.0411) sβ ˆ .0413 .0417 .0376 .0417 .0375 .0416 ˆ β◦ .1119 .0005 .1119 .0005 .1119 .0005 (.0373) (.0394) (.0373) (.0394) (.0373) (.0394) .0430 .0394 .0375 .0394 .0375 .0394 1 1 sβ ˆ ◦ ˆ ρ = −.0012(.0562) in feasible cases. Table B.10: Endogenous Stratification with ρ = 0.1, 1000 replications T =2 Average ˆ β POLS wPOLS uwGLS wGLS uwFGLS wFGLS 1.0695 .9997 1.0701 .9996 1.0701 .9997 (.0382) (.0412) (.0379) (.0411) (.0379) (.0410) sβ ˆ .0412 .0417 .0374 .0415 .0373 .0414 ˆ β◦ .1242 .0005 .1241 .0005 .1241 .0005 (.0386) (.0407) (.0386) (.0407) (.0386) (.0407) .0429 .0409 .0388 .0409 .0388 .0409 1 1 sβ ˆ ◦ ˆ ρ = .0989(.0559) in feasible cases. Table B.11: Endogenous Stratification with ρ = 0.5, 1000 replications T =2 Average ˆ β POLS wPOLS uwGLS wGLS uwFGLS wFGLS 1.0643 1.0001 1.0527 .9992 1.0529 .9993 (.0389) (.0414) (.0334) (.0365) (.0335) (.0365) sβ ˆ .0413 .0414 .0329 .0364 .0329 .0363 ˆ β◦ .1732 .0008 .1746 .0007 .1746 .0007 (.0428) (.0448) (.0426) (.0448) (.0426) (.0448) .0431 .0453 .0428 .0453 .0428 .0453 1 1 sβ ˆ ◦ ˆ ρ = .4996(.0487) in feasible cases. 71 Table B.12: Endogenous Stratification with ρ = 0.9, 1000 replications T =2 Average ˆ β POLS wPOLS uwGLS wGLS uwFGLS wFGLS 1.0591 1.0007 1.0133 .9994 1.0133 .9994 (.0404) (.0419) (.0174) (.0193) (.0175) (.0193) sβ ˆ .0419 .0408 .0170 .0186 .0170 .0186 ˆ β◦ .2223 .0010 .2278 .0008 .2278 .0008 (.0459) (.0478) (.0451) (.0478) (.0451) (.0478) .0437 .0482 .0449 .0482 .0449 .0482 1 1 sβ ˆ ◦ ˆ ρ = .9006(.0251) in feasible cases. Table B.13: Endogenous Stratification with ρ = 0.0, 1000 replications T =5 Average ˆ β POLS wPOLS uwGLS wGLS uwFGLS wFGLS 1.0324 1.0001 1.0324 1.0001 1.0324 1.0001 (.0240) (.0261 (.0240) (.0261) (.0240) (.0261) sβ ˆ .0260 .0277 .0240 .0269 .0248 .0269 ˆ β◦ .0469 -.0001 .0469 -.0001 .0469 -.0001 (.0245) (.0263) (.0245) (.0263) (.0246) (.0263) .0265 .0272 .0249 .0266 .0249 .0266 1 1 sβ ˆ ◦ ˆ ρ = −.0003(.0280) in feasible cases. Table B.14: Endogenous Stratification with ρ = 0.1, 1000 replications T =5 Average ˆ β POLS wPOLS uwGLS wGLS uwFGLS wFGLS 1.0321 1.0000 1.0317 1.0001 1.0317 1.0001 (.0241) (.0263) (.0240) (.0260) (.0240) (.0260) sβ ˆ .0260 .0277 .0246 .0267 .0246 .0267 ˆ β◦ .0523 -.0001 .0550 -.0001 .0550 -.0001 (.0265) (.0248) (.0263) (.0282) (.0264) (.0282) .0265 .0294 .0268 .0285 .0268 .0285 1 1 sβ ˆ ◦ ˆ ρ = .0996(.0279) in feasible cases. 72 Table B.15: Endogenous Stratification with ρ = 0.5, 1000 replications T =5 Average ˆ β POLS wPOLS uwGLS wGLS uwFGLS wFGLS 1.0300 .9997 1.0203 .9998 1.0203 .9998 (.0241) (.0264) (.0202) (.0217) (.0202) (.0217) sβ ˆ .0261 .0276 .0202 .0219 .0202 .0218 ˆ β◦ .0921 -.0002 .1019 -.0005 .1019 -.0004 (.0357) (.0383) (.0341) (.0367) (.0341) (.0367) .0266 .0394 .0347 .0370 .0347 .0370 1 1 sβ ˆ ◦ ˆ ρ = .4991(.0241) in feasible cases. Table B.16: Endogenous Stratification with ρ = 0.9, 1000 replications T =5 Average ˆ β POLS wPOLS uwGLS wGLS uwFGLS wFGLS 1.0252 .9997 1.0036 .9997 1.0036 .9997 (.0245) (.0259) (.0093) (.0099) (.0093) (.0099) sβ ˆ .0267 .0272 .0091 .0098 .0091 .0098 ˆ β◦ .1955 -.0011 .1980 -.0013 .1980 -.0013 (.0442) (.0476) (.0426) (.0460) (.0426) (.0460) .0271 .0486 .0432 .0463 .0432 .0463 1 1 sβ ˆ ◦ ˆ ρ = .8993(.0121) in feasible cases. 73 Table B.17: Variables Descriptions age the actual age of Head aychild6 1 if age of youngest person in the family is 6 or less black 1 if Head is black f emale 1 if Head is female f size the actual number of persons in the family f weight3 2003 core/immigrant family weight edu_hs 1 if the highest level of Head’s education is completed high school f edu_hs 1 if the highest level of Head’s father education is completed high school medu_hs 1 if the highest level of Head’s mother education is completed high school health 1 if health condition of Head is good, very good or excellent married 1 if Head is married nchild the actual number of persons currently in the family under 18 years of age unemployed 1 if Head is unemployed t f inc total family money income last year twealth sum of values of seven asset types, net of debt value plus value of home equity 74 Table B.18: Summary Statistics (1) (2) (3) (4) (5) VARIABLES N mean sd min max age 15,672 47.45 14.97 16 99 aychild6 15,672 0.197 0.398 0 1 black 15,672 0.282 0.450 0 1 f emale 15,672 0.250 0.433 0 1 f size 15,672 2.638 1.414 1 10 f weight3 15,672 23.38 16.74 0 114.3 edu_hs 15,672 0.464 0.499 0 1 f edu_hs 15,672 0.692 0.462 0 1 medu_hs 15,672 0.702 0.457 0 1 health 15,672 0.854 0.353 0 1 married 15,672 0.566 0.496 0 1 nchild 15,672 0.787 1.128 0 8 unemployed 15,672 0.0501 0.218 0 1 t f inc 15,672 74.69 111.4 -99.26 6,317 twealth 15,672 310.0 1,201 -2,700 50,475 75 76 health age3/1000 age2 age age.edu_hs 13.861** (0.156) (0.097) 12.772** 1.000** (0.026) (0.016) 1.000** -0.222** (1.286) (0.748) -0.179** 12.950** (0.119) (0.081) 10.774** 0.150 (5.807) (3.471) -0.061 -31.082** -19.194** (0.008) (0.006) edu_hs 0.033** 0.036** twealth wPOLS POLS (2) VARIABLES (1) 6.457** (0.100) 0.340** (0.017) -0.097** (0.932) 7.383** (0.100) -0.501** (4.494) -2.639 (0.004) 0.011** FGLS_re (3) 10.803** (0.137) 1.000** (0.024) -0.193** (1.233) 11.821** (0.101) -0.005 (5.083) -25.548** (0.006) 0.021** wFGLS_re (4) 5.527** (0.100) 0.480** (0.017) -0.119** (0.901) 8.311** (0.096) -0.341** (4.352) -9.913* (0.004) 0.013** FGLS_ar1 (5) Table B.19: Determinants of Family Income in the U.S 9.880** (0.137) 1.000** (0.024) -0.193** (1.234) 11.832** (0.102) 0.008 (5.153) -26.751** (0.006) 0.019** wFGLS_un (7) Continued on next page 10.619** (0.139) 1.000** (0.024) -0.197** (1.240) 11.980** (0.103) 0.036 (5.196) -27.669** (0.006) 0.022** wFGLS_ar1 (6) 77 medu_hs fedu_hs unem.edu_hs unemployed aychild6 fsize married VARIABLES -8.722* (3.478) (1.872) (3.344) (1.793) -7.334** -12.616** (7.295) (3.786) -12.540** 24.091** (6.441) (3.427) 18.206** -27.979** (4.793) (2.227) -22.596** -5.642 (1.530) (1.019) -3.365 12.830** (3.022) (1.788) 10.017** 26.845** (2.039) (1.219) 25.914** wPOLS (2) POLS (1) (3.181) -9.418** (3.157) -14.012** (3.042) 4.781 (2.669) -7.613** (1.913) -1.089 (1.010) 6.747** (2.168) 24.413** (1.068) FGLS_re (3) (3.544) -9.820** (3.503) -13.203** (5.552) 13.763* (4.880) -16.387** (2.774) -3.201 (1.241) 10.396** (2.870) 29.093** (1.509) wFGLS_re (4) (3.372) -9.070** (3.204) -13.829** (2.737) 4.742 (2.419) -6.692** (2.100) -2.880 (1.049) 5.910** (2.129) 25.608** (1.058) FGLS_ar1 (5) Table B.19 –continued from previous page (3.610) -9.671** (3.498) -12.931** (5.338) 12.432* (4.755) -14.189** (3.156) -3.923 (1.178) 9.484** (2.861) 29.960** (1.421) wFGLS_un (7) Continued on next page (3.584) -9.663** (3.474) -12.917** (5.517) 14.353** (4.863) -15.962** (3.395) -4.425 (1.280) 9.983** (2.865) 29.349** (1.503) wFGLS_ar1 (6) 78 Constant year09 year07 year05 nchild female black VARIABLES -185.597** (18.300) (10.626) (1.930) (2.387) -146.552** 12.816** (1.910) (2.031) 11.822** 6.344** (1.446) (1.903) 5.409** 2.537 (1.991) (1.263) 2.015 -10.604** (2.872) (1.287) -9.119** -6.304* (2.063) (0.985) -7.389** -9.478** wPOLS (2) -11.494** POLS (1) (15.406) -86.376** (1.616) 12.772** (1.190) 7.656** (1.049) 2.994** (1.172) -7.618** (1.727) -12.720** (1.538) -16.158** FGLS_re (3) (18.229) -163.858** (1.836) 13.240** (1.597) 7.615** (1.315) 3.007* (1.645) -9.830** (2.418) -8.815** (1.879) -11.550** wFGLS_re (4) (14.094) -96.402** (1.614) 12.721** (1.302) 7.488** (1.085) 2.913** (1.464) -5.289** (1.698) -12.680** (1.460) -16.635** FGLS_ar1 (5) Table B.19 –continued from previous page (18.251) -162.316** (1.826) 13.280** (1.618) 7.773** (1.321) 3.058* (1.917) -7.774** (2.356) -9.010** (1.912) -12.137** wFGLS_un (7) Continued on next page (18.201) -165.917** (1.831) 13.155** (1.631) 7.520** (1.328) 2.965* (2.056) -7.964** (2.432) -8.666** (1.898) -11.657** wFGLS_ar1 (6) 79 ** p<0.01, * p<0.05 Robust standard errors in parentheses ˆ ρ ˆ λ 0.282 0.286 R-squared 15,672 3,918 15,672 Observations wPOLS (2) Number of fid POLS VARIABLES (1) 0.63 3,918 15,672 FGLS_re (3) 0.40 3,918 15,672 wFGLS_re (4) 0.69 3,918 15,672 FGLS_ar1 (5) Table B.19 –continued from previous page 0.45 3,918 15,672 wFGLS_ar1 (6) 3,918 15,672 wFGLS_un (7) BIBLIOGRAPHY 80 BIBLIOGRAPHY [1] Bhattacharya, D. (2005): “Asymptotic Inference from Multi-Stage Samples” Journal of Econometrics, 126, 145-171. [2] Cameron, A.C., Pravin, K.T. (2005): “Microeconometrics Methods and Applications” Cambridge University Press, New York, NY. [3] Chamberlain, G. (1987): “Asymptotic Efficiency in Estimation with Conditional Moment Restrictions” Journal of Econometrics, 34, 305-334. [4] Cosslett, S.R. (1981a): “Efficient Estimation of Discrete Choice models” In: Manski, C.F., McFadden, D. (Eds.), Structural Analysis of Discrete Data with Econometrics Applications. MIT Press, Cambridge, MA. [5] Cosslett, S.R. (1981b): “Maximum Likelihood Estimators for Choice-Based Samples” Econometrica 49, 1289-1316. [6] Cosslett, S.R. (1993): “Estimation from Endogenously Stratified Samples” In: Maddala, G.S., Rao, C.R., Vinod, H.D. (Eds.), Handbook of Statistics, vol. 11, 1-43 [7] Findley, D. F. (1990): “Making Difficult Model Comparisons” mimeo, U.S. Bureau of the Census. [8] Findley, D. F. (1991): “ Convergence of finite multistep predictors from incorrect models and its role in model selection” Note di Matematica XI, 145-55. [9] Findley, D. F., Wei, C.Z. (1993): “ Moment bound for deriving time series CLT’s and model selection procedures” Statistica Sinica 3, 453-80. [10] Hardin, J.H., Hilbe, J.M. (2003): Hall/CRC. “Generalized Estimating Equations” Chapman & [11] Johnson D.R., Elliott L.A. (1998): “Sampling Design Effects: Do They Affect the Analyses of Data from the National Survey of Families and Households?” Journal of Marriage and Family, 60, 993-1001. [12] Heeringa, S.G., Berglund, P.A., Khan, A. (2011): “Construction and Evaluation of the 2009 Longitudinal Individual and Family Weights” Panel Study of Income Dynamics Technical Report. Survey Research Center, University of Michigan, Ann Arbor. [13] Heeringa, S.G., Berglund, P.A., Khan, A., Lee, S., Gouskova, E. (2011): “PSID Crosssectional Individual Weights, 1997-2009” Panel Study of Income Dynamics Technical Report. Survey Research Center, University of Michigan, Ann Arbor. [14] Hausman, J.A., Wise, D.A. (1981): “Stratification on an endogenous variable and estimation: The Gary income maintenance experiment” In: Manski, C.F., McFadden, D. (Eds.), Structural Analysis of Discrete Data with Econometrics Applications. MIT Press, Cambridge, MA, 365-391. 81 [15] Imbens, G. W. (1992): “An Efficient Method of Moments Estimator for Discrete Choice Models with Choice-Based Sampling” Econometrica, 60, 1187-1214. [16] Imbens, G. W., Lancaster, T. (1996): “Efficient Estimation and Stratified Sampling” Journal of Econometrics, 74, 289-318. [17] Manski, C.F., Lerman, S. (1977): “The Estimation of Choice Probabilities from ChoiceBased Samples” Econometrica, 45, 1977-1988. [18] Manski, C.F., McFadden, D. (1981): “Alternative Estimators and Sample Desighns for Discrete Choice Analysis” In: Manski, C.F., McFadden, D. (Eds.), Structural Analysis of Discrete Data with Econometrics Applications. MIT Press, Cambridge, MA, 2-50. [19] Newey, W.K., McFadden, D. (1994): “Large Sample Estimation and Hypothesis Testing” In: Engle, R.F., McFadden, D.L. (Eds.), Handbook of Econometrics, vol. IV, Amsterdam: North Holland, 2111-2245. [20] Panel Study of Income Dynamics, public use dataset. Produced and distributed by the Institute for Social Research, Survey Research Center, University of Michigan, Ann Arbor, MI (2012). [21] Rivers, D., Vuong, Q. (2002): “Model Selection Tests for Nonlinear Dynamic Models” The Econometrics Journal, 5, 1-39 [22] Tripathi, G. (2011): “Moment-Based Inference with Stratified Data” Econometric Theory, 27,47-73. [23] Vuong, Q. (1989): “Likelihood Ratio Tests for Model Selection and Non-Nested Hypothese” Econometrica, 57, 307-333 [24] Wooldridge, J.M. (1999): “Asymptotic Properties of Weighted M-Estimators for Variable Probability Samples” Econometrica, 67 (6), 1385-1406. [25] Wooldridge, J.M. (2001): “Asymptotic Properties of Weighted M-Estimators for Standard Stratifed Samples” Econometric Theory, 17, 451-470. [26] Wooldridge, J.M. (2008): “Cluster and stratified sampling” Imbens / Wooldridge BEA/FTC Lectures, Lecture notes 7 & 8. [27] Wooldridge, J.M. (2010): “Econometric Analysis of Cross-Section and Panel Data” (2nd ed.) MIT Press, Cambridge, MA. 82