.1 9. L. no ..§ m...) . ‘ . . . , . _ ha.” 70.... u. 5 Hank...“- THES‘S I‘;"n ‘_ LIBRARY Michigan State Unwemmy This is to certify that the dissertation entitled Three Essays on Generalized Method of Moments presented by Artem B. Prokhorov has been accepted towards fulfillment of the requirements for the Ph.D. degree in Economics /_ . 92% SM? Major Professor’s Signature Lt! ASK/cg Date NS) is an Affimtative Action/Equal Opportunity lnstitut ion PLACE IN RETURN BOX to remove this checkout from your record. To AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE o723uu my a 2mm 'vv 6/01 c:/ClRC/Date0ue.p65—p.15 fl THREE ESSAYS ON GENERALIZED METHOD OF MOMENTS BY ARTEM B. PROKHOROV A DISSERTATION SUBMITTED To MICHIGAN STATE UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF ECONOMICS 2006 ABSTRACT THREE ESSAYS ON GENERALIZED METHOD OF MOMENTS BY ARTEM B. PROKHOROV Generalized Method of Moments (GMM) is a powerful estimation method based on orthogonality conditions known to hold in the population of interest. GMM is sufficiently general to incorporate most of the extremum and minimum distance estimators in econometrics including (Q)MLE, M-estimator, weighted and nonlin- ear LS. By taking advantage of GMM’S universality, my thesis seeks to contribute to three areas of (micro)econometric research: modelling processes with missing observations (e.g., attrition and self-selection in panel data sets, counterfactual outcomes for treatment and control groups), modelling likelihood using copulas (e.g., PROBIT, LOGIT, selectivity models), and modelling covariance structures (e.g., LISREL, fixed effects, factor analysis). The first essay, “GMM Redundancy Results for General Missing Data Prob- lem,” considers alternative GMM estimators of a parameter vector that enters into one set of moment equations along with another vector that also enters into an additional set of moment conditions and may be known. Alternative estimators are ranked in terms of relative efficiency, and conditions for no efficiency gains are derived. The results are applied to a general missing data problem. Conditions for the counterintuitive result of the missing data literature that estimating selection probabilities is better than knowing them arise naturally in the general problem. Efficiency gains from using both weighted and unweighted moment equations under exogenous sampling are considered. The second essay, “Robustness, Redundancy, and Validity of Copulas in Like- lihood Models,” considers likelihood-based estimation of multivariate models, in which only marginal distributions are correctly specified. The unknown joint dis- tribution is modelled with a copula function, which may be misspecified. In a GMM framework, we study robustness and efficiency of resulting estimators, pro- pose improvements to existing estimators and discuss tests of copula validity. It is shown that radially symmetric copulas are robust against misspecification in problems about sample means if the true joint density is also radially symmet- ric. Efficiency results suggest that knowledge of the true copula is redundant if and only if the covariance matrix for relevant moment conditions is singular. A simple simulation supports the theoretical result about robustness of the Frank, Farlie-Gumbel-Morgenstern and Ali-Mikhail-Haq copula families. The third essay, “Modelling Covariance Structures: First and Second Order Asymptotics,” considers estimation of covariance structure models by quasi max- imum likelihood (QMLE), generalized method of moments (GMM) and empirical likelihood (EL). A general condition is derived under which the GMM (and EL) estimators do not dominate normal QMLE in terms of first-order efficiency. The condition is formulated in terms of the fourth order moments of the true distrib- ution. The second-order asymptotic bias of QMLE is derived and a formal proof is presented of the intuitive result that, under normality, this bias is the same as that of EL. To the memory of my father iv ACKNOWLEDGEMENTS This dissertation raises more questions than it answers. But the questions it raises and answers would not have been answered (perhaps not even raised) if it had not been for the interaction with my Doctorvater, University Distinguished Professor Peter Schmidt. Professor Schmidt’s wisdom and creativity, his ability to put things in perspective and prioritize, his talent of succinct statements and lively examples, his patience, approachability and close interaction with his students, his generous support of their endeavors and his understanding of their concerns — all this makes him a great mentor and this dissertation worth reading. I have found immensely enriching both my TA-ing for Professor Schmidt and attending the fabulous TA appreciation dinners Professors Peter Schmidt and Christine Amsler sponsor each semester for their TAs. It is impossible to overestimate the support I have received from the other members of my dissertation committee. Professor Jeffrey Wooldridge has been prompt and thorough in reading drafted parts of the dissertation as they appeared and providing feedback. He also generously supported my travel to Australia to present the results of the first essay. Professor Richard Baillie gave me valuable insights into the world of financial econometrics during my RA-ship for him. Pro- fessor Hira Koul has helped add statistical rigor to the dissertation and to my graduate training in econometrics. MSU graduate travel grants enabled me to attend the following meetings, where parts of this dissertation were presented: the 2006 AEA/ASSA meetings in Boston, the 33rd Annual Australian Conference of Economists in Sydney (October 2004), the 5th Villa Mondragone Workshop on Economic Theory and Econometrics in Rome (July 2005) and the 2004 Empirical Research Summer School on Experi- mental Economics and Econometrics in Mannheim. The final stages of the research were supported by a Dissertation Completion Fellowship from the Graduate School of MSU. Emma Iglesias of MSU, Ivana Komunjer of UCSD and Rustam Ibragimov of Harvard provided helpful discussions of some of the results. So did the participants of the above mentioned conferences and of the econometrics seminars at Michigan State, Bates White LLC, Concordia, Florida State, New South Wales, Massey, Emory University and Central Michigan University. Finally and most importantly, Irina Agafonova is the person who made this all worthwhile. I am very grateful to these people. vi Table Of Contents Table of Contents vii LIST OF TABLES ix LIST OF FIGURES x 1 GMM Redundancy Results for General Missing Data Problem 1 1.1 Introduction ............................... 1 1.2 Efficiency and redundancy results for the general estimation problem 6 1.2.1 Preliminaries .......................... 6 1.2.2 The general estimation problem ................ 8 1.2.3 Efficiency and redundancy results ............... 11 1.3 Application to missing data problem ................. 16 1.3.1 The population problem .................... 16 1.3.2 Motivation and definitions ................... 20 1.3.3 Relative efficiency results under ignorable selection ..... 25 1.3.4 Relative efficiency results under exogenous selection ..... 30 1.4 Concluding remarks ........................... 34 Bibliography ................................. 36 Appendix ................................... 40 2 Robustness, Redundancy, and Validity of Copulas in Likelihood Models 46 2.1 Introduction ............................... 46 2.2 Preliminaries .............................. 49 2.3 The GMM representation ....................... 52 2.4 Robustness of copula terms ...................... 55 2.4.1 A theoretical result ....................... 55 2.4.2 An illustrative simulation ................... 58 2.5 Redundancy of copula terms ...................... 65 2.5.1 Redundancy with correct copula ................ 66 2.5.2 Redundancy with misspecified copula ............. 71 2.5.3 Examples ............................ 75 vii 2.6 Validity of copula terms ........................ 82 2.6.1 Theoretical results ....................... 83 2.7 Concluding remarks ........................... 85 Bibliography ................................. 86 Appendix A .................................. 89 Appendix B .................................. 93 Appendix C .................................. 104 Modelling Covariance Structures: First and Second Order Asymptotics ‘ 108 3.1 Introduction ............................... 108 3.2 Preliminaries .............................. 111 3.2.1 Setup and assumptions ..................... 111 3.2.2 An example ........................... 113 3.2.3 Estimators ............................ 115 3.2.3.1 Normal (Q)MLE ................... 115 3.2.3.2 GMM ......................... 116 3.2.3.3 EL ........................... 118 3.3 First order analysis ........................... 119 3.3.1 The first order conditions ................... 119 3.3.2 Relative efficiency to the first order .............. 120 3.4 Second order analysis .......................... 124 3.4.1 Stochastic expansions to the second order .......... 124 3.4.2 Second order bias of QMLE .................. 127 3.4.3 Comparison to GMM and EL ................. 129 3.5 Concluding remarks ........................... 130 Bibliography ................................. 131 Appendix ................................... 134 viii List of Tables 2.1 The true values for Kendall’s 7‘ and p used in simulation ...... 61 2.2 Relative robustness measures for selected copulas, their standard errors, and estimated Pearson’s correlation coefficient 1‘0 for three sample sizes ............................... 64 ix List of Figures 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 3"(p) for no—parameter copulas: (a) Independence copula; (b) Lo- gistic copula. .............................. 104 3%“, p) and 5pm, p) for one-parameter copulas: (1) Farlie-Gumbel- Morgenstern. .............................. 104 FWD, p) and 5901, p) for one-parameter copulas: (2) Joe. ...... 105 5"(p, p) and 3901, p) for one-parameter copulas: (3) Ali-Mikhail-Haq.105 5"(p, p) and 3p(,u, p) for one—parameter copulas: (4) Clayton ..... 106 (9‘01, p) and (TN/I, p) for one-parameter copulas: (5) Gumbel ..... 106 5WD, p) and 3pm, p) for one-parameter copulas: (6) Normal. . . . . 107 (VOL, p) and 5pm, p) for one—parameter copulas: (7) Rank. ..... 107 Essay 1 GMM Redundancy Results for General Missing Data Problem 1.1 Introduction There are many models that can be formulated as two sets of moment conditions with two parameter vectors one of which enters in only one of these sets and the other in both. For example, Newey (1984) shows that multi-step estimators that employ estimates of an additional parameter vector in estimation of the primary parameter vector of interest can be represented in such a generalized method of moment (GMM) framework with exact identification of the parameters. Generated regressors models of Pagan (1984), latent variable models of Zellner (1970) and Goldberger (1972) and many others are two-step cases of this formulation. How- ever, the primary focus of and the motivation for this essay are the missing data (Or selectivity) models. Selectivity models deal with samples in which some observations are omitted (we call such samples “selected” ). The missing data problem arises when using selected samples in an estimation procedure results in a biased estimator. For example, if we were to conduct a survey of young mothers to study the effect of mother’s smoking on the weight of the newborn, the survey would typically have missing data due to non-response. It is likely that non-response is associated with heavy smoking and poor birth weight. If the missing data were ignored the effect of smoking would be underestimated. In such cases it is common to construct a probabilistic model for the missing data generating process (we call this model a “selection-model”) and then to apprOpriately adjust the primary model of interest for the effect of selection into the sample. This paper is motivated by a puzzle in the selectivity literature. Consider the set- ting of a GMM problem is which we have a set of moment conditions, with some parameters 01 (the “parameters of interest”), and these moment conditions hold in the unselected sample. However, we also have a selection mechanism such that the moment conditions do not hold in the selected sample. Under certain assumptions given below (typically referred to as “ignorability” or “selection on observables”), weighting the original moment conditions by the inverse of the probability of se- lection yields a modified set of moment conditions that do hold in the selected sample. We will follow Wooldridge (2002b, 2005) in calling the estimator based on these weighted moment conditions the “inverse probability weighting” (IPW) estimator. Unless the probability of selection is known for each selected Observation, imple- mentation of the IPW estimator will require a model that permits the estimation of the probability of selection. Let 02 be the parameters (the “selection parame- ters”) in the moment conditions derived from this model. Typically these moment conditions will be based on the score function from the likelihood function for the selection process. A two-step IPW procedure can be considered, in which the first step is the estimation of 02 from the selection model, and the second step is the estimation of 01 by GMM on the weighted moment conditions, where the weighting is done using the estimated probabilities of selection. In this setting, the puzzle is that it is better to estimate the selection probabilities than to use the true selection probabilities, even if the latter are known. In other words, in terms of the augmented model described above, we get a better estimator of 01 when we use the estimated 62 in the second step than if we used the true 02. This phenomenon has been discussed by Wooldridge (1999, 2001, 2002b, 2005), and it has also been noted in a number of previous works, including Rosenbaum (1987); Imbens (1992); Robins and Rotnitzky (1995); Crepon et al. (1997), and Hirano et al. (2003). This is puzzling because knowledge of 02, if properly exploited, cannot be harmful. To resolve this puzzle, we follow N ewey and McFadden (1994) in setting up an aug- mented set of moment conditions, where the first subset are the weighted original moment conditions, which now contain both 61 and 02, and the second subset are the moment conditions from the selection model, which contain only 62. We show that the second set of moment conditions is useful (non—redundant), even when 62 is known. This is true because the second set of moment conditions is correlated with the first set in the selected sample (even though it is not in the full sample). So the inefficiency of the estimator based on known 62 and the first set of moment conditions only is due to its failure to exploit the information in the second set of moment conditions; whereas, when 62 is not known, there is no choice but to include the second set of moment conditions. This raises the question of whether, when 62 is known, we can improve on the two-step estimator (which uses estimated 02 in the second step) by using a GMM estimator based on both sets of moment conditions, but where only 61 is estimated. After all, this GMM estimator cannot be worse than the two-step estimator of 01. The answer to this question is a bit complicated. In the case that the original GMM problem (the one that contains the parameter of interest) is overidentified, the two-step estimator is dominated by a one-step estimator that estimates 01 and 02 jointly in the augmented GMM model. However, we show that, in the augmented GMM model, knowledge of 62 is redundant (does not improve the precision of estimation of 01). So, while it can never hurt to know more, if that knowledge is used properly, in this case it does not help either. The result just quoted is given in Section 1.3 of the paper. In Section 1.2, we set the stage by giving a number of results on efficiency and redundancy of estimation in a general GMM setting, when one set of moment conditions depends on 61 and 62, while a second set of moment conditions depends only on 02. Some of these results are original and interesting in their own right. We consider “m- redundancy”, which is redundancy of moment conditions in the sense of Breusch et a1. (1999), and we also consider “p—redundancy”, which is redundancy of the knowledge of some of the parameters for estimation of the other parameters. One of our results gives an interesting connection between these two concepts: the first set of moment conditions with 01 known is m-redundant for estimation of 02 if and only if knowledge of 62 is p—redundant for estimation of 61. This is in fact the key result in establishing our subsequent results for the selectivity model. In Section 1.3 we also consider the selectivity model under a stronger “exogeneity of selection” assumption under which both the unweighted moment conditions and the weighted moment conditions hold in the selected population. Wooldridge (2001) has shown that in this circumstance it is better to use the unweighted moment conditions than the weighted moment conditions. However, this does not rule out the possibility that it would be better to use both. We show that in this circumstance the weighted moment conditions are m-redundant for estimation of 01, so that using both sets is no better than using just the unweighted moment conditions. Thus when we do not have to weight for reasons of consistency, we also do not have to weight for reasons of efficiency. GMM is sufficiently general to accommodate most of the extremum and mini- mum distance estimators in econometrics (see, e.g., Newey and McFadden, 1994, p.2118). The arguments we present can be applied, for example, to (Q)MLE, M- estimation, WLS, and NLS. They also extend to the asymptotic equivalents of GMM such as empirical likelihood and exponential tilting estimators. Our results relate to the treatment effect estimation literature (e.g., Rosenbaum and Rubin, 1983; Hirano et al., 2003; Heckman et al., 1998; Hahn, 1998), to stratified-sampling literature (e.g., Manski and Lerman, 1977; Manski and McFadden, 1981; Cosslett, 1981a,b; Imbens, 1992; 'ITipathi, 2003) and other similarly-structured problems (e.g., Hellerstein and Imbens, 1999; Nevo, 2002, 2003; Imbens, 1992; Crepon et al., 1997) 1.2 Efficiency and redundancy results for the gen- eral estimation problem 1.2.1 Preliminaries Consider a family of distributions {P9, 6 E O 2 O1 x O2 C Rpl x R92, O compact}, a random vector W“ E W* C Rdim(w*) from P90,60 E O, and a real valued, measurable function h : W* x O —-+ Rm such that E90[h(W*, 6)] = 0, if and only if 6 = 60. (1.1) The expectation is with respect to the distribution of W* indexed by 60. In the sequel we suppress the subscript. Let H - || denote the Euclidean norm, N (6, 6) C O denote an open p1 + pg-ball of radius 6 with center at 6, Vgh(-,6) denote the m x (191 + p2) Jacobian of h(-, 6) with respect to 6, and “w.p.1” stand for “with probability one”. Assumption 1.2.1 Assume that the moment function in (1.1) satisfies the fol- lowing conditions: (ii) h(W*,6) is continuous at each 6 E O, w.p.1; (iii) h(W*,6) is (once) continuously diflerentiable on N(60,6), for some 6 > 0, w.p.1; (iv) 1E{Slipaee||’1(W"',0)||2} < 00; (v) EisuPaeN(60,6) ||V9h(W*,6)||} < 00, for some (5 > 0; (vi) lE[Vgh(W*, 9ll6=60l is offull column rank. For Simplicity, we assume here that Wi“ , i = 1, . . . , N, are i.i.d. draws from P90. The generalized method of moments (GMM) estimator of 60 is the solution to the following minimization problem 5213 mm'wm), (1.2) where N 1 M0) = N Z h(W,-*, 9) i=1 is the sample analogue of the population moment condition which is zero at 60, and W is a positive semi-definite weighting matrix (see, e.g., Hansen, 1982). In the GMM framework, the choice of the weighting matrix may depend on 60. In such cases, a preliminary consistent estimate of 60 is used to construct an estimate of W used in the above definition of the GMM estimator. We will comment on this point again later. Theorem 1.2.1 (see, e. g., Newey and McFadden, 1994, Theorems 2.6 and 3.4) Under Assumption 1.2.1, the GMM estimator of 60 is consistent and asymptotically normal (CAN). Proofs: See the Appendix for proofs of all theorems and corollaries. Ci 1.2.2 The general estimation problem Let 6 = (6’ , 6’2)’ and h W*;6 ,6 h(W*;6)= 1( 1 2) , h2(W"‘;92) where 61 6 O1, 62 6 O2, and h1(.) and h2(-) are m1- and mg-vectors of known functions (m = m1 + m2). Then if we suppress W“ we can write (1.1) as (A) lE[h1(601,602)] = 0, (B) 1El’l2(902)l = 0- (1.3) We consider the general case of overidentification, i.e., m1 2 p1 and m2 2 p2. The optimal weighting matrix for GMM will be the inverse of the following covari- ance matrix or its components: C=V[h(60)]= C“ C” , (1.4) C21 022 where variance is with respect to P90 as before. Note that C is a function of 60 and is generally unknown. In defining alternative GMM estimators and deriving their asymptotic variance matrices, we will behave as if we knew 60 and thus knew C. In practice if we wish to use C in the weighting matrix of the GMM estimator we would typically first obtain an estimate of C based on a preliminary consistent estimate of 60. Such a preliminary estimate of 60 can be the GMM estimator that uses the identity matrix for weighting. We assume that C is finite and nonsingular so its inverse exists. Let 011 012 021 022 0—1 = Define the (m1 + m2) X (:01 + 192) matrix of expected derivatives D D D=E6h(6) _ 11 12 . (1.5) 66’ 9:90 0 D22 We assume that D11 and D22 are of full column rank so that hg alone identifies 62 and h1 identifies 61 given 62. Similar to C, D depends on 60. In deriving the GMM asymptotic variance matrices, we will treat D as know. Consistent estimates of D (and C) can be obtained using consistent estimates of 60 in practice. We now define four different GMM estimators that differ in which moment condi- tions are used and / or whether 62 is treated as known. For each of these estimators we treat C as known. We will comment on this point once again in the next subsection. Definition 1.2.1 Call the estimator of 6 that minimizes (1.2) with the optimal weighting matrix W = C ‘1 the ONE-STEP estimator. This is the usual GMM estimator that uses both moment conditions (1.3A) and (1.38) jointly to estimate 601 and 602. Definition 1.2.2 Call the estimator of 6 obtained in the following two step pro- cedure the TWO-STEP estimator: (i) the estimator 62 is obtained by minimizing (1.2), where h(6) = h2(62) and W = C231; (ii) the estimator of 61 is obtained by minimizing (1.2), where h(6) = h1(61,62), W = Cfil, and 62 = 62 is treated as known. This estimator uses the orthogonality condition (1.3B) first to Obtain a consistent estimator of the unknown parameter subvector 602 and then uses the moment con- dition (1.3A) to obtain the estimator of 601. Estimators considered in Wooldridge (2003), Newey (1984), Newey and McFadden (1994, pp. 2176-2184) and many others are TWO-STEP estimators with m1 = p1, m2 = p2. Definition 1.2.3 Call the estimator of 61 obtained by minimizing (1.2), where h(6) = h1(61, 62), W = Cfil, and 62 is treated as known, the KNOW-62 estimator. Here, equation (B) in (1.3) is ignored. However, the results of Section 1.3 of the paper all derive from understanding that (B) is potentially informative even though 620 is known because it imposes additional restrictions on the population. Definition 1.2.4 Call the estimator of 61 obtained by minimizing (1.2), where h() contains both h1(-) and h2(-), W = 0—1, and 62 is treated as known the KNOW-62-JOINT estimator. This is the augmented GMM estimator of 601 of the form considered in Qian and Schmidt (1999). Here, the information in (1.38) is kept even though 602 is assumed known. 10 Under Assumption 1.2.1 all four estimators are CAN. Theorem 1-2-2 Let VONE-STEF, VTWO—STEP; VKNOW-agi and VKNOW-Gg-JOINT d3“ note the asymptotic variance of the ONE-STEP, TWO-STEP, KNOW-62, and KNOW- 62-JOINT estimators, respectively. Then, yams = (D’C—IDH (1.6) VTwo-STEP = 808', (1.7) VKNOW-62 = (DllcfilDlll—I, (1-8) VKNOW-62-JOINT = (DillCnDlll—lr (19) where B is defined in equation (1.31) of the Appendix. In the above expressions, we use the standard notation that “the asymptotic vari- ance of 6 is V” means “\/1V(6 — 60) converges in distribution to N (0, V)” 1.2.3 Efficiency and redundancy results We can now state several asymptotic relative efficiency results (noting that a known parameter is always more efficient than its estimator). Theorem 1.2.3 For the estimators defined in Definitions 1.2.1-1.2.4 with asymp- totic variances given in (1.6)-(1.9), respectively, the following statements hold: 1. KNOW-62-JOINT is no less asymptotically efficient than KNOW-62. 11 2. KNOW-62-JOINT is no less asymptotically efficient than ONE-STEP. 3. ONE-STEP in no less asymptotically efficient than TWO-STEP. 4. If 012 = 0 then KNOW-62-JOINT and KNOW-62 are equally asymptotically eflicient [M—redundancyj. 5. If D12 == 0 then TWO-STEP and KNOW-62 are equally asymptotically efficient f0?" 61. 6. If 012 = 0 and D12 = 0 then ONE-STEP, TWO-STEP, KNOW-62-JOINT and KNOW-62 are all equally asymptotically efficient for 61, ONE-STEP and TWO- STEP are equally asymptotically efficient for 62, too [M/P-redundancy]. 7. If m1 = p1 then ONE-STEP of 62 and TWO-STEP of 62 are equal. 8. If m1 = p1 and m2 = p2 then the ONE-STEP and TWO-STEP estimates are equal ( for both 61 and 62) 9. If m1 = p1 and C12 = 0 then the ONE-STEP and TWO-STEP estimates are equally efi‘icient ( for both 61 and 62). 10. If 012 = 01202—2ng2 then KNOW-62-JOINT and ONE-STEP are equally as- ymptotically eflicient for 61 [P—redundancy]. 11. If D12 = 01202—2ng2 then ONE-STEP, TWO-STEP and KNOW-62-JOINT are no less asymptotically efficient for 61 than KNOW-62. As noted above, we have defined our estimators as depending on known C. In practice, C is replaced by an initial consistent estimate. This has no effect on the asymptotic variance of the estimates and so it does not affect our efficiency 12 comparisons. For Statements 7 and 8, which do not involve asymptotic arguments, we would need to require that the same initial consistent estimate is used. Statements 1-3 state the obvious fact that KNOW-62-JOINT dominates KNOW-62, ONE-STEP and TWO-STEP. The known value of 602 is at least as efficient as any estimate of 602, and the KNOW-62-JOINT estimate of 601 is the efficient GMM estimate of 601 based on the full set of available moment conditions. Statement 4 is essentially the result of Qian and Schmidt (1999). With 602 known, the second set of moment conditions contains no unknown parameters, and Qian and Schmidt show that using these conditions in addition to the first set of moment conditions improves efficiency except in the special case that 012 = 0. We call this type of redundancy the knowledge—of—moment redundancy (M-redundancy). Also, if we combine Statements 23 and 4, we have the corollary that if 012 = 0, KNOW-62 is at least as efficient as ONE-STEP and TWO-STEP. Statement 5 is essentially the result of Newey and McFadden (1994, pp. 2179- 2180) for the condition under which first stage estimation of a nuisance parameter (602) does not affect the asymptotic variance of the second stage estimate of the parameter of interest (601). See also Wooldridge (2002a, pp. 353-356). However, our version treats the overidentified case as well. Statement 6 combines the conditions of Statements 4 and 5. Therefore the equal efficiency of TWO-STEP, KNOW-62 and KNOW-62-JOINT follows from those state- ments. The fact that ONE-STEP is also equally efficient is an additional result. This statement provides conditions for redundancy of both the knowledge of 602 and of the extra moment conditions in (B) for estimating 601 (M/P-redundancy). One 13 case when the conditions hold is when 602 does not enter (A) and the two moment conditions are uncorrelated. This statement can also be viewed as a special case of Theorem 7 of Breusch et al. (1999) that deals with partial redundancy of moment conditions. Statement 7 is the GMM separability result of Ahn and Schmidt (1995) that says that the GMM estimate of 62 is unaffected if an equal number of parameters and moment conditions is added, because the additional conditions only determine 61 in terms of 62. Further, it can be shown (see the Appendix of Ahn and Schmidt, 1995) that if D11 is nonsingular (which is true Since D11 is of full column rank) the ONE-STEP estimator of 601 is expressed in terms of the ONE-STEP estimator of 602 using the equation h1(61, 62) = C12Ci1h2(62). Thus, ONE-STEP for 601 is derived from the same equation as TWO-STEP for 601 as long as h2(62) = 0 (which holds under exact identification of 62) or 012 is zero asymptotically. The former condition implies equivalence of the estimators (Statement 8); the latter implies their equal efficiency asymptotically (Statement 9). Statements 10 and 11 are novel and interesting. They discuss implications of the condition that D12 = 01202-21 D22. This is the condition for redundancy of hl given h2, for estimation of 602 when 601 is known (see Breusch et al., 1999, p. 94), which is an m-redundancy result. Under this condition, Statement 10 says that KNOW—62-JOINT and ONE-STEP are equally efficient. This means that knowledge of 602 does not help efficiency of estimation of 601 (from the set of all moment conditions) under this condition, which is a p-redundancy result. This link between m—redundancy and p—redundancy (the first set of moment conditions with 601 known is m—redundant for estimation of 602 if and only if knowledge of 602 is p-redundant for estimation of 601) is quite interesting and (so far as we know) 14 original. Under the same condition, Statement 11 says that KNOW-62 is dominated by the other three estimators. This is because knowledge of 602 is not useful, and the KNOW-62 estimator fails to use the second set of moment conditions, which is useful unless 0'12 = 0. Note, however, that although the TWO-STEP estimator dominates the KNOW-62 estimator under this condition, the TWO-STEP estimator is still not as efficient as the ONE-STEP or KNOW-62-JOINT estimators unless m1 = p1 (the first equation is exactly identified for 61, given 62). This condition is also important because it implies that conservative inference can be made using the asymptotic standard errors obtained from exactly identified estimations that neglect the first step (Statement 11). The condition of Statements 10 and 11 will often hold when h2(62) is the score of a log-likelihood function that depends on 62 but not 61. In this case the esti- mate of 602 based on h2 will be efficient, and another moment condition based on h1(61, 62) with 601 known should be m-redundant. More precisely, the generalized information equality (GIME) implies that the expectation of the derivative of h1 (with respect to 62), evaluated at 60, equals minus its covariance with the score so that D12 = -C'12, and the usual information equality implies that D22 = —ng, so that D12 = 01202—211322 holds. Indeed this is exactly what occurs in the selectivity model of the next section. Example 1.2.1 A sufficient condition for Statements 6, 10, and 11 to hold is that h1(91,02) = V911Df(w*|91,92) and h2(92) = V921Df(w*|91,92), where f (w*|61, 62) is the density of W’“. Then, the asymptotic variance matrix of the 15 estimator of 602 can be equivalently written as 02—21 and as 022. This implies that the information matrix for 61 and 62 is block diagonal, i.e. D12 = —012 = 0. Thus by Theorem 1.2.3 we can claim more than Statements 10 and 11 in this case: it does not make any difference for the efficiency of the estimate of 61 whether 62 is estimated or known, and in fact all four estimators are equally efficient (Statement 6). a We now apply these results to the missing data problem. 1.3 Application to missing data problem 1.3.1 The population problem Consider again a random vector W* E W* from the distribution P90, 60 E O = 81 x O2 C R61 x R92,O compact. Let W* contain random vector W E W C Rdim(W). Consider a real valued measurable function g : WxO1 —+ Rm1(m1 2 m) such that E[g(W, 61)] = 0, if and only if 01 = 601. (1.10) As before, expectation is with respect to P90. Assume that the moment function in (1.10) satisfies Assumption 1.2.1. We are interested in estimating 601. The parameter 601 usually describes some feature of the distribution of W such as the conditional mean, the conditional 16 variance, the conditional quantiles, etc. The vector W is often partitioned into (X, Y) E X x y and IE(Y|:I:) is often the feature of interest (see Example 1.3.1). Example 1.3.1 Consider the I -estimation of the parameter 601 in a general non- linear least squares model for 1E(Y|:r) = m(x, 601). This is one of the examples considered in Wooldridge (2003). We assume that the model is correctly speci- fied. Let the identifying moment functions be the first order conditions for op- timization of q(a:,y;61) = (y - m(r,61))2. Then, W = (X, Y), m1 = m, and g(W, 61) = —(Y — m(X, 61))[V91m(X, 61)]’. Note that a stronger condition than (1.10) holds in this case, namely lE[g(W, 601)|a:] = 0. El Example 1.3.2 Consider the maximum likelihood estimation of a LOGIT model where Y is a binary outcome variable and X is a vector of regressors and the condi- tional probability p(y|:r, 601) is modelled as G(:E’901)y°(1-G($,901))1—y, where G(-) is the logistic cdf. Likelihood equations can be used to construct the GMM estima- — tor based on the expectation of the score function IE [V91 1n f (X, Y; 61) ’0 9 J = 01 0, where f (SC, y; 601) is the joint density of X and Y. If the distribution of X does not depend on 61, then f (:13, y; 61) = p(y|r, 61) f (:r), where f(zr) is the un- known pdf of X. Then, the identifying moment condition can be rewritten as E{v9 [lnp(Y|2:;61) +1n f(X)]‘ } = IE [v9 lnp(Y|x;61)l and the 1 61:901 1 91=9o1 ML estimation is equivalent to the conditional ML estimation. For this example, W = (X Y) m = and (W a ) = X'(Y‘G(X'91)) . (x'v ) where () is v a 1 pl, 9 a 1 G(X’91)(1—G(X’91)) g 1 i g the logistic pdf. [3 Example 1.3.3 Consider estimation of the population averages #0 and p1 un- der control and treatment. Suppose a random sample is available of each unit’s 17 outcome under both control and treatment. Let Y(0) denote the outcome under control; Y(1) under treatment. The identifying moment restriction for each group is 1E(Y(t) — Mot) = 0,t = 0,1. So for this example W = Y(t), m1 = p1 = 1, and g(W, 6) = Y(t) - ut,t = 0,1. We can also consider the average treatment effect r=u1—u0. C] The above model (1.10) holds in the entire (unselected) population. Now we consider the selected population defined by a random variable S 6 {0,1} such that W is observed if and only if S = 1. We assume that the probability of selection depends on some additional variables Z, where Z 6 Z C Rdimw) is always observed. Some or all of Z may be in W; that is, some of W may always be observed, but all of W is observed only when S = 1. Define P(z,602) = P(S = 1|z), (1.11) where P(z, 62) is a parametric model for the probability of selection and is known up to the parameter vector 62 6 O2 C R62. Again, in many problems, the joint density of {5, Z} can be written as the product P(slz, 62)r(z), where r(z) is the pdf of Z. Assume {5, Z} is a subvector of W“ from P90. Suppose there exists a real valued measurable function u : {0,1} x Z x O2 —* Rm? (m2 2 p2) such that lEu(S, Z; 62) = 0, if and only if 62 = 602. (1.12) (The expectation is with respect to P90.) Assume that (1.12) satisfies Assump- tion 1.2.1. We call moment condition (1.12) the “selection moment condition”. 18 Examples 1.3.4-1.3.6 in the next section show how (1.12) can be obtained from (1.11). The GMM estimator based on (1.10), but with missing data, in effect makes the empirical moments 7%; 2?; S;g(W,-, 61) close to zero. These empirical moments are the random sample analogues of the population moments of the form lE[Sg(W, 61)] = 0. (1.13) We call these moment conditions the “unweighted selected population moments” to emphasize that they hold in the selected rather than the target population and to distinguish them from the weighted selected population moments that we will define Shortly. The selectivity problem is that the unweighted selected population moment conditions ( 1.13) may not hold at 601; more precisely, the value 601 that solves (1.10) may not solve (1.13). We also consider the “weighted selected population moments” that weight the moment function in (1.13) by the inverse of the selection probability (see, e.g., Horvitz and Thompson, 1952): E [Wigs—209M: 91)] = o. (1.14) The weighted selected population moments also may not hold. Indeed, it is intu- itively clear that whether (1.13) or (1.14) hold must depend on what is assumed about the relationship of the selection mechanism and W. 19 1.3.2 Motivation and definitions We follow Wooldridge (2002b, 2005) in making the following “ignorability” (or “selection on observables”) assumption. Assumption 1.3.1 (ignorability of selection) P(S = 1|w,z) = P(S = 1|z) = P(Z,902). Assumption 1.3.1 says that, conditional on Z, S and W are independent. This is commonly written as S _L W | Z. In some cases, ignorability is true by con- struction. An example would be the case that Z is an indicator of stratum, and selection is random within stratum. In other cases it is a substantial behavioral assumption. As Wooldridge notes, this assumption does not imply that the unweighted selected population moment conditions (1.13) hold at 601. This can be seen as follows: ES ° g(W, 601) = EEis ' g(W, 001)|Z]3 USing LIE = IEIE(S|z)lE[g(W, 601)|z], using ignorability (1-15) = EP(Za 602)IE[g(W, 901llzla (where LIE means law of iterated expectations), and our assumptions do not in general imply that IE[g(W, 601)|z] = 0. However, the weighted selected moment 20 conditions (1.14) do hold at 60, since Beggar/moi) = Emmi—,yawoolzl = Emmsemlgwanlzl = EE[9(W,901)|ZJ = lEg(W,601)=O. (1.16) The simplest assumption under which the unweighted moment condition (1.13) holds in the selected sample is the following. Assumption 1.3.2 P(S = 1|w) = P(S = 1). That is, S is independent of W. This assumption is easy to understand and clearly implies that (1.13) holds, Since S is independent of g(W, 61). This condition is sometimes referred to as “missing completely at random” (see, e.g., Little and Rubin, 2002) but we will not use this terminology further, since there seems to be some inconsistency in the literature in the use of these words. It should be noted that this assumption is neither stronger nor weaker than the assumption of ignorability (Assumption 1.3.1). That is, “S independent of W” does not imply, and is not implied by, “S independent of W conditional on Z”. It is perhaps intuitive that the first condition is stronger than the second, but in fact that intuition is not correct.1 1The intuition referred to here is based on the fact that, for general Y, X1 , X2, IE(Y|:r1 , $2) = 0 does imply that IE(Y|a:1) = 0 by the law of iterated expectations. But there is no comparable law for conditional independence. 21 The Simplest assumption under which both the unweighted and the weighted mo— ment conditions hold is the following. Assumption 1.3.3 (independence of selection) (S, Z) is independent of W. This assumption is also easy to understand, but it would appear to be too strong to apply in practical cases. We now consider an exogeneity condition that is weaker than 1.3.3 and which does imply that both the weighted and unweighted moment conditions hold. Assumption 1.3.4 (exogeneity of selection) (i) Assumption 1.3.1 (ignorability of selection) holds. (ii) 139(W, 901llz = 0- This is essentially the same definition of exogeneity as in Wooldridge (2005). Under Assumption 1.3.4, selection is both ignorable and exogenous with respect to the primary problem of interest. For example, if W = (Y, X) and Z g X, then having X in the conditioning set in the original problem is sufficient for the assumption to hold. If selection is based on covariates other than X, i.e. X g Z, then g(Y, X; 61) has to be uncorrelated with any function of X " E Z \ X given X. We now show that under Assumption 1.3.4, both the weighted and unweighted moment conditions hold. We first state without proof the following basic result. 22 Lemma 1.3.1 Suppose Assumption 1.3.1 holds. Then f (w|z,s) = f (w|z). (Here f () is generic notation for probability density.) Then it is easy to see that the following result is true. Theorem 1.3.1 Suppose Assumption 1.3.4 (exogeneity) holds. Then Eg(W, 601)|z, s = 0. (1.17) This is a much simpler and stronger result than Wooldridge obtained. It imme- diately implies that any function of Z and S is uncorrelated with g(W, 601), and therefore that the unweighted moment condition (1.13) and the weighted moment condition (1.14) both hold in the selected sample. In fact, this is true whether or not the weights are correct (in the sense that they do in fact represent P( S = 1|z)). All that is required is that the weights be a function of Z and S. We conclude that under ignorable selection, S W,6 IE ”2,9025“ 01) =0 (1.18) “(3,Z;9o2) and, under independent or exogenous selection, 5 W,6 E g( 01) =0, (1.19) “(Si Z; 002) 23 W,6 IE ”259902) 9( 01) =0, (1.20) U(S,Z;602) l- - q _ S 59(W 601) _ 11,.(27, Z; 602) ) Example 1.3.4 Suppose that sampling in Example 1.3.1 is nonrandom and the selection mechanism can be modelled as a PROBIT. Then, P(Z,62) = (Z’62), where () is standard normal cdf. Then, the selection moment conditions for this problem contain the likelihood equations for the log-likelihood l(62|s, z) E sln(z'62) + (1 — s)ln(1— (Hz/62)). Thus, m2 = p2 and Z’(S — (Z’62)) U(S, Z; 92) = (P(Z'92) (1 — (Z’02)) ' ¢(Z’62)a where d)(-) is the standard normal pdf. Note that we not only have lEu(S, Z; 602) = 0 but also IE[u(S, Z; 602)|z] = 0. Under the ignorability of selection assumption, we can use the moment condition Eq) g(X, Y ,—601) — 0, where g( ) 13 defined (Z’0 29) in Example 1.3.1. Cl Example 1.3.5 (Variable Probability Sampling) Suppose that sampling in Exam- ple 1.3.2 is stratified. Let the sample Space W be partitioned into J nonempty and disjoint strata W1, W2, . . . , W J. If an observation lies in stratum W-, it is retained with probability Poj that is usually known. So, the selection predictor Z can be de fined by vector (21,. . . ,ZJ)’ with Z- = II{W E W-}, j =1 2, ,,J where M} is the indicator function, and the probability model P(z, 602): 221:1 Poj zj, where 602 = (P01, . . . , POJ)’ . The ignorability assumption is satisfied by design. We 24 have P(slz,62) = “31:1 [Pg (1 -— P J”)1 5]zJi j n.(1 11). Hence, the selection mo- ment function for this problem contains the likelihood equation of the log-likelihood function l(62|s, z): — 23‘- -1 zj[slnPj + (1 — s)1n(1-)]. Thus, m2: p2— = J and u(,S Z ,=6)2 (TZDHf—‘gfl” . ”gig—£55). The weighted selected population mo- ment condition contains W. -g(X, Y; 61), where g( ) is defined In Example j=1PJZJ 1.3.2. If, In addition, stratification 15 based on exogenous variables, i.e. exogeneity of selection assumption holds, the unweighted moment conditions (1.13) can also be used. [:1 Example 1.3.6 (Average Treatment Effect) Suppose that the sample in Example 1.3.3 is not entirely observed. Instead we observe Y(0) only for the units that are in the control group and Y(1) only for those that are in the treatment group. Understandably, the counterfactual data are missing. If Z are treatment predictors, the selection model for the treatment group is P(S = 1|z) = P(z; 602) and for the control group P(S = Olz) = 1 — P(z;602), where P(z;602) is the probability of receiving treatment. The ignorability of selection assumption implies in this case that P(S = 1[y(0),z) = P(S = 1|z) and P(S = 0|y(1),z) = P(S = 0|z). The selection moment condition for this example is the same as in Example 1.3.4. The weighted population moment condition will contain m(YU) — #01) for the l-S treatment group and 1-P 2&2 (Y(O) — p00) for the control group. The average . - . s 1—S treatment effect can be Identified uSIng P Z; 002 Y(1) — 1—P 2&2 Y(O) — To. C] 1.3.3 Relative efficiency results under ignorable selection First consider estimation based on (1.18), under ignorable selection. Following the notation of Section 1.2 we write the weighted selected population moment 25 condition as 1Eh1(W*, 601, 602) = 0, where W* contains W, S and Z, and where , s h1(W 3901, 902) = mgfwfloll- (1-22) a 0 Wooldridge (2005) discusses estimation based on (1.22), for the exactly identi- fied case. He compares the estimator of 601 when 602 is known to the estimator of 601 when 602 is replaced by some consistent estimate 62. In order to ana- lyze this or other related issues, we have to say something about how 602 is esti- mated. In general terms, it is estimated by GMM based on a moment condition lEh2(S, Z; 602) = 0, which puts the analysis into the framework of Section 1.2. However, following Wooldridge, we make the specific assumption that 602 is esti- mated by MLE based on the model P(s = 1|z) = P(z, 602). That is, h2(S, Z; 602) is the score function corresponding to the likelihood for this model. Specifically, S — P(z, 602) =90? P(z, 6.2)[1 — PM. 6.2)]' (1.23) h2(5. Z; 902) = “(5. Z;9o2) = V02 P(Z,92)|02 Under these assumptions, we have the puzzle referred to in the Introduction; namely, the TWO—STEP estimator of 601 that uses 62 in (1.22) is better than the KNOW-62 estimator that uses the true value of 602 in (1.22). We will verify that this result holds also in the case that (1.22) is overidentified, and also provide our explanation of the puzzle, using the results of Section 1.2. To apply these results 26 we need to do some calculations involving the following: 012 = Eh1(W*,60)h2(Sa 29602)], C22 = Eh2(SI 23602)h2(3i 21602)], D12 E V92h1(W .91, 92)]6 9 . :0 (1.24) D22 = EV92h2(3,Z.02)]62_9 2- -0 Theorem 1.3.2 Under the ignorability of selection assumption, W,6 . . . ( a) C12 = FEW V92 P(Z, 62) 62:902’ which is (in general) not equal to zero; (b) D12 = -C'12, D22 = -C22, and 80 012 = 01202—21022- To understand Theorem 1.3.2, note first that in the unselected population, sz E lEg(W, 601) - h2(S, Z, 602)’ = 0. That is, the original moment condition g(W, 601) is uncorrelated with the score function h2(S, Z, 602) by the generalized information equality. However, in the selected sample, 012 aé 0. That is, h1(W*, 601, 602) and h2(S, Z, 602) are correlated. This correlation makes h2(S, Z, 602) relevant for esti- mation of 601 even if 602 is known, and the inefficiency of the KNOW-62 estimator is due to its failure to capture the information in the moment condition based on h2(S‘J Z1602)' Although we do not pursue this point, it would appear that the inefficiency of the KNOW-62 estimator (at least relative to the KNOW-62-JOINT estimator) would hold even if h2(S, Z, 62) were not a score function. It depends only on C12 75 0, not on the particular form of C12. 27 Part (b) of Theorem 1.3.2 gives a number of information equalities which do depend on h2(S, Z, 62) being a score function. They establish that D12 = 01202-211222, which is the condition for Statements 10 and 11 of Theorem 1.2.3. Statement 11 of Theorem 1.2.3 says that the KNOW-62 estimator is inefficient relative to the ONE- STEP, TWO-STEP and KNOW-62-JOINT estimators. This extends the previously- cited (but, we hope, no longer puzzling!) result, namely that KNOW—62 is inefficient relative to TWO-STEP, to a larger set of other estimators, and also to the case that the GMM problem for the parameters of interest is overidentified. Statement 10 of Theorem 1.2.3 says further that 602 is p—redundant, so that the ONE-STEP and KNOW-62-JOINT estimators are equally efficient. So long as one includes the score function h2(S, Z, 602) in the estimation problem, it does not matter (in terms of efficiency of estimation of 601) whether 602 is known or not. Another note is that, although the TWO-STEP estimator is better than the KNOW- 62 estimator, it is not necessarily efficient. In the exactly identified case, it is efficient because it equals the ONE-STEP estimator (Statement 7 of Theorem 1.2.3), but in the overidentified case it is generally less efficient than the KNOW-62-JOINT and ONE-STEP estimators. Example 1.3.7 Continuing Example 1.3.4 under ignorable selection with the ML 2]: = V92P(z,62)] estimate of 602, IE[S - u(S, z; 602)'|z] can be written as (S - P(Zv 602)) E S ' P(Z.0.2)(1- P(Z.0.2)) - Va2P(Z.62)]0 2 =902 _ [13(32lz) — IE(S|2) - P(z, 0.2)] ‘ P(z.e.2)<1—P(z.o.2>> 'V92P(z92)lo 2:902 92 =902 28 where the second equality follows because E(S2|z) = E(S|z) and E(S|z) = P(z, 62). This is non-zero. Also, E[g(W; 61)|z] aé 0 unless there is also exogeneity. Thus, 012, which can be expressed by the law of iterated expectations as 1 1E {mE]g(W, 601)]Z]E]SU(S, Z; 602),|Zl}’ is generally non-zero. In fact, C12 = E ]% - V92 P(Z, 62)]62=002]. We can- not therefore claim m—redundancy under ignorability of selection: using orthog- onality conditions from the selection process helps in estimating 601 even if the weighting probabilities are known. However, we can claim p—redundancy by The- orem 1.3.2: using known selection probabilities with the additional moment con- ditions for selection is as efficient as estimating the probabilities in a one-step or two-step procedure. Each of the three alternatives is equally preferred to only using the original problem with known probabilities. [:1 Example 1.3.8 Continuing Example 1.3.5 under ignorability with the ML esti- gasses-mane.» P0, (l—Poj j = 1,... , J. Since E(Szlz) = E(S|z) = 231:1 Pojzj, the elements can be written J z-E-_ P -z- ‘7 Jfiolj 0] J, j = 1, . . . , J. Thus, 012, which can be expressed by the law of , mates of Poj , E[S-u(S, z; 602)’ Iz] contains elements of the form as iterated expectations as E m - E[g(W, 601)|z] - E[Su(S, z; 602)’|z]}, can g(W,601)lI{WEW1} g(W,6Q1)II{W€W_]} P01 ’ ' ' ' ’ PoJ less there is also the exogeneity or independence assumption. Similarly to Example be simplified to E ] ]. This is nonzero, un- 1.3.7, under ignorable selection, using selection moment conditions increases pre- cision of estimating 601. Also, if knowledge of selection probabilities is available it provides for the same precision of 61 as the one-step or two-step procedures as long as all m1 + m2 moment conditions are used. Cl 29 Example 1.3.9 Continuing Example 1.3.6, with ML estimates of treatment prob- abilities P(z; 62) from PROBIT, the correlation matrix between the moment con- ditions that identify p01 and the likelihood equations from PROBIT for 602 is (S—P(Z;602))V9 P(Z;62)] E P 23902 x (1—P(Z;602)) . ThIs, under ignorability, can Y 1 — V P Z;6 : be rewrittenas —E ]( ( ) #01)P(0Z2-6(2) 2)l62 902] which is non-zero unless Y(1)J.Z v o and equal to minus the expected derivative of the weighted moment equation for #01 with respect to 62. A similar argument is valid for estimating #00 and, conse- quently, r0. Hence, in average treatment effect estimation, m-redundancy cannot be claimed: knowledge about the treatment assignment process should be included into the estimation. There is p—redundancy, however: it does not matter asymp- totically whether the parameters of the assignment process are known or estimated as long as all available moments are used. [:1 1.3.4 Relative efficiency results under exogenous selection Consider now estimation based on (1.19)-( 1.21), under exogenous selection. Wooldridge (2002b, Theorem 5.2) shows, under the exogenous selection assump- tion, that the IPW M-estimator that uses known selection probabilities is as ef- ficient as a two-step estimator that employs initial ML estimates of the selection probabilities. The results of Section 1.2 allow to restate this result for other esti- mators and for the cases of overidentification in the primary problem of interest. Using definitions (1.22)-(1.24), it is easy to verify that, for the GMM estimator based on ( 1.20), the following is true. 30 Theorem 1.3.3 Under the exogeneity of selection assumption: (0) 012 = 0; (b) 012 = 0- SO, by Theorem 123(6), we have m—redundancy of the selection moment condition and p-redundancy of 602. ONE-STEP, TWO-STEP, KNOW-62 and KNOW-62-JOINT estimators of 601 are equally efficient asymptotically. Wooldridge (2005, Theorem 4.3) shows, under exogeneity and the further assump- tion that the original moment conditions satisfy the conditional information matrix equality, that the estimator based on the unweighted moment conditions is more efficient than the estimator based on the weighted moment conditions. This is fine as far as it goes, but it does not rule out the possibility that using both could be more efficient than using either. Our next result does rule out this possibility. Theorem 1.3.4 Suppose Assumption 1 .3.4 holds. Then the optimal moment con- ditions in the selected population are the same as in the unselected population. To see why this result is true, note that the optimal moment conditions in the unselected population are the following: IED(Z)’C(Z)‘19(W, 9.1) = 0, (1.25) 31 where D(z) = E V919(W’61)]g z and C(z) = Eg(W,601)g(W,601)’|z. The 1:901 optimal moment conditions in the selected population are: ED(Z, s = 1)’C(Z, s = 1)_ng(W, 601) = 0, (1.26) where D(z,S = 1) = E{Vglg(w,61)]9 6 z,S =1} and C( (z 5: 1) = 1: 01 E{g(W,601)g(W, 601)’|z,S = 1}. But D(z, S— - 1) =(Dz )by the ignorability assumption, and similarly C(z, S = 1) = C(z). An implication of this result is that the weighted moment conditions are m- redundant for the estimation of 601. More precisely, assuming that weighting was not part of the efficient estimation problem in the unselected population, it also plays no role in the efficient problem in the selected population. Thus in this circumstance we do not have to weight for reasons of consistency, and we also do not have to weight for reasons of efficiency. Theorem 1.3.4 is a useful result, but it falls short of being the final word on effi- ciency. The question is whether the moment conditions in equation (1.17) capture all of the information in the exogeneity assumption. The first part of the exogeneity assumption is that Eg(W, 601)|z = 0, and the efficient GMM estimator under this conditional moment restriction (with full observability) is well understood. The second part of the exogeneity assumption is the ignorability condition, and Theo- rem 1.3.1 shows that this makes the original conditional moment restriction valid in the selected sample as well. More precisely, we then have Eg(W, 601)|z, s = 0 and Theorem 1.3.4 gives the form of the efficient estimator under this conditional moment restriction. However, what is not clear is whether all of the information in the ignorability condition is captured by the extension of the original moment 32 conditions to the selected population. We defined P(z,602) = E(S|z), so that E[S — P(z,602)]|z = 0. However, under ignorability, we have the stronger condition that E[S — P(z, 602)]Iz, w = 0. The score function for estimation of 602, as given in (1.23) above, will not be useful for estimation of 601, because it is a function of Z and S only, and we have already used the optimal functions of Z and S in (1.26) above. The question is whether the fact that E[S — P(z, 602)]lw = 0 adds anything. This question is complicated by the fact that W is only observed when S = 1. If no part of W (other than Z, if Z is a subset of W) is always observed, we do not see any way to make use of the condition that E[S — P(z, 602)]lw = 0. However, now suppose that some subset of W is always observed. Let W0 be the part of W which (i) is always observed, and (ii) is not part of Z. Then we can consider moment conditions of the form Ek(Wo)[S — P(z, 602)] = 0. (1.27) These moment conditions are not useful for estimation of 602, but they may be useful for estimation of 601, if they are correlated with the original moment con- ditions. It is easy to see that they are not correlated with the unselected original moment conditions: Eg(W, 601) k(W0), [S _ P(Z> 602)] = EELS _ P(21002)]]2E9(VV: 601)k(W0)IIZ = 0. 33 However, they are correlated with the selected original moment conditions: E59(Wa901) k(Wo)' [3 — P(Z,9o2)l = EESIS — P(z,902ll|ZEg(Wa901)k(Wo)'lz = EP(Z, 902ll1 — P(Z, 902)] Eg(W, 901)k(Wo)'|Z ea 0. Thus the moment conditions in (1.27) may possibly be useful in estimation of 601. We leave further exploration of this point for future work. 1.4 Concluding remarks We summarize relative efficiency results for four alternative GMM estimators of a parameter vector that enters into one set of moment conditions along with another vector that also enters into an additional set of conditions and may be known. We provide formal statements and proofs of efficiency claims and spell out condi- tions under which some knowledge may be redundant. If the two sets of moment conditions are uncorrelated and the expected derivative of the first set with respect to the additional parameter vector is zero, both the additional moment conditions and the knowledge of the additional parameters are redundant. These are the strongest sufficient conditions we consider. The weaker condition of moment un- correlatedness is sufficient for redundancy of extra moment conditions when the additional parameters are known and for equal efficiency of the multi-step and one- step estimators under exact identification of the original set of moment conditions. The condition of zero expected derivative of the original set of moments with re- spect to the additional parameter vector turns out to be sufficient for no influence 34 of the first step estimation over the second step standard errors in very general settings. We provide a sufficient condition for equal relative efficiency of the es- timator that treats additional parameters as known using the full set of moment conditions and the estimator that involves estimating both parameter vectors. We apply these results to a general missing data problem after showing that the weighted and unweighted GMM estimators on the selected sample preserve de- sired asymptotic properties under reasonable assumptions. We explain the coun- terintuitive result that estimating selection probabilities dominates using known probabilities if this knowledge is available. It turns out that this is an outcome of ignoring the moment conditions that characterize the selection process. In- terestingly, however, a proper use of such knowledge along with known selection probabilities turns out to be as good as estimating the probabilities using the same moment conditions. Redundancy of the parameter knowledge applies. We show that this redundancy result is driven by two factors: the ignorability assumption on selection and the use of the score function in estimation of the selection prob- abilities. The ignorability condition says that the first-stage score function for the conditional likelihood f (slz) is in fact the score function for the conditional likeli- hood f (slz, w) and thus GCIME can be applied producing the sufficient condition for parameter knowledge redundancy. When selection is based on exogenous variables with respect to a correctly spec- ified feature of conditional distribution, any function of the exogenous variable can be used as a weight in the weighted GMM estimation. This implies two in- teresting results. First, the weighted GMM estimation on the selected sample is robust to selection model misspecification. Second, using both weighted and un- weighted moment conditions dominates using only one of them unless the original 35 moment function incorporates the optimal weights in the first place. No efficiency improvements are possible in that case. Besides the examples we give, the following specific missing data problems can be studied in the framework of Section 1.2: using auxiliary data to estimate probabil- ities of selection (see Hellerstein and Imbens, 1999; Nevo, 2002, 2003), weighting by nonparametric estimates of propensity scores in estimation of average treat— ment effects (see Hirano et al., 2003), estimating weights for choice-based samples in pseudo-MLE settings (see Manski and Lerman, 1977; Manski and McFadden, 1981; Cosslett, 1981a,b; Imbens, 1992), EL and GMM estimation for stratified samples with possibly known sampling or population frequencies (see Tripathi, 2003). 36 Bibliography AHN, S. AND P. SCHMIDT (1995): “A separability result for GMM estimation, with applications to GLS prediction and conditional moment tests,” Econometric Reviews, 14, 19—34. BREUSCH, T., H. QIAN, P. SCHMIDT, AND D. WYHOWSKI (1999): “Redun- dancy of moment conditions,” Journal of Econometrics, 91, 89-111. COSSLETT, S. R. (1981a): “Efficient estimation of discrete-choice models,” in Structural Analysis of Discrete Data and Econometric Applications, ed. by C. F. Manski and D. L. McFadden, Cambridge: The MIT Press, 51—111. (1981b): “Maximum likelihood estimator for choice-based samples,” Econo- metrica, 49, 1289—1316. CREPON, B., F. KRAMARZ, AND A. TROGNON (1997): “Parameters of interest, nuisance parameters and orthogonality conditions An application to autoregres- sive error component models,” Journal of Econometrics, 82, 135—156. GOLDBERGER, A. (1972): “Maximum likelihood estimation of regressions con- taining unobservable independent variables,” International Economic Review, 13, 1—15. HAHN, J. (1998): “On the role of the propensity score in efficient semiparametric estimation of average treatment effects,” Econometica, 66, 315—331. HANSEN, L. (1982): “Large sample properties of generalized method of moments estimators,” Econometrica, 50, 1029—1054. HECKMAN, J. J ., H. ICHIMURA, AND P. TODD (1998): “Matching as an econo- metric evaluation estimator,” The Review of Economic Studies, 65, 261—294. HELLERSTEIN, J. K. AND G. W. IMBENS (1999): “Imposing moment restrictions from auxiliary data by weighting,” The Review of Economics and Statistics, 81, 1—14. 37 HIRANO, K., G. IMBENS, AND G. RIDDER (2003): “Efficient estimation of aver- age treatment effects using the estimated propensity score,” Econometrica, 71, 1161—1189. HORVITZ, D. AND D. THOMPSON (1952): “A generalization of sampling without replacement from a finite universe,” Journal of the American Statistical Associ- ation, 47, 663—685. IMBENS, G. W. (1992): “An efficient method of moments estimator for discrete choice models with choice-based sampling,” Econometrica, 60, 1187-1214. LITTLE, R. J. A. AND D. B. RUBIN (2002): Statistical analysis with missing data, Wiley series in probability and statistics, Wiley-Interscience, 2 ed. MANSKI, C. F. AND S. R. LERMAN (1977): “The estimation of choice probabil- ities from choice based samples,” Econometrica, 45, 1977—1988. MANSKI, C. F. AND D. L. MCFADDEN (1981): “Alternative estimators and sample designs for discrete choice analysis,” in Structural Analysis of Discrete Data with Econometric Applications, ed. by C. F. Manski and D. L. McFadden, The MIT Press, 2—50. NEVO, A. (2002): “Sample selection and information-theoretic alternatives to GMM,” Journal of Econometrics, 107, 149—157. (2003): “Using weights to adjust for sample selection when auxiliary infor- mation is available,” Journal of Business and Economic Statistics, 21, 43—53. NEWEY, W. (1984): “A method of moments interpretation of sequential estima- tors,” Economics Letters, 14, 201-206. NEWEY, W. AND D. MCFADDEN (1994): “Large sample estimation and hypoth- esis testing,” in Handbook of Econometrics, ed. by R. Engle and D. McFadden, vol. IV, 2113—2241. PAGAN, A. (1984): “Econometric issues in the analysis of regressions with gener- ated regressors,” International Economic Review, 25, 221—247. QIAN, H. AND P. SCHMIDT (1999): “Improved instrumental variables and gener- alized method of moments estimators,” Journal of Econometrics, 91, 145—169. ROBINS, J. M. AND A. ROTNITZKY (1995): “Semiparametric efficiency in multi- variate regression models with missing data,” Journal of the American Statistical Association, 90, 122—129. ROSENBAUM, P. R. (1987): “Model-based direct adjustment,” Journal of Amer- ican Statistical Association, 82, 387—394. 38 ROSENBAUM, P. R. AND D. B. RUBIN (1983): “The central role of the propensity score in observational studies for causal effects,” Biometrika, 70, 41—55. T RIPATHI, G. (2003): “GMM and empirical likelihood with stratified data,” Work— ing Paper, University of Wisconsin. WOOLDRIDGE, J. (1999): “Asymptotic properties of weighted M-estimators for variable probability samples,” Econometrica, 67, 1385—1406. (2001): “Asymptotic properties of weighted M—estimators for standard stratified samples,” Econometric Theory, 17, 451—470. - (2002a): Econometric analysis of cross section and panel data, Cambridge, Mass: MIT Press. (2002b): “Inverse probability weighted M-estimators for sample selection, attrition and stratification,” Portuguese Economic Journal, 1, 117—139. (2003): “Inverse probability weighted estimation for general missing data problems,” Working Paper, Michigan State University, www.msu.edu/ ~ec / faculty / wooldridge / current%20research / wght2r6.pdf . (2005): “Inverse probability weighted estimation for general missing data problems,” Working Paper, Michigan State University, www. msu.edu/ ~ec / faculty / wooldridge / current%20research / wght2r7.pdf . ZELLNER, A. (1970): “Estimation of regression relationships containing unobserv- able independent variables,” International Economic Review, 11, 441—454. 39 Appendix: Proofs PROOF OF THEOREM 1.2.1: Proofs are given, e.g., in Theorems 2.6 and 3.4 of Newey and McFadden (1994). Also, see Hansen (1982). Condition (1) is the identification assumption. Conditions (ii) and (iv) are needed for consistency, conditions (iii)-(v) are needed for asymptotic normality, while conditions (iv) and (v) ensure that the objective function in (1.2) and its first derivative, respectively, converge uniformly to their population analogues and condition (vi) provides for invertibility of a part of the mean-value expansion. Some of the conditions can be relaxed at the expense of complicating proofs. El PROOF OF THEOREM 1.2.2: Equations (1.6), (1.8), and (1.9) follow from the standard asymptotic variance derivation for the GMM estimation using the optimal weighting matrix (see, e.g., p. 2148 of Newey and McFadden, 1994; Hansen, 1982, Theorems 3.1 and 3.2). Equation (1.7) is obtained similarly but we separately expand the first order con- ditions corresponding to (A) and (B). The TWO-STEP estimator of 602 minimizes h2(62)’C2—21h2(62). The first order con- ditions that the estimator solves are DézC2—21h2(62) = 0. Expanding around 62 gives 62 - 602 = —(Dé202—21D22)_1Dé202—21h_2(602)+ 0p(N_1/2). (1.28) 40 The TWO-STEP estimator of 601 minimizes h1(61, 62)’Cé'21h1 (61, 62). The first order conditions that the estimator solves are D’nCl—llh1(61,62) = 0. Expanding around 601 and using (1.28) gives 61 - 901 = —(D’IICfiIDII)_1D'IICl-11h_1(901.902) + (1-29) + (DiICfilDII)—1Di10fi1012(D’2202—21022)T10520§2152(9o2) + 0p(N-1/2). On multiplying by x/IV and combining (1.28)-( 1.29), we get VTWO-STEP = BCBII (1'30) where C is defined in (1.4) and 311 312 B: (1.31) 0 322 with 311 = —(D'11Cf11D11)_lDiICl_Ili 312 = (D(ICfilDll)“1D’uCfilD12(D§2C2'21Dgg)—1D§20231, (1-32) 1322 = -(D’22C2‘21022)‘ID§2C;21. PROOF OF THEOREM 1.2.3: Statements 1 and 4 are proved on p. 148 of Qian and Schmidt (1999) where it is shown that there is no gain in efficiency if and only 41 if DllCfilClg = 0. When the original problem is exactly identified (m1 = p1) and D11 is non-singular (by assumption), this is true if and only if 012 = 0. If the original problem is overidentified (m1 > p1) then the condition C12 = 0 is sufficient for no gain in efficiency. To prove Statement 3 first note that BD = —I, where I is the p1 + 192 dimensional identity matrix. Then, vao-STep—v6Ne-STep = BCB’-(D’C‘1D)‘1 (1.33) = BCB’ — BD(D’C’1D)‘1D’B’ 1 _1 ,_1_1_1,_11 = BC'Z[I—C 20(DC 2c: 20) DC 21023. The matrix is brackets is the projection orthogonal to C ‘1/2D, which is positive semidefinite. vow.STEP for 01 is of the form (011011011 — A112M2‘21M21)‘1, where M12 = M5, = 031011012 + 0110121322 and M22 is the lower right p2-block of 170-1 D, which is positive semidefinite. Hence, the inverse of VONE-STEP for 61 minus V—l KNOWQTJOINT is negative semidefinite. Thus, VONE_STEP of 61 is no smaller than VKNow_92_JOINT in positive definite sense, which proves Statement 2. Further, KNOW-62-JOINT and ONE-STEP are equally efficient if M12 = 0 but M12 = D'11[C11D12 + 0121222]. This fact along with the fact that 012C231 = -—(C11)"1C12 implies that if 012 = 012C2—21D22 than M12 = 0 which proves Statement 10. Statement 11 can be proved in two parts. First, Since M12 = 0 the inverse of 42 —1 VONE-STEP for 91 is simply 011C 11011 which is generally greater than VKNOW-62 = D11 Cl—l1 D11 in the positive definite sense since C11 — 01—11 is positive semidefinite. This along with Statement 10 implies that ONE-STEP and KNOW-62-JOINT are no less efficient for 61 than KNOW-62. Second, to prove that TWO-STEP is no less efficient for 61 than KNOW-62 note that, by (1.30)—(1.32), VTwo-STEP for 61 is equal to 311011311 + 812021811 + 811012812 + 812022312. Also note that 811011811 = (D’11C1"11011)'1 and that, under D12 = 0120;2ng2, the sym- metric positive semidefinite matrices —812021 Bil and -BllClgB(2 are equal to 812022812. ero-STEP for 61 reduces therefore to VKNow-62 minus a positive semidefinite matrix, which completes the second part of the proof. Statements 7—9 follow from Theorem 1 of Ahn and Schmidt (1995) and subsequent discussion (pp. 21-22). Statement 5 holds since if D12 = 0 then (1.7) reduces to (D’nCfi1 D11)’1, which is equal to (1.8). Statement 6 follows from Statements 4 and 10 and a trivial comparison of variances in (1.7) and (1.6) under given conditions. C] PROOF OF THEOREM 1.3.1: Follows trivially from Lemma 1.3.1 and part (ii) of Assumption 1.3.4. Cl PROOF OF THEOREM 1.3.2: (a) First, note that, by ignorability and (1.23), E[S - . S—P ,6 h2(S, z; 602)’|z] can be ertten as E[S- P(zr(602)(1(:P0(2zfgo2D ~V92 P(z, 62) l02=002 lz] = 43 E52 —lES -P ,6 . [ (126,212)(lifelelifill "792 Perhaps, = v9, P(2192)|02=602,SlnCGE(52|Z) = E(Slz) and E(Slz) = P(z, 602). This is nonzero in general. Second, E[g(W; 601)|z] aé 0 in general. Finally, 012 lEh1(W"‘, 001, 902)h2(5, Z, 902)' = EimEINWfloDIZIElS/INS.z;6o2)’|zl}, by ignorability (1.34) W,6 = Elij‘Z—flggl-Va, P(Z.62)Io,=ao,l. by LIE which is generally non-zero. (b) Follows by (generalized) information equality, where h2(-) is the score, D22 is the expected Hessian, 022 is the expected outer product of the score, D12 is the expected derivative of h1 with respect to 62 evaluated at 602 and 012 is the covariance of h1 with the score. One may also write 012 = E{Vem%,2;|92=aozg(w;9ooh by (1.22) = [—2Ph.=o.1 ES E W;6 = —IE[ ( lz) (g( 61)|2)V02 P(Z,62)]62=902], by LIE P(Z,602)2 W;6 = _E[%Tfl§3v62 P(Z,92)|62=6021a 95 133(5)” = P(Z’QOZ) = _012 by (1.34) 44 PROOF OF THEOREM 1.3.3: (a) By LIE and exogeneity, S C = E —— Y,X;6 ~uS,Z;6 ’] 12 [P(Z,,02)g( .1) < .2) _ S . , I _ EE[P(Z,002)9(YIX1001) ”(3321602) l2] S = E E W,6 zE -——--uS,z;6 'z]} { lg( 01)Il [P(z,902) < 02)| = 0. (b) By LIE and exogeneity, D - E V —S—-— g(YX'6 ) 12 92 P(Z, 02) 62:002 3 r 01 = ]E {E [g(Y, X; 6olllzl V92 [FZSTE =0. all} 45 Essay 2 Robustness, Redundancy, and Validity of Copulas in Likelihood Models 2.1 Introduction In multivariate economic models, one is often ready to assume marginal distrib« utions but is reluctant to impose a joint distribution. For example, in a panel setting, economists often use a specific likelihood for each cross section separately (e.g., PROBIT or LOGIT) but avoid modelling the joint distribution of the cross- SectionS over time. Similarly, in selectivity models, it is often desired to allow for unrestricted dependence between the disturbances in the primary and the selection models, each of which has a well-defined likelihood. 46 The usual way to handle the indeterminacy of the joint distribution is to assume independence of the marginal distributions and employ quasi-MLE or to assume joint normality and employ pseudo-MLE (e.g. White, 1982; Gourieroux et al., 1984). In certain cases these approaches result in a consistent estimation while a “sandwich” covariance matrix may be used for valid inference. However, these approaches suffer from major weaknesses. First, there are impor- tant cases when using a pseudo-likelihood does not result in consistent estimates. Green (2002, Section 17.9) and Wooldridge (2002, Chapter 13) discuss such cases. Second, as we Show below, there are estimators that dominate traditional QMLE under non-independence. The copula approach used here allows to replace normality or independence with an alternative assumption about the joint distribution. Clearly such a replacement is only warranted if the new distribution possesses some useful properties such as ease of computation, robustness to misspecification, and improved efficiency. Arguably, copulas (or at least some of their families) may have such properties in certain econometric models. The copula approach also incorporates multivariate normality and independence as special cases. The copula approach is relatively new to econometrics. A note by Lee (1983) appears to be the earliest application of this approach in econometrics. Copulas have recently received a lot of attention in finance literature. They are used to model dependence in financial time series (e.g., Patton, 2001; Breymann et al., 2003) and in risk management applications (e.g., Embrechts et al., 2003, 2002). Bouyé et al. (2000) provide an extensive discussion of prospects for copula in finance. Use of copula in other subfields of econometrics still appears rather limited. 47 Smith (2003) incorporates a copula in selectivity models and provides applications to labor supply and duration of hospitalization; Cameron et al. (2004) use a copula to develop a bivariate count data model with an application to the number of doctor visits. We start by presenting some basics on copulas. This is done in Section 2.2. Section 2.3 introduces the GMM representation of the likelihood-based models used in the sequel. We show that imposing a joint distribution amounts to adding moment conditions. Imposing moment conditions makes consistency of the resultant estimator con— ditional on the moment validity. Moreover, there are infinitely many alternative multivariate distributions that can be used. Section 2.4 shows that estimation of means remains robust against copula misspecification as long as the used copula and the true joint density share a symmetry property. A simple simulation employs most commonly used copula families to study their robustness properties. It is well known that additional moment conditions cannot reduce asymptotic efficiency if properly used. However, sometimes the additional moments do not help even if properly used, i.e. are redundant in the sense of Breusch et al. (1999). In Section 2.5 we develop conditions for such redundancy. Section 2.6 proposes tests of copula validity that can help deciding on the copula. Section 2.7 concludes. 48 2.2 Preliminaries Definition 2.2.1 (Nelsen, 1999, p.40) An M-dimensional copula is a function C : [0, 1]M —> [0,1] that has the following properties: i. C(u1,...,um_1,0,um+1,...,uM)=0,m=2,...,M—1. ii. C(1,...,1,um,1,...,1)=um,m=1,...,M. iii. C is M-increasing: for every M-boxB = [a1,b1] x [a2,b2] x x [aM,bM], whose 2M vertices (c1, . . . , CM) are in [0, 1]M , the C-volume of B, defined by VC(B) 2 2 Z Z (—1)21+'"+2MC(Clili-°'iCMiM)’ i1=1 iM=1 where cjl = aj and Cj2 = bj for allj E {1, . . . , M}, satisfies Vc(B) Z 0. Property (iii) implies for M = 2 that C(a1,a2)—C(a1,b2)—C(b1,a2)+C(b1,b2) Z 0 for any vectors (a1, a2), (b1,b2) 6 [0,1]2 such that am S bm, m = 1,2, i.e. C(a, b) is non-decreasing in (a, b). It follows from the definition that an M -dimensional copula C is an M -dimensional cdf whose M marginals are uniform on [0,1]. One may also note that for any M—dimensional copula C, M Z 3, each m-marginal of C, 2 S m < M, is an m-dimensional copula. 49 The following well-known theorem establishes existence of such a function for any joint distribution function of random variables. We restate it without proof. Theorem 2.2.1 (Sklar, 1959, p.229-230) Let H be an M ~dimensional distribution function with marginals F1, . . . , FM. Then there exists an M -dimensional copula Csuch thatforallmeR,m=1,...,M H(I1,-.-,$M) =C(F1(:v1).-~.FM($M))- (2-1) If F1,...,FM are continuous, then C is unique. Conversely, if C is an M - dimensional copula and F1, . . . , FM are distribution functions, then the function H in (2.1) is an M -dimensional distribution function with marginals F1, . . . , FM. Thus, a copula is a multivariate distribution function that connects two or more marginal distributions to exactly form the joint distribution. A copula thus com- pletely parameterizes the entire dependence structure between two or more random variables. It is important to note that a given joint distribution function H de- fines a unique set of marginal distribution functions Fm, m = 1, . . . , M, whereas given marginal distributions do not determine a unique joint distribution (and the implied copula). To connect copulas to likelihood-based models, let h and c be the derivatives of the distribution functions H and C, respectively; let fm be the derivatives of the 50 marginal distribution functions Fm, m = 1,. . . , M. Then, 6MH(x1,...,xM) 6x1...6xM 6MC(F1(171)» - - 'iFM(xM)) 8:131...C$M aMC (23) E599 memos; 9.). F2(X2; 00);po) = 0, (C) E39,meme.)F2 a T 1 Z —1nk,, 0 since (C) and (D) hold in population. More- over, by WLLN, for any misspecified copula for which (2.4) holds, 3(60, p5) —>p 0. However, for non-robust copulas, the probability limit may be non-zero. In order to be able to compare copulas we define a common measure of dependence. There are very many such measures (see Nelsen, 1999, Section 5). We pick one 58 that has a simple copula representation. Definition 2.4.3 For any two continuous random variables U and V whose copula is K, Kendall’s 7' measure of concordance is given by 7- : 4f];2 K(u,v;p)dK(u,v;p) — 1. (2-13) It follows from (2.13) that 7- : 4/[2 K(u, u;p)k(u,v;p)dudv — 1 = 4EK(U, Vipl — 1- (2-14) I For two random variables, Kendall’s r can be viewed as the probability that “large” (“small”) values of one are associated with “large” (“small”) values of the other (the probability of concordance) minus the probability that “large” (“small”) values of one are associated with “small” (“large”) values of the other (the probability of discordance). Importantly, various copulas cover unequal ranges of dependence as measured by Kendall’s r (see Appendix A). We therefore control for r in all one-parameter copulas. In the simulation, we use the fact (see, e.g., Kendall, 1949) that for the Normal copula with Normal margins, Pearson’s correlation coefficient p is related to r: p = sin gr. (2.15) This allows us to derive the value of Kendall’s r that corresponds to the true value of Pearson’s correlation coefficient p employed in simulating the joint Normal distribution. 59 We employ the following procedure: X1 m I r Step 1. Generate T realizations of ~ N , by X2 m 7' 1 0 1 0 e generating Z ~ N , ; 0 0 1 0 using the Cholesky decomposition Step 2. For each realization t, calculate e u,t(p) = (X,- — ,u),i = 1, 2, where () is the Standard Normal c.d.f.; - My, p) 5 (211,01), unfit); p); 0 5501.10) E 3%111 lamp) and 5501.10) ‘-—‘ 537, In 19:01.10); Step 3. Calculate sample averages 'd 501.0) 5 25min)- t=1 Step 4. Plot the resultant functions 3601, p) and 5901, p) over a relevant range of p and p. Step 5. Evaluate the sample means cf“ and 5p and the sample standard errors se(5") = s“/\/Tse(5p) = sp/fi 60 Table 2.1: The true values for Kendall’s r and p used in simulation k Copula p0 r0 Independence — 0 Logistic — 1 / 3 Farlie-Gumbel-Morgenstern (FGM) 0.872880 0.193973 Joe 1.426845 0.193973 Ali-Mikhail-Haq (AMH) 0.697058 0.193973 Clayton 0.481321 0.193973 Gumbel 1.240654 0.193973 Frank 1.801160 0.193973 Normal with Normal margins 0.3 0.193973 at the true parameter values p = m0 and p = p109, where .- = Filament) — 3'(mo,p’8))2 T - 1 ° The true parameter values in Step 1 are m0 = 0 and r0 = 0.3. We use (2.15) to calculate the true r and then we use (2.14) to derive the value of p corresponding to the true value of r for each copula. We consider the independence, Logistic, Farlie- Gumbel-Morgenstern, Joe, Ali-Mikhail—Haq, Clayton, Gumbel, Hank and Normal copulas. For some of these copulas it is possible to obtain an analytical solution for p in terms of 7' using (2.14) (see Appendix A), otherwise we use numerical methods to approximate the true value of p with desired accuracy. Note that the independence, Farlie-Gumbel-Morgenstern, Frank and Normal families are radially symmetric. Table 2.1 contains the true values of r and p for the considered families of copulas. We choose r0 = 0.3 because it corresponds to a value of r within the coverage of all the one-parameter copula families we consider. Note that the two no-parameter 61 copulas, independence and Logistic, imply dependence measures that are different from the true. Figures 2.1 through 2.8 of Appendix C contain the plots of (it"(p, p) and 59(p, p) obtained in Step 4. The sample size used for the plots is 200. According to Figure 2.1, the independence copula is robust: the copula term is identically zero even though the marginal terms are not independent. The copula term for the Logistic copula is zero for a value of u around 0.33. Figures 2.2—2.8 illustrate how the one—parameter copulas compare in terms of ro- bustness. Note that all the surfaces appear to intersect the zero plane at around the true values of the parameters, which suggests general robustness. As we show below, however, one cannot accept the hypothesis of zero (I for all copula families. The benchmark for comparisons is the Normal copula — Figure 2.7. Interestingly, the sample analogue of the Normal copula moment (C) is close to zero at the true value of u for any value of p and at p = 0 for any value of u — panel (6a). The FGM, AMH and Frank families display a similar feature — panels (1a), (3a) and (7a). Clearly, when p = 0, these four families of copulas reduce to the independence copula, which is known to be robust. When p aé 0, if” is still close to zero at the true m0. This observation suggests robustness of the FGM, AMH and Frank families. With these copulas, one can use the copula moment (C) with any assumed p and obtain a consistent estimate of ,u. The other families do not exhibit this advantage. Of course, the FGM and Frank families of copulas are RS. The observed robustness of these families is clearly a consequence of the theoretical result in the previous 62 section. However, the AMH family is not RS. Why is the AMH copula robust? To answer this question, write the AMH copula as an infinite sum of a geometric sequence 21.1) = uv 00 — u — v ’9. , 1_ ,0(1- u)(1— v) [El/)0 )(1 ll (216) The F GM copula is then the first-order approximation to the AMH family, which explains similar robustness. To test the features illustrated on the figures, in Step 5 we calculate 6“ and 6” at the true parameter values p = m0 = 0 and p = p0 and evaluate standard errors for these averages. Table 2.2 shows these values along with the estimated Pearson’s correlation coefficient f0 as sample size grows from 200 to 30,000. The ratio of the sample average to the standard error in parenthesis is a test statistic. Under H0 : 6 = 0, it is asymptotically standard Normal. The table entries for the Logistic copula are significantly different from zero. This copula is not RS and it implies a different measure of dependence (1' = 1 / 3). This suggests running the same simulation with common 7' = 1/3 for all copulas. How- ever, this value falls outside the coverage range for several one-parameter copula families (see Appendix A), making a general comparison infeasible. As expected, the entries for the Normal copula are insignificantly different from zero for all sample sizes. For the two RS copula families, FGM and Hank, one cannot reject the null either. The AMH family is fairly robust, too. For the Joe, Clayton and Gumbel families, the sample averages are significantly different from zero for at least one sample size which confirms the observation that these non-RS copulas are not robust in this setting. 63 .36 288% 396 .88 732 5km 23 3 83 EOE accuommv baa-momfiwmmfi v.3 5% was 1% :02? new 3398 money—ow n ”@302 853 33 :23 s... 688.8 2.338 238.8 838.8 53.8 38.38 385.5 935.5- 333- 538.5- 383- 358.5 :5an 358.8 658.8 @888 @358 @328 33.8 883 535.5- $35.5. $3.5. $3.5- 38.5- assess 838.8 835.8 382.8 835.8 33.8 3888 $3.5- 5§3 SE3- 3:3 333. e335 Esau $33.8 $358 33.8 385.8 335.8 338 £83- 383- 83.5- 385- $33- 383- e856 3838 $35.8 $888 33.8 283.8 3588 833. $33 33.5- 383- 38.5- 885.5 Sensing? S338 8358 8388 £358 @388 388.8 833- R383 583- 333 333- 323 8.. £388 $35.8 $558 $3.88 6888 853.8 83.5- 333- $23. 383- 333- 38.5 ”assesses-aha:Chess 33.8 8388 $3.8 - 883- - Sets- - £83- eseaea I o l o I o cocowaoaovfi 64 A551 528% Ace 528% Asa 5.28% Ass 528% A558 528% A55. 528% ooodeB ooodnm. oomH-H moumw 038-3 83» new om acomoEooo comes—850 Mcoflebm vegan—Ewe was .maobo «56:53 :05 .3388 @0323 no“ 35305 $05338 Q’s-Box ”mm ofiwfi Among the one-parameter copula families, several entries in the table stand out. First, the Frank family sample averages are at least as close to zero as the Normal benchmark for all sample sizes. Second, the FGM family sample averages are closer to zero for T = 200 than the Normal family average. For the other sample sizes, sample averages for the two families are comparable. Third, the AMH family also performs well in the sense of the sample averages being insignificantly different from zero. In particular, 6“ for this family is not significantly different from zero for all sample Sizes. Finally, the Clayton family averages are close to zero for the smaller sample size but not for the larger. In the previous section, it was noted that (D’) does not generally have to hold in the population for RS copulas. An interesting observation from Table 2.2 is that sample analogues of (D’) are insignificantly different from zero for RS copulas and significantly different from zero for others. This does not follow from Theorem 2.4.1. 2.5 Redundancy of COpula terms We now turn to the question of redundancy of COpula moments. We assume that we either have the true copula moments (2.3C—D) or the robust misspecified copula moments (2.4C’-D') that hold at the true value of 6. We would like to study conditions under which using valid copula moments (either the true or misspecified ones) does not result in efficiency gains in estimation of 6. 65 2.5.1 Redundancy with correct copula We first prove a lemma that reveals the structure of the varianCe and derivative matrices of the moment functions in (2.3). Recall that correct specification of the copula is assumed in (2.3). Lemma 2.5.1 Denote the covariance matrix of the moment functions in (2.3) by C, their expected derivative matrix with respect to (6, p) by D. Then, AG—GO G’ B —G’ o C: (2.17) —G’ —G J E L _ and _ l —A o —B 0 D = , (2.18) G + G’ — J —E —E’ —F J where A, B, E, F, G, J are matrix-functions of (6, p) defined in Appendix B. Several important observations immediately follow from the lemma. First, (A) and (B) are uncorrelated with (C) if and only if (A) and (B) are uncorrelated with each other (G = 0). Second, the optimal GMM based on (2.3) is identical to the ML estimation in (2.5), as claimed in Section 2.3. To see this explicitly, note that the optimal GMM on (2.3) does not change if (2.3) is pre-multiplied by a matrix 66 W such that W = D’ C"1, if C is nonsingular. But, by Lemma 2.5.1, , _1 II II I o _1 I I ll 0 0001i 00011 where l denotes the identity matrix of the relevant dimension. Clearly, this re- produces the MLE first order conditions (2.5). Not surprisingly, estimators that use the same first order conditions yield the same asymptotic variance matrices. In particular, for non-singular C, the asymptotic variance matrix of the optimal GMM estimator of (6, p) based on (2.3) can be written as VGMM = (D’C‘ID)"1. (2.19) (We use the standard notation according to which “V is the asymptotic variance of an estimator 6” means that “\/N (6 - 60) converges in distribution to N (0, V).” It is implicit that D and C in the asymptotic variance formulas are evaluated at the true values 60 and p0.) By Lemma 2.5.1, this is identical to the asymptotic variance matrix of the MLE estimator of (6, p) - -—1 ( 10) IMO 1111110 110 C VMLE=- D = 00011 00011 H0 1 I“. In contrast to VGMM, VMLE is defined even if C is singular. In fact the last (2.20) representation in (2.20) involves the outer-product—of-the-score form of the infor- mation matrix, while the one before the last involves the expected-Hessian form of the information matrix. Both are non-singular under regularity conditions. 67 By a similar argument, it follows from Lemma 2.5.1 that the marginal moments (2.7) are not equivalent to the QMLE first order conditions (2.6). To see this explicitly, partition C and D as follows: C C D 0 C: 11 12 ’ D: 11 , (221) C21 C22 D21 D22 where C11, 012, C21, 022, D11, D21, Dgg correspond to the blocks separated by the dotted lines in (2.17-2.18). The optimal GMM based on (2.7) does not change if the moment conditions (2.7) are pre—multiplied by a matrix W11 such that W11 = D11’Cfll, if C11 is nonsingular. Now, using Lemma 2.5.1, W11 = Dlllcfll = — ] II II] - ] —G’ —G ] (311—1. The last term is what distinguishes the optimal GMM based on the stacked mar- ginal moments (2.7) from summation (2.6) employed by QMLE. Call the GMM estimator based on (2.7), the Impmved QML estimator (IQMLE). Schmidt (2004) shows that correlation between marginal scores used in the optimal weighting matrix results in efficiency gains over summation and that there are interesting cases when the two estimation methods are equally efficient. A trivial such case is when there is no correlation between the marginal scores, i.e. G = 0. We provide a formal statement and a proof of this relative efficiency result in the following theorem. The logic of the proof will be used again when we compare PMLE and IPMLE. Theorem 2.5.1 (Schmidt, 2004) Let VIQMLE and VQMLE denote the asymptotic variance matrices of the IQMLE and QMLE of 60, respectively. Then, VQMLE — 68 VIQMLE is positive semi-definite. Proof. Define A = [II II]. Then, (2.6) can be rewritten as (2.7) pre—multiplied by A. Correspondingly, the variance matrix of the moment functions in (2.6) can be expressed as AC1 1A’ , where C11 is the variance matrix for the moment functions in (2.7), defined in (2.21). Similarly, the expected derivative matrix for the moment conditions in (2.6) can be expressed in terms of the relevant matrix for (2.7) as ADll. Then, VQMLE = [(ADn)’(ACnA’)‘1(ADrrn—I. (2.22) while VIQMLE = [D11'CfilDlll—1- (2-23) But VQMLE — VIQMLE is positive semi-definite (PSD) if and only if vI—CiMLE — VOL/{LE = D11’C1‘11D11 — D11’A'(AC11A’)'1AD11 is PSD. The last expression can be rewritten as D11’C;,1/ 2[II — C}{2A’(AC}(ZC}{2A’)-1AC](2]C;11/2D11. This is PSD because the matrix in brackets is the PSD projection matrix orthogonal to CiizA’. Cl Conditions under which the copula moments do not help in terms of efficiency for 6 can be derived by comparing VIQMLE with the upper left p x p block of VMLE- When C is non-Singular, the comparisons can be equivalently made to the upper left 1) X 1) block Of VGMM° Breusch et al. (1999) (henceforth, BQSW) developed a very useful toolbox for analyzing redundancy of a set of moment conditions given another set of moment 69 conditions. However, their analysis assumes nonsingular C. For this reason, we do not employ their results here but compare VIQMLE with the relevant block of VMLE directly. Theorem 2.5.2 VMLE for 6 and VIQMLE are equal if and only if J — Cglcfictj2 — EF’IE = o, (2.24) where C31 = Caz, = [—G' — G]. The cumbersome expression in (2.24) has a simple interpretation in terms of sin- gularity of C. It states that the linear projection of moment condition (C) on moment conditions (A), (B) and (D) is uncorrelated with moment condition (C). More specifically, (2.24) can be rewritten as follows 3991an IE Elna—92194 6111f 31116 =0 36 11 39 2 36’ I where - . A G 0 921=l-G' ‘G E], n11= G’ B O 00F L J and the arguments of the moment functions have been suppressed for brevity. In other words, (C) has to be a linear combination of (A), (B) and (D) for the copula - information to be redundant in terms of asymptotic efficiency of estimation of 6. Thus C has to be Singular. 70 Since VMLE = VGMM for non-singular C, and VIQMLE is equal to VMLE for 6 if and only if C is singular, thus equality of VIQMLE and VGMM for 9 is impossible unless (C) is a linear combination of (A), (B) and (D). Corollary 2.5.1 If (C) is a linear combination of (A) and (B) with ,0 known then 1. E = o,- 2. J -— 03,0;llo‘f2 = o,- 3. IQMLE is efi‘lcient. We therefore have two cases when the copula knowledge in (C) and (D) is redun- dant given the knowledge of the marginals in (A) and (B). One case is when the copula moment (C) is a linear combination of (A) and (B). The other case is when (C) is not a linear combination of (A) and (B) but is a linear combination of (A), (B) and (D). In both cases, C is singular. Examples at the end of this section illustrate how one can apply the redundancy results in practice. 2.5.2 Redundancy with misspecified copula Now suppose incorrect but zero—mean copula terms in (2.4C’) and (2.4D') are used in estimation. When is such knowledge redundant in terms of efficient estimation of 0? 71 Lemma 2.5.2 Denote the covariance matrix of the moment functions in (2.3) that employ the copula moments (240') and (2.417) instead of ( 2. 3C) and ( 2. 3D), respectively, by Ck, their expected derivative matrix with respect to (6, pk) by Dk. Then, A G —K —P I I I Ck _ G B —L —Q —K’ -L N V L —P’ ——Q V’ W and .- _ —A 0 —B 0 Dk _ , K’ + L — M —S —S’ —T J where A,B,G are as in Lemma 2.5.1, K,L,M,N,P,Q,S,T,V,W are matrix— functions of (6, pk) defined in Appendix B. Lemma 2.5.2 can be used to make the following observation. The optimal GMM estimator using (2.3A-B)-(2.4C’-D') is not identical to the PML estimator. This is in contrast with Lemma 2.5.1, in which MLE coincided with GMM using (2.3A-D) because we had knowledge of the correct copula. More specifically, the optimal GMM estimator based on (2.3A-B)-(2.4C’-D’) is unchanged if (2.3A-B)-(2.4C'— D’) are pre-multiplied by matrix Wk = Dl"(Ck)’1 if C1‘ is non-singular. Using Lemma 2.5.2, it can be shown that lllllIO 00011 Dk’(ck)-1 = — + 2(ck)-1, 72 where Z contains G’ - K’, G — L, N —— M’, P’, Q, V’ -— S’, W—T’. Clearly, Lemma 2.5.2 becomes Lemma 2.5.1 if k = c. In this case, Z = 0, Wk = W, the optimal weighting retrieves (2.5), and PMLE is equivalent to MLE. For k aé c, correlation patterns impossible in Lemma 2.5.1 now provide potential efficiency gains over PMLE. We call the GMM estimator using (2.3A-B)-(2.4C'—D') the Improved PM L estimator (IPMLE). Theorem 2.5.3 Let VIPMLE and VPMLE denote the asymptotic variance matri- ces of the IPMLE and PMLE of (00, pg), respectively. Then, VpMLE — VIPMLE is positive semi-definite. Proof. Define II II II 0 0001 Then, (2.8) can be rewritten as (2.3) pre—multiplied by A. Correspondingly, the variance matrix of the moment functions in (2.8) can be expressed as ACll‘lA’. Similarly, the expected derivative matrix for the moment conditions in (2.8) can be expressed as ADl‘. Then, VPMLE = [(AD“)’X1X2-p2(—1(U)."l(v);p) p 6 (—-1,1) 2 r = E arcsinp E (—1, 1) Note: * denotes Archimedean copulas, i.e. copulas generated as C(U, v) = 0 Vt 6 (0,1) is called the generator function. It can be shown (see, e.g., Nelsen, 1999, p.130) that for Archimedean copulas, Kendall’s 1 r=1+4/ 90(t)dt. 0 t 92 Appendix B: Proofs PROOF OF THEOREM 2.4.1: We show that 132593111 lC(F10t1 + X1). F2012 + X2);p”) = 0. where u = 011.112)’, holds for any RS K. “) By the chain rule, 5% 1n k(F1(u1 + x1), F2012 + 1:2); p contains terms of the form 1 X (918071011 + $1). F2012 + $2);pk) [C(F10t1 +$1),F2(u2 +332);pk) 33011 +151) >< film + Ii). (2.31) i = 1, 2. Due ’00 MS 0f (X1.X2) and R5 of K, film + 152') = film - Ii) and k(F1(/11 + $1). F2(/12+$2)) = k(1-F1(#1+I1).1-F2(u2+$2)) = k(51011-561). F2(u2-$2))- So the first term in (2.31) is the same whether evaluated at (1:1, :52) or (-x1, —:r2). Similarly, the last term is the same whether evaluated at 2:,- or —:I:,-. Furthermore, 0k(F1(#1+31).F2(u2+x2);pk) = 8k(1—F1(p1+$1),1_F2(p2+x2);pk) 6172' (Hi+$i) 6(1—Fz‘ (Hi-xi» k = _ak(F1(#1-$1)»F2(u2-$2);p ) MAM-xi) ‘ Thus, 363111 k(F1(u1+:1:1),F2(u2+:r2);pk)= -38;1n k(F1(#l-$1)a 155012-232); Pk)- Denote 9051.332) 5 33; 1nk(F1(u1 + $1), F2(#2 + $2);p“) 41011 + $140 + trap)- From the above, it follows with RS that g(—a:1,—a:2) = —g(.r1,:c2). 93 We thus have EgaglnHFN/ti+X1).F2(#2+X2);pk) L00 L00 9($1,$2)d$1d$2 Loo L00 9( $1. $2)d$1d$2 L900 fooo g(z1,x2)dx1dx2 fooo f0009( (371,332)d$1d$2 fooo f. 009($1,$2)d$1d$2 +++ fooo fowg g(-—:r:1, -r2)dx1da:2 fooo f— 009(— $19—$2)d$1d$2 fowfooog(x1,x2dx1dxg +++ low f_ 009($1.$2)0l$1d$2 0. PROOF OF LEMMA 2.5.1: By the information matrix equality (IME), A s E(é’glnmxl; >83, 1nf1(X1.9)}———85:6,1nf1(X1.0-) (2.32) Similar for B, F. By the generalized IME (GIME), E E IE{8661116(F1(X1;l9) F2(X2;9); 10);;1nC(F1(X1;9),F2(X2;9);P)} = 13868;, 1DC(F1(X1,9),F2(X2;9);10) (2-33) 94 and, for i = 1,2, 62 a IE(gglnfrxtv—1nc+1“C("'“’”}= 82 Era—aae'lnf’m; 6) 80’ for i = 1, 2, which, along with (2.32), implies that G e{§,1nf1I"B’“1+B‘1\IIB"1 B‘lI‘ 7m 176 _ 05,, 955 ’r’B'-1 and ‘1’ = E(éE’), ‘1' = E(W’) If we let 0 denote the vector of all distinct parameters in Ay, Am, B, P, , \II, 9?, 9% and let Z = (Y’ , X’ )’ we will obtain the setup of Section 3.2.1. By imposing appropriate restrictions, the LISREL model reduces to many well- known models (see, e.g., Aigner et al., 1984). For example, equation (3.9) reduces 114 to a FA model if one imposes sufficient restrictions to retain only the upper-left block in the form I‘I" + \II. From (3.6)-(3.8), SEM can be obtained by restricting B = I, 92 = 9? = 0. To obtain a model for the conditional expectation of Y|X, one can restrict A9; to l, 9% to 0, and to the sample covariance matrix of X. See Joreskog (1970) for other special cases. A well known special case of LISREL known as the multiple indicators multiple causes model (MIMIC) is obtained from (3.6)-(3.8) by setting Ax = II, B = II and 62 = 0 (see, e.g., Joreskog and Goldberger, 1975). 3.2.3 Estimators 3.2.3.1 Normal (Q)MLE The normal QML estimator is N OQMLE = argggagz 1nf(zz-, 6), (3.10) i=1 where 1 _ mm = HZ?2 122'. 1 (2709 W121 It is easy to see that the problem in (3.10) can be equivalently written as 9 =ar minF 9. QMLE €969 MLE( ), 115 where FMLE(6) = log |2| + u(sz—l). (3.11) Thus QMLE amounts to finding the value of 0 that minimize distance (3.11) between the sample covariance matrix S and the covariance matrix 2 imposed by the model. It is a standard result (see, e.g., Chamberlain, 1984, p. 1289) that, under Assump- tion 3.2.1, the normal QMLE of 90 is consistent and asymptotically normal. 3.2.3.2 GMM The optimal GMM estimator of 90 is based on the distinct elements of (3.1), i.e. on the moment conditions IE[m(zz'; 90)] = 0, (3-12) where m(Zi; 6) = vech(S,-) — vech(2) and vech denotes vertical vectorization of the lower triangle of a matrix. Thus m is a %q(q + 1)-vector. The optimal GMM estimator of 00 is obtained as the solution to the following problem: éGMM = arg min FGMMW). see where FGMM(9) = mN(9)IWmN(0)a (3-13) 116 1 N mN(9) = fizmwfig) i=1 = uech(S) — vech(2), and W is the appropriate (optimal) weighting matrix. The optimal weighting matrix is w. = {E[m(Z.-; ao)m(zi; earn—1. (3.14) But in (3.13), one would typically use the following consistent estimator of W0 based on a preliminary consistent estimate of 00 —1 1 N « « W: I—V-Z[m(z,-;0)m(zi;9)'] t=l Note that there is a connection between W in (3.14) and A in (3.4). To show the connection we need to define matrices that transform vech into vec and vice versa. Magnus and Neudecker (1988, p. 49) show that, for a symmetric k x k matrix A there exists a unique 1:2 x 134% duplication matrix Hk such that Hk vech(A) = vec(A). Thus Hk transforms vech into vec, while the Moore- Penrose inverse of Hk, Hk = (Hi‘HleH’ , transforms vec into vech. Matrices Hk and FL, have the following properties: 0) file Hk = Hk(k+1)§ (ii) sz Hk = Hk, where Hkg is the commutation matrix defined above; (iii) HI: file = $(Hk2 + II,,2); 117 (iv) (llkg + 11kg) Hk = 2H}: and Hk (llkg + 11kg): 21:11:. Thus, omitting the dimensionality subscript, we can write A = V[vec(S,-)] = V[H vech(S,-)] = HV[vech(S,-)]H’. But V[vech(S,-)] = E[m(Zi;0)m(Z,-;6)’]. We can therefore write the optimal weighting matrix in (3.14) as [HAofI']_1. It is easy to verify that, under Assumption 3.2.1 and with W 33» W, the standard conditions for consistency and asymptotically normality of the GMM estimator of 00 hold (see, e.g., Newey and McFadden, 1994, Theorems 2.6 and 3.4). 3.2.3.3 EL The EL estimator of 00 is obtained as follows: N 0 =ar max ln7r- EL 3068; 2 subject to N Znim(zi; 0) = 0 i=1 and ’ N D:- = 1 i=1 It can also be shown that Assumption 3.2.1 is sufficiently strong to satisfy the conditions for consistency and asymptotic normality of éEL (see, e.g., Kitamura, 1997; Owen, 2001). 118 3.3 First order analysis 3.3.1 The first order conditions Let G(0) denote the Jacob-ion matrix of the moment functions in (3.12). Then 8m(z- 0) 8vech(2) G E G 0 = —“— = —— ( ) 86' (90’ The following lemmas are used in derivation of the main results of the paper. They are well known and thus given without proof (see, e.g., Chamberlain, 1984; Hansen, 1982; Qin and Lawless, 1994, for some relevant proofs). Lemma 3.3.1 Under Assumption 3.2.1, the first order condition for éQMLE is G’H’(2 a 2)-1H [vech(S) — uech(2)] = o. (3.15) Lemma,3.3.2 Under Assumption 3.2.1, the first order condition for éGMM is G'W_1[vech(S) — vech()3)] = 0. (3.16) Lemma 3.3.3 Under Assumption 3.2.1, the first order condition for éEL is N —1 G’ [2 nimimg] [vech(S) — vech(2)] = 0, (3.17) i=1 where m, = m(Zi; 0). 119 In Section 3.4, we will use an alternative way of writing the first order conditions that circumvents the need to operate with the inverse. Define A = —[E(0) <8) 2(9)]‘1HmN(6). Then the QMLE first order condition can be written as G(0)’H’A SN(.3) = — = 0 HmN(9) + [2(0) <8) 2(0)]A and we now have a p + q2-vector of parameters [3 = (0’, X)’. A similar representa- tion of the GMM and EL first order conditions was used, for example, by Newey and Smith (2004). It is clear from (3.15)-(3.17) that the only thing that distinguishes the three es- timators is the way in which the empirical moments mN(0) are weighted. One way to compare the first order variances of GMM and QMLE is to note that éQMLE comes from the GMM problem that employs a suboptimal weighting ma- trix H’ (2 ® 2)‘1H and is therefore inferior to éGMM in terms of first-order asymptotic relative efficiency. However, that argument cannot be used to derive the equal efficiency condition. 3.3.2 Relative efficiency to the first order Theorem 3.3.1 Suppose Assumption 3.2.1 holds. Let V denote the first order _ 1 . asymptotic variance matrix of the relevant estimator, i.e. V = Avar[N 2 (0-90)]. 120 Then, VQMLE = [GhHI(20 ® 2joy-IHGOFI xG’oH’(20 a 20)’1A0(20 a 20)‘1HG0 (3.18) x [G’OH’(20 3 sorlncorl, VGMM = VEL =[G;:0 a 20)-1HG0 = HA0(20 e 20):"1 % (11+ H)HG0 = HMS. 3 Earl $2119, = HAOOJO <8) 20)-1HG0. This proves both (i) and (ii). [:1 Theorem 3.3.2 is novel in that it states the first order efficiency properties of QMLE, GMM and EL explicitly in terms of the fourth moments A of the distribution. It is clear from the theorem, that GMM and EL dominate QMLE because they make efficient use of the second moment information without imposing restrictions on the fourth moments. Ahn and Schmidt (1995, Appendix 2) showed that the GMM estimator of covariance structures reaches the semiparametric efficiency bound of 123 Newey (1990). Theorem 3.3.2 provides an explicit expression for the gain attained by GMM over QMLE. Not surprisingly, the conditions of Theorem 3.3.2 hold for the multivariate normal distribution. Using (3.5), one can write imam, a 20)‘1HG0 = Ha - II)HG., = ZHHGO = 2 G0. So condition (ii) trivially holds. However, there may conceivably exist other distrib- utions that satisfy the equal first order asymptotic efficiency conditions of Theorem 3.3.2. We leave further exploration of this point for future work. 3.4 Second order analysis 3.4.1 Stochastic expansions to the second order Higher order stochastic expansions are based on the Taylor approximation of the first order conditions at the true value. The expansions have the following form .. __ 1 Was - a.) = u + N 2. + 0pm“), (3.20) where p and 1' are 019(1) random vectors. It is well known that the first order bias can be obtained by taking the expectation of the first term. Since QMLE, GMM, and EL are W consistent, their first order bias is zero. Similarly, the first order variances can be obtained as the expectation 124 of the outer product of first term. The second order bias is based on the expectation of the first two terms in (3.20). Alternatively, the second order bias can be obtained using the Edgeworth approximation to the distribution as in Rothenberg (1984) and McCullagh (1987). General expressions for p and r of extremum and minimum distance estimators with many examples can be found in Rilstone et al. (1996); Bao and Ullah (2003); Ullah (2004); Kim (2005). Specialized expressions for the GMM and (generalized) EL can be found in Newey et al. (2003) and Newey and Smith (2004). Derivation of higher order stochastic expansions involves higher order derivatives of the objective functions. Rilstone et al. (1996) use a recursive definition of derivatives which is useful in general settings. In our derivation we follow N ewey and Smith (2004) in using the usual definition because we do not go to the order higher than two and because we wish to compare the QMLE bias to the GMM and EL bias expressions they derive. Define G'H’A Si(:6) = — a Hm,+(2®2)x 529% ) M. = IE_2__0_ h = 0’ O, I, .7 afi’afij, W ere 30 ( 01 ) R = [G’H’(2®2)-1HG]’1. Q = RG’H’(2®2)-1, P = (2®2)‘1—(2®2)‘1HGQ. 125 Theorem 3.4.1 Under Assumption 3.2.1, the estimator fiQMLE satisfies (3.20) with Q0 HJ—i[vech(Si)—vech(20)], (3.21) Po Wi=1 2 —R. Q. ”+4 1/2 uiju, Q2 Po j=1 where #j is the j -th element of a. Proof. See Appendix B. [:1 Note that IE}; = 0 and the first order variance of fiQMLE based on (3.21) can be written as Emu - I Q0 H E[m(Z,:; 60)m(Z,-; 90),]H, Q0 P0 P0 QvoQ’o QvoPfi, , (3.22) POAOQ’O POAOP’O where the upper left p x p block of (3.22) represents the first order asymptotic variance of BQMLE in (3.18). Interestingly, the matrix in (3.22) is not in general block diagonal unlike the EL and GMM analogues (see, e.g., Qin and Lawless, 1994, Theorem 1). However, in 126 the case of multivariate normality, the blocks of (3.22) can be simplified as follows QAQ’ QAP’ = o, (3.23) 2R, PAP' = (II+H)P. And thus éQMLE and XQMLE are in this case asymptotically uncorrelated. 3.4.2 Second order bias of QMLE Let B denote the second order bias of the relevant estimator. Using (3.20), the bias can be written in terms of the expected value of 1' as l3 = lET/N. Thus, an explicit form of the QMLE bias contains EijJ-p, j = 1, . . . , p + q2. But Mj can be written as a2 G’H’A M,- = —IE , 35 332' Hm,- + (2: e 2),\ 0:00,A=0 -I 0 G], H’ , — , _ , J = 1,. . .,p HG’O 0% G - o — a] , j = 1, . . . ,q2 no,- 0 127 where Gj = 2,33%, G0,- = a,5},[G.;,H’],-, nj -_- 3%;(20 a 20) and 0,,- = 3%[20 a Bob. Therefore Mj is non-random and we can write 0 G’o'H’ — , , lEuu’ej, j=1,...,p HG?) 9% lEuJ-Mju = (3.24) Goj 0 2 — Eflfl’ep+j, j=13H°aqa no,- 0 where ek is a p + q2-vector of zeros with the k-th element equal to 1. Substituting (3.22) into (3.24) and simplifying yields the result of the following theorem. Theorem 3.4.2 Under Assumption 3.2.1, the second order bias of fiQMLE can be written as follows -I 1 ’R0 Q0 p 0 Gb H, QvoQb BQMLE = —— 2 ~ - ej 2N Q5; 1Do j=1 HG'b n’b PvoQb 2 <1 G . +2 0’ Qvopgej , (3.25) where ek is the zero vector of relevant dimension in which the k-th element is 1. McCullagh (1987) and Linton (1997) give expressions for the second order bias of QMLE in terms of cumulants; we use the higher—moment representation to enable comparison with second order biases derived in Newey and Smith (2004). 128 Based on (3.23), the following simplification applies in the multivariate normal 03.882 1 ’R0 Q0 p 0 Gj H, 2R0 Q0 POJ j=1HGo 0g 0 _ 1 4‘0 Q0 0 W Q; P. 22§=1HG£Roej = _i Q0H2j= IGgRoej , (3.26) N POH 29:1 ngoej 3.4.3 Comparison to GMM and EL Newey and Smith’s (2004, Theorems 4.1 and 4.6) second order biases of GMM and EL of 00 are, in our notation, 1 13mm) = —2—N ELEngELej, (3.27) j: —1 1 IE‘GMM(‘9) = BEL + NQEL Uo, where QBL = RELG’llEmémZ-l, REL = (G'UEmimS-I‘IG)“1, C,‘ II E[m,m§Pm,-]. It is not clear how these compare to BQMLEW) in general. However, when Z is 129 multivariate normal, it is easy to show that the upper block of BQMLE is equal to (3.27) since, under normality, REL = {G’[2H(2 e 2)H']-1G}-1 = 2(G’H'(2 a 2)-1HG]-1 = 2R, QEL = RELG’[2fi()3 a 2:)1’1'1-1 = RG’H’(2 a 2)‘1H = QH. 3.5 Concluding remarks The paper examined estimation methods available for covariance structure models in terms of their first and second order asymptotic properties. The results suggest the following strategy in estimating models of covariance structure. First, if we have large samples so that the first-order asymptotic results can be applied we should prefer GMM or EL to quasi-MLE. Due to increased computa- tional difficulty of BL, the GMM estimator would be preferable. If efficiency is not an issue and we are ready to sacrifice efficiency for a simpler and yet consistent estimation technique we may prefer the traditional normal QMLE approach. Second, if we have small samples, EL would be the preferred method of estimation. If the data is normal, normal QMLE will have the same second-order bias as EL. The bias can be estimated using (3.27) and the bias-adjusted estimator can be 130 constructed. If the data are not normal and we still use the QMLE, construction of the bias-adjusted estimator may be more complicated but is still possible using (3.25). Interesting related questions are how different are the alternative estimates in applications and whether the equal efficiency and equal bias results can be shown for other distributions. 131 Bibliography AHN, S. C. AND P. SCHMIDT (1995): “Efficient estimation of models for dynamic panel data,” Journal of Econometrics, 68, 5—27. AIGNER, D. J., C. HSIAO, A. KAPTEYN, AND T. WANSBEEK (1984): “La- tent variable models in econometrics,” in Handbook of Econometrics, ed. by Z. Griliches and M. D. Intriligator, vol. II, 1323—1393. BAO, Y. AND A. ULLAH (2003): “The Second-Order Bias and Mean Squared Error of Estimators in Time Series Models,” Working Paper, University of Cal- ifornia — Riverside, http: / / www.economics.ucr.edu / papers / papers03 / 03-08.pdf. CHAMBERLAIN, G. (1982): “Multivariate regression models for panel data,” Jour- nal of Econometrics, 18, 5—46. (1984): “Panel data,” in Handbook of Econometrics, ed. by Z. Griliches and M. D. Intriligator, vol. II, 1248—1313. HANSEN, L. (1982): “Large sample properties of generalized method of moments estimators,” Econometrica, 50, 1029—1054. J DRESKOG, K. G. (1970): “A general method for analysis of covariance struc- tures,” Biometrika, 57, 239—251. Joassxoc, K. G. AND A. S. GOLDBERGER (1975): “Estimation of amodel with multiple indicators and multiple causes of a single latent variable,” Journal of American Statistical Association, 70, 631—939. JDRESKOG, K. G. AND D. SDRBOM (1977): “Statistical models and methods for analysis of longitudinal data,” in Latent Variables in Socio-Economic Models, ed. by D. a. A.Goldberger, Amsterdam: North-Holland Publishing Company, Contributions to Economic Analysis, 285-325. (1996): LISREL 8 User’s Reference Guide, SSI Scientific Software. KIM, K.-I. (2005): “Higher order bias correcting moment equation for M- estimation,” UCLA Papers. 132 KITAMURA, Y. (1997): “Empirical likelihood methods with weakly dependent processes,” The Annals of Statistics, 25, 2084—2102. LINTON, O. (1997): “An asymptotic expansion in the GARCH(1,1) model,” Econometric Theory, 13, 558. MAGNUS, J. R. AND H. NEUDECKER (1988): Matrix differential calculus with applications in statistics and econometrics, Wiley Series in Probability and Sta- tistics, Chichester: John Wiley and Sons Ltd. MCCULLAGH, P. (1987): Tensor Methods in Statistics, Monographs on Statistics and Applied Probability, London: Chapman and Hall. NEWEY, W. (1990): “Semiparametric efficiency bounds,” Journal of Applied Econometrics, 5, 99—135. N EWEY, W. AND D. MCFADDEN (1994): “Large sample estimation and hypoth- esis testing,” in Handbook of Econometrics, ed. by R. Engle and D. McFadden, vol. IV, 2113—2241. NEWEY, W. K., J. S. RAMALHO, AND R. J. SMITH (2003): “Asymptotic bias for GMM and GEL estimators with estimated nuisance parameters,” CEMMAP working paper CWP05/03. NEWEY, W. K. AND R. J. SMITH (2004): “Higher order properties of GMM and Generalized Empirical Likelihood estimators,” Econometrica, 72, 219-255. OWEN, A. B. (2001): Empirical likelihood, Monographs on statistics and applied probability; 92, Boca Raton, Fla. : Chapman and Hall. QIN, .1. AND J. LAWLESS (1994): “Empirical likelihood and general estimating equations,” The Annals of Statistics, 22, 300—325. RILSTONE, P., V. SRIVASTAVA, AND A. ULLAH (1996): “The second—order bias and mean squared error of nonlinear estimators,” Journal of Econometrics, 75, 369—395. ROTHENBERG, T. (1984): “Approximating the distributions of econometric esti- mators and test statictics,” in Handbook of Econometrics, ed. by Z. Griliches and M. D. Intriligator, vol. II, 881—935. ULLAH, A. (2004): Finite Sample Econometrics, Advanced Texts in Econometrics, Oxford University Press. 133 Appendix: Proofs PROOF OF THEOREM 3. 4.1. Let M(B): N211; QHT’ M(fi) ____ E621?) Edy-(B): N Z,- 1215—682915— and B be between B and B0. By the second-order Taylor expansion of (3.15) around B0, we have 8MB) = 0 1p _ . = sN(B.) +M(B.) (B— B.)+ 5 21B. —B..-)M .-(B)(B — B.) = SN(fio) + M(IBoXfi— Hol+ [M J(.Blo) _ M(fiO)](B _ :30) W42 +1132 (Bi B..)M.-(B.)(B - [30) + 23': —1 1 p+q2 . _ _ - +5 2 (Br - BoBlew) — M.(B.))(B — B.) j=1 Note that M(BO) = M(Bo) so that the third term in the last equation is zero. Also note that the last term is 0,.( N ‘3/ 2). Assume that M(Bo) is not singular. Then, ,6)-.30 = ’lM (Boll—1 SN(fio) P+q2 BM(.)11 23B. —B..—)M .(B.)(B— B.) +O,.(N N‘3/2). (3-28) 134 0 GQH' 0 But M(BO) = — , sN(B0) = — and the sec— HGo 2:0 ® 230 H mN(90) ond term is 0,.(N‘1). We thus have 3 _ 3 = _1_ Q0 H—1—- i[vech(S-) — vech(20)] + O (N-l) 0 m Po ”/7”? i=1 1 p = —\71_I_\l_“+ 0p(N_1). (3-29) Substituting (3.29) into (3.28), multiplying by x/TV- and collecting terms of the same order yields the result. DEPARTMENT OF ECONOMICS MICHIGAN STATE UNIVERSITY EAST LANSING, MI 48824 Email address: prohoronmsu.edu 135