ESSAYS ON PSEUDO PANEL DATA AND TREATMENT EFFECTS By Fei Jia A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Economics – Doctor of Philosophy 2017 ABSTRACT ESSAYS ON PSEUDO PANEL DATA AND TREATMENT EFFECTS By Fei Jia This dissertation is composed of three chapters that study two suitable estimation methods for identifying causal relationships in the presence of (pseudo) panel data. The first and the second chapters are devoted to minimum distance estimation for pseudo panel models, whereas the third chapter is concerned with the estimation of controlled direct effects in causal mediation analyses using panel data. The first chapter focuses on finite sample properties of minimum distance estimators in pseudo panel models. Previous research shows theoretically that the minimum distance asymptotic theory is a natural fit for pseudo panel models when cohort sizes are large. However, little is known about how minimum distance estimation performs with a realistic sample size. In a carefully designed simulation study that mimics the sampling scheme of repeated cross sections, we compare the optimal minimum distance estimator to the fixed effects estimator which is identical to the minimum distance estimators using identity weighting matrix. The results show that both estimators perform well in realistic finite sample setups. The results also confirm that the optimal minimum distance estimator is generally more efficient than the fixed effect estimator. In particular, we find that cohortwise heteroskedasticity and varying cohort size are the two typical scenarios that call for the use of optimal weighting. For the fixed effects estimator, we find that the minimum distance inference is more suitable than the naive inference which incorrectly ignores the estimation errors in the pseudo panel of variable cohort means. The second chapter extends the basic pseudo panel models in the first chapter by adding extra instrumental variables. The additional instruments, if non-redundant, can improve estimation efficiency. To have the efficiency gain result in a general form, we derive it in a non-separable minimum distance framework developed in this chapter. Along with the efficiency gain result, consistency, asymptotic normality, and optimal weighting theorems are also established. This efficiency gain result echoes the property of generalized methods of moments that more moment conditions do not hurt. After developing the results in the non-separable minimum distance framework, we apply them to the extended pseudo panel models. we show that the minimum distance estimators in the extended pseudo panels are generalized least squares estimators, and the optimal weighting matrix is block diagonal. Because of the last fact, the use of optimal weighting becomes more important than in basic pseudo panels. Simulation evidence confirms the theoretical findings in realistic finite sample setups. For an empirical illustration, we apply the method to estimate returns to education using data from the Current Population Survey in the US. The third chapter, coauthored with Zhehui Luo and Alla Sikorskii, proposes a flexible plug-in estimator for controlled direct effects in mediation analyses using the potential outcome framework. A controlled direct effect is the direct treatment effect on an outcome when the indirect treatment effect through a mediator is shut off by holding the mediator fixed. The flexible plug-in estimator for controlled direct effects is a parametric g-formula with an additional partially linear assumption on the outcome equation. Compared to simulation based method in the literature, this estimator avoids estimation of conditional densities and numerical evaluation of expectations. We compare the flexible plug-in estimator to the sequential g-formula estimator, and prove theoretically and via simulation that they are numerically equivalent under certain settings. We also discuss a sensitivity analysis to check the robustness of the flexible plug-in estimator to a particular violation of the sequential ignorability assumption. We illustrate the use of the flexible plug-in estimator in a secondary analysis of a random sample of low birthweight and normal birthweight infants to estimate the controlled direct effect of low birth weight on reading scores at age 17 when a behavior problem index is used as the mediator. Copyright by FEI JIA 2017 ACKNOWLEDGMENTS The Ph.D. journey at Michigan State University has been a life-changing experience for me. Along this journey, I have been fortunate to meet so many amazing people who help me to gain little advantages every day. I thank Jeffrey Wooldridge who served as the chair of my dissertation committee. His advising throughout my studies at Michigan State University shaped two thirds of this dissertation and my current research. I thank Zhehui Luo who supervised my research assistantship in the Department of Epidemiology and Biostatistics at Michigan State University. The interdisciplinary collaboration with her and Alla Sikorskii makes the the third chapter of this dissertation possible. I also thank my other committee members, Peter Schmidt and Timothy Vogelsang for their help at various stages of my research. Their feedback had an important positive impact on the quality of my work. Finally, I thank Lori Jean Nichols and Margaret Lynch who work in the administrative staff of the department of economics. Their continuous support for five years helped a lot in completing this degree. The work in the third chapter was supported by National Institute of Mental Health grant RC4MH092737 (Luo) and the data came from the grant R01MH44586 (Breslau). v TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii CHAPTER 1 FINITE SAMPLE PROPERTIES OF THE MINIMUM DISTANCE ESTIMATOR FOR PSEUDO PANEL DATA . . . . . . . . . . . . 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 The population models . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Discussion on exogeneity . . . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Minimum distance estimation . . . . . . . . . . . . . . . . . . . . . 1.2.3.1 Limiting distribution of cohort-time cell means . . . . . . 1.2.3.2 Minimum distance approach for pseudo panels . . . . . . . 1.2.3.3 Closed-form MD estimators for pseudo panels . . . . . . . 1.2.3.4 Discussion on the difference between MD FE and naive FE inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Simulation and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Deterministic aggregate time effects . . . . . . . . . . . . . . . . . . 1.3.3 Covariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.4 Cohort-wise heteroskedasticity in the idiosyncratic error . . . . . . 1.3.5 Cohort-time cell size . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Conclude . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 2 EXPLORING ADDITIONAL MOMENT CONDITIONS IN NONSEPARABLE MINIMUM DISTANCE ESTIMATION WITH AN APPLICATION TO PSEUDO PANELS . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 The NMD framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Optimal weighting matrix . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.5 More conditions do not hurt . . . . . . . . . . . . . . . . . . . . . . 2.3 Pseudo panels with additional IVs . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Population model and structural equations . . . . . . . . . . . . . . 2.3.2 Useful notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi . . . . . . . . . 1 1 3 3 7 8 9 11 13 . . . . . . . . . 14 16 17 23 25 25 27 29 32 . . . . . . . . . . . 34 34 37 38 38 39 41 41 43 44 46 The partial derivatives L and B and the inverse optimal weighting matrix M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.4 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˆ . . . . . . . . . . . . . . . . . . . . . . . 2.3.4.1 Asymptotics of π 2.3.4.2 Estimation of L . . . . . . . . . . . . . . . . . . . . . . . . ˆ and the FE estimator θ ˇ . . . . . 2.3.4.3 The general estimator θ opt ˆ 2.3.4.4 Estimation of B, M and θ . . . . . . . . . . . . . . . . ˆ θ ˇ and θ ˆ opt . 2.3.4.5 Estimation of the asymptotic variances of θ, 2.3.5 The GLS perspective . . . . . . . . . . . . . . . . . . . . . . . . . . ˇ . . . . . . . . . . . . . . . . . . . . 2.3.6 Naive variance estimators for θ 2.4 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Simulation design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Simulation results for the small pseudo panel . . . . . . . . . . . . . 2.4.3 Simulation results for the middle sized pseudo panel . . . . . . . . . 2.5 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . APPENDIX A PROOFS AND ALGEBRA . . . . . . . . . . . . . . . . . . . . APPENDIX B ADDITIONAL TABLES . . . . . . . . . . . . . . . . . . . . . . BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 . 47 . 48 . 49 . 49 . 50 . 50 . 53 . 53 . 55 . 59 . 60 . 64 . 67 . 69 . 72 . 73 . 83 . 104 CHAPTER 3 A FLEXIBLE PLUG-IN G-FORMULA FOR CONTROLLED DIRECT EFFECTS IN MEDIATION ANALYSIS . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Existing Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 The g-Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 The Sequential g-formula estimator . . . . . . . . . . . . . . . . . . . 3.4 The Flexible Plug-in g-formula estimator . . . . . . . . . . . . . . . . . . . . 3.4.1 The Partial Linearity Assumption and the Plug-in g-formula estimator 3.4.2 Estimation Procedure for the Flexible Plug-in g-formula estimator of CDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 plim of Parametric g-Formula is the Flexible Plug-in g-formula estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.4 Flexible Plug-in g-formula estimator Is Numerically Equivalent to Sequential g-formula estimator . . . . . . . . . . . . . . . . . . . . . . 3.4.5 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.6 Comparison of Flexible plug-in g-formula estimator with Sequential g-formula estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 An Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . APPENDIX A PROOFS AND ALGEBRA ON G-FORMULA . . . . . . . . . . vii 107 107 109 111 111 114 117 117 118 119 119 120 123 125 127 128 131 132 APPENDIX B APPENDIX C APPENDIX D BIBLIOGRAPHY PROOFS AND ALGEBRA ON SEQUENTIAL NUMERICAL EQUIVALENCE . . . . . . . . . SENSITIVITY ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii G-ESTIMATOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 147 152 153 LIST OF TABLES Results for benchmark. G = 6, T = 4, ngt ≈ 40, sampling rate = .2%; R denotes number of replications; Monte Carlo averages on top, Monte Carlo standard deviations in parentheses. . . . . . . . . . . . . . . . . . . 19 Results for benchmark. G = 6, T = 4, ngt ≈ 200, sampling rate = .2%; R denotes number of replications; Monte Carlo averages on top, Monte Carlo standard deviations in parentheses. . . . . . . . . . . . . . . . . . . 20 Results for benchmark. G = 6, T = 4, ngt ≈ 1000, sampling rate = .2%; R denotes number of replications; Monte Carlo averages on top, Monte Carlo standard deviations in parentheses. . . . . . . . . . . . . . . . . . . 21 Results for different aggregate time effect processes. G = 6, T = 4, ngt ≈ 200, sampling rate = .2%; 10,000 replications; Monte Carlo averages on top, Monte Carlo standard deviations in parentheses. . . . . . . . . . . . . 24 Results for distribution of x2 . G = 6, T = 4, ngt ≈ 200, sampling rate = .2%; 10,000 replications; Monte Carlo averages on top, Monte Carlo standard deviations in parentheses. . . . . . . . . . . . . . . . . . . . . . . 26 Results for cohort-wise heteroskedasticity in error term. G = 6, T = 4, ngt ≈ 200, sampling rate = .2%; 10,000 replications; Monte Carlo averages on top, Monte Carlo standard deviations in parentheses. . . . . . 28 Table 1.7 Cohort-time cell sizes for the two sampling schemes . . . . . . . . . . . . 29 Table 1.8 Results for varying cohort size. G = 6, T = 4; ngt follows the three specifications given in section 1.3.5 and is generated by varying the sampling rate; 10,000 replications; Monte Carlo averages on top, Monte Carlo standard deviations in parentheses. . . . . . . . . . . . . . . . . . . . . . . 30 Table 1.1 Table 1.2 Table 1.3 Table 1.4 Table 1.5 Table 1.6 Table 2.1 x x x Variance-covariance and correlation matrix of (µgt2 , µgt3 , µgt4 ); correlation x x coefficients in parentheses. µgt3 = sin(gt), µgt4 = (1 + exp[1.5 ∗ sin(gt/2)])−1 . 62 Table 2.2 Table 2.3 Table 2.4 Finite sample properties of various estimators of β2 and its standard error, G = 6, T = 4. Case 3. x2it ∼ N (gt/6, 1) + zi . . . . . . . . . . . . . 65 Finite sample properties of various estimators of β2 and its standard error, G = 6, T = 4. Case 4. x2it ∼ N (gt/6, 1) + zi + fi . . . . . . . . . . 66 Finite sample properties of various estimators of β2 and its standard error, G = 30, T = 20. Case 3. x2it ∼ N (gt/150, 1) + zi . . . . . . . . . . 67 ix Finite sample properties of various estimators of β2 and its standard error, G = 30, T = 20. Case 4. x2it ∼ N (gt/150, 1) + zi + fi . . . . . . . . 68 Table B.1 Small panel with G = 6, T = 4. Case 1.a: x2it ∼ N (gt/6, 1), ngt = 200, sampling rate= 1%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. . . . . . . . . . . . . . . . . . . 84 Table B.2 Small panel with G = 6, T = 4. Case 1.b: x2it ∼ N (gt/6, 1), ngt = 1000, sampling rate= 1%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. . . . . . . . . . . . . . . . . . . 85 Table B.3 Small panel with G = 6, T = 4. Case 1.1: x2it ∼ N (gt/6, 1), ngt = 200, sampling rate= 0.2%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. . . . . . . . . . . . . . . . . . . 86 Table B.4 Small panel with G = 6, T = 4. Case 1.1: x2it ∼ N (gt/6, 1), ngt = 200, sampling rate= 0.2%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. . . . . . . . . . . . . . . . . . . 87 Table B.5 Small panel with G = 6, T = 4. Case 2.a: x2it ∼ N (gt/6, 1) + fi , ngt = 200, sampling rate= 1%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. . . . . . . . . . . . 88 Table B.6 Small panel with G = 6, T = 4. Case 2.b: x2it ∼ N (gt/6, 1) + fi , ngt = 1000, sampling rate= 1%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. . . . . . . . . . . . 89 Table B.7 Small panel with G = 6, T = 4. Case 2.1: x2it ∼ N (gt/6, 1) + fi , ngt = 200, sampling rate= 0.2%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. . . . . . . . . . . . 90 Table B.8 Small panel with G = 6, T = 4. Case 2.1: x2it ∼ N (gt/6, 1) + fi , ngt = 200, sampling rate= 0.2%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. . . . . . . . . . . . 91 Table B.9 Small panel with G = 6, T = 4. Case 3.a: x2it ∼ N (gt/6, 1) + zi , ngt = 200, sampling rate= 1%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. . . . . . . . . . . . 92 Table B.10 Small panel with G = 6, T = 4. Case 3.b: x2it ∼ N (gt/6, 1) + zi , ngt = 1000, sampling rate= 1%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. . . . . . . . . . . . 93 Table B.11 Small panel with G = 6, T = 4. Case 3.1: x2it ∼ N (gt/6, 1) + zi , ngt = 200, sampling rate= 0.2%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. . . . . . . . . . . . 94 Table 2.5 x Table B.12 Small panel with G = 6, T = 4. Case 3.1: x2it ∼ N (gt/6, 1) + zi , ngt = 200, sampling rate= 0.2%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. . . . . . . . . . . . 95 Table B.13 Small panel with G = 6, T = 4. Case 4.a: x2it ∼ N (gt/6, 1) + zi + fi , ngt = 200, sampling rate= 1%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. . . . . . . . . . . . 96 Table B.14 Small panel with G = 6, T = 4. Case 4.b: x2it ∼ N (gt/6, 1) + zi + fi , ngt = 1000, sampling rate= 1%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. . . . . . . . . . . . 97 Table B.15 Small panel with G = 6, T = 4. Case 4.1: x2it ∼ N (gt/6, 1) + zi + fi , ngt = 200, sampling rate= 0.2%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. . . . . . . . . . . . 98 Table B.16 Small panel with G = 6, T = 4. Case 4.1: x2it ∼ N (gt/6, 1) + zi + fi , ngt = 200, sampling rate= 0.2%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. . . . . . . . . . . . 99 Table B.17 Small panel with G = 6, T = 4. Case 5.a: x2it ∼ N (gt/2, 1) + zi + fi , ngt = 200, sampling rate= 1%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. . . . . . . . . . . . 100 Table B.18 Small panel with G = 6, T = 4. Case 5.b: x2it ∼ N (gt/2, 1) + zi + fi , ngt = 1000, sampling rate= 1%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. . . . . . . . . . . . 101 Table B.19 Small panel with G = 6, T = 4. Case 5.1: x2it ∼ N (gt/2, 1) + zi + fi , ngt = 200, sampling rate= 0.2%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. . . . . . . . . . . . 102 Table B.20 Small panel with G = 6, T = 4. Case 5.2: x2it ∼ N (gt/2, 1) + zi + fi , ngt = 1000, sampling rate= 0.2%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. . . . . . . . . . . . 103 Table 3.1 Compare the plug-in estimator with the sequential g-estimator under different specifications for the outcome conditional mean and different structural nested mean models. . . . . . . . . . . . . . . . . . . . . . . . . 116 Table 3.2 Simulation results: flexible plug-in g-formula v.s. sequential g-estimator . 121 xi LIST OF FIGURES Figure 3.1 A directed acyclic graph for a longitudinal study with three time points. (A, M ) are the intervention nodes, (L0 , L1 , Y ) the non-intervention nodes, and U0 the unobservables. . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Figure 3.2 Subgraphs for G where an upper bar means arrows pointing to a node are removed and an under bar means arrows emitting from a node are removed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Figure 3.3 Controlled direct effect of LBW on reading with bad behavior as mediator, Model 1 to 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 xii CHAPTER 1 FINITE SAMPLE PROPERTIES OF THE MINIMUM DISTANCE ESTIMATOR FOR PSEUDO PANEL DATA 1.1 Introduction Repeated cross-sectional data is available when a series of different random samples can be obtained from the population over time. The Current Population Survey in the U.S.A, conducted monthly, is an example of such type of data sets. By combining cross sections at consecutive points in time, repeated cross-sectional data gains the replicability over time in absence of genuine panel data. Although we still cannot track each individual over time, we are able to estimate certain panel data models, especially those with fixed individual-specific effects and those with individual dynamics, under appropriate conditions. The literature that makes possible the estimation of these panel data models with only repeated cross sections dates back to the seminal work by Deaton (1985). Deaton’s idea is to divide individuals into cohorts according to certain predetermined characteristics, such as year of birth, and then use the cohort means of all relevant variables to construct a panel at the cohort level. Since the variable cohort means are estimated rather than directly observed, such a constructed panel is often called a pseudo panel. Common panel data approaches such as first difference and fixed effects (FE) estimation are readily applicable because of this panel structure. In this chapter, our focus is on the pseudo panel FE estimator. Despite the fact that the cohort means are error-ridden estimates, the pseudo panel FE coefficient estimator is generally consistent. The corresponding standard error estimators (the naive standard errors hereafter), however, are potentially problematic for ignoring the estimation errors in the cohort means, whether they are made robust to heteroskedasticity and/or serial correlation. To make the standard errors right, Imbens and Wooldridge (2007) propose a minimum distance (MD) approach for pseudo panel models. With asymptotics 1 relying on large cohort sizes, this approach is a natural fit for many microeconomic analyses, since for microeconomic data the cohort-wise number of observations is often large, and the number of cohorts and the number of time periods are often small. The MD approach effectively takes account of the estimated cohort means. More importantly, it provides an asymptotically efficient way to utilize all the moment conditions through its weighting procedure. In fact, Imbens and Wooldridge (2007) show that the pseudo panel fixed effect estimator is exactly the MD estimator that puts equal weights on the moment conditions via an identity weighting matrix. The superiority of the MD approach for pseudo panels relies on large sample theory, but its finite sample properties have not been fully studied. It is possible that the naive FE standard errors, especially those made robust to heteroskedasticity and/or serial correlation, can still achieve acceptable accuracy under certain circumstances. Moreover, although the result on optimal weighting in Imbens and Wooldridge (2007) implies that departures from identity weighting call for optimal weighting, it is unclear what are the typical causes of those departures. In this chapter, we investigate the finite sample properties of the MD approach for pseudo panels through a carefully designed simulation study. In particular, the attention is paid to the comparison of the optimal MD estimator and the MD estimator with identity weighting matrix. We identify two stylized causes, namely varying cohort sizes and cohort-wise error heteroskedasticity, of departures of the optimal weighting matrix from identity. In presence of these two features, optimal weighting evidently outperforms identity weighting. As for the naive FE inference, we find that it is always inferior to the MD FE inference. Therefore, we should never throw away individual-level data in empirical studies, for they contain useful information that the sample cohort means do not have. The MD approach is certainly not the only approach to pseudo panels. Deaton (1985), for example, treats the estimated cohort means as a measurement error problem, and proposes a measurement-error corrected ordinary least squares (OLS) estimator. Collado (1997) extends 2 the analysis to dynamic models, and develops a measurement-error corrected GMM estimator based on the instrument variables (IV) method in Arellano and Bond (1991). Another strand of researches go beyond pseudo panels and dive into individual level. Moffitt (1993) considers both dynamic and binary choice models, and proposes an IV estimator that constructs IV from functions of cohort and/or time. In particular, Moffit points out that aggregating to the cohort level is equivalent to using a full set of cohort, time, and cohort-time dummies as IV. Girma (2000) quasi-differences pairs of individuals in the same cohort to circumvent the problem of missing individual trajectories, and proposes a particular GMM IV method that uses past and present values of the dependent and explanatory variables within the same group. Verbeek and Vella (2005) propose an alternative computationally attractive IV estimator. A more thorough review that also covers important empirical applications can be found in Verbeek (2008). The rest of the chapter is organized as follows. In section 2 we set up the notations and framework. In section 3 we reports and discusses the results from the simulation study. Section 4 concludes. 1.2 Framework Deaton (1985) shows the importance of distinguishing between the population model and the sampling scheme. This distinction, as pointed out by Imbens and Wooldridge (2007), “is critical for understanding the nature of the identification problem, and in deciding the appropriate asymptotic analysis”. Therefore we follow this convention in this paper . The exposition in this section borrows heavily from Imbens and Wooldridge (2007). 1.2.1 The population models Consider the population model yit = xit β + ηt + fi + uit , t = 1, . . . , T. 3 (1.1) in which yit is the dependent random variable, xit is a 1 × K vector of random covariates with the first entry a constant term, fi is the unobserved time invariant effect , and uit is the unobserved idiosyncratic error. β is the parameter of practical interest. ηt ’s are the time varying intercepts and are also treated as parameters to estimate since we are considering applications with small T . An alternative representation is to include time dummies in xt and then the ηt ’s are obsorbed in β. The index i refers to the same individual over time in the population model. Writing the subscript i explicitly helps to indicate whether the quantities are changing only across t, changing only across i, or changing across both, which will become useful later. The model (1.1) imposes the same data generating structure for all T time periods, which assumes a stationary population over time. Later we will see that, by stationary population, we essentially means that the population cohort means of fi do not change over time. Following Deaton (1985), we assume the population can be divided into G predetermined group. The group designation must be determined before the samples are drawn, and must be independent of time. Birth year, for example, is one of the most commonly used characteristic to define the group designation. Let gi be the random variable indicating the group membership of a random draw i. gi takes values in {1, 2, . . . G}. Take expectation of (1.1) conditional on group membership, we have E(yit |gi = g) = E(xit |gi = g)β+ηt +E(fi |gi = g)+E(uit |gi = g), t = 1, . . . , T, g = 1, . . . , G. (1.2) Define the population cohort means as y µgt = E(yit |gi = g) µx gt = E(xit |gi = g) (1.3) αg = E(fi |gi = g) δgt = E(uit |gi = g) for g = 1, . . . , G and t = 1, . . . , T . Note that all the four quantities above are deterministic 4 population cohort means. Then we can rewrite (1.2) as y µgt = µx gt β + ηt + αg + δgt , g = 1, . . . , G, t = 1, . . . , T. (1.4) (1.2) and (1.4) are different notations for the population model at the cohort level. The parameter δgt can be considered as the effect of the cohort-time cell (g, t) net of the cohort effect αg and the time effect ηt . y Even if µgt and µx gt are known, the system of linear equations in (1.4) is not identified if we leave δgt vary freely. Therefore, we need certain restrictions on δgt . In a standard panel data model, a weak exogeneity assumption we usually make is the contemporaneous exogeneity of xit given fi : E(uit |xit , fi ) = 0, t = 1, . . . , T. (1.5) This condition is, however, is not required here. A weaker condition that is relevant in the context of (1.1) is E(uit |fi ) = 0, t = 1, . . . , T. (1.6) Note that by iterated expectation, (1.5) implies (1.6). This gives certain flexibility to pseudo panels on the exogeneity of xit , which will be discussed in more details later. Because fi summarized all time-invariant unobservables, Imbens and Wooldridge (2007) argue that (1.6) should be true for not only the lump sum fi but also any time-invariant factors including gi . In other words, fi should represent any random variable that does not depend on time. While this thought experiment makes sense, rigorously speaking, it does impose stronger conditions than (1.6).1 Nevertheless, we keep this treatment in this chapter. In particular, replacing fi with the group indicator gi , we obtain E(uit |gi ) = 0, t = 1, . . . , T. 1 The (1.7) sigma algebra generated by fi is not necessarily a subset of the sigma algebra generated by gi 5 Note that E(uit |gi ) is still a random variable. Since gi takes only finitely many values, an alternative way to write (1.7) is δgt = E(uit |gi = g) = 0, g = 1, . . . , G, t = 1, . . . , T. (1.8) Substitute (1.8) in (1.4), we get y µgt = µx gt β + ηt + αg , g = 1, . . . , G, t = 1, . . . , T. (1.9) Let θ = (β , η , α ) be the (K + T + G) × 1 column vector of parameters with η = (η1 , . . . , ηT ) and α = (α1 , . . . , αG ) . There are, however, only T + G − 2 parameters to estimate. Since xit includes a constant term, only (G − 1) parameters in αg and (T − 1) in ηt are separately identifiable. We impose the normalization α1 = 0 and η1 = 0 which is slightly different from the normalization G g=1 αg = 0 and η1 = 0 in Imbens and Wooldridge (2007). With this treatment, αg , g = 2, · · · , G and ηt , t = 2, · · · , T represent the net effects relative y to the first cohort at the first time period. If µgt and µx gt are known, GT ≥ K + T + G − 2, and the equations in (1.9) are linearly independent, then (1.9) contains enough (maybe over-identified) restrictions to solve for θ. As pointed out in Imbens and Wooldridge (2007), what (1.7) really imposes is that the cohort-level equations contain only the set of cohort and time effects but not the cohorttime interaction effects. If for any cohort-time cell (g, t) δgt is nonzero, then there is a misspecification in the population model (1.1). In the extreme case where the true model contains a full set of cohort-time net effects, nothing is identified since the identification of any parameter comes from the variation of its associated variable over cohort and/or time. Perhaps another representation helps understanding this better. Write the population model with a full set of cohort-time effects as yit = xit β + ηt + fi + δgi ,t + uit , t = 1, . . . , T, where δgi ,t = E(uit |gi ), the cohort-time effect of cell (gi , t), is properly treated as a random 6 variable. Then (1.7) is exactly. δgi ,t = 0, i.e. the population model does not contain a full set of cohort-time effects. Details about some common estimation strategies given (1.9), such as OLS, FE and FD can be found in Imbens and Wooldridge (2007). They are straightforward after treating the cohort means as known. 1.2.2 Discussion on exogeneity We argue that a subtle flexibility is gained thanks to the fact that (1.6) is weaker than (1.5). Specifically, the weaker condition (1.6) allows the deviation of xit from its cohort mean to be non-exogenous with respect to the deviation of uit from its cohort mean. Put it differently, within a given cohort-time cell, xit and uit are allowed to be correlated. But at the cohort level, the cohort mean of xit must be exogenous with respect to the cohort mean of uit , if we treat the variation in their cohort means over cohort and time as the source of randomness. In sum, endogeneity at the individual level is allowed, but exogeneity at the cohort level is still required. The first implication of this is that the allowed dependence between xit and uit is not arbitrary. xit can still contain lagged dependent variables, most commonly yi,t−1 , or explanatory variables that are contemporaneously endogenous, but the dependence cannot be fundamental, meaning that it exists at the cohort level. In our setup, this is guaranteed by two restrictions: (i) the specification in the individual level population model (1.1) is correct, and (ii) the zero cohort mean of uit condition in (1.7) holds. They together translate to the exclusion of a full set of cohort-time effects. Another implication is that, if G is large enough so that we can rely on large G asymptotics,2 we do not need the zero cohort mean of uit condition imposed in (1.7) for consistent we can assume the conditional distribution of δgt given µx gt is normal and use maximum likelihood estimation. 2 Alternatively, 7 estimation of β. The condition can be relaxed to some form of exogeneity at the cohort level. Let g = gi to simplify notation, and denote the cohort-level random explanatory variable and error by µx gt and δgt , which treats the cohort dimension as random but still leaves the time dimension fixed. Then one form of such exogeneity assumption can be expressed as E(δgt |µx gt ) = 0, t = 1, . . . , T. (1.10) Apparently, the condition (1.7) implies (1.10). The analysis in Deaton (1985) goes a bit further to treat the time dimension as random as well, and thus relies on large GT asymptotics, but the idea is essentially the same. Nevertheless, this treatment “seems unnatural for the way pseudo panels are constructed, and the thought experiment about how one might sample more and more groups is convoluted”, as pointed out by Imbens and Wooldridge (2007). Therefore, if we do not have large G but only large cohort-time cell size, Ngt , MD estimation is the way to go, and we need to impose the stronger zero cohort mean of uit condition (1.7). The treatment regarding µx gt and δgt above also breaks the barrier between the view of constructing cohort-level equations from the individual level, as represented by Deaton (1985) and Imbens and Wooldridge (2007), and the view of starting the analysis right from the cohort level. Both views make sense and are unified under this treatment. But when starting the analysis from the cohort level, we need to make sure that the assumptions are consistent with the process of construction from the individual level. In particular, attention should be paid to proper asymptotics. 1.2.3 Minimum distance estimation Given a repeated cross-sectional data set with large cohort sizes, small number of cohorts and small number of time periods, the MD estimator is a natural fit. Because of the large y cohort sizes, the cohort means µgt and µx gt in (1.9) can be estimated fairly precisely by their sample analogs in each cohort-time cell . The system of equations (1.9) is the link between 8 y the reduced-form parameter {(µgt , µx gt ), g = 1, . . . , G, t = 1, . . . , T } and the structural parameter θ. The MD approach is essentially a delta method, recovering structural estimates from reduced-form estimates. In the next several subsections, we derive the limiting distribution of the sample cohort means, present the minimization problem of MD estimators, and give a closed-form expression of the general MD estimator for pseudo panels. In particular, the optimal MD estimator and the FE estimator as the MD estimator with identity weighting are discussed in detail. 1.2.3.1 Limiting distribution of cohort-time cell means Specifically, assume we have a random sample on (xit , yit ) of size nt for each t, and we denote them collectively by{(xit , yit ), i = 1, . . . , nt }. i may refer to different individuals in different time periods. This notation works fine as long as we keep in mind the in each time period we have a new random sample. For each random draw i, let ri = (rit,1 , rit,2 , . . . , rit,G ) be a vector of group indicators such that rit,g = 1{g =g} , where 1A is the indicator function that takes values in {0, 1} i and equals 1 only if A is true. In this way we properly treat the group membership of the random draw i as a random vector ri . With ri , the sample average of the response variable in cohort-time cell (g, t) can be written as y µ ˆgt = n−1 gt nt rit,g yit = (ngt /nt )−1 n−1 t i=1 where ngt = nt i=1 rit,g nt rit,g yit (1.11) i=1 is properly treated as a random variable. y y µ ˆgt is generally consistent for µgt . Specifically, let ρg = P (rit,g = 1), the fraction of the population in cohort g. We have treated ρg as time invariant because we assume the population is stationary. Then p ρˆgt = (ngt /nt ) −→ ρg , 9 (1.12) and thus we have y µ ˆgt = −1 ρˆ−1 gt nt nt p y rit,g yit −→ ρ−1 g E(rit,g yit ) = µgt . i=1 y The last equality holds because E(rit,g yit ) = P (rit,g = 1)E(yit |rit,g = 1) = ρg µgt . The same argument also holds for the other cohort means. y ˆ sgt = (ˆ ˆx Let sit = (yit , xit ), and define µ µgt , µ gt ) as in (1.11). Then the asymptotic ˆ sgt is distribution of µ √ s ˆ sgt − µsgt ) −→ N ormal(0, ρ−1 nt (µ g Ωgt ) where Ωsgt = V ar(sit |g) is the (K + 1) × (K + 1) variance-covariance matrix for the cohort-time cell (g, t). When later we stack the means across groups and time periods, it is useful to have the result √ ˆ sgt − µsgt ) −→ N ormal(0, (ρg κt )−1 Ωsgt ) n(µ where n = T t=1 nt (1.13) and κt = limn→∞ (nt /n) is essentially the fraction of all observations accounted for by cross section t. ρg κt is consistently estimated by ngt /n. A consistent estimator for Ωsgt is ˆs Ω gt = n−1 gt nt ˆ sgt )(sit − µ ˆ sgt ) . rit,g (sit − µ (1.14) i=1 which is the sample variance-covariance matrix of s within the cell (g, t). Let π = (µs11 , µs12 , . . . , µs1T , µs21 . . . , µsGT ) , the column vector of all cell means. π is a ˆ by replacing µsgt with µ ˆ sgt . Now, µ ˆ sgt GT (K + 1) vector since each µsgt is K + 1. Define π are independent across g because we have random sampling for each t. When xit does not ˆ sgt are independent across t, too. Then, by stacking (1.13) for all contain lags or leads, µ (g, t), we have √ ˆ − π) −→ N ormal(0, Ω), n(π 10 (1.15) where Ω is the GT (K + 1) × GT (K + 1) block diagonal matrix with the gt-th block (ρg κt )−1 Ωsgt . Note that Ω incorporates both different cell variance-covariance matrices as well as the different frequencies of observations. As we will see in the simulation study, this is exactly the reason why the optimal MD estimator outperforms other MD estimators when there are cohort-wise heteroskedasticity and varying cohort sizes. 1.2.3.2 Minimum distance approach for pseudo panels Classical MD estimation is useful for obtaining structural estimates from reduced form estimates when a known relationship exists between the structural and reduced form parameters (see, e.g., Wooldridge (2010)). In the pseudo panel setup, the group means in π are the reduced form parameters, θ contains the structural parameters, and the cohort-level equations embody the known relationship between π and θ. To facilitate the discussion, we rearrange terms in (1.9) by putting everything on the left hand side of the equality sign. Write the resulting expression as h(π, θ) = 0 (1.16) where h(·, ·) is a GT × 1 vector valued function (recall θ = (β , η , α ) ). The gt-th row of y h(π, θ) is −µgt + µx gt β + ηt + αg , or equivalently, π g (−1, β ) + ηt + αg (1.17) where π g is the g-th T × 1 block of π. The parameters π and θ do not appear in a separable way directly in h(π, θ), but it can be shown that this is a separable case. The classical MD estimator is a solution to the minimization problem ˆ θ) W h(π, ˆ θ). min h(π, θ∈Θ (1.18) where Θ is the space of θ and W is a GT × GT weighting matrix. W is needed when the restrictions in (1.16) over-identifies θ (GT > K + G + T − 2). We focus on the over-identified 11 case because it is usually the case in practice. Chamberlain (Harvard lecture notes) shows that the optimal weighting matrix is the inverse of M = ∇π h(π, θ)Ω∇π h(π, θ) , (1.19) where ∇π h(π, θ) is the GT × GT (K + 1) Jacobian of h(π, θ) with respect to π. Use Kronecker product (notation ⊗) and (1.17), we have ∇π h(π, θ) = IGT ⊗ (−1, β ) where IGT is the GT × GT identity matrix. This last result is exciting because, with Ω block diagonal, it implies that (1.19) is a GT × GT diagonal matrix with the gt-th diagonal entry (ρg κt )−1 (−1, β )Ωsgt (−1, β ) . (1.20) But recall that Ωsgt = V ar(sit |g), we have 2 ≡ (−1, β )Ωs (−1, β ) = V ar(y − x β|g), τgt it it gt 2 is and therefore, a consistent estimator of τgt 2 τˇgt = n−1 gt Nt ˇ − ηˇt − α ˇ g )2 rit,g (yit − xit β i=1 ˇ is the initial estimator of which is the sample residual variance within cell (g, t). Here θ ˇ = θ( ˇ π). ˇ is exactly the least squares ˆ θ obtained by putting W = IGT . Note that θ θ dummy variables (LSDV) estimator of θ on the pseudo panel. Since (ρg κt )−1 is consistently estimated by (ngt /n)−1 , (1.20) can be consistently estimated by 2. (ngt /n)−1 τˇgt (1.21) . ˇ −1 the estimated optimal weighting matrix, where the gt-th diagonal entry Denote by M 2 . Note that M ˇ = C(θ( ˇ π)) ˇ −1 is (ngt /n)/ˇ ˇ = M(θ) ˆ is a function of the reduced-form of M τgt ˆ Then the minimization problem of the optimal MD estimator is estimate π. ˇ −1 h(π, ˆ θ) M ˆ θ). min h(π, θ∈Θ 12 (1.22) 1.2.3.3 Closed-form MD estimators for pseudo panels In the pseudo panel setup, h(π, θ) is linear in each argument, and the MD estimator of θ is in closed form. We derive this expression in this section. x Let µgt = (µx gt , dt , cg ) be the 1 × (K + G + T − 1) row vector of regressors, where dt is a 1 × (T − 1) vector of time dummies and cg is a 1 × G vector of group dummies. Let   x  µg1   x   µ    x µg =  .g2  , T × (K + G + T − 1),  ..      x µgT  x µ1 x µ2          µx =  .  ,  ..      x µG GT × (K + G + T − 1). Then ∇θ h(π, θ) = µx , and the FOC for (1.22) is ˆ−µ ˇ −1 (µ ˆx M ˆx θ µ ˆy ) = 0, ˆ x = (µ ˆx where µ gt , dt , cg ). Therefore, the optimal MD estimator is ˆ = (µ ˇ −1 µ ˇ −1 µ ˆ x )−1 µ ˆx M ˆx M θ ˆy . (1.23) which looks like a weighted least squares estimator. Following Chamberlain, the estimated ˆ is simply asymptotic variance of θ ˆ = (µ ˇ −1 µ ˆx M ˆ x )−1 /n. Avar(θ) (1.24) 2 , it is easy to weight each cell ˇ −1 is the diagonal matrix with entries (ngt /n)/ˇ Because M τgt (g, t) by ˆ and its asymptotic standard errors via a ngt /n/ˆ τgt and then compute both θ 2. weighted regression. In STATA, this can be done by specifying aweight (ngt /n)/ˇ τgt 13 The FE estimator applied to the pseudo panel of cohort means turns out to be the MD ˇ −1 with the estimator with identity weighting matrix. To see that, we simply replace M identity matrix IGT in (1.23), ˇ = (µ ˆ x )−1 µ ˆx µ ˆx µ ˆy θ (1.25) Strictly speaking, (1.25) is the LSDV estimator on the pseudo panel. But since it gives the same estimates for β, we also call it the FE estimator. The MD asymptotic variance ˇ is estimator for θ ˇ = (µ ˇ −1 µ ˆ x )(µ ˆ x )−1 (µ ˆ x )−1 /n. ˆx µ ˆx M ˆx µ Avar(θ) (1.26) Apparently, this formula is different from the naive FE asymptotic variance estimators to be discussed in the next section, whether they are made robust to heteroskedasticity and/or serial correlation. 1.2.3.4 Discussion on the difference between MD FE and naive FE inference ˇ can not be Unlike the optimal MD asymptotic variance, the MD asymptotic variance for θ estimated directly from a weighted regression. In fact, since the corresponding weighting matrix for FE is the identity matrix, the correct weight for each cell is simply no weight (equal weight). Without any weighting, a linear regression gives us the naive asymptotic variance estimator3 ˇ = (µ ˆ x )−1 σ ˆx µ ˇ2 Avarn (θ) (1.27) where y ˇ ˆx µ ˆgt − µ ˇg gt β − ηˇt − α σ ˇ 2 = (GT − 1)−1 2 . g,t ˇ −1 equals σ Clearly, (1.26) and (1.27) coincide if nM ˇ 2 IGT , which is generally not the case. Making it robust to heteroskedasticity (White (1980)), we get the naive heteroskedasticityrobust asymptotic variance estimator 3 Proper adjustment of degrees of freedom can also be proposed, which we do not discuss for simplicity. It also applies to the two robust naive variance estimators. 14 ˇ = µ ˆx ˆx µ Avarr (θ) −1 ˆx µ ˆ uˇ ) diag(µ 2 ˆx µ ˆx ˆx µ µ −1 ˆ uˇ ) is the diagonal matrix created by putting the vector µ ˆ uˇ on the principal where diag(µ ˆ uˇ is the column vector that stacks all cohort-level residuals over g and t, and its diagonal. µ [(g − 1)T + t]-th entry is ˇ µ ˆugt = n−1 gt nt rit,g uˇit . i=1 ˇ is the sample cohort mean of the individual-level residuals within cell (g, t). The That is, µ ˆugt individual level residual, uˇit , is defined as ˇ − (ˇ ηt + α ˇ g ). uˇit = yit − xit β ˇ )2 is different from τˇ2 . The former is the square of the residual cohort mean Note that (ˆ µugt gt for cell (g, t), which only contains cohort-level information, where as the latter is the sample variance of the residuals within cell (g, t), which contains individual-level information. Further making it robust to heteroskedasticity and serial correlation (see, e.g. Wooldridge (2010)), we get the naive cluster-robust asymptotic variance estimator −1 ˇ = µ ˆx µ ˆ x diagG (µ ˆx ˆ uˇ )diagG (µ ˆ uˇ ) µ ˆx Avarc (θ) µ ˆx µ ˆx µ −1 ˆ uˇ ) is the block diagonal matrix with the g-th diagonal block µ ˆ ugˇ for g = where diagG (µ 1, · · · , G. The subscript G in the notation diagG indicates that the block diagonal matrix ˇ . Alternatively, ˆ ugˇ is a T × 1 vector with the t-th entry µ has G blocks on the diagonal. µ ˆugt the middle term can be written as ˇµ ˇ ) ˆ uˇ )diagG (µ ˆ uˇ ) = diag(µ ˆ u1ˇ µ ˆ u1ˇ , µ ˆ u2ˇ µ ˆ u2ˇ , · · · , µ ˆ uG ˆ uG diagG (µ ˇ −1 , this is a block diagonal matrix with µ ˆ zg uˇ on the ˆ zg uˇ µ Unlike the diagonal matrix nM g-th diagonal block. To summarize, the three naive FE asymptotic variance estimators can be obtained 2 ˇ0 = M ˇ −1 /n in (1.26) with V ˇn = σ ˇ r = diag(µ ˇc = ˆ uˇ ) and V by replacing V ˇ 2 IGT , V 15 ˆ uˇ )diagG (µ ˆ uˇ ) , respectively. But the naive FE inference is fundamentally different diagG (µ from the MD inference. The former only relies on cohort-level information, where as the latter ˇ and Avarc (θ) ˇ abstracts information from the individual level. The robustness of Avarr (θ) is also with respect to cohort-level heteroskedasticity and/or serial correlation only, i.e. heteroskedasticity and/or serial correlation in µugt , which requires at least large G asymptotics. As illustrated by the simulation study in the next section, the naive FE inference is far less efficient since it discards all individual-level information. 1.3 Simulation and results We now present the Monte Carlo simulation study that investigates the finite sample properties of the MD approach for pseudo panels. The simulation study focuses on two questions. First, what are the typical scenarios in which the optimal MD estimator outperforms the FE estimator? From (1.21), we know that if there is cohort-wise heteroskedasticity and/or varying the cell sizes, the optimal MD estimator is expected to outperform the FE estimator. In general, if there is any pattern in the population model that makes the optimal weighting matrix evidently different from the identity matrix, the optimal MD estimator is supposed to perform better. We check if it is the case in the simulation study. Secondly, can the naive FE inference still provide satisfactory accuracy and if it could, what are these typical scenarios? As discussed in the last section, the naive FE asymptotic variances and the MD FE asymptotic variance are alike in their formulae. On the other hand, the naive FE inference is fundamentally different form the MD FE inference in that the former discards all information at the individual level. The simulation study helps to understand these two seemingly conflicting facts. As Imbens and Wooldridge (2007) point out, the simulation design should be careful in at least two places. First, data for each cross section should be drawn from the population independently 16 across time, and the group identifier should also be randomly drawn. This is accomplished by a two step procedure. In the first step, we draw the population using (1.1). The population cohort sizes are fixed and depending on the design may or may not depend on cohort and/or time.4 In the second step, we mimic the sampling scheme of repeated cross sections by drawing independent random samples over time. In each period, we draw a tiny portion of the population as the cross-sectional sample for that period. Second, the underlying model should have full time effects to be realistic. If, as in Verbeek and Vella (2005), we omit the aggregate time effects while let explanatory variables to have means differ by cohort-time cell, the variation in µx gt will be relatively rich and thus we may set up too optimistic a situation for the estimators. We consider five scenarios. The first is a benchmark scenario in which all things are balanced across cohort-time cells. In the remaining four scenarios, we manipulate four different features of the population model, namely the time effects, the covariate distribution, the cohort-wise heteroskedasticity and the varying cohort sizes, one at a time. In this way, it is easy to isolate the cause. We begin with the benchmark scenario. 1.3.1 Benchmark In the benchmark scenario, we generate the outcome yit as a linear function of the covariates (x1it = 1, x2it , x3it , x4it ), the time effect ηt , the individual effect fi and the idiosyncratic error uit as in yit = β1 + β2 x2it + β3 x3it + β4 x4it + ηt + fi + uit , i = 1, · · · , Nt , t = 1, · · · , T. (1.28) The parameter values used are β = (β1 , β2 , β3 , β4 ) = (1, 1, 1, 1). The time effects are generated by ηt = t − 1, and the cohort effects are generated by αg = g − 1. Individual 4 Ideally, we would like a population with infinity many observations so that it is infinitely close to the population distribution defined by (1.1). In reality this is impossible, so we draw a large number of individuals to approximate the population distribution. 17 fixed effects are generated by adding a random normal disturbance to the cohort effects, i.e. fi ∼ N (αg , 1). The distribution of idiosyncratic error is given by uit ∼ N (0, 10). To fix ideas, it might be helpful to think of x2it , x3it and x4it as education, experience and marital status, respectively. The outcome yit is the log hourly wage, and there is an individual effect fi representing some unobserved ability. The three explanatory variables x2it , x3it and x4it are generated as follows x2it ∼ N (gt/6, 1), x3it ∼ N (sin(gt), 1), x4it ∼ Bernouli 1 1 + exp[1.5 ∗ sin(gt/2)] . That is, x2it is a continuous variable with population cohort mean gt/6 and with-cell variance 1. x3it is a continuous variable with the population cohort mean sin(gt) and within-cell 1 . The variance 1, and x4it is a binary variable equal to 1 with probability 1+exp(1.5∗sin(gt/2)) key is to let the three variable cohort means have distinct variation over g and t. We apply the optimal MD estimator and the MD estimator with identity weighting. The latter is numerically equivalent to the FE estimator on the pseudo panel of cohort means. For each estimator, we compute the MD coefficient and standard error estimates. For the MD estimator with identity weighting, we also compute the three naive FE standard errors discussed in the last section. We consider a small panel with G = 6 cohorts and T = 4 time periods. The population cell sizes Ngt are 2×104 , 105 and 5×105 respectively in the three cases considered. After the population panel is generated, we fix it over simulation replications. To mimic the sampling scheme of repeated cross-sectional surveys, we draw .2% of the population in each period. The resulting sample cell sizes ngt are approximately 40, 200 and 1000, respectively. For each case, we consider three different numbers of replications. The results are reported in Table 1.1, Table 1.2 and Table 1.3. There are several observations worth discussing. First of all, the optimal MD estimator 18 Table 1.1 Results for benchmark. G = 6, T = 4, ngt ≈ 40, sampling rate = .2%; R denotes number of replications; Monte Carlo averages on top, Monte Carlo standard deviations in parentheses. MD Identity βˇ R = 1000 x2 x3 x4 c2 d2 cons R = 5000 x2 x3 x4 c2 d2 cons R = 10000 x2 x3 x4 c2 d2 cons ˇ se(β) MD Optimal ˇ sen (β) ˇ ser (β) ˇ sec (β) βˆ ˆ se(β) 0.977 0.353 (0.345) (0.041) 0.999 0.199 (0.191) (0.017) 1.001 0.634 (0.631) (0.063) 0.980 0.433 (0.420) (0.032) 1.025 0.412 (0.417) (0.036) 0.973 0.380 (0.364) (0.034) 0.345 (0.080) 0.194 (0.042) 0.619 (0.139) 0.424 (0.090) 0.404 (0.088) 0.373 (0.079) 0.313 (0.097) 0.185 (0.047) 0.584 (0.163) 0.400 (0.122) 0.387 (0.108) 0.347 (0.116) 0.322 (0.142) 0.201 (0.080) 0.657 (0.249) 0.198 (0.092) 0.469 (0.174) 0.290 (0.108) 0.979 (0.343) 1.001 (0.195) 0.998 (0.633) 0.977 (0.423) 1.023 (0.420) 0.978 (0.365) 0.347 (0.041) 0.196 (0.016) 0.622 (0.061) 0.425 (0.031) 0.405 (0.035) 0.373 (0.032) 0.981 0.351 (0.345) (0.041) 1.003 0.198 (0.193) (0.017) 1.009 0.632 (0.640) (0.062) 0.984 0.432 (0.420) (0.032) 1.020 0.410 (0.414) (0.036) 0.975 0.379 (0.372) (0.033) 0.339 (0.079) 0.192 (0.042) 0.611 (0.136) 0.419 (0.090) 0.398 (0.087) 0.368 (0.079) 0.308 (0.095) 0.182 (0.046) 0.574 (0.159) 0.393 (0.118) 0.380 (0.105) 0.340 (0.113) 0.320 (0.143) 0.194 (0.075) 0.645 (0.243) 0.194 (0.090) 0.459 (0.169) 0.286 (0.109) 0.982 (0.347) 1.003 (0.195) 1.006 (0.644) 0.985 (0.424) 1.021 (0.416) 0.975 (0.374) 0.345 (0.040) 0.195 (0.017) 0.621 (0.060) 0.425 (0.031) 0.403 (0.035) 0.372 (0.032) 0.984 0.350 (0.344) (0.041) 1.003 0.198 (0.194) (0.017) 1.007 0.632 (0.634) (0.061) 0.984 0.432 (0.422) (0.031) 1.017 0.410 (0.411) (0.035) 0.979 0.378 (0.373) (0.032) 0.338 (0.079) 0.191 (0.042) 0.610 (0.137) 0.417 (0.090) 0.397 (0.087) 0.367 (0.079) 0.306 (0.095) 0.182 (0.047) 0.574 (0.159) 0.392 (0.119) 0.380 (0.105) 0.339 (0.113) 0.320 (0.141) 0.193 (0.075) 0.646 (0.246) 0.193 (0.089) 0.458 (0.169) 0.286 (0.108) 0.984 (0.346) 1.004 (0.195) 1.005 (0.637) 0.985 (0.426) 1.018 (0.414) 0.978 (0.375) 0.344 (0.040) 0.195 (0.017) 0.620 (0.059) 0.424 (0.030) 0.402 (0.034) 0.372 (0.031) 19 Table 1.2 Results for benchmark. G = 6, T = 4, ngt ≈ 200, sampling rate = .2%; R denotes number of replications; Monte Carlo averages on top, Monte Carlo standard deviations in parentheses. MD Identity βˇ R = 1000 x2 x3 x4 c2 d2 cons R = 5000 x2 x3 x4 c2 d2 cons R = 10000 x2 x3 x4 c2 d2 cons ˇ se(β) MD Optimal ˇ sen (β) ˇ ser (β) ˇ sec (β) βˆ ˆ se(β) 0.990 0.161 (0.165) (0.009) 1.008 0.089 (0.088) (0.003) 1.010 0.286 (0.289) (0.012) 0.996 0.191 (0.194) (0.006) 1.000 0.184 (0.181) (0.007) 0.982 0.168 (0.165) (0.007) 0.157 (0.034) 0.087 (0.018) 0.279 (0.059) 0.186 (0.039) 0.180 (0.038) 0.164 (0.035) 0.140 (0.042) 0.082 (0.021) 0.262 (0.072) 0.175 (0.052) 0.171 (0.047) 0.151 (0.049) 0.143 (0.062) 0.087 (0.033) 0.299 (0.110) 0.088 (0.036) 0.206 (0.075) 0.128 (0.047) 0.990 (0.165) 1.008 (0.087) 1.011 (0.288) 0.995 (0.194) 1.000 (0.182) 0.982 (0.165) 0.160 (0.009) 0.089 (0.003) 0.285 (0.012) 0.190 (0.006) 0.183 (0.007) 0.167 (0.006) 0.991 0.160 (0.161) (0.009) 1.005 0.089 (0.089) (0.003) 1.010 0.286 (0.287) (0.012) 0.991 0.191 (0.192) (0.006) 1.006 0.184 (0.181) (0.007) 0.983 0.168 (0.168) (0.006) 0.157 (0.033) 0.087 (0.018) 0.280 (0.059) 0.187 (0.039) 0.180 (0.038) 0.164 (0.034) 0.140 (0.042) 0.082 (0.020) 0.263 (0.071) 0.174 (0.051) 0.171 (0.046) 0.150 (0.049) 0.142 (0.063) 0.087 (0.032) 0.298 (0.109) 0.088 (0.036) 0.207 (0.074) 0.127 (0.046) 0.992 (0.161) 1.005 (0.089) 1.011 (0.287) 0.991 (0.192) 1.006 (0.181) 0.983 (0.169) 0.160 (0.009) 0.089 (0.003) 0.285 (0.012) 0.190 (0.006) 0.184 (0.007) 0.167 (0.006) 0.995 0.161 (0.161) (0.009) 1.004 0.089 (0.089) (0.003) 1.012 0.286 (0.286) (0.012) 0.990 0.191 (0.191) (0.006) 1.002 0.184 (0.183) (0.007) 0.985 0.168 (0.168) (0.007) 0.157 (0.034) 0.087 (0.018) 0.279 (0.059) 0.186 (0.039) 0.180 (0.038) 0.164 (0.035) 0.140 (0.042) 0.082 (0.020) 0.263 (0.071) 0.174 (0.052) 0.171 (0.046) 0.150 (0.050) 0.143 (0.063) 0.088 (0.032) 0.298 (0.110) 0.088 (0.036) 0.206 (0.074) 0.127 (0.047) 0.995 (0.161) 1.004 (0.089) 1.013 (0.286) 0.990 (0.191) 1.002 (0.183) 0.985 (0.168) 0.160 (0.009) 0.089 (0.003) 0.285 (0.012) 0.190 (0.006) 0.183 (0.007) 0.167 (0.007) 20 Table 1.3 Results for benchmark. G = 6, T = 4, ngt ≈ 1000, sampling rate = .2%; R denotes number of replications; Monte Carlo averages on top, Monte Carlo standard deviations in parentheses. MD Identity R = 1000 x2 x3 x4 c2 d2 cons R = 5000 x2 x3 x4 c2 d2 cons R = 10000 x2 x3 x4 c2 d2 cons MD Optimal βˇ ˇ se(β) ˇ sen (β) ˇ ser (β) ˇ sec (β) βˆ ˆ se(β) 1.003 (0.072) 0.997 (0.040) 1.001 (0.130) 1.004 (0.085) 1.004 (0.083) 1.003 (0.079) 0.072 (0.002) 0.040 (0.001) 0.128 (0.002) 0.085 (0.001) 0.082 (0.001) 0.075 (0.001) 0.070 (0.014) 0.038 (0.008) 0.124 (0.026) 0.082 (0.017) 0.079 (0.016) 0.072 (0.015) 0.062 (0.018) 0.037 (0.009) 0.117 (0.031) 0.076 (0.022) 0.075 (0.019) 0.065 (0.021) 0.062 (0.028) 0.039 (0.014) 0.132 (0.049) 0.039 (0.015) 0.090 (0.031) 0.056 (0.020) 1.003 (0.072) 0.997 (0.040) 1.001 (0.130) 1.004 (0.085) 1.004 (0.083) 1.003 (0.079) 0.072 (0.002) 0.040 (0.001) 0.128 (0.002) 0.085 (0.001) 0.082 (0.001) 0.075 (0.001) 1.003 (0.072) 0.998 (0.040) 1.009 (0.127) 0.998 (0.084) 1.002 (0.081) 1.003 (0.076) 0.072 (0.002) 0.040 (0.001) 0.128 (0.002) 0.085 (0.001) 0.082 (0.001) 0.075 (0.001) 0.071 (0.015) 0.039 (0.008) 0.126 (0.026) 0.084 (0.017) 0.081 (0.017) 0.073 (0.015) 0.063 (0.019) 0.037 (0.009) 0.119 (0.032) 0.077 (0.023) 0.077 (0.021) 0.067 (0.022) 0.064 (0.029) 0.039 (0.015) 0.135 (0.050) 0.040 (0.016) 0.093 (0.033) 0.057 (0.021) 1.003 (0.072) 0.998 (0.040) 1.009 (0.127) 0.999 (0.085) 1.002 (0.081) 1.002 (0.076) 0.072 (0.002) 0.040 (0.001) 0.128 (0.002) 0.085 (0.001) 0.082 (0.001) 0.075 (0.001) 0.999 (0.073) 0.997 (0.040) 1.009 (0.127) 1.002 (0.084) 1.001 (0.082) 1.001 (0.075) 0.072 (0.002) 0.040 (0.001) 0.128 (0.002) 0.085 (0.001) 0.082 (0.001) 0.075 (0.001) 0.071 (0.015) 0.039 (0.008) 0.125 (0.026) 0.083 (0.017) 0.080 (0.017) 0.073 (0.015) 0.063 (0.019) 0.037 (0.009) 0.118 (0.031) 0.077 (0.023) 0.076 (0.020) 0.067 (0.022) 0.064 (0.028) 0.039 (0.015) 0.134 (0.049) 0.039 (0.015) 0.092 (0.033) 0.057 (0.020) 0.999 (0.073) 0.997 (0.040) 1.009 (0.127) 1.002 (0.084) 1.001 (0.082) 1.001 (0.075) 0.072 (0.002) 0.040 (0.001) 0.128 (0.002) 0.085 (0.001) 0.082 (0.001) 0.075 (0.001) 21 has no advantage over the MD estimator with identity weighting in benchmark. This is under the current specification the optimal weighting matrix is an identity matrix, so the MD estimator with identity weighting is the optimal MD estimator. Second, even in the case where the sample cohort size is about 40, the two MD estimators perform well. The Monte Carlo averages of the coefficient estimates are fairly close to the true parameter values. For each covariate, the Monte Carlo averages of the standard error estimates are also fairly close to the Monte Carlo standard deviations the coefficient estimates. Since the results are fairly stable across the three different numbers of replications, we will report the results for 10, 000 only in later discussions. Third, the three naive FE standard errors are much more volatile than the MD FE standard error. This observation is consistent with the fact that the naive FE inference relies on cohort-level information and discards all individual-level information. Moreover, there seems to be downward small-sample bias in the naive FE standard errors than in the MD FE standard errors. This is mainly due to the small G setting. It is well known that α ˇ g ’s are inconsistent under fixed G, which contaminate the residual estimates and in turn the naive FE standard errors. Another reason is that the degree-of-freedom adjustment used does not take into account the fact that the cohort means are estimated. Although the FE MD standard errors seem also biased downwards, the size of the biases is always smaller across the three tables. Clearly, in the benchmark scenarios the MD FE inference is superior to the naive FE inference. Fourth, the cluster-robust naive FE standard errors are severely biased downwards for the cohort effect α2 . This observation remains valid for all the scenarios considered in this chapter. The explanation is again the fact that the estimates for the fixed effects obtained via LSDV are essentially based on only T observations, so the cluster-robust naive FE standard errors are inconsistent for fixed T . Lastly, the performance of all estimators improve universally as the the cohort size increases. For the naive FE standard errors, the reason is that the sample cohort means of the 22 residuals approaches zero as cohort size increases. To keep the discussion concise, we will report the results for ngt ≈ 200 in later discussions. 1.3.2 Deterministic aggregate time effects The aggregate time effects, i.e. the time intercepts ηt ’s, are treated as parameters in the population. Therefore, to generate the aggregate time effects properly, only deterministic functions of time need to be considered. If randomness is otherwise imposed on ηt , the random disturbance would become part of the idiosyncratic error, which is a separate scenario considered in section 1.3.4. In the benchmark scenario, the aggregate time effects are ηt = t − 1 which is linear in t. In this section, we consider two additional deterministic functions of time: quadratic, and natural log ηt = (t − 1)2 , ηt = ln(t). The variation in the quadratic function is greater than that in the natural log function. The results are reported in Table 1.4. The patterns of the results are similar to those in the last section. In fact, the two panels in Table 1.4 are exactly the same as the third panel in Table 1.2 except for the coefficient estimates on the time dummy d2 in the lower panel where ηt = ln(t). Note that the true coefficient on d2 is ln(2) ≈ .693 in that case. These results suggest that the aggregate time effect process has little effect on effect on the performance of the estimators. It only changes the true parameter values of the time effects. Of course, the fact the models are correct specified also plays a role. Correct specification implies that both estimators are consistent. As a result, changes in the deterministic process of the aggregate time effect have little effect on the estimated residuals and thus do not matter for inference or the estimation of other coefficients. 23 Table 1.4 Results for different aggregate time effect processes. G = 6, T = 4, ngt ≈ 200, sampling rate = .2%; 10,000 replications; Monte Carlo averages on top, Monte Carlo standard deviations in parentheses. MD Identity βˇ ηt = (t − 1)2 x2 x3 x4 c2 d2 cons ηt = ln(t) x2 x3 x4 c2 d2 cons ˇ se(β) MD Optimal ˇ sen (β) ˇ ser (β) ˇ sec (β) βˆ ˆ se(β) 0.995 (0.161) 1.004 (0.089) 1.012 (0.286) 0.990 (0.191) 1.002 (0.183) 0.985 (0.168) 0.161 0.157 0.140 0.143 0.995 0.160 (0.009) (0.034) (0.042) (0.063) (0.161) (0.009) 0.089 0.087 0.082 0.088 1.004 0.089 (0.003) (0.018) (0.020) (0.032) (0.089) (0.003) 0.286 0.279 0.263 0.298 1.013 0.285 (0.012) (0.059) (0.071) (0.110) (0.286) (0.012) 0.191 0.186 0.174 0.088 0.990 0.190 (0.006) (0.039) (0.052) (0.036) (0.191) (0.006) 0.184 0.180 0.171 0.206 1.002 0.183 (0.007) (0.038) (0.046) (0.074) (0.183) (0.007) 0.168 0.164 0.150 0.127 0.985 0.167 (0.007) (0.035) (0.050) (0.047) (0.168) (0.007) 0.995 (0.161) 1.004 (0.089) 1.012 (0.286) 0.990 (0.191) 0.695 (0.183) 0.985 (0.168) 0.161 0.157 0.140 0.143 0.995 0.160 (0.009) (0.034) (0.042) (0.063) (0.161) (0.009) 0.089 0.087 0.082 0.088 1.004 0.089 (0.003) (0.018) (0.020) (0.032) (0.089) (0.003) 0.286 0.279 0.263 0.298 1.013 0.285 (0.012) (0.059) (0.071) (0.110) (0.286) (0.012) 0.191 0.186 0.174 0.088 0.990 0.190 (0.006) (0.039) (0.052) (0.036) (0.191) (0.006) 0.184 0.180 0.171 0.206 0.695 0.183 (0.007) (0.038) (0.046) (0.074) (0.183) (0.007) 0.168 0.164 0.150 0.127 0.985 0.167 (0.007) (0.035) (0.050) (0.047) (0.168) (0.007) 24 1.3.3 Covariate distributions To understand how the distributions of the covariates affect estimation, we manipulate the distribution of the covariates in this section. In particular, attention is paid to the covariate x2 . In addition to the distribution x2it ∼ N (gt/6, 1) considered in the benchmark, we look at the following two distributions x2it ∼ N ((gt)2 /6, 1), x2it ∼ N (ln(gt)/6, 1). The quadratic product of g and t embodies a greater variation than the product only, and the product in turn embodies a greater variation than its natural log transformation. All the other variables are generated as in the benchmark. The results are summarized in Table 1.5. The pattern is similar to the last section. The only difference is that the greater variation in the cohort mean of x2it in the first panel makes the estimation of its coefficient easier, whereas in the second panel the weaker variation renders estimation harder. Changes in the distribution of x2it have little effect on the two MD estimators. The explanation is the same as that for the aggregate time effects. Since both estimators are consistent, the variation in the distribution of x2it does not enter the residuals. The optimal weighting matrix is still an identity matrix, so the performance of the two MD estimators are similar. 1.3.4 Cohort-wise heteroskedasticity in the idiosyncratic error In the benchmark, the idiosyncratic error uit is homoskedastic. In this section, we investigate how cohort-wise heteroskedasticity in uit would affect estimation. We present the results for two case in which the variance of uit depends on (g, t). Specifically, we consider 25 Table 1.5 Results for distribution of x2 . G = 6, T = 4, ngt ≈ 200, sampling rate = .2%; 10,000 replications; Monte Carlo averages on top, Monte Carlo standard deviations in parentheses. MD Identity βˇ x2 ∼ x2 N ((gt)2 /6, 1) x3 x4 c2 d2 cons x2 ∼ x2 N (ln(gt)/6, 1) x3 x4 c2 d2 cons ˇ se(β) MD Optimal ˇ sen (β) ˇ ser (β) ˇ sec (β) βˆ ˆ se(β) 1.000 0.004 0.004 0.004 0.004 1.000 0.004 (0.004) (0.000) (0.001) (0.001) (0.002) (0.004) (0.000) 1.004 0.086 0.084 0.080 0.083 1.004 0.086 (0.087) (0.003) (0.018) (0.020) (0.032) (0.087) (0.003) 1.011 0.286 0.279 0.263 0.299 1.012 0.285 (0.286) (0.012) (0.059) (0.071) (0.111) (0.286) (0.012) 0.989 0.176 0.172 0.167 0.058 0.988 0.175 (0.176) (0.004) (0.036) (0.050) (0.023) (0.176) (0.004) 1.000 0.157 0.153 0.152 0.189 1.000 0.157 (0.155) (0.004) (0.032) (0.039) (0.065) (0.156) (0.004) 0.986 0.163 0.159 0.153 0.125 0.986 0.162 (0.163) (0.004) (0.033) (0.048) (0.044) (0.163) (0.004) 1.003 1.023 0.954 0.884 0.979 1.003 1.020 (1.002) (0.252) (0.286) (0.303) (0.433) (1.002) (0.251) 1.003 0.090 0.084 0.080 0.086 1.003 0.090 (0.086) (0.009) (0.018) (0.021) (0.033) (0.086) (0.009) 1.014 0.306 0.287 0.268 0.301 1.015 0.305 (0.293) (0.031) (0.063) (0.072) (0.112) (0.294) (0.031) 0.987 0.224 0.210 0.202 0.130 0.987 0.223 (0.216) (0.040) (0.055) (0.065) (0.073) (0.216) (0.040) 0.998 0.201 0.188 0.182 0.213 0.998 0.201 (0.195) (0.038) (0.051) (0.056) (0.085) (0.195) (0.038) 0.987 0.160 0.150 0.146 0.111 0.987 0.159 (0.153) (0.015) (0.033) (0.045) (0.042) (0.153) (0.015) 26 uit ∼ N (0, 10 + (gt)2 ), uit ∼ N (0, 10 + gt). The degree of heteroskedasticity is greater in the first case. All the other variables are generated in the same way as in the benchmark. f We note here that introducing variation in the distribution of fi = αg + εi is at most another way to introduce heteroskedasticity. First of all, it is not interesting to vary the deterministic process of the cohort effects αg ’s, because they are parameters to estimate. f Secondly, it is not interesting to vary the mean of the distribution of εi , because that would f only affect the process of αg ’s. Lastly, letting the variance of εi depend on g is the same as introducing cohort-wise heteroskedasticity in uit . It does not make sense to let the variance f of εi depend on t because fi is time invariant. The results in Table 1.6 show that cohort-wise heteroskedasticity has two major effects. First, the optimal MD estimator outperforms the MD estimator with identity weighting, especially in the top panel where uit ∼ N (0, 10 + (gt)2 ). This is because cohort-wise heteroskedasticity makes the optimal weighting matrix non-identity. Secondly, the strict increase in the variance of uit in either case raises the standard errors of both MD estimators compared to the benchmark. This rise is universal. 1.3.5 Cohort-time cell size In this section, we let the cohort-time cell size vary by cohort and time. Specifically, we manipulate the sampling rate so that the sample size for cohort g at time t follows approximately the following two processes 1. ngt ≈ (200 + 180 × 1.5) − 180|g − 3.5| = 470 − 180|g − 3.5|, g = 1, . . . , G 2. ngt ≈ (200 + 50 × 1.5) − 50|g − (3.5 − (t − 3))| = 275 − 50|g + t − 6.5|, 27 g = 1, . . . , G Table 1.6 Results for cohort-wise heteroskedasticity in error term. G = 6, T = 4, ngt ≈ 200, sampling rate = .2%; 10,000 replications; Monte Carlo averages on top, Monte Carlo standard deviations in parentheses. MD Identity βˇ uit ∼ x2 N (0, 10 + (gt)2 ) x3 x4 c2 d2 cons uit ∼ x2 N (0, 10 + gt) x3 x4 c2 d2 cons ˇ se(β) MD Optimal ˇ sen (β) ˇ ser (β) ˇ sec (β) βˆ ˆ se(β) 0.989 0.651 0.510 0.505 0.509 0.991 0.448 (0.653) (0.051) (0.141) (0.202) (0.256) (0.448) (0.031) 1.012 0.323 0.283 0.280 0.291 1.012 0.214 (0.326) (0.019) (0.077) (0.090) (0.115) (0.214) (0.013) 1.040 1.101 0.910 0.943 1.098 1.002 0.762 (1.100) (0.062) (0.248) (0.320) (0.472) (0.758) (0.048) 0.969 0.488 0.608 0.495 0.343 0.988 0.321 (0.489) (0.040) (0.164) (0.159) (0.163) (0.316) (0.023) 1.000 0.589 0.585 0.541 0.645 1.007 0.326 (0.589) (0.046) (0.159) (0.180) (0.271) (0.324) (0.026) 0.976 0.348 0.534 0.416 0.356 0.979 0.265 (0.349) (0.030) (0.144) (0.149) (0.138) (0.263) (0.016) 0.994 0.221 0.209 0.190 0.193 0.993 0.213 (0.222) (0.013) (0.046) (0.059) (0.087) (0.215) (0.012) 1.006 0.122 0.116 0.111 0.117 1.006 0.119 (0.123) (0.005) (0.025) (0.028) (0.044) (0.120) (0.005) 1.017 0.403 0.372 0.361 0.411 1.012 0.390 (0.403) (0.018) (0.081) (0.102) (0.158) (0.391) (0.018) 0.985 0.233 0.249 0.221 0.122 0.988 0.227 (0.233) (0.009) (0.054) (0.064) (0.051) (0.227) (0.009) 1.001 0.236 0.239 0.223 0.268 1.003 0.226 (0.235) (0.011) (0.052) (0.060) (0.096) (0.226) (0.011) 0.983 0.194 0.218 0.188 0.161 0.983 0.191 (0.195) (0.008) (0.047) (0.060) (0.058) (0.192) (0.008) 28 Table 1.7 Cohort-time cell sizes for the two sampling schemes ngt : 1 ngt : 2 t g 1 2 3 4 5 6 col. sum total 1 20 200 380 380 200 20 1200 2 3 20 20 200 200 380 380 380 380 200 200 20 20 1200 1200 4800 t g 1 2 3 4 5 6 col. sum total 4 20 200 380 380 200 20 1200 1 2 3 285 217 149 285 285 217 217 285 285 149 217 285 81 149 217 13 81 149 1030 1234 1302 4800 4 81 149 217 285 285 217 1234 In the first case, the cohort size starts from 20 at cohort 1, increases linearly with step 180 up to 380 at cohorts 3 and 4, and then decreases with the same step down to 20 at cohort 6. The idea is to let the cohorts in the middle have more observations. The overall sample size is about 4800. The second case has approximately the same overall sample size, but the middle peak cohorts shifts over time. The highest sample cell size is 285, and the step is 68. The two schemes are shown in Table 1.7. Note that the changes in the two schemes are both quite radical. The results are summarized in Table 1.8. The impact of varying cell size is similar to that of cohort-wise heteroskedasticity. The optimal MD estimator significantly outperforms the MD estimator with identity weighting in both cases. This is again due to the non-identity weighting matrix caused by the varying cell size. The naive FE inference cannot provide satisfactory standard error estimates. Depend on the covariate and the robust type, it can either overestimate or underestimate, and the bias is overall large. 1.4 Conclude Build upon the theoretical analysis in Imbens and Wooldridge (2007), we study the finite sample properties of the MD estimator for pseudo panels in this chapter. In particular, we 29 Table 1.8 Results for varying cohort size. G = 6, T = 4; ngt follows the three specifications given in section 1.3.5 and is generated by varying the sampling rate; 10,000 replications; Monte Carlo averages on top, Monte Carlo standard deviations in parentheses. MD Identity βˇ ngt : 1 x2 x3 x4 c2 d2 cons ngt : 2 x2 x3 x4 c2 d2 cons ˇ se(β) MD Optimal ˇ sen (β) ˇ ser (β) ˇ sec (β) βˆ ˆ se(β) 0.993 (0.443) 1.004 (0.155) 1.005 (0.482) 0.993 (0.468) 1.005 (0.453) 0.983 (0.457) 0.444 0.298 0.328 0.285 1.008 0.282 (0.067) (0.097) (0.127) (0.157) (0.289) (0.027) 0.155 0.166 0.148 0.162 1.002 0.095 (0.017) (0.049) (0.045) (0.068) (0.095) (0.006) 0.480 0.529 0.457 0.538 1.020 0.308 (0.062) (0.156) (0.149) (0.226) (0.311) (0.020) 0.472 0.362 0.383 0.177 0.984 0.412 (0.063) (0.107) (0.146) (0.100) (0.436) (0.039) 0.455 0.347 0.370 0.424 0.987 0.263 (0.063) (0.105) (0.140) (0.207) (0.266) (0.021) 0.461 0.319 0.353 0.232 0.992 0.398 (0.070) (0.094) (0.154) (0.109) (0.422) (0.039) 0.993 (0.306) 1.005 (0.100) 1.008 (0.438) 0.994 (0.444) 1.003 (0.421) 0.983 (0.473) 0.303 0.196 0.205 0.220 0.998 0.220 (0.065) (0.062) (0.090) (0.121) (0.226) (0.018) 0.099 0.108 0.099 0.108 1.004 0.088 (0.008) (0.032) (0.029) (0.045) (0.088) (0.004) 0.431 0.350 0.342 0.415 1.015 0.308 (0.086) (0.106) (0.126) (0.186) (0.312) (0.017) 0.437 0.236 0.286 0.145 0.983 0.271 (0.099) (0.072) (0.140) (0.084) (0.279) (0.016) 0.418 0.228 0.272 0.354 0.996 0.265 (0.095) (0.071) (0.131) (0.194) (0.271) (0.019) 0.466 0.209 0.276 0.205 0.992 0.289 (0.118) (0.065) (0.158) (0.109) (0.298) (0.021) 30 focus on the comparison of the optimal MD estimator and the MD estimator with identity weighting matrix. The latter is of interest because it coincides with the FE estimator applied to the pseudo panel of cohort means. We find that in cases where there is significant heteroskedasticity by cohort-time cells, or in cases where the cohort-time cell size varies, the optimal MD estimator significantly outperform the MD estimator with identity weighting in that the former’s standard errors are smaller. This finding is consistent with the large cohort size asymptotics under the MD estimation framework, as the optimal MD estimator achieves the smallest asymptotic variance. We also compare the MD FE inference to the naive FE inference. We find that in cases where the optimal weighting matrix is close to an identity matrix, the naive FE standard errors are barely satisfactory. But when the optimal weighting matrix is far from identity, the naive FE standard errors are not acceptable without doubt. In any case, the MD FE inference is always more efficient than the naive FE inference. This finding is consistent with the fact that the FE inference relies on large number of cohorts and it discards all individual-level information. In a setup with small number of cohorts and time periods, the naive FE inference cannot work well. The simulation setup in this analysis considers sample cohort sizes in hundreds, and the results are already promising provided that the variation in covariate cohort means are rich enough. In practice, sample cohort sizes of repeated cross sections can easily exceed thousands. Therefore, the results in this chapter should bring confidence to the application of the MD approach to pseudo panels. In future studies, we could extend the analysis to dynamic models where we have lagged dependent or explanatory variables. For robustness check, we should allow correlation covariates and individual fixed effects. Moreover, given the weak exogeneity condition (1.7), we could also allow covariates that are endogenous at the individual level but not at the cohort level. Results from these extensions can provide more practical implications. 31 BIBLIOGRAPHY 32 BIBLIOGRAPHY Arellano, Manuel, and Stephen Bond. 1991. “Some tests of specification for panel data: Monte Carlo evidence and an application to employment equations.” The review of economic studies, 58(2): 277–297. Collado, M Dolores. 1997. “Estimating dynamic models from time series of independent cross-sections.” Journal of Econometrics, 82(1): 37–62. Deaton, Angus. 1985. “Panel data from time series of cross-sections.” Journal of econometrics, 30(1): 109–126. Girma, Sourafel. 2000. “A quasi-differencing approach to dynamic modelling from a time series of independent cross-sections.” Journal of Econometrics, 98(2): 365–383. Imbens, Guido, and Jeffrey M Wooldridge. 2007. What’s new in econometrics? NBER. Moffitt, Robert. 1993. “Identification and estimation of dynamic models with a time series of repeated cross-sections.” Journal of Econometrics, 59(1): 99–123. Verbeek, Marno. 2008. “Pseudo-panels and repeated cross-sections.” In The Econometrics of Panel Data. 369–383. Springer. Verbeek, Marno, and Francis Vella. 2005. “Estimating dynamic models from repeated cross-sections.” Journal of econometrics, 127(1): 83–102. White, Halbert. 1980. “A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity.” Econometrica: Journal of the Econometric Society, 817–838. Wooldridge, Jeffrey M. 2010. Econometric Analysis of Cross Section and Panel Data. . 2nd ed., Boston MA:MIT Press. 33 CHAPTER 2 EXPLORING ADDITIONAL MOMENT CONDITIONS IN NON-SEPARABLE MINIMUM DISTANCE ESTIMATION WITH AN APPLICATION TO PSEUDO PANELS 2.1 Introduction Minimum distance (MD) estimation is a useful approach to recover structural estimates from reduced form estimates when there exists a known relationship between the structural and reduced form parameters. The known relationship is often in the form of structural equations, moment conditions,1 or restrictions, which are terminology used interchangeably hereafter. When applying MD, researchers may encounter situations in which they need to introduce additional moment conditions into estimation. This could happen, for example, when some new instrument variables (IVs) become available as the research proceeds. An important question to ask in such a situation is whether we can always improve asymptotic efficiency by using all the moment conditions than using just part of them. In this chapter, we provide an affirmative answer to this question. We show that in MD estimation it never hurts to have more moment conditions. In particular, when the additional moment conditions are non-redundant, adding them to estimation strictly improves efficiency. This efficiency gain result echoes the similar property for generalized method of moments (GMM) in Breusch et al. (1999). The motivation for deriving this efficiency gain result comes from the need in pseudo panel models to incorporate external IVs. A pseudo panel model can estimate an underlying unobserved effect panel data model with only repeated cross sections. The idea, which dates back to Deaton (1985), is to divide the population into a number of groups by certain predetermined group membership such as age cohorts. Then the group averages of the 1 More precisely, the moment conditions in MD are conditional moment conditions. 34 variables can be used to construct a panel at the group level. Since the group averages are error ridden estimates, Deaton suggests to treat the estimation as a measurement error problem. In this chapter, we adopt the MD perspective proposed by Imbens and Wooldridge (2007). Within the MD framework, the group averages of the variables are the reduced form estimates, and the group averages of the panel data model are the structural equations linking the reduced form to the structural parameters. When new IVs become available, the additional set of structural equations induced by the IVs can be easily added to the estimation. Clearly, the aim of having more structural equations is to improve estimation efficiency. However, there is no such theory in MD estimation telling us whether efficiency gain can be achieved. Therefore, we attempt to derive such a result in this chapter to fill this gap. We derive the result within a so called non-separable minimum distance (NMD) framework developed in this chapter. The framework is a special case of the “high level” MD framework in Newey and McFadden (1994).2 The key difference between NMD and the high level MD framework is that NMD models the reduced form parameters explicitly. This feature makes the NMD framework convenient to use when our exact purpose is to recover structural estimates from reduced form estimates. The qualifier “non-separable” highlights NMD’s capability to deal with structural equations that are non-separable in the structural and reduced form parameters. Note, however, that the separable framework, i.e. the Classical Minimum Distance (CMD) framework, is still covered as a special case. We establish consistency and asymptotic normality within the NMD framework. We also derive the optimal weighting matrix for the over-identified case in which the the number of structural equations is greater than that of the structural parameters. The optimal weighting matrix turns out to be the asymptotic variance of the rescaled structural equations, which gives an intuitive explanation of the weighting procedure. That is, the optimal weighting 2 In effect, the MD framework in Newey and McFadden (1994) is so general that both generalized method of moments (GMM) and classical minimum distance (CMD) are its special cases. 35 matrix readjusts the relative importance of the conditions according to their own volatility as well as their correlation with each other. Building on these basic results, we then give the main efficiency result discussed at the beginning of the chapter. After the general results are established in the NMD framework, we apply them back to the case of pseudo panels with external IVs. We show that a pseudo panel NMD estimator with an arbitrary weighting matrix is a generalized least squares (GLS) estimator. The inverse of the optimal weighting matrix corresponds to the usual unconditional variancecovariance matrix in GLS estimation. As a result of the added structural equations, the optimal weighting matrix becomes block diagonal. This result generalizes the finding in Imbens and Wooldridge (2007) that the optimal weighting matrix is diagonal in the case without external IVs. The inclusion of extra IVs in pseudo panel models also highlights a typical case where the optimal weighting matrix should be used over the naive identity matrix. In the first chapter, we have shown that varying cohort sizes and cohort-wise heteroskedasticity in idiosyncratic errors are two typical causes of a non-identity yet diagonal optimal weighting matrix. When IVs are added, the optimal weighting matrix is usually block diagonal since within-cohort dependence between structural equations generally exists. As a result, it is more likely to achieve efficiency gain by using the optimal weighing matrix. A related question is whether we can estimate pseudo panel models naively by applying fixed effect on the sample cohort means and then making the inference robust to heteroskedasticity and/or serial correlation. In this chapter, we show that the naive fixed effect coefficient estimator is still valid because it coincides with the NMD estimator using the identity weighting matrix. But the naive inference, whether made robust or not, is invalid because it is different from the correct NMD inference. The fundamental reason of the difference is that the naive inference only uses the cohort averages and ignores any individual level information. In terms of asymptotic theory, the naive inference requires the number of cohorts tend to infinity and the number of time periods remain fixed (see, e.g., Arellano (1987); Wooldridge (2010); Hansen (2007a)), or both tend to infinity (Kezdi (2003); Hansen 36 (2007b)). In pseudo panel models, however, we often have large cohort sizes but fixed numbers of cohorts and time periods. This fact makes the MD framework a natural fit to the pseudo panel models. As mentioned in Verbeek (2008), repeated cross sections have several advantages over panel data sets. Because it is usually easier and less costly to collect random samples than panel data, the sample sizes of repeated cross sections are often much larger than common panel data sets. Moreover, repeated cross sections are naturally immune to attrition which is a common issue for panel data. Therefore, the availability of the NMD approach to pseudo panels potentially opens many new research opportunities in cases where unobserved individual fixed effects are a concern. The rest of the chapter is organized as follows. In section 2, we lay out the NMD framework. The consistency and asymptotic normality the NMD estimator, the optimal weighting matrix, and the property that more moment conditions do not hurt are discussed. In section 3, we apply the NMD framework to pseudo panel models with additional instruments. There are also two special subsections in which we discuss the GLS perspective and the naive variance estimators. Section 4 contains a simulation study of the pseudo panel NMD estimators. The last section concludes. 2.2 The NMD framework Minimum distance is essentially a delta method - it recovers structural estimates from reduced form estimates when there exists a known set of structural equations that links the structural and reduced form parameters. Formally, let Π × Θ be an subset of RP × RK , which is the product space for the reduced form parameter π and the structural parameter θ. Let h : Π × Θ → RJ be an vector-valued function satisfying h(π 0 , θ 0 ) = 0 37 (2.1) for some true parameter value (π 0 , θ 0 ) ∈ Π × Θ. Hereafter, h is referred to as the structural function, and eq. (2.1) is the set of structural equations. Suppose there is an estimator p ˆ of θ 0 is defined as ˆ → π 0 . Then an NMD estimator θ π ˆ = argmin h(π, ˆ h(π, ˆ θ). ˆ θ) W θ (2.2) θ∈Θ p ˆ is a J-dimensional positive semi-definite matrix and W ˆ → where W W. 2.2.1 Consistency A consistency result for NMD is summarized in the following theorem (similar to Theorem 2.6 in Newey and McFadden (1994)): p p ˆ → ˆ → π0, W Theorem 1. Suppose that π W, and (i) (Identification) W is positive semi- definite and Wh(π 0 , θ) = 0 only if θ = θ 0 ; (ii) (Boundedness) θ 0 ∈ Θ, which is compact; (iii) (Continuity) hj (π, θ) is continuous on Π and on Θ, for j = 1, · · · , J; (iv) (Uniform p p ˆ→ ˆ θ) − hj (π 0 , θ)| → 0, for j = 1, · · · , J. Then θ convergence) supθ∈Θ |hj (π, θ0 . Proof. See Appendix. 2.2.2 Asymptotic normality Theorem 3.2 in Newey and McFadden (1994) requires √ d ˆ θ 0 ) → N (0, Ω), which denh(π, mands effort to verify when h(π, θ) takes on some general functional form. If in addition continuous differentiability of h(π, θ) with respect to π is assumed, a Taylor expansion of √ d ˆ θ 0 ) → N (0, Ω) holds. The verification ˆ θ 0 ) around π 0 can be used to verify that nh(π, h(π, step however could be saved with the establishment of the following theorem. p p ˆ satisfies (2.2), θ ˆ→ ˆ → Theorem 2. Suppose that θ θ0 , W W where W is positive semi- definite, and (i) π 0 ∈ interior(Π) and θ 0 ∈ interior(Θ); 38 (ii) h(π, θ) is continuously differentiable with respect to θ in a neighborhood N (θ 0 ) of θ 0 , and h(π, θ) is continuously differentiable with respect to π in a neighborhood N (π 0 ) of π0; (iii) √ d ˆ − π 0 ) → N (0, Ω); n (π p ˆ θ) − L(π 0 , θ) → 0, and for B(π, θ) := (iv) For L(π, θ) ≡ ∇θ h(π, θ), supθ∈N (θ ) L(π, 0 p ˆ θ) − B(π 0 , θ) → 0; ∇π h(π, θ), supθ∈N (θ ) B(π, 0 (v) L WL is nonsingular, where L ≡ L(π 0 , θ 0 ). Let B ≡ B(π 0 , θ 0 ). Then √ d −1 −1 ˆ − θ0 → ). L WBΩB WL L WL n θ N (0, L WL (2.3) Proof. See Appendix. With the presence of the added smoothness assumption with respect to π, the theorem above provides a more constructive and straightforward version of the “high level” theorem in Newey and McFadden (1994). To obtain the asymptotic variance of the MD estimator, all we need is to find the two partial derivatives of h and plugging them in eq. (2.3). 2.2.3 Optimal weighting matrix The asymptotic variance in (2.3) depends on the probability limit W of the weighting matrix ˆ When W = M−1 where W. M = BΩB (2.4) the asymptotic variance simplifies to Avar √ ˆ − θ0 n θ = (L (BΩB )−1 L)−1 . (2.5) As shown in the following theorem, the inverse of BΩB is the optimal weighting matrix since (2.5) is the “smallest” asymptotic variance that can be obtained by optimizing over all possible nonsingular weighting matrices. 39 p ˆ → Theorem 3. Suppose M = BΩB is nonsingular. Then an NMD estimator with W W = M−1 is asymptotically efficient in the class of NMD estimators based on the same set of structural eqations. Proof. See Appendix. The intuition for using an optimal weighting matrix is straightforward. Asymptotically, it √ is not about over-identification. Rather, it is because the the conditions in nh(π 0 , θ 0 ) = 0 are asymptotically random. More accurate conditions exhibit less volatility, and the conditions are potentially correlated. To use all the conditions optimally, more weights should be given to less volatile conditions, and the correlation between conditions should also be accounted for. The best characterization of the relative volatility of all conditions is the the asymptotic variance-covariance matrix of the rescaled conditions. It turns out that M is exactly that variance-covariance matrix . Specifically, the first part of condition (ii) in Theorem 2 and a Taylor expansion imply that √ √ ˆ θ 0 ) = B · n (π ˆ − π 0 ) + op (1), nh(π, d → N (0, BΩB ). (2.6) The optimal weighting operation is essentially a standardization that assigns more loadings to less volatile conditions and untangles the correlation between conditions. It standardizes the asymptotic variance-covariance matrix to an identity matrix. Admittedly, that the inverse of the optimal weighting matrix is the asymptotic variance is a known result which can be found in, e.g., Newey and McFadden (1994), and the idea of standardization by volatility can also be found in the generalized method of moments (GMM) and generalized least squares (GLS) literature. However, this intuitive explanation is often overlooked when it comes to MD estimation. 40 In the application to pseudo panel, the intuition is even clearer, for M is exactly the variance-covariance matrix of individual level residuals. 2.2.4 Estimation ˆ for π 0 , the NMD estimator using the identity weighting Given a consistent estimator π matrix, i.e. ˇ = argmin h(π, ˆ θ) h(π, ˆ θ), θ θ∈Θ ˇ follows from Theorem 1. By can be used as an initial estimator for θ 0 . Consistency of θ p ˇ → ˇ = ∇π h(π, ˆ θ) B. Then, continuity of the partial derivatives, the plug-in estimator B ˆ for Ω, M ˆ ≡B ˇΩ ˆB ˇ is a consistent estimator for M, and an given a consistent estimator Ω asymptotically efficient for θ 0 can be obtained by ˆ opt = argmin h(π, ˆ −1 h(π, ˆ θ) M ˆ θ). θ θ∈Θ The corresponding consistent estimator for the asymptotic variance-covariance matrix is given by ˆ Avar(θ opt ˆ −1 /n ˆ (B ˆΩ ˆB ˆ )−1 L) ) = (L ˇ and L ˇ ˆ ≡ ∇π h(π, ˆ ≡ ∇π h(π, ˆ θ) ˆ θ). where B The estimator defined above iterates only once. Multiple iterations are also allowed. They are, however, asymptotically equivalent. 2.2.5 More conditions do not hurt Partition the restrictions in (2.1) into two parts:   h1 (π 0 , θ 0 ) h(π 0 , θ 0 ) =   = 0, h2 (π 0 , θ 0 ) where h1 is J1 × 1, h2 is J2 × 1, and J1 + J2 = J. Let ˜ = argmin h1 (π, ˆ θ) M−1 ˆ θ) θ 1,1 h1 (π, θ∈Θ 41 (2.7) with M1,1 = B1 ΩB1 for L1 ≡ L1 (π 0 , θ 0 ) ≡ ∇θ h1 (π 0 , θ 0 ), B1 ≡ B1 (π 0 , θ 0 ) ≡ ∇π h1 (π 0 , θ 0 ). Then √ d ˜ − θ 0 ) → N (0, [L (B1 ΩB )−1 L1 ]−1 ). n(θ 1 1 On the other hand, if all restrictions are used, we have √ d ˆ opt − θ 0 → n θ N (0, [L (BΩB )−1 L]−1 ). ˆ is at least as efficient as θ. ˜ The theorem The following theorem shows that asymptotically θ as well as the proof is similar to its GMM counterpart in Breusch et al. (1999). Theorem 4. Let Li ≡ Li (π 0 , θ 0 ) ≡ ∇θ hi (π 0 , θ 0 ) and Bi ≡ Bi (π 0 , θ 0 ) ≡ ∇π hi (π 0 , θ 0 ) for i = 1, 2. Let Mi,j = Bi ΩBj for i = 1, 2 and j = 1, 2. Assume BΩB and B1 ΩB1 are both nonsingular. Let F = M2,2 − M2,1 M−1 1,1 M1,2 . Then L (BΩB )−1 L − L1 (B1 ΩB1 )−1 L1 = M2,1 M−1 1,1 L1 − L2 F−1 M2,1 M−1 1,1 L1 − L2 and thus is positive semi-definite. Proof. See Appendix. The condition h2 (π 0 , θ 0 ) = 0 is redundant if L2 = M2,1 M−1 1,1 L1 , i.e. L2 = B2 ΩB1 (B1 ΩB1 )−1 L1 . (2.8) We can think of Φ = (B1 ΩB1 )−1 B1 ΩB2 as the coefficient matrix from the GLS of B2 on B1 with weight Ω. Then h2 (π 0 , θ 0 ) = 0 is redundant if L2 is a linear transformation of L1 with the transformation matrix Φ . Eq. (2.8) is similar to condition (C) of Theorem 1 in Breusch et al. (1999). We can also derive a condition similar to condition (B) in that theorem to have a more intuitive explanation of the redundancy condition. Specifically, define √ ˆ θ0 ) ≡ nr2 (π, √ √ ˆ θ 0 ) − M2,1 M−1 ˆ θ 0 ). nh2 (π, 1,1 nh1 (π, 42 (2.9) √ ˆ θ 0 ), and M2,1 is the asymptotic nh1 (π, √ ˆ θ 0 ) and h1 (π, ˆ θ 0 ). Therefore, asymptotically, M2,1 M−1 ˆ θ 0 ) is covariance of h2 (π, 1,1 nh1 (π, √ √ √ ˆ θ 0 ) on nh1 (π, ˆ θ 0 ), and nr2 (π, ˆ θ 0 ) is the residual in the linear projection of nh2 (π, By eq. (2.6), M1,1 is the asymptotic variance of this linear projection. It follows that a redundancy condition that is equivalent to but more intuitive than eq. (2.8) is ∇θ √ ˆ θ 0 ) ≡ ∇θ nr2 (π, √ √ ˆ θ 0 ) − M2,1 M−1 ˆ θ 0 ) = 0. nh2 (π, 1,1 nh1 (π, ˆ θ 0 ) = 0 to be redundant is that r2 (π, ˆ θ 0 ) is marginally That is, the condition for h2 (π, uninformative for θ 0 . 2.3 Pseudo panels with additional IVs In the case of pseudo panels with additional IVs, the restrictions defined by (2.1) are not additively separable in π and θ as in the CMD case. Therefore it serves as a good example to illustrate the NMD framework. Moreover, the first-order condition takes the form of the normal equation of a GLS estimation. Therefore, the optimal NMD estimator in this case turns out to be a GLS estimator using the optimal weighting matrix as the unconditional variance-covariance matrix. The adoption of the MD perspective in pseudo panel models provides a new way to deal with errors in variables. In the seminal work of Deaton (1985), this issue is treated as a measurement error problem. By specifying the measurement error structure, Deaton proposes a measurement-error corrected estimator. Collado (1997) follows the measurement error thinking and extends Deaton’s method to a more general measurement-error corrected GMM estimator. In the MD framework, group averages are treated as estimates for the reduced form parameters. Since the group sizes are usually large for repeated cross sections, the MD framework is a natural fit for pseudo panel models. In the following subsections we go through the derivation of the particular contents of h, π, θ , L, B and M in the pseudo panel case, discuss estimation, and summarize the 43 asymptotics. 2.3.1 Population model and structural equations Formally, for outcome yit , covariate xit (1 × K), coefficient β (K × 1), time effect ηt , fixed effect fi , and idiosyncratic error uit , consider the following model for a generic individual in the population yit = xit β + ηt + fi + uit , t = 1, . . . , T. (2.10) fi and uit are unobserved. The first entry of xit is unity for notation convenience. Essentially, we are thinking of the population as a genuine panel data set from which different samples are drawn each period. The same treatment is also adopted in Verbeek and Vella (2005). Let zit = (1, z2it , · · · , zP it ) be a 1 × P row vector of instrumental variables satisfying E(zit uit |fi ) = 0, Cov(zpit , fi |gi ) = 0, p = 1, 2, · · · , P, (2.11) (2.12) where, for convenience, z1it is also set to 1. In a standard panel, the conditional exogeneity of xit given fi is usually assumed: E(uit |xit , fi ) = 0, t = 1, . . . , T. (2.13) This condition is not required here. A weaker condition that suffices is E(uit |fi ) = 0, t = 1, . . . , T. (2.14) Note that by iterated expectation, (2.13) implies (2.14). Because fi aggregates all timeconstant unobservables, we should think of (2.13) and (2.14) as being true for not only the lump sum fi but also any time constant factors. In particular, replacing fi with the group indicator gi (i.e. applying iterated expectation) leads to E(uit |gi ) = 0, t = 1, . . . , T. 44 (2.15) Let E(·|g) be the shorthanded notation for E(·|gi = g), and let αg = E(fi |g) be the group fixed effect for group g. By (2.12) and the fact that E(zpit · fi |g) = Cov(zpit , fi |g) + E(zpit |g) · E(fi |g), p = 1, · · · , P, the structural model follows as E(zit yit |g) =E(zit xit |g)β + E(zit |g)ηt + E(zit |g)αg , (2.16) f or t = 1, . . . , T ; g = 1, · · · , G. Thanks to z1it = 1, the first row in eq. (2.16) represents the cohort level equations without instruments, which is the basic case studied in Imbens and Wooldridge (2007). The exogeneity condition (2.15) might appear non-substantial at the first glance, because it seems we can always make E(uit |gi ) = 0 holds by subtracting E(uit |gi ) from uit and redefine the deviation as uit . But this subtraction operation is equivalent to the inclusion of a full set of cohort-time effects in the structural model (2.16). Perhaps the following equivalent representation of (2.15) makes the explanation clearer δgt = E(uit |gi = g) = 0, g = 1, . . . , G, t = 1, . . . , T. If eq. (2.17) (2.17) (or equivalently (2.15)) is not imposed, the GT parameters δgt for g = 1, . . . , G, t = 1, . . . , T will enter the structural model (2.16) as the full set of cohort-time effects. Including the full set of cohort-time effects is equivalent to not imposing (2.17) (or (2.15)). Therefore, the key assumption disguised by eq. (2.15) together with the specification in (2.10) is that the structural model (2.16) requires only the set of group and time effects (ηt and αg ) but not the full set of cohort-time effects (δgt ). If any such cohort-time effect is required, then, as pointed out in Imbens and Wooldridge (2007), one way to think about the misspecification is that some δgt = E(uit |gi = g) is not zero. Note that the structural model with the full set of cohort-time effects is always correctly specified, but it is not interesting because the variation in the covariate cohort means is 45 absorbed by δgt . As a result, such a the model is only identified up to the GT cohort-time effects. A technical point here is that, due to the setup x1it = z1it = 1, there are only (G − 1) parameters in αg and (T − 1) in ηt to estimate. Imbens and Wooldridge (2007) make the normalization G g=1 αg = 0 and η1 = 0. This chapter however proceeds with α1 = 0 and η1 = 0. The purpose of this slightly different normalization is to cope with the estimation convention that the dummies for the first cohort and the first time period are always dropped. As a result of the dropout, the sum (β1 + α1 + η1 ) is identified, but β1 , α1 and η1 are not separately identifiable. The remaining estimated group and time effects are the relative effects (αg − α1 ) for g = 2, · · · , G and (ηt − η1 ) for t = 2, · · · , T . Setting α1 = η1 = 0 then conveniently simplifies (β1 + α1 + η1 ), (αg − α1 ) and (ηt − η1 ) to β1 , αg and ηt . 2.3.2 Useful notations Some notations are useful later. Let µxgt = E(xit |g) denote the population mean of a generic variable xit conditional on gi = g. For a vector (e.g. xit ) or a matrix (e.g. zit xit ) variable, zx bold symbols like µx gt or µgt will be used. In this notation, eq. (2.16) can be written as zy 0 = −µgt + µzgtx β + µzgt (ηt + αg ), (2.18) f or t = 1, . . . , T ; g = 1, · · · , G. Also, for a generic variable xit and j = (g − 1)T + t, let µx denote the column “vector” with µxgt the jth row block. Depending on the dimension of xit , µx can be either a column vector or a matrix. Let vit = (yit , xit ) and sit = zit ⊗ vit with ⊗ denotes Kronecker product. sit is a long row vector. Assume the variance-covariance matrix of sit exists and is denoted by Ωsgt = V ar(sit |g). An explicit formula for Ωsgt is given in Appendix. 46 (2.19) Then the reduced form parameter π in this pseudo panel case can be expressed as π = µs = (µs11 , µs12 , · · · , µsGT ) . The structural parameter is θ = (β , η , α ) with α = (α1 , . . . , αG ) and η = (η1 , . . . , ηT ) . Moreover, the right hand side of eq. (2.18) says the j-th row block of the h function is zy hj (π, θ) = −µgt + µzgtx β + µzgt (ηt + αg ) (2.20) with j = (g − 1)T + t. Note that each hj (π, θ) is P × 1. Let xit = (xit , d, c) with d the vector of time dummies and c the vector of group dummies. Then a second useful expression for hj is zy zx hj (π, θ) = −µgt + µgt θ. (2.21) Later we will see that the two expressions (2.20) and (2.21) are convenient for calculating partial derivatives of h. 2.3.3 The partial derivatives L and B and the inverse optimal weighting matrix M By eq. (2.21), it is trivial that L = ∇θ h(π, θ) = µz x (2.22) zx where, as defined in the last section, µz x is the matrix with µgt the j-th row block for j = (g − 1)T + t. z z On the other hand, recall eq. (2.20). For z1it = 1, µgt1 = 1, define ∇ z1 [µgt1 (ηt + αg )] = µ gt (ηt + αg ). Define β gt by replacing the first entry of β, i.e. β1 , with (β1 + ηt + αg ), and let bgt (θ) = IP ⊗ (−1, β gt ) with IP the P -dimensional identity matrix. Some algebra (see Appendix ) then shows that    bgt (θ), if g˜ = g and t˜ = t, ∇π ˜hj (π, θ) = g˜t   0, otherwise. 47 Define another block diagonal matrix b(θ) by putting bgt (θ) on its gt-th diagonal block. Then B = ∇π h(π, θ) = b(θ). (2.23) With the general formula in eq. (2.4) and the particular contents in eq. (2.19) and (2.23), the inverse of the optimal weighting matrix, M, is given by M = b(θ)Ωs b(θ) . (2.24) In the Appendix, we show that an expansion of the right hand side of eq. (2.24) leads to M = diagGT [(ρ1 κ1 )−1 b11 (θ)Ωs11 b11 (θ) , (ρ1 κ2 )−1 b12 (θ)Ωs12 b12 (θ) , · · · · · · , (ρG κT )−1 bGT (θ)ΩsGT bGT (θ) ]. (2.25) That is, M is a block diagonal matrix with (ρg κt )−1 bgt (θ)Ωsgt bgt (θ) on the gt-th diagonal block. We also show in the Appendix that bgt (θ)Ωsgt bgt (θ) is actually the variance-covariance matrix of the composite errors within cell (g, t) bgt (θ)Ωsgt bgt (θ) = Ξgt ≡ V ar[zit yit − zit xit β − zit (ηt + αg )|g]. (2.26) Therefore another useful expression for M is M = diagGT (ρ1 κ1 )−1 Ξ11 , (ρ1 κ2 )−1 Ξ12 , · · · , (ρG κT )−1 ΞGT . 2.3.4 (2.27) Estimation Assume we have T repeated cross-sectional random samples denoted by {(yit , xit , zit , git ), i = 1, · · · , nt ; t = 1, · · · T } where nt is the number of observations for cross section t. Note that in each time period we have a new random sample, so in general the same index i refers to different individuals in different time periods, and thus git sees a subscript t added. 48 2.3.4.1 ˆ Asymptotics of π Let 1A be the indicator function equal to 1 if A is true and equal to 0 otherwise. Let rit = (rit,1 , rit,2 , . . . , rit,G ) be a vector of group indicators with rit,g = 1{git = g} where 1{·} is the indicator function equal to one if the event in {·} is ture. In this way the group membership of the random draw i at time t is properly treated as a random variable. It follows that the number of observations in cell (g, t) is also a random variable given by ngt = nt i=1 rit,g . Let µ ˆxgt denote the sample average within cell (g, t) for a generic variable xit . Let ρg = p P (rit,g = 1) be the fraction of the population in cohort g and assume ρˆgt = ngt /nt → ρg . Let κt = limn→∞ nt /n be the fraction of all observations accounted for by cross section t. By (essentially) the central limit theorem, for g = 1, · · · , G and t = 1, · · · , T , √ d ˆ sgt − µsgt ) → N ormal(0, (ρg κt )−1 Ωsgt ). nt (µ ˆ = (µ ˆ s11 , µ ˆ s12 , · · · , µ ˆ sGT ) and π 0 = (µsgt , µsgt , · · · , µsgt ) . Then the results Furthermore, let π above can be stacked in √ d ˆ − π 0 ) → N (0, Ωs ) n (π where Ωs is a block diagonal matrix with (ρg κt )−1 Ωsgt on the gt-th diagonal block. 2.3.4.2 Estimation of L Eq. (2.22) suggests that a straightforward estimator for L is ˆ =µ ˆ z x. L (2.28) ˆ z x is the sample analog of µz x . Recall that xit = (xit , d, c) and that xit contains a µ ˆ z x is the matrix of the sample cohort means of the explanatory constant term. Then µ variables, the instruments, and their interactions. Its dimension is GP T × (K + G + T − 2). 49 2.3.4.3 ˆ and the FE estimator θ ˇ The general estimator θ There exists an analytical solution to eq. (2.2) in the current setup, which turns out to be a GLS estimator. Specifically, given eq.(2.21) and (2.28), the first-order condition to eq. (2.2) in the current setup can be written as3 ˆ µ ˆ z xθ − µ ˆ z y = 0. ˆ z x) W (µ ˆµ ˆ z x) W ˆ z x is nonsingular, then the general pseudo panel NMD estimator with Assume (µ ˆ is given by a weighting matrix W ˆ = (µ ˆµ ˆ z x) W ˆz x θ −1 ˆµ ˆ z x) W ˆz y (µ (2.29) ˆ serves as the inverse of the Clearly, (2.29) is of the form of a GLS estimator where W ˆz x “unconditional variance-covariance matrix of the error term”, and the cohort means µ ˆ z y are the matrix of right-hand-side variables and left-hand-side variable, receptively. and µ ˆ with the identity matrix gives the fixed effect estimator In particular, replacing W ˇ = (µ ˆz x ˆ z x) µ θ −1 ˆ z x) µ ˆ z y. (µ (2.30) The standard case without instrument in Imbens and Wooldridge (2007) corresponds to the case P = 1, i.e. deleting the letter z in eq. (2.30). 2.3.4.4 ˆ opt Estimation of B, M and θ ˇ as an initial estimator, an estimator for B follows from eq. (2.23) by substituting θ With θ ˇ which leads to with θ ˇ ˆ = b(θ). B An obvious estimator for the variance-covariance matrix of s defined in (2.19) is ˆ s = n−1 Ω gt gt nt ˆ sgt ) (sit − µ ˆ sgt ). rit,g (sit − µ i=1 3 Note ˆ z x ) is the transpose of µ ˆ z x and is not the same as µ ˆ x z. that (µ 50 ˆ s , can be defined as the block diagonal matrix with the gt-th Then an estimator for Ωs , Ω s ˆ , i.e., diagonal block (ngt /n)−1 Ω gt ˆ s ). ˆ s , · · · , (nGT /n)−1 Ω ˆ s , (n12 /n)−1 Ω ˆ s = diagGT ((n11 /n)−1 Ω Ω GT 12 11 ˆ and Ω ˆ s , the following estimator for the inverse of the optimal weighting matrix Given B follows from eq. (2.24) ˇ Ω ˇ . ˆ = b(θ) ˆ s b(θ) M (2.31) Eq. (2.31), however, may involve big matrices in calculation when the number of covariates and/or instruments is large (the dimension of s increase quickly with multiple instruments). Fortunately, eq. (2.27) provides an alternative but numerically equivalent way to estimate M - all we need is an estimator for Ξgt . By eq. (2.26), Ξgt can be conveniently ˆ gt , the sample variance-covariance matrix of the residuals in cell (g, t) to be estimated by Ξ defined as follows. ˇ to obtain the individual residual First, using the fixed effect estimator θ ˇ − (ˇ uˇit = yit − xit β ηt + α ˇ g ). (2.32) The cohort residual is then defined as zp u ˇ µ ˆgt = n−1 gt nt rit,g zpit uˇit . (2.33) i=1 For p, q = 1, · · · , P , let τˆpq (drop subscript g, t from τ for simplicity) denote the entry on ˆ gt . Then τˆpq is given by row p, column q of Ξ τˆpq = n−1 gt nt zp u ˇ rit,g zpit uˇit − µ ˆgt i=1 ˆ gt is defined as the matrix with the pq-th entry τˆpq Finally, Ξ ˆ gt = (ˆ Ξ τpq ). 51 zq u ˇ zqit uˇit − µ ˆgt . (2.34) ˆ gt , the second method to estimate M is via Given Ξ ˆ GT ] ˆ 12 , · · · , (nGT /n)−1 Ξ ˆ 11 , (n12 /n)−1 Ξ ˆ = diagGT [(n11 /n)−1 Ξ M (2.35) ˆ gt . which is the block diagonal matrix with the gt-th diagonal block (ngt /n)−1 Ξ The numerical equivalence of the two estimators for M is summarized in the following theorem. ˆ defined in eq. (2.31) and (2.35) are numerically Theorem 5. The two ways of computing M equivalent. Proof. See Appendix. When p = q = 1, τˆ11 = n−1 gt nt i=1 rit,g uˇit − µ ˇugt 2 which is of the same form as the τˆ2 ˆ becomes a diagonal matrix that coincides defined in Imbens and Wooldridge (2007),4 and M ˆ in Imbens and Wooldridge (2007). with the matrix C ˆ in hand, by replacing W ˆ with M ˆ −1 in eq. (2.29), the optimal pseudo panel With M NMD estimator is obtained as ˆ opt = (µ ˆ −1 µ ˆz x ˆ z x) M θ −1 ˆ −1 µ ˆ z x) M ˆ z y. (µ (2.36) The above formula in its appearance is similar to a GLS estimator on the cohort level data. ˆ is computed But the weighting matrix is not the usual one used by a feasible GLS because M ˆ opt to from individual level data. For more detail about the connection and difference of θ GLS, see the next section. 4 The formula in Imbens and Wooldridge (2007) needs the correction of demeaning. Because in STATA, the command for calculating the sample variance automatically demeans the residuals. 52 2.3.4.5 ˆ θ ˇ and θ ˆ opt Estimation of the asymptotic variances of θ, ˆ With all the pieces worked out, and by Theorem 2, the asymptotic variance estimator for θ is ˆ = (µ ˆµ ˆ z x) W ˆz x Avar(θ) −1 ˆM ˆW ˆµ ˆ z x) W ˆz x (µ ˆµ ˆ z x) W ˆz x (µ −1 /n. ˇ it is For θ, ˇ = (µ ˆ z x) µ ˆz x Avar(θ) −1 ˆµ ˆ z x) M ˆz x (µ ˆ z x) µ ˆz x (µ −1 /n. (2.37) ˆ opt , it is Finally, for θ ˆ Avar(θ opt ˆ −1 µ ˆ z x) M ˆz x (µ ) = −1 /n. With the presence of additional IVs, dependence between restrictions are introduced since each cohort repeats itself several times in the restrictions. The optimal weighing matrix is ˆ gt ). In fact, some more likely to be non-diagonal (it is block diagonal with block (ngt /n)−1 Ξ algebra (see Appendix) shows that another expression for Ξgt is f Ξgt = E (εi + uit )2 zit zit |g . (2.38) where f εi ≡ fi − αg (2.39) is the deviation of individual effect from its cohort mean. Without further assumptions f regarding the correlation between the quadratic terms (εi + uit )2 and zit zit , and the correlation among the IVs in zit , Ξgt is generally non-diagonal. As a result, the use of optimal weighting matrix becomes more important with the presence of additional IVs. 2.3.5 The GLS perspective ˆ opt and its relation to GLS, define the To better understand the relationship between θ individual composite error as f eit ≡ yit − xit β − (ηt + αg ) = εi + uit . 53 The residual uˇit given in (2.32) is obviously an consistent estimator for eit . With eit , the vector of individual composite errors in cohort g is zit eit , and an alternative expression for Ξgt in (2.26) is Ξgt = V ar[zit eit |g]. (2.40) For a given g, define the cohort composite error as zp e µ ˆgt = n−1 gt nt rit,g zpit eit . (2.41) i=1 ˆ zgte is similarly defined and represents the vector of cohort composite errors in cell (g, t). µ ˆ zgte conditional on g is given by The variance-covariance matrix of µ ˆ zgte |g] = n−1 V ar[µ gt Ξgt . (2.42) From the MD perspective, ngt is large, and n−1 gt Ξgt → 0 as ngt → ∞. It thus does not make sense to model and estimate the “cohort composite errors” because they degenerate to 0 asymptotically. The usual feasible GLS on the pseudo panel of cohort means ignores individual level data and relies on much stringent assumptions on the unconditional variance-covariance structure of the cohort composite error. In particular, the underlying asymptotics rely on large G. The GLS estimator in eq. (2.36) is apparently not the usual feasible GLS. Rather, it is an GLS imposing the following block diagonal variance-covariance structure of all the cohort composite errors −1 −1 diagGT [n−1 11 Ξ11 , n12 Ξ12 , · · · , nGT ΞGT ] (2.43) Eq. (2.43) contains GT P (P + 1)/2 parameters, and thus is never feasible if only the GT P cohort means are observed. But if the individual level data are available, eq. (2.43) can be well estimated by −1 ˆ −1 ˆ −1 ˆ ˆ diagGT [n−1 11 Ξ11 , n12 Ξ12 , nGT ΞGT ] = n M. 54 (2.44) ˆ in a GLS formula leads to eq. (2.36); the n−1 cancels off. From the GLS Using n−1 M ˆ −1 standardizes the sample cohort composite errors so perspective, the weighting by n−1 M that they become close to uncorrelated and homoskedastic. It is worth noting that eq. (2.43) is not the unconditional variance-covariance matrix of ˆ z e . On each diagonal block is eq. (2.42), the conditional variance-covariance matrix of µ ˆ zgte given g. µ From the MD perspective, there is no asymptotic variance-covariance matrix for the ˆ −1 → 0 ˆ z e ] → 0 as ngt → ∞; so is n−1 M sample cohort composite errors because V ar[µ as ngt → ∞. Rather, what matters is the following set of relocated and rescaled estimated structural equations, 0=− =− √ √ zy zy ˆ gt − µgt ) + n(µ √ ˆ zgtx − µzgtx )β + n(µ √ ˆ zgt − µzgt )(ηt + αg ), n(µ ˆ zgte , nµ (2.45) (2.46) f or t = 1, . . . , T ; g = 1, · · · , G. The above equation is obtained by manipulating eq. (2.18). Asymptotically, eq. 2.46 √ ze ˆ gt is exactly M. converges to GT P random restrictions. The asymptotic variance of nµ Therefore, to use all the random conditions efficiently, the random restrictions need to be ˆ −1 , but their weighted by the square root of M−1 . In estimation, M−1 is replaced by M function is equivalent asymptotically. Given fixed G, T and P , and ngt → ∞, the use of ˆ −1 is totally legit since M ˆ −1 does not converge to 0. The weight M ˆ −1 adjusts the relative M importance of each sample restriction according to its level of accuracy. The level of accuracy ˆ gt . for the gt-th sample restriction is measured by (ngt /n)−1 Ξ 2.3.6 ˇ Naive variance estimators for θ ˇ is the fixed effect estimator on the pseudo panel of the sample cohort means, Because θ it is also convenient to compute the usual asymptotic variance estimators for a fixed effect 55 estimator. These naive estimators, however, are generally incorrect because they only make use of the sample cohort means. Before listing the formulae for the naive variance estimators, we summarize several possible reasons in repeated random cross sections that may ruin their validity. We cite the reasons that apply to the breakdown of each estimator in later discussion. zp u ˇ zq u ˇ 1. µ ˆgt and µ ˆgt are generally correlated (dependence over p for fixed g and t) zp u ˇ zp u ˇ zq u ˇ 2. the variance of µ ˆgt , as well as the covariance of µ ˆgt and µ ˆgt , depends on zit (heteroskedasticity) zp u ˇ 3. µ ˆgt depends on g because of either zit or even uit itself depends on g (non-identical distribution over g) Among the three items, the last one is the most crucial because all the naive variance estimators discussed below rely on large G. We consider three naive asymptotic variance estimators. Their formulae in a standard model can be found in Wooldridge (2010) as well as other textbooks. The first is the nonrobust variance estimator for which the consistency relies on a scalar (proportional to an identity matrix) variance-covariance structure of the cohort composite errors. To obtain this zp u ˇ formula, recall the definition of µ ˆgt in eq. (2.33). Define the mean squared error for the pseudo panel as zp u ˇ σ ˇ 2 = (GT P − K − G − T + 2)−1 (ˆ µgt )2 . g,t,p Then the naive non-robust variance estimator can be written as −1  x z zp x ˆ gt  ˆ gt p µ µ ˇ =σ Avarn (θ) ˇ2  . g,t,p The subscript n in typewriter font stands for “non-robust”. Its validity hings on i.i.d. samzp u ˇ pling over (g, t, p) and homoskedasticity of µ ˆgt , neither of which holds in a pseudo panel of sample cohort means due to all three reason listed. 56 The second is the naive heteroskedasticity-robust variance estimator whose formula is given by −1   x z zp x ˆ gt  ˆ gt p µ µ ˇ = Avarr (θ) zp u ˇ x z zp x ˆ gt p µ ˆ gt   (ˆ µgt )2 µ  g,t,p g,t,p −1  x z zp x ˆ gt  ˆ gt p µ µ . g,t,p The subscript r stands for “robust”. The estimator is robust to heteroskedasticity in the zp e cohort composite error µ ˆgt . But its validity still relies on i.i.d. sampling over (g, t, p) which does not hold due to reasons 1 and 3 mentioned above. The third is the naive cluster-robust variance estimator and its formula is −1    x z zp x ˆ gt  ˆ gt p µ µ ˇ = Avarc (θ) zp u ˇ u ˇzq x zp zq x ˆ gt µ ˆ gr   µ ˆgt µ ˆgr µ  g,t,r,p,q g,t,p −1 x z zp x ˆ gt p µ ˆ gt  µ . g,t,p The middle term can also be written as zp u ˇ u ˇzq x zp zq x ˆ gr = ˆ gt µ µ ˆgt µ ˆgr µ xz z u ˇ g,t,p g,t,r,p,q ˇzq x zp zq x zp u ˇ u ˆ gt µ ˆ gr ˆgr µ µ ˆgt µ z x p ˆ gt p µ ˆ gtp + (ˆ µgt )2 µ g,t=r,p=q where the first sum is exactly the middle term in the naive heteroskedasticity-robust variance estimator. The naive cluster-robust variance estimator is robust to arbitrary heteroskedasˆ zg e . But its validity relies on ticity and serial correlation in the cohort composite errors µ i.i.d. sampling over g which may not hold due to reason 3 listed above. Some other equivalent representations of the three naive variance estimators are inforˇ Write the three naive estimators as mative of their link to Avar(θ). ˇ = (µ ˆ z x) µ ˆz x Avarn (θ) ˇ = (µ ˆ z x) µ ˆz x Avarr (θ) ˇ = (µ ˆ z x) µ ˆz x Avarc (θ) −1 −1 −1 ˆ z x ) (ˇ ˆz x (µ σ 2 I)µ ˆ z uˇ ) diag(µ ˆ z x) (µ 2 ˆ z x) µ ˆz x (µ ˆz x µ −1 , ˆ z x) µ ˆz x (µ ˆ z x ) diagG (µ ˆ z uˇ )diagG (µ ˆ z uˇ ) µ ˆz x (µ −1 , ˆ z x) µ ˆz x (µ −1 , ˆ z uˇ ) is the square, diagonal matrix created by putting the vector µ ˆ z uˇ on the where diag(µ ˆ z uˇ ) is the block diagonal matrix with the gth diagonal block principal diagonal, and diagG (µ ˆ zg uˇ for g = 1, · · · , G. Then clearly, the three naive variance estimators can be obtained µ 57 ˆ by replacing M/n in eq. (2.37) with (ˇ σ 2 I), 2 ˆ z uˇ ) diag(µ ˆ z uˇ )diagG (µ ˆ z uˇ ) , or diagG (µ respectively. Yet another set of equivalent representations provide some insights on the large G perspective of the naive estimators. Specifically, write −1 ˆ gz x ) (µ ˇ = Avarn (θ) −1 ˆ zg x ) (µ ˆ zg x µ g g ˇ = Avarr (θ) ˆ gz x ) (µ ˆ zg x µ 2 ˆ zg uˇ ) diag(µ ˆ zg x ) (µ ˆ zg x µ ˇ = Avarc (θ) −1 ˆ zg x µ ˆ gz x ) (µ ˆ gz x µ ˆ zg uˇ (µ ˆ zg uˇ ) µ ˆ zg x ) (µ g g , g −1 ˆ zg x ) (µ , −1 ˆ gz x µ g g ˆ zg x µ g −1 ˆ gz x ) (µ ˆ gz x ) (µ ˆ zg x (ˇ σ 2 IT P )µ ˆ gz x µ . g (2.47) ˆ zg uˇ ) is the square, diagonal matrix created by putting the vector µ ˆ zg uˇ on the where diag(µ principal diagonal. In essence, the three naive variance estimators differ in estimating (treat g as random) ˆ gz x ) µ ˆ zg e (µ ˆ zg e ) µ ˆ gz x , E (µ i.e. the middle term of the sandwich-form. But they all need i.i.d. sampling over g, which is not satisfied in the MD framework due to reason 3 listed above. The estimation errors in the cohort means are also ignored. ˇ and Avar(θ). ˇ The last point we want to make is about the relationship between Avarc (θ) First, rewrite −1 ˆ zg x ) (µ ˇ = Avar(θ) −1 ˆ zg x µ ˆ gz x ) (µ ˆ (n−1 M ˆ gz x g )µ g g ˆ zg x ) (µ ˆ gz x µ . g (2.48) ˇ and ˆ g are the the only difference between Avarc (θ) ˆ zg uˇ (µ ˆ zg uˇ ) and n−1 M It is then clear that µ ˇ Notice that µ ˆg= ˆ zg uˇ (µ ˆ zg uˇ ) ≈ diag(µ ˆ zg1uˇ (µ ˆ zg1uˇ ) , · · · , µ ˆ zgTuˇ (µ ˆ zgTuˇ ) ) and that n−1 M Avar(θ). zp u ˇ zq u ˇ ˆ g1 , · · · , n−1 Ξ ˆ gT ). Moreover, notice that µ ˆ z uˇ (µ ˆ z uˇ ) = (ˆ diag(n−1 Ξ µ µ ˆ )p,q and that g1 gt gT 58 gt gt gt −1 τ ) . Therefore, the comparison boils down to the difference between ˆ n−1 pq p,q gt Ξgt = ngt (ˆ zp u ˇ zq u ˇ µ ˆgt µ ˆgt = n−1 gt nt n−1 gt rit,g zpit uˇit i=1 5 and n−1 gt τˆpq = n−2 gt nt rit,g zqit uˇit i=1 nt zp u ˇ rit,g zpit uˇit − µ ˆgt zq u ˇ zqit uˇit − µ ˆgt . i=1 zp u ˇ zq u ˇ ˇ uses µ That is, Avarc (θ) ˆgt µ ˆgt to approximate the covariance between the cohort composite zp e zq e errors µ ˆgt and µ ˆgt , which uses only cohort-level information, and is not an estimator for zp e zq e zp u ˇ zq u ˇ Cov(ˆ µgt , µ ˆgt |g) because µ ˆgt µ ˆgt is observed only once for given g, t, p. On the other hand, ˇ uses τˆpq to estimate the covariance between the individual composite errors zp e and Avar(θ) zq e, which uses individual-level information, and is indeed an estimator for Cov(zp e, zq e|g) because τˆpq averages over ngt observations. The additional n−1 gt then transform it to a zp e zq e legitimate estimator for Cov(ˆ µgt , µ ˆgt |g). Apparently, n−1 gt τˆpq is a better estimator for zp e zq e ˇ zp u ˇ zq u ˆgt . ˆgt µ ˆgt |g) than µ Cov(ˆ µgt , µ ˇ can only make What conclusion do we get from this comparison? First of all, Avarc (θ) sense if we have random sample over g, because eq. (2.47) averages over g. Second, a relatively large number of groups is also needed for the large G asymptotics to work. In the ˇ because the residuals are all 0. Third, Avarc (θ) ˇ just-identified case, there is no Avarc (θ) ˆ zgte degenerates to 0. This is also needs fixed ngt , otherwise the cohort composite error µ however not too much a problem because in a sample ngt is always finite. 2.4 Simulation This section contains a simulation study for the optimal NMD estimator and the NMD estimator with identity matrix (i.e. the FE estimator) in the pseudo panel case with instruˇ zp u ˇ zq u 5 Note zp u ˇ zq u ˇ that we do not need the formula above to calculate µ ˆgt µ ˆgt ; µ ˆgt µ ˆgt can be obtained from calculating the cohort-level residuals. The formula is to provide an insight of its relationship to n−1 gt τˆpq . 59 ments. The major purposes of this simulation study are (i) to illustrate that the formulae derived in the last section work when the model are correctly specified, and (ii) to show that valid instruments improve estimation efficiency. We also look at naive ways of computing the standard errors that only make use of the cohort level data. Their performance is compared to the NMD standard errors, and explanations for the difference are provided. 2.4.1 Simulation design Throughout the simulation study, the outcome yit is generated as a linear function of the covariates (x1it = 1, x2it , x3it , x4it ), the time effect ηt , the individual effect fi , and the idiosyncratic error uit : yit = β1 + β2 x2it + β3 x3it + β4 x4it + ηt + fi + uit , i = 1, · · · , Nt , t = 1, · · · , T. (2.49) The parameter values used are β = (β1 , β2 , β3 , β4 ) = (1, 1, 1, 1). The time effects are generated by ηt = t − 1, and the cohort effects are generated by αg = g − 1. Individual fixed effects are generated by adding a random normal disturbance to the cohort effects, i.e. fi ∼ N (αg , 1). To fix ideas, it might be helpful to think of x2it , x3it and x4it as education, experience and marital status, respectively. The outcome yit is the log hourly wage, and there is an individual effect fi representing some unobserved ability. We focus on estimating the coefficient of x2it for which the distribution is given later. The distributions of the two auxiliary variables x3it and x4it are given by x3it ∼ N (sin(gt), 1), x4it ∼ Bernouli 1 1 + exp[1.5 sin(gt/2)] . That is, x3it is a continuous variable with population cohort mean sin(gt) and within cell 1 variance 1; x4it is a binary variable equal to 1 with probability 1+exp(1.5∗sin(gt/2)) . Since the individual-level disturbance to x3it and x4it are independently generated, they are always valid IVs. A time-invariant external instrument is generated as zi ∼ N (0, 1). 60 We investigate a small pseudo panel (G = 6, T = 4) and a middle sized one (G = 30, T = 20). In the small pseudo panel, the idiosyncratic error uit follows N (0, 10), and the following 5 cases for x2it are considered 1. x2it ∼ N (gt/6, 1), 2. x2it ∼ N (gt/6, 1) + fi , 3. x2it ∼ N (gt/6, 1) + zi , 4. x2it ∼ N (gt/6, 1) + zi + fi , 5. x2it ∼ N (gt/2, 1) + zi + fi . x The standard deviation for µgt2 over (g, t) is about 1. Note that x2it is a valid IV in cases 1 through 3, but not valid in cases 4 and 5. In the middle sized pseudo panel, uit follows N (0, 100) which has a bigger variance than in the small pseudo panel. The five cases considered for x2it are 1. x2it ∼ N (gt/150, 1), 2. x2it ∼ N (gt/150, 1) + fi , 3. x2it ∼ N (gt/150, 1) + zi , 4. x2it ∼ N (gt/150, 1) + zi + fi , 5. x2it ∼ N (gt/50, 1) + zi + fi , x The standard deviation for µgt2 over (g, t) is about 23. The variance-covariance as well as x x x correlation matrix of (µgt2 , µgt3 , µgt4 ) are given in Table 2.1. Case 4 in each setup is the case of major interest. Cases 1 through 3 are used to isolate the effect of adding fi or zi as part of x2it . Case 5 checks the effect of a larger variation in the cohort mean of x2it . 61 x x x Table 2.1 Variance-covariance and correlation matrix of (µgt2 , µgt3 , µgt4 ); correlation x x coefficients in parentheses. µgt3 = sin(gt), µgt4 = (1 + exp[1.5 ∗ sin(gt/2)])−1 . G = 6, T = 4; x µgt2 = gt/6 x x µgt2 x µgt2 x µgt3 x µgt4 µgt3 G = 30, T = 20; x µgt2 = gt/150 x x µgt4 x µgt2 1.078 (1) -0.142 (-0.195) 0.488 (1) 0.107 (0.444) 0.018 (0.109) x µgt3 x µgt4 0.054 (1) x x µgt2 x µgt3 x µgt4 x µgt3 x -0.013 (-0.020) 0.507 (1) 0.009 (0.040) -0.001 (-0.006) x µgt4 µgt2 x µgt2 0.488 (1) 0.322 (0.444) 0.018 (0.109) x µgt4 0.056 (1) G = 30, T = 20; x µgt2 = gt/50 9.701 (1) -0.425 (-0.195) µgt3 0.834 (1) G = 6, T = 4; x µgt2 = gt/2 µgt2 x µgt2 x µgt3 x µgt4 0.054 (1) x µgt3 x µgt4 7.508 (1) -0.039 (-0.020) 0.507 (1) 0.026 (0.040) -0.001 (-0.006) 0.056 (1) In the ideal situation, we would like to draw a population of size infinity so that the cohort level population equations hold exactly. Take case 4 as an example, that would mean the following set of equations holds exactly E(yit |g) = β1 + β2 gt 1 + β3 sin(gt) + β4 + (t − 1) + (g − 1), (2.50) 150 1 + exp[1.5 sin(gt/2)] g = 1, · · · , G; t = 1, · · · , T. But drawing an infinite number of observations is obviously infeasible. Therefore, we choose a relatively large number as the population size. The true distribution of the resulting population is of course its empirical distribution, but we could think of it as an approximation 62 of the population defined by eq. (2.49). Eq. (2.50) also holds only approximately, but the difference should be negligible for the purpose of this simulation study. In the current setup, the population cohort sizes are set equally to Ngt = 105 for all g and t. That means a population panel of N = 2.4 × 106 individual-time points for G = 6 and T = 4, and N = 6 × 107 for G = 30 and T = 20. After the population is generated, we fix it over simulations. In each replication, we draw repeated random cross sections from this fixed population. To have an idea on how the sample size affects the estimates, we consider two different sampling rates, 0.2% and 1%, which result in the sample cohort sizes ngt = 200 and 1000, respectively.6 The simulation design above is careful in the two places emphasized by Imbens and Wooldridge (2007). First, data for each section is drawn from the population independently across time, and because of the random sampling in each period, the group identifier is also randomly drawn. Second, eq. (2.49) has full time effects which is more realistic than Verbeek and Vella (2005) that omits the aggregate time effects, for the variation in µx gt here is net of the time effects. For θ = (β, η, α) in eq. (2.49), we consider the NMD estimator with identity matrix ˇ and its standard error (s.e.), and the optimal NMD estimator (θ) ˆ and its s.e.. Because (θ) ˇ is the fixed effect estimator on the pseudo panel of sample cohort means, three naive s.e. θ estimators, namely the non-robust s.e., the heteroskedasticity-robust s.e. and the clusterrobust s.e., are also computed. They are the usual s.e. estimators routinely computed for the fixed effect estimator in a true panel, but are naive in a pseudo panel because they treat the sample cohort means as observations carrying no errors and completely ignore the individual-level data. Besides the basic NMD uses no IV, each of z, x2 , x3 and x4 is used one at a time as the 6A relatively higher sampling rate might introduce too much overlap among the repeated cross-sectional samples. Therefore, we also consider the setup Nt = 1.5 ∗ 107 with sampling rate 0.2%. The result shows that there is no essential difference from the setup Nt = 3 ∗ 103 with sampling rate 1%. 63 additional IV. The NMD using all 4 variables as the additional IVs is also estimated. 2.4.2 Simulation results for the small pseudo panel At the center of the simulation study is Case 4, as cases 1 through 3 are its simplified cases to pin down the effect of the correlation of x2it with zi and fi , and Case 5 is a variation of Case 4 that increases the variation in the cohort mean of x2 . Therefore we focus on analyzing Case 4 in this section. The Monte Carlo simulation results for case 4 from 1000 replications for the coefficient and s.e. estimators of x2 are presented in Table 2.3. Two sample cohort sizes, ngt = 200 and 1000, are considered. For each considered quantity, the Monte Carlo average and standard deviation over the 1000 replications are reported, with the standard deviation in parentheses. The estimators with no IV, z as IV, and x2 as IV are picked because they provide most of the insights. The results on the same quantities in Case 3 are reported in Table 2.2. Detailed results are in Tables B.1 through B.20 in Appendix B. Several observations stand out from Table 2.3. First of all, the NMD coefficient estimators work well in all cases except when the invalid IV x2 is used. Both βˇ2 and βˆ2 are close to the true value in columns 1, 2, 4 and 5. As the sample cohort size ngt gets bigger, the slight biases in βˇ2 and βˆ2 get even smaller, and their Monte Carlo standard deviations also shrink. Second, the NMD s.e. estimators also work well, even when x2 is used as the IV. The Monte Carlo averages of se(βˇ2 ) and se(βˆ2 ) are close to the standard deviations of βˇ2 and βˆ2 throughout all columns, and having a bigger cohort size, as expected, reduces se(βˇ2 ) and se(βˆ2 ) universally. Third, using a valid and relevant IV improves efficiency, but the validity of IV is crucial. Compared to using no IV (column 1 and 4), using z as IV (columns 2 and 5) leads to reduced Monte Carlo averages of se(βˇ2 ) and se(βˆ2 ) and smaller finite sample bias in βˇ2 and βˆ2 . The usage of the invalid IV x2 (columns 3 and 6), however, introduces persistent biases in βˇ2 and βˆ2 that do not vanish as cohort size gets larger. Note that x2 is not a valid IV because it is correlated with fi , which violates the condition in (2.12). A comparison with the results in Table 2.2 confirms that the correlation between x2it and 64 Table 2.2 Finite sample properties of various estimators of β2 and its standard error, G = 6, T = 4. Case 3. x2it ∼ N (gt/6, 1) + zi ngt = 200 ngt = 1000 none z x2 none z x2 βˇ2 .9909 (.1623) .9980 (.0430) .9965 (.0421) .9937 (.0714) .9982 (.0191) .9964 (.0191) se(βˇ2 ) .1590 (.0117) .0463 (.0012) .0436 (.0020) .0720 (.0023) .0206 (.0002) .0192 (.0004) sen (βˇ2 ) .1552 (.0349) .0457 (.0054) .0488 (.0092) .0698 (.0144) .0205 (.0025) .0218 (.0039) ser (βˇ2 ) .1393 (.0425) .0508 (.0073) .0490 (.0094) .0627 (.0181) .0229 (.0034) .0217 (.0040) sec (βˇ2 ) .1423 (.0603) .0496 (.0159) .0431 (.0154) .0639 (.0283) .0223 (.0073) .0189 (.0069) βˆ2 .9911 (.1623) .9979 (.0437) .9981 (.0324) .9936 (.0713) .9982 (.0191) .9987 (.0143) se(βˆ2 ) .1585 (.0117) .0452 (.0012) .0327 (.0007) .0720 (.0023) .0205 (.0002) .0148 (.0001) fi invalidates x2it as IV. In absence of the correlation between x2it and fi , x2it is exogenous and becomes a valid IV for itself. As a result, no obvious bias is observed in βˇ2 and βˆ2 when x2 is used as IV in Table 2.2. In effect, x2 is a better IV than z, since Table 2.2 shows that se(βˇ2 ) and se(βˆ2 ) become smaller on average when the IV is changed from z to x2 . This makes sense because no IV is more relevant to a variable than the variable itself. When the IV is changed from z to x2 in Table 2.2, a larger reduction is observed in se(βˆ2 ) than in se(βˇ2 ). This observation highlights a typical situation to use the optimal weighting matrix - when the IV brings in within-cell heteroskedasticity and correlation. Specifically, when z is the IV, we show in the Appendix that   1 0  Ξgt = σe2  , 0 1 f (2.51) where σe2 = E (εi + uit )2 |g . This implies that the optimal weighting matrix is proportional 65 Table 2.3 Finite sample properties of various estimators of β2 and its standard error, G = 6, T = 4. Case 4. x2it ∼ N (gt/6, 1) + zi + fi ngt = 200 ngt = 1000 none z x2 none z x2 βˇ2 1.0153 (.1599) 1.0048 (.0431) 1.2218 (.0947) .9989 (.0716) .9996 (.0191) 1.2166 (.0423) se(βˇ2 ) .1575 (.0143) .0462 (.0014) .0951 (.0071) .0719 (.0029) .0206 (.0003) .0421 (.0014) sen (βˇ2 ) .1537 (.0356) .0455 (.0054) .0728 (.0188) .0697 (.0145) .0205 (.0025) .0405 (.0082) ser (βˇ2 ) .1388 (.0427) .0506 (.0074) .0752 (.0191) .0627 (.0181) .0229 (.0034) .0458 (.0106) sec (βˇ2 ) .1440 (.0631) .0494 (.0159) .0947 (.0420) .0642 (.0284) .0223 (.0073) .0711 (.0254) βˆ2 1.0155 (.1598) 1.0048 (.0437) 1.3194 (.0266) .9989 (.0715) .9996 (.0191) 1.3220 (.0120) se(βˆ2 ) .1569 (.0142) .0451 (.0014) .0266 (.0006) .0719 (.0029) .0205 (.0003) .0120 (.0001) to an identity matrix, which explains why the averages of se(βˆ2 ) and se(βˇ2 ) are close to each other and to the standard deviations of βˇ2 and βˆ2 in Table 2.2 when z is the IV. On the other hand, when x2 is used as the IV, we show that   gt  1  150 Ξgt = σe2  , gt gt 2 150 2 + 150 (2.52) which has within-cell heteroskedasticity and correlation. The resulting optimal weighting matrix is distinct from an identity matrix. The results on the naive s.e. estimators, sen (βˇ2 ), ser (βˇ2 ) and sec (βˇ2 ) are also consistent with the theory. We leave the discussion to the next subsection because the pattern is more obvious when G is greater. 66 Table 2.4 Finite sample properties of various estimators of β2 and its standard error, G = 30, T = 20. Case 3. x2it ∼ N (gt/150, 1) + zi ngt = 200 ngt = 1000 none z x2 none z x2 βˇ2 1.0134 (.0839) 1.0022 (.0277) 1.0012 (.0248) .9980 (.0399) .9996 (.0121) .9996 (.0106) se(βˇ2 ) .0842 (.0011) .0277 (.0001) .0239 (.0002) .0387 (.0002) .0123 (.0000) .0105 (.0000) sen (βˇ2 ) .0842 (.0028) .0274 (.0006) .0286 (.0009) .0388 (.0012) .0123 (.0003) .0129 (.0004) ser (βˇ2 ) .0838 (.0045) .0281 (.0008) .0269 (.0009) .0387 (.0021) .0125 (.0003) .0119 (.0004) sec (βˇ2 ) .0841 (.0143) .0282 (.0037) .0241 (.0035) .0385 (.0067) .0124 (.0016) .0105 (.0016) βˆ2 1.0134 (.0843) 1.0020 (.0279) 1.0016 (.0209) .9979 (.0400) .9995 (.0121) .9993 (.0087) se(βˆ2 ) .0837 (.0011) .0269 (.0001) .0196 (.0001) .0387 (.0002) .0123 (.0000) .0089 (.0000) 2.4.3 Simulation results for the middle sized pseudo panel The results in Table 2.5 and 2.4 for the middle sized pseudo panel basically tell the same story as the small pseudo panel. We focus on two points that stand out. These two points are less clear, although also exist, in the small pseudo panel. First, the results on the naive s.e. estimators, sen (βˇ2 ), ser (βˇ2 ) and sec (βˇ2 ) are consistent with the theory. This is best seen from the last column in Table 2.4. Moving down the list sen (βˇ2 ), ser (βˇ2 ) and sec (βˇ2 ), the bias in the Monte Carlo averages gradually declines. The Monte Carlo average of sec (βˇ2 ) rounded four decimal places is even identical to that of se(βˇ2 ). The reason is that, when x2 is used as IV, −1 ˆ zg e |g] = diag(n−1 V ar[µµ g1 Ξg1 , · · · , ngT ΞgT ) is indeed block diagonal by eq. (2.52). Among the three naive variance estimators, only the 67 Table 2.5 Finite sample properties of various estimators of β2 and its standard error, G = 30, T = 20. Case 4. x2it ∼ N (gt/150, 1) + zi + fi ngt = 200 ngt = 1000 none z x2 none z x2 βˇ2 1.0496 (.0830) 1.0106 (.0276) 1.1137 (.1376) 1.0061 (.0395) 1.0013 (.0120) 1.0285 (.0642) se(βˇ2 ) .0826 (.0012) .0276 (.0002) .1339 (.0026) .0386 (.0003) .0123 (.0000) .0633 (.0006) sen (βˇ2 ) .0826 (.0028) .0273 (.0006) .0783 (.0037) .0387 (.0012) .0123 (.0003) .0382 (.0016) ser (βˇ2 ) .0822 (.0045) .0280 (.0008) .1249 (.0111) .0385 (.0020) .0125 (.0003) .0594 (.0051) sec (βˇ2 ) .0824 (.0139) .0281 (.0037) .1222 (.0302) .0384 (.0066) .0124 (.0016) .0585 (.0151) βˆ2 1.0493 (.0835) 1.0104 (.0279) 1.3186 (.0175) 1.0060 (.0395) 1.0013 (.0120) 1.3199 (.0071) se(βˆ2 ) .0821 (.0012) .0268 (.0002) .0162 (.0001) .0385 (.0003) .0122 (.0000) .0073 (.0000) cluster-robust version correctly accounts for the variance-covariance structure of the cohort composite error. The heteroskedasticity-robust version only captures the heteroskedasticity but not the within-cluster correlation. The non-robust version accounts for neither. As a comparison, in the 4th and 5th columns of Table 2.4, the Monte Carlo averages of ˆ zg e |g] is sen (βˇ2 ), ser (βˇ2 ) and sec (βˇ2 ) are all close to that of se(βˇ2 ). This is because V ar[µ proportional to an identity matrix when none or z is used as IV. As a result, all three versions ˆ zg e |g]. Moreover, of the naive variance estimators are correct in their modeling of V ar[µ from sen (βˇ2 ) through ser (βˇ2 ) to sec (βˇ2 ), the s.e. estimators become less and less efficient, indicated by greater and greater Monte Carlo standard deviations. This is also consistent with their well-known relative efficiency property. Of course, se(βˇ2 ) is much more efficient than any of the naive estimators, for se(βˇ2 ) makes use of the extra information from the individual-level data. 68 Secondly, the first column in Table 2.5 shows noticeable bias in βˇ2 and βˆ2 . It is finite sample bias because as ngt gets larger, the bias shrinks quickly. A comparison with the first column in Table 2.4 confirms that the correlation between x2it and fi contributes to a large part of the bias. 2.5 Concluding remarks This chapter develops a general NMD framework that imposes (partial) differentiability on the structural equations. The differentiability conditions are stronger than the MD framework in Newey and McFadden (1994), but the resulting framework is more convenient to work with in application. Consistency and asymptotic normality are established, as well as the optimal weighting matrix expressed as functions of the partial derivatives of the structural equations. A theorem that echoes the GMM property that more moment conditions do not hurt is given. The general framework is then applied to the special case of pseudo panel. Simulation results are consistent with the theory. The property that having more moment conditions could improve efficiency is first noticed in the exercise of adding external instruments in pseudo panel MD estimation, which is an extension to the work on pseudo panel by Imbens and Wooldridge (2007). We would like to establish this property in a more general setup, hence the NMD framework at the beginning of this chapter is motivated. Having both the general framework and the case of pseudo panel as an example in the same chapter helps the understanding of the general concepts. In particular, we find that the inverse optimal weighting matrix is exactly the variance-covariance of the relocated and rescaled structural equations in the pseudo panel application, which provides straightforward intuition for why the optimal weighting matrix works. Essentially, the optimal weighting matrix down-weights the structural equations that are volatile and give more weights to those that are less volatile, and correlation between structural equations are also accounted for. 69 The NMD estimation in pseudo panel correctly relies on large ngt but fixed G, T asymptotics. Naive methods like FE on the cohort means are found to be the NMD estimators using the identity weighting matrix. But the naive s.e. estimators, including the usual s.e., the s.e. robust to heteroskedasticity and the cluster-robust s.e., rely on at least large G asymptotics. Even when G is moderate or large, depending on how complicated the IVs are, the usual s.e. and the one only robust to heteroskedasticity may not capture the correct variance-covariance structure. The cluster-robust s.e. though has the potential to work for large G because it is fully robust. But since it ignores the individual level data completely, it is always less efficient than the NMD s.e. estimator using identity weighting matrix. The optimal NMD is always the most efficient among these candidates. we conclude that when there are extra IVs to explore, it is important to use optimal weighting. The comments in Imbens and Wooldridge (2007) regarding flexible specifications provide several ideas we would like to investigate in future research. First, we intend to extend the application to dynamic models, i.e., to add lagged dependent variables in the list of explanatory variables. This is an issue that has been studied by Moffitt (1993); Collado (1997); Girma (2000); Verbeek and Vella (2005); McKenzie (2004) among others. In dynamic models, the advantage of having the general NMD framework stands out. There is no need to tailor the framework in any way, since we can still define the vector of reduced form parameters as before. Because cohort means of the dependent variable do not appear redundantly in the reduce-form parameters, their asymptotics are well defined. The cohortlevel equations are also of the same form as before; the only difference is that the equations for the first several periods need to be dropped because of the lags. Second, we intend to add unit-specific trend in the unobserved heterogeneity as in the random growth model of Heckman and Hotz (1989). Third, an even more flexible extension is to let the factor loads on the unobserved heterogeneity be time-varying. These extensions should be easily handled by the NMD framework. Lastly, we are also interested in an empirical application of the method. Currently, we am working on applying the pseudo panel method to estimate returns 70 to education using data from the U.S. Current Population Survey. 71 APPENDICES 72 APPENDIX A PROOFS AND ALGEBRA A.1 Proof of consistency Proof. Prove by verifying verifying (i)-(iv) of Theorem 2.1 in Newey and McFadden (1994) A.2 Proof of asymptotic normality A sketch of the idea first. By the first part of (ii), a mean value expansion of each comˆ around θ 0 leads to h(π, ˆ = h(π, ¯ θ ˆ − θ 0 ) where θ ¯ is a ˆ θ) ˆ θ) ˆ θ 0 ) + L(π, ˆ θ)( ponent of h(π, ˆ θ 0 ) = B(π, ¯ θ 0 )(π ˆ − π 0 ), follows by the vector of mean values. A similar expansion, h(π, second part of (ii) and h(π 0 , θ 0 ) = 0. Substituting the two expansions in the first-order √ ˆ ˆ WL( ¯ −1 · L(π, ˆ WB( ˆ π, ˆ ˆ θ) ˆ θ)] ˆ θ) ¯ θ0 ) · − θ 0 ) = −[L(π, π, condition and solving gives n(θ √ p ˆ→ ˆ − π 0 ). By the first part of (iv), θ n(π θ 0 and continuity of L(π, θ) on N (θ 0 ) in (ii), ˆ − L(π 0 , θ) ˆ + ˆ − L ≤ L(π, ˆ θ) ˆ θ) we have that, with probability approaching one, L(π, p ˆ −L → ˆ − L ≤ sup ˆ θ) − L(π 0 , θ) + L(π 0 , θ) 0. Similar, the L(π 0 , θ) θ∈N (θ 0 ) L(π, p p p p ¯ −L → ¯ → ˆ θ) ¯ θ 0 ) − B → 0 by π ¯ → π0 convergence L(π, 0 follows by θ θ 0 , and B(π, and continuity of B(π, θ) on N (θ 0 ). Condition (iv) guarantees that L WL −1 exists, ˆ WL( ¯ −1 . The ˆ π, ˆ θ) ˆ θ)] and thus, with probability approaching one, the existence of [L(π, conclusion then follows by (iii), Slutsky’s theorem and the asymptotic equivalence theorem. A full proof with technical details is given below. Proof. By (i), without of generality N (θ 0 ) (N (π 0 )) can be assumed to be a convex, open set contained in Θ (Π). Then N (θ 0 ) (N (π 0 )) is also connected since Θ ∈ RP (Π ∈ RK ). 73 ˆ ∈ N (θ 0 )} and A2 = Let 1A denote the indicator function for an event A. Let A1 = {θ p p p p ˆ→ ˆ ∈ N (π 0 )}, and . Note that θ ˆ → π 0 ) implies 1A1 → 1 (1A2 → 1). {π θ 0 (π (1) By the first part of condition (ii) and the first order condition for a minimum, 1A1 · ˆ Wh( ˆ = 0. The multiplication by 1A is needed because by (ii) L(π, θ) only ˆ π, ˆ θ) ˆ θ) L(π, 1 ˆ is pointwise defined by θ ˆ = argmin h(π, ˆ h(π, ˆ θ) W ˆ θ), not by the first exists on N (θ 0 ). θ θ∈Θ ˆ Wh( ˆ = 0. For some realization of π, ˆ may ˆ π, ˆ θ) ˆ θ) ˆ the corresponding θ order condition L(π, not lie in N (θ 0 ). (2) Since N (θ 0 ) is connected, by condition (ii) and mean value expansion theorem, ˆ = 1A · hj (π, ¯ j )(θ ˆ − θ 0 ) for j = 1, · · · , J, where θ ¯ j is ˆ θ) ˆ θ 0 ) + 1A1 · Lj (π, ˆ θ 1A1 · hj (π, 1 a random variable equal to the mean value if 1A1 = 1 and equal to θ 0 otherwise. Again, p ˆ is not necessarily in N (θ 0 ). Clearly, θ ¯j → this complication is needed because θ θ 0 as p ˆ→ ¯ and let L(π, ¯ be the matrix with ˆ θ) θ θ 0 . Collect all the J mean values in the matrix θ, ¯ j ). Then those expansions can be written collectively as 1A · h(π, ˆ = ˆ θ ˆ θ) j-th row Lj (π, 1 ¯ θ ˆ − θ 0 ). Substituting in 0 = 1A · L(π, ˆ Wh( ˆ leads to ˆ π, ˆ θ 0 ) + 1A · L(π, ˆ θ)( ˆ θ) ˆ θ) 1A · h(π, 1 1 1 ˆ Wh( ˆ WL( ¯ θ ˆ − θ 0 ). ˆ π, ˆ π, ˆ θ) ˆ θ 0 ) + 1A1 · L(π, ˆ θ) ˆ θ)( 0 = 1A1 · L(π, ˆ θ 0 ) = 1A2 · (3) By a similar reasoning and the fact h(π 0 , θ 0 ) = 0, write 1A2 · h(π, ¯ θ 0 )(π ˆ − π 0 ), where π ¯ j , the j-th column of the matrix π, ¯ equals to a mean value the if B(π, ¯ θ 0 ) is the matrix with the j-th row Bj (π ¯ j , θ 0 ). 1A2 = 1 and equal to π 0 otherwise, and B(π, p p p p ¯ j → π 0 as π ¯ j → π 0 . Also, 1A2 → 1, and B(π, ¯ θ 0 ) → B by the second part of condition π ˆ Wh( ˆ π, ˆ θ) ˆ θ 0 )B(π, ¯ θ 0 )(π ˆ − π 0 ) + 1A ∩A · (iv). Substituting again gives 0 = 1A ∩A · L(π, 1 2 1 2 ˆ WL( ¯ θ ˆ − θ 0 ). ˆ π, ˆ θ) ˆ θ)( L(π, ˆ WL( ¯ is nonsingular}. Let V ˆ π, ¯ be a random variable equal ˆ θ) ˆ θ) (4) Let A3 = {L(π, ˆ WL( ¯ if 1A = 1 and equal to the K-dimensional identity matrix otherwise. ˆ π, ˆ θ) ˆ θ) to L(π, 3 p p ˆ → ¯ → ˆ θ) ˆ θ) By the first part of condition (iv), L(π, L and L(π, L. Then by condition p p p ¯ → ˆ → (v) and W W, 1A3 → 1 and V L WL. Substituting for another time gives 0 = ˆ − θ 0 ). ˆ Wh( ˆ π, ¯ θ ˆ θ) ˆ θ 0 )B(π, ¯ θ 0 )(π ˆ − π 0 ) + 1A ∩A ∩A · V( 1A ∩A ∩A · L(π, 1 2 3 1 2 3 p Now, let A0 = A1 ∩A2 ∩A3 . Note that 1A0 = 1A1 ·1A2 ·1A3 → 1. Multiplying by 74 √ n and √ ˆ √ ˆ Wh( ¯ −1 L(π, ˆ π, ˆ − π 0 ) − (1A0 − ˆ θ) ˆ θ 0 )B(π, ¯ θ 0 ) · n(π solving gives n(θ − θ 0 ) = −1A0 · V √ ˆ − θ 0 ). The conclusion follows by Slutsky’s theorem and the asymptotic equivalence 1) n(θ theorem. A.3 Proof of optimal weighting matrix Proof. For an arbitrary W, L WL −1 L WBΩB WL L WL −1 − (L (BΩB )−1 L)−1 = D (BΩB )−1 D is positive semi-definite, where D = L WL A.4 −1 L W − [L L WL −1 L]−1 L (BΩB )−1 . Proof that extra conditions do not hurt Proof. Notice that     ∇π h1 (π 0 , θ 0 ) B1  B = ∇π h(π 0 , θ 0 ) =   =  . ∇π h2 (π 0 , θ 0 ) B2 Then     B1  B1  M = BΩB =   Ω   B2 B2      B1 ΩB1 B1 ΩB2   M1,1 M1,2  = = . B2 ΩB1 B2 ΩB2 M2,1 M2,2 Also notice that     ∇θ h1 (π 0 , θ 0 ) L1  L = ∇θ h(π 0 , θ 0 ) =   =  . ∇θ h2 (π 0 , θ 0 ) L2 75 Now, define F = M2,2 − M2,1 M−1 1,1 M1,2 . F is the Schur complement of M1,1 in M. Then L M−1 L    −1   L1   M1,1 M1,2  L1  =      L2 M2,1 M2,2 L2      −1 −1 −1 −1 −1 −1 L1   M + M1,1 M1,2 F M2,1 M1,1 −M1,1 M1,2 F  L1  =    1,1   −1 L2 −F−1 M2,1 M−1 F L2 1,1 −1 −1 −1 −1 −1 = L1 M−1 1,1 + M1,1 M1,2 F M2,1 M1,1 − L2 F M2,1 M1,1 , ...   L −1 + L F−1  1  ... −L1 M−1   2 1,1 M1,2 F L2 −1 −1 −1 −1 −1 =L1 M−1 1,1 + M1,1 M1,2 F M2,1 M1,1 L1 − L2 F M2,1 M1,1 L1 −1 −1 − L1 M−1 1,1 M1,2 F L2 + L2 F L2 . Therefore, L M−1 L − L1 M−1 1,1 L1 −1 −1 −1 −1 −1 =L1 M−1 1,1 + M1,1 M1,2 F M2,1 M1,1 L1 − L2 F M2,1 M1,1 L1 −1 −1 −1 − L1 M−1 1,1 M1,2 F L2 + L2 F L2 − L1 M1,1 L1 . −1 −1 −1 −1 =L1 M−1 1,1 M1,2 F M2,1 M1,1 L1 − L2 F M2,1 M1,1 L1 −1 −1 − L1 M−1 1,1 M1,2 F L2 + L2 F L2 −1 −1 = L1 M−1 1,1 M1,2 − L2 F M2,1 M1,1 L1 −1 − L1 M−1 1,1 M1,2 − L2 F L2 −1 M M−1 L − L = L1 M−1 2,1 1,1 1 2 1,1 M1,2 − L2 F = M2,1 M−1 1,1 L1 − L2 F−1 M2,1 M−1 1,1 L1 − L2 where the fact M1,2 = M2,1 is used in the last equality. Clearly, the last expression is in quadratic form and thus is positive semi-definite. The condition for redundancy of 76 h2 (π 0 , θ 0 ) = 0 is M2,1 M−1 1,1 L1 − L2 = 0, or B2 ΩB1 (B1 ΩB1 )−1 L1 − L2 = 0. Useful expressions for sit and Ωsgt A.5 For vit = (yit , xit ) and sit = zit ⊗ vit , an explicit expression for sit is sit = (vit , z2it vit , · · · , zP it vit ). Hence,   Cov(v , z v |g) it P it it      Cov(z v , v |g)  V ar(z v |g) · · · Cov(z v , z v |g) 2it it it 2it it 2it it it P it   s Ωgt =  . . . . ..   . . . . . . .     Cov(zP it vit , vit |g) Cov(z2it vit , z2it vit |g) · · · V ar(zP it vit |g) z v V ar(vit |g) ˆ gtp = n−1 For µ gt Cov(vit , z2it vit |g) nt i=1 rit,g zpit vit , ˆ pq,gt = Γ n−1 gt nt ··· an estimator for Cov(zpit vit , zqit vit |g) is z v z v ˆ gtp ) (zqit vit − µ ˆ gtq ). rit,g (zpit vit − µ i=1 It is also informative to write sit as sit = (yit , xit , z2it yit , z2it xit , · · · , zP it yit , zP it xit ). Because xit includes unity, sit contains all yit , xit , zit and the interactions of zit with (yit , xit ). 77 A.6 Derivation for M M =b(θ)Ωs b(θ)  b11 (θ)   b12 (θ)  = ...     b11 (θ)   b12 (θ)   ...    b11 (θ)Ωs11 b11 (θ) ρ1 κ1   s Ω11   ρ1 κ 1        bGT (θ)   Ωs12 ρ1 κ2 ... ΩsGT ρG κT     ·            bGT (θ)     =     b12 (θ)Ωs12 b12 (θ) ρ1 κ2 .. . bGT (θ)ΩsGT bGT (θ) ρG κT 78         A.7 ˆ Equivalence of the two ways of computing M ˇ Ω ˇ shows that ˆ = b(θ) ˆ s b(θ) Proof. Expanding M ˇ Ω ˇ ˆ s b(θ) b(θ)  ˇ b11 (θ)   ˇ b12 (θ)  = ..  .    ˇ b11 (θ)   ˇ b12 (θ)   ..  .   ˇ Ω ˇ ˆ s b (θ) b11 (θ) 11 11  n11 /n  ˆs Ω11  n   11 /n        ˇ bGT (θ)  ˆs Ω 12 n12 /n ... ˆs Ω GT nGT /n          ˇ bGT (θ)     =         ·      ˇ Ω ˇ ˆ s b (θ) b12 (θ) 12 12 n12 /n       ..  .  ˇ Ω ˇ  ˆs b bGT (θ) ( θ) GT GT nGT /n 79 For each (g, t), ˇ Ω ˇ ˆ s bgt (θ) bgt (θ) gt ˇ ) [Γ ˇ ) ˆ pq,gt ]P IP ⊗ (−1, β = IP ⊗ (−1, β gt gt    ˇ ˆ (−1, β gt )   Γ11,gt      Γ ˇ ) ˆ (−1, β gt    21,gt =   . ..    .. .       ˇ ) ˆ (−1, β gt P ΓP 1,gt   ˇ (−1, β gt )      ˇ ) (−1, β gt     ...       ˇ (−1, β gt ) P         −1  −1   −1    −1  ˆ ˆ 12,gt     Γ11,gt  Γ        ˇ ˇ ˇ ˇ  β gt β gt β gt β gt           −1  −1   −1    −1  ˆ 22,gt  ˆ 21,gt    Γ       Γ  ˇ ˇ ˇ ˇ = β β β β gt gt gt gt   .. ..  . .            −1   ˆ  −1   −1   −1  ˆ   ΓP 1,gt     Γ  P 2,gt  ˇ ˇ ˇ ˇ β β β β gt gt gt gt  ˆ 12,gt · · · Γ ˆ 1P,gt Γ   ˆ 22,gt · · · Γ ˆ P P,gt  Γ  · .. . .. ..  . .   ˆ P 2,gt · · · Γ ˆ P P,gt Γ    −1     ˇ β gt       −1  ˆ ΓP P,gt    ˇ β gt    .. ..  . .         −1   −1  ˆ · · ·   ΓP P,gt   ˇ ˇ β β gt gt  −1  ···   ˇ gt β    −1  ···   ˇ β gt ˆ 1P,gt  Γ  For each (p, q; g, t), recall vit = (yit , xit ) and notice that    −1  ˇ gt ) vit   = −(yit − xit β ˇ β gt ˇ − (ˇ = −(yit − xit β ηt + α ˇ g )) = −ˇ uit and that   z v  −1  ˆ gtp   µ ˇ β gt = n−1 gt  nt  nt  −1  rit,g zpit vit   = n−1 rit,g zpit uˇit gt ˇ β gt i=1 i=1 80  we have      −1  ˆ  −1    Γpp,gt   ˇ ˇ β β gt gt     nt z v z v  −1   −1  ˆ gtp ) (zqit vit − µ ˆ gtq )   =   n−1 rit,g (zpit vit − µ gt ˇ ˇ β β i=1 gt gt           nt z v  −1  z v  −1    −1   −1   ˆ gtq   ˆ gtp   zqit vit   − µ =n−1 rit,g zpit vit   − µ gt ˇ gt ˇ gt ˇ gt ˇ gt β β β β i=1 =n−1 gt =n−1 gt nt rit,g zpit uˇit − n−1 gt i=1 nt nt rit,g zpit uˇit zqit uˇit − n−1 gt i=1 zp u ˇgt rit,g zpit uˇit − µ nt rit,g zpit uˇit i=1 zp u zqit uˇit − µ ˇgt = τˆpq . i=1 Hence the two ways are numerically equivalent. A.8 Further algebra on Ξgt In general, Ξgt = V ar[zit yit − zit xit β − zit (ηt + αg )|g] f = V ar[zit (εi + uit )|g] = E f f zit (εi + uit ) − E[zit (εi + uit )|g] f f f zit (εi + uit ) − E[zit (εi + uit )|g] f f = E (εi + uit )2 zit zit |g − E zit (εi + uit )|g E (εi + uit )zit |g f = E (εi + uit )2 zit zit |g f = E (εi + uit )2 |g E zit zit |g if independent = σ 2f (g) · Ωz if independent of gi & mean zero ε +u 81 For zit = (1, x2it ) in Case 3, f Ξgt = E (εi + uit )2 zit zit |g f = E (εi + uit )2 |g E zit zit |g if independent = σ 2f E (1, x2it ) (1, x2it )|g ε +u     1 x2it   = σ 2f E   |g  ε +u 2 x2it x2it   gt/150   1 = σ 2f   ε +u gt/150 2 + (gt/150)2 82 APPENDIX B ADDITIONAL TABLES 83 Table B.1 Small panel with G = 6, T = 4. Case 1.a: x2it ∼ N (gt/6, 1), ngt = 200, sampling rate= 1%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. MD Identity IV: none x2 x3 x4 IV: z x2 x3 x4 IV: x2 x2 x3 x4 IV: x3 x2 x3 x4 IV: x4 x2 x3 x4 IV: all x2 x3 x4 MD Optimal βˇ ˇ se(β) ˇ sen (β) ˇ ser (β) ˇ sec (β) βˆ ˆ se(β) 1.0021 (0.1534) 0.9973 (0.0901) 1.0127 (0.2818) 0.1590 (0.0122) 0.0887 (0.0040) 0.2848 (0.0137) 0.1555 (0.0326) 0.0866 (0.0176) 0.2780 (0.0570) 0.1397 (0.0424) 0.0826 (0.0195) 0.2624 (0.0659) 0.1413 (0.0628) 0.0878 (0.0315) 0.2943 (0.1070) 1.0020 (0.1531) 0.9976 (0.0904) 1.0129 (0.2834) 0.1585 (0.0121) 0.0884 (0.0040) 0.2838 (0.0137) 1.0530 (0.1447) 0.9880 (0.0884) 1.0230 (0.2764) 0.1504 (0.0104) 0.0870 (0.0037) 0.2788 (0.0129) 0.1486 (0.0206) 0.0858 (0.0105) 0.2749 (0.0349) 0.1225 (0.0304) 0.0690 (0.0148) 0.2187 (0.0496) 0.1240 (0.0514) 0.0736 (0.0263) 0.2455 (0.0870) 1.0532 (0.1452) 0.9875 (0.0892) 1.0221 (0.2774) 0.1486 (0.0103) 0.0860 (0.0036) 0.2755 (0.0127) 1.2350 (0.1365) 1.0028 (0.1091) 0.9988 (0.3368) 0.1360 (0.0085) 0.1100 (0.0048) 0.3415 (0.0184) 0.0931 (0.0216) 0.0528 (0.0118) 0.1670 (0.0384) 0.1025 (0.0257) 0.0709 (0.0235) 0.2146 (0.0699) 0.1344 (0.0535) 0.0708 (0.0304) 0.2608 (0.1138) 1.4768 (0.0329) 0.9063 (0.0833) 1.1332 (0.2681) 0.0326 (0.0009) 0.0802 (0.0027) 0.2694 (0.0116) 1.0329 (0.1561) 0.9879 (0.0458) 1.0083 (0.2885) 0.1634 (0.0117) 0.0463 (0.0016) 0.2932 (0.0134) 0.1218 (0.0190) 0.0353 (0.0049) 0.2332 (0.0333) 0.1115 (0.0258) 0.0367 (0.0064) 0.2191 (0.0421) 0.1253 (0.0532) 0.0406 (0.0138) 0.2567 (0.0918) 1.0504 (0.1404) 0.9885 (0.0419) 1.0254 (0.2607) 0.1424 (0.0098) 0.0415 (0.0012) 0.2617 (0.0112) 1.0061 (0.1546) 0.9975 (0.0862) 1.0097 (0.1550) 0.1588 (0.0119) 0.0863 (0.0036) 0.1585 (0.0043) 0.1007 (0.0175) 0.0542 (0.0088) 0.1271 (0.0206) 0.1011 (0.0269) 0.0564 (0.0120) 0.1303 (0.0268) 0.1173 (0.0509) 0.0693 (0.0252) 0.1408 (0.0491) 1.0505 (0.1463) 0.9892 (0.0811) 1.0143 (0.1008) 0.1479 (0.0105) 0.0812 (0.0032) 0.1011 (0.0023) 1.2449 (0.1268) 0.9943 (0.0871) 0.9839 (0.3119) 0.1267 (0.0079) 0.0881 (0.0034) 0.3163 (0.0155) 0.0574 (0.0111) 0.0301 (0.0055) 0.1043 (0.0198) 0.0867 (0.0218) 0.0514 (0.0157) 0.1798 (0.0578) 0.1108 (0.0446) 0.0582 (0.0226) 0.2246 (0.0964) 1.4718 (0.0335) 0.9687 (0.0406) 1.0243 (0.0983) 0.0319 (0.0008) 0.0392 (0.0009) 0.0972 (0.0016) 84 Table B.2 Small panel with G = 6, T = 4. Case 1.b: x2it ∼ N (gt/6, 1), ngt = 1000, sampling rate= 1%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. MD Identity IV: none x2 x3 x4 IV: z x2 x3 x4 IV: x2 x2 x3 x4 IV: x3 x2 x3 x4 IV: x4 x2 x3 x4 IV: all x2 x3 x4 MD Optimal βˇ ˇ se(β) ˇ sen (β) ˇ ser (β) ˇ sec (β) βˆ ˆ se(β) 0.9940 (0.0715) 1.0037 (0.0391) 1.0067 (0.1274) 0.0722 (0.0018) 0.0398 (0.0007) 0.1282 (0.0024) 0.0700 (0.0144) 0.0385 (0.0079) 0.1244 (0.0256) 0.0628 (0.0180) 0.0367 (0.0085) 0.1174 (0.0308) 0.0639 (0.0286) 0.0390 (0.0146) 0.1333 (0.0486) 0.9939 (0.0714) 1.0037 (0.0391) 1.0069 (0.1276) 0.0721 (0.0018) 0.0397 (0.0007) 0.1281 (0.0024) 0.9940 (0.0709) 1.0038 (0.0389) 1.0064 (0.1274) 0.0718 (0.0018) 0.0397 (0.0007) 0.1278 (0.0024) 0.0714 (0.0088) 0.0395 (0.0049) 0.1272 (0.0156) 0.0517 (0.0142) 0.0301 (0.0069) 0.0963 (0.0247) 0.0531 (0.0233) 0.0322 (0.0120) 0.1100 (0.0399) 0.9940 (0.0710) 1.0037 (0.0390) 1.0067 (0.1278) 0.0716 (0.0018) 0.0396 (0.0007) 0.1275 (0.0024) 0.9979 (0.0372) 1.0036 (0.0459) 1.0219 (0.1538) 0.0360 (0.0011) 0.0464 (0.0009) 0.1536 (0.0033) 0.0336 (0.0064) 0.0270 (0.0052) 0.0751 (0.0145) 0.0312 (0.0057) 0.0286 (0.0071) 0.0839 (0.0273) 0.0306 (0.0120) 0.0373 (0.0140) 0.1172 (0.0518) 0.9987 (0.0208) 1.0026 (0.0357) 1.0086 (0.1241) 0.0205 (0.0002) 0.0371 (0.0005) 0.1259 (0.0023) 0.9998 (0.0778) 1.0025 (0.0206) 1.0097 (0.1327) 0.0768 (0.0020) 0.0207 (0.0003) 0.1337 (0.0025) 0.0565 (0.0078) 0.0158 (0.0022) 0.1059 (0.0147) 0.0512 (0.0110) 0.0165 (0.0028) 0.0992 (0.0195) 0.0573 (0.0248) 0.0181 (0.0065) 0.1159 (0.0408) 0.9946 (0.0658) 1.0031 (0.0186) 1.0070 (0.1218) 0.0680 (0.0016) 0.0188 (0.0002) 0.1202 (0.0020) 0.9959 (0.0711) 1.0040 (0.0381) 1.0028 (0.0686) 0.0728 (0.0018) 0.0387 (0.0006) 0.0705 (0.0007) 0.0457 (0.0075) 0.0242 (0.0040) 0.0567 (0.0093) 0.0458 (0.0114) 0.0251 (0.0053) 0.0580 (0.0121) 0.0534 (0.0238) 0.0313 (0.0117) 0.0635 (0.0218) 0.9936 (0.0695) 1.0045 (0.0370) 1.0011 (0.0442) 0.0708 (0.0017) 0.0370 (0.0006) 0.0454 (0.0003) 1.0002 (0.0390) 1.0028 (0.0248) 1.0154 (0.1157) 0.0377 (0.0010) 0.0250 (0.0003) 0.1166 (0.0018) 0.0256 (0.0030) 0.0132 (0.0015) 0.0515 (0.0060) 0.0271 (0.0045) 0.0150 (0.0024) 0.0599 (0.0167) 0.0300 (0.0112) 0.0201 (0.0073) 0.0872 (0.0357) 0.9986 (0.0208) 1.0031 (0.0179) 1.0014 (0.0440) 0.0203 (0.0002) 0.0182 (0.0002) 0.0448 (0.0003) 85 Table B.3 Small panel with G = 6, T = 4. Case 1.1: x2it ∼ N (gt/6, 1), ngt = 200, sampling rate= 0.2%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. MD Identity IV: none x2 x3 x4 IV: z x2 x3 x4 IV: x2 x2 x3 x4 IV: x3 x2 x3 x4 IV: x4 x2 x3 x4 IV: all x2 x3 x4 MD Optimal βˇ ˇ se(β) ˇ sen (β) ˇ ser (β) ˇ sec (β) βˆ ˆ se(β) 0.9898 (0.1652) 1.0083 (0.0875) 1.0103 (0.2886) 0.1606 (0.0089) 0.0889 (0.0034) 0.2863 (0.0124) 0.1568 (0.0342) 0.0868 (0.0185) 0.2794 (0.0593) 0.1402 (0.0423) 0.0823 (0.0208) 0.2619 (0.0716) 0.1430 (0.0615) 0.0871 (0.0326) 0.2986 (0.1097) 0.9900 (0.1652) 1.0081 (0.0874) 1.0113 (0.2882) 0.1601 (0.0089) 0.0886 (0.0034) 0.2853 (0.0123) 0.9896 (0.1599) 1.0091 (0.0864) 1.0095 (0.2832) 0.1565 (0.0084) 0.0879 (0.0033) 0.2817 (0.0119) 0.1546 (0.0198) 0.0868 (0.0105) 0.2783 (0.0341) 0.1186 (0.0308) 0.0682 (0.0160) 0.2179 (0.0541) 0.1203 (0.0477) 0.0716 (0.0268) 0.2447 (0.0908) 0.9905 (0.1599) 1.0084 (0.0863) 1.0111 (0.2818) 0.1547 (0.0083) 0.0869 (0.0033) 0.2785 (0.0117) 0.9966 (0.0829) 1.0086 (0.1007) 1.0183 (0.3441) 0.0809 (0.0052) 0.1035 (0.0043) 0.3417 (0.0167) 0.0752 (0.0153) 0.0606 (0.0122) 0.1684 (0.0341) 0.0701 (0.0131) 0.0642 (0.0165) 0.1871 (0.0628) 0.0687 (0.0255) 0.0817 (0.0322) 0.2611 (0.1184) 0.9978 (0.0476) 1.0065 (0.0808) 1.0130 (0.2811) 0.0453 (0.0010) 0.0819 (0.0027) 0.2750 (0.0113) 1.0019 (0.1685) 1.0038 (0.0468) 1.0200 (0.2924) 0.1681 (0.0089) 0.0464 (0.0013) 0.2956 (0.0123) 0.1241 (0.0180) 0.0353 (0.0048) 0.2340 (0.0327) 0.1127 (0.0253) 0.0367 (0.0060) 0.2198 (0.0441) 0.1275 (0.0542) 0.0399 (0.0140) 0.2580 (0.0882) 0.9931 (0.1513) 1.0035 (0.0433) 1.0173 (0.2629) 0.1473 (0.0073) 0.0416 (0.0010) 0.2633 (0.0101) 0.9909 (0.1641) 1.0105 (0.0840) 1.0028 (0.1582) 0.1610 (0.0087) 0.0865 (0.0030) 0.1582 (0.0036) 0.1020 (0.0177) 0.0544 (0.0091) 0.1276 (0.0210) 0.1020 (0.0263) 0.0563 (0.0126) 0.1301 (0.0278) 0.1189 (0.0507) 0.0693 (0.0266) 0.1415 (0.0498) 0.9899 (0.1565) 1.0093 (0.0800) 1.0005 (0.0996) 0.1529 (0.0079) 0.0814 (0.0027) 0.1008 (0.0016) 0.9993 (0.0849) 1.0059 (0.0547) 1.0151 (0.2575) 0.0839 (0.0045) 0.0559 (0.0017) 0.2579 (0.0091) 0.0566 (0.0070) 0.0296 (0.0035) 0.1144 (0.0140) 0.0602 (0.0104) 0.0335 (0.0056) 0.1336 (0.0385) 0.0668 (0.0250) 0.0443 (0.0167) 0.1932 (0.0819) 0.9979 (0.0479) 1.0036 (0.0421) 1.0013 (0.1007) 0.0443 (0.0010) 0.0398 (0.0008) 0.0984 (0.0015) 86 Table B.4 Small panel with G = 6, T = 4. Case 1.1: x2it ∼ N (gt/6, 1), ngt = 200, sampling rate= 0.2%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. MD Identity IV: none x2 x3 x4 IV: z x2 x3 x4 IV: x2 x2 x3 x4 IV: x3 x2 x3 x4 IV: x4 x2 x3 x4 IV: all x2 x3 x4 MD Optimal βˇ ˇ se(β) ˇ sen (β) ˇ ser (β) ˇ sec (β) βˆ ˆ se(β) 1.0027 (0.0722) 0.9971 (0.0401) 1.0010 (0.1304) 0.0722 (0.0017) 0.0398 (0.0007) 0.1279 (0.0024) 0.0697 (0.0145) 0.0384 (0.0079) 0.1235 (0.0256) 0.0618 (0.0181) 0.0366 (0.0086) 0.1167 (0.0306) 0.0622 (0.0281) 0.0390 (0.0138) 0.1325 (0.0490) 1.0027 (0.0722) 0.9972 (0.0401) 1.0012 (0.1304) 0.0721 (0.0017) 0.0398 (0.0007) 0.1278 (0.0024) 1.0028 (0.0715) 0.9971 (0.0400) 1.0009 (0.1304) 0.0718 (0.0017) 0.0397 (0.0007) 0.1275 (0.0024) 0.0709 (0.0087) 0.0392 (0.0048) 0.1259 (0.0152) 0.0510 (0.0143) 0.0300 (0.0069) 0.0957 (0.0245) 0.0516 (0.0229) 0.0323 (0.0114) 0.1092 (0.0403) 1.0029 (0.0715) 0.9973 (0.0401) 1.0009 (0.1302) 0.0716 (0.0017) 0.0396 (0.0007) 0.1272 (0.0024) 1.0030 (0.0353) 0.9962 (0.0461) 1.0016 (0.1567) 0.0360 (0.0011) 0.0465 (0.0009) 0.1535 (0.0034) 0.0332 (0.0066) 0.0267 (0.0053) 0.0742 (0.0148) 0.0308 (0.0058) 0.0284 (0.0072) 0.0832 (0.0290) 0.0301 (0.0115) 0.0359 (0.0137) 0.1166 (0.0554) 0.9992 (0.0204) 0.9980 (0.0374) 0.9999 (0.1278) 0.0205 (0.0002) 0.0372 (0.0005) 0.1256 (0.0023) 1.0005 (0.0754) 0.9989 (0.0211) 1.0006 (0.1349) 0.0769 (0.0018) 0.0207 (0.0003) 0.1335 (0.0025) 0.0565 (0.0075) 0.0158 (0.0021) 0.1057 (0.0139) 0.0511 (0.0112) 0.0165 (0.0026) 0.0993 (0.0182) 0.0574 (0.0255) 0.0180 (0.0060) 0.1159 (0.0404) 1.0006 (0.0674) 0.9998 (0.0190) 0.9981 (0.1211) 0.0680 (0.0015) 0.0188 (0.0002) 0.1198 (0.0021) 1.0030 (0.0731) 0.9965 (0.0391) 1.0027 (0.0713) 0.0729 (0.0017) 0.0388 (0.0006) 0.0705 (0.0007) 0.0456 (0.0074) 0.0242 (0.0039) 0.0564 (0.0091) 0.0452 (0.0114) 0.0251 (0.0053) 0.0579 (0.0120) 0.0518 (0.0234) 0.0310 (0.0117) 0.0628 (0.0222) 1.0031 (0.0707) 0.9970 (0.0369) 1.0032 (0.0461) 0.0708 (0.0017) 0.0370 (0.0006) 0.0454 (0.0003) 1.0022 (0.0369) 0.9980 (0.0252) 1.0025 (0.1185) 0.0378 (0.0010) 0.0250 (0.0004) 0.1164 (0.0019) 0.0255 (0.0029) 0.0132 (0.0015) 0.0512 (0.0059) 0.0268 (0.0046) 0.0149 (0.0024) 0.0595 (0.0175) 0.0299 (0.0113) 0.0197 (0.0069) 0.0863 (0.0377) 0.9991 (0.0203) 0.9998 (0.0183) 1.0026 (0.0460) 0.0203 (0.0002) 0.0182 (0.0002) 0.0449 (0.0003) 87 Table B.5 Small panel with G = 6, T = 4. Case 2.a: x2it ∼ N (gt/6, 1) + fi , ngt = 200, sampling rate= 1%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. MD Identity IV: none x2 x3 x4 IV: z x2 x3 x4 IV: x2 x2 x3 x4 IV: x3 x2 x3 x4 IV: x4 x2 x3 x4 IV: all x2 x3 x4 MD Optimal βˇ ˇ se(β) ˇ sen (β) ˇ ser (β) ˇ sec (β) βˆ ˆ se(β) 1.0021 (0.1534) 0.9973 (0.0901) 1.0127 (0.2818) 0.1590 (0.0122) 0.0887 (0.0040) 0.2848 (0.0137) 0.1555 (0.0326) 0.0866 (0.0176) 0.2780 (0.0570) 0.1397 (0.0424) 0.0826 (0.0195) 0.2624 (0.0659) 0.1413 (0.0628) 0.0878 (0.0315) 0.2943 (0.1070) 1.0020 (0.1531) 0.9976 (0.0904) 1.0129 (0.2834) 0.1585 (0.0121) 0.0884 (0.0040) 0.2838 (0.0137) 1.0530 (0.1447) 0.9880 (0.0884) 1.0230 (0.2764) 0.1504 (0.0104) 0.0870 (0.0037) 0.2788 (0.0129) 0.1486 (0.0206) 0.0858 (0.0105) 0.2749 (0.0349) 0.1225 (0.0304) 0.0690 (0.0148) 0.2187 (0.0496) 0.1240 (0.0514) 0.0736 (0.0263) 0.2455 (0.0870) 1.0532 (0.1452) 0.9875 (0.0892) 1.0221 (0.2774) 0.1486 (0.0103) 0.0860 (0.0036) 0.2755 (0.0127) 1.2350 (0.1365) 1.0028 (0.1091) 0.9988 (0.3368) 0.1360 (0.0085) 0.1100 (0.0048) 0.3415 (0.0184) 0.0931 (0.0216) 0.0528 (0.0118) 0.1670 (0.0384) 0.1025 (0.0257) 0.0709 (0.0235) 0.2146 (0.0699) 0.1344 (0.0535) 0.0708 (0.0304) 0.2608 (0.1138) 1.4768 (0.0329) 0.9063 (0.0833) 1.1332 (0.2681) 0.0326 (0.0009) 0.0802 (0.0027) 0.2694 (0.0116) 1.0329 (0.1561) 0.9879 (0.0458) 1.0083 (0.2885) 0.1634 (0.0117) 0.0463 (0.0016) 0.2932 (0.0134) 0.1218 (0.0190) 0.0353 (0.0049) 0.2332 (0.0333) 0.1115 (0.0258) 0.0367 (0.0064) 0.2191 (0.0421) 0.1253 (0.0532) 0.0406 (0.0138) 0.2567 (0.0918) 1.0504 (0.1404) 0.9885 (0.0419) 1.0254 (0.2607) 0.1424 (0.0098) 0.0415 (0.0012) 0.2617 (0.0112) 1.0061 (0.1546) 0.9975 (0.0862) 1.0097 (0.1550) 0.1588 (0.0119) 0.0863 (0.0036) 0.1585 (0.0043) 0.1007 (0.0175) 0.0542 (0.0088) 0.1271 (0.0206) 0.1011 (0.0269) 0.0564 (0.0120) 0.1303 (0.0268) 0.1173 (0.0509) 0.0693 (0.0252) 0.1408 (0.0491) 1.0505 (0.1463) 0.9892 (0.0811) 1.0143 (0.1008) 0.1479 (0.0105) 0.0812 (0.0032) 0.1011 (0.0023) 1.2449 (0.1268) 0.9943 (0.0871) 0.9839 (0.3119) 0.1267 (0.0079) 0.0881 (0.0034) 0.3163 (0.0155) 0.0574 (0.0111) 0.0301 (0.0055) 0.1043 (0.0198) 0.0867 (0.0218) 0.0514 (0.0157) 0.1798 (0.0578) 0.1108 (0.0446) 0.0582 (0.0226) 0.2246 (0.0964) 1.4718 (0.0335) 0.9687 (0.0406) 1.0243 (0.0983) 0.0319 (0.0008) 0.0392 (0.0009) 0.0972 (0.0016) 88 Table B.6 Small panel with G = 6, T = 4. Case 2.b: x2it ∼ N (gt/6, 1) + fi , ngt = 1000, sampling rate= 1%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. MD Identity IV: none x2 x3 x4 IV: z x2 x3 x4 IV: x2 x2 x3 x4 IV: x3 x2 x3 x4 IV: x4 x2 x3 x4 IV: all x2 x3 x4 MD Optimal βˇ ˇ se(β) ˇ sen (β) ˇ ser (β) ˇ sec (β) βˆ ˆ se(β) 0.9993 (0.0717) 1.0027 (0.0391) 1.0082 (0.1274) 0.0721 (0.0025) 0.0398 (0.0008) 0.1283 (0.0026) 0.0698 (0.0144) 0.0385 (0.0079) 0.1244 (0.0256) 0.0628 (0.0181) 0.0367 (0.0085) 0.1174 (0.0309) 0.0642 (0.0287) 0.0390 (0.0146) 0.1333 (0.0487) 0.9992 (0.0716) 1.0027 (0.0392) 1.0084 (0.1277) 0.0720 (0.0025) 0.0397 (0.0008) 0.1282 (0.0026) 1.0107 (0.0706) 1.0004 (0.0389) 1.0113 (0.1272) 0.0712 (0.0025) 0.0396 (0.0008) 0.1277 (0.0026) 0.0709 (0.0088) 0.0394 (0.0048) 0.1271 (0.0156) 0.0526 (0.0140) 0.0301 (0.0068) 0.0966 (0.0247) 0.0545 (0.0235) 0.0323 (0.0120) 0.1102 (0.0401) 1.0107 (0.0706) 1.0004 (0.0390) 1.0115 (0.1276) 0.0711 (0.0024) 0.0395 (0.0008) 0.1274 (0.0025) 1.2316 (0.0608) 1.0107 (0.0480) 1.0146 (0.1537) 0.0609 (0.0017) 0.0491 (0.0009) 0.1532 (0.0034) 0.0555 (0.0095) 0.0312 (0.0052) 0.0992 (0.0166) 0.0664 (0.0135) 0.0354 (0.0098) 0.1064 (0.0313) 0.1070 (0.0301) 0.0398 (0.0134) 0.1261 (0.0536) 1.4756 (0.0147) 0.9080 (0.0352) 1.1474 (0.1225) 0.0147 (0.0002) 0.0365 (0.0005) 0.1238 (0.0023) 1.0101 (0.0776) 1.0016 (0.0206) 1.0115 (0.1325) 0.0763 (0.0026) 0.0207 (0.0003) 0.1337 (0.0026) 0.0562 (0.0078) 0.0158 (0.0022) 0.1058 (0.0147) 0.0512 (0.0110) 0.0165 (0.0028) 0.0991 (0.0195) 0.0571 (0.0247) 0.0181 (0.0065) 0.1160 (0.0408) 1.0094 (0.0658) 1.0023 (0.0185) 1.0087 (0.1217) 0.0675 (0.0022) 0.0188 (0.0002) 0.1201 (0.0022) 1.0027 (0.0712) 1.0030 (0.0382) 1.0039 (0.0687) 0.0726 (0.0025) 0.0387 (0.0007) 0.0705 (0.0008) 0.0456 (0.0075) 0.0242 (0.0040) 0.0567 (0.0092) 0.0458 (0.0114) 0.0251 (0.0053) 0.0580 (0.0121) 0.0537 (0.0238) 0.0313 (0.0117) 0.0635 (0.0218) 1.0094 (0.0690) 1.0018 (0.0370) 1.0017 (0.0442) 0.0703 (0.0024) 0.0369 (0.0007) 0.0454 (0.0004) 1.2392 (0.0575) 1.0034 (0.0387) 0.9986 (0.1424) 0.0574 (0.0016) 0.0394 (0.0006) 0.1423 (0.0029) 0.0337 (0.0050) 0.0175 (0.0025) 0.0609 (0.0088) 0.0573 (0.0117) 0.0263 (0.0065) 0.0899 (0.0260) 0.0867 (0.0257) 0.0337 (0.0100) 0.1093 (0.0448) 1.4726 (0.0146) 0.9803 (0.0176) 1.0102 (0.0429) 0.0146 (0.0002) 0.0179 (0.0002) 0.0442 (0.0003) 89 Table B.7 Small panel with G = 6, T = 4. Case 2.1: x2it ∼ N (gt/6, 1) + fi , ngt = 200, sampling rate= 0.2%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. MD Identity IV: none x2 x3 x4 IV: z x2 x3 x4 IV: x2 x2 x3 x4 IV: x3 x2 x3 x4 IV: x4 x2 x3 x4 IV: all x2 x3 x4 MD Optimal βˇ ˇ se(β) ˇ sen (β) ˇ ser (β) ˇ sec (β) βˆ ˆ se(β) 1.0147 (0.1624) 1.0034 (0.0871) 1.0175 (0.2885) 0.1591 (0.0126) 0.0889 (0.0041) 0.2864 (0.0133) 0.1553 (0.0351) 0.0868 (0.0187) 0.2795 (0.0597) 0.1399 (0.0428) 0.0824 (0.0209) 0.2623 (0.0720) 0.1450 (0.0643) 0.0875 (0.0327) 0.2987 (0.1101) 1.0148 (0.1624) 1.0033 (0.0871) 1.0184 (0.2881) 0.1586 (0.0125) 0.0886 (0.0041) 0.2854 (0.0132) 1.0622 (0.1535) 0.9950 (0.0855) 1.0297 (0.2812) 0.1505 (0.0110) 0.0872 (0.0037) 0.2804 (0.0124) 0.1490 (0.0202) 0.0863 (0.0106) 0.2775 (0.0341) 0.1216 (0.0303) 0.0687 (0.0159) 0.2190 (0.0543) 0.1253 (0.0516) 0.0725 (0.0269) 0.2455 (0.0905) 1.0638 (0.1538) 0.9942 (0.0855) 1.0314 (0.2796) 0.1487 (0.0109) 0.0862 (0.0037) 0.2772 (0.0122) 1.2409 (0.1366) 1.0163 (0.1082) 1.0187 (0.3458) 0.1363 (0.0084) 0.1100 (0.0046) 0.3425 (0.0175) 0.0920 (0.0235) 0.0525 (0.0126) 0.1662 (0.0403) 0.1010 (0.0265) 0.0702 (0.0238) 0.2122 (0.0745) 0.1331 (0.0548) 0.0707 (0.0311) 0.2587 (0.1200) 1.4705 (0.0328) 0.9150 (0.0793) 1.1427 (0.2719) 0.0324 (0.0009) 0.0804 (0.0026) 0.2708 (0.0113) 1.0486 (0.1633) 0.9996 (0.0465) 1.0287 (0.2912) 0.1630 (0.0119) 0.0463 (0.0016) 0.2948 (0.0129) 0.1211 (0.0189) 0.0352 (0.0049) 0.2336 (0.0328) 0.1114 (0.0259) 0.0366 (0.0061) 0.2195 (0.0442) 0.1277 (0.0546) 0.0399 (0.0140) 0.2580 (0.0887) 1.0592 (0.1457) 1.0002 (0.0430) 1.0250 (0.2612) 0.1422 (0.0099) 0.0415 (0.0012) 0.2626 (0.0107) 1.0223 (0.1602) 1.0058 (0.0836) 1.0077 (0.1577) 0.1584 (0.0123) 0.0864 (0.0036) 0.1581 (0.0043) 0.1008 (0.0184) 0.0545 (0.0092) 0.1276 (0.0212) 0.1012 (0.0268) 0.0564 (0.0127) 0.1302 (0.0279) 0.1200 (0.0525) 0.0696 (0.0268) 0.1417 (0.0498) 1.0598 (0.1503) 0.9974 (0.0793) 1.0033 (0.0991) 0.1478 (0.0108) 0.0812 (0.0032) 0.1008 (0.0022) 1.2502 (0.1276) 1.0080 (0.0864) 1.0039 (0.3193) 0.1271 (0.0078) 0.0881 (0.0032) 0.3170 (0.0145) 0.0568 (0.0120) 0.0299 (0.0059) 0.1038 (0.0206) 0.0854 (0.0223) 0.0510 (0.0159) 0.1782 (0.0619) 0.1108 (0.0458) 0.0579 (0.0234) 0.2232 (0.1021) 1.4657 (0.0335) 0.9815 (0.0415) 1.0100 (0.0996) 0.0318 (0.0008) 0.0392 (0.0009) 0.0970 (0.0016) 90 Table B.8 Small panel with G = 6, T = 4. Case 2.1: x2it ∼ N (gt/6, 1) + fi , ngt = 200, sampling rate= 0.2%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. MD Identity IV: none x2 x3 x4 IV: z x2 x3 x4 IV: x2 x2 x3 x4 IV: x3 x2 x3 x4 IV: x4 x2 x3 x4 IV: all x2 x3 x4 MD Optimal βˇ ˇ se(β) ˇ sen (β) ˇ ser (β) ˇ sec (β) βˆ ˆ se(β) 1.0080 (0.0721) 0.9961 (0.0401) 1.0026 (0.1304) 0.0719 (0.0024) 0.0398 (0.0008) 0.1279 (0.0027) 0.0694 (0.0145) 0.0384 (0.0079) 0.1235 (0.0255) 0.0615 (0.0179) 0.0366 (0.0086) 0.1166 (0.0305) 0.0619 (0.0280) 0.0390 (0.0138) 0.1322 (0.0489) 1.0080 (0.0721) 0.9961 (0.0401) 1.0028 (0.1303) 0.0719 (0.0024) 0.0398 (0.0008) 0.1278 (0.0027) 1.0191 (0.0714) 0.9939 (0.0400) 1.0057 (0.1302) 0.0711 (0.0024) 0.0396 (0.0008) 0.1273 (0.0026) 0.0702 (0.0086) 0.0391 (0.0047) 0.1258 (0.0151) 0.0514 (0.0138) 0.0300 (0.0068) 0.0957 (0.0244) 0.0520 (0.0227) 0.0323 (0.0115) 0.1090 (0.0403) 1.0193 (0.0715) 0.9940 (0.0400) 1.0057 (0.1300) 0.0709 (0.0023) 0.0395 (0.0008) 0.1270 (0.0026) 1.2376 (0.0602) 1.0016 (0.0485) 0.9972 (0.1544) 0.0610 (0.0018) 0.0492 (0.0010) 0.1535 (0.0035) 0.0545 (0.0092) 0.0308 (0.0051) 0.0979 (0.0163) 0.0645 (0.0132) 0.0350 (0.0098) 0.1051 (0.0310) 0.1023 (0.0293) 0.0389 (0.0128) 0.1255 (0.0525) 1.4775 (0.0147) 0.9031 (0.0364) 1.1404 (0.1251) 0.0146 (0.0002) 0.0365 (0.0006) 0.1235 (0.0023) 1.0108 (0.0749) 0.9980 (0.0211) 1.0025 (0.1349) 0.0763 (0.0025) 0.0207 (0.0003) 0.1334 (0.0027) 0.0562 (0.0076) 0.0158 (0.0021) 0.1057 (0.0139) 0.0509 (0.0111) 0.0165 (0.0026) 0.0993 (0.0182) 0.0571 (0.0253) 0.0180 (0.0060) 0.1158 (0.0403) 1.0152 (0.0672) 0.9991 (0.0189) 0.9998 (0.1210) 0.0674 (0.0021) 0.0188 (0.0003) 0.1198 (0.0022) 1.0096 (0.0730) 0.9955 (0.0391) 1.0038 (0.0713) 0.0725 (0.0024) 0.0388 (0.0008) 0.0704 (0.0009) 0.0454 (0.0074) 0.0241 (0.0039) 0.0564 (0.0091) 0.0450 (0.0114) 0.0251 (0.0053) 0.0578 (0.0120) 0.0515 (0.0233) 0.0310 (0.0116) 0.0627 (0.0221) 1.0191 (0.0703) 0.9943 (0.0368) 1.0038 (0.0460) 0.0701 (0.0023) 0.0370 (0.0007) 0.0454 (0.0005) 1.2440 (0.0568) 0.9957 (0.0391) 0.9832 (0.1436) 0.0575 (0.0017) 0.0394 (0.0007) 0.1426 (0.0030) 0.0332 (0.0048) 0.0172 (0.0025) 0.0601 (0.0086) 0.0555 (0.0114) 0.0260 (0.0065) 0.0886 (0.0260) 0.0824 (0.0251) 0.0329 (0.0093) 0.1086 (0.0443) 1.4744 (0.0147) 0.9777 (0.0180) 1.0106 (0.0448) 0.0145 (0.0002) 0.0179 (0.0002) 0.0442 (0.0003) 91 Table B.9 Small panel with G = 6, T = 4. Case 3.a: x2it ∼ N (gt/6, 1) + zi , ngt = 200, sampling rate= 1%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. MD Identity IV: none x2 x3 x4 IV: z x2 x3 x4 IV: x2 x2 x3 x4 IV: x3 x2 x3 x4 IV: x4 x2 x3 x4 IV: all x2 x3 x4 MD Optimal βˇ ˇ se(β) ˇ sen (β) ˇ ser (β) ˇ sec (β) βˆ ˆ se(β) 0.9764 (0.1533) 1.0024 (0.0902) 1.0051 (0.2823) 0.1586 (0.0116) 0.0887 (0.0038) 0.2850 (0.0131) 0.1548 (0.0325) 0.0865 (0.0175) 0.2779 (0.0569) 0.1389 (0.0425) 0.0825 (0.0195) 0.2626 (0.0663) 0.1403 (0.0621) 0.0874 (0.0315) 0.2944 (0.1072) 0.9765 (0.1528) 1.0027 (0.0903) 1.0055 (0.2839) 0.1581 (0.0116) 0.0884 (0.0038) 0.2840 (0.0131) 1.0018 (0.0460) 0.9980 (0.0841) 1.0087 (0.2738) 0.0463 (0.0012) 0.0826 (0.0027) 0.2766 (0.0118) 0.0456 (0.0055) 0.0813 (0.0098) 0.2723 (0.0341) 0.0508 (0.0077) 0.0667 (0.0150) 0.2202 (0.0495) 0.0499 (0.0154) 0.0713 (0.0257) 0.2462 (0.0869) 1.0020 (0.0462) 0.9976 (0.0850) 1.0080 (0.2751) 0.0452 (0.0011) 0.0816 (0.0026) 0.2734 (0.0116) 1.0042 (0.0431) 0.9967 (0.1076) 0.9932 (0.3591) 0.0440 (0.0020) 0.1092 (0.0046) 0.3584 (0.0192) 0.0487 (0.0088) 0.0679 (0.0126) 0.1885 (0.0365) 0.0489 (0.0092) 0.0707 (0.0175) 0.2042 (0.0643) 0.0422 (0.0154) 0.0881 (0.0335) 0.2764 (0.1203) 1.0041 (0.0331) 0.9978 (0.0846) 1.0090 (0.2779) 0.0329 (0.0007) 0.0815 (0.0026) 0.2734 (0.0117) 0.9840 (0.1582) 0.9923 (0.0460) 0.9989 (0.2890) 0.1633 (0.0111) 0.0464 (0.0014) 0.2941 (0.0129) 0.1215 (0.0189) 0.0353 (0.0049) 0.2333 (0.0332) 0.1104 (0.0257) 0.0367 (0.0064) 0.2192 (0.0421) 0.1236 (0.0529) 0.0403 (0.0137) 0.2552 (0.0917) 0.9850 (0.1420) 0.9919 (0.0421) 1.0189 (0.2617) 0.1423 (0.0091) 0.0416 (0.0011) 0.2625 (0.0107) 0.9740 (0.1543) 1.0025 (0.0862) 1.0045 (0.1553) 0.1582 (0.0113) 0.0863 (0.0034) 0.1586 (0.0036) 0.1002 (0.0173) 0.0541 (0.0087) 0.1270 (0.0205) 0.1004 (0.0270) 0.0563 (0.0119) 0.1303 (0.0269) 0.1162 (0.0502) 0.0689 (0.0252) 0.1406 (0.0490) 0.9828 (0.1441) 1.0010 (0.0810) 1.0111 (0.1014) 0.1474 (0.0096) 0.0812 (0.0030) 0.1011 (0.0017) 1.0052 (0.0399) 0.9925 (0.0559) 0.9959 (0.2635) 0.0404 (0.0013) 0.0565 (0.0017) 0.2653 (0.0101) 0.0296 (0.0037) 0.0312 (0.0039) 0.1211 (0.0157) 0.0335 (0.0051) 0.0348 (0.0058) 0.1423 (0.0383) 0.0379 (0.0126) 0.0458 (0.0160) 0.2010 (0.0807) 1.0043 (0.0333) 0.9914 (0.0414) 1.0133 (0.1012) 0.0322 (0.0007) 0.0397 (0.0008) 0.0985 (0.0015) 92 Table B.10 Small panel with G = 6, T = 4. Case 3.b: x2it ∼ N (gt/6, 1) + zi , ngt = 1000, sampling rate= 1%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. MD Identity IV: none x2 x3 x4 IV: z x2 x3 x4 IV: x2 x2 x3 x4 IV: x3 x2 x3 x4 IV: x4 x2 x3 x4 IV: all x2 x3 x4 MD Optimal βˇ ˇ se(β) ˇ sen (β) ˇ ser (β) ˇ sec (β) βˆ ˆ se(β) 0.9937 (0.0714) 1.0038 (0.0391) 1.0066 (0.1273) 0.0720 (0.0023) 0.0398 (0.0007) 0.1283 (0.0025) 0.0698 (0.0144) 0.0385 (0.0079) 0.1244 (0.0256) 0.0627 (0.0181) 0.0366 (0.0085) 0.1174 (0.0308) 0.0639 (0.0283) 0.0391 (0.0146) 0.1332 (0.0487) 0.9936 (0.0713) 1.0038 (0.0391) 1.0068 (0.1276) 0.0720 (0.0023) 0.0397 (0.0007) 0.1282 (0.0025) 0.9982 (0.0191) 1.0029 (0.0356) 1.0077 (0.1244) 0.0206 (0.0002) 0.0372 (0.0005) 0.1261 (0.0023) 0.0205 (0.0025) 0.0371 (0.0045) 0.1258 (0.0154) 0.0229 (0.0034) 0.0295 (0.0067) 0.0982 (0.0243) 0.0223 (0.0073) 0.0317 (0.0121) 0.1118 (0.0403) 0.9982 (0.0191) 1.0029 (0.0357) 1.0080 (0.1249) 0.0205 (0.0002) 0.0371 (0.0005) 0.1259 (0.0023) 0.9964 (0.0191) 1.0049 (0.0489) 1.0179 (0.1651) 0.0192 (0.0004) 0.0489 (0.0009) 0.1624 (0.0036) 0.0218 (0.0039) 0.0307 (0.0055) 0.0859 (0.0154) 0.0217 (0.0040) 0.0320 (0.0077) 0.0921 (0.0290) 0.0189 (0.0069) 0.0406 (0.0153) 0.1246 (0.0553) 0.9987 (0.0143) 1.0027 (0.0355) 1.0084 (0.1239) 0.0148 (0.0001) 0.0370 (0.0005) 0.1258 (0.0023) 0.9995 (0.0773) 1.0026 (0.0206) 1.0097 (0.1327) 0.0763 (0.0025) 0.0207 (0.0003) 0.1338 (0.0025) 0.0562 (0.0078) 0.0158 (0.0022) 0.1059 (0.0147) 0.0510 (0.0109) 0.0165 (0.0028) 0.0992 (0.0195) 0.0570 (0.0245) 0.0180 (0.0065) 0.1159 (0.0408) 0.9945 (0.0654) 1.0031 (0.0186) 1.0070 (0.1218) 0.0675 (0.0021) 0.0188 (0.0002) 0.1202 (0.0020) 0.9956 (0.0710) 1.0041 (0.0382) 1.0028 (0.0686) 0.0726 (0.0023) 0.0387 (0.0007) 0.0705 (0.0007) 0.0456 (0.0075) 0.0242 (0.0040) 0.0567 (0.0093) 0.0457 (0.0114) 0.0251 (0.0054) 0.0580 (0.0121) 0.0533 (0.0236) 0.0314 (0.0117) 0.0634 (0.0219) 0.9933 (0.0692) 1.0046 (0.0371) 1.0011 (0.0442) 0.0703 (0.0022) 0.0369 (0.0006) 0.0454 (0.0003) 0.9979 (0.0171) 1.0037 (0.0252) 1.0120 (0.1207) 0.0178 (0.0003) 0.0252 (0.0004) 0.1203 (0.0019) 0.0133 (0.0016) 0.0141 (0.0017) 0.0553 (0.0067) 0.0150 (0.0022) 0.0158 (0.0026) 0.0646 (0.0176) 0.0168 (0.0058) 0.0208 (0.0076) 0.0910 (0.0377) 0.9986 (0.0144) 1.0031 (0.0179) 1.0014 (0.0440) 0.0147 (0.0001) 0.0182 (0.0002) 0.0448 (0.0003) 93 Table B.11 Small panel with G = 6, T = 4. Case 3.1: x2it ∼ N (gt/6, 1) + zi , ngt = 200, sampling rate= 0.2%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. MD Identity IV: none x2 x3 x4 IV: z x2 x3 x4 IV: x2 x2 x3 x4 IV: x3 x2 x3 x4 IV: x4 x2 x3 x4 IV: all x2 x3 x4 MD Optimal βˇ ˇ se(β) ˇ sen (β) ˇ ser (β) ˇ sec (β) βˆ ˆ se(β) 0.9909 (0.1623) 1.0079 (0.0870) 1.0113 (0.2880) 0.1590 (0.0117) 0.0890 (0.0037) 0.2868 (0.0127) 0.1552 (0.0349) 0.0868 (0.0185) 0.2798 (0.0594) 0.1393 (0.0425) 0.0825 (0.0210) 0.2625 (0.0714) 0.1423 (0.0603) 0.0873 (0.0329) 0.2991 (0.1095) 0.9911 (0.1623) 1.0077 (0.0870) 1.0123 (0.2877) 0.1585 (0.0117) 0.0887 (0.0037) 0.2858 (0.0127) 0.9980 (0.0430) 1.0070 (0.0804) 1.0131 (0.2794) 0.0463 (0.0012) 0.0827 (0.0027) 0.2780 (0.0115) 0.0457 (0.0054) 0.0820 (0.0097) 0.2757 (0.0336) 0.0508 (0.0073) 0.0670 (0.0159) 0.2210 (0.0535) 0.0496 (0.0159) 0.0704 (0.0269) 0.2490 (0.0888) 0.9979 (0.0437) 1.0065 (0.0803) 1.0138 (0.2780) 0.0452 (0.0012) 0.0818 (0.0027) 0.2749 (0.0113) 0.9965 (0.0421) 1.0109 (0.1051) 1.0141 (0.3642) 0.0436 (0.0020) 0.1090 (0.0048) 0.3598 (0.0184) 0.0488 (0.0092) 0.0688 (0.0129) 0.1914 (0.0358) 0.0490 (0.0094) 0.0712 (0.0177) 0.2046 (0.0669) 0.0431 (0.0154) 0.0886 (0.0343) 0.2767 (0.1251) 0.9981 (0.0324) 1.0064 (0.0801) 1.0140 (0.2797) 0.0327 (0.0007) 0.0816 (0.0026) 0.2749 (0.0113) 1.0028 (0.1629) 1.0036 (0.0467) 1.0203 (0.2919) 0.1633 (0.0111) 0.0465 (0.0014) 0.2961 (0.0125) 0.1213 (0.0187) 0.0353 (0.0049) 0.2343 (0.0328) 0.1105 (0.0252) 0.0367 (0.0061) 0.2202 (0.0442) 0.1241 (0.0517) 0.0400 (0.0140) 0.2588 (0.0881) 0.9952 (0.1465) 1.0034 (0.0433) 1.0180 (0.2629) 0.1426 (0.0092) 0.0417 (0.0010) 0.2638 (0.0103) 0.9917 (0.1605) 1.0102 (0.0836) 1.0034 (0.1576) 0.1583 (0.0112) 0.0865 (0.0033) 0.1584 (0.0037) 0.1006 (0.0182) 0.0545 (0.0091) 0.1277 (0.0210) 0.1008 (0.0264) 0.0564 (0.0128) 0.1303 (0.0278) 0.1172 (0.0498) 0.0695 (0.0268) 0.1417 (0.0498) 0.9920 (0.1498) 1.0088 (0.0795) 1.0010 (0.0992) 0.1477 (0.0098) 0.0813 (0.0029) 0.1010 (0.0017) 0.9980 (0.0376) 1.0071 (0.0550) 1.0122 (0.2666) 0.0402 (0.0013) 0.0564 (0.0018) 0.2656 (0.0100) 0.0296 (0.0037) 0.0314 (0.0039) 0.1223 (0.0153) 0.0335 (0.0050) 0.0350 (0.0059) 0.1432 (0.0408) 0.0379 (0.0125) 0.0458 (0.0171) 0.2011 (0.0858) 0.9981 (0.0330) 1.0036 (0.0421) 1.0013 (0.1007) 0.0321 (0.0007) 0.0397 (0.0008) 0.0984 (0.0015) 94 Table B.12 Small panel with G = 6, T = 4. Case 3.1: x2it ∼ N (gt/6, 1) + zi , ngt = 200, sampling rate= 0.2%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. MD Identity IV: none x2 x3 x4 IV: z x2 x3 x4 IV: x2 x2 x3 x4 IV: x3 x2 x3 x4 IV: x4 x2 x3 x4 IV: all x2 x3 x4 MD Optimal βˇ ˇ se(β) ˇ sen (β) ˇ ser (β) ˇ sec (β) βˆ ˆ se(β) 1.0027 (0.0721) 0.9971 (0.0401) 1.0010 (0.1305) 0.0720 (0.0023) 0.0398 (0.0008) 0.1280 (0.0025) 0.0695 (0.0146) 0.0384 (0.0080) 0.1235 (0.0256) 0.0618 (0.0182) 0.0366 (0.0086) 0.1166 (0.0305) 0.0623 (0.0281) 0.0390 (0.0138) 0.1324 (0.0489) 1.0027 (0.0722) 0.9972 (0.0401) 1.0012 (0.1304) 0.0720 (0.0023) 0.0398 (0.0008) 0.1279 (0.0025) 1.0012 (0.0196) 0.9975 (0.0372) 1.0004 (0.1286) 0.0206 (0.0002) 0.0373 (0.0005) 0.1258 (0.0023) 0.0203 (0.0024) 0.0369 (0.0045) 0.1245 (0.0150) 0.0227 (0.0034) 0.0295 (0.0069) 0.0977 (0.0240) 0.0222 (0.0073) 0.0320 (0.0115) 0.1110 (0.0398) 1.0012 (0.0197) 0.9976 (0.0372) 1.0005 (0.1283) 0.0205 (0.0002) 0.0372 (0.0005) 0.1256 (0.0023) 1.0005 (0.0186) 0.9963 (0.0476) 1.0004 (0.1665) 0.0192 (0.0004) 0.0490 (0.0010) 0.1621 (0.0038) 0.0216 (0.0041) 0.0305 (0.0057) 0.0849 (0.0161) 0.0216 (0.0040) 0.0317 (0.0081) 0.0912 (0.0305) 0.0189 (0.0069) 0.0389 (0.0149) 0.1240 (0.0578) 1.0000 (0.0145) 0.9978 (0.0372) 1.0000 (0.1274) 0.0148 (0.0001) 0.0371 (0.0005) 0.1255 (0.0023) 1.0004 (0.0751) 0.9989 (0.0211) 1.0005 (0.1350) 0.0765 (0.0024) 0.0207 (0.0003) 0.1335 (0.0025) 0.0563 (0.0076) 0.0158 (0.0021) 0.1057 (0.0139) 0.0511 (0.0113) 0.0165 (0.0026) 0.0993 (0.0182) 0.0573 (0.0254) 0.0180 (0.0060) 0.1158 (0.0403) 1.0003 (0.0673) 0.9998 (0.0190) 0.9980 (0.1212) 0.0675 (0.0019) 0.0188 (0.0002) 0.1199 (0.0021) 1.0028 (0.0730) 0.9965 (0.0391) 1.0027 (0.0713) 0.0727 (0.0023) 0.0388 (0.0007) 0.0705 (0.0007) 0.0455 (0.0075) 0.0242 (0.0039) 0.0564 (0.0091) 0.0451 (0.0115) 0.0251 (0.0053) 0.0579 (0.0120) 0.0518 (0.0233) 0.0310 (0.0116) 0.0628 (0.0221) 1.0026 (0.0699) 0.9971 (0.0369) 1.0032 (0.0461) 0.0703 (0.0022) 0.0370 (0.0006) 0.0454 (0.0003) 1.0004 (0.0172) 0.9983 (0.0251) 1.0008 (0.1228) 0.0178 (0.0003) 0.0252 (0.0004) 0.1200 (0.0020) 0.0132 (0.0016) 0.0140 (0.0017) 0.0549 (0.0066) 0.0149 (0.0022) 0.0156 (0.0025) 0.0640 (0.0181) 0.0170 (0.0057) 0.0203 (0.0072) 0.0902 (0.0386) 1.0000 (0.0145) 0.9998 (0.0183) 1.0026 (0.0460) 0.0147 (0.0001) 0.0182 (0.0002) 0.0449 (0.0003) 95 Table B.13 Small panel with G = 6, T = 4. Case 4.a: x2it ∼ N (gt/6, 1) + zi + fi , ngt = 200, sampling rate= 1%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. MD Identity IV: none x2 x3 x4 IV: z x2 x3 x4 IV: x2 x2 x3 x4 IV: x3 x2 x3 x4 IV: x4 x2 x3 x4 IV: all x2 x3 x4 MD Optimal βˇ ˇ se(β) ˇ sen (β) ˇ ser (β) ˇ sec (β) βˆ ˆ se(β) 1.0020 (0.1521) 0.9972 (0.0901) 1.0131 (0.2825) 0.1570 (0.0139) 0.0888 (0.0043) 0.2854 (0.0140) 0.1533 (0.0329) 0.0866 (0.0176) 0.2783 (0.0571) 0.1380 (0.0425) 0.0825 (0.0196) 0.2628 (0.0662) 0.1399 (0.0620) 0.0875 (0.0314) 0.2947 (0.1072) 1.0020 (0.1516) 0.9975 (0.0903) 1.0133 (0.2839) 0.1565 (0.0138) 0.0884 (0.0043) 0.2844 (0.0139) 1.0084 (0.0456) 0.9967 (0.0841) 1.0104 (0.2735) 0.0460 (0.0014) 0.0825 (0.0027) 0.2764 (0.0119) 0.0454 (0.0055) 0.0813 (0.0098) 0.2722 (0.0342) 0.0505 (0.0078) 0.0667 (0.0150) 0.2201 (0.0495) 0.0495 (0.0154) 0.0712 (0.0257) 0.2462 (0.0869) 1.0086 (0.0459) 0.9963 (0.0850) 1.0097 (0.2747) 0.0450 (0.0013) 0.0816 (0.0027) 0.2732 (0.0117) 1.2218 (0.0963) 1.0029 (0.1114) 0.9978 (0.3442) 0.0954 (0.0072) 0.1121 (0.0048) 0.3470 (0.0185) 0.0741 (0.0173) 0.0529 (0.0122) 0.1676 (0.0396) 0.0775 (0.0191) 0.0726 (0.0237) 0.2207 (0.0705) 0.0979 (0.0426) 0.0728 (0.0311) 0.2681 (0.1154) 1.3252 (0.0276) 0.9357 (0.0841) 1.0932 (0.2715) 0.0267 (0.0006) 0.0803 (0.0026) 0.2699 (0.0115) 1.0304 (0.1521) 0.9880 (0.0459) 1.0083 (0.2888) 0.1585 (0.0128) 0.0463 (0.0016) 0.2938 (0.0135) 0.1187 (0.0190) 0.0352 (0.0049) 0.2335 (0.0333) 0.1090 (0.0251) 0.0367 (0.0064) 0.2194 (0.0422) 0.1223 (0.0512) 0.0405 (0.0137) 0.2572 (0.0922) 1.0463 (0.1369) 0.9887 (0.0420) 1.0252 (0.2608) 0.1377 (0.0107) 0.0415 (0.0012) 0.2622 (0.0112) 1.0055 (0.1523) 0.9975 (0.0861) 1.0098 (0.1552) 0.1557 (0.0133) 0.0863 (0.0038) 0.1587 (0.0043) 0.0990 (0.0177) 0.0542 (0.0088) 0.1272 (0.0206) 0.0995 (0.0271) 0.0564 (0.0120) 0.1304 (0.0268) 0.1154 (0.0500) 0.0691 (0.0252) 0.1410 (0.0490) 1.0488 (0.1402) 0.9894 (0.0805) 1.0140 (0.1008) 0.1426 (0.0110) 0.0811 (0.0033) 0.1013 (0.0023) 1.1804 (0.0720) 0.9944 (0.0890) 0.9829 (0.3180) 0.0713 (0.0056) 0.0897 (0.0033) 0.3207 (0.0151) 0.0412 (0.0074) 0.0309 (0.0054) 0.1070 (0.0196) 0.0544 (0.0116) 0.0520 (0.0160) 0.1838 (0.0584) 0.0698 (0.0279) 0.0576 (0.0234) 0.2289 (0.0978) 1.3193 (0.0285) 0.9766 (0.0416) 1.0171 (0.0999) 0.0263 (0.0006) 0.0393 (0.0009) 0.0975 (0.0015) 96 Table B.14 Small panel with G = 6, T = 4. Case 4.b: x2it ∼ N (gt/6, 1) + zi + fi , ngt = 1000, sampling rate= 1%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. MD Identity IV: none x2 x3 x4 IV: z x2 x3 x4 IV: x2 x2 x3 x4 IV: x3 x2 x3 x4 IV: x4 x2 x3 x4 IV: all x2 x3 x4 MD Optimal βˇ ˇ se(β) ˇ sen (β) ˇ ser (β) ˇ sec (β) βˆ ˆ se(β) 0.9989 (0.0716) 1.0027 (0.0391) 1.0081 (0.1274) 0.0719 (0.0029) 0.0398 (0.0009) 0.1283 (0.0027) 0.0697 (0.0145) 0.0385 (0.0079) 0.1244 (0.0256) 0.0627 (0.0181) 0.0366 (0.0085) 0.1174 (0.0309) 0.0642 (0.0284) 0.0391 (0.0146) 0.1332 (0.0488) 0.9989 (0.0715) 1.0028 (0.0392) 1.0083 (0.1277) 0.0719 (0.0029) 0.0397 (0.0009) 0.1282 (0.0027) 0.9996 (0.0191) 1.0026 (0.0356) 1.0082 (0.1244) 0.0206 (0.0003) 0.0372 (0.0005) 0.1261 (0.0023) 0.0205 (0.0025) 0.0371 (0.0045) 0.1258 (0.0154) 0.0229 (0.0034) 0.0295 (0.0067) 0.0982 (0.0243) 0.0223 (0.0073) 0.0317 (0.0121) 0.1118 (0.0403) 0.9996 (0.0191) 1.0026 (0.0357) 1.0085 (0.1249) 0.0205 (0.0003) 0.0371 (0.0005) 0.1259 (0.0023) 1.2166 (0.0423) 1.0113 (0.0491) 1.0149 (0.1568) 0.0421 (0.0014) 0.0500 (0.0009) 0.1559 (0.0035) 0.0405 (0.0082) 0.0286 (0.0057) 0.0910 (0.0182) 0.0458 (0.0106) 0.0357 (0.0104) 0.1077 (0.0325) 0.0711 (0.0254) 0.0402 (0.0145) 0.1274 (0.0546) 1.3220 (0.0120) 0.9385 (0.0353) 1.1027 (0.1224) 0.0120 (0.0001) 0.0365 (0.0005) 0.1241 (0.0023) 1.0097 (0.0771) 1.0017 (0.0206) 1.0115 (0.1325) 0.0758 (0.0030) 0.0207 (0.0003) 0.1337 (0.0026) 0.0559 (0.0079) 0.0158 (0.0022) 0.1058 (0.0147) 0.0509 (0.0110) 0.0165 (0.0028) 0.0991 (0.0195) 0.0569 (0.0243) 0.0181 (0.0065) 0.1159 (0.0408) 1.0091 (0.0653) 1.0024 (0.0186) 1.0087 (0.1216) 0.0671 (0.0025) 0.0188 (0.0003) 0.1201 (0.0022) 1.0024 (0.0710) 1.0031 (0.0382) 1.0038 (0.0687) 0.0724 (0.0029) 0.0387 (0.0008) 0.0705 (0.0008) 0.0455 (0.0076) 0.0242 (0.0040) 0.0567 (0.0093) 0.0457 (0.0114) 0.0251 (0.0053) 0.0580 (0.0121) 0.0536 (0.0236) 0.0314 (0.0117) 0.0635 (0.0219) 1.0090 (0.0689) 1.0019 (0.0371) 1.0017 (0.0442) 0.0698 (0.0027) 0.0369 (0.0007) 0.0454 (0.0004) 1.1734 (0.0317) 1.0038 (0.0395) 0.9986 (0.1451) 0.0314 (0.0011) 0.0401 (0.0006) 0.1446 (0.0028) 0.0235 (0.0031) 0.0174 (0.0023) 0.0608 (0.0081) 0.0345 (0.0060) 0.0252 (0.0070) 0.0877 (0.0268) 0.0533 (0.0158) 0.0304 (0.0111) 0.1063 (0.0455) 1.3182 (0.0121) 0.9883 (0.0182) 1.0073 (0.0444) 0.0120 (0.0001) 0.0180 (0.0002) 0.0444 (0.0003) 97 Table B.15 Small panel with G = 6, T = 4. Case 4.1: x2it ∼ N (gt/6, 1) + zi + fi , ngt = 200, sampling rate= 0.2%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. MD Identity IV: none x2 x3 x4 IV: z x2 x3 x4 IV: x2 x2 x3 x4 IV: x3 x2 x3 x4 IV: x4 x2 x3 x4 IV: all x2 x3 x4 MD Optimal βˇ ˇ se(β) ˇ sen (β) ˇ ser (β) ˇ sec (β) βˆ ˆ se(β) 1.0153 (0.1599) 1.0031 (0.0867) 1.0184 (0.2879) 0.1575 (0.0143) 0.0890 (0.0043) 0.2869 (0.0135) 0.1537 (0.0356) 0.0868 (0.0187) 0.2799 (0.0597) 0.1388 (0.0427) 0.0825 (0.0211) 0.2628 (0.0718) 0.1440 (0.0631) 0.0876 (0.0331) 0.2990 (0.1099) 1.0155 (0.1598) 1.0029 (0.0866) 1.0194 (0.2877) 0.1569 (0.0142) 0.0887 (0.0043) 0.2859 (0.0135) 1.0048 (0.0431) 1.0057 (0.0804) 1.0149 (0.2792) 0.0462 (0.0014) 0.0827 (0.0027) 0.2779 (0.0115) 0.0455 (0.0054) 0.0820 (0.0097) 0.2756 (0.0336) 0.0506 (0.0074) 0.0670 (0.0159) 0.2210 (0.0535) 0.0494 (0.0159) 0.0704 (0.0270) 0.2488 (0.0889) 1.0048 (0.0437) 1.0052 (0.0803) 1.0157 (0.2778) 0.0451 (0.0014) 0.0818 (0.0027) 0.2748 (0.0113) 1.2218 (0.0947) 1.0175 (0.1094) 1.0190 (0.3531) 0.0951 (0.0071) 0.1119 (0.0048) 0.3481 (0.0175) 0.0728 (0.0188) 0.0522 (0.0129) 0.1655 (0.0410) 0.0752 (0.0191) 0.0715 (0.0243) 0.2170 (0.0755) 0.0947 (0.0420) 0.0722 (0.0320) 0.2634 (0.1217) 1.3194 (0.0266) 0.9439 (0.0790) 1.1004 (0.2737) 0.0266 (0.0006) 0.0806 (0.0026) 0.2714 (0.0112) 1.0478 (0.1584) 0.9997 (0.0465) 1.0287 (0.2908) 0.1585 (0.0127) 0.0463 (0.0016) 0.2953 (0.0130) 0.1185 (0.0191) 0.0352 (0.0049) 0.2339 (0.0328) 0.1092 (0.0256) 0.0366 (0.0061) 0.2199 (0.0441) 0.1239 (0.0524) 0.0400 (0.0139) 0.2586 (0.0884) 1.0572 (0.1415) 1.0003 (0.0431) 1.0252 (0.2615) 0.1379 (0.0106) 0.0415 (0.0012) 0.2631 (0.0108) 1.0224 (0.1568) 1.0056 (0.0832) 1.0083 (0.1572) 0.1557 (0.0136) 0.0864 (0.0038) 0.1583 (0.0044) 0.0994 (0.0187) 0.0545 (0.0092) 0.1278 (0.0212) 0.1000 (0.0266) 0.0565 (0.0128) 0.1303 (0.0279) 0.1179 (0.0514) 0.0698 (0.0270) 0.1418 (0.0499) 1.0573 (0.1437) 0.9976 (0.0789) 1.0035 (0.0988) 0.1430 (0.0116) 0.0811 (0.0033) 0.1010 (0.0022) 1.1794 (0.0707) 1.0090 (0.0875) 1.0036 (0.3253) 0.0710 (0.0057) 0.0896 (0.0032) 0.3215 (0.0143) 0.0407 (0.0078) 0.0306 (0.0057) 0.1062 (0.0198) 0.0531 (0.0114) 0.0514 (0.0163) 0.1814 (0.0624) 0.0677 (0.0273) 0.0573 (0.0243) 0.2260 (0.1035) 1.3142 (0.0275) 0.9894 (0.0424) 1.0073 (0.1018) 0.0262 (0.0006) 0.0393 (0.0009) 0.0974 (0.0015) 98 Table B.16 Small panel with G = 6, T = 4. Case 4.1: x2it ∼ N (gt/6, 1) + zi + fi , ngt = 200, sampling rate= 0.2%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. MD Identity IV: none x2 x3 x4 IV: z x2 x3 x4 IV: x2 x2 x3 x4 IV: x3 x2 x3 x4 IV: x4 x2 x3 x4 IV: all x2 x3 x4 MD Optimal βˇ ˇ se(β) ˇ sen (β) ˇ ser (β) ˇ sec (β) βˆ ˆ se(β) 1.0079 (0.0721) 0.9961 (0.0401) 1.0026 (0.1305) 0.0718 (0.0028) 0.0398 (0.0009) 0.1279 (0.0027) 0.0693 (0.0146) 0.0384 (0.0079) 0.1235 (0.0255) 0.0615 (0.0181) 0.0365 (0.0086) 0.1165 (0.0305) 0.0620 (0.0280) 0.0390 (0.0138) 0.1321 (0.0488) 1.0079 (0.0721) 0.9961 (0.0400) 1.0027 (0.1304) 0.0717 (0.0028) 0.0398 (0.0009) 0.1279 (0.0027) 1.0025 (0.0196) 0.9972 (0.0372) 1.0008 (0.1285) 0.0205 (0.0003) 0.0373 (0.0006) 0.1258 (0.0023) 0.0203 (0.0024) 0.0369 (0.0045) 0.1244 (0.0150) 0.0226 (0.0034) 0.0295 (0.0069) 0.0977 (0.0240) 0.0221 (0.0073) 0.0320 (0.0115) 0.1109 (0.0398) 1.0026 (0.0196) 0.9973 (0.0372) 1.0009 (0.1283) 0.0204 (0.0003) 0.0372 (0.0006) 0.1255 (0.0023) 1.2219 (0.0420) 1.0015 (0.0490) 0.9986 (0.1587) 0.0422 (0.0015) 0.0501 (0.0010) 0.1561 (0.0036) 0.0397 (0.0079) 0.0281 (0.0056) 0.0895 (0.0179) 0.0445 (0.0105) 0.0352 (0.0104) 0.1057 (0.0324) 0.0677 (0.0250) 0.0389 (0.0138) 0.1275 (0.0540) 1.3238 (0.0121) 0.9336 (0.0367) 1.0948 (0.1257) 0.0120 (0.0001) 0.0366 (0.0005) 0.1238 (0.0023) 1.0106 (0.0747) 0.9980 (0.0211) 1.0024 (0.1349) 0.0759 (0.0029) 0.0207 (0.0003) 0.1335 (0.0027) 0.0560 (0.0076) 0.0158 (0.0021) 0.1057 (0.0139) 0.0509 (0.0111) 0.0165 (0.0026) 0.0992 (0.0181) 0.0570 (0.0251) 0.0180 (0.0060) 0.1158 (0.0402) 1.0148 (0.0670) 0.9991 (0.0189) 0.9997 (0.1211) 0.0669 (0.0024) 0.0188 (0.0003) 0.1198 (0.0023) 1.0094 (0.0729) 0.9955 (0.0391) 1.0038 (0.0713) 0.0723 (0.0028) 0.0388 (0.0008) 0.0705 (0.0009) 0.0453 (0.0075) 0.0241 (0.0039) 0.0564 (0.0091) 0.0449 (0.0115) 0.0251 (0.0053) 0.0578 (0.0119) 0.0515 (0.0233) 0.0310 (0.0116) 0.0627 (0.0220) 1.0185 (0.0696) 0.9944 (0.0368) 1.0038 (0.0461) 0.0697 (0.0026) 0.0370 (0.0007) 0.0454 (0.0005) 1.1771 (0.0312) 0.9954 (0.0395) 0.9840 (0.1473) 0.0316 (0.0012) 0.0402 (0.0007) 0.1447 (0.0029) 0.0232 (0.0030) 0.0172 (0.0022) 0.0601 (0.0078) 0.0336 (0.0058) 0.0248 (0.0071) 0.0861 (0.0270) 0.0513 (0.0157) 0.0294 (0.0103) 0.1059 (0.0456) 1.3198 (0.0123) 0.9849 (0.0185) 1.0069 (0.0458) 0.0120 (0.0001) 0.0180 (0.0002) 0.0444 (0.0003) 99 Table B.17 Small panel with G = 6, T = 4. Case 5.a: x2it ∼ N (gt/2, 1) + zi + fi , ngt = 200, sampling rate= 1%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. MD Identity IV: none x2 x3 x4 IV: z x2 x3 x4 IV: x2 x2 x3 x4 IV: x3 x2 x3 x4 IV: x4 x2 x3 x4 IV: all x2 x3 x4 MD Optimal βˇ ˇ se(β) ˇ sen (β) ˇ ser (β) ˇ sec (β) βˆ ˆ se(β) 0.9951 (0.0524) 1.0007 (0.0906) 1.0074 (0.2817) 0.0539 (0.0021) 0.0887 (0.0033) 0.2843 (0.0129) 0.0527 (0.0106) 0.0866 (0.0175) 0.2777 (0.0569) 0.0470 (0.0142) 0.0825 (0.0194) 0.2625 (0.0661) 0.0472 (0.0211) 0.0875 (0.0316) 0.2945 (0.1070) 0.9951 (0.0522) 1.0010 (0.0908) 1.0078 (0.2834) 0.0537 (0.0021) 0.0884 (0.0032) 0.2834 (0.0128) 1.0029 (0.0360) 0.9966 (0.0863) 1.0107 (0.2754) 0.0359 (0.0010) 0.0847 (0.0028) 0.2777 (0.0120) 0.0354 (0.0043) 0.0833 (0.0101) 0.2733 (0.0345) 0.0358 (0.0053) 0.0687 (0.0143) 0.2200 (0.0494) 0.0352 (0.0117) 0.0727 (0.0256) 0.2460 (0.0865) 1.0030 (0.0361) 0.9962 (0.0871) 1.0099 (0.2765) 0.0352 (0.0010) 0.0836 (0.0028) 0.2744 (0.0118) 1.0240 (0.0782) 0.9991 (0.1104) 0.9839 (0.3425) 0.0787 (0.0044) 0.1118 (0.0048) 0.3461 (0.0188) 0.0380 (0.0098) 0.0487 (0.0126) 0.1403 (0.0369) 0.0456 (0.0120) 0.0692 (0.0237) 0.2007 (0.0720) 0.0469 (0.0193) 0.0668 (0.0319) 0.2496 (0.1127) 1.2598 (0.0252) 0.8485 (0.0867) 1.2129 (0.2806) 0.0247 (0.0007) 0.0824 (0.0028) 0.2737 (0.0119) 1.0004 (0.0552) 0.9908 (0.0461) 1.0021 (0.2886) 0.0572 (0.0022) 0.0463 (0.0013) 0.2932 (0.0129) 0.0422 (0.0059) 0.0353 (0.0049) 0.2331 (0.0332) 0.0381 (0.0087) 0.0368 (0.0063) 0.2190 (0.0420) 0.0426 (0.0187) 0.0405 (0.0136) 0.2554 (0.0911) 1.0026 (0.0496) 0.9907 (0.0422) 1.0212 (0.2612) 0.0502 (0.0018) 0.0415 (0.0010) 0.2617 (0.0107) 0.9950 (0.0534) 1.0009 (0.0867) 1.0061 (0.1549) 0.0544 (0.0022) 0.0863 (0.0029) 0.1583 (0.0036) 0.0343 (0.0056) 0.0541 (0.0087) 0.1269 (0.0205) 0.0343 (0.0091) 0.0563 (0.0119) 0.1303 (0.0269) 0.0395 (0.0173) 0.0689 (0.0253) 0.1407 (0.0491) 1.0018 (0.0515) 0.9971 (0.0818) 1.0121 (0.1011) 0.0522 (0.0020) 0.0815 (0.0027) 0.1008 (0.0017) 1.0226 (0.0740) 0.9952 (0.1000) 0.9774 (0.3333) 0.0745 (0.0040) 0.1014 (0.0041) 0.3369 (0.0176) 0.0227 (0.0052) 0.0284 (0.0065) 0.0857 (0.0199) 0.0388 (0.0101) 0.0565 (0.0189) 0.1775 (0.0634) 0.0409 (0.0166) 0.0575 (0.0265) 0.2239 (0.1009) 1.2477 (0.0255) 0.9575 (0.0421) 1.0247 (0.1011) 0.0240 (0.0006) 0.0399 (0.0009) 0.0986 (0.0016) 100 Table B.18 Small panel with G = 6, T = 4. Case 5.b: x2it ∼ N (gt/2, 1) + zi + fi , ngt = 1000, sampling rate= 1%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. MD Identity IV: none x2 x3 x4 IV: z x2 x3 x4 IV: x2 x2 x3 x4 IV: x3 x2 x3 x4 IV: x4 x2 x3 x4 IV: all x2 x3 x4 MD Optimal βˇ ˇ se(β) ˇ sen (β) ˇ ser (β) ˇ sec (β) βˆ ˆ se(β) 0.9985 (0.0239) 1.0034 (0.0391) 1.0072 (0.1273) 0.0241 (0.0004) 0.0397 (0.0006) 0.1282 (0.0024) 0.0233 (0.0048) 0.0385 (0.0079) 0.1244 (0.0256) 0.0209 (0.0060) 0.0366 (0.0085) 0.1174 (0.0308) 0.0213 (0.0095) 0.0390 (0.0146) 0.1333 (0.0486) 0.9985 (0.0239) 1.0034 (0.0391) 1.0074 (0.1276) 0.0241 (0.0004) 0.0397 (0.0006) 0.1281 (0.0024) 0.9991 (0.0150) 1.0031 (0.0368) 1.0075 (0.1251) 0.0160 (0.0002) 0.0382 (0.0006) 0.1268 (0.0023) 0.0160 (0.0019) 0.0380 (0.0047) 0.1264 (0.0156) 0.0161 (0.0022) 0.0304 (0.0065) 0.0981 (0.0243) 0.0157 (0.0052) 0.0324 (0.0120) 0.1116 (0.0404) 0.9992 (0.0151) 1.0030 (0.0369) 1.0078 (0.1256) 0.0160 (0.0002) 0.0381 (0.0006) 0.1265 (0.0023) 1.0251 (0.0355) 1.0089 (0.0489) 1.0059 (0.1551) 0.0351 (0.0008) 0.0498 (0.0009) 0.1552 (0.0035) 0.0193 (0.0041) 0.0247 (0.0052) 0.0715 (0.0150) 0.0217 (0.0052) 0.0312 (0.0104) 0.0902 (0.0330) 0.0229 (0.0093) 0.0307 (0.0149) 0.1118 (0.0517) 1.2599 (0.0109) 0.8485 (0.0368) 1.2361 (0.1265) 0.0111 (0.0001) 0.0374 (0.0006) 0.1259 (0.0023) 1.0010 (0.0260) 1.0023 (0.0206) 1.0104 (0.1325) 0.0257 (0.0005) 0.0207 (0.0002) 0.1337 (0.0025) 0.0189 (0.0026) 0.0158 (0.0022) 0.1059 (0.0147) 0.0171 (0.0037) 0.0165 (0.0028) 0.0991 (0.0195) 0.0191 (0.0083) 0.0180 (0.0065) 0.1159 (0.0408) 0.9998 (0.0221) 1.0028 (0.0186) 1.0076 (0.1217) 0.0227 (0.0004) 0.0188 (0.0002) 0.1201 (0.0020) 0.9993 (0.0238) 1.0037 (0.0381) 1.0032 (0.0686) 0.0243 (0.0004) 0.0387 (0.0006) 0.0705 (0.0007) 0.0153 (0.0025) 0.0242 (0.0040) 0.0567 (0.0093) 0.0153 (0.0038) 0.0251 (0.0053) 0.0579 (0.0121) 0.0178 (0.0079) 0.0313 (0.0117) 0.0635 (0.0218) 0.9995 (0.0233) 1.0036 (0.0370) 1.0013 (0.0442) 0.0237 (0.0004) 0.0369 (0.0005) 0.0454 (0.0003) 1.0231 (0.0337) 1.0052 (0.0444) 0.9991 (0.1511) 0.0333 (0.0008) 0.0452 (0.0008) 0.1512 (0.0033) 0.0117 (0.0021) 0.0147 (0.0026) 0.0445 (0.0080) 0.0188 (0.0043) 0.0257 (0.0082) 0.0801 (0.0290) 0.0197 (0.0081) 0.0267 (0.0123) 0.1005 (0.0463) 1.2498 (0.0109) 0.9681 (0.0183) 1.0142 (0.0447) 0.0109 (0.0001) 0.0182 (0.0002) 0.0448 (0.0003) 101 Table B.19 Small panel with G = 6, T = 4. Case 5.1: x2it ∼ N (gt/2, 1) + zi + fi , ngt = 200, sampling rate= 0.2%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. MD Identity IV: none x2 x3 x4 IV: z x2 x3 x4 IV: x2 x2 x3 x4 IV: x3 x2 x3 x4 IV: x4 x2 x3 x4 IV: all x2 x3 x4 MD Optimal βˇ ˇ se(β) ˇ sen (β) ˇ ser (β) ˇ sec (β) βˆ ˆ se(β) 0.9993 (0.0550) 1.0066 (0.0874) 1.0131 (0.2886) 0.0539 (0.0021) 0.0888 (0.0032) 0.2859 (0.0124) 0.0527 (0.0113) 0.0868 (0.0185) 0.2794 (0.0595) 0.0471 (0.0141) 0.0823 (0.0208) 0.2623 (0.0718) 0.0479 (0.0206) 0.0871 (0.0326) 0.2988 (0.1102) 0.9994 (0.0550) 1.0064 (0.0874) 1.0141 (0.2882) 0.0537 (0.0021) 0.0885 (0.0032) 0.2849 (0.0124) 1.0017 (0.0342) 1.0058 (0.0826) 1.0151 (0.2798) 0.0360 (0.0010) 0.0848 (0.0028) 0.2793 (0.0117) 0.0355 (0.0042) 0.0841 (0.0100) 0.2767 (0.0339) 0.0359 (0.0051) 0.0688 (0.0154) 0.2209 (0.0534) 0.0353 (0.0115) 0.0715 (0.0270) 0.2483 (0.0897) 1.0017 (0.0347) 1.0052 (0.0828) 1.0159 (0.2784) 0.0353 (0.0010) 0.0838 (0.0028) 0.2761 (0.0115) 1.0280 (0.0797) 1.0159 (0.1086) 1.0101 (0.3527) 0.0785 (0.0042) 0.1117 (0.0048) 0.3470 (0.0178) 0.0376 (0.0107) 0.0483 (0.0134) 0.1393 (0.0385) 0.0449 (0.0125) 0.0684 (0.0242) 0.1977 (0.0766) 0.0466 (0.0206) 0.0662 (0.0324) 0.2463 (0.1195) 1.2559 (0.0244) 0.8574 (0.0821) 1.2250 (0.2813) 0.0245 (0.0006) 0.0825 (0.0027) 0.2751 (0.0116) 1.0062 (0.0569) 1.0022 (0.0468) 1.0234 (0.2921) 0.0571 (0.0022) 0.0464 (0.0013) 0.2950 (0.0123) 0.0420 (0.0059) 0.0353 (0.0049) 0.2338 (0.0328) 0.0381 (0.0086) 0.0367 (0.0061) 0.2197 (0.0443) 0.0432 (0.0184) 0.0399 (0.0139) 0.2577 (0.0886) 1.0058 (0.0514) 1.0023 (0.0432) 1.0203 (0.2627) 0.0502 (0.0018) 0.0416 (0.0010) 0.2628 (0.0102) 1.0004 (0.0550) 1.0088 (0.0839) 1.0047 (0.1579) 0.0543 (0.0021) 0.0864 (0.0028) 0.1580 (0.0037) 0.0344 (0.0058) 0.0544 (0.0091) 0.1276 (0.0211) 0.0344 (0.0087) 0.0563 (0.0127) 0.1303 (0.0279) 0.0400 (0.0171) 0.0694 (0.0267) 0.1418 (0.0502) 1.0051 (0.0536) 1.0049 (0.0800) 1.0017 (0.0993) 0.0522 (0.0020) 0.0815 (0.0026) 0.1007 (0.0017) 1.0262 (0.0755) 1.0115 (0.0984) 1.0035 (0.3431) 0.0743 (0.0038) 0.1012 (0.0040) 0.3378 (0.0165) 0.0225 (0.0056) 0.0283 (0.0069) 0.0852 (0.0207) 0.0382 (0.0106) 0.0559 (0.0194) 0.1750 (0.0674) 0.0409 (0.0176) 0.0572 (0.0271) 0.2212 (0.1072) 1.2446 (0.0250) 0.9700 (0.0428) 1.0149 (0.1026) 0.0239 (0.0006) 0.0399 (0.0009) 0.0985 (0.0017) 102 Table B.20 Small panel with G = 6, T = 4. Case 5.2: x2it ∼ N (gt/2, 1) + zi + fi , ngt = 1000, sampling rate= 0.2%. sen , ser and sec are the non-roust, robust and cluster-robust standard errors, receptively. MD Identity IV: none x2 x3 x4 IV: z x2 x3 x4 IV: x2 x2 x3 x4 IV: x3 x2 x3 x4 IV: x4 x2 x3 x4 IV: all x2 x3 x4 MD Optimal βˇ ˇ se(β) ˇ sen (β) ˇ ser (β) ˇ sec (β) βˆ ˆ se(β) 1.0015 (0.0241) 0.9967 (0.0401) 1.0016 (0.1304) 0.0241 (0.0004) 0.0398 (0.0006) 0.1279 (0.0024) 0.0232 (0.0048) 0.0384 (0.0079) 0.1235 (0.0255) 0.0206 (0.0060) 0.0366 (0.0086) 0.1166 (0.0305) 0.0207 (0.0094) 0.0390 (0.0138) 0.1324 (0.0489) 1.0015 (0.0241) 0.9968 (0.0401) 1.0018 (0.1304) 0.0241 (0.0004) 0.0398 (0.0006) 0.1278 (0.0024) 1.0018 (0.0151) 0.9966 (0.0383) 1.0017 (0.1293) 0.0160 (0.0002) 0.0383 (0.0006) 0.1265 (0.0023) 0.0158 (0.0019) 0.0378 (0.0046) 0.1251 (0.0151) 0.0160 (0.0022) 0.0304 (0.0066) 0.0975 (0.0239) 0.0157 (0.0054) 0.0327 (0.0114) 0.1107 (0.0399) 1.0019 (0.0151) 0.9967 (0.0384) 1.0018 (0.1290) 0.0159 (0.0002) 0.0382 (0.0006) 0.1262 (0.0023) 1.0291 (0.0343) 0.9982 (0.0489) 0.9876 (0.1574) 0.0351 (0.0008) 0.0500 (0.0010) 0.1554 (0.0036) 0.0190 (0.0039) 0.0245 (0.0050) 0.0708 (0.0146) 0.0214 (0.0050) 0.0309 (0.0106) 0.0891 (0.0336) 0.0219 (0.0086) 0.0296 (0.0143) 0.1111 (0.0533) 1.2617 (0.0109) 0.8421 (0.0381) 1.2310 (0.1299) 0.0111 (0.0001) 0.0375 (0.0006) 0.1255 (0.0023) 1.0014 (0.0253) 0.9985 (0.0212) 1.0012 (0.1350) 0.0257 (0.0004) 0.0207 (0.0003) 0.1334 (0.0025) 0.0189 (0.0025) 0.0158 (0.0021) 0.1057 (0.0139) 0.0171 (0.0037) 0.0165 (0.0026) 0.0993 (0.0181) 0.0192 (0.0085) 0.0180 (0.0060) 0.1158 (0.0403) 1.0019 (0.0226) 0.9995 (0.0190) 0.9987 (0.1211) 0.0227 (0.0003) 0.0188 (0.0002) 0.1198 (0.0021) 1.0017 (0.0244) 0.9962 (0.0392) 1.0031 (0.0713) 0.0243 (0.0004) 0.0388 (0.0006) 0.0705 (0.0007) 0.0152 (0.0025) 0.0241 (0.0039) 0.0564 (0.0091) 0.0151 (0.0038) 0.0251 (0.0053) 0.0579 (0.0120) 0.0173 (0.0078) 0.0309 (0.0117) 0.0628 (0.0221) 1.0028 (0.0237) 0.9961 (0.0369) 1.0034 (0.0460) 0.0237 (0.0004) 0.0370 (0.0005) 0.0454 (0.0004) 1.0270 (0.0325) 0.9952 (0.0445) 0.9816 (0.1535) 0.0333 (0.0008) 0.0453 (0.0008) 0.1514 (0.0033) 0.0116 (0.0020) 0.0146 (0.0026) 0.0440 (0.0078) 0.0185 (0.0042) 0.0254 (0.0084) 0.0790 (0.0297) 0.0188 (0.0074) 0.0258 (0.0118) 0.0997 (0.0478) 1.2514 (0.0110) 0.9645 (0.0187) 1.0142 (0.0461) 0.0109 (0.0001) 0.0182 (0.0002) 0.0448 (0.0003) 103 BIBLIOGRAPHY 104 BIBLIOGRAPHY Arellano, Manuel. 1987. “PRACTITIONERS’ CORNER: Computing Robust Standard Errors for Within-groups Estimators.” Oxford bulletin of Economics and Statistics, 49(4): 431–434. Breusch, Trevor, Hailong Qian, Peter Schmidt, and Donald Wyhowski. 1999. “Redundancy of moment conditions.” Journal of econometrics, 91(1): 89–111. Collado, M Dolores. 1997. “Estimating dynamic models from time series of independent cross-sections.” Journal of Econometrics, 82(1): 37–62. Deaton, Angus. 1985. “Panel data from time series of cross-sections.” Journal of econometrics, 30(1): 109–126. Girma, Sourafel. 2000. “A quasi-differencing approach to dynamic modelling from a time series of independent cross-sections.” Journal of Econometrics, 98(2): 365–383. Hansen, Christian B. 2007a. “Asymptotic properties of a robust variance matrix estimator for panel data when T is large.” Journal of Econometrics, 141(2): 597–620. Hansen, Christian B. 2007b. “Generalized least squares inference in panel and multilevel models with serial correlation and fixed effects.” Journal of Econometrics, 140(2): 670–694. Heckman, James J, and V Joseph Hotz. 1989. “Choosing among alternative nonexperimental methods for estimating the impact of social programs: The case of manpower training.” Journal of the American statistical Association, 84(408): 862–874. Imbens, Guido, and Jeffrey M Wooldridge. 2007. What’s new in econometrics? NBER. Kezdi, Gabor. 2003. “Robust standard error estimation in fixed-effects panel models.” Available at SSRN 596988. McKenzie, David J. 2004. “Asymptotic theory for heterogeneous dynamic pseudo-panels.” Journal of Econometrics, 120(2): 235–262. Moffitt, Robert. 1993. “Identification and estimation of dynamic models with a time series of repeated cross-sections.” Journal of Econometrics, 59(1): 99–123. Newey, Whitney K, and Daniel McFadden. 1994. “Large sample estimation and hypothesis testing.” Handbook of econometrics, 4: 2111–2245. Verbeek, Marno. 2008. “Pseudo-panels and repeated cross-sections.” In The Econometrics of Panel Data. 369–383. Springer. Verbeek, Marno, and Francis Vella. 2005. “Estimating dynamic models from repeated cross-sections.” Journal of econometrics, 127(1): 83–102. 105 Wooldridge, Jeffrey M. 2010. Econometric Analysis of Cross Section and Panel Data. . 2nd ed., Boston MA:MIT Press. 106 CHAPTER 3 A FLEXIBLE PLUG-IN G-FORMULA FOR CONTROLLED DIRECT EFFECTS IN MEDIATION ANALYSIS 3.1 Introduction In the literature of epidemiology and biostatistics, the term g-methods (e.g., Westreich et al., 2012) are often used to collectively refer to g-formula (Robins, 1986), g-estimation of structural nested models (Robins, 1998), and inverse probability weighting of marginal structural models (Horvitz and Thompson, 1952; Robins, 1989; Hernan and Robins, 2015), all of which are useful approaches in estimating the effects of time-varying treatments in the presence of time-varying confounders. The g-formula in its original form is non-parametric and is the foundation for the other two. While non-parametric g-formula is flexible in its model specification, it is also quite demanding on data. Therefore, we often introduce semi-parametric or parametric modeling. Despite the fact that parametric models are almost always misspecified, the parametric g-formula often yields satisfactory estimates as long as the specified models are reasonably flexible. However, most applications of the parametric g-formula (e.g., Westreich et al., 2012; Taubman et al., 2009; Young et al., 2011; Danaei et al., 2013; Lajous et al., 2013; Garcia-Aymerich et al., 2014) still use Monte Carlo integration to calculate because closed-form expressions of the treatment effects of interest are either non-existent or tedious to derive. The application of these g-method to mediation analysis is straightforward as mediation analysis is conceptually equivalent to a sequential treatment of two periods. Since Robins and Greenland (1992) conceptualize the natural and controlled effects in mediation analysis using the potential outcome (counterfactual) framework, several mediation analysis methods are developed from g-methods. These methods include, among others, the parametric g-formula in Daniel, De Stavola and Cousens (2011) and Valeri and Vanderweele (2013), 107 and the parametric version of the sequential g-estimation of structural nested mean models (SNMMs) studied by Vansteelandt (2009). As in the case of g-methods, these methods also rely on Monte Carlo integration to calculate the treatment effects of interest, which can be computationally demanding. Meanwhile, since there is no closed-form expressions, we almost always need to bootstrap the standard errors, which raises the computation intensity rapidly. When the estimation itself is time-consuming, the problem gets amplified even further. This includes, but not limited to, maximum likelihood estimation when it converges slowly and most semi-parametric or non-parametric techniques that require cross validation for tuning parameter selection. In view of this limitation, in this chapter we propose a so called flexible plug-in g-formula for controlled direct effects (CDE) in mediation analysis. The key assumption needed is that the conditional expectation of the outcome is linear in time-varying confounders. This partial linearity allows us to replace the confounders with their fitted values, which results in a plug-in estimator for CDE. At the same time, it also relaxes the fully linear assumptions that are commonly used in empirical studies, which gives us more flexibility in choosing the functional form of the outcome conditional mean. As a result, we have a better chance to be closer to the true underlying model. Besides the partial linearity assumption, another necessary condition for the consistency of the flexible plug-in g-formula is the sequential ignorability assumption (Robins, 1986). To check the robustness of the estimator to a particular violation of the sequential ignorability assumption, we present a sensitivity analysis that is similar in spirit to that proposed by Imai, Keele and Tingley (2010). The proposed estimator is evaluated in a small simulation and its use is illustrated in a longitudinal cohort study. The rest of this chapter is organized as follows. We first set up the counterfactual mediation analysis framework in the second section. In the third section, we review the general g-formula as well as the sequential g-estimation. In the fourth section, we present in detail the flexible plug-in g-formula. In particular, it is compared to the sequential g- 108 estimation in some commonly used linear specifications. The next two sections outline the sensitivity analysis and an empirical application, respectively. The last section concludes. 3.2 Framework A causal mediation analysis is typically guided by a directed acyclic graphs (DAG) (Pearl, 2009). We use the DAG G in Figure 1 for illustration, but our method can be applied in similar models that satisfy the assumptions below. Throughout this paper, a DAG is viewed as a graphical representation of an underlying non-parametric structural equation model with independent errors (NPSEM-IE) (Pearl, 2009).1 A DAG and the associated NPSEM-IE are related as usual: “there is an equation for each variable in the model, specifying that variable as a function of its parents in the graph” (Richardson and Robins, 2013). The counterfactual outcomes, defined by intervening on certain variables in the NPSEM-IE model, are used in constructing the CDEs of interest. Specifically, assume we have a longitudinal study in which each respondent was interviewed 3 times at k = 0, 1, 2, with a one-time treatment A at k = 0. Each interview generates data Lk . The purpose is to learn to what extent the treatment effect of A on Y is mediated by a mediator M . (A, M ) are the intervention nodes, and {Lk : k = 0, 1, 2} are observed non-intervention nodes or confounders. L0 contains all baseline information, and L1 contains all post-treatment non-intervention nodes that occur before M and confounds the mediator-outcome relationship. As noted in Pearl (2014), we cannot identify natural mediation effects non-parametrically because of the existence of L1 . Hence, we focus on the CDEs in this paper. The model allows for a type of harmless unobserved variables, collec1 Although the NPSEM-IE model makes many more counterfactual independence assump- tions than, and is a strict submodel of, the finest fully randomized causally interpretable structured tree graph (FFRCISTG) model of (Robins, 1986; Robins and Richardson, 2010), the generality of the latter does not play an essential role in our paper. The adoption of NPSEM-IE, however, makes the description of the model straightforward and easy to follow, since it is built on the traditional structural equation models and imposes simple distribution assumptions on the errors (exogenous variables). 109 Figure 3.1 A directed acyclic graph for a longitudinal study with three time points. (A, M ) are the intervention nodes, (L0 , L1 , Y ) the non-intervention nodes, and U0 the unobservables. G tively denoted by U0 that are parents of the observed non-intervention nodes. Based on the back-door criterion the observed non-intervention nodes block the confounding effects of U0 Pearl (2009). Each node (including the unobservable U0 ) has an exogenous error (not shown in the graph) attached solely to itself. For example, εY is the exogenous error for Y , as εU0 is for U0 . All errors are unobserved and are assumed to be jointly independent in NPSEM-IE. The independence assumption will be relaxed in our sensitivity analysis in which a more general NPSEM allowing correlations between ε’s will be used. Let Y am be the counterfactual outcome where A and M are intervened to be fixed at a and m. In the context of NPSEM-IEs, such an intervention would correspond to the operation of deleting the equations for A and M from the system and substituting A = a and M = m in the system. This operation is called the do operation in Pearl (2009). In this paper, we consider a binary A so that a ∈ {0, 1}. An extension to multi-valued A is trivial. For a fixed m, we are interested in the CDE(m) defined as E(Y 1m − Y 0m ), or E(Y (1, m) − Y (0, m)). 110 3.3 Existing Methods In the context of DAG G, we discuss two existing methods for estimating the CDE. The first method is the g-formula for the marginal distribution of Y am introduced in Robins (1986) and revisited in Richardson and Robins (2013), which is the basis for the parametric g-formula and in particular for the flexible plug-in estimator. The second method is the sequential g-estimation Vansteelandt (2009), which is, as we will show, numerically equivalent to the flexible plug-in estimator in certain linear cases. 3.3.1 The g-Formula In Richardson and Robins (2013), the term g-formula refers to the unextended g-formula of Robins (1986), the extended g-formula of Robins, Hernán and SiEBERT (2004), or the g-formula for a sequence of treatments and a single response. Among the three, the last one is of interest in the context of DAG G and the estimation of CDEs. Let P (Y am = y) be the distribution of the potential response Y under the sequence of interventions (A = a, M = m), from which we can construct the CDE. Assume the consistency rule holds Robins (1994), that is, a potential response under a hypothetical condition that happened to take place is precisely the observed response. In addition, the following form of sequential ignorability condition from Robins (2000) is imposed: Y am ⊥ ⊥ A | L0 , f or all a, m (3.1) Y am ⊥ ⊥ M | L0 , A, L1 , where ⊥ ⊥ represents distributional independence. Note that condition (3.1) summarizes the set of independence conditions for all possible values of a and m, which are needed to identify CDEs for all possible values of M . A logically equivalent algorithm of finding the right conditioning set of variables as the sequential ignorability conditions in (3.1) is the sequential back-door criterion developed in 111 Figure 3.2 Subgraphs for G where an upper bar means arrows pointing to a node are removed and an under bar means arrows emitting from a node are removed. GAM G(M ) 112 Pearl and Robins (1995), which states (Y ⊥ ⊥ A | L0 )G AM , (3.2) (Y ⊥ ⊥ M | L0 , A, L1 )G , M where G(M ) denotes the subgraph obtained by removing from G all arrows emerging from M (Figure 3.2, B), and GAM denotes the removal of both incoming arrows to M and outgoing arrows from A (Figure 3.2, A). Under either (3.1) or (3.2), the g-formula for the expected counterfactual outcome is ˆ ˆ ˆ E(Y am ) = ˆ ˆ = yfY |L ,A,L ,M (y|l0 , a, l1 , m)fL |L (l1 |l0 , a)fL0 (l0 )dydl1 dl0 0 1 1 0 E(y|l0 , a, l1 , m)fL |L (l1 |l0 , a)fL0 (l0 )dl1 dl0 1 0 (3.3) (3.4) where, e.g, P (y|l0 , a, l1 , m) is shorthand for P (Y = y|L0 = l0 , A = a, L1 = l1 , M = m). See Appendix A for detail. Then the CDE is E(Y 1m − Y 0m ) ˆ ˆ = E(y|l0 , 1, l1 , m)fL |L (l1 |l0 , 1)fL0 (l0 )dl1 dl0 1 0 ˆ ˆ − E(y|l0 , 0, l1 , m)fL |L (l1 |l0 , )fL0 (l0 )dl1 dl0 . 1 0 (3.5) Equation (3.3) is non-parametric in the sense that no parametric assumptions are made yet for fY |L ,A,L ,M (y|l0 , a, l1 , m), fL |L (l1 |l0 , a) and fL0 (l0 ). Equation (3.4) adds an 0 1 1 0 additional assumption that the conditional mean E(y|l0 , a, l1 , m) exits and is finite. Two straightforward estimation strategies to estimate E(Y am ) follow from equations (3.3) and (3.4). The first strategy exploits equation (3.3), with either non-parametric or parametric specification of the distributions. This strategy generally involves Monte Carlo simulation and numerical integration for calculating CDEs (Daniel, De Stavola and Cousens, 2011; Imai, Keele and Tingley, 2010; Hicks and Tingley, 2011). The second strategy exploits equation (3.4), which avoids the estimation of the density function fY |L ,A,L ,M (y|l0 , a, l1 , m). Instead, we only estimates the conditional mean 0 1 113 E (Y |l0 , a, l1 , m). We can keep the non-parametric feature or choose suitable parametric models to reduce computation burdens. Many applications have shown that a properly chosen parametric method can provide good approximations (Westreich et al., 2012; Taubman et al., 2009; Young et al., 2011; Danaei et al., 2013; Lajous et al., 2013; Garcia-Aymerich et al., 2014). The estimator proposed in this chapter falls in the second estimation strategy and imposes a particular parametric assumption on E (Y |l0 , a, l1 , m) that simplifies equation (3.4) even further (details below). 3.3.2 The Sequential g-formula estimator The sequential g-estimation for CDEs is a two-step estimator based on an SNMM (Vansteelandt, 2009). The idea is to first partial out the effect of the mediator on the outcome and then regress the adjusted outcome on the treatment, the confounders, and possibly their interactions to identify the direct effect. The sequential g-formula estimator assumes an additive separable functional form in the conditional mean equation for Y E(Y |l0 , a, l1 , m) = qA (l0 , a, l1 ; γ) + qM (l0 , a, l1 , m; γ) (3.6) where qA (·) and qM (·) are arbitrary known functions with finite dimensional parameter γ, satisfying qM (l0 , a, l1 , m = 0; γ) = 0. For example, we can assume qA = γ0 + γA a + γL0 l0 + γL1 l1 and qM = γM m. In addition, assume an SNMM for E(Y am − Y 0m |l0 ) = ϕA a where ϕA is the CDE. Then the sequential g-estimation procedure is: 1. regress Y on (1, L0 , A, L1 , M ) and obtain the ordinary least square (OLS) estimator γˆM for γM and generate Yˆ−M ≡ Y − γˆM M , and 2. regress Yˆ−M on (1, L0 , A). Denote by ϕˆA the OLS estimator for the coefficient of A. It can be shown ϕˆA is a consistent estimator for ϕA .2 2 Under the sequential ignorability conditions and additive separability (3.6), Vansteelandt (2009) gave the key identification result E [Y − qM (L0 , A, L1 , M ; γ)|L0 = l0 , A = a] = 114 See Appendices C and D for the validity of the estimation procedure. Table 3.1 lists five typical examples of SNMMs compatible with DAG G that can be estimated using the sequential g-formula estimator. The simple example above is Model 1 in Table 3.1. The first three models were discussed in Vansteelandt (2009), and we added the latter two models with more flexible specifications. One difficulty in applying the sequential g-formula estimator is to find the proper qA (·) and qM (·) functions for a given SNMM, as can been seen in Table 3.1. The derivation of standard errors of the estimator is also no mean feat. The sequential g-estimation is not always as simple as it looks in Model 1. For example, in a model with up to two-way interactions (Model 5 in Table 3.1), we need to estimate ϕAM , ϕAL0 and ϕAM L0 in E(Y 1m − Y 0m ) = ϕA + ϕAM m + ϕAL0 E(L0 ) + ϕAM L0 mE(L0 ). It turns out that in this case only E(L1 |L0 = l0 , A = 0) = π0 + πL0 l0 is not enough to identify ϕAM or ϕAM L0 . What is needed is the stronger assumption E(L1 |L0 = l0 , A = a) = π0 + πL0 l0 + πA a + πAL0 a × l0 which implies (by setting γAM L0 = γAM L1 = 0 in Appendix B) ϕAM = γAM + γM L1 πA , ϕAM L0 = γM L1 πAL0 , Clearly, the above equations show that ϕAM and ϕAM L0 cannot be obtained directly from the first-step regression, since now an additional regression for E(L1 |L0 = l0 , A = a) is needed to estimate π. E Y a0 |L0 = l0 . We show that, in addition to the assumptions in Vansteelandt (2009), a necessary and sufficient condition for the validity of the second-step regression is f (l0 , a) ≡ E(L1 |L0 = l0 , A = 0) = π0 + πL0 l0 . See Appendix B for detail. 115 Table 3.1 Compare the plug-in estimator with the sequential g-estimator under different specifications for the outcome conditional mean and different structural nested mean models. Structural Nested Mean Model (SNMM) E(Y am − Y 0m |l0 ) 1 ϕA a 2 ϕA a + ϕAM a · m 3 4 5 ϕA a + ϕAL0 a · l0 ϕA a + ϕAM a · m +ϕAL0 a · l0 ϕA a + ϕAM a · m+ +ϕAL0 a · l0 + ϕAL0 a · l0 +ϕAM L0 a · m · l0 Sequential g-formula estimator E(Y |l0 , a, l1 , m) = qA (l0 , a, l1 ; γ) + qM (l0 , a, l1 , m; γ) qA = γ0 + γA a + γL0 l0 + γL1 l1 qM = γM m f (l0 , 0) = π0 + πL0 l0 qA = γ0 + γA a + γL0 l0 + γL1 l1 qM = γM m + γAM a · m f (l0 , 0) = π0 + πL0 l0 Flexible Plug-in g-formula estimator E(Y |l0 , a, l1 , m) = h0 (a, m)l0 + h1 (a, m)l1 + h(a, m) h0 = γL0 , h1 = γL1 h = γ0 + γA a + γM m f (l0 , a) = π0 + πL0 l0 + πA a h0 = γL0 , h1 = γL1 h = γ0 + γA a + γM m + γAM a · m f (l0 , a) = π0 + πL0 l0 + πA a qA = γ0 + γA a + γL0 l0 + γL1 l1 +γAL0 a · l0 qM = γM m f (l0 , 0) = π0 + πL0 l0 qA = γ0 + γA a + γL0 l0 + γL1 l1 +γAL0 a · l0 qM = γM m + γAM a · m f (l0 , 0) = π0 + πL0 l0 h0 = γL0 + γAL0 a, h1 = γL1 h = γ0 + γA a + γM m f (l0 , a) = π0 + πL0 l0 + πA a +πAL0 a · l0 h0 = γL0 + γAL0 a, h1 = γL1 h = γ0 + γA a + γM m + γAM a · m f (l0 , a) = π0 + πL0 l0 + πA a +πAL0 a · l0 qA = γ0 + γA a + γL0 l0 + γL1 l1 +γAL0 a · l0 + γAL1 a · l1 qM = γM m + γAM a · m + γM L0 m · l0 +γM L1 m · l1 f (l0 , a) = π0 + πL0 l0 + πA a +πAL0 a · l0 h0 = γL0 + γAL0 a + γM L0 m h1 = γL1 + γAL1 a + γM L1 m h = γ0 + γA a + γM m + γAM a · m f (l0 , a) = π0 + πL0 l0 + πA a +πAL0 a · l0 Notation: qA ≡ qA (l0 , a, l1 ; γ), qM ≡ qM (l0 , a, l1 , m; γ), f (l0 , a) ≡ E(L1 |L0 = l0 , A = a). 116 3.4 3.4.1 The Flexible Plug-in g-formula estimator The Partial Linearity Assumption and the Plug-in g-formula estimator Although the idea of using linear outcome conditional mean is not particularly new (Robins, 2000; Van der Wal et al., 2009), to the best of our knowledge, the flexible plug-in g-formula estimator proposed here is the first to make full use of this idea. This parametric g-formula has a closed-form expression for CDE and thus does not require numerical integration. Specifically, let the conditional expectation of Y given (L0 , A, L1 , M ) be linear in L0 and L1 , namely E(Y |l0 , a, l1 , m) = h0 (a, m; γ)l0 + h1 (a, m; γ)l1 + h(a, m; γ) (3.7) where the hk ’s are arbitrary known functions of (a, m) known up to certain parameters γ, for k = {0, 1, ∅}. This is of course a strong parametric assumption since it ignores any interaction among confounders. We should think of Equation (3.7) as the first order Taylor approximation to any function of (L0 , L1 ). The plug-in g-formula estimator can be extended to include higher order terms of confounders. Equations (3.4) and (3.7) lead to the proposed flexible plug-in g-formula estimator for E(Y am ): E(Y am ) = h0 (a, m; γ)E(L0 ) + h1 (a, m; γ)E [E(L1 |L0 , A = a)] + h(a, m; γ). (3.8) See Appendix A for proof. The last column of Table 3.1 shows that, by varying the specifications of hk ’s, equation (3.7) can provide the same specification on E(Y |l0 , a, l1 , m) as equation (3.6) in all the five models there. The separability in qA and qM for the sequential g-formula estimator and the estimating equation that is linear in confounders for the plug-in g-formula estimator do not nest within each other. Neither estimator is strictly more flexible than the other. Note that the unknown parameters γ in (3.7) are not the structural parameters that would appear in the structural equation Y = fY (L0 , A, L1 , M, Y ) in the NPSEM-IE. Equation (3.7) is 117 an estimating equation only. The parameters in this equation are not of direct interest in general, but they eventually identify E(Y am ). Note also that as long as equation (3.7) holds, the presence of U0 does not affect the identification of CDE. However, we do need a model for E(L1 |L0 , A). This model needs not be linear, and can potentially be semi- or non-parametric if the dimension of the conditioning set is low. For example, if L1 is binary, a logistic model can be used. But in the application below, we use the following linear model for simplicity: E(L1 |L0 , A) = π0 + πA A + πL0 L0 + πAL0 A × L0 , (3.9) E [E(L1 |L0 , A = a)] = π0 + πA a + πL0 E(L0 ) + πAL0 a × E(L0 ). (3.10) and thus 3.4.2 Estimation Procedure for the Flexible Plug-in g-formula estimator of CDE Given the discussion above, the CDE can be estimated as follows: 1. estimate γ in equation (3.7) using a proper method, e.g., a quasi-maximum likelihood estimator, which is consistent given correctly specified conditional mean (Wooldridge, 2010); and 2. estimate E(L0 ) using a proper method, e.g., the sample mean, and get E(L0 ). 3. Estimate E(L1 |L0 , A) using a proper method and get E [E(L1 | L0 , A = a)]. For example, regress L1 on (1, A, L0 , AL0 ). 4. Plug E(L0 ) and E [E(L1 | L0 , A = a)] into (3.8) to obtain E(Y am ). Then CDE(m) = E(Y am ) − E(Y 0m ). CDE(m) is the flexible plug-in estimator for the CDE evaluated at m. Bootstrap can be used to obtain the standard errors for the estimated CDEs. 118 3.4.3 plim of Parametric g-Formula is the Flexible Plug-in g-formula estimator By the law of large numbers, the flexible plug-in g-formula estimator of CDE is the plim of the corresponding parametric g-formula in section 3.1. This is simply because the former is an analytical solution for the integral which the latter is trying to evaluate via Monte Carlo simulation. We verify this claim using Model 5 in Table 3.1 through a simulation study with the aid of the Stata command gformula developed in Daniel, De Stavola and Cousens (2011). In the simulation study, we let the number of Monte Carlo simulations used by the gformula command increase towards infinity. The results show that the estimates obtained using the gformula command indeed come closer and closer to those obtained using the flexible plug-in g-formula as the simulations increase. One reason for using Model 5 is that the specification is complex enough to make noticeable difference in computation time between the two methods. Obviously, if we are only interested in a point estimate of the controlled direct effect, the flexible plug-in g-formula does not gain us much. However, since we also need to obtain the standard errors for the estimators, and bootstrap is often inevitable in g-methods, the flexible plug-in g-formula can save considerable amount of computation time. 3.4.4 Flexible Plug-in g-formula estimator Is Numerically Equivalent to Sequential g-formula estimator We show that, in each of the five models in Table 3.1, the flexible plug-in g-formula estimator and the sequential g-formula estimator are numerically identical. See Appendix C for proof. It is worth emphasizing the following two conditions that are met by each of the five models in Table 3.1. First, the SNMM used by the sequential g-formula estimator must be compatible with the specification on E(Y |l0 , a, l1 , m). Note that a given SNMM and a compatible specification on E(Y |l0 , a, l1 , m) implies a specification on E(L1 |L0 , A = 0) (or on E(L1 |L0 , A) in Model 5) as shown in Table 3.1. Second, the specification on E(L1 |A, L0 ) used by the flexible 119 plug-in estimator must be properly chosen. By “properly chosen”, we mean the specification on E(L1 |A, L0 ) must be of the same level of flexibility as the second-step regression of the sequential g-formula estimator. See the remark in Appendix C for more detail. Later when we discuss the issue of “one single parameter”, the two estimators are not identical except for the no interaction case. The sequential g-formula estimator forces E(Y am − Y 0m ) to be ϕa when it is actually not, a typical case in which the incompatibility issue arises. 3.4.5 Simulation We use a simple simulation study to evaluate the equivalency. Assume the data generating process (DGP) is as follows: U0 = εU0 L0 = U0 + εL0 A = 1[expit(L0 ) ≥ εA ] L1 = U0 + L0 + A + εL1 M = 50 × expit(L0 + A + L1 + εM ) Y = L0 × log(1 + A + M 2 ) + L1 × (A + M ) + U0 + εY where all ε’s except εA are standard normal, εA is uniform on (0, 1), and expit(x) = exp(x)/(exp(x) + 1) is the inverse of the logit transformation. All ε’s are independent of each other. For simplicity, all coefficients are set to unity. The resulting true CDE is CDE(m) = m + 1 for any fixed value m. Under this DGP, the following estimating equation that is linear in L0 and L1 holds: 1 1 E(Y |L0 , A, L1 , M ) = log(1 + A + M 2 ) L0 + ( + A + M )L1 + − A . 3 3 h0 (a,m) h1 (a,m) 120 h(a,m) (3.11) Table 3.2 Simulation results: flexible plug-in g-formula v.s. sequential g-estimator m 1 ture CDE 2 25 26 50 51 m 1 ture CDE 2 25 26 50 51 FPG 57.0846 (8.5109) 40.1442 (5.1294) 22.4979 (7.1983) A: Simulation Results for Model 4 E(L1 |A, L0 ) = E(L1 |A, L0 ) = π0 + πA A + πL0 L0 π0 + πA A + πL0 L0 + πAL0 A × L0 SG Difference FPG SG Difference −3 57.0781 57.0781 0 (0) 57.0781 −6.481e (.1090) (8.5106) (8.5106) (8.5106) 40.1377 40.1377 0 (0) 40.1377 −6.481e−3 (.1090) (5.1285) (5.1285) (5.1285) −3 22.4914 22.4914 0 (0) 22.4914 −6.481e (.1090) (7.1973) (7.1973) (7.1973) B: Simulation Results for Model 5 E(L1 |A, L0 ) = E(L1 |A, L0 ) = π0 + πA A + πL0 L0 π0 + πA A + πL0 L0 + πAL0 A × L0 FPG SG Difference FPG SG Difference −4 .0426 (.4303) .04219 (.4303) −4.951e .0419 (.4303) .0419 (.4303) 0 (0) (.0236) 24.9052 24.9047 −4.951e−4 24.8994 24.8994 0 (0) (2.3424) (2.3425) (.0236) (2.3411) (2.3411) −4 50.8037 50.8032 −4.951e 50.7926 50.7926 0 (0) (4.6871) (4.6871) (.0236) (4.6847) (4.6847) 121 For illustration purposes, we only show the estimation results for the two estimators in Models 4 and Model 5 in Table 3.1. The following two specifications for E(L1 |A, L0 ) which differ in their flexibility are considered for both Model 4 and Model 5: E(L1 |A, L0 ) = π0 + πA A + πL0 L0 , (3.12) E(L1 |L0 , A) = π0 + πA A + πL0 L0 + πAL0 A × L0 . (3.13) The specifications for E(L1 |A, L0 ) affect the estimates of the flexible plug-in estimator in both Model 4 and Model 5. As for the sequential g-formula estimator, the specifications for E(L1 |A, L0 ) have no effect in Model 4 since no estimates involve the estimation of π, which is in turn a result of the specification that L1 does not interact with M ; but they do have an effect in Model 5 since both ϕAM and ϕAM L0 need the estimation of π. When the same specification on E(Y |L0 , A, L1 , M ) and the same proper specification on E(L1 |L0 , A) are used, the flexible plug-in g-formula estimator and the sequential g-formula estimator must give exactly the same estimates on each occasion. The simulation consists of 1000 runs with 500 observations in each sample. To save space, we report in Table 3.2 the results for CDE(1), CDE(25) and CDE(50) only, m ranges from 1 to 50. For each estimator, the average and standard deviation (in parenthesis) over the 1000 simulations are reported. We also calculate the difference of the two estimates in each simulation run and report the average and standard deviation of the difference over the 1000 simulations. There are two important observations from the simulation results in Table 3.2. First, the flexibility of the specification on E(Y |L0 , A, L1 , M ) is important. Although Models 4 and 5 both misspecify the true estimating equation, however, compared to Model 5, Model 4 is more restrictive, leading to larger biases in both estimators. Note that in terms of qA (·) and qM (·), Model 4 is typical in causal mediation analysis, partly because researchers usually think the two-way interactions A × L1 and M × L1 are unnecessary. On the other hand, if 122 the flexible plug-in g-formula estimator is used, the complete set of two-way interactions in Model 5 becomes typical. Second, when E(L1 |A, L0 ) = π0 + πA A + πL0 L0 , the two estimates are close but not the same. Because the CDE is linear in m for both estimators, the difference does not depend on m. When the proper specification E(L1 |A, L0 ) = π0 + πA A + πL0 L0 + πAL0 A × L0 is used, the two estimates become identical. This supports the numerical equivalence claim made in the last section. 3.4.6 Comparison of Flexible plug-in g-formula estimator with Sequential gformula estimator In cases where the two estimators are identical, the flexible plug-in g-formula estimator inherits everything the sequential g-formula estimator has. However, the difference in the estimation procedures grants the former several advantage over the latter in applications. We discuss several points of importance in this respect. First, when applying the sequential g-formula estimator, one needs to make sure the specifications for qA (·) and qM (·) are compatible with the chosen SNMM. For example, E(Y am − Y 0m ) = ϕa is not compatible with the qM (·) in Model 2. As qA (·) and qM (·) become more complex, it becomes more difficult to find the corresponding SNMM. The flexible plug-in g-formula estimator avoids this issue, because it starts from assuming the specifications on E(Y |l0 , a, l1 , m) and E(L1 |a, l0 ), and the model for CDEs follows naturally. The resulting SNMM can even be nonlinear depending on the specifications on E(Y |l0 , a, l1 , m) and E(L1 |a, l0 ). Second, unless the SNMM is forced to be E(Y am − Y 0m ) = ϕa, the sequential g-formula estimator does not always depend on “one single parameter” for CDE Vansteelandt (2009). For example, if there is a strong belief that there is treatment-mediator interaction in qM (·) as in Model 2, then E(Y am − Y 0m ) = ϕA a + ϕAM a × m is a function of two parameters. The aforementioned compatibility issue will arise if, in order to force CDE to depend on “one 123 single parameter”, the SNMM is assumed to be E(Y am − Y 0m ) = ϕa. In an incompatible case it is difficult to interpret what effect the single parameter ϕ captures. Without further investigation, all we can say is that it is some average of the CDE evaluated at different values of m, and it is unknown whether it is practically relevant. As a result, the test of existence of CDE based on this average becomes less useful than a test evaluated at different values of m. Third, there are interesting specifications on E(Y |l0 , a, l1 , m) that the sequential gestimation does not allow. The feasibility of the sequential g-formula estimator hinges on the additive separability between qA (·) and qM (·), and clearly, not all specifications for E(Y |l0 , a, l1 , m) satisfy this restriction. Practically interesting examples include cases where h = log(γ0 + γA a + γM m2 ) or h = exp(γ0 + γA a + γM m2 ), i.e. we use the link function idea of generalized linear models to enrich specifications on the hk functions. To be fair, however, we also note that there are specifications that the flexible plug-in g-formula estimator cannot handle. For example, if the conditional expectation of Y is nonlinear in L0 and L1 , equation (3.7) will not hold, but equation (3.6) may still be satisfied provided that the nonlinearity does not interfere with the additive separability requirement. Extensions of equation (3.7) to be nonlinear in L0 and L1 are possible but complicated, in which case the original parametric g-formula might be a better choice. In sum, the sequential g-formula estimator has the potential of allowing nonlinearity in confounders but generally not in the treatment and mediator, and for the flexible plug-in g-formula estimator the converse is true. In this sense these two estimators complement each other. Finally, the sequential g-estimation procedure changes in a nontrivial way as the specification for qA (·) and qM (·) changes, unless one does not care about compatibility and always uses E(Y am − Y 0m ) = ϕa. The two steps of the procedure must be derived and tailored individually, and become more complex as we move from Model 1 to Model 5 (see Appendix B). In particular, in Model 5 (or whenever there are M L1 and/or AM L1 interactions), ϕAM in SNMM can not be estimated simply by the first-step regression anymore, and the 124 derivation of the formula is not trivial. Moreover, there is an additional parameter ϕAM L0 to be estimated. These scenarios make the sequential g-formula estimator inconvenient to use when one would like to build more flexibility into the model. On the other hand, the estimation procedure for the plug-in g-formula works uniformly across different settings, and there is no derivation by hand because the work is done by the computer. 3.5 Sensitivity Analysis The untestable sequential ignorability conditions in (3.1) is crucial for any g-formula driven estimators. The second part of the assumption is particularly vulnerable in mediation analyses since sequential randomization is not always the case in practice. In this section, we provide a sensitivity analysis for one type of violation of the sequential ignorability conditions. For illustration purposes, this section only shows the sensitivity analysis for Model 1. Similar procedures can be derived for other models in Table 3.1. Recall the exogenous parents ε’s omitted from DAG G are assumed to be jointly independent in NPSEM-IE. Suppose now εM and εY are correlated, then the original g-formula and consequently the flexible plug-in g-formula estimator do not work any more. To identify the CDE in the analysis, we use the following conditions to perform a sensitivity analysis. 1. The unobservable U0 does not enter the structural equation of Y , i.e. the arrow from U0 to Y in Figure 1 is deleted. 2. The structural equation for Y is linear in its coefficients. If we use fY (L0 , A, L1 , M ) to denote the structural equation for Y , then in Model 1, the linearity assumption means fY (·) = γ0 + γL0 L0 + γA A + γL1 L1 + γM M + εY , which is stronger than a linear estimation equation. 3. The structural equation for M is additive separable in εM , i.e. M = fM (L0 , A, L1 ) + εM for some function fM . 125 Given these assumptions, the counterfactual Y am is Y am = γ0 + γL0 L0 + γA a + γL1 L1 + γM m + εY , and the CDE is E Y 1m − Y 0m = γA . To consistently estimate γA , let Z = (1, L0 , A, L1 , M ) and γ Z = (γ0 , γL0 , γA , γL1 , γM ). Then we can write Y = Zγ Z + εY . Define γ OLS = E(Z Z) Z −1 E(Z Y ). Because εM and εY are correlated, in general γ OLS is Z not equal to γ Z . But under conditions (1-3), γ Z can be derived through a bias correction term:   γ Z = γ OLS − E(Z Z) −1  Z  0 σ εM εY   where σεM εY is the covariance between εM and εY , and 0 is a 4 × 1 vector. (See Appendix D for derivation.) To see how sensitive the flexible plug-in g-formula estimator is to this particular violation of sequential ignorability conditions, we let σεM εY vary within some range and estimate the CDE accordingly. As a rule of thumb, the covariance between M and Y could provide a reference on the choice of the range for σεM εY , since εM (εY ) only represents part of the variation in M (Y ) if one believes that the chosen model is a sound one. Compared to the sensitivity analysis in Imai, Keele and Tingley (2010), the sensitivity analysis in this section relaxes the structural assumption on the mediator. One major reason that this relaxation can be made is that CDE is the parameter of interest in this paper instead of the natural direct effect, which is not nonparametrically identified in a model with post-treatment confounders. 126 3.6 An Application In a longitudinal study by Breslau, Johnson and Lucia (2001) and Breslau, Paneth and Lucia (2004), a random sample of low birthweight (LBW, < 2500 grams) and normal birthweight (NBW, ≥ 2500 grams) infants is selected from two socioeconomically disparate populations in southeast Michigan and followed over 17 years. The goal is to study the long term impact of LBW on academic achievements. The first assessment occurs when the children are 6 years old, the second assessment occurs when the children are 11, and the last assessment when the children are 17. A test from the Woodcock-Johnson Psychoeducational BatteryRevised (WJ-R) by Woodcock, Johnson and Mather (1990) is used to measure their academic achievement in reading at ages 11 and 17. The WJ-R tests are age standardized with a mean of 100 and a standard deviation of 15. An earlier paper by Breslau, Johnson and Lucia (2001) found that the reading score for LBW children at age 11 is 3.6 points lower than those of NBW children. However, the difference became trivial and insignificant after adjusting for their IQ, visual-motor-integration (VMI) function from Beery (1989) and phonologic awareness (PA) from Rosner and Simon (1971) at age 6. Their conclusion is thus that the deficit in reading score in LBW children at age 11 relative to NBW children is accounted for (mediated) mostly by the deficit in their cognitive skills at age 6. In the follow-up study Breslau, Paneth and Lucia (2004), a similar conclusion is obtained for reading score at age 17. In this application, we estimate the CDE of LBW on reading scores at age 17 when a behavior problem index is used as the mediator. The behavior problem index is constructed by summing up 8 binary indicators for different behavior problems at age 17, including ever smoked a cigarette, ever smoked cigarettes daily, ever used alcohol, ever used marijuana, ever used cocaine, ever used crack, ever used any hallucinogen, and ever used inhalants. The index ranges from 0 to 8. Based on Breslau, Johnson and Lucia (2001); Breslau, Paneth and Lucia (2004); Luo et al. (2014), we put the subject’s gender and residence at birth and mother’s IQ, education and marital status as the baseline confounders in L0 . For post- 127 treatment confounders in L1 , the subject’s IQ, VMI and PA at age 6 were used. After deleting observations that contain missing values, we obtain a sample of 704 complete cases out of the total 713 who were assessed at age 17. The five models in Table 3.1 were applied, and the results are presented in Figure 3.3. The results for Model 1 show a negative constant CDE estimate, and the effect is not statistically significant according to the normal-based bootstrap confidence intervals. When the interaction A × M is added as in Model 2, the CDE estimates show a downward trend as the behavior problem became more severe. Specifically, the CDE estimate decreases from −.75 to −7.62 as the mediating behavior problem changes from 0 to 8. The effect is significant when the number of behavior problems is greater than 1. Model 3 includes the M × L0 interaction, but the result is almost identical to Model 1 because the resulting SNMM is not a function of m. Model 4 has both A × M and M × L0 interactions, and its result is similar to Model 2 but with slightly wider confidence intervals. Finally, Model 5 included M × L1 interactions on top of Model 4 but led to similar results. The downward trends in Model 2, 4 and 5 mainly comes from the negative effect of the A × M interaction, although this interaction is not statistically significant. Ignoring other channels, this negative interaction effect essentially indicated that even if the immediate effect of the behavior problem on reading is shut down by controlling the behavior problem, a more severe behavior problem would still exacerbate the negative effect of LBW on reading by altering the mechanism through which LBW exerts its effect. 3.7 Conclusion In this chapter, wee formalize the idea of using partially linear conditional mean models of the outcome and propose a flexible plug-in g-formula estimator in for controlled direct effects causal mediation analysis. Partial linearity of outcome conditional expectation is of interest because under this linear assumption, we can replace the confounders in the conditional 128 Figure 3.3 Controlled direct effect of LBW on reading with bad behavior as mediator, Model 1 to 5 1 2 0 2 4 Behavior Problem as Mediator CDE CDE if Cov(eM,eY)=-1000 6 A=1 if LBW. Sample Size 704. Controlled Direct Effect of LBW on Reading 10 -5 0 5 -20 -15 -10 Controlled Direct Effect of LBW on Reading 0 -20 -15 -10 10 -5 5 A=1 if LBW. Sample Size 704. 8 0 95% normal-based CI CDE if Cov(eM,eY)=1000 2 CDE CDE if Cov(eM,eY)=-1000 3 2 4 Behavior Problem as Mediator 6 8 0 95% normal-based CI CDE if Cov(eM,eY)=1000 Controlled Direct Effect of LBW on Reading -5 5 -20 -15 -10 10 0 A=1 if LBW. Sample Size 704. 2 4 Behavior Problem as Mediator CDE CDE if Cov(eM,eY)=-1000 6 2 95% normal-based CI CDE if Cov(eM,eY)=1000 4 Behavior Problem as Mediator CDE CDE if Cov(eM,eY)=-1000 5 0 8 A=1 if LBW. Sample Size 704. Controlled Direct Effect of LBW on Reading -20 -15 -10 -5 0 5 10 Controlled Direct Effect of LBW on Reading -5 5 -20 -15 -10 10 0 CDE CDE if Cov(eM,eY)=-1000 6 4 A=1 if LBW. Sample Size 704. 0 4 Behavior Problem as Mediator 8 95% normal-based CI CDE if Cov(eM,eY)=1000 129 6 95% normal-based CI CDE if Cov(eM,eY)=1000 8 mean model for the outcome by properly fitted values of the confounders, which results in a plug-in estimator for the controlled direct effects. The flexible plug-in g-formula is closedform and thus can save some computation time by avoiding Monte Carlo integration that the traditional parametric g-formula usually relies on to evaluate integrals. We also show that under certain conditions the flexible plug-in g-formula estimator is numerically equivalent to the sequential g-formula estimator in the literature. Although the sequential g-formula estimator is supposed to be a parametric version of the g-estimation of structural nested mean models, this equivalence result indicates that it can also be viewed as a particular parametric g-formula. Indeed, since the g-estimation of structural nested models is a semiparametric version of the original g-formula, when stronger parametric assumptions are imposed, we should expect it to come close to the parametric g-formula. Therefore the equivalence result provides a new insight of the connections between parametric g-formula and g-estimation of structural nested mean models. The interest in the flexible plug-in g-formula estimator is manifold. First, in the linear case, the flexible plug-in g-formula estimator provides an closed-form expression without introducing additional assumptions than those commonly made in empirical studies. In view of the fact that linear regression is often the first choice for modeling continuous outcomes in parametric g-formula, the flexible plug-in g-formula actually imposes no stronger parametric assumption than some of the parametric g-formulae that already exist in the literature. Second, the flexible plug-in g-formula estimator connects to the sequential g-estimation but may be more straightforward to use. The two estimators also complement each other in giving practitioners choices of reasonable specifications in their context. For example, if there is reason to believe the functional form for Y should be nonlinear in A and M , we can use the flexible plug-in estimator; and on the other hand, if the nonlinearity lies in the confounding factors, we can use the sequential g-formula estimator. In any case, if the results from both methods are similar, the estimates are more likely to be robust. 130 APPENDICES 131 APPENDIX A PROOFS AND ALGEBRA ON G-FORMULA A.1 A direct proof of the g-formula for fY am (y) The proof for the g-formula has been given in a series of paper by Robins and his colleagues. Here we repeat it for the continuous outcome case for easy reference. We adopt the convention to use upper case letters for random variables and lower case letters for realizations. Under sequential ignorability and consistency, fY |L ,A,L ,M (y|l0 , a, l1 , m)fL |L ,A (l1 |l0 , a)fL0 (l0 ) 0 1 1 0 fY am (y) = l0 ,l1 where, e.g., fY |L ,A,L ,M (y|l0 , a, l1 , m) is the shorthanded notation for the conditional den0 1 sity function of Y given (L0 , A, L1 , M ). Proof. It can be shown that ˆ fY am (y) = ˆ fY am |L (y|l0 )fL0 (l0 )dl0 0 = fY am |L ,A (y|l0 , a)fL0 (l0 )dl0 0 ˆ ˆ = fY am |L ,A,L (y|l0 , a, l1 )fL |L (l1 |l0 , a)fL0 (l0 )dl1 dl0 1 0 0 1 ˆ ˆ = fY am |L ,A,L ,M (y|l0 , a, l1 , m)fL |L (l1 |l0 , a)fL0 (l0 )dl1 dl0 0 1 1 0 ˆ ˆ = fY |L ,A,L ,M (y|l0 , a, l1 , m)fL |L (l1 |l0 , a)fL0 (l0 )dl1 dl0 0 1 1 0 where the first and the third equality uses law of total probability, the second uses Y am ⊥ ⊥ A|L0 , the fourth uses (Y am ⊥ ⊥ M |L0 , A = a, L1 ), and the last uses the consistency axiom. 132 Remark 6. Based on the g-formula for fY am (y), it follows that the g-formula for E(Y am ) is ˆ ˆ ˆ am E(Y ) = y fY |L ,A,L ,M (y|l0 , a, l1 , m)fL |L (l1 |l0 , a)fL0 (l0 )dl1 dl0 dy 0 1 1 0 ˆ ˆ ˆ = yfY |L ,A,L ,M (y|l0 , a, l1 , m)fL |L (l1 |l0 , a)fL0 (l0 )dydl1 dl0 0 1 1 0 ˆ ˆ ˆ = yfY |L ,A,L ,M (y|l0 , a, l1 , m)dy fL |L (l1 |l0 , a)fL0 (l0 )dl1 dl0 0 1 1 0 ˆ ˆ = E(y|l0 , a, l1 , m)fL |L (l1 |l0 , a)fL0 (l0 )dl1 dl0 . 1 A.2 0 The flexible plug-in g-formula for E(Y am ) If sequential ignorability and consistency hold, and the outcome conditional mean is given by E(Y |l0 , a, l1 , m) = h0 (a, m)l0 + h1 (a, m)l1 + h(a, m), then E(Y am ) = h0 (a, m)E(L0 ) + h1 (a, m)E [E(L1 |L0 , A = a)] + h(a, m). Proof. Under the assumptions, we get ˚ am E(Y ) = yf (y|l0 , a, l1 , m)f (l1 |l0 , a)f (l0 )dydl1 dl0 ˆ ˆ ˆ = yf (y|l0 , a, l1 , m)dy f (l1 |l0 , a)f (l0 )dl1 dl0 ˆ ˆ = E(Y |l0 , a, l1 , m)f (l1 |l0 , a)f (l0 )dl1 dl0 ˆ ˆ = E(Y |l0 , a, l1 , m)f (l1 |l0 , a)dl1 f (l0 )dl0 ˆ ˆ = [h0 (a, m)l0 + h1 (a, m)l1 + h(a, m)] f (l1 |l0 , a)dl1 f (l0 )dl0 ˆ = [h0 (a, m)l0 + h1 (a, m)E(L1 |l0 , a) + h(a, m)] f (l0 )dl0 ˆ = h0 (a, m)E(L0 ) + h1 (a, m) E(L1 |l0 , a)f (l0 )dl0 + h(a, m) = h0 (a, m)E(L0 ) + h1 (a, m)E [E(L1 |L0 , A = a)] + h(a, m) 133 Remark 7. Note that E [E(L1 |L0 , A = a)] = E(L1 |A = a). For example, L1 = exp(L0 ) + A. Then E(L1 |L0 , A = a) = exp(L0 ) + a, and the LHS is E [exp(L0 ) + a] = E [exp(L0 )] + a. But the RHS is E [exp(L0 ) + A|A = a] = E [exp(L0 )|A = a] + a. If A ⊥ ⊥ L0 , the equality holds. In general, let the structural model for L1 be L1 = f (L0 , A, εL1 ). Then the LHS is E f (L0 , a, εL1 )|L0 . The RHS is E f (L0 , a, εL1 )|A = a . If we assume εL1 is independent of (L0 , A), the LHS becomes E f (L0 , a, εL1 ) . If in addition A ⊥ ⊥ L0 , the RHS becomes E f (L0 , a, εL1 ) and the equality holds. 134 APPENDIX B PROOFS AND ALGEBRA ON SEQUENTIAL G-ESTIMATOR B.1 Validity of the second step of sequential g-estimator The proof of the validity of the sequential g-estimator in the Appendix of Vansteelandt (2009) needs to be extended because L0 is included in our analysis. Specifically, we want to show that, in presence of L0 , and under additive separability, qM (l0 , a, l1 , 0; γ) = 0, and sequential ignorability, E [Y − qM (L0 , A, L1 , M ; γ)|L0 , A] = E Y A0 |L0 , i.e. for any (l0 , a), show E [Y − qM (L0 , A, L1 , M ; γ)|L0 = l0 , A = a] = E Y a0 |L0 = l0 . (B.1) Proof. Under the assumptions, we have E [Y − qM (l0 , a, l1 , m; γ)|L0 = l0 , A = a, L1 = l1 , M = m] = qA (l0 , a, l1 ; γ). The last equality holds for any m. Therefore E [Y − qM (l0 , a, l1 , M ; γ)|L0 = l0 , A = a, L1 = l1 , M ] = qA (l0 , a, l1 ; γ). Take expectation of both sides conditional on L0 = l0 , A = a, L1 = l1 and use iterated expectation, we get E [Y − qM (l0 , a, l1 , M ; γ)|L0 = l0 , A = a, L1 = l1 ] = qA (l0 , a, l1 ; γ). 135 Now, sequential ignorability implies E Y a0 |L0 = l0 , A = a, L1 = l1 =E Y a0 |L0 = l0 , A = a, L1 = l1 , M = 0 =E [Y |L0 = l0 , A = a, L1 = l1 , M = 0] =E [qA (L0 , A, L1 ; γ) + qM (L0 , A, L1 , M ; γ)|L0 = l0 , A = a, L1 = l1 , M = 0] =E [qA (l0 , a, l1 ; γ) + qM (l0 , a, l1 , 0; γ)|L0 = l0 , A = a, L1 = l1 , M = 0] =qA (l0 , a, l1 ; γ) The equality holds for any l1 . Therefore E [Y − qM (l0 , a, L1 , M ; γ)|L0 = l0 , A = a, L1 ] = E Y a0 |L0 = l0 , A = a, L1 . Take expectation of both sides conditional on (L0 = l0 , A = a), we get E [Y − qM (l0 , a, L1 , M ; γ)|L0 = l0 , A = a] = E Y a0 |L0 = l0 , A = a . Lastly, notice that Y a0 ⊥ ⊥ A|L0 , we have E [Y − qM (l0 , a, L1 , M ; γ)|L0 = l0 , A = a] = E Y a0 |L0 = l0 . We provide an alternative proof below. Proof. (alternative proof) First, we show that E [Y − qM (L0 , A, L1 , M ; γ)|L0 , A, L1 , M ] = E Y A0 |L0 , A, L1 . 136 Specifically, E [Y − qM (L0 , A, L1 , M ; γ)|L0 , A, L1 , M ] =qA (L0 , A, L1 ; γ) =E [qA (L0 , A, L1 ; γ)|L0 , A, L1 , M = 0] =E [qA (L0 , A, L1 ; γ) + qM (L0 , A, L1 , 0; γ)|L0 , A, L1 , M = 0] =E [qA (L0 , A, L1 ; γ) + qM (L0 , A, L1 , M ; γ)|L0 , A, L1 , M = 0] =E [Y |L0 , A, L1 , M = 0] =E Y A0 |L0 , A, L1 , M = 0 =E Y A0 |L0 , A, L1 where the first equality holds by the definition of qA (L0 , A, L1 ; γ), the second equality holds since qA is a function of L0 , A, L1 , the third equality holds because qM (L0 , A, L1 , 0; γ) = 0, the fourth and fifth are simply rewriting, the sixth by consistency, and the last by Y am ⊥ ⊥ M |L0 , A, L1 . Then, take expectation of both sides conditional on (L0 , A), we get E [Y − qM (L0 , A, L1 , M ; γ)|L0 , A] = E Y A0 |L0 , A . Lastly, by Y am ⊥ ⊥ A|L0 , we have E [Y − qM (L0 , A, L1 , M ; γ)|L0 , A] = E Y A0 |L0 . B.2 Estimation procedures and interpretations for the sequential g-estimator in Model 1 and in a general setup with up to threeway interactions in E(Y |L0 , A, L1 , M ) We discuss Model 1 and Model 5. Model 1 is the simplest specification, so it is used as an example to illustrate the idea. Model 5 is the most general specification, so the discussion 137 on Model 5 shows that the conclusion applies to all 5 models. B.2.1 Model 1 (No interaction) Model assumptions 1. The NPSEM-IE associated with DAG G. (Thus consistency and sequential ignorability hold.) 2. Structural Nested Mean Model: E(Y am − Y 0m |l0 ) = ϕA a. (B.2) 3. Conditional mean of the outcome: E(Y |L0 , A, L1 , M ) = γ0 + γL0 L0 + γA A + γL1 L1 + γM M, (B.3) qA (L0 , A, L1 ; γ) = γ0 + γL0 L0 + γA A + γL1 L1 , (B.4) so that qM (L0 , A, L1 , M ; γ) = γM M. 4. Conditional mean of the post-treatment confounder at A = 0: f (l0 , 0) ≡ E(L1 |L0 = l0 , A = 0) = π0 + πL0 l0 . (B.5) Estimation procedure 1. Regress Y on (1, L0 , A, L1 , M ) and obtain the OLS estimator γˆM for γM . Generate Yˆ−M ≡ Y − γˆM M . 2. Regress Yˆ−M on (1, L0 , A). Denote by ϕˆA the OLS estimator for the coefficient of A. ˆ Then ϕˆA is a consistent estimator for ϕA . Hence, CDE(m) = ϕˆA . 138 Validity of the second-step regression (i.e. the consistency of ϕˆA for ϕA ) Proof. (i) First of all, we show E(Y 00 |L0 = l0 ) = γ0 + γL0 l0 + γL1 f (l0 , 0). (B.6) Under sequential ignorability, consistency, and the specification in (B.3), we have E(Y a0 |L0 = l0 , A = a, L1 = l1 ) =E(Y a0 |L0 = l0 , A = a, L1 = l1 , M = 0) =E(Y |L0 = l0 , A = a, L1 = l1 , M = 0) =γ0 + γA a + γL0 l0 + γL1 l1 . The equality holds for any (l0 , a, l1 ), and therefore we can write E(Y A0 |L0 , A, L1 ) = γ0 + γA A + γL0 L0 + γL1 L1 + γAL1 AL1 . Take expectation of both sides conditional on (L0 , A), we get E(Y A0 |L0 , A) =γ0 + γA A + γL0 L0 + γAL0 AL0 + γL1 E(L1 |L0 , A) + γAL1 AE(L1 |L0 , A) =γ0 + γA A + γL0 L0 + γAL0 AL0 + γL1 f (L0 , A) + γAL1 Af (L0 , A), i.e. for any (l0 , a), E(Y a0 |L0 = l0 , A = a) =γ0 + γA a + γL0 l0 + γAL0 al0 + γL1 f (l0 , a) + γAL1 af (l0 , a). Then the first part of the sequential ignorability implies E(Y a0 |L0 = l0 ) =γ0 + γA a + γL0 l0 + γAL0 al0 + γL1 f (l0 , a) + γAL1 af (l0 , a). 139 (B.7) Set a = 0, we get E(Y 00 |L0 = l0 ) = γ0 + γL0 l0 + γL1 f (l0 , 0). (ii) Secondly, we show E(Y − qm (L0 , A, L1 , M ; γ)|L0 = l0 , A = a) =ϕA a + γ0 + γL0 l0 + γL1 f (l0 , 0). (B.8) By setting m = 0 in (B.2), we have E(Y a0 − Y 00 |l0 ) = ϕA a. Then, (B.1) and (B.6) imply E(Y − qm (L0 , A, L1 , M ; γ)|L0 = l0 , A = a) =E(Y a0 |L0 = l0 ) =E(Y a0 − Y 00 |L0 = l0 ) + E(Y 00 |L0 = l0 ) =ϕA a + γ0 + γL0 l0 + γL1 f (l0 , 0). (iii) Lastly, given (B.5), we have E(Y − qm (L0 , A, L1 , M ; γ)|L0 = l0 , A = a) =ϕA a + γ˜0 + γ˜L0 l0 , (B.9) where γ˜0 = γ0 + γL1 π0 and γ˜L0 = γL0 + γL1 πL0 . Therefore, the OLS estimator ϕˆA in the second-step regression is consistent for ϕA . Necessity and sufficiency of (B.5) given all the other model assumptions and that L0 is not binary Proof. The sufficiency of (B.5) given all the other model assumptions have been shown by (B.9). 140 To show its necessity, first we rewrite (B.8) as Y−M = ϕA A + γ¨0 + γL0 L0 + v (B.10) where Y−M ≡ Y − qm (L0 , A, L1 , M ; γ) = Y − γM M, γ¨0 ≡ γ0 + γL1 E [f (L0 , 0) − L0 ] , f¯(L0 , 0) ≡ [f (L0 , 0) − L0 ] − E [f (L0 , 0) − L0 ] , ξ ≡ Y−M − E [Y−M |A, L0 ] , v ≡ γL1 f¯(L0 , 0) + ξ. Since L0 is not binary, equation (B.10) indicates that f (l0 , 0) must be linear for ϕˆA to be consistent for ϕA . We prove this by contradiction. If f (l0 , 0) is nonlinear in l0 , in general we have Cov(L0 , v) = E(L0 v) = γL1 E L0 f¯(L0 , 0) = 0 −1 unless, e.g, f¯(L0 , 0) = L−1 0 and E(L0 ) = 0, or some other particular conditions hold by fluke. But then, unless a regression of Y−M on (1, A, L0 ) would yield inconsistent estimators for all coefficients. B.2.2 General model with a three-way interaction in E(Y |L0 , A, L1 , M ) Model assumptions 1. The NPSEM-IE associated with DAG G. (Thus consistency and sequential ignorability hold.) 2. Structural Nested Mean Model: E(Y am − Y 0m |l0 ) = ϕA a + ϕAM am + ϕAL0 al0 + ϕAM L0 aml0 . 141 (B.11) 3. Conditional mean of the outcome: E(Y |L0 , A, L1 , M ) =γ0 + γA A + γM M + γAM AM + γL0 L0 + γAL0 AL0 + γM L0 M L0 + γAM L0 AM L0 + γL1 L1 + γAL1 AL1 + γM L1 M L1 + γAM L1 AM L1 , (B.12) so that qA (L0 , A, L1 ; γ) =γ0 + γA A + γL0 L0 + γAL0 AL0 + +γL1 L1 + γAL1 AL1 qM (L0 , A, L1 , M ; γ) =γM M + γAM AM + γM L0 M L0 + γAM L0 AM L0 + γM L1 M L1 + γAM L1 AM L1 4. Conditional mean of the post-treatment confounder, which is given by the following equation f (l0 , a) ≡ E(L1 |l0 , a) = π0 + πL0 l0 + πA a + πAL0 al0 . (B.13) Estimation procedure 1. Regress Y on (1, A, M, AM, L0 , AL0 , M L0 , AM L0 , L1 , AL1 , M L1 , AM L1 ) and obtain the OLS estimator γˆ for γ. Generate Yˆ−M ≡ Y − γˆM M − γˆAM AM − γˆM L0 M L0 − γˆAM L0 AM L0 −ˆ γM L1 M L1 − γˆAM L1 AM L1 . 2. Regress Yˆ−M on (1, L0 , A, AL0 ) and obtain the OLS estimators, ϕˆA and ϕˆAL0 , for the coefficients of A and AL0 , respectively. 142 3. Regress L1 on (1, L0 , A, AL0 ) and obtain the OLS estimator π ˆ for π. Then the sequential g-formula estimator for the controlled direct effect at M = m is ¯ 0 + ϕˆAM L mL ¯ 0, E(Y 1m − Y 0m ) = ϕˆA + ϕˆAM m + ϕˆAL0 L 0 where ϕˆAM = γˆAM + γˆM L1 π ˆA + γˆAM L1 (ˆ π0 + π ˆA ) , ˆL0 + π ˆAL0 . ϕˆAM L0 = γˆAM L0 + γˆM L1 π ˆAL0 + γˆAM L1 π Validity of the second step (i.e. the consistency of ϕˆA and ϕˆAL0 for ϕA and ϕAL0 ) Proof. (i) First of all, we show E(Y 00 |L0 = l0 ) = γ0 + γL0 l0 + γL1 f (l0 , 0). (B.14) which is exactly the same as (B.6). Under sequential ignorability, consistency, and the specification in (B.12), and using a similar argument to that for (B.7), we have E(Y a0 |L0 = l0 , A = a, L1 = l1 ) =E(Y a0 |L0 = l0 , A = a, L1 = l1 , M = 0) =E(Y |L0 = l0 , A = a, L1 = l1 , M = 0) =γ0 + γA a + γL0 l0 + γAL0 al0 + γL1 l1 + γAL1 al1 . The equality holds for any (l0 , a, l1 ), and therefore we can write E(Y A0 |L0 , A, L1 ) =γ0 + γA A + γL0 L0 + γAL0 AL0 + γL1 L1 + γAL1 AL1 143 (B.15) Take expectation of both sides conditional on (L0 , A), we get E(Y A0 |L0 , A) =γ0 + γA A + γL0 L0 + γAL0 AL0 + γL1 E(L1 |L0 , A) + γAL1 AE(L1 |L0 , A) =γ0 + γA A + γL0 L0 + γAL0 AL0 + γL1 f (L0 , A) + γAL1 Af (L0 , A), i.e. for any (l0 , a), E(Y a0 |L0 = l0 , A = a) =γ0 + γA a + γL0 l0 + γAL0 al0 + γL1 f (l0 , a) + γAL1 af (l0 , a). Then the first part of the sequential ignorability implies E(Y a0 |L0 = l0 ) =γ0 + γA a + γL0 l0 + γAL0 al0 + γL1 f (l0 , a) + γAL1 af (l0 , a). Set a = 0, we get E(Y 00 |L0 = l0 ) = γ0 + γL0 l0 + γL1 f (l0 , 0). (ii) Secondly, we show E(Y − qm (L0 , A, L1 , M ; γ)|L0 = l0 , A = a) =ϕA a + γ0 + γL0 l0 + γL1 f (l0 , 0). By setting m = 0 in (B.11), we have E(Y a0 − Y 00 |l0 ) = ϕA a + ϕAL0 al0 . Then, (B.1) and (B.14) imply E(Y − qm (L0 , A, L1 , M ; γ)|L0 = l0 , A = a) =E(Y a0 |L0 = l0 ) =E(Y a0 − Y 00 |L0 = l0 ) + E(Y 00 |L0 = l0 ) =ϕA a + ϕAL0 al0 + γ0 + γL0 l0 + γL1 f (l0 , 0). 144 (B.16) (iii) Lastly, given (B.5), we have E(Y − qm (L0 , A, L1 , M ; γ)|L0 = l0 , A = a) =ϕA a + ϕAL0 al0 + γ˜0 + γ˜L0 l0 , (B.17) where γ˜0 = γ0 + γL1 π0 and γ˜L0 = γL0 + γL1 πL0 . Therefore, in the second-step regression the OLS estimators ϕˆA and ϕˆAL0 are consistent for ϕA and ϕAL0 , respectively. Necessity and sufficiency of (B.5) for the validity of the second step, given all the other model assumptions and that L0 is not binary The same argument as that in D.1 can be used. Validity of the third step (i.e. the consistency of ϕˆAM and ϕˆAM L0 for ϕAM and ϕAM L0 ) Proof. Given equation (B.12) and Y am ⊥ ⊥ M |L0 , A, L1 , we have E(Y am |L0 , a, L1 ) =γ0 + γA a + γM m + γAM am + γL0 L0 + γAL0 aL0 + γM L0 mL0 + γAM L0 amL0 + γL1 L1 + γAL1 aL1 + γM L1 mL1 + γAM L1 amL1 . Take expectation conditional on (L0 , a), and notice Y am ⊥ ⊥ A|L0 , we have E(Y am |L0 ) =γ0 + γA a + γM m + γAM am + γL0 L0 + γAL0 aL0 + γM L0 mL0 + γAM L0 amL0 + γL1 E(L1 |L0 , a) + γAL1 aE(L1 |L0 , a) + γM L1 mE(L1 |L0 , a) + γAM L1 amE(L1 |L0 , a). 145 Hence, E(Y am − Y 0m |L0 ) =γA a + γAM am + γAL0 aL0 + γAM L0 amL0 + γL1 [E(L1 |L0 , a) − E(L1 |L0 , a = 0)] + γAL1 aE(L1 |L0 , a) + γM L1 m [E(L1 |L0 , a) − E(L1 |L0 , a = 0)] + γAM L1 amE(L1 |L0 , a). (B.18) Assume E(L1 |L0 , A) = π0 + πL0 L0 + πA A + πAL0 AL0 , we have E(Y am − Y 0m |L0 ) =γA a + γAM am + γAL0 aL0 + γAM L0 amL0 + γL1 πA a + πAL0 aL0 + γAL1 a π0 + πL0 L0 + πA a + πAL0 aL0 + γM L1 m πA a + πAL0 aL0 + γAM L1 am π0 + πL0 L0 + πA a + πAL0 aL0 = γA + γL1 πA + γAL1 (π0 + πA ) a + γAM + γM L1 πA + γAM L1 (π0 + πA ) am + γAL0 + γL1 πAL0 + γAL1 πL0 + πAL0 aL0 + γAM L0 + γM L1 πAL0 + γAM L1 πL0 + πAL0 amL0 . Compare the last equation with B.11, we see that ϕA = γA + γL1 πA + γAL1 (π0 + πA ) (B.19) ϕAM = γAM + γM L1 πA + γAM L1 (π0 + πA ) (B.20) ϕAL0 = γAL0 + γL1 πAL0 + γAL1 πL0 + πAL0 (B.21) ϕAM L0 = γAM L0 + γM L1 πAL0 + γAM L1 πL0 + πAL0 (B.22) Therefore, we can estimate ϕAM and ϕAM L0 consistently by ϕˆAM = γˆAM + γˆM L1 π ˆA + γˆAM L1 (ˆ π0 + π ˆA ) , ϕˆAM L0 = γˆAM L0 + γˆM L1 π ˆAL0 + γˆAM L1 π ˆL0 + π ˆAL0 , where γˆ is from the first step regression, and π ˆ is from the third step regression. 146 APPENDIX C NUMERICAL EQUIVALENCE We prove the numerical equivalence for Model 5 which is the most flexible model in all the five models considered. The proof for the other four models can be obtained in a similar fashion. To present the results rigorously, we first need to express the two estimators using the same set of notation. Let X1 = (1, A, L0 , AL0 ), X2 = (L1 , AL1 ), W = (M, AM, M L0 , M L1 ), Z = (X, W ) and e = Y − E(Y |Z). Let γ = (γ0 , γA , γL0 , γAL0 , γL1 , γAL1 , γM , γAM , γM L0 , γM L1 ) . Assume the sample size is n. Stack all observations in the matrix denoted by the corresponding bold letters. Under these notations, Model 5 says that Y = Zγ + e. ˆ . Then Denote the OLS estimator for γ by γ ˆ, where Z e ˆ = 0. Y = Zˆ γ+e Both the flexible plug-in estimator and the SG estimator use the above linear regression in the first step. In addition, the sequential g-formula estimator generates the fitted outcome Yˆ−M = Y − γˆM M − γˆAM AM − γˆM L0 M L0 − γˆM L1 M L1 = Y − X3 γˆW which is free of the effect of M . Let β 1 = (β0 , βA , βL0 , βAL0 ) and β 2 = (βL1 , βAL1 ) . Combine them in β = (β 1 , β 2 ) . Let π = (π0 , πA , πL , πAL ) . Define ε = Yˆ−M − E(Yˆ−M |X), u = Yˆ−M − E(Yˆ−M |X1 ), and 0 0 147 v = L1 − E(L1 |X1 ). Then ˆ −M = Xβ + ε, Y ˆ −M = X1 β 1 + u, Y L1 = X1 π + v. ˆ β ˜ 1 and π, ˆ Denote the linear projection coefficients in the above three equations by β, respectively. Then ˆ +ε ˆ −M = Xβ ˆ, Y ˆ=0 Xε ˜1 + u ˆ −M = X1 β ˜, Y ˆ +v ˆ. L 1 = X1 π ˜ =0 X1 u ˆ=0 X1 v (C.1) (C.2) (C.3) Note that (C.2) represents the second-step regression of the sequential g-estimation. ˆ denote the sequential g-formula estiFinally, let ϕ = (ϕA , ϕAL0 , ϕAM , ϕAM L0 ) . Let ϕ ˇ the flexible plug-in estimator. The two estimators are the same in estimating mator, and ϕ ϕAM and ϕAM L0 : ϕˆAM = ϕˇAM = γˆAM + γˆM L1 π ˆA , (C.4) ˆL0 + π ˆAL0 , ϕˆAM L0 = ϕˇAM L0 = γˆAM L0 + γˆM L1 π ˆAL0 + γˆAM L1 π (C.5) but they differ in estimating ϕA and ϕAL0 : ϕˆA = β˜A , (C.6) ϕˆAL0 = β˜AL0 . (C.7) ϕˇA = γˆA + γˆL1 π ˆA + γˆAL1 (ˆ π0 + π ˆA ) , (C.8) ϕˇAL0 = γˆAL0 + γˆL1 π ˆAL0 + γˆAL1 π ˆL0 . (C.9) Theorem 8. (Numerical Equivalence in Model 5) The flexible plug-in g-formula and the sequential g-estimation are numerically equivalent in Model 5. That is ϕˆA = ϕˇA , ϕˆAL0 = ϕˇAL0 . 148 Proof. Regarding those linear projection coefficients, the first fact is that ˆ =γ ˆX β (C.10) ˆ X is the sub-vector of γ ˆ associated with X. To see why, let γ ˆ W be the sub-vector where γ ˆ associated with W . Then the regression of Yˆ−M on X can equivalently be cast as the of γ ˆ W , and the later restricted regression of Y on Z = (X, W ) with the constraint γ W = γ ˆX . restricted regression is known to yield γ The second fact is that the orthogonality condition following each of the three equations (C.1), (C.2) and (C.3) is definitional for the corresponding linear projection coefficient vector. Note that the rank condition always hold in practice unless by fluke. Therefore, given a data ˆ β˜1 and π set, β, ˆ are uniquely defined by their respective orthogonality conditions. Now we are ready to prove the equivalence. Plug equation (C.3) into (C.1), we have ˆ −M = Xβˆ + ε ˆ Y ˆ = βˆ0 + βˆA A + βˆL0 L0 + βˆAL0 AL0 + βˆL1 L1 + βˆAL1 AL1 + ε = βˆ0 + βˆA A + βˆL0 L0 + βˆAL0 AL0 + βˆL1 (ˆ π0 + π ˆA A + π ˆL0 L0 + ˆ ) + βˆAL1 A(ˆ ˆ) + ε ˆ π ˆAL0 AL0 + v π0 + π ˆA A + π ˆL0 L0 + π ˆAL0 AL0 + v = (βˆ0 + βˆL1 π ˆ0 ) + βˆA + βˆL1 π ˆA + βˆAL1 (ˆ π0 + π ˆA ) A + (βˆL0 + βˆL1 π ˆL0 )L0 ˆ + βˆAL0 + βˆL1 π ˆAL0 + βˆAL1 (ˆ πL0 + π ˆAL0 ) AL0 + (βˆL1 + βˆAL1 )ˆ v+ε ∗ A + βˆ∗ L + βˆ∗ AL + ε ˆ∗ = βˆ0∗ + βˆA 0 L0 0 AL0 ˆ∗ + ε ˆ∗ = X1 β 1 where we define βˆ0∗ = βˆ0 + βˆL1 π ˆ0 , ∗ = βˆ + βˆ π ˆ βˆA π0 + π ˆA ), A L1 ˆA + βAL1 (ˆ (C.11) ∗ = βˆ + βˆ π βˆL L0 L1 ˆL0 0 ∗ βˆAL = βˆAL0 + βˆL1 π ˆAL0 + βˆAL1 (ˆ πL0 + π ˆAL0 ), 0 149 (C.12) ˆ ∗ = (β ∗ , β ∗ , βˆ∗ , βˆ∗ ) , β 1 0 A L0 AL0 ˆ∗ = (βˆL1 + βˆAL1 )ˆ ˆ. ε v+ε But (C.3) and (C.2) imply that ˆ∗ = X1 (βˆL1 + βˆAL1 )ˆ ˆ X1 ε v+ε ˆ + X1 ε ˆ = (βˆL1 + βˆAL1 )X1 v = 0. (C.13) ˜1 = β ˆ ∗ . In particular, we have Hence, by definition and uniqueness, β 1 ∗, β˜A = βˆA ∗ . β˜AL0 = βˆAL 0 It follows immediately from (C.6), (C.7), (C.10), (C.11) and (C.12) that ϕˆA = ϕˇA , ϕˆAL0 = ϕˇAL0 . ˆ∗ = 0, it shows why we need the regressors be the same in the Remark 1: In proving X1 ε model of L1 as in the model of the second-step regression. Remark 2: If the statistical model for E(L1 |A, L0 ) is of the same level of flexibility as the second step regression of the SG estimator, we say the model is properly chosen. For example, in Model 1, the second step of the SG estimator is to regress Yˆ−M,i on (1, Ai , L0i ), and the proper model for E(L1 |A, L0 ) is E(L1 |A, L0 ) = πC + πA A + πL0 L0 . In Model 5, the second step of the SG estimator is to regress Yˆ−M,i on (1, Ai , L0i , AL0i ), and the proper model for E(L1 |A, L0 ) should be E(L1 |A, L0 ) = πC + πA A + πL0 L0 + πAL0 AL0 . If the second-step regression of the SG estimation contains AL0 , but the linear model for E(L1 |A, L0 ) excludes AL0 , then the numerical equivalence does not hold any more. The 150 intuition is that when πAL0 is forced to be zero, it will alter the the estimates for π0 , πA and πL0 , which can be viewed as some restricted estimates. The math is explained in (C.13): ˆ is not zero anymore. X1 v 151 APPENDIX D SENSITIVITY ANALYSIS In this section we provide the derivation details for the sensitivity analysis in section 5 of main text. Specifically, we want to show that   γ Z = γ OLS − E(Z Z) −1  Z  0 σ εM εY  . Proof. Recall that Y = Zγ Z + εY , (D.1) Z = (1, L0 , A, L1 , M ), σεM εY = Cov(εM , εY ) = 0. Premultiply both sides of equation (D.1) by X , take expectation„ we have E(Z Y ) = E(Z Z)γ Z + E(Z εY ). Assume E(Z Y ) has full rank, then γ Z = E(Z Z) −1 E(Z Y ) − E(Z Z) −1 E(Z εY ) Now notice that the three additional assumptions in Section 5 imply E[(1, L0 , A, L1 ) εY ] = 0. (D.2) where 0 is a 4×1 zero vector. Meanwhile recall that γ OLS = E(Z Z) −1 E(Z Y ). It follows Z immediately that   γ Z = γ OLS − E(Z Z) −1  Z 152  0 σ εM εY  . BIBLIOGRAPHY 153 BIBLIOGRAPHY Beery, Keith E. 1989. Developmental test of visual-motor integration: Administration and scoring manual. Cleveland, OH:Modern Curriculum Pr. Breslau, Naomi, Eric O Johnson, and Victoria C Lucia. 2001. “Academic achievement of low birthweight children at age 11: the role of cognitive abilities at school entry.” Journal of Abnormal Child Psychology, 29(4): 273–279. Breslau, Naomi, Nigel S Paneth, and Victoria C Lucia. 2004. “The lingering academic deficits of low birth weight children.” Pediatrics, 114(4): 1035–1040. Danaei, Goodarz, An Pan, Frank B Hu, and Miguel a Hernán. 2013. “Hypothetical midlife interventions in women and risk of type 2 diabetes.” Epidemiology, 24(1): 122–8. Daniel, Rhian M, Bianca L De Stavola, and Simon N Cousens. 2011. “gformula: Estimating causal effects in the presence of time-varying confounding or mediation using the g-computation formula.” Stata Journal, 11(4): 479. Garcia-Aymerich, Judith, Raphaëlle Varraso, Goodarz Danaei, Carlos A. Camargo, and Miguel A. Hernán. 2014. “Incidence of adult-onset asthma after hypothetical interventions on body mass index and physical activity: An application of the parametric G-Formula.” American Journal of Epidemiology, 179(1): 20–26. Hernan, Miguel A, and ames M Robins. 2015. Causal Inference. Boca Raton:Chapman & Hall/CRC. Hicks, Raymond, and Dustin Tingley. 2011. “Causal mediation analysis.” Stata Journal, 11(4): 605. Horvitz, Daniel G, and Donovan J Thompson. 1952. “A generalization of sampling without replacement from a finite universe.” Journal of the American Statistical Association, 47(260): 663–685. Imai, Kosuke, Luke Keele, and Dustin Tingley. 2010. “A general approach to causal mediation analysis.” Psychological Methods, 15(4): 309. Lajous, Martin, Walter C. Willett, James Robins, Jessica G. Young, Eric Rimm, Dariush Mozaffarian, and Miguel A. Hernán. 2013. “Changes in fish consumption in midlife and the risk of coronary heart disease in men and women.” American Journal of Epidemiology, 178(3): 382–391. Luo, Zhehui, Joshua Breslau, Joseph C Gardiner, Qiaoling Chen, and Naomi Breslau. 2014. “Assessing interchangeability at cluster levels with multiple-informant data.” Statistics in medicine, 33(3): 361–375. 154 Pearl, Judea. 2009. Causality: Models, Reasoning and Inference. Cambridge University Press. Pearl, Judea. 2014. “Interpretation and identification of causal mediation.” Psychological methods, 19(4): 459. Pearl, Judea, and James Robins. 1995. “Probabilistic evaluation of sequential plans from causal models with hidden variables.” 444–453, Morgan Kaufmann Publishers Inc. Richardson, Thomas S, and James M Robins. 2013. “Single World Intervention Graphs (SWIGs): A unification of the counterfactual and graphical approaches to causality.” Center for the Statistics and the Social Sciences, University of Washington Series. Working Paper, 128. Robins, James M. 1986. “A new approach to causal inference in mortality studies with a sustained exposure period - application to control of the healthy worker survivor effect.” Mathematical Modelling, 7(9-12): 1393–1512. Robins, James M. 1989. “The control of confounding by intermediate variables.” Statistics in Medicine, 8(6): 679–701. Robins, James M. 1994. “Correcting for non-compliance in randomized trials using structural nested mean models.” Communications in Statistics-Theory and methods, 23(8): 2379–2412. Robins, James M. 1998. “Structural nested failure time models.” Encyclopedia of Biostatistics. Robins, James M. 2000. “Marginal structural models versus structural nested models as tools for causal inference.” In Statistical models in epidemiology, the environment, and clinical trials. 95–133. Springer. Robins, James M, and Sander Greenland. 1992. “Identifiability and exchangeability for direct and indirect effects.” Epidemiology, 143–155. Robins, James M, and Thomas Richardson. 2010. “Alternative graphical causal models and the identification of direct effects.” Causality and psychopathology: Finding the determinants of disorders and their cures, 103–158. Robins, J M, M A Hernán, and U W E SiEBERT. 2004. “Effects of multiple interventions.” Comparative quantification of health risks: global and regional burden of disease attributable to selected major risk factors, 1: 2191–2230. Rosner, Jerome, and Dorothea P Simon. 1971. “The Auditory Analysis Test An Initial Report.” Journal of Learning Disabilities, 4(7): 384–392. Taubman, Sarah L, James M Robins, Murray A Mittleman, and Miguel A Hernán. 2009. “Intervening on risk factors for coronary heart disease: an application of the parametric g-formula.” International Journal of Epidemiology, 38(6): 1599–1611. 155 Valeri, Linda, and Tyler J Vanderweele. 2013. “Mediation analysis allowing for exposure-mediator interactions and causal interpretation: theoretical assumptions and implementation with SAS and SPSS macros.” Psychological Methods, 18(2): 137–50. Van der Wal, W M, M Prins, B Lumbreras, and R B Geskus. 2009. “A simple G-computation algorithm to quantify the causal effect of a secondary illness on the progression of a chronic disease.” Statistics in Medicine, 28(18): 2325–2337. Vansteelandt, Stijn. 2009. “Estimating direct effects in cohort and case-control studies.” Epidemiology, 20(6): 851–860. Westreich, Daniel, Stephen R Cole, Jessica G Young, Frank Palella, Phyllis C Tien, Lawrence Kingsley, Stephen J Gange, and Miguel A Hernán. 2012. “The parametric g-formula to estimate the effect of highly active antiretroviral therapy on incident AIDS or death.” Statistics in Medicine, 31(18): 2000–2009. Woodcock, Richard W, M Bonner Johnson, and Nancy Mather. 1990. WoodcockJohnson psycho-educational battery–Revised. DLM Teaching Resources. Wooldridge, Jeffrey M. 2010. Econometric Analysis of Cross Section and Panel Data. . 2nd ed., Boston MA:MIT Press. Young, Jessica G, Lauren E Cain, James M Robins, Eilis J O’Reilly, and Miguel A Hernán. 2011. “Comparative effectiveness of dynamic treatment regimes: an application of the parametric g-formula.” Statistics in biosciences, 3(1): 119–143. 156