ESSAYS ON NONLINEAR PANEL MODELS WITH UNOBSERVED HETEROGENEITY By Robert Martin A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Economics – Doctor of Philosophy 2017 ABSTRACT ESSAYS ON NONLINEAR PANEL MODELS WITH UNOBSERVED HETEROGENEITY By Robert Martin This dissertation concerns nonlinear panel data estimation relevant to the fields of econometrics and applied microeconomics. Panel data is attractive for estimating causal effects when unobserved heterogeneity in cross-sectional units is correlated with explanatory variables. For instance, well-known linear fixed effects and first difference estimators use within-group variation to achieve consistent estimation. However, nonlinear models often better represent limited dependent variables like binary outcomes or counts, and extending traditional panel techniques to these settings can be problematic. For instance, treating heterogeneity as parameters to be estimated usually leads to what is known as the incidental parameters problem. Furthermore, heterogeneous slopes in a conditional mean function can also confound estimation, but fewer remedies exist than do for additive effects. I aim to address these issues in my research with an emphasis on practical applicability. Chapter 1: Finite sample properties of bias-corrected fixed effects estimators for panel binary response models Maximum likelihood estimation (MLE) of nonlinear unobserved effects panel models is known to be generally inconsistent when treating the heterogeneity as parameters. Several authors have proposed corrections justified by large-T expansions of the inconsistency under conditions like dynamic completeness. Using Monte Carlo (MC) techniques, I find that failure of dynamic completeness can increase bias in slope and average partial effects (APE) estimates in shorter panels, but has little impact on APE for longer panels. I also compare bias-corrections to correlated random effects (CRE) and Conditional MLE using MC and welfare data from the Survey of Income and Program Participation (SIPP). Chapter 2: Exponential panel models with coefficient heterogeneity If heterogeneous slopes are ignored in exponential panel models, fixed effects Poisson may not estimate any quantity of interest. Existing estimation methods often involve treating only a small subset of the slopes as “random effects” and integrating from the likelihood, increasing computational difficulty. I propose a test to detect slope heterogeneity that, unlike the traditional approach, does not amount to testing for information matrix equality. Additionally, I present a correlated random coefficients approach to identification which allows for estimation of the coefficient means and average partial effects. I test these proposed methods using a Monte Carlo experiment and apply them to the patent-R&D relationship for U.S. manufacturing firms. Chapter 3: Estimation of average marginal effects in multiplicative unobserved effects panel models This chapter concerns estimation of average marginal effects in static multiplicative unobserved effects panel models for nonnegative dependent variables. While fixed effects Poisson (FEP) consistently estimates the parameters of the conditional mean function, marginal effects generally depend on the unobserved heterogeneity. They would therefore seem inestimable without either additional assumptions or some form of bias correction. I show, however, that Average Partial Effect (APE) and Average Treatment Effect (ATE) estimators that use estimated individual effects are consistent and asymptotically normal. This is in contrast with cases like fixed effects logit, where similar marginal effects estimators suffer from the incidental parameters problem. ACKNOWLEDGEMENTS First and foremost, I would like to thank the chair of my dissertation committee, Jeff Wooldridge, for all of his advice, encouragement, and helpful critiques. I would also like to thank Peter Schmidt, Kyooil Kim, and Nicole Mason for serving on my committee and providing valuable feedback and assistance. I also appreciate the comments of seminar participants at Michigan State University, the 2016 MEA Conference, and the 2016 Annual Meeting of the Midwest Econometrics Group. I am especially grateful for the financial support I received from the Graduate School and the Department of Economics at Michigan State University, including the David Kelley Fellowship, Summer Research Fellowship, and Dissertation Completion Fellowship. I also appreciate the support and advice that Lori Jean Nichols, Steven Haider, Todd Elder, and Steve Woodbury all gave me as I navigated the graduate program and job market. Finally, I cannot thank my family enough for their support and encouragement. I am especially grateful to my wife, Kara, for moving with me to East Lansing and then to Washington DC, as well as all the countless ways she has supported my endeavors over the years. iv TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii CHAPTER 1 1.1 1.2 1.3 1.4 1.5 FINITE SAMPLE PROPERTIES OF BIAS-CORRECTED FIXED EFFECTS ESTIMATORS FOR PANEL BINARY RESPONSE MODELS . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The panel binary response model with incidental parameters . . . . . . . . . . 1.2.1 Bias correction techniques . . . . . . . . . . . . . . . . . . . . . . . . Monte Carlo experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Evaluating the dynamic completeness assumption . . . . . . . . . . . . 1.3.2 Comparing bias correction and CRE under more general forms of heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.3 Conditional logit and the importance of correcting APE estimates . . . Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Evalauating the dynamic completeness assumption . . . . . . . . . . . 1.4.1.1 Comparison with uncorrected MLE . . . . . . . . . . . . . . 1.4.2 Comparing bias correction and CRE under more general forms of heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.3 Conditional logit and the importance of correcting APE estimates . . . 1.4.4 Empirical example: Welfare participation . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . 1 . 5 . 7 . 9 . 11 . . . . . . . . . . 12 13 13 14 22 . . . . . . . . 22 24 26 29 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 31 32 34 34 37 39 41 43 45 46 48 50 50 54 56 65 CHAPTER 2 2.1 2.2 2.3 2.4 2.5 2.6 EXPONENTIAL PANEL MODELS WITH COEFFICIENT HETEROGENEITY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 The fixed effects Poisson model with coefficient heterogeneity . . . . . 2.3.2 Testing under full distributional assumptions . . . . . . . . . . . . . . . 2.3.3 Testing under weaker assumptions . . . . . . . . . . . . . . . . . . . . 2.3.4 A correlated random coefficients approach to testing and estimation . . 2.3.5 Adding second moment assumptions . . . . . . . . . . . . . . . . . . . 2.3.6 Estimating average partial effects . . . . . . . . . . . . . . . . . . . . . 2.3.6.1 Approaches under the CRE assumption for ci . . . . . . . . . 2.3.6.2 Estimation when the slopes are independent of covariates . . Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Comparing estimation methods . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Testing when coefficients are not normal . . . . . . . . . . . . . . . . . Empirical application: the Patent-R&D relationship . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v CHAPTER 3 3.1 3.2 3.3 3.4 ESTIMATION OF AVERAGE MARGINAL EFFECTS IN MULTIPLICATIVE UNOBSERVED EFFECTS PANEL MODELS . . . . . . . Introduction and Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Exponential Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 A note about dropped observations . . . . . . . . . . . . . . . . . . . . Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . APPENDIX A Analytical bias correction expressions from Chapter 1 . . . . . APPENDIX B Simulation results for bias corrections on a larger cross-section APPENDIX C Derivations of test statistics from Chapter 2 . . . . . . . . . . . APPENDIX D Simulation results from Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 66 68 74 76 76 76 77 79 . . . . . . . . . . 80 81 84 86 89 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 vi LIST OF TABLES Table 1.1: Probit Estimates of β (β0 = 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Table 1.2: Probit Estimates of γ (γ0 = 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Table 1.3: Probit Estimates of µx /µx (true value = 1) . . . . . . . . . . . . . . . . . . . . . 18 Table 1.4: Probit Estimates of µd /µd (true value = 1) . . . . . . . . . . . . . . . . . . . . . 19 Table 1.5: Probit Estimates of µx /µx Under Different Heterogeneity (true value = 1) . . . . 23 Table 1.6: Corrected and Uncorrected Logit Estimates of µx /µx (true value = 1) . . . . . . 25 Table 1.7: Welfare Participation: Slope Estimates . . . . . . . . . . . . . . . . . . . . . . . 27 Table 1.8: Welfare Participation: Average Partial Estimates . . . . . . . . . . . . . . . . . 28 Table 2.1: Finite Sample Properties of Slope Estimators: β1 = 1, β2 = −1 . . . . . . . . . . 52 Table 2.2: Finite Sample Properties of APE Estimators: β1 = 1, β2 = −1 . . . . . . . . . . 53 Table 2.3: Testing when b i is not normal . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Table 2.4: Distribution of Net Sales in 2000 . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Table 2.5: R& D Expenditures in 2000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Table 2.6: Summary of Key Variables in 2000 . . . . . . . . . . . . . . . . . . . . . . . . 60 Table 2.7: Results for traditional estimators . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Table 2.8: Results for CRC FEP estimators . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Table 2.9: CRCFEP 3 estimated elasticities . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Table B.1: Probit Slope Estimates when N = 500, T = 6 . . . . . . . . . . . . . . . . . . . 84 Table B.2: Probit APE Estimates when N = 500, T = 6 . . . . . . . . . . . . . . . . . . . . 85 Table D.1: Finite Sample Properties of Poisson QMLE: β1 = 0.5, β2 = −0.5, N = 500 . . . 89 Table D.2: Finite Sample Properties of Fixed Effects Poisson: β1 = 0.5, β2 = −0.5, N = 500 90 Table D.3: Finite Sample Properties of Poisson QMLE: β1 = 0.5, β2 = −0.5, N = 500 . . . 91 Table D.4: Finite Sample Properties of Fixed Effects Poisson: β1 = 0.5, β2 = −0.5, N = 500 92 vii Table D.5: Finite Sample Properties of Poisson QMLE: β1 = 0.5, β2 = −0.5, N = 1000 . . . 93 Table D.6: Finite Sample Properties of Fixed Effects Poisson: β1 = 0.5, β2 = −0.5, N = 1000 94 Table D.7: Finite Sample Properties of Poisson QMLE: β1 = 0.5, β2 = −0.5, N = 1000 . . . 95 Table D.8: Finite Sample Properties of Fixed Effects Poisson: β1 = 0.5, β2 = −0.5, N = 1000 96 Table D.9: Finite Sample Properties of Poisson QMLE: β1 = 0.5, β2 = −0.5, N = 2000 . . . 97 Table D.10:Finite Sample Properties of Fixed Effects Poisson: β1 = 0.5, β2 = −0.5, N = 2000 98 Table D.11:Finite Sample Properties of Poisson QMLE: β1 = 0.5, β2 = −0.5, N = 2000 . . . 99 Table D.12:Finite Sample Properties of Fixed Effects Poisson: β1 = 0.5, β2 = −0.5, N = 2000100 viii CHAPTER 1 FINITE SAMPLE PROPERTIES OF BIAS-CORRECTED FIXED EFFECTS ESTIMATORS FOR PANEL BINARY RESPONSE MODELS 1.1 Introduction Nonlinear models are popular in economics in many settings. For instance, binary response models are common for analyzing outcomes like labor force participation, employment, or union membership. At the same time, panel data can be attractive when controlling for unobserved heterogeneity is necessary to identify causal effects. However, it is well-known that maximum likelihood estimation (MLE) that treats heterogeneity as parameters to estimate is inconsistent. For example, in the case of cross-section heterogeneity, the problem arises in the typical large-N, fixed-T microeconometric setting because only a handful of observations contribute to the estimation of each individual’s fixed effect (Lancaster, 2000). This is known as the incidental parameters problem, first described by Neyman and Scott in 1948. In the statistics and econometrics literature, there have been many approaches to estimation in the presence of incidental parameters. In some special cases, it is possible to re-parameterize the model or find a conditioning variable that removes the incidental parameters from the likelihood function (Lancaster, 2000). A leading example of this is the conditional logit model, where the conditioning variable is the number of successes observed for cross-sectional unit (Chamberlain, 1980). However, while conditional maximum likelihood in a case like this consistently estimates the slope parameters of the index of the logit function, conditioning usually does not identify partial effects, which depend on the heterogeneity (Wooldridge, 2010). Other approaches involve restricting the relationship between the heterogeneity and explanatory variables in some way. For instance, if we are willing to assume independence between the heterogeneity and the explanatory variables, then we can use a random effects approach. In many cases, however, correlation between heterogeneity and covariates is of concern. The correlated random effects (CRE) approach 1 of Chamberlain (1980, 1982) or Mundlak (1978), restricts the conditional distribution of the heterogeneity to have a mean that is linear function of the explanatory variables, but the restriction at least buys the researcher identification of APE and scaled slope parameters (Wooldridge, 2010). Assumptions restricting the nature of the heterogeneity are a potential drawback. For instance, Rabe-Hesketh and Skrondal (2013) explore a special case in the dynamic probit setting where misspecification of the heterogeneity causes significant bias . In general, however, we do not know the robustness of CRE is when the distributional assumption fails or when the researcher chooses the wrong conditional mean function. If one prefers to leave the nature of the heterogeneity completely unrestricted, a linear probability model (LPM) estimated by fixed effects ordinary least squares is thought to do a reasonable job approximating, and even consistently estimates them under certain assumptions regarding the explanatory variables (Stoker, 1986). Nevertheless, often the index slope parameters are of interest, or the researcher wants to estimate partial effects at different values of the explanatory variables. In these cases it is tempting to use a nonlinear “fixed effects” estimator, whereby the heterogeneity are estimated as parameters alongside the index slopes in a MLE procedure, but this is problematic. Particularly when the number of time periods is small, fixed effects estimators often perform worse than simply ignoring the heterogeneity entirely (Greene, 2004). In the case of cross-sectional heterogeneity only, several studies have noted that inconsistency diminishes as the number of time periods increases, and that estimates of slope parameters are consistent with both N and T growing to infinity. However, the asymptotic distribution of fixed effects estimators is not centered around the true parameter values, so confidence intervals can still be misleading (Hahn and W. Newey, 2004). I study bias corrections for models with cross-sectional heterogeneity that subtract the leading term of a large-T expansion of the bias from the uncorrected fixed effects MLE. Analytical bias corrections estimate this term from expressions specific to the parametric model. Jackknife corrections estimate it non-parametrically by generating variation in the uncorrected MLE by dropping some time periods. These techniques reduce the bias from O p (T −1 ) to O p (T −2 ), but they can 2 require significant restrictions on the underlying distribution of the data (Hahn and W. Newey, 2004). Both approaches assume at least that the explanatory variables are stationary and weakly dependent. The analytical and jackknife corrections developed by Hahn and Newey (2004) also require the dependent variables to be serially independent conditional on the heterogeneity and the explanatory variables. The analytical correction of Fernandez-Val (2009) and the split-panel jackknife of Dhaene and Jochmans (2015) relax conditional independence to accommodate models with lagged dependent variables, but still require dynamic completeness. Either conditional independence, or dynamic completeness rule out serially correlated error terms, which is potentially a serious problem for static models. Serial correlation is certainly a concern in linear models, as demonstrated by widespread use of clustered standard errors and postestimation testing. Extending that concern to nonlinear models is particularly prudent given that in cases like the probit or logit, serial correlation causes inconsistency in the estimators themselves, not just their standard errors. Without unobserved heterogeneity, APE are still identified in probit or logit models with serial correlation, so the problem is easily handled by using pooled MLE with cluster-robust standard errors (Wooldridge, 2010). To my knowledge, however, no researchers have simulated bias-corrected estimators in the presence of serial correlation. This chapter aims to answer three questions. First, how robust are bias corrections when latent errors have serial correlation? Second, how do the bias corrections compare to the CRE approach when the heterogeneity does not satisfy the CRE conditional distribution assumption? Finally, the incidental parameters problem causes bias not only in slope estimates, but in APE estimates as well, but how severe is bias in APE estimates when the slopes are estimated consistently with a procedure like conditional logit? The first goal is to inform practitioners who wish to account for unobserved heterogeneity while being agnostic about serial dependence. Using Monte Carlo techniques, I evaluate the impact of serially correlated errors on the analytical bias corrections of Hahn and Newey (2004) and Fernandez-Val (2009). I also evaluate the drop-one-period jackknife of Hahn and Newey (2004) and the split-panel jackknife of Dhaene and Jochmans (2015). I generate the error terms in the 3 latent variable model as first order autoregressive processes, but simulate estimators that use clustered standard errors to allow for general (weak) serial dependence. Since slope parameters are only identified up to scale in this setting, I focus primarily on estimation of APE, which are still identified (Wooldridge, 2010). While simulation evidence from the aforementioned studies shows that bias-corrected estimators often have much more desirable finite sample properties than the uncorrected fixed effects MLE (at least for slope parameters), less work has been done to evaluate sensitivity of these properties to relaxation of the assumptions underlying the corrections. Dhaene and Jochmans (2015) examine departures from stationarity in dynamic models, particularly of initial observations and propose a Wald test for evaluating the validity of the split-panel approach overall. Alexander and Breunig (2014) simulate the performance of several bias corrections for the fixed effects probit estimator while varying parameters like the variance of the heterogeneity and correlation between heterogeneity and explanatory variables. but do not consider any departures from stationarity or conditional independence. In addition to using clustered standard errors, many researchers will find it attractive to make a CRE assumption to avoid the issue of incidental parameters. In fact, in studying the issue of serial correlation, many of my simulation results show that the CRE estimator of APE tends to have better finite sample properties than the uncorrected or corrected fixed effects methods. This result is not surprising given the data generating process I employ. Therefore, my second contribution is to consider the relative performance of the CRE approach versus the fixed effects approach when the CRE conditional distribution assumption does not hold. Finally, if researchers are willing to assume the dependent variables are conditionally independent, then a logit specification can be attractive because conditional maximum likelihood estimation (conditioning on the individual’s sum of the dependent variables) allows for consistent estimation of slope parameters with only N → ∞. However, partial effects are not identified because they depend on the heterogeneity terms that have been conditioned out of the likelihood function. Nevertheless, it is tempting to implement the following procedure: 1) Estimate slope 4 parameters by conditional MLE. 2) Estimate the heterogeneity parameters using logit MLE, while restricting the slopes to be equal to the estimates from stage 1), and then estimate partial effects. For instance, an empirical example in Greene (2012, Chapter 17) on German health care utilization follows this procedure in estimating partial effects evaluated at the average of the explanatory variables (PEA) (Greene, 2012). This procedure is likely to suffer from the incidental parameters problem because, although the slope parameters are consistent, the heterogeneity estimates still do not converge to anything with fixed T (and it is unclear if the sample average of the estimated heterogeneity converge to anything interesting as N gets large). Fernandez-Val (2009) uses this procedure to estimate a model of female labor force participation, but corrects the APE estimates for the incidental parameters problem in the second stage (Fernandez-Val, 2009). Therefore, this chapter’s third contribution is to include Monte Carlo evidence that uncorrected APE estimates derived in this manner from conditional logit estimation can have significant bias. Strictly speaking, any conclusions drawn from these simulations are valid only for the data generating processes I employ. However, the results presented are still useful in alerting empirical researchers to potential benefits and pitfalls when implementing one of the discussed estimation methods. The rest of the paper is organized as follows. Section 2 reviews the incidental parameters problem in the panel binary response model, as well as the bias correction techniques considered here. Section 3 describes the Monte Carlo experiment. Section 4 presents and discusses results including the application to the SIPP data. Section 5 concludes. Additional tables, as well as descriptions of the analytical bias correction formulas, are collected in Appendices. 1.2 The panel binary response model with incidental parameters I consider the following panel binary response model with unobserved heterogeneity. yit = 1 [αi + xit θ0 + rit > 0] , for i = 1, . . . , N and t = 1, . . . , T. 5 (1.1) where yit is a scalar outcome variable, xit is a vector of explanatory variables, αi is an individual fixed effect, and rit is a error term. In the probit (logit) case, rit is distributed standard normal (standard logistic). 1 [·] is the indicator function. The log-likelihood function for individual i in period t is it (θ , αi ) = yit log [G(αi + x it θ )] + (1 − yit ) log [1 − G(αi + x it θ )] , (1.2) where G is either the standard normal CDF or standard logistic CDF. Following the notation of Hahn and Newey (2004) and Fernandez-Val (2009), the maximum likelihood estimator of θ0 maximizes the profile log-likelihood, concentrating out the alphas: N θ = arg max ∑ T ∑ θ i=1 t=1 it (θ , αi (θ ))/NT (1.3) where T αi (θ ) = arg max ∑ it (θ , αi )/T α (1.4) t=1 The incidental parameters problem arises because with T fixed, as N → ∞, T p θ → θT , where θT = arg max EN θ ∑ it (θ , αi (θ ))/T (1.5) t=1 where EN [m(Zit , αi )] ≡ lim ∑N i=1 m(Zit , αi )/N. For finite T , θT = θ0 because α(θ ) = αi , even N→∞ when evaluated at the true θ0 . Hahn and Newey (2004) show that for smooth likelihoods like the probit and logit, θT = θ0 + B/T + O(T −2 ) (1.6) where B = I −1 b. In this expression, b represents a higher order expansion of the bias in α(θ ) as T gets large, while I is the information matrix of the profile log-likelihood. Both terms together capture the effect of estimation error in α(θ ) on θ . While it is true that θ is consistent for θ0 if √ √ both N and T → ∞, the limiting distribution of NT (θ − θ0 ) is centered around B κ, where N/T → κ. Therefore, confidence intervals for coefficient estimates will likely have poor coverage (Hahn and W. Newey, 2004). 6 1.2.1 Bias correction techniques Arellano and Hahn (2007) provide a thorough review of different approaches to mitigating bias from the incidental parameters problem. The techniques that I consider in this chapter involve estimating B and using it to construct an estimator with a bias of lower order. Analytical bias corrections use expressions for B (denoted for an arbitrary θ as B(θ )) derived from large-T expansion of the scores of the profile log-likelihood around the true αi . I focus mainly on the “one-step” estimator B(θ ), which is evaluated at the uncorrected MLE. Then the bias corrected estimator is formed as θbc = θ − B(θ )/T (1.7) Previous simulations have shown that the one-step estimator performs reasonably well compared to an iterated procedure or related analytical corrections that solve modified scores (Hahn and W. Newey, 2004). I examine the methods of Hahn and Newey (2004) and Fernandez-Val (2009) for estimating B(θ ). Full expressions for the analytical bias corrections can be found in Appendix A. Jackknife corrections estimate B nonparametrically by using variation in θ when estimated over the full panel and shorter sub-panels. This approach is advantageous because it does not require an explicit characterization of B, though it does require more computation. Hahn and Newey (2004) proposed a technique where the MLE is estimated over the T subpanels formed by dropping one period. Their corrected estimator is formed as T −1 T θhn jk = T θ − ∑ θs , T s=1 (1.8) where θs is the uncorrected MLE estimated over the periods {1, . . . , s − 1, s + 1, . . . , T }. Dhaene and Jochmans (2015) show that splitting the panel into equal, or almost-equal, length sub-panels minimizes the impact of imprecise estimation of B on the remaining bias and allows for dynamic models (Dhaene and Jochmans, 2015). To illustrate how the estimator is formed, suppose T is even for simplicity. Let θS1 and θS2 be the uncorrected MLE estimated over the 7 periods {1, 2, . . . , T /2} and {T /2 + 1, . . . , T }. Then, the jackknife corrected estimator is formed as θd j jk = 2θ − (1/2)(θS1 + θS1 ). (1.9) Researchers are often interested in estimating functions of the data and parameters, like the partial effect of the kth element of x it on the probability that yit equals one: mk (θ , αi , x it ) = θk g(αi + x it θ ), (1.10) where g() is the derivative of G(). Many past simulation and theoretical work has suggested that uncorrected MLE on static binary response models has a “small bias” property for estimates of APE. This means that the bias in APE estimates tends to be smaller than that of slope parameters, and in the probit case with no heterogeneity, is exactly zero (Fernandez-Val, 2009). This T suggests that biases in θk and ∑N i=1 ∑t=1 g(αi + x it θ ) move in opposite directions. Since APE and other functions of the data generally depend directly on the α’s, correcting the slope parameters only (or using a consistent procedure like conditional logit) is insufficient to handle the incidental parameters problem as it only removes one source of the bias. In fact, αi (θ ), even if evaluated at θ0 , does not converge to its true value with T fixed, or at a slower rate when T is allowed to grow (Fernandez-Val, 2009). APE estimates with consistent estimates of θ but no correction for imprecise estimation of the α’s may have much larger biases than APE estimates derived from the uncorrected MLE, as section IV explores. The analytical and jackknife corrections for APE are implemented in a similar fashion to their counterparts for slope estimates. In the analytical case (see Appendix A), a bias term is estimated and subtracted, while for the jackknife, APE are estimated for the full panel and the subpanels separately and then combined just like the slope estimates. Under dynamic completeness for the Fernandez-Val case and conditional independence for the Hahn and Newey case, analytical bias-corrected estimators been shown to be consistent and asymptotically normal as long as T grows faster than N 1/3 , and a similar property has been conjectured for the Hahn and Newey jackknife correction (Hahn and W. Newey, 2004). This makes 8 them reasonable procedures to implement when N is fairly large relative to T , as is typical in microeconometrics. The split-panel jackknife of Dhaene and Jochmans is only consistent with T and N growing at the same rate, but they find evidence it reduces bias with as few as six time periods. The analytical and jackknife corrections analyzed here allow explanatory variables to be only sequentially exogenous, but require the assumption of dynamic completeness, meaning that no additional lags of x or y affect the current yit after x it has been included. Dynamic completeness is written formally as f (yit |αi , x it , yi,t−1 , x i,t−1 . . . , yi1 , x i1 ) = f (yit |αi , x it ) (1.11) Either conditional independence or Assumption (1.11) imply that the scores of the log-likelihood are serially uncorrelated and rule out any serial dependence in the per-period shocks. For the many researchers interested in estimating static models, however, this assumption is less than ideal. Empirical researchers routinely encounter static models with neglected serial correlation in the linear case, and take care to conduct inference using clustered standard errors. Consequently, we would rather not assume that a static model has fully captured the dynamics in the nonlinear case either. One attractive point about the CRE approach with clustered standard errors is that for binary response models with unobserved heterogeneity, arbitrary serial correlation do not cause inconsistency in APE estimates (Wooldridge, 2010). Any complete comparison of bias corrections, therefore, should evaluate their robustness to this common problem. 1.3 Monte Carlo experiment The data generating process I specify is similar to Greene (2004) and Fernandez-Val and Weidner (2016). The outcome is generated as yit = 1 [αi + β0 xit + γ0 dit + rit > 0] (1.12) xit = αi + .5xi,t−1 + vit , t > 1 (1.13) where 9 xi1 = αi + vi1 , vit ∼ N(0, 1/2) (1.14) dit = 1 [xit + hit > 0] , hit ∼ N(0, 1/2) (1.15) αi ∼ N(0, 1/16) (1.16) I set β0 = γ0 = 1. In this model, dit represents a policy or treatment variable of interest, while xit is a continuous control variable that is both correlated with dit and its own past values. Both xit and dit are generated to be strictly exogenous, though the Fernandez-Val and Dhaene and Jochmans corrections only require sequential exogeneity. Correlation between xit and αi is roughly 0.5, while correlation between dit and αi is roughly 0.3. Correlation between xit and dit is about 0.6. Let µw be the population APE of w on the probability that y equals one, for w ∈ {x, d}. In general, this quantity varies by T , so for comparison, I report the estimated APE divided by their ˆ and αˆ (the uncorrected MLE), true value. For βˆ , γ, 1 ˆ αˆ i , zit ) ∑N ∑T mw (βˆ , γ, µˆ w = NT i=1 t=1 , T m (β , γ , α , z ) µw E T1 ∑t=1 w 0 0 i it (1.17) where zit = (xit , dit ) and mw (β , γ, αi , zit ) =    β g(αi + β xit + γdit ) for w = x (1.18)   G(αi + β xit + γ) − G(αi + β xit ) for w = d where for the probit (logit) simulations, G() and g() are the CDF and PDF, respectively, for the standard normal (logistic) distribution. The expectation in the denominator is simulated with a single draw from a panel of 1, 000, 000 individuals. Note that the sum in the numerator is divided by the entire sample size, NT . An individual j whose value of y jt does not change over the length of the panel gets, the uncorrected MLE for the heterogeneity, αˆ j , is unbounded, so the individual ˆ αˆ j , z jt ) for that is dropped from the estimation of the structural parameters. The estimate mw (βˆ , γ, observation is zero (Alexander and Breunig, 2014). I will discuss practical issues this can cause when the panels are short and the data are highly persistent. Details on the analytical corrections can be found in Appendix A. The jackknife-corrected APE estimators are constructed analogously to the slope estimators in equations (1.8) and (1.9). 10 1.3.1 Evaluating the dynamic completeness assumption I relax dynamic completeness in the panel probit case by introducing serial correlation into the error term rit from the latent variable model. I use the following procedure: rit = ψt,ρ uit (1.19) uit = ρui,t−1 + eit , t > 1 (1.20) ui1 = ei1 /ψt,ρ , eit ∼ i.i.d.N(0, 1)     1 − ρ 2 if ρ < 1 ψt,ρ ≡  √  1/ t if ρ = 1 (1.21) (1.22) T has the same variance, which otherDivision of ei1 by ψt,ρ ensures that each element of {uit }t=1 wise would not hold because of finite length (Vamo¸s, Soltuz, ¸ and Cr˘aciun, 2007). Multiplication of uit by ψt,ρ is to give rit unit variance. I maintain unit variance of the error terms to remove the coefficient scaling that would otherwise occur in probit MLE. This allows us to better compare slope estimates across estimators and values of ρ. In the logit case, I use a Gaussian copula based on these series of normal errors. I present results from simulations that set ρ equal to 0, 0.4, 0.8 to represent cases of dynamic completeness, moderate serial correlation, and high serial correlation. While the copula is not guaranteed to maintain the exact serial correlation for the logit case, the autocorrelations were within two decimal points of the specified ρ. Consistent with the literature, I considered panel lengths of 6, 8, 12, and 20, and I set N = 100 in all cases for ease of computation. Previous work by Fernandez-Val (2009) and Alexander and Breunig (2014) has found that the larger N does not affect the relative performance of the different estimators in terms of bias, but does increase their overall precision. I find evidence of these findings, which can be found in Appendix B for the N = 500, T = 6 case. One important finding is that when estimators have finite sample bias, coverage of confidence intervals generally decreases with sample size as standard errors shrink. 11 I also estimate the probit slope coefficients and APE using the pooled MLE version of Mundlak’s (1978) correlated random effects (CRE), and the APE using a LPM for comparison. Standard errors for each estimator are clustered by individual to account for serial dependence in the scores. For each pair of ρ and T , I run 1000 replications. 1.3.2 Comparing bias correction and CRE under more general forms of heterogeneity A correlated random affects approach of Mundlak (1978) applied to the panel probit model with two strictly exogenous explanatory variables assumes that D(ci |xxi , d i ) = Normal(ψ + ξ1 x¯i + ξ2 d¯i , σa2 ), (1.23) D(yit |xxi , d i ) = Probit(βa xit + γa dit + ψa + ξ1,a x¯i + ξ2,a d¯i ) (1.24) which implies where x¯i and d¯i denote time averages, and the “a” subscript indicates the coefficients are scaled by 1/ 1 + σa2 . Therefore, pooled probit of yit on xit , dit , x¯i and d¯i identifies β and γ up to scale. Since the APE depend on the scaled coefficients, they can be estimated consistently with no problem (Wooldridge, 2010). Tables 1.1-1.4 in Section 4 show that CRE used on probit data generated with the above process (or similarly for the logit case) performs well because the heterogeneity enters the equation for xit additively; therefore, the αi can be written as a linear function of the time averages of xit . Consequently, a natural question, is how much better do the fixed effects approaches perform when the CRE assumption fails? I explore this question with the panel probit model through the following modifications: yit = 1 α j,i + β0 xit + γ0 dit + rit > 0 (1.25) xit = .5xi,t−1 + vit , t > 1 (1.26) xi1 = vi1 , vit ∼ N(0, 1/2) (1.27) 12 Where α j,i is one of: 1 T α1,i = −1 + √ ∑ xit2 + ai T t=1 (1.28) 1 T √ α2,i = (xit + xit2 + xit3 ) + ai ∑ T t=1 (1.29) α3,i ∼ N 0, exp 0.125 T √ ∑ (xit + xit2 + xit3 ) T t=1 (1.30) Where in the first two cases, ai ∼ N(0, 1/4). Table 1.5 compares the uncorrected fixed effects MLE, MLE with Fernandez-Val’s analytical bias correction, and two estimators based on CRE. One adds x¯i and d¯i to the probit index, and a more flexible version (CRE2), where the index includes squares of x¯i and d¯i and interactions between the explanatory variables and time averages. I consider panels with T = 6 and T = 12, for the dynamically complete case. 1.3.3 Conditional logit and the importance of correcting APE estimates To evaluate the finite sample properties of APE estimates derived from conditional logit slope estimates, I generate a panel of logit dependent variables using the process described in (12). I only consider the ρ = 0 case, as conditional logit is not valid when dynamic completeness fails. I estimate APE using the uncorrected logit MLE, and two conditional logit procedures which estimate the heterogeneity with a restricted MLE as described on the introduction. One procedure does not correct for the incidental parameters problem while the other uses Fernandez-Val’s 2009 correction for the logit case. 1.4 Results For brevity, I mainly report the bias corrections for the probit case. The logit case is qualitatively similar, though the effect of serial correlation on the Fernandez-Val correction is much less severe. I also report only the T = 6 and T = 12 results as they seem to be representative of the short panel 13 and long panel cases, respectively. Each of the tables lists the mean and standard deviation of the estimator, the coverage probability of a 95% confidence interval, and the ratio of the estimated (cluster-robust) standard error to the standard deviation. They show quite an interesting range of performance for both the uncorrected MLE and the different bias reduction techniques. Results for the 1.4.1 Evalauating the dynamic completeness assumption Tables 1.1 and 1.2 show the performance of the probit slope estimators for different levels of serial correlation. In line with evidence from the literature, the uncorrected MLE can be severely biased for the index slopes in the presence of incidental parameters. 14 Table 1.1: Probit Estimates of β (β0 = 1) T=6 MLE A-FV09 A-HN04 J-DJ15 J-HN04 CRE T=12 MLE A-FV09 A-HN04 J-DJ15 J-HN04 CRE Mean ρ =0 SD cv:.95 SE SD Mean ρ = 0.4 SD cv:.95 SE SD 1.36 0.96 1.18 0.85 0.87 1.01 0.24 0.14 0.21 0.34 0.16 0.14 0.70 0.97 0.87 0.64 0.82 0.95 1.14 1.00 1.03 0.94 0.96 0.99 0.12 0.10 0.10 0.12 0.09 0.09 0.79 0.95 0.94 0.82 0.93 0.95 0.96 1.15 0.92 0.46 0.95 0.99 1.56 1.03 1.36 0.73 0.99 1.01 0.30 0.14 0.26 0.50 0.20 0.15 0.48 0.97 0.66 0.49 0.90 0.94 0.99 1.03 1.00 0.78 1.03 1.01 1.22 1.05 1.10 0.90 1.02 0.99 0.13 0.11 0.12 0.16 0.10 0.10 0.61 0.94 0.87 0.69 0.95 0.94 15 Mean ρ = 0.8 SD cv:.95 SE SD 0.90 1.17 0.83 0.35 0.82 0.98 2.49 0.63 2.24 0.80 1.43 1.02 0.55 0.59 0.52 1.03 0.45 0.15 0.05 0.58 0.07 0.39 0.49 0.93 0.83 0.29 0.72 0.28 0.50 0.95 0.99 1.02 0.98 0.62 1.01 0.98 1.61 1.33 1.45 0.75 1.30 1.00 0.19 0.14 0.16 0.32 0.14 0.10 0.05 0.32 0.16 0.39 0.38 0.95 0.96 0.98 0.92 0.31 0.95 1.01 Table 1.2: Probit Estimates of γ (γ0 = 1) T=6 MLE A-FV09 A-HN04 J-DJ15 J-HN04 CRE T=12 MLE A-FV09 A-HN04 J-DJ15 J-HN04 CRE Mean ρ =0 SD cv:.95 SE SD Mean ρ = 0.4 SD cv:.95 SE SD 1.31 0.95 1.14 0.78 0.87 0.98 0.26 0.16 0.22 0.73 0.17 0.17 0.79 0.98 0.91 0.76 0.93 0.95 1.15 1.01 1.04 0.95 0.97 1.00 0.14 0.12 0.12 0.13 0.11 0.11 0.82 0.97 0.96 0.91 0.96 0.95 0.98 1.26 1.00 0.30 1.18 0.99 1.52 1.02 1.33 0.24 0.95 0.99 0.30 0.16 0.27 1.53 0.32 0.16 0.59 0.99 0.78 0.54 0.96 0.96 1.00 1.10 1.06 0.93 1.13 1.00 1.23 1.07 1.11 0.91 1.03 1.00 0.15 0.12 0.13 0.16 0.12 0.11 0.65 0.95 0.89 0.83 0.97 0.95 16 Mean ρ = 0.8 SD cv:.95 SE SD 0.95 1.32 0.93 0.18 0.65 1.01 2.49 0.49 2.25 -1.68 0.85 1.00 0.77 0.82 0.76 2.58 1.79 0.15 0.10 0.62 0.13 0.17 0.59 0.95 0.65 0.29 0.54 0.21 0.17 0.99 0.99 1.07 1.03 0.82 1.09 0.98 1.61 1.33 1.44 0.64 1.30 1.00 0.20 0.15 0.18 0.70 0.15 0.11 0.11 0.44 0.25 0.48 0.52 0.94 0.97 1.05 0.96 0.21 1.06 0.97 In the dynamically complete case (ρ = 0), bias diminishes as T grows, there is still room for improvement even when T = 12. For instance, the uncorrected MLE for γ has a bias of 31% when T = 6, but only 15% when T = 12. As predicted by theory, coverage of the 95% confidence interval is still somewhat low at 0.82 when T = 12, meaning for a 5% significance level, one would expect to reject a true null hypothesis 18% of the time. As found in previously published simulations, the correction techniques reduce bias and generally increase coverage. In particular, FernandezVal’s analytical correction performs better than the others in all panels, both in terms of bias and variance, particularly for the short panels. The split panel jackknife of Dhaene and Jochmans tends to have higher variance than the others. If one is concerned primarily with estimating APE, however, the incidental parameters problem clearly has much less bite, as shown by Tables 1.3 and 1.4. For the dynamically complete case, bias in the uncorrected MLE for µx is less than 1% for either panel length, while the bias in that of µd is 4% or less. This supports the “small bias” property for APE estimators found by many previous studies of static models (Fernandez-Val, 2009). The bias corrected estimators perform well for the longer panels, but even in the dynamically complete case, many of them have higher bias than the uncorrected MLE for the short panels. Among the different bias correction techniques, both corrections from Hahn and Newey (2004) tend to have the smallest bias, while the split panel jackknife does worse. Additionally, while theory suggests that both corrections reduce bias without any change in variance, it appears that the jackknife corrections may increase variance, especially in shorter panels. 17 Table 1.3: Probit Estimates of µx /µx (true value = 1) T=6 MLE A-FV09 A-HN04 J-DJ15 J-HN04 CRE LPM T=12 MLE A-FV09 A-HN04 J-DJ15 J-HN04 CRE LPM Mean ρ =0 SD cv:.95 SE SD Mean ρ = 0.4 SD cv:.95 SE SD 1.00 0.96 1.05 1.10 1.04 1.01 0.95 0.14 0.13 0.15 0.19 0.15 0.13 0.13 0.94 0.93 0.89 0.78 0.89 0.96 0.94 1.00 0.99 1.00 1.00 1.00 1.00 0.93 0.09 0.09 0.09 0.11 0.09 0.09 0.09 0.94 0.93 0.93 0.89 0.93 0.96 0.88 0.96 0.96 0.88 0.69 0.83 1.03 1.03 0.99 0.94 1.06 1.15 1.07 1.01 0.95 0.14 0.13 0.16 0.22 0.17 0.13 0.13 0.93 0.90 0.88 0.71 0.84 0.95 0.94 0.96 0.94 0.93 0.81 0.93 1.01 1.03 0.99 0.99 1.00 1.01 1.00 1.00 0.93 0.09 0.09 0.10 0.12 0.09 0.09 0.09 0.93 0.93 0.93 0.84 0.93 0.94 0.86 18 Mean ρ = 0.8 SD cv:.95 SE SD 0.92 0.94 0.81 0.66 0.74 1.02 1.02 0.94 0.57 1.05 1.26 1.15 1.01 0.94 0.14 0.44 0.16 0.22 0.19 0.13 0.13 0.86 0.37 0.82 0.58 0.63 0.95 0.92 0.86 0.28 0.72 0.70 0.58 0.99 1.00 0.93 0.91 0.90 0.72 0.90 0.97 1.00 0.99 0.98 1.01 1.05 1.00 1.00 0.93 0.09 0.09 0.10 0.13 0.09 0.09 0.09 0.90 0.89 0.90 0.78 0.90 0.95 0.87 0.89 0.86 0.85 0.67 0.84 0.97 0.99 Table 1.4: Probit Estimates of µd /µd (true value = 1) T=6 MLE A-FV09 A-HN04 J-DJ15 J-HN04 CRE LPM T=12 MLE A-FV09 A-HN04 J-DJ15 J-HN04 CRE LPM Mean ρ =0 SD cv:.95 SE SD Mean ρ = 0.4 SD cv:.95 SE SD 0.96 0.93 1.00 1.09 1.04 0.99 1.28 0.19 0.18 0.19 0.25 0.22 0.19 0.19 0.93 0.93 0.93 0.81 0.89 0.95 0.68 1.01 1.00 1.01 1.03 1.01 1.01 1.33 0.13 0.13 0.13 0.15 0.13 0.13 0.13 0.94 0.94 0.94 0.91 0.94 0.95 0.29 0.95 1.01 0.92 0.72 0.83 1.00 0.99 0.97 0.91 1.01 1.11 1.05 0.99 1.28 0.19 0.17 0.19 0.26 0.21 0.18 0.18 0.92 0.92 0.92 0.80 0.88 0.95 0.67 0.99 0.99 0.98 0.88 0.96 1.00 1.00 1.01 1.00 1.01 1.03 1.02 1.01 1.33 0.13 0.13 0.13 0.15 0.13 0.13 0.13 0.94 0.95 0.94 0.88 0.93 0.94 0.27 19 Mean ρ = 0.8 SD cv:.95 SE SD 0.92 1.01 0.89 0.67 0.79 1.00 1.00 0.96 0.42 1.00 1.05 1.05 0.99 1.29 0.18 0.56 0.18 0.27 0.20 0.17 0.17 0.88 0.37 0.89 0.75 0.83 0.95 0.62 0.84 0.31 0.83 0.60 0.70 1.00 1.00 0.97 0.97 0.96 0.82 0.94 0.98 0.99 1.00 0.99 1.01 1.03 1.01 1.00 1.33 0.12 0.12 0.12 0.16 0.12 0.13 0.13 0.94 0.93 0.93 0.87 0.92 0.94 0.27 0.95 0.95 0.94 0.78 0.91 0.98 0.97 The simulation results for models where dynamic completeness fails reveal many interesting implications for the uncorrected and corrected fixed effects probit estimators. To begin with, higher levels of serial dependence in the error terms and yit exacerbate a practical difficulty in performing MLE while treating heterogeneity as parameters to be estimated. The problem relates to the fact that the partial effect is not well-defined for an individual j whose value of y jt is constant. In this case, the dummy variable for observation j perfectly predicts the outcome, so the estimate of αi is technically unbounded (Fernandez-Val, 2009). These observations are therefore dropped from the estimation sample. These individual’s contributions to the sample APE are equal to zero. The true α’s in these cases tend to be larger in magnitude, and while this means m(β0 , γ0 , α, zit ) will be smaller by the properties of the standard normal PDF and CDF, it should still be strictly positive. This explains the tendency of MLE to under-predict APE (Alexander and Breunig, 2014). Additionally, there may be distributional differences between the subpopulation that has a changing response and the population in general that could cause additional bias. The probability of observing an individual with a constant yit increases significantly in the shorter panels as serial dependence in the errors increases. To illustrate, for T = 6 and ρ = 0, across the 1000 replications, 21% of the individuals were dropped on average, while for T = 6 and ρ = 0.8, 32% were dropped on average. For comparison, with T = 12, this dropping rate was only 7.5% for ρ = 0 and 14.5% for ρ = 0.8. Splitting the panel for Dhaene and Jochman’s jackknife makes this much worse, especially when the panel is only six periods long to begin with. Practically speaking, losing more observations makes it more likely that the numerical maximization algorithm will not converge (at least when N is relatively small). The worst case of this occurring in this study was for the split-panel jackknife in the T = 6, ρ = 0.8 case, in which 32% of replications had a failure to converge. Similar rates of non-convergence occurred as well for (unreported) runs of the uncorrected MLE and analytical corrections with high ρ and only three or four time periods. The results show that, as expected, the failure of dynamic completeness significantly increases the bias and decreases the precision of all of the fixed effects slope estimators. By design of the 20 data generating process, this bias is separate from the scaling that would occur from the latent model errors having non-unit variance as a result of their autoregressive structure. In the worst of cases, the means of the split-panel jackknife estimates of γ for the shorter panels even have the wrong sign when ρ = 0.8. For small panels, the standard errors of the corrected estimators also do a poor job estimating the true standard deviations. The increased bias is not surprising given that in the presence of unobserved heterogeneity, a conditional independence assumption for {yi1 , yi2 , . . . , yiT } is required to identify unscaled slope parameters in the panel probit model (Wooldridge, 2010). Fernandez-Val’s analytical correction and Hahn and Newey’s jackknife continue to mitigate the bias and perform relatively well when ρ = 0.4. While they still provide an improvement over the uncorrected MLE when ρ = 0.8, they are still severely biased. The performance of the fixed effects estimators in estimating APE is much more relevant when dynamic completeness fails. In the case of the short panel (T = 6), the effect of higher serial correlation in the errors on the performance of the fixed effects estimators is quite mixed. Comparisons between estimators in Tables 1.3 and 1.4 suggest that the analytical correction proposed by Hahn and Newey seem fairly robust to serial correlation, with biases in APE estimates of 6% or less. Bias in the Fernandez-Val correction only increases slightly at low-to-moderate levels of serial correlation, but the combination of high autocorrelation and short panel length causes a substantial downward bias of 40% to 60%. With the longer panels, however, the effect of ρ on the bias of this and the other corrections is much smaller, 2% or less for the T = 12 case. The effect of ρ on the jackknife APE corrections is different for each explanatory variable. For instance, the bias in the split-panel jackknife APE estimates for x increase with higher ρ in the T = 6 case, but those for d appear to be less affected. Hahn and Newey’s jackknife shows a very similar pattern, but with much smaller variance of the estimators. Furthermore, the results for the split-panel jackknife illustrate that slope and APE estimates do not necessarily agree in sign. This is another drawback to using this procedure on short panels, since splitting the panel increases variance substantially. Perhaps larger N would mitigate this problem. 21 1.4.1.1 Comparison with uncorrected MLE As in the dynamically complete case, it is important to note that the uncorrected APE estimators often have lower bias than either the analytical or jackknife corrected estimators, especially for the short panels. For longer panels, the uncorrected MLE, analytical corrections, drop-one-period jackknife, and CRE behave very similarly, while the split-panel jackknife has higher variance. For comparison, the CRE and LPM are not really affected by either failure of dynamic completeness or the length of the panel. The structure of the data is such that one would expect CRE to do well. As a side note, I found that a generalized estimating equations approach with either an exchangeable or AR(1) covariance matrix was not much more efficient than pooled MLE for the CRE model. In contrast to the CRE, the best linear approximation performs fairly well for the continuous variable (bias of 5 − 7%) but does not perform very well for the discrete variable (bias of 28 − 34%). 1.4.2 Comparing bias correction and CRE under more general forms of heterogeneity Table 1.5 compares probit APE estimates for the continuous variable x using the uncorrected MLE, Fernandez-Val correction, and two Correlated Random Effects estimators, described in Section 3. I consider panels with T = 6 and T = 12, in the case of serially independent errors. The estimators are compared across three different forms of heterogeneity which do not satisfy the conditional distribution assumptions for either CRE estimator. The uncorrected MLE and Fernandez-Val correction, in contrast, place no restriction on the nature of the heterogeneity. 22 Table 1.5: Probit Estimates of µx /µx Under Different Heterogeneity (true value = 1) T=6 MLE A-FV09 A-HN04 CRE CRE2 T=12 MLE A-FV09 A-HN04 CRE CRE2 Mean α1 SD cv:.95 SE SD Mean α2 SD cv:.95 SE SD 1.00 0.96 1.04 0.76 0.78 0.16 0.15 0.16 0.15 0.15 0.92 0.92 0.88 0.59 0.65 0.99 0.98 1.00 0.68 0.70 0.12 0.12 0.12 0.11 0.11 0.91 0.91 0.91 0.22 0.26 0.90 0.91 0.83 0.96 0.95 0.98 0.93 1.03 0.75 0.76 0.23 0.21 0.24 0.20 0.20 0.91 0.89 0.87 0.74 0.75 0.89 0.88 0.87 1.01 1.00 0.98 0.97 0.99 0.79 0.80 0.18 0.18 0.18 0.17 0.17 0.85 0.84 0.84 0.72 0.73 23 Mean α3 SD cv:.95 SE SD 0.86 0.87 0.78 1.03 1.01 0.99 0.95 1.03 0.95 0.96 0.15 0.14 0.16 0.15 0.15 0.92 0.90 0.88 0.92 0.93 0.90 0.90 0.82 0.99 0.98 0.76 0.75 0.74 0.98 0.97 1.00 0.99 1.01 0.95 0.96 0.10 0.10 0.10 0.10 0.10 0.92 0.92 0.91 0.92 0.93 0.91 0.90 0.89 1.02 1.01 Since the pooled-MLE version of CRE only identifies slope parameters up to scale, I only report on the APE. The tables show that the bias in the CRE estimators is higher in all three specifications. For instance, in the second specification (α = α2 ), CRE underestimates the APE of x by about 25% when T = 6, while the biases in the uncorrected MLE and the Fernandez-Val correction are only 2% and 9%, respectively. The results for the APE of d were comparatively similar, though the CRE estimators tended to have a positive bias. This illustrates the importance of the functional form assumption when specifying a CRE model, and suggests an advantage in the FE approaches as they place no restrictions on the αi . 1.4.3 Conditional logit and the importance of correcting APE estimates Table 1.6 explores a possible approach to handling unobserved cross-sectional heterogeneity in logit models where the response variables are conditionally independent. Using conditional logit to consistently estimate slope parameters does not allow for estimating average partial effects unless the researcher can somehow recover estimates of the αi . One way is to estimate them by MLE, restricting the slope parameters to their conditional logit estimates, but this causes bias in APE estimates. The table shows the APE estimates (for the continuous variable x) from the uncorrected pooled logit MLE, conditional logit without correcting the APE estimates (denoted CLOG), and conditional logit where the APE have been corrected using Fernandez-Val’s formula (CLOGC). Simulations for the APE of d showed a very similar pattern. 24 Table 1.6: Corrected and Uncorrected Logit Estimates of µx /µx (true value = 1) T=6 MLE CLOGIT CLOGIT-C T=12 MLE CLOGIT CLOGIT-C Mean ρ =0 SD cv:.95 SE SD Mean ρ = 0.4 SD cv:.95 SE SD 1.01 0.87 1.00 0.19 0.16 0.18 0.95 0.93 0.94 1.00 0.94 1.00 0.12 0.11 0.12 0.95 0.94 0.95 1.01 1.14 1.00 1.01 0.87 0.99 0.18 0.16 0.18 0.94 0.91 0.93 1.01 1.06 1.00 1.00 0.94 1.00 0.12 0.11 0.12 0.95 0.94 0.95 25 Mean ρ = 0.8 SD cv:.95 SE SD 0.99 1.10 0.97 0.99 0.87 0.98 0.18 0.15 0.17 0.92 0.82 0.90 0.88 0.93 0.83 1.01 1.06 1.00 1.00 0.94 0.99 0.12 0.11 0.12 0.94 0.91 0.93 0.95 0.99 0.94 The table illustrates a couple of interesting points. First, the uncorrected conditional logit APE estimates have biases that are 5-13 percentage points higher than the corrected versions. This shows that inconsistent estimation of the αi is a significant problem even when a consistent procedure is used to estimate the slope coefficients. Moreover, these suggest that the “small bias” property in the uncorrected MLE APE estimates observed earlier is the result of two competing biases. In the case of this chapter’s data generating process, an upward bias in the slope estimate is being offset by a scale factor that is biased toward zero. Using a procedure like conditional logit (or any bias correction) while failing to correct APE estimates removes only one source of the problem and may increase the bias compared to doing no correction at all. 1.4.4 Empirical example: Welfare participation As an additional demonstration of the relative performance of these fixed effects estimators, I apply them to a dataset on participation in Aid to Families with Dependent Children (AFDC), a U.S. welfare program. The data are by way of Chay and Hyslop (2014), who use the 1990 Survey of Income and Program Participation (SIPP). The panel consists of AFDC participation, age, race, marital status, number of children, and poverty level for 1, 934 women who either received benefits or had income below a certain threshold at some point during the sample period. As welfare participation is a binary response that is thought to be highly persistent over time, Chay and Hyslop differentiate between unobserved heterogeneity, and structural state dependence as sources of persistence, finding significant evidence for the latter using dynamic estimators under varying assumptions about the nature of the heterogeneity and initial conditions (Chay and Hyslop, 2014). Although their findings suggest that a dynamic model may be more appropriate, these data still provide an interesting and relevant setting for evaluating the bias-corrected fixed effects estimators in the static case. Table 1.7 lists slope parameter estimates for two key determinants of participation, marital status and number of children. Note that in addition to several control variables, these specifications include time period dummies. While technically, they are also incidental parameters under large-T bias corrections, it is customary to include them in this type of analysis. In (unre- 26 Table 1.7: Welfare Participation: Slope Estimates Full Sample CRE (1) Marriage -0.986 (0.011) Kids 0.162 (0.001) MLE (2) -1.908 (0.208) 0.481 (0.104) Sample with changing participation A-FV09 A-HN04 J-DJ15 J-HN04 (3) (4) (5) (6) -1.579 -1.730 -1.822 -1.565 (0.178) (0.189) (0.229) (0.176) 0.409 0.437 0.447 0.380 (0.098) (0.100) (0.121) (0.096) CRE (7) -1.462 (0.022) 0.358 (0.006) N=1934 N*=494 T=8 T=8 Controls include education, poverty level, a quadratic in age, a race dummy, and time period dummies. Standard errors were clustered by individual ported) simulations with true time effects, I found that the additional bias caused by their inclusion to be smaller and that it did not change the relative performance of the different FE estimators. Table 1.8 lists estimated APE. Unlike the simulations, these tables include CRE and LPM estimates over the estimation subsample of the fixed effects estimators. This application highlights the problems that may arise when many individuals have responses that do not change. In this case, only 494, or roughly 25% of women in the sample had participation that changed over the 32 months of the survey. In the worst simulation case (T = 6, ρ = 0.8) 68% of the sample still had responses that changed. Practically speaking, not only does this increase variance of the estimators, but it potentially exacerbates any bias stemming from sample selection (which did not appear to be much of a problem in the simulations). The bias-corrected slope estimates in both cases are smaller in magnitude than the uncorrected MLE, and are similar in magnitude to CRE estimates over the subsample of changing responses, though quite different from the CRE estimates over the whole sample. Probit slope estimates from the 1998 and 2014 versions of Chay and Hyslop range from -0.934 to -0.658 for the marriage variable, and 0.11 to 0.152 for the kids variable. Both are much smaller in magnitude than the nonlinear fixed effects estimates suggesting that persistence, state dependence and/or sample selection are playing a significant role. 27 Table 1.8: Welfare Participation: Average Partial Estimates Full Sample CRE LPM (1) (2) Marriage -0.260 -0.271 (0.001) (0.001) Kids 0.047 0.052 (0.000) (0.000) MLE (3) -0.112 (0.007) 0.034 (0.007) Sample with changing participation A-FV09 A-HN04 J-DJ15 J-HN04 CRE* (4) (5) (6) (7) (8) -0.110 -0.115 -0.162 -0.129 -0.112 (0.008) (0.007) (0.008) (0.008) (0.000) 0.033 0.035 0.049 0.034 0.034 (0.007) (0.007) (0.007) (0.007) (0.000) N=1934 N*=494 T=8 T=8 *Sum of partial effects divided by full sample size for comparison with FE estimators 28 LPM* (9) -0.126 (0.000) 0.033 (0.000) The 1998 version of Chay and Hyslop contains several estimates of LPMs, including the static model estimated with fixed effects (column 2 of Table 2), which are compared to the bias-correct APE estimates in Table 1.8. The Chay and Hyslop estimates (that account for heterogeneity) range from -0.271 to -0.143 for marriage and from 0.029 to 0.068 for kids. The bias corrected estimates range from -0.162 to -0.110 for marriage and from 0.033 to 0.050 for kids, which seem more in line than the slope estimates, echoing previous research and the simulation evidence in this chapter for the “small bias” property. 1.5 Conclusion The simulation evidence in this chapter suggests that these bias corrections continue to estimate APE fairly well when the level of serial correlation is low to moderate, but strong serial correlation may cause bias when the panel is short. As such, dynamic completeness may be a substantive requirement unless the researcher has access to many time periods of data. Estimation in shorter panels may also present sample selection or computational challenges. While it may seem unfair to evaluate a technique based on large-T asymptotic approximations using panels with only six time periods, others have suggested these techniques have desirable properties in large-N, small-T settings. Moreover, the results of this chapter suggest that if a researcher is primarily concerned with estimating APE in a static model, then the included bias correction techniques may offer little benefit relative to the uncorrected MLE while adding the cost of a more complicated estimation procedure. It should be noted, however, that the “small bias” property of APE does not hold in dynamic models, where correction techniques have been found to decrease bias substantially. Additionally, I find that the fixed effects approach (with or without a bias correction) may offer advantages over CRE when the heterogeneity does not satisfy the CRE assumption. I also find evidence that highlights the importance of correcting for inconsistent estimation of the heterogeneity terms when a consistent procedure is used to estimate the slopes. There are many important avenues for future research. First and foremost, an interesting ques- 29 tion is how well the analytical bias correction of Hahn and Kuersteiner (2011) performs in this setting. It accommodates serial correlation in theory, but requires “moderately large T .” Furthermore, in practical applications like a policy or program analysis, it is important to control for time effects, which I did not include in this set of simulations. The reason is that under the large-T asymptotics that justify these corrections, time dummies are also incidental parameters. I did run a set of simulations over the same values of ρ and T where time effects were estimated, but not part of the true data generating process for yit . I found that the same relative patterns held across estimators as in this chapter, but the additional incidental parameters caused slightly higher bias in slope parameters and virtually no increase in bias for APE except for the short panels, where bias increased slightly. Fernandez-Val and Weidner (2016) allow for both time and cross-sectional heterogeneity in analytical and jackknife corrections. However, the results depend on N/T being constant in the limit. Therefore, unlike the wide and short panels included in this chapter, their application is intended for settings where N and T are of similar magnitude. 30 CHAPTER 2 EXPONENTIAL PANEL MODELS WITH COEFFICIENT HETEROGENEITY 2.1 Introduction The fixed effects Poisson (FEP) estimator, also known as multinomial QCMLE, is an attractive choice for modeling nonnegative responses whose conditional means contain an unobserved individual effect that may be correlated with the explanatory variables. Unlike other conditional-ML estimators, notably the FE logit, FEP does not require assuming a full distribution or conditional independence (Wooldridge, 1999). This chapter considers the exponential conditional mean, which is logically consistent for nonnegative dependent variables and has the feature that coefficients on the regressors can be interpreted as semi-elasticities. The focus of this chapter is an extension to the unobserved effects exponential model that allows for additional heterogeneity in the form of random coefficients. While there is some literature considering Poisson variables in this setting, less insight exists into how to proceed for other nonnegative or non-count variables, or even what the consequences are of ignoring the heterogeneity. In the linear unobserved effects model with strictly exogenous regressors and random coefficients, for instance, it is straightforward to show that fixed effects OLS is consistent for the means of the coefficients so long as they are mean-independent of the time-demeaned regressors. This is not necessarily true for nonlinear models, as this chapter shows for the exponential case. Moreover, it is unknown whether other quantities of interest, like average partial effects (APE), can be consistently estimated while ignoring coefficient heterogeneity. Furthermore, much of the literature assumes all sources of heterogeneity are independent of covariates, which can cause inconsistent estimation of coefficient means as well as type II errors in tests for random coefficients These potential complications motivate testing for neglected heterogeneity. An LM test in the style of Chesher (1984), however, is likely to reject when the Poisson distribution is misspecified or when conditional independence fails. Therefore, I extend this methodology specifically to the 31 FEP setting, deriving a simple variable addition test that is more broadly applicable. Furthermore, I propose a method for parametrically identifying the means of random coefficients that leads to estimators that are computationally simple related to existing approaches to random coefficients in this model. One novel contribution of this chapter is to treat random coefficients and the traditional multiplicative effect1 separately, as the latter can be handled without restricting their dependence on explanatory variables. I also provide estimators of average partial effects. In an application to the patent R&D relationship among U.S. manufacturing firms, I find evidence of heterogeneous elasticities and lagged effects, though the results are not robust to changes in the estimation sample. The rest of this chapter is organized as follows: Section 2 gives an overview of the existing literature, Section 3 reviews the FEP model and the classical test for the Fixed Effects Poisson case, before proposing this chapter’s theoretical contributions. Section 4 contains a Monte Carlo experiment for the methods proposed, while Section 5 describes the empirical application. Section 6 consists of a brief conclusion and direction for future research. 2.2 Literature Review Applying Andersen’s (1970) conditional ML methodology, Hausman, Hall, and Griliches (1984) developed the FEP estimator for count data that allows arbitrary dependence between the unobserved effect and the regressors. They implemented their techniques to analyze the patent-R&D relationship in the U.S. manufacturing industry. Wooldridge (1999), showed that correct specification of the conditional mean and strict exogeneity of the regressors (conditional on the unobserved effect) were sufficient for consistency of FEP, broadening its application as a quasi-CMLE. Cameron and Trivedi (2013) considered the panel unobserved effects Poisson model with random coefficients in a “random effects” setting where all heterogeneity were assumed to be normally distributed and independent of the regressors. They concluded that “unlike for the linear model, 1 The multiplicative effect can also be expressed as a random intercept inside the exponential conditional mean function. 32 the conditional mean for the random slopes model differs from that for the pooled and random effects models, making model comparison and interpretation more difficult.” Lagrange multiplier (LM) statistics are attractive in testing for coefficient heterogeneity because they use parameter estimates from a restricted model which can be simpler to estimate. In this case, the restricted model is FEP, for which built-in procedures exist in Stata and other programs. Moreover, LM tests are valid for null values on the boundary of the parameter space, unlike Wald tests, which is important because parameters (i.e. variances) associated with random coefficients should be nonnegative (Wooldridge, 2010). Random coefficients are an example of neglected heterogeneity that Chesher (1984) derived a test for in the ML setting. Chesher, as well as Lee and Chesher (1986), developed methodology for deriving test statistics in this and other settings where scores are identically zero under the parameter restriction. Greene and MacKenzie (2015) applied this methodology to random effects probit MLE. Hahn, Newey, and Smith (2014) extend Chesher’s to moment condition estimators like Generalized Method of Moments (GMM). Hahn, Moon, and Snider (2015) allow for dependence between the heterogeneity and covariates when testing the likelihood setting, though they also find that tests that treat the heterogeneity and regressors as mean and second-moment independent still have power under alternatives where this is not true. A common feature of tests for neglected heterogeneity in the likelihood setting is that they have the interpretation of being either for information matrix (IM) equality or for overdispersion, making them less attractive for settings where researchers do not want to fully specify a distribution. I derive a test for slope heterogeneity in exponential models that does not have this drawback. A Poisson-normal mixture model like the one described by Cameron and Trivedi is one of the “Generalized linear latent and mixed models” studied by Rabe-Hesketh and Skrondal (2004). The likelihood function consists of a multi-dimensional integral that must be numerically approximated, limiting its application to models where only a small number of coefficients are believed to be random. The authors used adaptive Gaussian quadrature to estimate a model of seizure counts for 236 subjects of (randomly assigned) epilepsy treatment trial, where both the intercept and the 33 coefficient on a variable for time of visit were allowed to be vary by individual. While a random effects approach makes sense for the experimental setting, treating the heterogeneity as independent of covariates can cause inconsistent estimation in many economic applications. Wang, Cockburn, and Puterman (1998), do allow dependence between the heterogeneity and explanatory variables in the panel Poisson setting, assuming a parametric form for the dependence as well as a particular distribution for the heterogeneity. With the patent-R&D relationship in mind, they propose a mixed-Poisson regression approach which assumes that the coefficients follow a discrete distribution with finite support, modeling the probability mass at each point as multinomial logit. Their method involves using economic intuition or selection criteria to select the number of support points. Moreover, they suggest using a continuous model for the coefficients if model selection criteria indicate four or more points of support. My paper complements their work by proposing such a model. One benefit of my approach is that as in FEP, cases I can allow an unrestricted relationship between the explanatory variables and the multiplicative effect, as well as analyze non-counts. 2.3 2.3.1 Theory The fixed effects Poisson model with coefficient heterogeneity The standard fixed effects Poisson model with an exponential mean function assumes: E(yit |xxi , ci ) = E(yit |xxit , ci ) = ci exp(xxit β 0 ) (2.1) for i = 1, . . . , N;t = 1, . . . , T . In this expression, x it is a 1 × K vector of time-varying explanatory variables, ci is unobserved heterogeneity, and β 0 is a K × 1 unknown vector of coefficients.2 Equation (2.1) implicitly assumes that x it is strictly exogenous. Hausman, Hall, and Griliches (1984) showed that if conditional on x i = {xxi1 , . . . , x iT } and ci , the yit are independently distributed 2 Wooldridge (1999) considered conditional mean functions of the form ci m(xxi , β 0 ) of which m(xxi , β 0 ) = exp(xxit β 0 ) is a special case. 34 T y results in the multinomial as Poisson with mean given by (2.1), then conditioning on ni ≡ ∑t=1 it distribution for {yi1 , . . . , yiT }. The multinomial log-likelihood is M (β i β) = T ∑ yit log [pt (xxi, β )] , (2.2) t=1 where pt (xxi , β ) ≡ exp(xxit β ) . T exp(x xir β ) ∑r=1 (2.3) The feature that ci enters conditional mean function multiplicatively means it cancels out of β ), meaning dependence between ci and x i may remain unrestricted. pt (xxi , β ) and therefore i (β This structure also has the consequence that coefficients on time-constant regressors are not identified because these terms also cancel. This model is particularly attractive because as shown by Wooldridge (1999), β 0 maximizes the expected value of 2.2 as long as (2.1) is true. Therefore, under additional regularity conditions, FEP consistently estimates β 0 with N growing and T fixed. Notably, consistency does not require a distribution assumption for the responses and allows them to be arbitrarily serially correlated (Wooldridge, 1999). Condition (2.1) generally fails, however, if the coefficients in the conditional mean function vary by individual i, as in the following: E(yit |xxi , ci , b i ) = E(yit |xxit , ci , b i ) = ci exp(xxit b i ), (2.4) where now b i is a K × 1 vector of unobserved random variables such that E(bbi ) = β 0 . Defining d i ≡ b i − β 0 , the conditional mean in (2.4) is equivalent to ci exp(xxit β 0 + x it d i ), meaning one interpretation of the heterogeneity is unobserved interactions in the index of the mean function. There is a more practical, economic interpretation as well. Assuming element j is not functionally related to any other elements of x it , then ∂ log [E(yit |xxi , ci , b i )] = bi j , ∂ xit j 35 (2.5) so model (2.4) implies semi-elasticities of the conditional mean of yit that vary by individual. If xit j is the log of another variable, as in some applications, then the bi j are individually-varying elasticities. An immediate consequence is that the heterogeneity likely causes specification error if we want to use FEP assuming (2.1). To see this, suppose for concreteness that d i is continuous, and write its PDF conditional on x i and ci as f ( ; ψ 0 ), where ψ 0 is an unknown parameter that is nonzero only if the coefficients are random. It follows under (2.4) and the Law of Iterated Expectations (LIE) that E(yit |xxi , ci ) = ci exp [xxit β 0 + gt (xxi , x it , ci ; ψ 0 )] , (2.6) where gt (xxi , x it , ci ; ψ 0 ) = log {E [exp(xxit d i )|xxi , ci ]} = log RK exp(xxit d i ) f (dd i |xxi , ci ) ddd i , (2.7) assuming the expectation exists. The exponential function now contains an unknown term that is generally nonzero and varies over time.3 Depending on what we are willing to assume about the dependence between b i and x i , we may not be able to distinguish between coefficients that are random and a more flexible functional form. The consequence of ignoring the coefficient heterogeneity is that now (2.1) is not correct, and so FEP of yit on x it can no longer be shown to be generally consistent for β 0 . This is true even under ideal conditions like independence between b i and {xxi , ci } In fact, simulation evidence from Section 4 suggests that substantial bias and inconsistency for FEP in this case. This is to contrast with the linear unobserved effects model with random coefficients, in which Fixed Effects OLS is consistent for the means of the coefficients so long as the coefficients are mean independent of the time-demeaned regressors (Wooldridge, 2010). In this case, the random coefficients cause a certain form of system heteroskedasticity in the idiosyncratic errors that is handled completely with robust inference. 3 If g (x xi , β , ψ ) and FEP would t x i , ci ; ψ 0 ) were time-constant, then it would also cancel from pt (x be consistent, but there is no reason to think this should be the case with time-varying x it . 36 2.3.2 Testing under full distributional assumptions If the yit are count data and researchers are willing to take full distributional assumptions seriously, the approach of Chesher (1984) provides a simple LM test. The slopes are not allowed to depend on the covariates or ci under the alternative, which avoids having to specify a particular joint distribution for b i and x i . However, lack of power may be an issue in alternatives where b i depends on x i . Findings of Hahn, Moon and Snider (2015), however, suggest that this is less of a concern in nonlinear models. The following statements formalize the assumptions: yit |(xxi , ci , b i ) ∼ Poisson [ci exp(xxit b i )] , i = 1, . . . , N; t = 1, . . . , T, (2.8) {yi1 , . . . , yiT } are independent conditional on {xxi , ci , b i } (2.9) b i = β 0 + Λ 0 u i , where u i |(xxi , ci ) ∼ F(00, I K ), (2.10) where I K is the K × K identity matrix. From Chesher (1984), assumption (2.10) does not assume a particular distribution for b i , but specifies that they follow a “location-scale generalization of the class of spherical distributions” described by Kelker (1970). Denote the PDF of u i as f (). It follows that yi |(ni , xi , ci , bi ) ∼ Multinomial(ni , p1 (xxi , bi ), . . . , pT (xxi , bi )), (2.11) where pt (xxi , b i ) ≡ exp(xit b i ) . ∑Tr=1 exp(xir b i ) (2.12) Therefore, the log-likelihood for an observation i, integrating out the random part of the slopes, is β , Λ ) = log i (β T ni ! [pt (xxi , b i )yit ] f (uui ) duui , ∏ T K R ∏t=1 yit ! t=1 where the integral is of K dimensions. 37 (2.13) An LM test of H0 : Λ 0 = 0 is attractive because in this case, b i = β 0 , and so the restricted model can be estimated using FEP. It also turns out that the restricted score does not depend on the unknown PDF f (). However, the parameterization of this model causes a complication in deriving the restricted scores, as described by Chesher (1984) and Lee and Chesher (1986) for a more general class of models. It turns out the score of the unrestricted model evaluated at the parameter restriction is identically zero.4 Chesher (1984) proposed re-parameterizing the scale assumption and restricting the correlation among the heterogeneity allowed under the alternative.5 Λ 0 = diag λ1,0 , . . . , (2.14) λK,0 Allowing no covariance between coefficients may affect power under alternatives in which this does not hold, but at the same time, information about the covariances is only relevant if there is evidence that the variances are nonzero.6 Under (2.14), the restricted score has the 0/0 form, but the limits follow from L’Hopital’s rule. The algebraic details are collected in Appendix C. Collecting the λ j in the K × 1 vector λ , the restricted score is: β , 0 ) ≡ lim ∇θ (β β ,λ ) s i (β λ ↓0   T y ∇ p (x  xi , β ) ∑t=1 it  β t x i , β ) /pt (x    1 a (x  N  2 1 xi , β ) =∑ ..  i=1  .       1 a (x 2 K xi , β )          , (2.15)         where a j (xxi , β ) is the ( j, j)th element of A (xxi , β ) T ≡ ∑ ∇2β Mit (ββ ) + t=1 T ∑ ∇β Mit (ββ ) t=1 4 See T ∑ ∇β M (β it β ) . (2.16) t=1 Appendix C for the derivation. solution would be to assume Λ 0 = λ0 IK 6 The relevant alternative, strictly speaking, should be that at least one λ j,0 ≥ 0, but for simplicity, the two-sided alternative is treated here, as in Chesher (1984). 5 Chesher’s 38 In this last expression, M it is the multinomial log-likelihood for observation i in period t. The outer product of the score version of the LM statistic is then N times the uncentered Rsquared from the regression of 1 on s i , where for each observation i, s i is the appropriate summand in right hand side of (2.15) evaluated at β FEP . The advantage to this approach is its relative simplicity. The unrestricted model may be even computationally infeasible to estimate, but a test of the null hypothesis of constant coefficients is relatively easy to implement. The downside of this approach concerns robustness to failure of (2.8) or (2.9). Chesher (1984) notes that statistics derived using this approach resemble White’s (1982) information matrix test for A(xxi , β )] = 0 if the conditional multinomial distribution is general model misspecification, as E [A correct. This means coefficient heterogeneity cannot be distinguished from failures of the model’s other assumptions, such as the Poisson distribution or conditional independence. 2.3.3 Testing under weaker assumptions In the previous section, I showed the classical test applicable to conditionally independent Poisson dependent variables. While the statistic is simple to calculate, the test is likely to reject in cases where the Poisson or conditional independence assumption fail regardless of the presence of random coefficients. This is similar to the case of a linear model where the presence random slopes (that are assumed to be independent of covariates) is indistinguishable from a certain form of system heteroskedasticity. In this section, I extend Chesher’s approach to testing for neglected heterogeneity to the FEP setting where only the conditional mean of y it is assumed to be correctly specified. I show that an LM test of exclusion restrictions on squared regressors is valid when the coefficients are allowed to belong to a location-scale family under the alternative. As before, assume: E(yit |xxi , ci , b i ) = E(yit |xxit , ci , b i ) = ci exp(xxit b i ) (2.17) b i = β 0 + Λ 0 u i , where u i |(xxi , ci ) ∼ F(00, I K ), (2.18) and 39 where again the CDF F() and the corresponding PDF f () are left unspecified. Similar to before, these conditions imply: E(yit |xxi , ci ) = ci exp [xxit β 0 + mt (xxi , Λ 0 )] , (2.19) where mt (xxi , Λ 0 ) = log {E [exp(xxit Λ 0 u i )|xxi , ci ]} = log exp(xxit Λ 0 u i ) f (uui ) duui . (2.20) It is easy to see that mt (xxi , 0) = 0. In the multivariate normal case, mt (xxi , Λ0 ) = 1x Ω x , 2 it 0 it RK where Ω0 = Λ0 Λ0 . Rejecting H0 : Λ0 = 0 provides evidence against the null of constant coefficients. I follow Chesher’s derivation of the LM statistic as before, but unlike other methods, I only integrate u i out of the conditional mean function, not the entire likelihood or score. The unrestricted quasi log-likelihood is T β , Λ) = i (β ∑ yit log [pt (xxi, β , Λ )] , (2.21) t=1 where pt (xxi , β , Λ ) ≡ exp (xxit β + mt (xxi , Λ )) . T exp (x xir β + mt (xxi , Λ )) ∑r=1 (2.22) The first K elements of the unrestricted score evaluated at Λ = 0 are just the usual FEP scores. The gradient with respect to Λ evaluated at Λ = 0, however, presents a similar problem as before. I make the same re-parameterization as before, shown in equation (2.14), restricting the coefficients to be uncorrelated with each other under the alternative. The restricted scores have a 0/0 form and are evaluated using L’Hopital’s Rule. The details are collected in Appendix C. The score evaluated at the parameter restriction is:   T y ∇ p (x   x β 0 x β 0   ∑t=1 it β t i , , ) /pt (x i , , )           T  1 ∑T yit ∑T exp(xxir β ) x2 − x2  x / exp(x β ) ∑ ir r=1 it1 ir1 r=1 2 t=1 β , 0) = s i (β . ..     .             1 ∑T y ∑T exp(xx β ) x2 − x2 T x / exp(x β ) ∑ it ir ir itK irK r=1 r=1 2 t=1 40 (2.23) The last K elements are proportional to the restricted FEP scores for testing the exclusion of squared regressors from the model with constant slopes. Therefore, in the exponential case, we cannot distinguish random coefficients from the presence of quadratics in E(yit |xxit , ci ). As an empirical matter, however, this test takes no stand on the (conditional) distribution, overdispersion, or serial correlation of yit , so it may offer some advantages to the approach in Section 3.2. For example, if a researcher rejects the null using the test based on (2.15), but fails to reject based on (2.23), then he or she can proceed in estimating the model based on (2.1) with some peace of mind. 2.3.4 A correlated random coefficients approach to testing and estimation When one wishes to allow more than one or two slopes to be random, “random effects” type estimation based on integrating out the heterogeneity is computationally difficult and may not be robust to misspecification of the response variable’s distribution. A straightforward alternative, which is applicable not only to counts but also to other nonnegative responses, is to make a parametric, distributional assumption for bi that allows us to derive E [exp(xxit d i )|xxi , ci ]. Here, I assume correlated random coefficients (CRC) and (conditional) multivariate normality: b i = α 0 + Γ 0 x¯ i + d i , d i |(xxi , ci ) ∼ Normal(00, Ω 0 ), (2.24) T x , α is an unknown K × 1 vector, and Γ is an unknown K × K matrix. This aswhere x¯ i = ∑t=1 it 0 0 sumption states that the dependence between xi and the mean of bi is captured entirely through the time averages of xit , and is the application of Mundlak (1978) to the current setup. Alternatively, one could allow the mean of bi to depend on xi in the style of Chamberlain (1980). If Γ0 = 0, then (2.24) amounts to a stronger version of (2.10) where then α 0 = β 0 . Note that (2.24) only requires multivariate normality of the coefficients conditional on x i ; their unconditional distribution may not be normal, though logically speaking it should be continuous and have unbounded support. Condition (2.24) also implies b i and ci are independent, conditional on x i . This is less restrictive for testing purposes because b i is constant under the null, but it could affect power under alterna- 41 tives where the two are dependent. The two sources of heterogeneity are still allowed, through xi , to be correlated unconditionally. As in FEP, the relationship between xi and ci is left completely unrestricted. Under (2.4) and (2.24), it follows from properties of the lognormal distribution and the LIE that E(yit |xxi , ci ) =E(yit |xxit , x¯ i , ci ) 1 =ci exp x it α 0 + x it Γ 0 x¯ i + x it Ω 0 x it 2 1 Γ0 ) + =ci exp x it α 0 + (¯x i ⊗ x it )vec(Γ 2 K ∑ ω j xit2 j + 2 j=1 K−1 K ∑ ∑ ρ jhxit j xith j=1 h= j 1 ≡ci exp x it α 0 + (¯x i ⊗ x it )γγ 0 + xˇ it ω 0 , 2 (2.25) 2 , . . . x2 , x x , x x . . . x Γ0 ), xˇ it = (xit1 where γ 0 = vec(Γ it,K−1 xitK ), itK it1 it2 it1 it3 ω 0 ≡ (ω1 , . . . ωK , 2ρ12 , 2ρ13 . . . , 2ρK−1,K ) , ω j = Var(b j ), and ρ jh = Cov(b j , bh ). Equation (2.25), along with regularity conditions, implies that FEP of yit on x it , interactions between x it and x¯ i , and squares and interactions of x it will consistently estimate α 0 , γ 0 , and ω 0 without assuming a distribution for yit and while allowing arbitrary serial correlation (Wooldridge, 1999). Following estimation of (2.25), the unconditional means of the b i are easy to estimate using the following, where µ x¯ = E(¯x i ): β 0 ≡ E(bbi ) = α 0 + Γ0 µ x¯ , (2.26) I believe that using the lognormal distribution in the FEP setting is novel and that it offers the crucial advantage of still allowing one source of heterogeneity to be correlated with x i .7 This procedure is easy to implement, as the FEP estimator is available in software packages like Stata, 7A similar result appeared in Cameron and Trivedi (2013) for the case where b i |xxi , ci ∼ β 0 , Ω 0 ) and ci |(xxi , b i ) ∼ lognormal(0, σc2 ) as a way of illustrating how random coeffiNormal(β cients change E(yit |xxi ). 42 though practitioners should be careful to calculate cluster-robust standard errors to account for serial correlation and misspecification of the multinomial distribution. Another important note is if one believes that time constant variables z i belong in the model and they also have random coefficients that are correlated with the coefficients on the x it , then the augmented FEP regression should also include interactions between z i and x it as these are not absorbed by ci when conditioning on ni . One drawback to this approach is that for a binary element k of x it , FEP only identifies αk + 1 2 ωk . Similarly, some elements of α 0 and Ω 0 are not separately identified when x it contains both levels and higher order terms. This model nests the traditional case of constant coefficients, which occurs when γ 0 = 0 and ω 0 = 0 ). Rejection of the null that γ 0 = 0 is perhaps most convincing evidence of that slopes vary by individual. Therefore, the primary contribution of this approach to random coefficients is to suggest the inclusion of interactions between time-varying regressors and time averages to see if more flexibility is necessary. If there is no evidence that slopes are correlated with the x¯ i , then one should carefully consider how to interpret inference on ω 0 . Statistically significant estimates may just indicate that squares and cross-products of x it belong in the FEP regression. Clearly if the cross-products are significant while the squares are not, or if the coefficients on squared terms are negative and significant, then the random coefficient framework does not make sense, though the results may still have yielded useful insight into the what functions of the explanatory variables should be included in the analysis. 2.3.5 Adding second moment assumptions While under our assumptions, FEP is consistent under correct specification of the conditional mean (2.25), it may be possible to achieve greater efficiency by adding assumptions about the conditional second moment of y i . Another reason may be to identify the coefficients on binary variables. 43 I assume a variance function that is proportional to the conditional mean. Var [yit |xxi , ci , b i ] = σ0 ci exp(xxit b i ) (2.27) Additionally, the following CRE assumption implies conditional mean and variance functions that do not depend on ci . log(ci )|xxi , b i ∼ Normal(ψ1 + x¯ i ξ 1 , σa2 ) (2.28) Under assumptions 2.4, 2.24, 2.27, and 2.28, it follows from the properties of the lognormal distribution, the LIE, and the Law of Total Variance that 1 E(yit |xxi ) = E(yit |xxit , x¯ i ) = exp h(xxit , x¯ i , θ 0 ) + v(xxit , τ 0 ) 2 (2.29) and Var(yit |xxi ) =Var(yit |xxit , x¯ i ) 1 =σ0 exp h(xxit , x¯ i , θ 0 ) + v(xxit , τ 0 ) 2 + exp [2h(xxit , x¯ i , θ 0 ) + v(xxit , τ 0 )] {exp [v(xxit , τ 0 )] − 1} , (2.30) ω 0 , σa2 ) , h(xxit , x¯ i , θ 0 ) ≡ ψ1 + x¯i ξ 1 + x it α 0 + (¯x i ⊗ x it )γγ 0 , and where θ ≡ (ψ1 , ξ 1 , α , γ ) , τ = (ω v(xxit , τ 0 ) ≡ xˇ it ω 0 + σa2 . Estimation of θ 0 and τ 0 can then proceed using pooled normal QMLE, specifying the mean and variance functions as above. As the normal distribution is a member of the quadratic exponential family, this procedure is consistent without the normal distribution being true (Gourieroux, Monfort, and Trognon, 1984) Once again, inference should be made cluster-robust to account for serial correlation and the true distribution being non-normal. Estimation of β 0 can then proceed as before, and coefficients on binary or quadratic variables are now identified off of the nonlinearity in (2.30). Normal QMLE in this case is straightforward to program in software like Stata using built-in maximum likelihood functions, and it had good finite sample properties in simulations run for this 44 chapter. Some researchers may wish to specify a conditional covariance structure for yi as a way to get more efficiency. If so, one option is to assume Cov [yit , yir |xxi , ci , b i ] = 0,t = r. (2.31) Equation (2.31) does not allow serial correlation when conditioning on x i , ci , b i , but the presence of the time-constant heterogeneity ensures that the responses will be serially correlated when conditioning on x i only. Under 2.4, 2.24, 2.27, 2.31, and 2.28, Cov(yit , yir |xxi ) = 1 exp h(xxit , x¯ i , θ 0 ) + h(xxir , x¯ i , θ 0 ) + (v(xxit , τ 0 ) + v(xxir , τ 0 )) 2 exp(xxit Ω 0 x ir + σa2 ) − 1 . (2.32) 2.3.6 Estimating average partial effects Even though the coefficients in (2.4) have direct interpretations as semi-elasticities, it may still be desirable to estimate partial effects and APEs, perhaps to compare estimates between competing nonlinear models. Moreover, this sections shows that the average partial effects for a binary variable depend only on αk + 21 ωk , meaning that even though we cannot separately identify αk and ωk without second moment assumptions, we can still estimate average partial effects. Let x = {xx1 , x 2 , . . . , x T }, c, and b = {b1 , b2 . . . , bK } denote fixed values of the variables. The partial effect of a continuous xt j on the conditional mean of yt is defined as8 φ j (xxt , c, b ) ≡ ∂ E(yt |xxt , c, b ) = ci exp(xxt b )b j . ∂ xt j (2.33) For a binary xtk , the partial effect is defined as the discrete difference in the conditional mean of yt at each level of the binary variable. In the expressions to follow, the subscript k signifies that xtk , x¯k , or their associated coefficients have been omitted from the vector. 8I implicitly assume that xt j is not functionally linked with any other element in xt . 45 φk (xxt , c, b ) ≡E(yt |xxtk , xtk = 1, c, b ) − E(yt |xxtk , xtk = 0, c, b ) =c exp(xxtk b k + bk ) − c exp(xxtk b k ) (2.34) Of course, estimating features of the distributions of φ j and φk is infeasible as we do not observe c or b . Therefore, this section focuses mainly on APEs where the heterogeneity has been averaged out. δh (xxt ) ≡ Evi [φh (xxt , ci , b i )] , (2.35) where v ≡ (c, b ) and h ∈ { j, k}. 2.3.6.1 Approaches under the CRE assumption for ci To proceed, it is necessary to maintain the assumptions of correlated random coefficients (2.24). As ci is unobserved, I also maintain (2.28). Later, I will discuss a possible “estimator” of ci . For now, there are two choices as to how to proceed in estimating δ j and δk . The first is to estimate an Average Structural Function (ASF), as proposed by Blundell and Powell (2003), where essentially x¯ proxies for v and is averaged out before taking derivatives and differences. The second is to use derivatives and differences of (2.29) directly (Wooldridge, 2010). The ASF is defined as: ASF(xxt ) ≡ Ev i [ci exp(xxt b i )] , (2.36) where again, xt is a fixed argument. Under (2.24), (2.27), and (2.28) the L.I.E. implies 1 ASF(xxt ) = Ex¯ exp h(xxt , x¯ i , θ 0 ) + v(xxt , τ 0 ) 2 (2.37) Passing the derivative through the expectation, the APE for continuous xt j is: 1 δ j (xxt ) =Ex¯ exp h(xxt , x¯ i , θ 0 ) + v(xxt , τ 0 ) 2 For a binary xtk , the APE is: 46 K α j + x¯ i γ j + ω j xt j + ∑ ρ jhxth h= j (2.38) δk (xxt ) =Ex¯ E yt |xxtk , xtk = 1, x¯ i − E yt |xxtk , xtk = 0, x¯ i 1 1 =Ex¯ exp h(xxtk , 1, x¯ i , θ 0 ) + v(xxtk , 1, τ 0 ) − exp h(xxtk , 0, x¯ i , θ 0 ) + v(xxtk , 0, τ 0 ) 2 2 , (2.39) where h(xxtk , 1, x¯ i , θ 0 ) =ψ1 + x¯i ξ 1 + xtk α k + xtk Γ k x¯ k + xtk x¯ik γ kk + αk + x¯ i γ k , h(xxtk , 0, x¯ i , θ 0 ) =ψ1 + x¯i ξ 1 + xtk α k + xtk Γ k x¯ k + xtk x¯ik γ kk , v(xxtk , 1, τ 0 ) =ˇx k ω k + σa2 + ωk + 2 K ∑ ρkhxth, h=k and v(xxtk , 0, τ 0 ) =ˇx k ω k + σa2 . (2.40) The direct approach consists of taking derivatives and differences of 2.29 directly. Note that since these expressions do not first average out x¯ , the entire history of x is now a fixed argument. For a continuous variable xt j the APE is: δ j (xx) = ∂ E(yt |xx) ∂ xt j 1 = exp h(xxt , x¯ , θ 0 ) + v(xxt , τ 0 ) 2 ξ j /T + α j + x¯ γ j + K 1 xt γ j + ω j xt j + ∑ ρ jh xth , T h= j (2.41) where γ j is the jth row and γ j is the jth column of Γ 0 . Define z(xxt , x¯ , θ , τ ) = h(xxt , x¯ , θ ) + 12 v(xxt , τ ). Then we have for a binary xtk , δk (xx) =E yt |xxk , {xsk }Ts=t , xtk = 1 − E yt |xxk , {xsk }Ts=t , xtk = 0 K 1 (1) (1) (1) = exp z(xxtk , x¯ k , θ k , τ k ) + ξ j x¯tk + αk + x¯ k γ kk + γkk x¯tk + xtk x¯tk γ kk + ωk + ∑ ρkh xth 2 h=k (0) (0) − exp z(xxtk , x¯ k , θ k , τ k ) + ξ j x¯tk + xtk x¯tk γ kk , 47 (2.42) (0) (1) where γkk is the kth diagonal element of Γ 0 , x¯tk ≡ T1 1 + ∑Ts=t xsk , and x¯tk ≡ T1 ∑Ts=t xsk . Whichever approach is chosen, one can then estimate δ j (xxt ) or δk (xxt ) by inserting the estimated parameters, replacing expectations over the distribution of x¯ with averages over i, and plugging in interesting values of x . Many researchers will average over the distribution of x to get a single number. Asymptotic variances can be computed either via the delta method or using the panel bootstrap. 2.3.6.2 Estimation when the slopes are independent of covariates The traditional case where b i is independent of x i (conditional on ci ) is one where the ASF is identified without placing any restriction on ci or Var(yit |xxi ). The following summarizes the necessary condition. bi = β 0 + d i, d i |(xxi , ci ) ∼ Normal(00, Ω 0 ). (2.43) The results of Section 3.4 continue to hold, but the time averages no longer enter E(yit |xxit , ci ) (that is, Γ 0 = 0 ). The LIE implies that for a fixed xt , 1 ASF(xxt ) = E(ci ) exp xt β 0 + xt Ω 0 x it 2 (2.44) Passing the derivative through the expectation, the APE of a continuous variable xt j is given by: 1 δ j (xxt ) =E(ci ) exp xt β 0 + xt Ω 0 x it 2 For a binary variable xtk , the APE is: 48 K β j + +ω j xt j + ∑ ρ jhxth h= j (2.45) δk (xxt ) = K 1 1 1 E(ci ) exp xtk β k + xˇ k ω k + αk + ωk + ∑ ρkh xth − E(ci ) exp xtk β k + xˇ k ω k 2 2 2 . h=k (2.46) An estimator for E(ci ) is conveniently available. Poisson QMLE using (2.25) and treating the ci as (strictly positive) parameters is algebraically equivalent to multinomial QCMLE. 9 ) In our current β , ω ) , the QMLE for ci is: application, for a given θ ≡ (β ci (θθ ) = ni , T exp(x xit β + xˇ it ω ) ∑t=1 (2.47) T y . Define c = c (θ ), where θ is the FEP estimate of (β β 0 , ω 0 ) . The where again, ni = ∑t=1 it i i properties of ci are not well-known in either the constant or heterogeneous slope case. Though there is no incidental parameters problem for θ in the FEP case, ci (θθ ) = ci , even when evaluated at θ 0 . Viewing ci as a parameter, there is no reason to think ci is unbiased and it cannot be consistent with T fixed. However, the ASF in this case is proportional to E(ci ). Strict exogeneity of x it and (2.24) imply that T E(ni |ci , x i ) = ci ∑ exp t=1 1 x it β 0 + xˇ it ω 0 2 (2.48) It follows from the L.I.E. that   E(ci ) = E  ni T exp ∑t=1 x it β 0 + 12 xˇ it ω 0  (2.49) meaning N −1 ∑N i=1 ci consistently estimates E(ci ). Many researchers are primarily interested in a single APE estimate (averaged across the sample of observables). In this case, it may be attractive to treat ci as the unobservable ci and average 9 See Wooldridge, 2010 or Cameron and Trivedi, 2013 49 across the distributions of ci and x i at the same time. We would, generally expect such APE estimators for nonlinear FE models derived in such a way to suffer from the incidental parameters problem, even if the slopes are estimated consistently. 10 Given that N −1 ∑N i=1 ci is consistent for E(ci ), however, it may be that estimators including functions of ci that are averaged across i have desirable properties. This appears to be true at least for the data generating process considered in this chapter. Simulation results in Section 4 indicate very small finite sample bias of overall APE estimators computed using ci in this way. 2.4 2.4.1 Monte Carlo Comparing estimation methods To illustrate the impact of ignoring random coefficients in the FEP setting, I simulate the performance of the different estimators in both the ideal case of constant coefficients and in the case where the coefficients vary by individual. I employed the following data generating process: yit |(xxi , w i , ci , bi1 , bi2 ) ∼ Poisson [ci exp(bi1 xit + bi2 wit )] , (2.50) log(ci ) ∼ Normal(0, 1/16) (2.51) xit = log(ci ) + .5xi,t−1 + vit , t > 1 10 See, xi1 = log(ci )i + vi1 , vit ∼ N(0, 1/2) (2.52) wit = 1 [xit + hit > 0] , hit ∼ N(0, 1/2) (2.53)       2 bi1  β1  ω1 ρ    ∼ Normal   ,   bi2 β2 ρ ω22 (2.54) for example, Fernandez-Val, 2009. 50 For the above draws, i = 1, . . . , 1000 and t = 1, . . . , 10. The case where ω12 , ω22 , and ρ all equal zero corresponds to the constant coefficient case. For these simulations, the bi j are generated to be independent of {xxi , w i }, and this assumption is maintained in estimation. The bi j are also generated to be independent of each other (ρ = 0) but this is not assumed in estimation. In the following tables, FEP refers to the estimator that ignores the random coefficients. FEP2 refers to the estimator that adds the square of x and an interaction between x and w. Since this model’s assumptions does not separately identify β2 and ω22 , the estimated coefficient on w is compared to β2 + 12 ω22 . NQML refers to the normal QML estimator that also assumes (2.27) and (2.28).11 I set ω1 = ω2 = ω but do not assume equal variance in estimation. In each case, I used one thousand replications. 11 APE estimates from NQML also plugged in ci . 51 Table 2.1: Finite Sample Properties of Slope Estimators: β1 = 1, β2 = −1 ω 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 FEP Mean SD 1.00 0.02 1.00 0.02 1.01 0.02 1.02 0.02 1.03 0.02 1.05 0.03 1.07 0.03 1.10 0.04 1.14 0.06 1.18 0.07 1.23 0.09 β1 FEP2 NQML Mean SD Mean SD 1.00 0.03 1.00 0.02 1.00 0.03 1.00 0.02 1.00 0.03 1.00 0.02 1.00 0.03 1.00 0.03 1.00 0.03 1.00 0.03 1.00 0.03 1.00 0.03 1.00 0.03 1.00 0.03 1.00 0.04 1.00 0.04 1.00 0.04 1.00 0.04 1.00 0.04 0.99 0.05 1.00 0.04 0.99 0.05 52 β2 FEP NQML Mean SD Mean SD -1.00 0.03 -1.00 0.04 -1.00 0.03 -1.00 0.04 -1.00 0.03 -1.00 0.04 -0.99 0.03 -1.00 0.04 -0.99 0.03 -1.00 0.04 -0.98 0.03 -1.00 0.04 -0.98 0.04 -1.00 0.04 -0.97 0.04 -0.99 0.05 -0.96 0.05 -0.99 0.05 -0.96 0.07 -0.99 0.05 -0.95 0.08 -0.98 0.06 β2 + 12 ω22 FEP2 Mean SD Truth -1.00 0.04 -1.00 -1.00 0.04 -1.00 -0.99 0.04 -1.00 -0.99 0.04 -0.99 -0.98 0.04 -0.98 -0.97 0.04 -0.97 -0.96 0.04 -0.96 -0.94 0.04 -0.94 -0.93 0.05 -0.92 -0.91 0.05 -0.90 -0.89 0.05 -0.88 Table 2.2: Finite Sample Properties of APE Estimators: β1 = 1, β2 = −1 ω 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 Truth 0.88 0.88 0.90 0.91 0.95 0.98 1.04 1.11 1.20 1.32 1.49 Est. APE of x FEP FEP2 NQML Mean SD Mean SD Mean SD 0.88 0.03 0.88 0.03 0.88 0.03 0.88 0.03 0.88 0.03 0.88 0.03 0.89 0.03 0.89 0.03 0.89 0.03 0.92 0.04 0.92 0.04 0.92 0.04 0.95 0.04 0.95 0.04 0.95 0.04 0.99 0.05 0.99 0.05 0.99 0.05 1.04 0.07 1.04 0.06 1.03 0.06 1.11 0.09 1.11 0.08 1.10 0.09 1.21 0.15 1.21 0.13 1.20 0.14 1.32 0.22 1.33 0.21 1.31 0.23 1.49 0.47 1.50 0.49 1.48 0.48 53 Truth -1.12 -1.12 -1.13 -1.14 -1.15 -1.16 -1.19 -1.22 -1.26 -1.30 -1.36 Est. APE of w FEP FEP2 NQML Mean SD Mean SD Mean SD -1.12 0.06 -1.12 0.08 -1.12 0.07 -1.12 0.06 -1.12 0.08 -1.12 0.07 -1.13 0.06 -1.13 0.09 -1.13 0.08 -1.15 0.07 -1.14 0.10 -1.14 0.08 -1.17 0.07 -1.15 0.10 -1.15 0.08 -1.19 0.09 -1.17 0.12 -1.16 0.10 -1.23 0.11 -1.19 0.16 -1.18 0.11 -1.27 0.14 -1.22 0.20 -1.20 0.13 -1.35 0.22 -1.26 0.29 -1.23 0.16 -1.43 0.34 -1.30 0.47 -1.26 0.24 -1.57 0.86 -1.35 0.60 -1.29 0.26 It appears from Table 2.1 that the standard deviation of the coefficients is positively related to the finite sample bias (in magnitude) in FEP slope estimates. This is not surprising given that (2.1) fails for ω > 0. This is despite the fact that the coefficients are independent of the covariates and each other, a case in which random coefficients would not cause a problem in linear models. In contrast, the augmented FEP and the NQML estimators show much smaller bias at all levels of ω, with the exception of the FEP2 coefficient on w, which, as expected, appears to show small bias for β2 + 21 ω 2 . The APEs are estimated using expressions similar to (2.45) and (2.46) using the FEP2 and NQML parameter estimates. The difference is I treat ci as ci and average over {xxit , ci } only once. I followed an analogous procedure for the FEP case. Table 2.2 suggests that this approach to estimating APEs has small bias for the FEP2 and NQML case, despite using estimates of incidental parameters. For FEP, bias in the APE of the binary variable increases as ω increases. Surprisingly, this is not the case for the continuous variable. Even though the simulation suggests a large bias in the FEP estimate of β1 . This warrants further investigation as it suggests there many be circumstances in which researchers can ignore random coefficients if all they care about is APEs of continuous variables, though it could also be an artifact of this data generating process. 2.4.2 Testing when coefficients are not normal Section 3 shows that for slope heterogeneity in a location-scale family of spherical distributions (where the heterogeneity are independent of each other), an LM test for coefficient heterogeneity is equivalent to testing the coefficients on squares of the covariates, which suggests that the heterogeneity need not be normal for the approach of this chapter to work well. To explore this, I generate the responses using random coefficients of different distributions. bi j2 = 1 + ω (u j2 − 0.5)/ 1/12 , u j2 ∼ U(0, 1) 54 (2.55) √ bi j3 = 1 + ω (u j3 − 4)/ 8 , u j3 ∼ χ42 (2.56) bi j4 = 1 + ω u j4 / 5/3 , u j4 ∼ t5 (2.57) bi j5 = 1 + ω u j5 − 1 , u j5 ∼ Exponential (1) (2.58) bi j6 ∼ Gamma (1/ω 2 , ω 2 ) (2.59) These draws are made separately for j = 1, 2, and for simplicity, Cov(bi1h , bi2h ) = 0 for each h. Each coefficient’s data generating process ensures that it has a mean of 1 and variance of ω 2 . Each of the first five coefficients falls into a location-scale family as they consist of a standardized random variable multiplied by ω to result in a variance of ω 2 and shifted to have a mean of one. The gamma coefficients, in contrast, are not drawn from a location-scale family, but are directly specified to have a mean of 1 and variance of ω 2 . Given the issue identifying parameters associated with binary regressors in the FEP2 setting, I generate the responses to depend on continuous regressors only, where each xit j is generated as in (2.52). yit |(xxi1 , xi2 , ci , bi1h , bi2h ) ∼ Poisson [ci exp(bi1h xit1 + bi2h xit2 )] (2.60) 2, After generating the data, β1 , β2 , ω12 , ω22 , and ρ were estimated using FEP of yt on xt1 , xt2 , xt1 2 , and x x . A Wald test was then performed on x2 , x2 , and x x . The results of Section 3.3 xt2 t1 t2 t1 t2 t1 t2 suggest that this test should perform well for the first five coefficient types, and I conjecture that it performs well for the Gamma coefficients as well. When testing for random slopes, is important to use a FE procedure if one is concerned that the multiplicative effect ci is correlated with the explanatory variables. Otherwise, the omitted variable problem is likely to cause the test to be over-sized. In fact, in a simulation where Random Effects Poisson was used on the same set of 55 covariates, a Wald test rejected the null of constant slopes in 88% of replications when the true slopes were nonrandom. Table 2.3: Testing when b i is not normal ω 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 Empirical Rejection Probability (Null value 0.05) Normal Uniform* Chi2* t5 * Exp.* Gamma 0.069 0.069 0.069 0.069 0.069 0.069 0.108 0.115 0.112 0.108 0.121 0.132 0.186 0.212 0.159 0.196 0.16 0.178 0.308 0.359 0.287 0.302 0.303 0.334 0.468 0.531 0.439 0.408 0.404 0.472 0.640 0.691 0.543 0.579 0.553 0.625 0.785 0.796 0.689 0.693 0.652 0.741 0.881 0.887 0.796 0.804 0.757 0.817 0.914 0.948 0.860 0.852 0.814 0.868 0.931 0.965 0.897 0.897 0.876 0.892 0.970 0.979 0.904 0.919 0.876 0.923 Table 2.3 shows that as expected, rejection probabilities increase with ω when the coefficients are normal, and are quite high when ω is large.12 What is interesting is that there does not seem to be much change in either size or finite sample power when the coefficients are not normal, even when the coefficients are not drawn from a location-scale family. 2.5 Empirical application: the Patent-R&D relationship There is a long history of economic inquiry into the relationship between a firm’s research and development (R&D) expenditures and the number of patents for which it applies in a given year. Patent applications are viewed in the literature as an indicator of additions to the knowledge stock of a firm (Pakes and Griliches, 1980). Pakes and Griliches (1980) were among the first to focus on firm effects as a source of potential endogeneity in analyzing U.S. manufacturing firms. Hausman, Hall, and Griliches (1984) and Hall, Griliches, and Housman (1986) also look to firm effects to account for significant over-dispersion in the distribution of patent counts. In addition to FEP, 12 I have not yet varied the cross-section size. I would expect these rejection probabilities to increase. 56 Negative Binomial models are also common as a way to introduce more dispersion. Nonlinear count models are not only attractive for logical reasons, but also because datasets can contain a nontrivial proportion of observations with zero patents. These observations must be eliminated or transformed in some ad hoc manner before estimating a linear log-log model(Hall, Griliches, and Hausman, 1986). Such observations seem to be more common in more recent datasets as well. While only 8% of observations were zero in Hall, Hausman, and Griliches 1968-1975 panel of 121 firms, 16.5% were zero in Gurmu and Perez-Sebastian’s 1982-1992 panel of 391 firms (Gurmu and Pérez-Sebastián, 2008). A common finding in the literature is that distributed lag models that do not account for any firm heterogeneity tend to have a U-shaped lag profile, and that after accounting for firm heterogeneity, only contemporaneous R & D expenditure tends to be significant (Hall, Griliches, and Hausman, 1986). In a cross-sectional analysis of the pharmaceutical industry, Wang, Cockburn, and Puterman (1998) use a Poisson model and allow for heterogeneity in both the multiplicative effect and coefficients. While the mixing distribution is allowed to depend on the regressors, they assume that the vector of heterogeneity has finite support, which in their analysis consisted of three or fewer points. This framework may be less palatable in studies with broader industry coverage. The population of interest for this chapter is publicly-traded U.S. manufacturing firms in existence from 1996 to 2003. The patent data come from the United States Patent and Trademark Office by way of the National Bureau of Economic Research’s Patent Data Project (PDP) and includes data through 2006. As patents are not recorded in the USPTO database until they are granted, the panel is truncated in 2003 to diminish the effect of the time-lag between application and granting.13 Financial information on publicly-traded firms comes from the Compustat database, accessed through Wharton Research Data Services (WRDS) in September 2016. Hall, Jaffe, and Trajtenberg (2001) and Bessen (2009) thoroughly describe the patent data as well as matching information for the Compustat database. Matching patents to firms is not a trivial given 13 The average lag over applications made in 1990-92 was 1.76 years, with 96.1% of patents granted in three years or less. 57 nonstandard naming in USPTO records, among other issues. I mainly follow Bound, et. al (1982) and Hall, Griliches, and Hausman (1986) in assembling the panel dataset. The initial sample from the Compustat database consists of 3,126 firms in the U.S. manufacturing industry that were in existence in the year 2000. Following the literature, I require that data exist for patents and R&D expenditures for each year from 1996 to 2003, and that R&D expenditures be strictly positive since I take logs. I also eliminate firms that show large jumps in either gross capital or employment in a year. In the end, my sample consists of 848 firms over the period 1996-2003. I describe the selectivity of my sample in Tables 2.4 and 2.5. The tables show that although the sample covers only about a quarter of U.S. manufacturing firms in 2000, it covers nearly 70% of R&D expenditures. Coverage is generally poorer for smaller firms and higher for larger firms both in terms of net sales and R&D. Sample coverage is comparable to Hall, Griliches, and Hausman (1986) in terms of net sales, though they achieve 90% coverage of total R&D. Table 2.4: Distribution of Net Sales in 2000 Number in 2000 cross-section Net Sales All Pos. R&D Less than $1M 332 207 $1M-10M 439 335 $10M-100M 900 672 $100M-1B 986 588 $1B-10B 402 271 More than $10B 67 52 Total 3,126 2,125 Number in Sample 49 115 242 244 157 41 848 Coverage All Pos. R&D 0.15 0.24 0.26 0.34 0.27 0.36 0.25 0.41 0.39 0.58 0.61 0.79 0.27 0.40 Table 2.5: R& D Expenditures in 2000 Firm R&D (2000 USD) Less than $1M $1M-10M $10M-100M $100M-1B $1B-10B Total 2000 Cross-section Sample 170.15 55.32 3695.48 1492.38 21621.47 8765.10 38160.81 25075.92 67084.16 54007.14 130732.08 89395.85 58 Coverage 0.33 0.40 0.41 0.66 0.81 0.68 Table 2.6 shows summary statistics for the key variables over the sample of 848.14 Consistent with the literature, this shows the distribution of patents to be right-skewed and over-dispersed with a thick right tail. Also noteworthy is that compared to previous studies, my sample contains a much higher proportion of zeros than previous studies. Compared to either Hall, Griliches, and Hausman (1986) or Gurmu and Perez-Sebastian (2008), the median number of patents is lower, and the maximum number of patents is higher in this sample. 14 Note that firms with zero patents in all years drop from the multinomial log-likelihood. 59 Table 2.6: Summary of Key Variables in 2000 Variable Net Sales (Millions of USD) R&D (Millions of USD) Patents Fraction with zero patents Fraction in scientific sector Mean 2506.28 105.42 30.47 0.35 0.55 St.Dev. Min 1st Q. 12980.46 0.00 15.77 490.95 0.01 2.22 141.85 0.00 0.00 0.48 – – 0.50 – – Med. 118.73 7.53 2.00 – – 3rd Q. 877.54 31.71 7.00 – – All dollars amounts are real 2000 USD. The scientific sector is defined to include the drug, computer, electronic component, and scientific instrument industries. 60 Max 206083.00 6800.00 1811.00 – – I apply the exponential model introduced in Section 3 to patent counts where the regressors of interest are the logs of current R&D and up to three lags. I include year dummies, but assume their coefficients are constant. τ E [patentsit | log(Ri1 ), . . . , log(RiT ), δt , ci , b i ] = ci exp ∑ bi,s log(Ri,t−s) + δt , (2.61) s=0 where Rit is real R&D expenditures by firm i in year t. The CRC assumption is: α + γ log(R)i , Ω ), b i |(log(Ri,t−0 ), . . . , log(Ri,t−τ ), δt , ci ) ∼ Normal(α (2.62) T log(R ) is a scalar. Section 3 implies that FEP of patents on current where log(R)i = T −1 ∑t=1 it and lagged log(R) terms, interactions between log(R) and the log(R) terms, and squares and crossproducts of the log(R) terms will be consistent under these assumptions. Table 2.7: Results for traditional estimators VARIABLES log(R0 ) (1) PQML 1 (2) PQML 2 (3) FEOLS 1 (4) FEOLS 2 0.819*** (0.0441) 0.423** (0.191) 0.234*** (0.0637) 0.0845 (0.108) 0.0826 (0.203) 0.113*** (0.0198) 0.0476** 0.318*** (0.0205) (0.0682) 0.00784 (0.0192) 0.00777 (0.0180) -0.00789 (0.0204) -0.442*** (0.0301) 1.268*** (0.0765) 0.055 0.318*** (0.034) (0.0682) 4,240 5,968 848 746 0.137 log(R−1 ) log(R−2 ) log(R−3 ) Dum. for zero pat. Constant Sum of log(R) coeff. Observations Number of firms R-squared -0.211 (0.211) 0.819*** (0.0441) 6,784 848 -0.228 (0.214) 0.824*** (0.045) 4,240 848 -0.543*** (0.0261) 1.091*** (0.0440) 0.113*** (0.0198) 6,784 848 0.157 (5) FEP 1 (6) FEP 2 0.161*** (0.0560) 0.0158 (0.0378) -0.0250 (0.0710) -0.00236 (0.0546) 0.1495 (0.1096) 3,510 702 Clustered standard errors in parentheses. Year dummies included in all specifications. *** p<0.01, ** p<0.05, * p<0.1 Table 2.7 presents results from the six different specifications that assume constant coefficients. For all but columns (3) and (4), the dependent variable is the number of patents. Columns (1) and 61 (2) contains Poisson QMLE estimates where firm heterogeneity is ignored. Column (3) contains estimates from FE OLS where the dependent, variable is the log of patents. For this column only, zero patent counts are changed to 1, with a dummy variable added following Hall, Griliches, and Hausman (1986). Columns (5) and (6) contain FEP estimates. Consistent with the literature, these estimates imply that correlation between patents and current R&D is strongest relative to lag effects, and that the total elasticity of patents with respect to R&D that is less than unity. I also find the estimated elasticities fall once I account for firm effects. For the Poisson specification, the total elasticity falls from 0.82 to 0.32 in the one-lag model and from 0.82 to 0.15 in the three-lag model. The three-lag FEP specification implies an elasticity with respect to current R&D that is only about half of those estimated in previous studies, and this estimate is sensitive to the time dimension of the panel and lag-length chosen. If I mimic Gurmu and Perez-Sebastian (2008) and estimate a four-lag FEP model over 1982-1992, I get very similar results to theirs. It is possible that the nature of the patent-R&D relationship changed in the intervening decade, but it may also be that the exponential model is incorrect, our specification neglects some dynamics or endogeneity, or that sample selection has had a different effect on the more current data. Additionally, Section 3 and Section 5 imply that neglected slope heterogeneity could also be a source of bias in this model. Table 2.8 gives results from the CRC estimator proposed in this chapter, varying the lag length and assumptions about Ω . In columns (1) and (3), I impose that the b i are deterministic linear functions of log(R)i , while in column (4), I impose that Ω is diagonal. Given (2.61) and (2.62), these data do provide some evidence of slope heterogeneity. In the one-lag models, none of the additional terms are statistically significant. The evidence is mixed in the three-lag models. In column (3), the estimates of γ are jointly marginally significant (p = 0.08), with the interaction involving the second lag of log(R) negative and significant at the 5% level. In column (4), while all terms involving log(R) are jointly significant, the interactions and squares are not. In column (5), the interactions, squares, and cross-products are jointly marginally signifcant (p = 0.08). The terms associated with Ω are jointly insignificant, however, as are the interactions 62 Table 2.8: Results for CRC FEP estimators VARIABLES log(R0 ) (1) CRCFEP 1 (2) CRCFEP 2 (3) CRCFEP 3 (4) CRCFEP 4 (5) CRCFEP 5 0.538*** (0.144) 0.548*** (0.151) -0.0394 (0.0285) 0.165 (0.183) 0.115 (0.141) 0.0736 (0.0892) 0.444** (0.173) -0.0384 (0.149) 0.00850 (0.0248) -0.0103 (0.0167) -0.0844** (0.0368) 0.00672 (0.0284) 0.152 (0.133) 0.0604 (0.0951) 0.423*** (0.148) -0.00633 (0.142) -0.182 (0.224) -0.118 (0.195) -0.556** (0.258) -0.0775 (0.159) 0.0915 (0.108) 0.0569 (0.0978) 0.234** (0.118) 0.0404 (0.0735) 0.160 (0.141) 0.111 (0.0887) 0.360*** (0.121) 0.0205 (0.125) -0.215 (0.251) 0.0177 (0.294) -0.167 (0.313) -0.236 (0.262) 0.0921 (0.118) 0.102 (0.108) 0.309** (0.147) 0.117 (0.0854) -0.0986 (0.141) -0.0120 (0.177) 0.144 (0.176) -0.255 (0.183) 0.123 (0.129) -0.266** (0.118) log(R−1 ) log(R−2 ) log(R−3 ) log(R0 ) × log(R0 ) log(R−1 ) × log(R0 ) log(R−2 ) × log(R0 ) log(R−4 ) × log(R0 ) [log(R0 )]2 -0.102 (0.0892) [log(R−1 )]2 [log(R−2 )]2 [log(R−3 )]2 log(R0 ) × log(R−1 ) log(R0 ) × log(R−2 ) log(R0 ) × log(R−3 ) log(R−1 ) × log(R−2 ) log(R−1 ) × log(R−3 ) log(R−2 ) × log(R−3 ) Clustered standard errors in parentheses. Year dummies included in all specifications. *** p<0.01, ** p<0.05, * p<0.1 63 Table 2.9: CRCFEP 3 estimated elasticities Parameter β0 β−1 β−2 β−3 β0 + β−1 + β−2 + β−3 Estimate S.E. P-value 95% C.I. 0.134 0.051 0.257 -0.023 0.417 0.093 0.057 0.098 0.092 0.127 0.149 0.379 0.009 0.800 0.001 -0.048 0.315 -0.062 0.163 0.064 0.449 -0.205 0.158 0.169 0.666 β−τ = ατ+1 + γτ+1 log(R). Clustered S.E.’s ignore sampling error of log(R) with the time average. Therefore, while there is marginal evidence of heterogeneity, I cannot parse it into its components. Focusing on model (3), therefore, the results are quite interesting, at least at face value. The estimator for the average elasticity with respect to Rt−s is given by β−s = αs+1 + γs+1 log(R), (2.63) T where log(R) = (NT )−1 ∑N i=1 ∑t=1 log(Rit ). I give these estimates in Table 2.9. This implied lag profile for the average elasticity is different from that previously observed in the literature, where typically the contemporaneous elasticity accounts for most of the total and the lags are much smaller in magnitude and often statistically insignificant. Model (3) estimates imply, however, that the highest estimated average elasticity is with respect to the second lag of log(R), at 0.26 with a standard error of 0.098. Meanwhile, the contemporaneous and other lags are insignificantly different from zero. At face value, this seems to imply a delay in the benefit to R&D expenditures. Furthermore, the negative estimated coefficient on log(R−2 ) × log(R0 ) implies that the firms with larger R&D expenditures overall experience lower marginal returns. The correlation between log(R0 ) and the estimate of the multiplicative firm effect is 0.39, indicating that firms with a higher base rate of patenting tend to have lower marginal returns to R&D dollars, which echoes the findings of Wang, et. al. (1998) with regards to the pharmaceutical industry. Unfortunately, however, the results do not appear to be robust to changes in the estimation sample. If I construct a panel over 1994-2001, for instance, neither the lag-structure result or the finding of heterogeneous 64 slopes hold. It may be that there is still a sample selection problem caused by not observing any patent applications made through 2003 if the were not granted before 2006. 2.6 Conclusion FEP analysis of count or other nonnegative response variables cannot generally be justified in the presence of heterogeneous slopes and may not lead to estimation of any quantity of interest. Given this, I extend Chesher’s (1984) testing framework to the FEP setting and show that an LM test for neglected heterogeneity amounts to adding squares of regressors to the set of covariates. This procedure is more widely applicable than classical tests. Simulation evidence also suggests robustness to this approach when coefficients are neither normal nor belong to a location-scale family. Identification via a correlated random coefficients assumption leads to FEP on a more flexible mean function as an estimation method. Under a proportional variance assumption and CRE assumption for the scalar, multiplicative effect, normal QMLE is another technique which may have advantages in cases of binary or time-constant regressors. Each of these options feasibly allows for higher dimensional random coefficients than estimators based on likelihoods with integrals, while also allowing for dependence between the heterogeneity and the regressors. Application of these methods to the U.S. manufacturing industry may indicate firms may have heterogeneous elasticities of patenting with respect to R&D, and that in contrast to previous results, there may be a delay in the effect of R&D expenditures on patenting. results do not hold when estimating over different years of data. One immediate avenue for future research is to extend this type of correlated random coefficients model to cases where the regressors are not strictly exogenous, either because of feedback, contemporaneous endogeneity, or sample selection, as a way to explore robustness of these findings. 65 CHAPTER 3 ESTIMATION OF AVERAGE MARGINAL EFFECTS IN MULTIPLICATIVE UNOBSERVED EFFECTS PANEL MODELS 3.1 Introduction and Review Nonlinear models often make logical sense for representing limited dependent variables like discrete choices and counts. Challenges can arise, however, in micro-econometric panel settings when one wishes to control for unobserved individual heterogeneity and has relatively few time periods of data. For static multiplicative effects models with strictly exogenous covariates, fixed effects Poisson (FEP) consistently estimates the parameters of a correctly-specified conditional mean function (Wooldridge, 1999). Researchers may also want to estimate quantities like Average Partial Effects (APE) and Average Treatment Effects (ATE), but as they depend on the unobserved heterogeneity, it is not immediately clear how to proceed. I study an approach that estimates APE and ATE by combining FEP parameter estimates with estimates of the individual heterogeneity. The latter come from unconditional Poisson QMLE treating the heterogeneity as parameters to be estimated, a procedure that yields estimates of the conditional mean function parameters that are algebraically equivalent to FEP.1 While easy to implement, such APE and ATE estimates potentially suffer from the incidental parameters problem (IPP) since the individual effect estimates are based on only T observations (Lancaster, 2000). However, I show that in multiplicative models, such APE and ATE estimators are consistent and asymptotically normal with only the cross-sectional dimension growing. The consistency result may not be surprising, but it is not implied by consistency of FEP for slope coefficients, and similar results do not hold for other nonlinear models. For instance, the IPP still biases APE estimates in fixed effects binary response models even if one knows the true values of the slope parameters or can estimate them consistently (Fernandez-Val, 2009). 1 This result was derived independently by Lancaster (2002) and a version of Blundell, et. al. (2002). 66 To my knowledge, estimating APE and ATE using estimated incidental parameters has not been studied in multiplicative models specifically. Many authors have studied consistent slope parameter and marginal effect estimation using estimated incidental parameters in either general nonlinear models or in other specific settings. One solution is to employ bias corrections that are justified by large-T asymptotics. See, for example, Hahn and Newey (2004) for general nonlinear models estimated with unconditional MLE, or Fernandez-Val (2009) for the unobserved effects probit model. Although allowed to be much smaller than the number of individuals, the number of time periods needs to be sufficiently large for the asymptotic approximation of the bias to perform well. For static probit and logit models, Fernandez-Val, Greene (2004) and others have noted a “small bias” property for APE and ATE estimates from unconditional MLE . The multiplicative case, however, is special in that the average marginal effects estimators are actually consistent with only the cross-section size growing, a rare result outside of the linear model. This means they should perform well even with only two time periods. Empirical researchers, of course, also have the option to focus on quantities that do not depend on unobserved heterogeneity. For instance, the exponential conditional mean function with a linear index gives the slope coefficients interpretations as semi-elasticities, and proportional treatment effects are also identified (M. Lee and Kobayashi, 2001). Another possibility is to make additional assumptions. For example, one could use a correlated random effects (CRE) approach by assuming a parametric form for the mean of the heterogeneity conditional on the explanatory variables. This is applicable in many nonlinear settings to estimate slope parameters as well as average partial effects (Wooldridge, 2010). Using estimated heterogeneity, however, avoids additional restrictions and allows the researcher to estimate average marginal effects in levels, which may be more meaningful than slope parameters and allows comparisons across models. The rest of this chapter is organized as follows: Section 2 describes the multiplicative model and derives the asymptotic properties of the APE and ATE estimators that use estimated heterogeneity. I also discuss some interesting implications of using these estimators in exponential models. Section 3 evaluates the proposed estimators via Monte Carlo, and Section 4 concludes. Simu- 67 lation tables are collected in Appendix D. 3.2 Theory The multiplicative unobserved effects panel model assumes that for i = 1, . . . , N; T = 1, . . . , T , E(yit |xxi , ci ) = E(yit |xxit , ci ) = ci m(xxit , β 0 ), (3.1) where m(xxit , β 0 ) is a known, positive, continuous, differentiable function of a 1 × K vector of explanatory variables x it and an unknown K × 1 parameter vector β 0 . The term ci is unobserved heterogeneity that is assumed to be strictly positive. Equation (3.1) implicitly assumes that x it is strictly exogenous, conditional on ci . I assume that the vector {yi1 , . . . , yiT , x i1 , . . . , x iT , ci } is independent and identically distributed across i, and that T is fixed. A common choice in the empirical literature is m(xxit , β ) = exp(xxit β ), but other forms are possible, and the responses need not even be counts. For example, under the restriction that 0 < ci < 1, yit could be binary or fractional, in which case m(xxit , β ) might be the logistic or normal cumulative distribution function. Another option for nonnegative responses is a panel version of Wooldridge’s (1992) alternative to the Box-Cox transformation. In this case, with β = (θθ , λ ) , the specification would be: m(xxit , β ) =    [1 + λ x it θ ]1/λ , λ = 0   exp(xxit θ ), (3.2) λ = 0. The parameters are perhaps less interesting in these examples than in the exponential case, motivating the estimation of marginal effects. While most of the derivations in this section are for a generic m(xxit , β ), I include a discussion of the exponential case at the end of this section. Hausman, Hall, and Griliches (1984) showed that if conditional on x i = {xxi1 , . . . , x iT } and ci , the yit are independently distributed as Poisson with mean given by (3.1), then conditioning on T y results in the multinomial distribution for {y , . . . , y } . The resulting fixed effects ni ≡ ∑t=1 it iT i1 68 Poisson (FEP) estimator is given by: N β ), β = argmax ∑ i (β T β) = i (β (3.3) i=1 β ∑ yit log t=1 m(xxit , β ) . ∑Tr=1 m(xxir , β ) (3.4) Wooldridge (1999) showed that β is consistent for β 0 under (3.1) only, making it a quasi conditional maximum likelihood estimator (QCMLE). Standard asymptotic theory for M-estimators yields that under regularity conditions: √ d −1 N(β − β 0 ) → N(00, A −1 0 B 0 A 0 ), (3.5) β 0 ) , B 0 = Var [ssi (β β 0 )], and s i (β β 0 ) = ∇β i (β β 0 ) . The sandwich form of where A 0 = −E ∇2β i (β the asymptotic variance estimator should be used to account for the fact that without the stronger β ) is not the true log-likelihood for individual i. assumptions of Hausman, et. al., i (β Researchers are often interested in estimating marginal effects, as the β j may not have an meaningful interpretation outside of the exponential case. I define the APE of a continuous variable x j as: δ j,0 = E T ∂ m(x T xit , β 0 ) ∂ E(yit |xxit , ci ) = E ci T −1 ∑ ≡ E ci T −1 ∑ M j (xxit , β 0 ) , ∂ xit j ∂ xit j t=1 t=1 where M j (xxit , β ) = β) ∂ m(xxit ,β . ∂ xit j (3.6) I define the ATE for a binary xk as: δk, =E E(yit |xxit(−k) , xitk = 1, ci ) − E(yit |xxit(−k) , xitk = 0, ci ) 0 =E ci T −1 T ∑ m(xxit(−k) , 1, β 0 ) − m(xxit(−k) , 0, β 0 ) (3.7) t=1 where the subscript (−k) indicates element k has been omitted, and where m(xxit(−k) , 1, β ) and m(xxit(−k) , 1, β ) correspond to a 1 or 0 being inserted for xitk in m(xxit , β ). Both of these quantities depend on ci , and so an additional assumption (i.e. correlated random effects) would seem necessary to proceed. However, unconditional QMLE that treats the ci as 69 additional parameters offers algebraically equivalent estimates of β 0 as FEP, as well as a closedform estimate of ci . The formula is: wi , β ) = c(w T y ∑t=1 it T m(x xit , β ) ∑t=1 ≡ ci (3.8) where w i ≡ {yi1 , . . . , yiT , x i1 , . . . , x iT }. The analysis to follow hinges on studying the properties of this random function of the data, which I rewrite for a generic β as: wi , β ) ≡ c(w T y ∑t=1 it T m(x ∑t=1 xit , β ) (3.9) There is a practical reason to estimate β 0 using FEP instead of unconditional QMLE (i.e. including N individual dummies in the exponential model). As pointed out by Cameron and Trivedi (2013), the econometrician may encounter computational or software limitations for large values of N. It is easier to just calculate ci following FEP estimation. The APE and ATE estimators I investigate are: N δ j = (NT )−1 ∑ T ∑ ciM j (xxit , β ) (3.10) i=1 t=1 −1 δk = (NT ) N T ∑ ∑ ci m(xxit(−k) , 1, β ) − m(xxit(−k) , 0, β ) (3.11) i=1 t=1 wi , β ) = ci , even if evaluated at β 0 , and with only N growing, ci cannot be consisClearly c(w tent for ci (under the view that ci is one of N individual-specific parameters).2 One should not generally expect marginal effects calculated from estimated incidental parameters to be consistent in nonlinear models, even if slope parameter estimates of are consistent. However, some sample wi , β ) and the fact that ci averages involving ci are consistent in the FEP case due to the form of c(w and m(xxit , β 0 ) are multiplicatively separable. wi , β )hh(xxi , β ) is an estimator of λ 0 ≡ E [ci h (xxi , β 0 )]. AsTheorem 1 Suppose λ ≡ N −1 ∑N i=1 c(w wi , β ) ≡ c(w wi , β )hh(xxi , β ) sume that (3.1) holds and that each element of the P×1 random vector g (w 2 Cameron p and Trivedi (2013) assert ci → ci as T → ∞, which is true if {yit } and {m(xxit , β 0 )} are ergodic for the mean. 70 wi , β ) from Theorem 12.2 of Wooldridge (2010). Then satisfies the regularity conditions on q(w p λ → λ0 p p wi , β )hh(xxi , β ) → E [c(w wi , β 0 )hh(xxi , β 0 )] by Lemma 12.1 Proof. Since β → β 0 , then N −1 ∑N i=1 c(w in Wooldridge (2010). Furthermore, by the L.I.E., wi , β 0 )hh(xxi , β 0 )] =E {E [c(w wi , β 0 )hh(xxi , β 0 )|xxi , ci ]} E [c(w T E(y |x ∑t=1 it x i , ci ) h (xxi , β 0 ) T m(x ∑t=1 xit , β 0 ) =E T m(x xit , β 0 ) ci ∑t=1 =E T m(x xit , β 0 ) ∑t=1 h (xxi , β 0 ) =E [ci h (xxi , β 0 )] (3.12) xi , β ) = 1, while consistency of δ j Consistency of N −1 ∑N i=1 ci for E(ci ) follows from setting h (x and δk follow from either setting h (xxi , β ) = T −1 T ∑ M j (xxit , β ) or (3.13) t=1 h (xxi , β ) = T −1 T ∑ m(xxit(−k) , 1, β ) − m(xxit(−k) , 0, β ) . (3.14) t=1 Theorem (1) shows that unlike with other nonlinear fixed effects estimators, no bias correction is necessary to estimate the APE and ATE in this setting. One might expect, a priori, that δ j and δk would perform well anyway as T grows and ci better approximates ci . Nevertheless, Theorem (1) holds for an arbitrary T , so δ j and δk should perform well even in panels with only two time periods (the minimum needed for FEP). The APE and ATE I consider are just two of many possible quantities of interest. Researchers might also want to know the average marginal effect for a specific time period, or for a specific subpopulation defined by the observables (i.e. the Average Treatment Effect on the Treated). One might also want to estimate the partial effect evaluated at the averages of the heterogeneity and covariates. As long as ci multiplies the relevant function 71 of the data, one need not worry about the difference between it and ci when averaging over the cross-section. As a caution, one cannot use ci to learn about other features of the distribution of ci except in more restrictive cases. For instance, Var(ci ) is identified only under additional assumptions about Var(yyi |xxi , ci ). A simple example is when the Poisson variance assumption, Var(yit |xxi , ci ) = E(yit |xxi , ci ), and zero conditional covariance, Cov(yit , yir |xxi , ci ) = 0,t = r, both hold. In this case, T m(x wi , β 0 )] − E ci / ∑t=1 xit , β 0 ) . one can show that Var(ci ) = Var [c(w The asymptotic variance of λ can be derived similarly to the delta method, but making sure to account for the randomness in w i . 3 Theorem 2 Under the assumptions in Theorem (1), √ d N(λ − λ 0 ) → N(00, D 0 ), where wi , β 0 ) − λ 0 − G 0 A −1 β 0) , D 0 = Var g (w 0 s i (β wi , β 0 ) = E c(w wi , β 0 )∇β h (xxi , β 0 ) + h (xxi , β 0 )∇β c(w wi , β 0 ) , G 0 = E ∇β g (w wi , β ) = −c(w wi , β ) ∇β c(w T ∇ m(x xit , β ) ∑t=1 β T m(x xit , β ) ∑t=1 , ∇β h (xxi , β ) is the P × K Jacobian of h (xxi , β ), and ∇β m(xxit , β ) is the 1 × K gradient of m(xxit , β ). ¨ i as the P × K Jacobian of g (w wi , β ) evaluated at different mean values between β Proof. Define G 3 The derivation here is essentially the same as the solution to Wooldridge (2010), Problem 12.17. 72 and β 0 . By a mean value expansion of each element of N N √ wi , β ) around β 0 , N λ = N −1/2 ∑N i=1 g (w N ¨i wi , β ) =N −1/2 ∑ g (w wi , β 0 ) + N −1 ∑ G N −1/2 ∑ g (w i=1 i=1 N √ N β −β0 (3.15) i=1 √ wi , β 0 ) + G 0 N β − β 0 + o p (1) =N −1/2 ∑ g (w i=1 N N i=1 i=1 (3.16) β 0 ) + o p (1). wi , β 0 ) − N −1/2 ∑ G 0 A −1 =N −1/2 ∑ g (w 0 s i (β (3.17) ¨ p The second equality follows because consistency of β implies N −1 ∑N i=1 G i → G 0 and because √ √ −1 β ) + N β − β 0 = O p (1). The third follows because N β − β 0 = −N −1/2 ∑N 0 i=1 A 0 s i (β o p (1). Therefore, N √ wi , β 0 ) − λ 0 − G 0 A −1 β 0 ) + o p (1) N λ − λ 0 = N −1/2 ∑ g (w 0 s i (β (3.18) i=1 By the Asymptotic Equivalence Lemma, the limiting distribution of √ N λ − λ 0 is the same as wi , β 0 ) − λ 0 − G 0 A −1 β 0 ) , which is easily shown to be the scaled sample avN −1/2 ∑N i=1 g (w 0 s i (β erage of a mean-zero random vector. Therefore, by the Central Limit Theorem for i.i.d. sequences, the result follows. Applying Theorem (2) for the APE of a continuous covariate x j : √ d N δ j − δ j,0 → N(0, D j,0 ), D j,0 = Var (3.19) T T −1 β 0) ∑ c(wwi, β 0)M j (xxit , β 0) − δ j,0 − G j,0 A−1 0 si (β , (3.20) t=1 T wi , β 0 )(T −1 ) ∑ G j,0 = E c(w ∇β M j (xxit , β 0 ) − M j (xxit , β 0 ) t=1 73 T ∇ m(x xit , β 0 ) ∑t=1 β T m(x xit , β 0 ) ∑t=1 (3.21) For the ATE of the binary covariate xk : √ N δk − δk, d 0 Dk, = Var T −1 0 T ∑ c(wwi, β 0) t=1 → N(0, Dk, ), (3.22) 0 β ) , s (β m(xxit(−k) , 1, β 0 ) − m(xxit(−k) , 0, β 0 − δk, − Gk, A −1 0 0 0 i 0 (3.23) T wi , β 0 )(T −1 ) ∑ Gk, = E c(w 0 ∇β mit (1) − ∇β mit (0) − (mit (1) − mit (0)) t=1 T ∇ m ∑t=1 β it T m ∑t=1 it , (3.24) where mit = m(xxit , β 0 ), mit (1) = m(xxit(−k) , 1, β 0 ), and mit (0) = m(xxit(−k) , 0, β 0 ). These asymptotic variances can be consistently estimated from the above expressions by plugging in β for β 0 and forming the sample analogs to the expectation and variance operators. 3.2.1 Exponential Models Since it is a common specification in empirical research, I include a few observations about the exponential conditional mean case. The form of the quasi log-likelihood means that one can estimate coefficients on time-varying x it only. Nevertheless, δ j,0 and δk,0 are still identified when the conditional mean function is exponential and includes time-constant observables. To see this, suppose the following: E(yit |xxit , z i , vi ) = vi exp(xxit β 0 + z i γ 0 ), (3.25) where now I use vi to denote the unobserved heterogeneity. Define ci = vi exp(zzi γ 0 ). Then clearly E(yit |xxit , ci ) = ci exp(xxit β 0 ). (3.26) The heterogeneity has absorbed the time-constant observables. Theorems (1) and (2) still hold, but the function ci now serves as a stand-in for the total contribution from all time-constant variables— observed and unobserved. Analogous to the linear case, γ 0 is not identified, nor are the average partial effects of the z i , but given consistent estimates of β 0 , one can still consistently estimate the average partial effects of the time-varying regressors. 74 One alternative estimand studied by Lee and Kobayashi (2001) is the proportional treatment effect, which for a binary treatment and the simple index in (3.25) is: 4 ξk ≡ E(yit , x it(−k) , xitk = 1, z i , vi ) E(yit , xit(−k) , xitk = 0, zi , vi ) − 1 = exp(βk ) − 1 (3.27) Of course, ξk may interesting in its own right, but my analysis shows that estimating the ATE in levels using (3.11) is another option, even when time-constant regressors belong in the model. Furthermore, APE of a continuous variable simplifies in the exponential conditional mean case. δ j,0 =E T −1 =E T −1 = T −1 T ∑ ci exp(xxit β ) β j,0 (3.28) ∑ E(yit |xxit , ci) β j,0 (3.29) t=1 T t=1 T ∑ E(yit ) t=1 β j,0 , (3.30) where the last equality is by the L.I.E. Here, the population scale factor is analogous to the crosssection case and doesn’t depend on the heterogeneity. Moreover, an estimator that treats ci as the unknown ci is equivalent to the sample analog of (3.30). N δ j = (NT )−1 ∑ T N ∑ ci exp(xxit β ) i=1 t=1 β j = (NT )−1 ∑ T ∑ yit βj (3.31) i=1 t=1 Consistency of δ j for δ j is immediate given a consistent estimator of β j,0 . Since δ j does not depend on ci , one could even estimate β 0 without assuming strict exogeneity of x it , using the GMM approach of either Chamberlain (1992) or Wooldridge (1997) based on sequential moment restrictions. The asymptotic variance is simpler as well: Avar √ β 0 )), N(δ j − δ j,0 ) = Var(y¯i β j − δ j − µyT r j A−1 0 s i (β 4 Lee (3.32) and Kobayashi’s model includes multi-valued treatment as well as interactions between the treatment and covariates, so the proportional treatment effect depends on x it and z i , but only involves coefficients on time-varying regressors and interactions. 75 T y ), and r is a 1 × K-vector with jth element equal to 1 and all other where µyT ≡ E(T −1 ∑t=1 it j elements equal to 0. The expression is similar if GMM is used to estimate β 0 . 3.2.2 A note about dropped observations If the dependent variable for an observation l is zero in each time period, then observation l contributes nothing to the quasi log-likelihood, as can be seen in equation (3.4). Clearly, the terms in wl , β ) = 0. δ j and δk corresponding to observation l’s contribution are then equal to zero, since c(w Nevertheless, if interested in an APE or ATE with respect to the entire population of interest, the sample size N in the formulas for δ j and δk should correspond to the number of individuals in the entire the cross-section, not the number of individuals in the estimation sample (that is, with ni > 0). Otherwise, the estimates will be conditional on this particular subsection of the population and be inflated by a factor of N/N p , where N p = ∑N i=1 1 [ni > 0]. 3.3 3.3.1 Monte Carlo Design I employ the following data generating process. For i = 1, . . . , N and t = 1, . . . , T : yit |(xxi , d i , ci ) ∼ Poisson [ci exp(β1 xit + β2 dit )] , (3.33) log(ci ) ∼ Normal(0, σ 2 ) (3.34) xit = log(ci ) + ρxi,t−1 + vit , t > 1 (3.35) xi1 = log(ci )/(1 − ρ) + vi1 / 1 − ρ 2 , vit ∼ N(0, 1/2), (3.36) ρ = 0.3 − 0.5σ (3.37) dit = 1 [xit + log(ci ) + hit > 0] , hit ∼ N(0, 1/2) (3.38) 76 I study panels of dimensions N ∈ {500, 1000, 2000} and T ∈ {2, 4, 10}. The conditional marginal distribution of yt is Poisson with an exponential mean function. I set β1 = 0.5 and β2 = −0.5. I vary the degree of heterogeneity, with σ ∈ {0, 0.25.0.5, 0.75, 1}. The continuous covariate xt and the binary covariate dt are both correlated with the heterogeneity, and the strength of the correlation increases with σ . The scaling of xi1 is intended to keep Var(xt ) constant across the different T .5 That the autoregressive parameter in the equation for xt depends on σ is an attempt to keep the autocovariance structure of xt more consistent as σ increases. I estimate β1 and β2 using FEP, and employ the APE and ATE estimators proposed in equations (3.10) and (3.11). In the tables to follow, FEP estimates are denoted with a “ ”. For reference, I also estimate the slopes, APE, and ATE using pooled Poisson QMLE, which ignores ci entirely. These estimates are denoted with a “ ”. Both FEP and Poisson QMLE are consistent when σ = 0, but only FEP is consistent when σ > 0. Reporting the results for Poisson QMLE is intended to give the reader a sense of how large a problem neglected heterogeneity causes under this particular DGP. For each estimator and parameter combination, I report the mean and standard deviation of the empirical distribution, the estimated bias, the ratio of the mean standard error to the empirical standard deviation, and the probability of rejecting a true null hypothesis at the 5 percent significance level. I use cluster robust asymptotic standard errors with the slope estimates, though they are technically not necessary with this DGP. For the APE and ATE estimates, I use the “unconditional” asymptotic standard errors derived in this chapter for the FEP case, as well as the analogous versions for Poisson QMLE. For each parameter combination, I draw 2000 replications. 3.3.2 Results Full tables of simulation results can be found in Appendix D. I focus attention on the APE and ATE estimates, though the slope estimates are included for reference. As expected, across all values of N and T , there is virtually no finite sample bias in the Poisson QMLE and the FEP estimates in the absence of heterogeneity (σ = 0). In the presence of heterogeneity, however, Poisson QMLE slopes 5 See Vamos, Soltuz, and Craciun 2007. 77 and APEs are biased. As heterogeneity increases, bias increases, and the probability of rejecting a true null hypothesis quickly approaches one. Therefore, this DGP succeeds in simulating settings where controlling for individual effects is important. Finite sample bias in δ1 is less than 0.01 for all values of N, T , and σ , which is not surprising given that in the exponential case, the APE scale factor does not even depend on ci . Some ATE estimates at higher levels of σ are slightly biased away from zero when the panel is short and the sample is smaller. For instance, when N = 500 and T = 2, finite sample bias is between 2 and 2.5 percent of the true value when σ ≥ 0.5. However, the magnitudes of these biases decrease to 1 − 1.5 percent when N = 1000 and 0 − 1 percent when N = 2000. In the T = 4 and T = 10 cases, the finite sample bias is less than 1 percent and quite small in the larger cross-sections. The finite sample standard deviations behave in predictable ways, decreasing as either N or T increases. The variability in δ2 seems to be greater than that of δ1 , and the spread between them increases with σ , which might be related to the fact that δ1 does not actually use ci in the exponential case. The standard errors derived in this chapter perform reasonably well, particularly with the largest cross-section, where at worst, their empirical mean underestimates the empirical standard deviation by about 4 percent. This occurs for the standard error of δ1 in the T = 4, σ = 1 case, where as a point of comparison, the mean standard error for β1 also underestimates the finite sample standard deviation of β1 by a similar amount. For the most part, the results suggest the approximations get better as N increases, though simulating more replications may be necessary to reduce sampling error. When σ is high, the apparent underestimation by the standard errors leads to slight over-rejection by about one or two percentage points, but larger N also mitigates this problem. Overall, these simulations support this chapter’s theoretical findings. The asymptotic properties derived in Section 2 for the APE and ATE estimators that use estimated incidental parameters seem to approximate their finite sample behavior very well. 78 3.4 Conclusion It is already well-known that in static multiplicative panel models under strict exogeneity, estimating the heterogeneity still leads to consistent estimation of the parameters of a correctly-specified conditional mean function. This chapter adds the result that APE and ATE estimators that use es√ timated heterogeneity are also consistent and N-asymptotically normal with T fixed. In fact, the results hold for estimating the mean of a wider class of random quantities where the heterogeneity is multiplicatively separable from functions of the data. I derive asymptotic standard errors for these estimators that perform well in simulations for a leading case in empirical research. One area for future research would be to use higher order expansions to derive standard errors that better approximate the standard deviation of the sampling distribution. 79 APPENDICES 80 APPENDIX A ANALYTICAL BIAS CORRECTION EXPRESSIONS FROM CHAPTER 1 From Hahn and Newey (2004), and Fernandez-Val (2009), the one-step bias corrected estimator is formed as θbc = θ − B(θ )/T, (A.1) where B(θ ) = I (θ )−1 b(θ ). Here θ denotes a generic coefficient vector, and θ is the uncorrected MLE. A.1 Hahn and Newey’s bias correction for M-estimators With strictly exogenous regressors x it : N I (θ ) = − (NT )−1 ∑ T T T ∑ [uitθ (θ ) − uitα (θ )] ∑ vitθ (θ ) / ∑ vitα (θ ) (A.2) uitα (θ ) βi (θ ) + ψit (θ ) + uitαα σi2 (θ )/2 , (A.3) i=1 t=1 N b(θ ) = (NT )−1 ∑ T ∑ t=1 t=1 i=1 t=1 where −1 T T βi (θ ) = − ∑ visα (θ ) s=1 ∑ visα (θ )ψit (θ ) + visαα (θ )σi2 (θ )/2 , (A.4) s=1 T σi2 (θ ) = T −1 ∑ ψit (θ )2. (A.5) s=1 In these expressions, uit (θ ) and vit (θ ) are derivatives of the log-likelihood with respect to θ and T αi , respectively, evaluated at αi = αi (θ ) = arg max ∑t=1 it (θ , αi )/T . Partial derivatives of uit (θ ) α and vit (θ ) are denoted by the θ and α subscripts. The terms ψit (θ ), σi2 (θ ), βi (θ ) are estimators for the influence function, asymptotic variance, and higher order asymptotic bias, respectively, of αi (θ ) as T grows. 81 A.2 Fernandez-Val’s bias correction based on conditional expectations Fernandez-Val (2009) simplifies the Hahn and Newey (2004) corrections by taking expectations conditional on {xxi , αi } and using the Law of Iterated Expectations. For static probit models with strictly exogenous regressors, I (θ ) = N −1 N ∑ T −1 i=1 b(θ ) = N −1 T ∑ Git (θ )xxit xit − T −1 ∑ −T −1 i=1 T ∑ Git (θ )xxit ηi (θ ) + t=1 T −1 T T −1 ∑ Git (θ )xxit t=1 t=1 N T σi2 ∑ Git (θ )xxit (A.6) t=1 T ∑ Git (θ )λi(θ )xxit σi2 /2 , t=1 where [φ (αi (θ ) + x it θ )]2 Git (θ ) = , σ2 = T Φ(αi (θ ) + x it θ )[1 − Φ(αi (θ ) + x it θ )] i ηi (θ ) = (1/2) T −1 T ∑ λit (θ )Git (θ ) T ∑ Git (θ ) −1 , (A.7) t=1 σi4 , (A.8) t=1 T λit (θ ) = αi (θ ) + x it θ , and αi (θ ) = arg max ∑ it (θ , αi )/T α (A.9) t=1 A.3 Average Partial Effects As in equation (14) of Section II, we define the function m(β , γ, α, x it ) as the partial effect of wit on the probability that yit = 1 for w ∈ {x, d}. Using one of the analytical bias-corrected slope estimators, θbc and αbc = αi (θbc ), the bias-corrected estimator for the average partial effect is µw,bc = (NT )−1 N T ∑ ∑ mw(βbc, γbc, αbc, x it ) − ∆/T. i=1 t=1 82 (A.10) Using Hahn and Newey’s method: N ∆ = (NT )−1 ∑ T ∑ mα βi + (1/2)mαα σ 2 , (A.11) i=1 t=1 where βit = βit (θbc ) σi2 = σi2 (θbc ), and mα and mαα denote partial derivatives with respect to α, evaluated at θbc and αbc . Using Fernandez-Val’s method:  T T N  T −1 ∑ mα ηi + (1/2) T −1 ∑ mαα ∆ = N −1 ∑  t=1 t=1 i=1 where λit = λit (θbc ), ηi = ηi (θbc ) and Git = Git (θbc ). 83 T −1 T ∑ Git t=1  −1   (A.12) APPENDIX B SIMULATION RESULTS FOR BIAS CORRECTIONS ON A LARGER CROSS-SECTION Table B.1: Probit Slope Estimates when N = 500, T = 6 ρ = 0.0 MLE A-FV09 A-HN04 J-DJ14 J-HN04 CRE ρ = 0.4 MLE A-FV09 A-HN04 J-DJ14 J-HN04 CRE ρ = 0.8 MLE A-FV09 A-HN04 J-DJ14 J-HN04 CRE β (true value = 1) SE Mean SD cv: .95 SD γ (true value = 1) SE Mean SD cv: .95 SD 1.33 0.95 1.15 0.92 0.90 0.99 0.10 0.06 0.09 0.13 0.07 0.06 0.08 0.92 0.57 0.61 0.68 0.94 0.99 1.13 0.94 0.53 0.98 0.98 1.32 0.97 1.15 0.94 0.92 0.99 0.10 0.07 0.09 0.13 0.07 0.07 0.15 0.99 0.71 0.79 0.91 0.96 1.06 1.34 1.09 0.72 1.26 1.06 1.51 1.02 1.32 0.85 1.02 0.99 0.12 0.06 0.11 0.19 0.08 0.06 0.00 0.98 0.09 0.41 0.92 0.94 0.99 1.17 0.92 0.36 0.90 1.01 1.51 1.05 1.32 0.87 1.03 0.99 0.12 0.07 0.11 0.18 0.08 0.07 0.01 0.99 0.16 0.58 0.97 0.95 1.06 1.41 1.04 0.50 1.14 1.05 2.36 0.79 2.12 0.94 1.60 0.99 0.20 0.21 0.19 0.47 0.17 0.06 0.00 0.41 0.00 0.26 0.00 0.94 1.00 0.32 0.87 0.17 0.62 1.00 2.37 0.76 2.14 0.71 1.59 0.99 0.22 0.24 0.21 1.09 0.18 0.07 0.00 0.50 0.00 0.32 0.01 0.94 0.99 0.40 0.86 0.10 0.66 1.00 84 Table B.2: Probit APE Estimates when N = 500, T = 6 ρ = 0.0 MLE A-FV09 A-HN04 J-DJ14 J-HN04 CRE LPM ρ = 0.4 MLE A-FV09 A-HN04 J-DJ14 J-HN04 CRE LPM ρ = 0.8 MLE A-FV09 A-HN04 J-DJ14 J-HN04 CRE LPM µx /µx (true value = 1) SE Mean SD cv: .95 SD µd /µd (true value = 1) SE Mean SD cv: .95 SD 0.99 0.95 1.03 1.09 1.04 1.00 0.93 0.06 0.06 0.07 0.09 0.07 0.06 0.06 0.93 0.85 0.87 0.59 0.85 0.95 0.78 0.94 0.94 0.86 0.66 0.81 0.99 0.99 0.98 0.94 1.01 1.11 1.05 1.00 1.29 0.08 0.08 0.08 0.10 0.09 0.08 0.08 0.95 0.91 0.94 0.68 0.87 0.96 0.05 1.02 1.08 1.00 0.77 0.90 1.05 1.07 0.97 0.93 1.04 1.13 1.05 1.00 0.93 0.06 0.06 0.07 0.09 0.07 0.06 0.06 0.91 0.72 0.86 0.41 0.78 0.95 0.80 0.94 0.93 0.84 0.59 0.77 1.01 1.01 0.98 0.93 1.02 1.13 1.06 1.00 1.29 0.08 0.07 0.08 0.11 0.09 0.08 0.08 0.94 0.87 0.94 0.56 0.84 0.96 0.04 1.01 1.08 0.98 0.71 0.87 1.04 1.05 0.92 0.72 1.04 1.23 1.13 1.00 0.93 0.06 0.15 0.07 0.10 0.08 0.06 0.06 0.66 0.02 0.82 0.10 0.32 0.95 0.77 0.91 0.33 0.77 0.53 0.62 0.99 1.00 0.96 0.64 1.00 1.14 1.09 1.00 1.29 0.07 0.18 0.07 0.11 0.08 0.08 0.08 0.89 0.01 0.93 0.47 0.69 0.95 0.02 0.94 0.39 0.93 0.62 0.76 1.00 1.02 85 APPENDIX C DERIVATIONS OF TEST STATISTICS FROM CHAPTER 2 C.1 Derivations from Section 2.3.2 From section 3.2, the score of (2.13) evaluated at Λ = 0 is identically zero. Assuming we can pass the derivative through the integral, we can work out the following: β , Λ) = ∇Λ i (β h RK it y T p (x T xi , , b i ) f (uui ) duui ∏t=1 t x i , b i ) it ∑t=1 yit u i ⊗ qt (x f (yyi |xxi , u i , ci , ni ) f (uui ) duui RK ni ! xi , b i ) = ∇b pt (xxi , b i )/pt (xxi , b i ). Evaluating T y ! , qt (x i ∏t=1 it do not depend on u i out of the integrals, we have: where hit = terms that β , Λ) ∇Λ i (β = Λ =00 y T p (x hit ∏t=1 t x i , β ) it T y ∑t=1 it y T p (x hit ∏t=1 t x i , β ) it at Λ = 0, and pulling the u ⊗ qt (xxi , β ) f (uui ) duui RK i RK (C.1) f (uui ) duui (C.2) T = ∑ yit E u i ⊗ qt (xxi , b i ) (C.3) t=1 = 0. The second equality uses that RK f (uui ) duui = 1, while the third follows from independence of x it and u i , as well as E(uui ) = 0 . Following the re-parameterization shown in (2.14), stacking the λ j into K ×1 vector λ , defining β , λ ) , and following similar steps as before, we have: let θ ≡ (β     T β ,λ ) ∂ i (β 1 x u u = y q (x , β ) u f (u ) du ij i i ∑ it t j i  2 λ t=1  ∂λj λ =0 RK j (C.4) λ j =0 where qt j () is the jth element of qt (), The above has 0/0 form since E(uui ) = 0 . Using L’Hopital’s rule, the limit, of β ,λ λ) ∂ i (β ∂λj as each element of λ approaches zero from above 86 is: 1 2 λj h [ p (xx , b )yit ] RK it ∏t t i i ∑t yit rt j (xxi , b i ) + ∑t yit qt j (xxi , b i ) 2 u2i j f (uui ) duui , 2 2 1 λj h [ p (xx , b )yit ] RK it ∏t t i i where rt j () is the ( j, j)th element of ∇b qt (xxi , b i ). The i (C.5) f (uui ) duui 1 2 λj terms cancel, as do the hit the product terms when we evaluate at λ = 0 (bbi = β 0 ). Then using RK f (uui ) duui = 1 and RK u2i j f (uui ) duui = E(u2i j ) = 1, we get the last K elements of (2.15). C.2 Derivations from Section 2.3.3 As before, the restricted score of (2.21 is identically zero. T β , Λ ) = ∑ yit ∇Λ i (β t=1 T ∇Λ pt (xxi , β , Λ ) pt (xxi , β , Λ ) ∑T exp(xxir β + mr (xxi , Λ )) ∇λ mt (xxi , Λ ) − ∇λ mr (xxi , Λ ) = ∑ yit r=1 , T exp(x x x β + m (x , Λ )) ∑ r ir i t=1 r=1 ∇λ mt (xxi , Λ ) = RK (C.6) exp (xxit Λ ui ) (uui ⊗ xit ) f (uui ) duui . exp(xxit Λ 0 u i ) f (uui ) duui RK (C.7) The complication arises because ∇λ mt (xxi , Λ ) Λ =00 = RK (uui ⊗ xit ) f (uui ) duui = 0, f (uui ) duui RK (C.8) which implies β , Λ) ∇Λ i (β = 0. Λ =00 (C.9) After the re-parameterization, for each of the λ j , we have: −1 ∇λ mt (xxi , Λ ) = j RK exp(xxit Λ 0 u i ) f (uui ) duui RK exp (xxit Λ ui ) xit j ui j f (uui ) duui 2 λj When evaluated at λ = 0 , the second factor of (C.10) has the form 0/0. 87 . (C.10) Using L’Hopital’s rule, as each λ j approaches zero from above, we have:  1  2 λj   xit Λ u i ) xit j ui j f (uui ) duui K exp (x  = lim  lim  R  λ ↓0 λ ↓0 2 λj exp (xxit Λ u i ) xit2 j u2i j f (uui ) duui RK 2( 1 ) 2    λj xit2 j RK u2i j f (uui ) duui = 2 1 2 = xit j 2 β , 0 ), we get (2.23). Plugging these limits in into the expression for ∇Λ i (β 88  (C.11) APPENDIX D SIMULATION RESULTS FROM CHAPTER 3 Table D.1: Finite Sample Properties of Poisson QMLE: β1 = 0.5, β2 = −0.5, N = 500 β1 σ T T T σ T T T σ T T T σ T T T σ T T T Mean = 0.00 =2 0.50 =4 0.50 = 10 0.50 = 0.25 =2 0.58 0.58 =4 = 10 0.58 = 0.50 =2 0.74 0.74 =4 = 10 0.74 = 0.75 =2 0.92 =4 0.92 = 10 0.91 = 1.00 =2 1.07 1.08 =4 = 10 1.08 β2 Bias SD SE/SD RP(0.05) Mean Bias SD SE/SD RP(0.05) 0.00 0.00 0.00 0.06 0.04 0.03 0.98 0.99 0.99 0.06 0.06 0.05 -0.50 -0.50 -0.50 0.00 0.00 0.00 0.09 0.06 0.04 0.99 1.00 0.98 0.05 0.05 0.05 0.08 0.08 0.08 0.06 0.04 0.03 0.99 1.00 1.01 0.29 0.52 0.87 -0.38 -0.39 -0.38 0.12 0.11 0.12 0.09 0.06 0.04 0.99 0.99 1.00 0.28 0.46 0.84 0.24 0.24 0.24 0.06 0.04 0.03 0.97 0.95 0.94 0.99 1.00 1.00 -0.19 -0.19 -0.19 0.31 0.31 0.31 0.10 0.07 0.05 0.99 0.95 1.00 0.90 0.99 1.00 0.42 0.42 0.41 0.07 0.06 0.05 0.86 0.84 0.82 1.00 1.00 1.00 -0.04 -0.04 -0.04 0.46 0.46 0.46 0.11 0.09 0.07 0.93 0.93 0.89 0.97 0.99 1.00 0.57 0.58 0.58 0.09 0.09 0.08 0.77 0.70 0.71 1.00 1.00 1.00 0.09 0.08 0.08 0.59 0.58 0.58 0.15 0.13 0.11 0.86 0.78 0.76 0.96 0.96 0.98 89 Table D.2: Finite Sample Properties of Fixed Effects Poisson: β1 = 0.5, β2 = −0.5, N = 500 β1 Mean σ T T T σ T T T σ T T T σ T T T σ T T T = 0.00 =2 0.50 =4 0.50 = 10 0.50 = 0.25 =2 0.50 =4 0.50 = 10 0.50 = 0.50 =2 0.50 =4 0.50 = 10 0.50 = 0.75 =2 0.50 0.50 =4 = 10 0.50 = 1.00 =2 0.50 0.50 =4 = 10 0.50 β2 Bias SD SE/SD RP(0.05) Mean Bias SD SE/SD RP(0.05) 0.00 0.00 0.00 0.10 0.05 0.03 0.99 0.99 0.99 0.05 0.05 0.05 -0.50 -0.50 -0.50 0.00 0.00 0.00 0.13 0.07 0.04 0.99 1.00 0.98 0.05 0.05 0.05 0.00 0.00 0.00 0.09 0.05 0.03 0.98 1.01 1.00 0.06 0.05 0.05 -0.50 -0.50 -0.50 0.00 0.00 0.00 0.13 0.07 0.04 0.99 0.99 1.00 0.05 0.05 0.05 0.00 0.00 0.00 0.08 0.04 0.02 0.99 1.00 1.00 0.05 0.05 0.06 -0.50 -0.50 -0.50 0.00 0.00 0.00 0.14 0.08 0.04 1.00 0.98 1.02 0.05 0.06 0.05 0.00 0.00 0.00 0.06 0.03 0.02 1.01 0.99 0.99 0.05 0.06 0.06 -0.50 -0.50 -0.50 0.00 0.00 0.00 0.14 0.08 0.05 0.99 0.99 0.99 0.06 0.05 0.05 0.00 0.00 0.00 0.05 0.03 0.02 0.97 0.97 0.97 0.06 0.06 0.06 -0.50 -0.50 -0.50 0.00 0.00 0.00 0.15 0.08 0.05 0.97 1.01 1.01 0.06 0.05 0.05 90 Table D.3: Finite Sample Properties of Poisson QMLE: β1 = 0.5, β2 = −0.5, N = 500 Mean σ T T T σ T T T σ T T T σ T T T σ T T T = 0.00 =2 0.41 =4 0.41 = 10 0.41 = 0.25 =2 0.50 =4 0.50 = 10 0.50 = 0.50 =2 0.74 =4 0.74 = 10 0.74 = 0.75 =2 1.19 1.19 =4 = 10 1.19 = 1.00 =2 2.00 2.03 =4 = 10 2.02 Bias δ1 (APE) SD SE/SD RP(0.05) Mean Bias δ2 (ATE) SD SE/SD RP(0.05) 0.00 0.00 0.00 0.05 0.04 0.02 0.99 0.99 0.99 0.05 0.05 0.05 -0.42 -0.42 -0.42 0.00 0.00 0.00 0.08 0.05 0.03 1.00 1.00 0.99 0.05 0.05 0.05 0.07 0.07 0.07 0.05 0.04 0.02 0.99 1.00 1.02 0.24 0.43 0.79 -0.34 -0.34 -0.34 0.12 0.11 0.12 0.08 0.06 0.04 1.00 0.99 1.00 0.30 0.49 0.85 0.24 0.24 0.24 0.07 0.06 0.05 0.97 0.95 0.97 0.94 1.00 1.00 -0.20 -0.20 -0.20 0.36 0.36 0.36 0.10 0.08 0.05 0.99 0.95 1.00 0.91 0.99 1.00 0.54 0.54 0.54 0.15 0.14 0.13 0.91 0.88 0.90 1.00 1.00 1.00 -0.06 -0.06 -0.06 0.70 0.71 0.71 0.16 0.12 0.10 0.92 0.92 0.88 0.96 0.98 0.99 1.07 1.10 1.09 0.36 0.39 0.35 0.83 0.76 0.82 1.00 0.99 1.00 0.13 0.12 0.14 1.28 1.26 1.28 0.28 0.27 0.22 0.84 0.72 0.73 0.95 0.96 0.97 91 Table D.4: Finite Sample Properties of Fixed Effects Poisson: β1 = 0.5, β2 = −0.5, N = 500 Mean σ T T T σ T T T σ T T T σ T T T σ T T T = 0.00 =2 0.41 =4 0.41 = 10 0.41 = 0.25 =2 0.43 =4 0.43 = 10 0.43 = 0.50 =2 0.50 =4 0.50 = 10 0.50 = 0.75 =2 0.65 0.65 =4 = 10 0.65 = 1.00 =2 0.93 0.93 =4 = 10 0.93 Bias δ1 (APE) SD SE/SD RP(0.05) Mean Bias δ2 (ATE) SD SE/SD RP(0.05) 0.00 0.00 0.00 0.08 0.04 0.02 0.99 0.99 0.99 0.05 0.05 0.05 -0.43 -0.42 -0.42 0.00 0.00 0.00 0.11 0.06 0.04 0.99 1.00 0.99 0.05 0.05 0.05 0.00 0.00 0.00 0.08 0.04 0.02 0.98 1.00 1.01 0.05 0.05 0.05 -0.46 -0.46 -0.45 0.00 0.00 0.00 0.13 0.07 0.04 0.99 0.99 1.01 0.05 0.05 0.04 0.00 0.00 0.00 0.08 0.05 0.03 1.00 0.99 1.02 0.05 0.06 0.05 -0.57 -0.56 -0.56 -0.01 0.00 0.00 0.18 0.10 0.06 0.98 0.98 1.03 0.05 0.06 0.04 0.00 0.00 0.00 0.09 0.06 0.05 1.00 0.98 0.98 0.05 0.06 0.06 -0.79 -0.77 -0.77 -0.02 -0.01 -0.01 0.28 0.16 0.10 0.98 0.99 0.98 0.05 0.06 0.05 0.00 0.00 0.00 0.14 0.12 0.10 0.94 0.90 0.95 0.07 0.08 0.08 -1.17 -1.15 -1.15 -0.03 -0.01 0.00 0.46 0.28 0.19 0.95 0.98 0.97 0.06 0.06 0.06 92 Table D.5: Finite Sample Properties of Poisson QMLE: β1 = 0.5, β2 = −0.5, N = 1000 β1 Mean σ T T T σ T T T σ T T T σ T T T σ T T T = 0.00 =2 0.50 =4 0.50 = 10 0.50 = 0.25 =2 0.58 =4 0.58 = 10 0.58 = 0.50 =2 0.74 =4 0.74 = 10 0.74 = 0.75 =2 0.92 0.92 =4 = 10 0.92 = 1.00 =2 1.09 1.09 =4 = 10 1.09 β2 Bias SD SE/SD RP(0.05) Mean Bias SD SE/SD RP(0.05) 0.00 0.00 0.00 0.04 0.03 0.02 1.00 1.00 0.98 0.05 0.05 0.05 -0.50 -0.50 -0.50 0.00 0.00 0.00 0.06 0.04 0.03 0.99 0.99 0.99 0.05 0.05 0.05 0.08 0.08 0.08 0.04 0.03 0.02 1.01 1.02 1.03 0.52 0.80 0.99 -0.38 -0.38 -0.38 0.12 0.12 0.12 0.06 0.04 0.03 1.01 0.99 1.03 0.46 0.78 0.99 0.24 0.24 0.24 0.04 0.03 0.02 0.98 0.98 0.97 1.00 1.00 1.00 -0.19 -0.19 -0.19 0.31 0.31 0.31 0.07 0.05 0.03 0.96 1.00 1.00 0.99 1.00 1.00 0.42 0.42 0.42 0.05 0.05 0.04 0.92 0.86 0.83 1.00 1.00 1.00 -0.04 -0.05 -0.04 0.46 0.45 0.46 0.08 0.07 0.05 0.96 0.90 0.88 0.99 0.99 0.99 0.59 0.59 0.59 0.08 0.07 0.07 0.78 0.73 0.75 1.00 1.00 1.00 0.07 0.07 0.07 0.57 0.57 0.57 0.11 0.10 0.09 0.85 0.78 0.77 0.98 0.98 0.99 93 Table D.6: Finite Sample Properties of Fixed Effects Poisson: β1 = 0.5, β2 = −0.5, N = 1000 β1 Mean σ T T T σ T T T σ T T T σ T T T σ T T T = 0.00 =2 0.50 =4 0.50 = 10 0.50 = 0.25 =2 0.50 =4 0.50 = 10 0.50 = 0.50 =2 0.50 =4 0.50 = 10 0.50 = 0.75 =2 0.50 0.50 =4 = 10 0.50 = 1.00 =2 0.50 0.50 =4 = 10 0.50 β2 Bias SD SE/SD RP(0.05) Mean Bias SD SE/SD RP(0.05) 0.00 0.00 0.00 0.07 0.04 0.02 0.99 1.00 0.98 0.05 0.05 0.05 -0.50 -0.50 -0.50 0.00 0.00 0.00 0.09 0.05 0.03 1.01 0.98 0.99 0.05 0.05 0.05 0.00 0.00 0.00 0.06 0.03 0.02 0.98 1.00 1.04 0.05 0.05 0.04 -0.50 -0.50 -0.50 0.00 0.00 0.00 0.09 0.05 0.03 1.00 0.99 1.01 0.05 0.06 0.04 0.00 0.00 0.00 0.05 0.03 0.02 1.00 0.98 1.02 0.05 0.06 0.05 -0.50 -0.50 -0.50 0.00 0.00 0.00 0.10 0.06 0.03 1.01 1.00 0.99 0.05 0.06 0.05 0.00 0.00 0.00 0.04 0.03 0.01 1.02 0.99 1.01 0.04 0.05 0.05 -0.50 -0.50 -0.50 0.00 0.00 0.00 0.10 0.06 0.03 0.99 1.00 1.00 0.05 0.05 0.05 0.00 0.00 0.00 0.03 0.02 0.01 0.99 0.97 1.02 0.05 0.06 0.05 -0.50 -0.50 -0.50 0.00 0.00 0.00 0.11 0.06 0.03 0.99 1.01 1.00 0.05 0.05 0.05 94 Table D.7: Finite Sample Properties of Poisson QMLE: β1 = 0.5, β2 = −0.5, N = 1000 Mean σ T T T σ T T T σ T T T σ T T T σ T T T = 0.00 =2 0.41 =4 0.41 = 10 0.41 = 0.25 =2 0.50 =4 0.50 = 10 0.50 = 0.50 =2 0.74 =4 0.74 = 10 0.74 = 0.75 =2 1.19 1.19 =4 = 10 1.19 = 1.00 =2 2.04 2.03 =4 = 10 2.04 Bias δ1 (APE) SD SE/SD RP(0.05) Mean Bias δ2 (ATE) SD SE/SD RP(0.05) 0.00 0.00 0.00 0.04 0.03 0.02 0.99 0.99 0.98 0.06 0.05 0.05 -0.42 -0.42 -0.42 0.00 0.00 0.00 0.05 0.04 0.02 0.99 0.99 1.00 0.05 0.05 0.05 0.07 0.07 0.07 0.04 0.03 0.02 1.00 1.00 1.01 0.45 0.71 0.98 -0.34 -0.34 -0.34 0.11 0.12 0.11 0.06 0.04 0.03 1.01 0.99 1.03 0.48 0.79 0.99 0.24 0.24 0.24 0.05 0.04 0.04 0.98 0.98 0.98 1.00 1.00 1.00 -0.20 -0.20 -0.20 0.36 0.36 0.36 0.08 0.05 0.04 0.96 1.00 1.00 0.99 1.00 1.00 0.54 0.54 0.54 0.11 0.10 0.10 0.95 0.91 0.90 1.00 1.00 1.00 -0.06 -0.06 -0.06 0.70 0.70 0.70 0.11 0.09 0.07 0.96 0.89 0.87 0.99 0.99 0.99 1.10 1.10 1.10 0.28 0.29 0.26 0.84 0.80 0.85 1.00 1.00 1.00 0.12 0.12 0.12 1.26 1.26 1.27 0.22 0.21 0.17 0.83 0.73 0.75 0.97 0.97 0.98 95 Table D.8: Finite Sample Properties of Fixed Effects Poisson: β1 = 0.5, β2 = −0.5, N = 1000 Mean σ T T T σ T T T σ T T T σ T T T σ T T T = 0.00 =2 0.41 =4 0.41 = 10 0.41 = 0.25 =2 0.43 =4 0.43 = 10 0.43 = 0.50 =2 0.50 =4 0.50 = 10 0.50 = 0.75 =2 0.65 0.65 =4 = 10 0.65 = 1.00 =2 0.93 0.93 =4 = 10 0.93 Bias δ1 (APE) SD SE/SD RP(0.05) Mean Bias δ2 (ATE) SD SE/SD RP(0.05) 0.00 0.00 0.00 0.06 0.03 0.02 0.99 0.99 0.98 0.05 0.06 0.05 -0.42 -0.42 -0.42 0.00 0.00 0.00 0.08 0.05 0.03 1.01 0.98 0.99 0.05 0.05 0.05 0.00 0.00 0.00 0.06 0.03 0.02 0.98 1.00 1.02 0.05 0.05 0.05 -0.46 -0.45 -0.46 0.00 0.00 0.00 0.09 0.05 0.03 1.00 0.99 1.02 0.05 0.06 0.04 0.00 0.00 0.00 0.06 0.03 0.02 1.00 0.99 1.01 0.05 0.05 0.05 -0.57 -0.56 -0.56 -0.01 0.00 0.00 0.13 0.07 0.04 1.01 1.01 1.00 0.04 0.05 0.05 0.00 0.00 0.00 0.06 0.04 0.03 1.02 0.98 1.00 0.04 0.06 0.05 -0.78 -0.77 -0.77 -0.01 0.00 0.00 0.19 0.11 0.07 0.99 1.00 1.01 0.05 0.05 0.05 0.00 0.00 0.00 0.10 0.08 0.07 0.95 0.93 0.96 0.06 0.07 0.07 -1.16 -1.15 -1.15 -0.01 0.00 0.00 0.31 0.19 0.13 0.98 0.99 0.99 0.05 0.05 0.05 96 Table D.9: Finite Sample Properties of Poisson QMLE: β1 = 0.5, β2 = −0.5, N = 2000 β1 Mean σ T T T σ T T T σ T T T σ T T T σ T T T = 0.00 =2 0.50 =4 0.50 = 10 0.50 = 0.25 =2 0.58 =4 0.58 = 10 0.58 = 0.50 =2 0.74 =4 0.74 = 10 0.74 = 0.75 =2 0.92 0.92 =4 = 10 0.92 = 1.00 =2 1.09 1.09 =4 = 10 1.09 β2 Bias SD SE/SD RP(0.05) Mean Bias SD SE/SD RP(0.05) 0.00 0.00 0.00 0.03 0.02 0.01 1.00 1.01 0.97 0.05 0.05 0.06 -0.50 -0.50 -0.50 0.00 0.00 0.00 0.04 0.03 0.02 0.99 1.00 0.99 0.05 0.05 0.05 0.08 0.08 0.08 0.03 0.02 0.01 1.02 1.01 0.98 0.81 0.98 1.00 -0.38 -0.38 -0.38 0.12 0.12 0.12 0.04 0.03 0.02 1.02 1.00 0.97 0.75 0.96 1.00 0.24 0.24 0.24 0.03 0.02 0.02 0.98 1.00 0.97 1.00 1.00 1.00 -0.19 -0.19 -0.19 0.31 0.31 0.31 0.05 0.03 0.02 1.00 1.04 0.99 1.00 1.00 1.00 0.42 0.42 0.42 0.04 0.04 0.03 0.92 0.88 0.89 1.00 1.00 1.00 -0.04 -0.05 -0.05 0.46 0.45 0.45 0.06 0.05 0.04 0.97 0.95 0.92 1.00 1.00 1.00 0.59 0.59 0.59 0.06 0.06 0.05 0.83 0.79 0.80 1.00 1.00 1.00 0.07 0.06 0.07 0.57 0.56 0.57 0.09 0.08 0.07 0.84 0.82 0.80 0.99 0.99 0.99 97 Table D.10: Finite Sample Properties of Fixed Effects Poisson: β1 = 0.5, β2 = −0.5, N = 2000 β1 Mean σ T T T σ T T T σ T T T σ T T T σ T T T = 0.00 =2 0.50 =4 0.50 = 10 0.50 = 0.25 =2 0.50 =4 0.50 = 10 0.50 = 0.50 =2 0.50 =4 0.50 = 10 0.50 = 0.75 =2 0.50 0.50 =4 = 10 0.50 = 1.00 =2 0.50 0.50 =4 = 10 0.50 β2 Bias SD SE/SD RP(0.05) Mean Bias SD SE/SD RP(0.05) 0.00 0.00 0.00 0.05 0.03 0.01 0.99 1.00 0.97 0.06 0.05 0.06 -0.50 -0.50 -0.50 0.00 0.00 0.00 0.06 0.04 0.02 0.96 1.00 0.99 0.06 0.05 0.05 0.00 0.00 0.00 0.04 0.02 0.01 1.01 1.02 0.98 0.05 0.05 0.06 -0.50 -0.50 -0.50 0.00 0.00 0.00 0.07 0.04 0.02 0.99 1.00 0.98 0.06 0.05 0.05 0.00 0.00 0.00 0.04 0.02 0.01 0.99 1.02 0.99 0.05 0.05 0.05 -0.50 -0.50 -0.50 0.00 0.00 0.00 0.07 0.04 0.02 0.99 1.01 0.98 0.05 0.05 0.05 0.00 0.00 0.00 0.03 0.02 0.01 0.95 0.99 0.98 0.06 0.06 0.05 -0.50 -0.50 -0.50 0.00 0.00 0.00 0.07 0.04 0.02 0.97 1.03 0.98 0.05 0.04 0.05 0.00 0.00 0.00 0.02 0.01 0.01 0.98 0.96 1.01 0.05 0.06 0.05 -0.50 -0.50 -0.50 0.00 0.00 0.00 0.08 0.04 0.02 0.98 0.99 1.01 0.05 0.05 0.05 98 Table D.11: Finite Sample Properties of Poisson QMLE: β1 = 0.5, β2 = −0.5, N = 2000 Mean σ T T T σ T T T σ T T T σ T T T σ T T T = 0.00 =2 0.41 =4 0.41 = 10 0.41 = 0.25 =2 0.50 =4 0.50 = 10 0.50 = 0.50 =2 0.74 =4 0.74 = 10 0.74 = 0.75 =2 1.19 1.19 =4 = 10 1.19 = 1.00 =2 2.03 2.04 =4 = 10 2.04 Bias δ1 (APE) SD SE/SD RP(0.05) Mean Bias δ2 (ATE) SD SE/SD RP(0.05) 0.00 0.00 0.00 0.02 0.02 0.01 0.99 1.01 0.96 0.05 0.05 0.05 -0.42 -0.42 -0.42 0.00 0.00 0.00 0.04 0.03 0.02 0.99 1.00 0.98 0.05 0.05 0.05 0.07 0.07 0.07 0.03 0.02 0.01 1.02 1.00 0.99 0.75 0.95 1.00 -0.34 -0.34 -0.34 0.11 0.11 0.11 0.04 0.03 0.02 1.02 1.00 0.97 0.77 0.97 1.00 0.24 0.24 0.24 0.04 0.03 0.03 0.98 0.99 0.99 1.00 1.00 1.00 -0.20 -0.20 -0.20 0.36 0.36 0.36 0.05 0.04 0.03 1.00 1.04 0.99 1.00 1.00 1.00 0.54 0.55 0.55 0.08 0.08 0.07 0.95 0.93 0.95 1.00 1.00 1.00 -0.06 -0.06 -0.06 0.71 0.70 0.70 0.08 0.06 0.05 0.96 0.95 0.92 1.00 1.00 1.00 1.10 1.11 1.11 0.19 0.20 0.19 0.91 0.86 0.90 1.00 1.00 1.00 0.11 0.11 0.11 1.26 1.25 1.26 0.16 0.15 0.13 0.84 0.80 0.77 0.99 0.99 0.99 99 Table D.12: Finite Sample Properties of Fixed Effects Poisson: β1 = 0.5, β2 = −0.5, N = 2000 Mean σ T T T σ T T T σ T T T σ T T T σ T T T = 0.00 =2 0.41 =4 0.41 = 10 0.41 = 0.25 =2 0.43 =4 0.43 = 10 0.43 = 0.50 =2 0.50 =4 0.50 = 10 0.50 = 0.75 =2 0.65 0.65 =4 = 10 0.65 = 1.00 =2 0.93 0.93 =4 = 10 0.93 Bias δ1 (APE) SD SE/SD RP(0.05) Mean Bias δ2 (ATE) SD SE/SD RP(0.05) 0.00 0.00 0.00 0.04 0.02 0.01 0.99 0.99 0.97 0.05 0.05 0.06 -0.42 -0.42 -0.42 0.00 0.00 0.00 0.06 0.03 0.02 0.96 1.00 0.98 0.06 0.05 0.05 0.00 0.00 0.00 0.04 0.02 0.01 1.01 1.01 0.98 0.05 0.04 0.05 -0.46 -0.45 -0.46 0.00 0.00 0.00 0.07 0.04 0.02 0.98 1.00 0.99 0.05 0.05 0.05 0.00 0.00 0.00 0.04 0.02 0.02 0.98 1.03 0.99 0.05 0.04 0.05 -0.56 -0.56 -0.56 0.00 0.00 0.00 0.09 0.05 0.03 0.99 1.01 0.97 0.05 0.05 0.05 0.00 0.00 0.00 0.05 0.03 0.02 0.96 0.98 0.99 0.06 0.06 0.06 -0.77 -0.77 -0.77 0.00 0.00 0.00 0.14 0.08 0.05 0.97 1.03 0.99 0.06 0.05 0.05 0.00 0.00 0.00 0.07 0.06 0.05 0.99 0.96 1.00 0.06 0.06 0.05 -1.16 -1.15 -1.15 -0.01 0.00 0.00 0.22 0.14 0.09 0.98 0.99 0.99 0.05 0.06 0.05 100 REFERENCES 101 REFERENCES Alexander, B. and R. Breunig (2014). “A Monte Carlo study of bias corrections for panel probit models”. In: Journal of Statistical Computation and Simulation 86.1, pp. 74–90. DOI: 10.1080/ 00949655.2014.994516. Andersen, E.B. (1970). “Asymptotic Properties of Conditional Maximum-Likelihood Estimators”. In: Journal of the Royal Statistical Society. Series B (Methodological) 32.2, pp. 283–301. ISSN: 00359246. DOI: 10.2307/2984535. URL: http://www.jstor.org/stable/2984535. Arellano, M. and J. Hahn (2007). Understanding Bias in Nonlinear Panel Models: Some Recent Developments. In Advances in Economics and Econometrics, Blundell R, Newey W, Persson T (eds). Cambridge: Cambridge University Press. Bessen, J. (2009). “Matching patent data to compustat firms”. In: NBER working paper. Blundell, R. and J.L. Powell (2003). “Endogeneity in Nonparametric and Semiparametric Regression Models”. In: Advances in Economics and Econometrics: Theory and Applications: Eighth World Congress Vol II, pp. 312–357. DOI: 10.1017/ccol0521818737.010. Bound, J. et al. (1982). “Who does R&D and who patents?” In: Cameron, A.C. and P.K. Trivedi (2013). Regression analysis of count data. 2nd ed. Cambridge University Press. Chamberlain, G. (1980). “Analysis of Covariance with Qualitative Data”. In: Review of Economic Studies 47, pp. 225–238. DOI: 10.2307/2297110. — (1982). “Multivariate Regression Models For Panel Data”. In: Journal of Econometrics 18, pp. 5–46. DOI: 10.1016/0304-4076(82)90094-x. — (1992). “Comment: Sequential moment restrictions in panel data”. In: Journal of Business & Economic Statistics 10.1, pp. 20–26. Chay, K.Y. and D.R. Hyslop (2014). “Identification and Estimation of Dynamic Binary Response Panel Data Models: Empirical Evidence Using Alternative Approaches”. In: Safety Nets and Benefit Dependence (Research in Labor Economics), pp. 1–39. DOI: 10.1108/s0147- 9121_ 2014_0000039001. Chesher, A. (1984). “Testing for Neglected Heterogeneity”. In: Econometrica 52.4, p. 865. 10.2307/1911188. 102 DOI : Dhaene, G. and K. Jochmans (2015). “Split-panel Jackknife Estimation of Fixed-effect Models”. In: Review of Economic Studies 82.3, pp. 991–1030. DOI: 10.1093/restud/rdv007. Fernandez-Val, I. (2009). “Fixed Effects Estimation of Structural Parameters and Marginal Effects in Panel Probit Models”. In: Journal of Econometrics 150, pp. 71–85. DOI: 10.1016/j.jeconom. 2009.02.007. Fernández-Val, Iván and Martin Weidner (2016). “Individual and time effects in nonlinear panel models with large N, T”. In: Journal of Econometrics 192.1, pp. 291–312. Gourieroux, C., A. Monfort, and A. Trognon (1984). “Pseudo Maximum Likelihood Methods: Theory”. In: Econometrica 52.3, p. 681. DOI: 10.2307/1913471. Greene, W.H. (2004). “The Behavior of the Fixed Effects Estimator in Nonlinear Models”. In: The Econometrics Journal 7, pp. 98–119. DOI: 10.1111/j.1368-423x.2004.00123.x. — (2012). Econometric analysis. Prentice Hall. Greene, W.H. and C. Mckenzie (2015). “An LM test based on generalized residuals for random effects in a nonlinear model”. In: Economics Letters 127, pp. 47–50. DOI: 10.1016/j.econlet. 2014.12.031. Gurmu, Shiferaw and Fidel Pérez-Sebastián (2008). “Patents, R&D and lag effects: evidence from flexible methods for count panel data on manufacturing firms”. In: Empirical Economics 35.3, pp. 507–526. Hahn, J. and G. Kuersteiner (2011). “Bias reduction for dynamic nonlinear panel models with fixed effects”. In: Econometric Theory 27.06, pp. 1152–1191. Hahn, J., H.R. Moon, and C. Snider (2015). “LM test of neglected correlated random effects and its application”. In: Journal of Business & Economic Statistics forthcoming. Hahn, J. and W. Newey (2004). “Jackknife and Analytical Bias Reduction for Nonlinear Panel Models”. In: Econometrica 72, pp. 1295–1319. DOI: 10.1111/j.1468-0262.2004.00533.x. Hahn, J., W.K. Newey, and R.J. Smith (2014). “Neglected heterogeneity in moment condition models”. In: Journal of Econometrics 178, pp. 86–100. Hall, B., Z. Griliches, and J. Hausman (1986). “Patents and R and D: Is There a Lag?” In: International Economic Review, pp. 265–283. Hall, B., A. Jaffe, and M. Trajtenberg (2001). The NBER patent citation data file: Lessons, insights and methodological tools. Tech. rep. National Bureau of Economic Research. 103 Hausman, J., B. Hall, and Z. Griliches (1984). “Econometric Models for Count Data with an Application to the Patents-R&D Relationship”. In: Econometrica 52.4, p. 909. DOI: 10 . 2307 / 1911191. Lancaster, T. (2000). “The Incidental Parameters Problem since 1948”. In: Journal of Econometrics 95, pp. 391–413. DOI: 10.1016/s0304-4076(99)00044-5. — (2002). “Orthogonal parameters and panel data”. In: The Review of Economic Studies 69.3, pp. 647–666. Lee, L. and A. Chesher (1986). “Specification testing when score test statistics are identically zero”. In: Journal of Econometrics 31.2, pp. 121–149. DOI: 10.1016/0304-4076(86)90045-x. Lee, M. and S. Kobayashi (2001). “Proportional treatment effects for count response panel data: effects of binary exercise on health care demand”. In: Health Economics 10.5, pp. 411–428. Mundlak, Y. (1978). “On the pooling of Time Series and Cross Section Data”. In: Econometrica 46, pp. 69–85. Neyman, J. and E. Scott (1948). “Consistent Estimates Based on Partially Consistent Observations”. In: Econometrica 16, pp. 1–32. Pakes, A. and Z. Griliches (1980). “Patents and R and D at the firm level: A first look”. In: Rabe-Hesketh, S. and A. Skrondal (2004). Generalized Latent Variable Modeling: Multilevel, Longitudinal, and Structural Equation Models. English. GB: CRC Press. — (2013). “Avoiding biased versions of Wooldridge’s simple solution to the initial conditions problem”. In: Economics Letters 120.2, pp. 346–349. DOI: 10.1016/j.econlet.2013.05.009. Stoker, T. (1986). “Consistent Estimation of Scaled Coefficients”. In: Econometrica 54.6, pp. 1461–1481. DOI: 10.2307/1914309. Vamo¸s, C., S. ¸ Soltuz, ¸ and M. Cr˘aciun (2007). “Order 1 autoregressive process of finite length”. In: Rev. Anal. Numér. Théor. Approx. 36.2, pp. 199–214. Wang, P., I.M. Cockburn, and M.L. Puterman (1998). “Analysis of patent data—a mixed-Poissonregression-model approach”. In: Journal of Business & Economic Statistics 16.1, pp. 27–41. White, H. (1982). “Maximum likelihood estimation of misspecified models”. In: Econometrica: Journal of the Econometric Society, pp. 1–25. Wooldridge, J.M. (1992). “Some alternatives to the Box-Cox regression model”. In: International Economic Review, pp. 935–955. 104 Wooldridge, J.M. (1997). “Multiplicative panel data models without the strict exogeneity assumption”. In: Econometric Theory 13.05, pp. 667–678. — (1999). “Distribution-free estimation of some nonlinear panel data models”. In: Journal of Econometrics 90.1, pp. 77–97. DOI: 10.1016/s0304-4076(98)00033-5. — (2010). Econometric analysis of cross section and panel data. MIT Press. 105