ESSAYS ON NONLINEAR PANEL MODELS WITH UNOBSERVED HETEROGENEITY
By
Robert Martin

A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
Economics – Doctor of Philosophy
2017

ABSTRACT
ESSAYS ON NONLINEAR PANEL MODELS WITH UNOBSERVED HETEROGENEITY
By
Robert Martin
This dissertation concerns nonlinear panel data estimation relevant to the fields of econometrics
and applied microeconomics. Panel data is attractive for estimating causal effects when unobserved heterogeneity in cross-sectional units is correlated with explanatory variables. For instance,
well-known linear fixed effects and first difference estimators use within-group variation to achieve
consistent estimation. However, nonlinear models often better represent limited dependent variables like binary outcomes or counts, and extending traditional panel techniques to these settings
can be problematic. For instance, treating heterogeneity as parameters to be estimated usually
leads to what is known as the incidental parameters problem. Furthermore, heterogeneous slopes
in a conditional mean function can also confound estimation, but fewer remedies exist than do
for additive effects. I aim to address these issues in my research with an emphasis on practical
applicability.

Chapter 1: Finite sample properties of bias-corrected fixed effects estimators for panel binary response models
Maximum likelihood estimation (MLE) of nonlinear unobserved effects panel models is known
to be generally inconsistent when treating the heterogeneity as parameters. Several authors have
proposed corrections justified by large-T expansions of the inconsistency under conditions like
dynamic completeness. Using Monte Carlo (MC) techniques, I find that failure of dynamic completeness can increase bias in slope and average partial effects (APE) estimates in shorter panels,
but has little impact on APE for longer panels. I also compare bias-corrections to correlated random effects (CRE) and Conditional MLE using MC and welfare data from the Survey of Income
and Program Participation (SIPP).

Chapter 2: Exponential panel models with coefficient heterogeneity
If heterogeneous slopes are ignored in exponential panel models, fixed effects Poisson may not
estimate any quantity of interest. Existing estimation methods often involve treating only a small
subset of the slopes as “random effects” and integrating from the likelihood, increasing computational difficulty. I propose a test to detect slope heterogeneity that, unlike the traditional approach,
does not amount to testing for information matrix equality. Additionally, I present a correlated random coefficients approach to identification which allows for estimation of the coefficient means
and average partial effects. I test these proposed methods using a Monte Carlo experiment and
apply them to the patent-R&D relationship for U.S. manufacturing firms.

Chapter 3: Estimation of average marginal effects in multiplicative unobserved effects panel
models
This chapter concerns estimation of average marginal effects in static multiplicative unobserved
effects panel models for nonnegative dependent variables. While fixed effects Poisson (FEP) consistently estimates the parameters of the conditional mean function, marginal effects generally
depend on the unobserved heterogeneity. They would therefore seem inestimable without either
additional assumptions or some form of bias correction. I show, however, that Average Partial
Effect (APE) and Average Treatment Effect (ATE) estimators that use estimated individual effects
are consistent and asymptotically normal. This is in contrast with cases like fixed effects logit,
where similar marginal effects estimators suffer from the incidental parameters problem.

ACKNOWLEDGEMENTS

First and foremost, I would like to thank the chair of my dissertation committee, Jeff Wooldridge,
for all of his advice, encouragement, and helpful critiques. I would also like to thank Peter Schmidt,
Kyooil Kim, and Nicole Mason for serving on my committee and providing valuable feedback and
assistance. I also appreciate the comments of seminar participants at Michigan State University,
the 2016 MEA Conference, and the 2016 Annual Meeting of the Midwest Econometrics Group.
I am especially grateful for the financial support I received from the Graduate School and the
Department of Economics at Michigan State University, including the David Kelley Fellowship,
Summer Research Fellowship, and Dissertation Completion Fellowship. I also appreciate the support and advice that Lori Jean Nichols, Steven Haider, Todd Elder, and Steve Woodbury all gave
me as I navigated the graduate program and job market.
Finally, I cannot thank my family enough for their support and encouragement. I am especially
grateful to my wife, Kara, for moving with me to East Lansing and then to Washington DC, as well
as all the countless ways she has supported my endeavors over the years.

iv

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
CHAPTER 1
1.1
1.2
1.3

1.4

1.5

FINITE SAMPLE PROPERTIES OF BIAS-CORRECTED FIXED EFFECTS ESTIMATORS FOR PANEL BINARY RESPONSE MODELS .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The panel binary response model with incidental parameters . . . . . . . . . .
1.2.1 Bias correction techniques . . . . . . . . . . . . . . . . . . . . . . . .
Monte Carlo experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.1 Evaluating the dynamic completeness assumption . . . . . . . . . . . .
1.3.2 Comparing bias correction and CRE under more general forms of heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.3 Conditional logit and the importance of correcting APE estimates . . .
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.1 Evalauating the dynamic completeness assumption . . . . . . . . . . .
1.4.1.1 Comparison with uncorrected MLE . . . . . . . . . . . . . .
1.4.2 Comparing bias correction and CRE under more general forms of heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.3 Conditional logit and the importance of correcting APE estimates . . .
1.4.4 Empirical example: Welfare participation . . . . . . . . . . . . . . . .
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 1
. 1
. 5
. 7
. 9
. 11

.
.
.
.
.

.
.
.
.
.

12
13
13
14
22

.
.
.
.

.
.
.
.

22
24
26
29

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

31
31
32
34
34
37
39
41
43
45
46
48
50
50
54
56
65

CHAPTER 2
2.1
2.2
2.3

2.4

2.5
2.6

EXPONENTIAL PANEL MODELS WITH COEFFICIENT HETEROGENEITY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1 The fixed effects Poisson model with coefficient heterogeneity . . . . .
2.3.2 Testing under full distributional assumptions . . . . . . . . . . . . . . .
2.3.3 Testing under weaker assumptions . . . . . . . . . . . . . . . . . . . .
2.3.4 A correlated random coefficients approach to testing and estimation . .
2.3.5 Adding second moment assumptions . . . . . . . . . . . . . . . . . . .
2.3.6 Estimating average partial effects . . . . . . . . . . . . . . . . . . . . .
2.3.6.1 Approaches under the CRE assumption for ci . . . . . . . . .
2.3.6.2 Estimation when the slopes are independent of covariates . .
Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Comparing estimation methods . . . . . . . . . . . . . . . . . . . . . .
2.4.2 Testing when coefficients are not normal . . . . . . . . . . . . . . . . .
Empirical application: the Patent-R&D relationship . . . . . . . . . . . . . . .
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.

v

CHAPTER 3
3.1
3.2

3.3

3.4

ESTIMATION OF AVERAGE MARGINAL EFFECTS IN MULTIPLICATIVE UNOBSERVED EFFECTS PANEL MODELS . . . . . . .
Introduction and Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Exponential Models . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.2 A note about dropped observations . . . . . . . . . . . . . . . . . . . .
Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
APPENDIX A
Analytical bias correction expressions from Chapter 1 . . . . .
APPENDIX B
Simulation results for bias corrections on a larger cross-section
APPENDIX C
Derivations of test statistics from Chapter 2 . . . . . . . . . . .
APPENDIX D
Simulation results from Chapter 3 . . . . . . . . . . . . . . . .

.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

66
66
68
74
76
76
76
77
79

.
.
.
.
.

.
.
.
.
.

80
81
84
86
89

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

vi

LIST OF TABLES

Table 1.1: Probit Estimates of β (β0 = 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Table 1.2: Probit Estimates of γ (γ0 = 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Table 1.3: Probit Estimates of µx /µx (true value = 1) . . . . . . . . . . . . . . . . . . . . . 18
Table 1.4: Probit Estimates of µd /µd (true value = 1) . . . . . . . . . . . . . . . . . . . . . 19
Table 1.5: Probit Estimates of µx /µx Under Different Heterogeneity (true value = 1) . . . . 23
Table 1.6: Corrected and Uncorrected Logit Estimates of µx /µx (true value = 1) . . . . . . 25
Table 1.7: Welfare Participation: Slope Estimates . . . . . . . . . . . . . . . . . . . . . . . 27
Table 1.8: Welfare Participation: Average Partial Estimates . . . . . . . . . . . . . . . . . 28
Table 2.1: Finite Sample Properties of Slope Estimators: β1 = 1, β2 = −1 . . . . . . . . . . 52
Table 2.2: Finite Sample Properties of APE Estimators: β1 = 1, β2 = −1 . . . . . . . . . . 53
Table 2.3: Testing when b i is not normal . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Table 2.4: Distribution of Net Sales in 2000 . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Table 2.5: R& D Expenditures in 2000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Table 2.6: Summary of Key Variables in 2000 . . . . . . . . . . . . . . . . . . . . . . . . 60
Table 2.7: Results for traditional estimators . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Table 2.8: Results for CRC FEP estimators . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Table 2.9: CRCFEP 3 estimated elasticities . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Table B.1: Probit Slope Estimates when N = 500, T = 6 . . . . . . . . . . . . . . . . . . . 84
Table B.2: Probit APE Estimates when N = 500, T = 6 . . . . . . . . . . . . . . . . . . . . 85
Table D.1: Finite Sample Properties of Poisson QMLE: β1 = 0.5, β2 = −0.5, N = 500 . . . 89
Table D.2: Finite Sample Properties of Fixed Effects Poisson: β1 = 0.5, β2 = −0.5, N = 500 90
Table D.3: Finite Sample Properties of Poisson QMLE: β1 = 0.5, β2 = −0.5, N = 500 . . . 91
Table D.4: Finite Sample Properties of Fixed Effects Poisson: β1 = 0.5, β2 = −0.5, N = 500 92

vii

Table D.5: Finite Sample Properties of Poisson QMLE: β1 = 0.5, β2 = −0.5, N = 1000 . . . 93
Table D.6: Finite Sample Properties of Fixed Effects Poisson: β1 = 0.5, β2 = −0.5, N = 1000 94
Table D.7: Finite Sample Properties of Poisson QMLE: β1 = 0.5, β2 = −0.5, N = 1000 . . . 95
Table D.8: Finite Sample Properties of Fixed Effects Poisson: β1 = 0.5, β2 = −0.5, N = 1000 96
Table D.9: Finite Sample Properties of Poisson QMLE: β1 = 0.5, β2 = −0.5, N = 2000 . . . 97
Table D.10:Finite Sample Properties of Fixed Effects Poisson: β1 = 0.5, β2 = −0.5, N = 2000 98
Table D.11:Finite Sample Properties of Poisson QMLE: β1 = 0.5, β2 = −0.5, N = 2000 . . . 99
Table D.12:Finite Sample Properties of Fixed Effects Poisson: β1 = 0.5, β2 = −0.5, N = 2000100

viii

CHAPTER 1
FINITE SAMPLE PROPERTIES OF BIAS-CORRECTED FIXED EFFECTS
ESTIMATORS FOR PANEL BINARY RESPONSE MODELS

1.1

Introduction

Nonlinear models are popular in economics in many settings. For instance, binary response models
are common for analyzing outcomes like labor force participation, employment, or union membership. At the same time, panel data can be attractive when controlling for unobserved heterogeneity
is necessary to identify causal effects. However, it is well-known that maximum likelihood estimation (MLE) that treats heterogeneity as parameters to estimate is inconsistent. For example,
in the case of cross-section heterogeneity, the problem arises in the typical large-N, fixed-T microeconometric setting because only a handful of observations contribute to the estimation of each
individual’s fixed effect (Lancaster, 2000). This is known as the incidental parameters problem,
first described by Neyman and Scott in 1948.
In the statistics and econometrics literature, there have been many approaches to estimation in
the presence of incidental parameters. In some special cases, it is possible to re-parameterize the
model or find a conditioning variable that removes the incidental parameters from the likelihood
function (Lancaster, 2000). A leading example of this is the conditional logit model, where the
conditioning variable is the number of successes observed for cross-sectional unit (Chamberlain,
1980). However, while conditional maximum likelihood in a case like this consistently estimates
the slope parameters of the index of the logit function, conditioning usually does not identify partial effects, which depend on the heterogeneity (Wooldridge, 2010). Other approaches involve
restricting the relationship between the heterogeneity and explanatory variables in some way. For
instance, if we are willing to assume independence between the heterogeneity and the explanatory
variables, then we can use a random effects approach. In many cases, however, correlation between heterogeneity and covariates is of concern. The correlated random effects (CRE) approach

1

of Chamberlain (1980, 1982) or Mundlak (1978), restricts the conditional distribution of the heterogeneity to have a mean that is linear function of the explanatory variables, but the restriction at
least buys the researcher identification of APE and scaled slope parameters (Wooldridge, 2010).
Assumptions restricting the nature of the heterogeneity are a potential drawback. For instance,
Rabe-Hesketh and Skrondal (2013) explore a special case in the dynamic probit setting where misspecification of the heterogeneity causes significant bias . In general, however, we do not know the
robustness of CRE is when the distributional assumption fails or when the researcher chooses the
wrong conditional mean function.
If one prefers to leave the nature of the heterogeneity completely unrestricted, a linear probability model (LPM) estimated by fixed effects ordinary least squares is thought to do a reasonable job
approximating, and even consistently estimates them under certain assumptions regarding the explanatory variables (Stoker, 1986). Nevertheless, often the index slope parameters are of interest,
or the researcher wants to estimate partial effects at different values of the explanatory variables.
In these cases it is tempting to use a nonlinear “fixed effects” estimator, whereby the heterogeneity
are estimated as parameters alongside the index slopes in a MLE procedure, but this is problematic.
Particularly when the number of time periods is small, fixed effects estimators often perform worse
than simply ignoring the heterogeneity entirely (Greene, 2004). In the case of cross-sectional heterogeneity only, several studies have noted that inconsistency diminishes as the number of time
periods increases, and that estimates of slope parameters are consistent with both N and T growing
to infinity. However, the asymptotic distribution of fixed effects estimators is not centered around
the true parameter values, so confidence intervals can still be misleading (Hahn and W. Newey,
2004).
I study bias corrections for models with cross-sectional heterogeneity that subtract the leading
term of a large-T expansion of the bias from the uncorrected fixed effects MLE. Analytical bias
corrections estimate this term from expressions specific to the parametric model. Jackknife corrections estimate it non-parametrically by generating variation in the uncorrected MLE by dropping
some time periods. These techniques reduce the bias from O p (T −1 ) to O p (T −2 ), but they can

2

require significant restrictions on the underlying distribution of the data (Hahn and W. Newey,
2004). Both approaches assume at least that the explanatory variables are stationary and weakly
dependent. The analytical and jackknife corrections developed by Hahn and Newey (2004) also
require the dependent variables to be serially independent conditional on the heterogeneity and
the explanatory variables. The analytical correction of Fernandez-Val (2009) and the split-panel
jackknife of Dhaene and Jochmans (2015) relax conditional independence to accommodate models
with lagged dependent variables, but still require dynamic completeness.
Either conditional independence, or dynamic completeness rule out serially correlated error
terms, which is potentially a serious problem for static models. Serial correlation is certainly a
concern in linear models, as demonstrated by widespread use of clustered standard errors and postestimation testing. Extending that concern to nonlinear models is particularly prudent given that in
cases like the probit or logit, serial correlation causes inconsistency in the estimators themselves,
not just their standard errors. Without unobserved heterogeneity, APE are still identified in probit
or logit models with serial correlation, so the problem is easily handled by using pooled MLE with
cluster-robust standard errors (Wooldridge, 2010). To my knowledge, however, no researchers
have simulated bias-corrected estimators in the presence of serial correlation.
This chapter aims to answer three questions. First, how robust are bias corrections when latent
errors have serial correlation? Second, how do the bias corrections compare to the CRE approach
when the heterogeneity does not satisfy the CRE conditional distribution assumption? Finally, the
incidental parameters problem causes bias not only in slope estimates, but in APE estimates as
well, but how severe is bias in APE estimates when the slopes are estimated consistently with a
procedure like conditional logit?
The first goal is to inform practitioners who wish to account for unobserved heterogeneity
while being agnostic about serial dependence. Using Monte Carlo techniques, I evaluate the impact of serially correlated errors on the analytical bias corrections of Hahn and Newey (2004) and
Fernandez-Val (2009). I also evaluate the drop-one-period jackknife of Hahn and Newey (2004)
and the split-panel jackknife of Dhaene and Jochmans (2015). I generate the error terms in the

3

latent variable model as first order autoregressive processes, but simulate estimators that use clustered standard errors to allow for general (weak) serial dependence. Since slope parameters are
only identified up to scale in this setting, I focus primarily on estimation of APE, which are still
identified (Wooldridge, 2010).
While simulation evidence from the aforementioned studies shows that bias-corrected estimators often have much more desirable finite sample properties than the uncorrected fixed effects
MLE (at least for slope parameters), less work has been done to evaluate sensitivity of these properties to relaxation of the assumptions underlying the corrections. Dhaene and Jochmans (2015)
examine departures from stationarity in dynamic models, particularly of initial observations and
propose a Wald test for evaluating the validity of the split-panel approach overall. Alexander and
Breunig (2014) simulate the performance of several bias corrections for the fixed effects probit
estimator while varying parameters like the variance of the heterogeneity and correlation between
heterogeneity and explanatory variables. but do not consider any departures from stationarity or
conditional independence.
In addition to using clustered standard errors, many researchers will find it attractive to make a
CRE assumption to avoid the issue of incidental parameters. In fact, in studying the issue of serial
correlation, many of my simulation results show that the CRE estimator of APE tends to have
better finite sample properties than the uncorrected or corrected fixed effects methods. This result
is not surprising given the data generating process I employ. Therefore, my second contribution is
to consider the relative performance of the CRE approach versus the fixed effects approach when
the CRE conditional distribution assumption does not hold.
Finally, if researchers are willing to assume the dependent variables are conditionally independent, then a logit specification can be attractive because conditional maximum likelihood estimation (conditioning on the individual’s sum of the dependent variables) allows for consistent
estimation of slope parameters with only N → ∞. However, partial effects are not identified because they depend on the heterogeneity terms that have been conditioned out of the likelihood
function. Nevertheless, it is tempting to implement the following procedure: 1) Estimate slope

4

parameters by conditional MLE. 2) Estimate the heterogeneity parameters using logit MLE, while
restricting the slopes to be equal to the estimates from stage 1), and then estimate partial effects.
For instance, an empirical example in Greene (2012, Chapter 17) on German health care utilization follows this procedure in estimating partial effects evaluated at the average of the explanatory
variables (PEA) (Greene, 2012). This procedure is likely to suffer from the incidental parameters
problem because, although the slope parameters are consistent, the heterogeneity estimates still
do not converge to anything with fixed T (and it is unclear if the sample average of the estimated
heterogeneity converge to anything interesting as N gets large). Fernandez-Val (2009) uses this
procedure to estimate a model of female labor force participation, but corrects the APE estimates
for the incidental parameters problem in the second stage (Fernandez-Val, 2009). Therefore, this
chapter’s third contribution is to include Monte Carlo evidence that uncorrected APE estimates
derived in this manner from conditional logit estimation can have significant bias.
Strictly speaking, any conclusions drawn from these simulations are valid only for the data
generating processes I employ. However, the results presented are still useful in alerting empirical
researchers to potential benefits and pitfalls when implementing one of the discussed estimation
methods.
The rest of the paper is organized as follows. Section 2 reviews the incidental parameters
problem in the panel binary response model, as well as the bias correction techniques considered
here. Section 3 describes the Monte Carlo experiment. Section 4 presents and discusses results
including the application to the SIPP data. Section 5 concludes. Additional tables, as well as
descriptions of the analytical bias correction formulas, are collected in Appendices.

1.2

The panel binary response model with incidental parameters

I consider the following panel binary response model with unobserved heterogeneity.
yit = 1 [αi + xit θ0 + rit > 0] , for i = 1, . . . , N and t = 1, . . . , T.

5

(1.1)

where yit is a scalar outcome variable, xit is a vector of explanatory variables, αi is an individual
fixed effect, and rit is a error term. In the probit (logit) case, rit is distributed standard normal
(standard logistic). 1 [·] is the indicator function. The log-likelihood function for individual i in
period t is
it (θ , αi ) = yit log [G(αi + x it θ )] + (1 − yit ) log [1 − G(αi + x it θ )] ,

(1.2)

where G is either the standard normal CDF or standard logistic CDF. Following the notation of
Hahn and Newey (2004) and Fernandez-Val (2009), the maximum likelihood estimator of θ0 maximizes the profile log-likelihood, concentrating out the alphas:
N

θ = arg max ∑

T

∑

θ i=1 t=1

it (θ , αi (θ ))/NT

(1.3)

where
T

αi (θ ) = arg max ∑ it (θ , αi )/T
α

(1.4)

t=1

The incidental parameters problem arises because with T fixed, as N → ∞,
T

p

θ → θT , where θT = arg max EN
θ

∑

it (θ , αi (θ ))/T

(1.5)

t=1

where EN [m(Zit , αi )] ≡ lim ∑N
i=1 m(Zit , αi )/N. For finite T , θT = θ0 because α(θ ) = αi , even
N→∞

when evaluated at the true θ0 . Hahn and Newey (2004) show that for smooth likelihoods like the
probit and logit,
θT = θ0 + B/T + O(T −2 )

(1.6)

where B = I −1 b. In this expression, b represents a higher order expansion of the bias in α(θ ) as
T gets large, while I is the information matrix of the profile log-likelihood. Both terms together
capture the effect of estimation error in α(θ ) on θ . While it is true that θ is consistent for θ0 if
√
√
both N and T → ∞, the limiting distribution of NT (θ − θ0 ) is centered around B κ, where
N/T → κ. Therefore, confidence intervals for coefficient estimates will likely have poor coverage
(Hahn and W. Newey, 2004).

6

1.2.1

Bias correction techniques

Arellano and Hahn (2007) provide a thorough review of different approaches to mitigating bias
from the incidental parameters problem. The techniques that I consider in this chapter involve
estimating B and using it to construct an estimator with a bias of lower order. Analytical bias
corrections use expressions for B (denoted for an arbitrary θ as B(θ )) derived from large-T
expansion of the scores of the profile log-likelihood around the true αi . I focus mainly on the
“one-step” estimator B(θ ), which is evaluated at the uncorrected MLE. Then the bias corrected
estimator is formed as
θbc = θ − B(θ )/T

(1.7)

Previous simulations have shown that the one-step estimator performs reasonably well compared to an iterated procedure or related analytical corrections that solve modified scores (Hahn
and W. Newey, 2004). I examine the methods of Hahn and Newey (2004) and Fernandez-Val
(2009) for estimating B(θ ). Full expressions for the analytical bias corrections can be found in
Appendix A.
Jackknife corrections estimate B nonparametrically by using variation in θ when estimated
over the full panel and shorter sub-panels. This approach is advantageous because it does not
require an explicit characterization of B, though it does require more computation. Hahn and
Newey (2004) proposed a technique where the MLE is estimated over the T subpanels formed by
dropping one period. Their corrected estimator is formed as
T −1 T
θhn jk = T θ −
∑ θs ,
T s=1

(1.8)

where θs is the uncorrected MLE estimated over the periods {1, . . . , s − 1, s + 1, . . . , T }.
Dhaene and Jochmans (2015) show that splitting the panel into equal, or almost-equal, length
sub-panels minimizes the impact of imprecise estimation of B on the remaining bias and allows
for dynamic models (Dhaene and Jochmans, 2015). To illustrate how the estimator is formed,
suppose T is even for simplicity. Let θS1 and θS2 be the uncorrected MLE estimated over the

7

periods {1, 2, . . . , T /2} and {T /2 + 1, . . . , T }. Then, the jackknife corrected estimator is formed as
θd j jk = 2θ − (1/2)(θS1 + θS1 ).

(1.9)

Researchers are often interested in estimating functions of the data and parameters, like the
partial effect of the kth element of x it on the probability that yit equals one:

mk (θ , αi , x it ) = θk g(αi + x it θ ),

(1.10)

where g() is the derivative of G(). Many past simulation and theoretical work has suggested that
uncorrected MLE on static binary response models has a “small bias” property for estimates of
APE. This means that the bias in APE estimates tends to be smaller than that of slope parameters, and in the probit case with no heterogeneity, is exactly zero (Fernandez-Val, 2009). This
T
suggests that biases in θk and ∑N
i=1 ∑t=1 g(αi + x it θ ) move in opposite directions. Since APE and

other functions of the data generally depend directly on the α’s, correcting the slope parameters
only (or using a consistent procedure like conditional logit) is insufficient to handle the incidental
parameters problem as it only removes one source of the bias. In fact, αi (θ ), even if evaluated
at θ0 , does not converge to its true value with T fixed, or at a slower rate when T is allowed to
grow (Fernandez-Val, 2009). APE estimates with consistent estimates of θ but no correction for
imprecise estimation of the α’s may have much larger biases than APE estimates derived from the
uncorrected MLE, as section IV explores.
The analytical and jackknife corrections for APE are implemented in a similar fashion to their
counterparts for slope estimates. In the analytical case (see Appendix A), a bias term is estimated
and subtracted, while for the jackknife, APE are estimated for the full panel and the subpanels
separately and then combined just like the slope estimates.
Under dynamic completeness for the Fernandez-Val case and conditional independence for
the Hahn and Newey case, analytical bias-corrected estimators been shown to be consistent and
asymptotically normal as long as T grows faster than N 1/3 , and a similar property has been conjectured for the Hahn and Newey jackknife correction (Hahn and W. Newey, 2004). This makes

8

them reasonable procedures to implement when N is fairly large relative to T , as is typical in
microeconometrics. The split-panel jackknife of Dhaene and Jochmans is only consistent with T
and N growing at the same rate, but they find evidence it reduces bias with as few as six time
periods. The analytical and jackknife corrections analyzed here allow explanatory variables to be
only sequentially exogenous, but require the assumption of dynamic completeness, meaning that
no additional lags of x or y affect the current yit after x it has been included. Dynamic completeness
is written formally as
f (yit |αi , x it , yi,t−1 , x i,t−1 . . . , yi1 , x i1 ) = f (yit |αi , x it )

(1.11)

Either conditional independence or Assumption (1.11) imply that the scores of the log-likelihood
are serially uncorrelated and rule out any serial dependence in the per-period shocks. For the
many researchers interested in estimating static models, however, this assumption is less than ideal.
Empirical researchers routinely encounter static models with neglected serial correlation in the
linear case, and take care to conduct inference using clustered standard errors. Consequently, we
would rather not assume that a static model has fully captured the dynamics in the nonlinear case
either. One attractive point about the CRE approach with clustered standard errors is that for
binary response models with unobserved heterogeneity, arbitrary serial correlation do not cause
inconsistency in APE estimates (Wooldridge, 2010). Any complete comparison of bias corrections,
therefore, should evaluate their robustness to this common problem.

1.3

Monte Carlo experiment

The data generating process I specify is similar to Greene (2004) and Fernandez-Val and Weidner
(2016). The outcome is generated as
yit = 1 [αi + β0 xit + γ0 dit + rit > 0]

(1.12)

xit = αi + .5xi,t−1 + vit , t > 1

(1.13)

where

9

xi1 = αi + vi1 , vit ∼ N(0, 1/2)

(1.14)

dit = 1 [xit + hit > 0] , hit ∼ N(0, 1/2)

(1.15)

αi ∼ N(0, 1/16)

(1.16)

I set β0 = γ0 = 1. In this model, dit represents a policy or treatment variable of interest, while
xit is a continuous control variable that is both correlated with dit and its own past values. Both xit
and dit are generated to be strictly exogenous, though the Fernandez-Val and Dhaene and Jochmans
corrections only require sequential exogeneity. Correlation between xit and αi is roughly 0.5, while
correlation between dit and αi is roughly 0.3. Correlation between xit and dit is about 0.6.
Let µw be the population APE of w on the probability that y equals one, for w ∈ {x, d}. In
general, this quantity varies by T , so for comparison, I report the estimated APE divided by their
ˆ and αˆ (the uncorrected MLE),
true value. For βˆ , γ,
1
ˆ αˆ i , zit )
∑N ∑T mw (βˆ , γ,
µˆ w
= NT i=1 t=1
,
T m (β , γ , α , z )
µw
E T1 ∑t=1
w 0 0 i it

(1.17)

where zit = (xit , dit ) and

mw (β , γ, αi , zit ) =




β g(αi + β xit + γdit )

for w = x
(1.18)



G(αi + β xit + γ) − G(αi + β xit ) for w = d
where for the probit (logit) simulations, G() and g() are the CDF and PDF, respectively, for the
standard normal (logistic) distribution. The expectation in the denominator is simulated with a
single draw from a panel of 1, 000, 000 individuals. Note that the sum in the numerator is divided
by the entire sample size, NT . An individual j whose value of y jt does not change over the length
of the panel gets, the uncorrected MLE for the heterogeneity, αˆ j , is unbounded, so the individual
ˆ αˆ j , z jt ) for that
is dropped from the estimation of the structural parameters. The estimate mw (βˆ , γ,
observation is zero (Alexander and Breunig, 2014). I will discuss practical issues this can cause
when the panels are short and the data are highly persistent. Details on the analytical corrections
can be found in Appendix A. The jackknife-corrected APE estimators are constructed analogously
to the slope estimators in equations (1.8) and (1.9).

10

1.3.1

Evaluating the dynamic completeness assumption

I relax dynamic completeness in the panel probit case by introducing serial correlation into the
error term rit from the latent variable model. I use the following procedure:
rit = ψt,ρ uit

(1.19)

uit = ρui,t−1 + eit , t > 1

(1.20)

ui1 = ei1 /ψt,ρ , eit ∼ i.i.d.N(0, 1)



 1 − ρ 2 if ρ < 1
ψt,ρ ≡

√

1/ t
if ρ = 1

(1.21)

(1.22)

T has the same variance, which otherDivision of ei1 by ψt,ρ ensures that each element of {uit }t=1

wise would not hold because of finite length (Vamo¸s, Soltuz,
¸
and Cr˘aciun, 2007). Multiplication
of uit by ψt,ρ is to give rit unit variance. I maintain unit variance of the error terms to remove
the coefficient scaling that would otherwise occur in probit MLE. This allows us to better compare
slope estimates across estimators and values of ρ. In the logit case, I use a Gaussian copula based
on these series of normal errors.
I present results from simulations that set ρ equal to 0, 0.4, 0.8 to represent cases of dynamic
completeness, moderate serial correlation, and high serial correlation. While the copula is not
guaranteed to maintain the exact serial correlation for the logit case, the autocorrelations were
within two decimal points of the specified ρ. Consistent with the literature, I considered panel
lengths of 6, 8, 12, and 20, and I set N = 100 in all cases for ease of computation. Previous
work by Fernandez-Val (2009) and Alexander and Breunig (2014) has found that the larger N does
not affect the relative performance of the different estimators in terms of bias, but does increase
their overall precision. I find evidence of these findings, which can be found in Appendix B for
the N = 500, T = 6 case. One important finding is that when estimators have finite sample bias,
coverage of confidence intervals generally decreases with sample size as standard errors shrink.

11

I also estimate the probit slope coefficients and APE using the pooled MLE version of Mundlak’s (1978) correlated random effects (CRE), and the APE using a LPM for comparison. Standard
errors for each estimator are clustered by individual to account for serial dependence in the scores.
For each pair of ρ and T , I run 1000 replications.

1.3.2

Comparing bias correction and CRE under more general forms of heterogeneity

A correlated random affects approach of Mundlak (1978) applied to the panel probit model with
two strictly exogenous explanatory variables assumes that
D(ci |xxi , d i ) = Normal(ψ + ξ1 x¯i + ξ2 d¯i , σa2 ),

(1.23)

D(yit |xxi , d i ) = Probit(βa xit + γa dit + ψa + ξ1,a x¯i + ξ2,a d¯i )

(1.24)

which implies

where x¯i and d¯i denote time averages, and the “a” subscript indicates the coefficients are scaled
by 1/

1 + σa2 . Therefore, pooled probit of yit on xit , dit , x¯i and d¯i identifies β and γ up to

scale. Since the APE depend on the scaled coefficients, they can be estimated consistently with no
problem (Wooldridge, 2010).
Tables 1.1-1.4 in Section 4 show that CRE used on probit data generated with the above process
(or similarly for the logit case) performs well because the heterogeneity enters the equation for
xit additively; therefore, the αi can be written as a linear function of the time averages of xit .
Consequently, a natural question, is how much better do the fixed effects approaches perform
when the CRE assumption fails?
I explore this question with the panel probit model through the following modifications:
yit = 1 α j,i + β0 xit + γ0 dit + rit > 0

(1.25)

xit = .5xi,t−1 + vit , t > 1

(1.26)

xi1 = vi1 , vit ∼ N(0, 1/2)

(1.27)

12

Where α j,i is one of:
1 T
α1,i = −1 + √ ∑ xit2 + ai
T t=1

(1.28)

1 T
√
α2,i =
(xit + xit2 + xit3 ) + ai
∑
T t=1

(1.29)

α3,i ∼ N 0, exp

0.125 T
√ ∑ (xit + xit2 + xit3 )
T t=1

(1.30)

Where in the first two cases, ai ∼ N(0, 1/4).
Table 1.5 compares the uncorrected fixed effects MLE, MLE with Fernandez-Val’s analytical
bias correction, and two estimators based on CRE. One adds x¯i and d¯i to the probit index, and
a more flexible version (CRE2), where the index includes squares of x¯i and d¯i and interactions
between the explanatory variables and time averages. I consider panels with T = 6 and T = 12,
for the dynamically complete case.

1.3.3

Conditional logit and the importance of correcting APE estimates

To evaluate the finite sample properties of APE estimates derived from conditional logit slope
estimates, I generate a panel of logit dependent variables using the process described in (12). I
only consider the ρ = 0 case, as conditional logit is not valid when dynamic completeness fails.
I estimate APE using the uncorrected logit MLE, and two conditional logit procedures which
estimate the heterogeneity with a restricted MLE as described on the introduction. One procedure
does not correct for the incidental parameters problem while the other uses Fernandez-Val’s 2009
correction for the logit case.

1.4

Results

For brevity, I mainly report the bias corrections for the probit case. The logit case is qualitatively
similar, though the effect of serial correlation on the Fernandez-Val correction is much less severe.
I also report only the T = 6 and T = 12 results as they seem to be representative of the short panel

13

and long panel cases, respectively. Each of the tables lists the mean and standard deviation of the
estimator, the coverage probability of a 95% confidence interval, and the ratio of the estimated
(cluster-robust) standard error to the standard deviation. They show quite an interesting range of
performance for both the uncorrected MLE and the different bias reduction techniques. Results for
the

1.4.1

Evalauating the dynamic completeness assumption

Tables 1.1 and 1.2 show the performance of the probit slope estimators for different levels of serial
correlation. In line with evidence from the literature, the uncorrected MLE can be severely biased
for the index slopes in the presence of incidental parameters.

14

Table 1.1: Probit Estimates of β (β0 = 1)

T=6
MLE
A-FV09
A-HN04
J-DJ15
J-HN04
CRE
T=12
MLE
A-FV09
A-HN04
J-DJ15
J-HN04
CRE

Mean

ρ =0
SD cv:.95

SE
SD

Mean

ρ = 0.4
SD cv:.95

SE
SD

1.36
0.96
1.18
0.85
0.87
1.01

0.24
0.14
0.21
0.34
0.16
0.14

0.70
0.97
0.87
0.64
0.82
0.95

1.14
1.00
1.03
0.94
0.96
0.99

0.12
0.10
0.10
0.12
0.09
0.09

0.79
0.95
0.94
0.82
0.93
0.95

0.96
1.15
0.92
0.46
0.95
0.99

1.56
1.03
1.36
0.73
0.99
1.01

0.30
0.14
0.26
0.50
0.20
0.15

0.48
0.97
0.66
0.49
0.90
0.94

0.99
1.03
1.00
0.78
1.03
1.01

1.22
1.05
1.10
0.90
1.02
0.99

0.13
0.11
0.12
0.16
0.10
0.10

0.61
0.94
0.87
0.69
0.95
0.94

15

Mean

ρ = 0.8
SD cv:.95

SE
SD

0.90
1.17
0.83
0.35
0.82
0.98

2.49
0.63
2.24
0.80
1.43
1.02

0.55
0.59
0.52
1.03
0.45
0.15

0.05
0.58
0.07
0.39
0.49
0.93

0.83
0.29
0.72
0.28
0.50
0.95

0.99
1.02
0.98
0.62
1.01
0.98

1.61
1.33
1.45
0.75
1.30
1.00

0.19
0.14
0.16
0.32
0.14
0.10

0.05
0.32
0.16
0.39
0.38
0.95

0.96
0.98
0.92
0.31
0.95
1.01

Table 1.2: Probit Estimates of γ (γ0 = 1)

T=6
MLE
A-FV09
A-HN04
J-DJ15
J-HN04
CRE
T=12
MLE
A-FV09
A-HN04
J-DJ15
J-HN04
CRE

Mean

ρ =0
SD cv:.95

SE
SD

Mean

ρ = 0.4
SD cv:.95

SE
SD

1.31
0.95
1.14
0.78
0.87
0.98

0.26
0.16
0.22
0.73
0.17
0.17

0.79
0.98
0.91
0.76
0.93
0.95

1.15
1.01
1.04
0.95
0.97
1.00

0.14
0.12
0.12
0.13
0.11
0.11

0.82
0.97
0.96
0.91
0.96
0.95

0.98
1.26
1.00
0.30
1.18
0.99

1.52
1.02
1.33
0.24
0.95
0.99

0.30
0.16
0.27
1.53
0.32
0.16

0.59
0.99
0.78
0.54
0.96
0.96

1.00
1.10
1.06
0.93
1.13
1.00

1.23
1.07
1.11
0.91
1.03
1.00

0.15
0.12
0.13
0.16
0.12
0.11

0.65
0.95
0.89
0.83
0.97
0.95

16

Mean

ρ = 0.8
SD cv:.95

SE
SD

0.95
1.32
0.93
0.18
0.65
1.01

2.49
0.49
2.25
-1.68
0.85
1.00

0.77
0.82
0.76
2.58
1.79
0.15

0.10
0.62
0.13
0.17
0.59
0.95

0.65
0.29
0.54
0.21
0.17
0.99

0.99
1.07
1.03
0.82
1.09
0.98

1.61
1.33
1.44
0.64
1.30
1.00

0.20
0.15
0.18
0.70
0.15
0.11

0.11
0.44
0.25
0.48
0.52
0.94

0.97
1.05
0.96
0.21
1.06
0.97

In the dynamically complete case (ρ = 0), bias diminishes as T grows, there is still room for
improvement even when T = 12. For instance, the uncorrected MLE for γ has a bias of 31% when
T = 6, but only 15% when T = 12. As predicted by theory, coverage of the 95% confidence interval
is still somewhat low at 0.82 when T = 12, meaning for a 5% significance level, one would expect
to reject a true null hypothesis 18% of the time. As found in previously published simulations,
the correction techniques reduce bias and generally increase coverage. In particular, FernandezVal’s analytical correction performs better than the others in all panels, both in terms of bias and
variance, particularly for the short panels. The split panel jackknife of Dhaene and Jochmans tends
to have higher variance than the others.
If one is concerned primarily with estimating APE, however, the incidental parameters problem
clearly has much less bite, as shown by Tables 1.3 and 1.4. For the dynamically complete case, bias
in the uncorrected MLE for µx is less than 1% for either panel length, while the bias in that of µd
is 4% or less. This supports the “small bias” property for APE estimators found by many previous
studies of static models (Fernandez-Val, 2009). The bias corrected estimators perform well for
the longer panels, but even in the dynamically complete case, many of them have higher bias
than the uncorrected MLE for the short panels. Among the different bias correction techniques,
both corrections from Hahn and Newey (2004) tend to have the smallest bias, while the split panel
jackknife does worse. Additionally, while theory suggests that both corrections reduce bias without
any change in variance, it appears that the jackknife corrections may increase variance, especially
in shorter panels.

17

Table 1.3: Probit Estimates of µx /µx (true value = 1)

T=6
MLE
A-FV09
A-HN04
J-DJ15
J-HN04
CRE
LPM
T=12
MLE
A-FV09
A-HN04
J-DJ15
J-HN04
CRE
LPM

Mean

ρ =0
SD cv:.95

SE
SD

Mean

ρ = 0.4
SD cv:.95

SE
SD

1.00
0.96
1.05
1.10
1.04
1.01
0.95

0.14
0.13
0.15
0.19
0.15
0.13
0.13

0.94
0.93
0.89
0.78
0.89
0.96
0.94

1.00
0.99
1.00
1.00
1.00
1.00
0.93

0.09
0.09
0.09
0.11
0.09
0.09
0.09

0.94
0.93
0.93
0.89
0.93
0.96
0.88

0.96
0.96
0.88
0.69
0.83
1.03
1.03

0.99
0.94
1.06
1.15
1.07
1.01
0.95

0.14
0.13
0.16
0.22
0.17
0.13
0.13

0.93
0.90
0.88
0.71
0.84
0.95
0.94

0.96
0.94
0.93
0.81
0.93
1.01
1.03

0.99
0.99
1.00
1.01
1.00
1.00
0.93

0.09
0.09
0.10
0.12
0.09
0.09
0.09

0.93
0.93
0.93
0.84
0.93
0.94
0.86

18

Mean

ρ = 0.8
SD cv:.95

SE
SD

0.92
0.94
0.81
0.66
0.74
1.02
1.02

0.94
0.57
1.05
1.26
1.15
1.01
0.94

0.14
0.44
0.16
0.22
0.19
0.13
0.13

0.86
0.37
0.82
0.58
0.63
0.95
0.92

0.86
0.28
0.72
0.70
0.58
0.99
1.00

0.93
0.91
0.90
0.72
0.90
0.97
1.00

0.99
0.98
1.01
1.05
1.00
1.00
0.93

0.09
0.09
0.10
0.13
0.09
0.09
0.09

0.90
0.89
0.90
0.78
0.90
0.95
0.87

0.89
0.86
0.85
0.67
0.84
0.97
0.99

Table 1.4: Probit Estimates of µd /µd (true value = 1)

T=6
MLE
A-FV09
A-HN04
J-DJ15
J-HN04
CRE
LPM
T=12
MLE
A-FV09
A-HN04
J-DJ15
J-HN04
CRE
LPM

Mean

ρ =0
SD cv:.95

SE
SD

Mean

ρ = 0.4
SD cv:.95

SE
SD

0.96
0.93
1.00
1.09
1.04
0.99
1.28

0.19
0.18
0.19
0.25
0.22
0.19
0.19

0.93
0.93
0.93
0.81
0.89
0.95
0.68

1.01
1.00
1.01
1.03
1.01
1.01
1.33

0.13
0.13
0.13
0.15
0.13
0.13
0.13

0.94
0.94
0.94
0.91
0.94
0.95
0.29

0.95
1.01
0.92
0.72
0.83
1.00
0.99

0.97
0.91
1.01
1.11
1.05
0.99
1.28

0.19
0.17
0.19
0.26
0.21
0.18
0.18

0.92
0.92
0.92
0.80
0.88
0.95
0.67

0.99
0.99
0.98
0.88
0.96
1.00
1.00

1.01
1.00
1.01
1.03
1.02
1.01
1.33

0.13
0.13
0.13
0.15
0.13
0.13
0.13

0.94
0.95
0.94
0.88
0.93
0.94
0.27

19

Mean

ρ = 0.8
SD cv:.95

SE
SD

0.92
1.01
0.89
0.67
0.79
1.00
1.00

0.96
0.42
1.00
1.05
1.05
0.99
1.29

0.18
0.56
0.18
0.27
0.20
0.17
0.17

0.88
0.37
0.89
0.75
0.83
0.95
0.62

0.84
0.31
0.83
0.60
0.70
1.00
1.00

0.97
0.97
0.96
0.82
0.94
0.98
0.99

1.00
0.99
1.01
1.03
1.01
1.00
1.33

0.12
0.12
0.12
0.16
0.12
0.13
0.13

0.94
0.93
0.93
0.87
0.92
0.94
0.27

0.95
0.95
0.94
0.78
0.91
0.98
0.97

The simulation results for models where dynamic completeness fails reveal many interesting
implications for the uncorrected and corrected fixed effects probit estimators. To begin with, higher
levels of serial dependence in the error terms and yit exacerbate a practical difficulty in performing
MLE while treating heterogeneity as parameters to be estimated. The problem relates to the fact
that the partial effect is not well-defined for an individual j whose value of y jt is constant. In
this case, the dummy variable for observation j perfectly predicts the outcome, so the estimate
of αi is technically unbounded (Fernandez-Val, 2009). These observations are therefore dropped
from the estimation sample. These individual’s contributions to the sample APE are equal to zero.
The true α’s in these cases tend to be larger in magnitude, and while this means m(β0 , γ0 , α, zit )
will be smaller by the properties of the standard normal PDF and CDF, it should still be strictly
positive. This explains the tendency of MLE to under-predict APE (Alexander and Breunig, 2014).
Additionally, there may be distributional differences between the subpopulation that has a changing
response and the population in general that could cause additional bias.
The probability of observing an individual with a constant yit increases significantly in the
shorter panels as serial dependence in the errors increases. To illustrate, for T = 6 and ρ = 0,
across the 1000 replications, 21% of the individuals were dropped on average, while for T = 6 and
ρ = 0.8, 32% were dropped on average. For comparison, with T = 12, this dropping rate was only
7.5% for ρ = 0 and 14.5% for ρ = 0.8. Splitting the panel for Dhaene and Jochman’s jackknife
makes this much worse, especially when the panel is only six periods long to begin with. Practically speaking, losing more observations makes it more likely that the numerical maximization
algorithm will not converge (at least when N is relatively small). The worst case of this occurring
in this study was for the split-panel jackknife in the T = 6, ρ = 0.8 case, in which 32% of replications had a failure to converge. Similar rates of non-convergence occurred as well for (unreported)
runs of the uncorrected MLE and analytical corrections with high ρ and only three or four time
periods.
The results show that, as expected, the failure of dynamic completeness significantly increases
the bias and decreases the precision of all of the fixed effects slope estimators. By design of the

20

data generating process, this bias is separate from the scaling that would occur from the latent
model errors having non-unit variance as a result of their autoregressive structure. In the worst
of cases, the means of the split-panel jackknife estimates of γ for the shorter panels even have
the wrong sign when ρ = 0.8. For small panels, the standard errors of the corrected estimators
also do a poor job estimating the true standard deviations. The increased bias is not surprising
given that in the presence of unobserved heterogeneity, a conditional independence assumption
for {yi1 , yi2 , . . . , yiT } is required to identify unscaled slope parameters in the panel probit model
(Wooldridge, 2010). Fernandez-Val’s analytical correction and Hahn and Newey’s jackknife continue to mitigate the bias and perform relatively well when ρ = 0.4. While they still provide an
improvement over the uncorrected MLE when ρ = 0.8, they are still severely biased.
The performance of the fixed effects estimators in estimating APE is much more relevant when
dynamic completeness fails. In the case of the short panel (T = 6), the effect of higher serial correlation in the errors on the performance of the fixed effects estimators is quite mixed. Comparisons
between estimators in Tables 1.3 and 1.4 suggest that the analytical correction proposed by Hahn
and Newey seem fairly robust to serial correlation, with biases in APE estimates of 6% or less.
Bias in the Fernandez-Val correction only increases slightly at low-to-moderate levels of serial
correlation, but the combination of high autocorrelation and short panel length causes a substantial
downward bias of 40% to 60%. With the longer panels, however, the effect of ρ on the bias of this
and the other corrections is much smaller, 2% or less for the T = 12 case.
The effect of ρ on the jackknife APE corrections is different for each explanatory variable.
For instance, the bias in the split-panel jackknife APE estimates for x increase with higher ρ in
the T = 6 case, but those for d appear to be less affected. Hahn and Newey’s jackknife shows a
very similar pattern, but with much smaller variance of the estimators. Furthermore, the results for
the split-panel jackknife illustrate that slope and APE estimates do not necessarily agree in sign.
This is another drawback to using this procedure on short panels, since splitting the panel increases
variance substantially. Perhaps larger N would mitigate this problem.

21

1.4.1.1

Comparison with uncorrected MLE

As in the dynamically complete case, it is important to note that the uncorrected APE estimators
often have lower bias than either the analytical or jackknife corrected estimators, especially for
the short panels. For longer panels, the uncorrected MLE, analytical corrections, drop-one-period
jackknife, and CRE behave very similarly, while the split-panel jackknife has higher variance. For
comparison, the CRE and LPM are not really affected by either failure of dynamic completeness or
the length of the panel. The structure of the data is such that one would expect CRE to do well. As
a side note, I found that a generalized estimating equations approach with either an exchangeable
or AR(1) covariance matrix was not much more efficient than pooled MLE for the CRE model. In
contrast to the CRE, the best linear approximation performs fairly well for the continuous variable
(bias of 5 − 7%) but does not perform very well for the discrete variable (bias of 28 − 34%).

1.4.2

Comparing bias correction and CRE under more general forms of heterogeneity

Table 1.5 compares probit APE estimates for the continuous variable x using the uncorrected MLE,
Fernandez-Val correction, and two Correlated Random Effects estimators, described in Section 3.
I consider panels with T = 6 and T = 12, in the case of serially independent errors. The estimators
are compared across three different forms of heterogeneity which do not satisfy the conditional
distribution assumptions for either CRE estimator. The uncorrected MLE and Fernandez-Val correction, in contrast, place no restriction on the nature of the heterogeneity.

22

Table 1.5: Probit Estimates of µx /µx Under Different Heterogeneity (true value = 1)

T=6
MLE
A-FV09
A-HN04
CRE
CRE2
T=12
MLE
A-FV09
A-HN04
CRE
CRE2

Mean

α1
SD cv:.95

SE
SD

Mean

α2
SD cv:.95

SE
SD

1.00
0.96
1.04
0.76
0.78

0.16
0.15
0.16
0.15
0.15

0.92
0.92
0.88
0.59
0.65

0.99
0.98
1.00
0.68
0.70

0.12
0.12
0.12
0.11
0.11

0.91
0.91
0.91
0.22
0.26

0.90
0.91
0.83
0.96
0.95

0.98
0.93
1.03
0.75
0.76

0.23
0.21
0.24
0.20
0.20

0.91
0.89
0.87
0.74
0.75

0.89
0.88
0.87
1.01
1.00

0.98
0.97
0.99
0.79
0.80

0.18
0.18
0.18
0.17
0.17

0.85
0.84
0.84
0.72
0.73

23

Mean

α3
SD cv:.95

SE
SD

0.86
0.87
0.78
1.03
1.01

0.99
0.95
1.03
0.95
0.96

0.15
0.14
0.16
0.15
0.15

0.92
0.90
0.88
0.92
0.93

0.90
0.90
0.82
0.99
0.98

0.76
0.75
0.74
0.98
0.97

1.00
0.99
1.01
0.95
0.96

0.10
0.10
0.10
0.10
0.10

0.92
0.92
0.91
0.92
0.93

0.91
0.90
0.89
1.02
1.01

Since the pooled-MLE version of CRE only identifies slope parameters up to scale, I only
report on the APE. The tables show that the bias in the CRE estimators is higher in all three
specifications. For instance, in the second specification (α = α2 ), CRE underestimates the APE
of x by about 25% when T = 6, while the biases in the uncorrected MLE and the Fernandez-Val
correction are only 2% and 9%, respectively. The results for the APE of d were comparatively
similar, though the CRE estimators tended to have a positive bias. This illustrates the importance
of the functional form assumption when specifying a CRE model, and suggests an advantage in
the FE approaches as they place no restrictions on the αi .

1.4.3

Conditional logit and the importance of correcting APE estimates

Table 1.6 explores a possible approach to handling unobserved cross-sectional heterogeneity in
logit models where the response variables are conditionally independent. Using conditional logit to
consistently estimate slope parameters does not allow for estimating average partial effects unless
the researcher can somehow recover estimates of the αi . One way is to estimate them by MLE,
restricting the slope parameters to their conditional logit estimates, but this causes bias in APE
estimates. The table shows the APE estimates (for the continuous variable x) from the uncorrected
pooled logit MLE, conditional logit without correcting the APE estimates (denoted CLOG), and
conditional logit where the APE have been corrected using Fernandez-Val’s formula (CLOGC).
Simulations for the APE of d showed a very similar pattern.

24

Table 1.6: Corrected and Uncorrected Logit Estimates of µx /µx (true value = 1)

T=6
MLE
CLOGIT
CLOGIT-C
T=12
MLE
CLOGIT
CLOGIT-C

Mean

ρ =0
SD cv:.95

SE
SD

Mean

ρ = 0.4
SD cv:.95

SE
SD

1.01
0.87
1.00

0.19
0.16
0.18

0.95
0.93
0.94

1.00
0.94
1.00

0.12
0.11
0.12

0.95
0.94
0.95

1.01
1.14
1.00

1.01
0.87
0.99

0.18
0.16
0.18

0.94
0.91
0.93

1.01
1.06
1.00

1.00
0.94
1.00

0.12
0.11
0.12

0.95
0.94
0.95

25

Mean

ρ = 0.8
SD cv:.95

SE
SD

0.99
1.10
0.97

0.99
0.87
0.98

0.18
0.15
0.17

0.92
0.82
0.90

0.88
0.93
0.83

1.01
1.06
1.00

1.00
0.94
0.99

0.12
0.11
0.12

0.94
0.91
0.93

0.95
0.99
0.94

The table illustrates a couple of interesting points. First, the uncorrected conditional logit
APE estimates have biases that are 5-13 percentage points higher than the corrected versions.
This shows that inconsistent estimation of the αi is a significant problem even when a consistent
procedure is used to estimate the slope coefficients. Moreover, these suggest that the “small bias”
property in the uncorrected MLE APE estimates observed earlier is the result of two competing
biases. In the case of this chapter’s data generating process, an upward bias in the slope estimate
is being offset by a scale factor that is biased toward zero. Using a procedure like conditional logit
(or any bias correction) while failing to correct APE estimates removes only one source of the
problem and may increase the bias compared to doing no correction at all.

1.4.4

Empirical example: Welfare participation

As an additional demonstration of the relative performance of these fixed effects estimators, I apply them to a dataset on participation in Aid to Families with Dependent Children (AFDC), a U.S.
welfare program. The data are by way of Chay and Hyslop (2014), who use the 1990 Survey of
Income and Program Participation (SIPP). The panel consists of AFDC participation, age, race,
marital status, number of children, and poverty level for 1, 934 women who either received benefits or had income below a certain threshold at some point during the sample period. As welfare
participation is a binary response that is thought to be highly persistent over time, Chay and Hyslop differentiate between unobserved heterogeneity, and structural state dependence as sources of
persistence, finding significant evidence for the latter using dynamic estimators under varying assumptions about the nature of the heterogeneity and initial conditions (Chay and Hyslop, 2014).
Although their findings suggest that a dynamic model may be more appropriate, these data still
provide an interesting and relevant setting for evaluating the bias-corrected fixed effects estimators
in the static case. Table 1.7 lists slope parameter estimates for two key determinants of participation, marital status and number of children. Note that in addition to several control variables, these
specifications include time period dummies. While technically, they are also incidental parameters
under large-T bias corrections, it is customary to include them in this type of analysis. In (unre-

26

Table 1.7: Welfare Participation: Slope Estimates
Full Sample
CRE
(1)
Marriage
-0.986
(0.011)
Kids
0.162
(0.001)

MLE
(2)
-1.908
(0.208)
0.481
(0.104)

Sample with changing participation
A-FV09 A-HN04 J-DJ15 J-HN04
(3)
(4)
(5)
(6)
-1.579
-1.730
-1.822 -1.565
(0.178) (0.189) (0.229) (0.176)
0.409
0.437
0.447
0.380
(0.098) (0.100) (0.121) (0.096)

CRE
(7)
-1.462
(0.022)
0.358
(0.006)

N=1934
N*=494
T=8
T=8
Controls include education, poverty level, a quadratic in age, a race dummy,
and time period dummies. Standard errors were clustered by individual
ported) simulations with true time effects, I found that the additional bias caused by their inclusion
to be smaller and that it did not change the relative performance of the different FE estimators. Table 1.8 lists estimated APE. Unlike the simulations, these tables include CRE and LPM estimates
over the estimation subsample of the fixed effects estimators. This application highlights the problems that may arise when many individuals have responses that do not change. In this case, only
494, or roughly 25% of women in the sample had participation that changed over the 32 months
of the survey. In the worst simulation case (T = 6, ρ = 0.8) 68% of the sample still had responses
that changed. Practically speaking, not only does this increase variance of the estimators, but it
potentially exacerbates any bias stemming from sample selection (which did not appear to be much
of a problem in the simulations).
The bias-corrected slope estimates in both cases are smaller in magnitude than the uncorrected
MLE, and are similar in magnitude to CRE estimates over the subsample of changing responses,
though quite different from the CRE estimates over the whole sample. Probit slope estimates from
the 1998 and 2014 versions of Chay and Hyslop range from -0.934 to -0.658 for the marriage
variable, and 0.11 to 0.152 for the kids variable. Both are much smaller in magnitude than the nonlinear fixed effects estimates suggesting that persistence, state dependence and/or sample selection
are playing a significant role.

27

Table 1.8: Welfare Participation: Average Partial Estimates
Full Sample
CRE
LPM
(1)
(2)
Marriage -0.260
-0.271
(0.001) (0.001)
Kids
0.047
0.052
(0.000) (0.000)

MLE
(3)
-0.112
(0.007)
0.034
(0.007)

Sample with changing participation
A-FV09 A-HN04 J-DJ15 J-HN04 CRE*
(4)
(5)
(6)
(7)
(8)
-0.110
-0.115
-0.162 -0.129 -0.112
(0.008)
(0.007) (0.008) (0.008) (0.000)
0.033
0.035
0.049
0.034
0.034
(0.007)
(0.007) (0.007) (0.007) (0.000)

N=1934
N*=494
T=8
T=8
*Sum of partial effects divided by full sample size for comparison with FE estimators

28

LPM*
(9)
-0.126
(0.000)
0.033
(0.000)

The 1998 version of Chay and Hyslop contains several estimates of LPMs, including the static
model estimated with fixed effects (column 2 of Table 2), which are compared to the bias-correct
APE estimates in Table 1.8. The Chay and Hyslop estimates (that account for heterogeneity) range
from -0.271 to -0.143 for marriage and from 0.029 to 0.068 for kids. The bias corrected estimates
range from -0.162 to -0.110 for marriage and from 0.033 to 0.050 for kids, which seem more in
line than the slope estimates, echoing previous research and the simulation evidence in this chapter
for the “small bias” property.

1.5

Conclusion

The simulation evidence in this chapter suggests that these bias corrections continue to estimate
APE fairly well when the level of serial correlation is low to moderate, but strong serial correlation
may cause bias when the panel is short. As such, dynamic completeness may be a substantive
requirement unless the researcher has access to many time periods of data. Estimation in shorter
panels may also present sample selection or computational challenges. While it may seem unfair
to evaluate a technique based on large-T asymptotic approximations using panels with only six
time periods, others have suggested these techniques have desirable properties in large-N, small-T
settings. Moreover, the results of this chapter suggest that if a researcher is primarily concerned
with estimating APE in a static model, then the included bias correction techniques may offer
little benefit relative to the uncorrected MLE while adding the cost of a more complicated estimation procedure. It should be noted, however, that the “small bias” property of APE does not hold
in dynamic models, where correction techniques have been found to decrease bias substantially.
Additionally, I find that the fixed effects approach (with or without a bias correction) may offer
advantages over CRE when the heterogeneity does not satisfy the CRE assumption. I also find evidence that highlights the importance of correcting for inconsistent estimation of the heterogeneity
terms when a consistent procedure is used to estimate the slopes.
There are many important avenues for future research. First and foremost, an interesting ques-

29

tion is how well the analytical bias correction of Hahn and Kuersteiner (2011) performs in this
setting. It accommodates serial correlation in theory, but requires “moderately large T .” Furthermore, in practical applications like a policy or program analysis, it is important to control for time
effects, which I did not include in this set of simulations. The reason is that under the large-T
asymptotics that justify these corrections, time dummies are also incidental parameters. I did run
a set of simulations over the same values of ρ and T where time effects were estimated, but not
part of the true data generating process for yit . I found that the same relative patterns held across
estimators as in this chapter, but the additional incidental parameters caused slightly higher bias
in slope parameters and virtually no increase in bias for APE except for the short panels, where
bias increased slightly. Fernandez-Val and Weidner (2016) allow for both time and cross-sectional
heterogeneity in analytical and jackknife corrections. However, the results depend on N/T being
constant in the limit. Therefore, unlike the wide and short panels included in this chapter, their
application is intended for settings where N and T are of similar magnitude.

30

CHAPTER 2
EXPONENTIAL PANEL MODELS WITH COEFFICIENT HETEROGENEITY

2.1

Introduction

The fixed effects Poisson (FEP) estimator, also known as multinomial QCMLE, is an attractive
choice for modeling nonnegative responses whose conditional means contain an unobserved individual effect that may be correlated with the explanatory variables. Unlike other conditional-ML
estimators, notably the FE logit, FEP does not require assuming a full distribution or conditional independence (Wooldridge, 1999). This chapter considers the exponential conditional mean, which
is logically consistent for nonnegative dependent variables and has the feature that coefficients on
the regressors can be interpreted as semi-elasticities.
The focus of this chapter is an extension to the unobserved effects exponential model that allows for additional heterogeneity in the form of random coefficients. While there is some literature
considering Poisson variables in this setting, less insight exists into how to proceed for other nonnegative or non-count variables, or even what the consequences are of ignoring the heterogeneity.
In the linear unobserved effects model with strictly exogenous regressors and random coefficients,
for instance, it is straightforward to show that fixed effects OLS is consistent for the means of the
coefficients so long as they are mean-independent of the time-demeaned regressors. This is not
necessarily true for nonlinear models, as this chapter shows for the exponential case. Moreover,
it is unknown whether other quantities of interest, like average partial effects (APE), can be consistently estimated while ignoring coefficient heterogeneity. Furthermore, much of the literature
assumes all sources of heterogeneity are independent of covariates, which can cause inconsistent
estimation of coefficient means as well as type II errors in tests for random coefficients
These potential complications motivate testing for neglected heterogeneity. An LM test in the
style of Chesher (1984), however, is likely to reject when the Poisson distribution is misspecified
or when conditional independence fails. Therefore, I extend this methodology specifically to the

31

FEP setting, deriving a simple variable addition test that is more broadly applicable. Furthermore,
I propose a method for parametrically identifying the means of random coefficients that leads to
estimators that are computationally simple related to existing approaches to random coefficients in
this model. One novel contribution of this chapter is to treat random coefficients and the traditional
multiplicative effect1 separately, as the latter can be handled without restricting their dependence
on explanatory variables. I also provide estimators of average partial effects. In an application
to the patent R&D relationship among U.S. manufacturing firms, I find evidence of heterogeneous
elasticities and lagged effects, though the results are not robust to changes in the estimation sample.
The rest of this chapter is organized as follows: Section 2 gives an overview of the existing
literature, Section 3 reviews the FEP model and the classical test for the Fixed Effects Poisson
case, before proposing this chapter’s theoretical contributions. Section 4 contains a Monte Carlo
experiment for the methods proposed, while Section 5 describes the empirical application. Section
6 consists of a brief conclusion and direction for future research.

2.2

Literature Review

Applying Andersen’s (1970) conditional ML methodology, Hausman, Hall, and Griliches (1984)
developed the FEP estimator for count data that allows arbitrary dependence between the unobserved effect and the regressors. They implemented their techniques to analyze the patent-R&D
relationship in the U.S. manufacturing industry. Wooldridge (1999), showed that correct specification of the conditional mean and strict exogeneity of the regressors (conditional on the unobserved effect) were sufficient for consistency of FEP, broadening its application as a quasi-CMLE.
Cameron and Trivedi (2013) considered the panel unobserved effects Poisson model with random
coefficients in a “random effects” setting where all heterogeneity were assumed to be normally
distributed and independent of the regressors. They concluded that “unlike for the linear model,
1 The

multiplicative effect can also be expressed as a random intercept inside the exponential
conditional mean function.

32

the conditional mean for the random slopes model differs from that for the pooled and random
effects models, making model comparison and interpretation more difficult.”
Lagrange multiplier (LM) statistics are attractive in testing for coefficient heterogeneity because they use parameter estimates from a restricted model which can be simpler to estimate. In
this case, the restricted model is FEP, for which built-in procedures exist in Stata and other programs. Moreover, LM tests are valid for null values on the boundary of the parameter space,
unlike Wald tests, which is important because parameters (i.e. variances) associated with random
coefficients should be nonnegative (Wooldridge, 2010). Random coefficients are an example of
neglected heterogeneity that Chesher (1984) derived a test for in the ML setting. Chesher, as well
as Lee and Chesher (1986), developed methodology for deriving test statistics in this and other
settings where scores are identically zero under the parameter restriction. Greene and MacKenzie
(2015) applied this methodology to random effects probit MLE. Hahn, Newey, and Smith (2014)
extend Chesher’s to moment condition estimators like Generalized Method of Moments (GMM).
Hahn, Moon, and Snider (2015) allow for dependence between the heterogeneity and covariates
when testing the likelihood setting, though they also find that tests that treat the heterogeneity and
regressors as mean and second-moment independent still have power under alternatives where this
is not true. A common feature of tests for neglected heterogeneity in the likelihood setting is that
they have the interpretation of being either for information matrix (IM) equality or for overdispersion, making them less attractive for settings where researchers do not want to fully specify a
distribution. I derive a test for slope heterogeneity in exponential models that does not have this
drawback.
A Poisson-normal mixture model like the one described by Cameron and Trivedi is one of
the “Generalized linear latent and mixed models” studied by Rabe-Hesketh and Skrondal (2004).
The likelihood function consists of a multi-dimensional integral that must be numerically approximated, limiting its application to models where only a small number of coefficients are believed to
be random. The authors used adaptive Gaussian quadrature to estimate a model of seizure counts
for 236 subjects of (randomly assigned) epilepsy treatment trial, where both the intercept and the

33

coefficient on a variable for time of visit were allowed to be vary by individual. While a random effects approach makes sense for the experimental setting, treating the heterogeneity as independent
of covariates can cause inconsistent estimation in many economic applications.
Wang, Cockburn, and Puterman (1998), do allow dependence between the heterogeneity and
explanatory variables in the panel Poisson setting, assuming a parametric form for the dependence
as well as a particular distribution for the heterogeneity. With the patent-R&D relationship in mind,
they propose a mixed-Poisson regression approach which assumes that the coefficients follow a
discrete distribution with finite support, modeling the probability mass at each point as multinomial
logit. Their method involves using economic intuition or selection criteria to select the number of
support points. Moreover, they suggest using a continuous model for the coefficients if model
selection criteria indicate four or more points of support. My paper complements their work by
proposing such a model. One benefit of my approach is that as in FEP, cases I can allow an
unrestricted relationship between the explanatory variables and the multiplicative effect, as well as
analyze non-counts.

2.3
2.3.1

Theory
The fixed effects Poisson model with coefficient heterogeneity

The standard fixed effects Poisson model with an exponential mean function assumes:
E(yit |xxi , ci ) = E(yit |xxit , ci ) = ci exp(xxit β 0 )

(2.1)

for i = 1, . . . , N;t = 1, . . . , T . In this expression, x it is a 1 × K vector of time-varying explanatory
variables, ci is unobserved heterogeneity, and β 0 is a K × 1 unknown vector of coefficients.2
Equation (2.1) implicitly assumes that x it is strictly exogenous. Hausman, Hall, and Griliches
(1984) showed that if conditional on x i = {xxi1 , . . . , x iT } and ci , the yit are independently distributed
2 Wooldridge

(1999) considered conditional mean functions of the form ci m(xxi , β 0 ) of which
m(xxi , β 0 ) = exp(xxit β 0 ) is a special case.

34

T y results in the multinomial
as Poisson with mean given by (2.1), then conditioning on ni ≡ ∑t=1
it

distribution for {yi1 , . . . , yiT }.
The multinomial log-likelihood is
M (β
i β) =

T

∑ yit log [pt (xxi, β )] ,

(2.2)

t=1

where
pt (xxi , β ) ≡

exp(xxit β )
.
T exp(x
xir β )
∑r=1

(2.3)

The feature that ci enters conditional mean function multiplicatively means it cancels out of
β ), meaning dependence between ci and x i may remain unrestricted.
pt (xxi , β ) and therefore i (β
This structure also has the consequence that coefficients on time-constant regressors are not identified because these terms also cancel. This model is particularly attractive because as shown by
Wooldridge (1999), β 0 maximizes the expected value of 2.2 as long as (2.1) is true. Therefore,
under additional regularity conditions, FEP consistently estimates β 0 with N growing and T fixed.
Notably, consistency does not require a distribution assumption for the responses and allows them
to be arbitrarily serially correlated (Wooldridge, 1999).
Condition (2.1) generally fails, however, if the coefficients in the conditional mean function
vary by individual i, as in the following:
E(yit |xxi , ci , b i ) = E(yit |xxit , ci , b i ) = ci exp(xxit b i ),

(2.4)

where now b i is a K × 1 vector of unobserved random variables such that E(bbi ) = β 0 . Defining
d i ≡ b i − β 0 , the conditional mean in (2.4) is equivalent to ci exp(xxit β 0 + x it d i ), meaning one
interpretation of the heterogeneity is unobserved interactions in the index of the mean function.
There is a more practical, economic interpretation as well. Assuming element j is not functionally
related to any other elements of x it , then
∂ log [E(yit |xxi , ci , b i )]
= bi j ,
∂ xit j

35

(2.5)

so model (2.4) implies semi-elasticities of the conditional mean of yit that vary by individual. If
xit j is the log of another variable, as in some applications, then the bi j are individually-varying
elasticities.
An immediate consequence is that the heterogeneity likely causes specification error if we want
to use FEP assuming (2.1). To see this, suppose for concreteness that d i is continuous, and write
its PDF conditional on x i and ci as f ( ; ψ 0 ), where ψ 0 is an unknown parameter that is nonzero
only if the coefficients are random. It follows under (2.4) and the Law of Iterated Expectations
(LIE) that
E(yit |xxi , ci ) = ci exp [xxit β 0 + gt (xxi , x it , ci ; ψ 0 )] ,

(2.6)

where
gt (xxi , x it , ci ; ψ 0 ) = log {E [exp(xxit d i )|xxi , ci ]} = log

RK

exp(xxit d i ) f (dd i |xxi , ci ) ddd i ,

(2.7)

assuming the expectation exists. The exponential function now contains an unknown term that
is generally nonzero and varies over time.3 Depending on what we are willing to assume about
the dependence between b i and x i , we may not be able to distinguish between coefficients that
are random and a more flexible functional form. The consequence of ignoring the coefficient
heterogeneity is that now (2.1) is not correct, and so FEP of yit on x it can no longer be shown
to be generally consistent for β 0 . This is true even under ideal conditions like independence
between b i and {xxi , ci } In fact, simulation evidence from Section 4 suggests that substantial bias
and inconsistency for FEP in this case. This is to contrast with the linear unobserved effects model
with random coefficients, in which Fixed Effects OLS is consistent for the means of the coefficients
so long as the coefficients are mean independent of the time-demeaned regressors (Wooldridge,
2010). In this case, the random coefficients cause a certain form of system heteroskedasticity in
the idiosyncratic errors that is handled completely with robust inference.
3 If g (x
xi , β , ψ ) and FEP would
t x i , ci ; ψ 0 ) were time-constant, then it would also cancel from pt (x

be consistent, but there is no reason to think this should be the case with time-varying x it .

36

2.3.2

Testing under full distributional assumptions

If the yit are count data and researchers are willing to take full distributional assumptions seriously,
the approach of Chesher (1984) provides a simple LM test. The slopes are not allowed to depend
on the covariates or ci under the alternative, which avoids having to specify a particular joint
distribution for b i and x i . However, lack of power may be an issue in alternatives where b i depends
on x i . Findings of Hahn, Moon and Snider (2015), however, suggest that this is less of a concern
in nonlinear models. The following statements formalize the assumptions:

yit |(xxi , ci , b i ) ∼ Poisson [ci exp(xxit b i )] , i = 1, . . . , N; t = 1, . . . , T,

(2.8)

{yi1 , . . . , yiT } are independent conditional on {xxi , ci , b i }

(2.9)

b i = β 0 + Λ 0 u i , where u i |(xxi , ci ) ∼ F(00, I K ),

(2.10)

where I K is the K × K identity matrix.
From Chesher (1984), assumption (2.10) does not assume a particular distribution for b i , but
specifies that they follow a “location-scale generalization of the class of spherical distributions”
described by Kelker (1970). Denote the PDF of u i as f ().
It follows that
yi |(ni , xi , ci , bi ) ∼ Multinomial(ni , p1 (xxi , bi ), . . . , pT (xxi , bi )),

(2.11)

where
pt (xxi , b i ) ≡

exp(xit b i )
.
∑Tr=1 exp(xir b i )

(2.12)

Therefore, the log-likelihood for an observation i, integrating out the random part of the slopes, is
β , Λ ) = log
i (β

T
ni !
[pt (xxi , b i )yit ] f (uui ) duui ,
∏
T
K
R ∏t=1 yit ! t=1

where the integral is of K dimensions.

37

(2.13)

An LM test of H0 : Λ 0 = 0 is attractive because in this case, b i = β 0 , and so the restricted
model can be estimated using FEP. It also turns out that the restricted score does not depend on the
unknown PDF f ().
However, the parameterization of this model causes a complication in deriving the restricted
scores, as described by Chesher (1984) and Lee and Chesher (1986) for a more general class of
models. It turns out the score of the unrestricted model evaluated at the parameter restriction is
identically zero.4 Chesher (1984) proposed re-parameterizing the scale assumption and restricting
the correlation among the heterogeneity allowed under the alternative.5

Λ 0 = diag

λ1,0 , . . . ,

(2.14)

λK,0

Allowing no covariance between coefficients may affect power under alternatives in which this
does not hold, but at the same time, information about the covariances is only relevant if there
is evidence that the variances are nonzero.6 Under (2.14), the restricted score has the 0/0 form,
but the limits follow from L’Hopital’s rule. The algebraic details are collected in Appendix C.
Collecting the λ j in the K × 1 vector λ , the restricted score is:

β , 0 ) ≡ lim ∇θ (β
β ,λ )
s i (β
λ ↓0



T y ∇ p (x

xi , β )
∑t=1
it

β t x i , β ) /pt (x



1 a (x

N 
2 1 xi , β )
=∑
..

i=1 
.






1 a (x
2 K xi , β )











,

(2.15)










where a j (xxi , β ) is the ( j, j)th element of
A (xxi , β )
T

≡

∑ ∇2β Mit (ββ ) +

t=1

T

∑ ∇β Mit (ββ )

t=1

4 See

T

∑ ∇β

M (β
it β )

.

(2.16)

t=1

Appendix C for the derivation.
solution would be to assume Λ 0 = λ0 IK
6 The relevant alternative, strictly speaking, should be that at least one λ
j,0 ≥ 0, but for simplicity, the two-sided alternative is treated here, as in Chesher (1984).
5 Chesher’s

38

In this last expression, M
it is the multinomial log-likelihood for observation i in period t.
The outer product of the score version of the LM statistic is then N times the uncentered Rsquared from the regression of 1 on s i , where for each observation i, s i is the appropriate summand
in right hand side of (2.15) evaluated at β FEP . The advantage to this approach is its relative
simplicity. The unrestricted model may be even computationally infeasible to estimate, but a test
of the null hypothesis of constant coefficients is relatively easy to implement.
The downside of this approach concerns robustness to failure of (2.8) or (2.9). Chesher (1984)
notes that statistics derived using this approach resemble White’s (1982) information matrix test for
A(xxi , β )] = 0 if the conditional multinomial distribution is
general model misspecification, as E [A
correct. This means coefficient heterogeneity cannot be distinguished from failures of the model’s
other assumptions, such as the Poisson distribution or conditional independence.

2.3.3

Testing under weaker assumptions

In the previous section, I showed the classical test applicable to conditionally independent Poisson dependent variables. While the statistic is simple to calculate, the test is likely to reject in
cases where the Poisson or conditional independence assumption fail regardless of the presence
of random coefficients. This is similar to the case of a linear model where the presence random
slopes (that are assumed to be independent of covariates) is indistinguishable from a certain form
of system heteroskedasticity. In this section, I extend Chesher’s approach to testing for neglected
heterogeneity to the FEP setting where only the conditional mean of y it is assumed to be correctly
specified. I show that an LM test of exclusion restrictions on squared regressors is valid when the
coefficients are allowed to belong to a location-scale family under the alternative.
As before, assume:
E(yit |xxi , ci , b i ) = E(yit |xxit , ci , b i ) = ci exp(xxit b i )

(2.17)

b i = β 0 + Λ 0 u i , where u i |(xxi , ci ) ∼ F(00, I K ),

(2.18)

and

39

where again the CDF F() and the corresponding PDF f () are left unspecified.
Similar to before, these conditions imply:
E(yit |xxi , ci ) = ci exp [xxit β 0 + mt (xxi , Λ 0 )] ,

(2.19)

where
mt (xxi , Λ 0 ) = log {E [exp(xxit Λ 0 u i )|xxi , ci ]} = log

exp(xxit Λ 0 u i ) f (uui ) duui .

(2.20)

It is easy to see that mt (xxi , 0) = 0. In the multivariate normal case, mt (xxi , Λ0 ) =

1x Ω x ,
2 it 0 it

RK

where Ω0 = Λ0 Λ0 . Rejecting H0 : Λ0 = 0 provides evidence against the null of constant coefficients.
I follow Chesher’s derivation of the LM statistic as before, but unlike other methods, I only
integrate u i out of the conditional mean function, not the entire likelihood or score. The unrestricted
quasi log-likelihood is
T

β , Λ) =
i (β

∑ yit log [pt (xxi, β , Λ )] ,

(2.21)

t=1

where
pt (xxi , β , Λ ) ≡

exp (xxit β + mt (xxi , Λ ))
.
T exp (x
xir β + mt (xxi , Λ ))
∑r=1

(2.22)

The first K elements of the unrestricted score evaluated at Λ = 0 are just the usual FEP scores.
The gradient with respect to Λ evaluated at Λ = 0, however, presents a similar problem as before. I
make the same re-parameterization as before, shown in equation (2.14), restricting the coefficients
to be uncorrelated with each other under the alternative. The restricted scores have a 0/0 form and
are evaluated using L’Hopital’s Rule. The details are collected in Appendix C.
The score evaluated at the parameter restriction is:


T y ∇ p (x


x
β
0
x
β
0


∑t=1 it β t i , , ) /pt (x i , , )










T
 1 ∑T yit ∑T exp(xxir β ) x2 − x2

x
/
exp(x
β
)
∑
ir
r=1
it1
ir1
r=1
2 t=1
β , 0) =
s i (β
.
..




.











 1 ∑T y ∑T exp(xx β ) x2 − x2
T
x
/
exp(x
β
)
∑
it
ir
ir
itK
irK
r=1
r=1
2 t=1

40

(2.23)

The last K elements are proportional to the restricted FEP scores for testing the exclusion of
squared regressors from the model with constant slopes. Therefore, in the exponential case, we
cannot distinguish random coefficients from the presence of quadratics in E(yit |xxit , ci ). As an
empirical matter, however, this test takes no stand on the (conditional) distribution, overdispersion,
or serial correlation of yit , so it may offer some advantages to the approach in Section 3.2. For
example, if a researcher rejects the null using the test based on (2.15), but fails to reject based on
(2.23), then he or she can proceed in estimating the model based on (2.1) with some peace of mind.

2.3.4

A correlated random coefficients approach to testing and estimation

When one wishes to allow more than one or two slopes to be random, “random effects” type estimation based on integrating out the heterogeneity is computationally difficult and may not be
robust to misspecification of the response variable’s distribution. A straightforward alternative,
which is applicable not only to counts but also to other nonnegative responses, is to make a parametric, distributional assumption for bi that allows us to derive E [exp(xxit d i )|xxi , ci ]. Here, I assume
correlated random coefficients (CRC) and (conditional) multivariate normality:
b i = α 0 + Γ 0 x¯ i + d i ,
d i |(xxi , ci ) ∼ Normal(00, Ω 0 ),

(2.24)

T x , α is an unknown K × 1 vector, and Γ is an unknown K × K matrix. This aswhere x¯ i = ∑t=1
it
0
0

sumption states that the dependence between xi and the mean of bi is captured entirely through the
time averages of xit , and is the application of Mundlak (1978) to the current setup. Alternatively,
one could allow the mean of bi to depend on xi in the style of Chamberlain (1980). If Γ0 = 0, then
(2.24) amounts to a stronger version of (2.10) where then α 0 = β 0 . Note that (2.24) only requires
multivariate normality of the coefficients conditional on x i ; their unconditional distribution may
not be normal, though logically speaking it should be continuous and have unbounded support.
Condition (2.24) also implies b i and ci are independent, conditional on x i . This is less restrictive
for testing purposes because b i is constant under the null, but it could affect power under alterna-

41

tives where the two are dependent. The two sources of heterogeneity are still allowed, through xi ,
to be correlated unconditionally. As in FEP, the relationship between xi and ci is left completely
unrestricted.
Under (2.4) and (2.24), it follows from properties of the lognormal distribution and the LIE
that

E(yit |xxi , ci ) =E(yit |xxit , x¯ i , ci )
1
=ci exp x it α 0 + x it Γ 0 x¯ i + x it Ω 0 x it
2
1
Γ0 ) +
=ci exp x it α 0 + (¯x i ⊗ x it )vec(Γ
2

K

∑

ω j xit2 j + 2

j=1

K−1 K

∑ ∑ ρ jhxit j xith

j=1 h= j

1
≡ci exp x it α 0 + (¯x i ⊗ x it )γγ 0 + xˇ it ω 0 ,
2

(2.25)

2 , . . . x2 , x x , x x . . . x
Γ0 ), xˇ it = (xit1
where γ 0 = vec(Γ
it,K−1 xitK ),
itK it1 it2 it1 it3

ω 0 ≡ (ω1 , . . . ωK , 2ρ12 , 2ρ13 . . . , 2ρK−1,K ) , ω j = Var(b j ), and ρ jh = Cov(b j , bh ).
Equation (2.25), along with regularity conditions, implies that FEP of yit on x it , interactions
between x it and x¯ i , and squares and interactions of x it will consistently estimate α 0 , γ 0 , and ω 0
without assuming a distribution for yit and while allowing arbitrary serial correlation (Wooldridge,
1999).
Following estimation of (2.25), the unconditional means of the b i are easy to estimate using
the following, where µ x¯ = E(¯x i ):
β 0 ≡ E(bbi ) = α 0 + Γ0 µ x¯ ,

(2.26)

I believe that using the lognormal distribution in the FEP setting is novel and that it offers
the crucial advantage of still allowing one source of heterogeneity to be correlated with x i .7 This
procedure is easy to implement, as the FEP estimator is available in software packages like Stata,
7A

similar result appeared in Cameron and Trivedi (2013) for the case where b i |xxi , ci ∼
β 0 , Ω 0 ) and ci |(xxi , b i ) ∼ lognormal(0, σc2 ) as a way of illustrating how random coeffiNormal(β
cients change E(yit |xxi ).

42

though practitioners should be careful to calculate cluster-robust standard errors to account for serial correlation and misspecification of the multinomial distribution. Another important note is if
one believes that time constant variables z i belong in the model and they also have random coefficients that are correlated with the coefficients on the x it , then the augmented FEP regression should
also include interactions between z i and x it as these are not absorbed by ci when conditioning on
ni .
One drawback to this approach is that for a binary element k of x it , FEP only identifies αk +
1
2 ωk .

Similarly, some elements of α 0 and Ω 0 are not separately identified when x it contains both

levels and higher order terms.
This model nests the traditional case of constant coefficients, which occurs when γ 0 = 0 and
ω 0 = 0 ). Rejection of the null that γ 0 = 0 is perhaps most convincing evidence of that slopes vary
by individual. Therefore, the primary contribution of this approach to random coefficients is to
suggest the inclusion of interactions between time-varying regressors and time averages to see if
more flexibility is necessary.
If there is no evidence that slopes are correlated with the x¯ i , then one should carefully consider
how to interpret inference on ω 0 . Statistically significant estimates may just indicate that squares
and cross-products of x it belong in the FEP regression. Clearly if the cross-products are significant
while the squares are not, or if the coefficients on squared terms are negative and significant,
then the random coefficient framework does not make sense, though the results may still have
yielded useful insight into the what functions of the explanatory variables should be included in
the analysis.

2.3.5

Adding second moment assumptions

While under our assumptions, FEP is consistent under correct specification of the conditional mean
(2.25), it may be possible to achieve greater efficiency by adding assumptions about the conditional
second moment of y i . Another reason may be to identify the coefficients on binary variables.

43

I assume a variance function that is proportional to the conditional mean.
Var [yit |xxi , ci , b i ] = σ0 ci exp(xxit b i )

(2.27)

Additionally, the following CRE assumption implies conditional mean and variance functions
that do not depend on ci .
log(ci )|xxi , b i ∼ Normal(ψ1 + x¯ i ξ 1 , σa2 )

(2.28)

Under assumptions 2.4, 2.24, 2.27, and 2.28, it follows from the properties of the lognormal distribution, the LIE, and the Law of Total Variance that
1
E(yit |xxi ) = E(yit |xxit , x¯ i ) = exp h(xxit , x¯ i , θ 0 ) + v(xxit , τ 0 )
2

(2.29)

and
Var(yit |xxi ) =Var(yit |xxit , x¯ i )
1
=σ0 exp h(xxit , x¯ i , θ 0 ) + v(xxit , τ 0 )
2
+ exp [2h(xxit , x¯ i , θ 0 ) + v(xxit , τ 0 )] {exp [v(xxit , τ 0 )] − 1} ,

(2.30)

ω 0 , σa2 ) , h(xxit , x¯ i , θ 0 ) ≡ ψ1 + x¯i ξ 1 + x it α 0 + (¯x i ⊗ x it )γγ 0 , and
where θ ≡ (ψ1 , ξ 1 , α , γ ) , τ = (ω
v(xxit , τ 0 ) ≡ xˇ it ω 0 + σa2 .
Estimation of θ 0 and τ 0 can then proceed using pooled normal QMLE, specifying the mean
and variance functions as above. As the normal distribution is a member of the quadratic exponential family, this procedure is consistent without the normal distribution being true (Gourieroux,
Monfort, and Trognon, 1984) Once again, inference should be made cluster-robust to account for
serial correlation and the true distribution being non-normal. Estimation of β 0 can then proceed as
before, and coefficients on binary or quadratic variables are now identified off of the nonlinearity
in (2.30).
Normal QMLE in this case is straightforward to program in software like Stata using built-in
maximum likelihood functions, and it had good finite sample properties in simulations run for this

44

chapter. Some researchers may wish to specify a conditional covariance structure for yi as a way
to get more efficiency. If so, one option is to assume

Cov [yit , yir |xxi , ci , b i ] = 0,t = r.

(2.31)

Equation (2.31) does not allow serial correlation when conditioning on x i , ci , b i , but the presence of the time-constant heterogeneity ensures that the responses will be serially correlated when
conditioning on x i only. Under 2.4, 2.24, 2.27, 2.31, and 2.28,
Cov(yit , yir |xxi ) =
1
exp h(xxit , x¯ i , θ 0 ) + h(xxir , x¯ i , θ 0 ) + (v(xxit , τ 0 ) + v(xxir , τ 0 ))
2

exp(xxit Ω 0 x ir + σa2 ) − 1 .
(2.32)

2.3.6

Estimating average partial effects

Even though the coefficients in (2.4) have direct interpretations as semi-elasticities, it may still be
desirable to estimate partial effects and APEs, perhaps to compare estimates between competing
nonlinear models. Moreover, this sections shows that the average partial effects for a binary variable depend only on αk + 21 ωk , meaning that even though we cannot separately identify αk and ωk
without second moment assumptions, we can still estimate average partial effects.
Let x = {xx1 , x 2 , . . . , x T }, c, and b = {b1 , b2 . . . , bK } denote fixed values of the variables. The
partial effect of a continuous xt j on the conditional mean of yt is defined as8
φ j (xxt , c, b ) ≡

∂ E(yt |xxt , c, b )
= ci exp(xxt b )b j .
∂ xt j

(2.33)

For a binary xtk , the partial effect is defined as the discrete difference in the conditional mean
of yt at each level of the binary variable. In the expressions to follow, the subscript k signifies that
xtk , x¯k , or their associated coefficients have been omitted from the vector.
8I

implicitly assume that xt j is not functionally linked with any other element in xt .

45

φk (xxt , c, b ) ≡E(yt |xxtk , xtk = 1, c, b ) − E(yt |xxtk , xtk = 0, c, b )
=c exp(xxtk b k + bk ) − c exp(xxtk b k )

(2.34)

Of course, estimating features of the distributions of φ j and φk is infeasible as we do not observe c
or b . Therefore, this section focuses mainly on APEs where the heterogeneity has been averaged
out.
δh (xxt ) ≡ Evi [φh (xxt , ci , b i )] ,

(2.35)

where v ≡ (c, b ) and h ∈ { j, k}.

2.3.6.1

Approaches under the CRE assumption for ci

To proceed, it is necessary to maintain the assumptions of correlated random coefficients (2.24).
As ci is unobserved, I also maintain (2.28). Later, I will discuss a possible “estimator” of ci . For
now, there are two choices as to how to proceed in estimating δ j and δk . The first is to estimate an
Average Structural Function (ASF), as proposed by Blundell and Powell (2003), where essentially
x¯ proxies for v and is averaged out before taking derivatives and differences. The second is to use
derivatives and differences of (2.29) directly (Wooldridge, 2010).
The ASF is defined as:
ASF(xxt ) ≡ Ev i [ci exp(xxt b i )] ,

(2.36)

where again, xt is a fixed argument. Under (2.24), (2.27), and (2.28) the L.I.E. implies
1
ASF(xxt ) = Ex¯ exp h(xxt , x¯ i , θ 0 ) + v(xxt , τ 0 )
2

(2.37)

Passing the derivative through the expectation, the APE for continuous xt j is:
1
δ j (xxt ) =Ex¯ exp h(xxt , x¯ i , θ 0 ) + v(xxt , τ 0 )
2
For a binary xtk , the APE is:

46

K

α j + x¯ i γ j + ω j xt j +

∑ ρ jhxth
h= j

(2.38)

δk (xxt ) =Ex¯ E yt |xxtk , xtk = 1, x¯ i − E yt |xxtk , xtk = 0, x¯ i
1
1
=Ex¯ exp h(xxtk , 1, x¯ i , θ 0 ) + v(xxtk , 1, τ 0 ) − exp h(xxtk , 0, x¯ i , θ 0 ) + v(xxtk , 0, τ 0 )
2
2

,

(2.39)
where
h(xxtk , 1, x¯ i , θ 0 ) =ψ1 + x¯i ξ 1 + xtk α k + xtk Γ k x¯ k + xtk x¯ik γ kk + αk + x¯ i γ k ,
h(xxtk , 0, x¯ i , θ 0 ) =ψ1 + x¯i ξ 1 + xtk α k + xtk Γ k x¯ k + xtk x¯ik γ kk ,
v(xxtk , 1, τ 0 ) =ˇx k ω k + σa2 + ωk + 2

K

∑ ρkhxth,
h=k

and v(xxtk , 0, τ 0 ) =ˇx k ω k + σa2 .

(2.40)

The direct approach consists of taking derivatives and differences of 2.29 directly. Note that
since these expressions do not first average out x¯ , the entire history of x is now a fixed argument.
For a continuous variable xt j the APE is:

δ j (xx) =

∂ E(yt |xx)
∂ xt j

1
= exp h(xxt , x¯ , θ 0 ) + v(xxt , τ 0 )
2

ξ j /T + α j + x¯ γ j +

K
1
xt γ j + ω j xt j + ∑ ρ jh xth ,
T
h= j

(2.41)
where γ j is the jth row and γ j is the jth column of Γ 0 .
Define z(xxt , x¯ , θ , τ ) = h(xxt , x¯ , θ ) + 12 v(xxt , τ ). Then we have for a binary xtk ,
δk (xx) =E yt |xxk , {xsk }Ts=t , xtk = 1 − E yt |xxk , {xsk }Ts=t , xtk = 0
K
1
(1)
(1)
(1)
= exp z(xxtk , x¯ k , θ k , τ k ) + ξ j x¯tk + αk + x¯ k γ kk + γkk x¯tk + xtk x¯tk γ kk + ωk + ∑ ρkh xth
2
h=k

(0)

(0)

− exp z(xxtk , x¯ k , θ k , τ k ) + ξ j x¯tk + xtk x¯tk γ kk ,

47

(2.42)

(0)

(1)

where γkk is the kth diagonal element of Γ 0 , x¯tk ≡ T1 1 + ∑Ts=t xsk , and x¯tk ≡ T1 ∑Ts=t xsk .
Whichever approach is chosen, one can then estimate δ j (xxt ) or δk (xxt ) by inserting the estimated
parameters, replacing expectations over the distribution of x¯ with averages over i, and plugging in
interesting values of x . Many researchers will average over the distribution of x to get a single
number. Asymptotic variances can be computed either via the delta method or using the panel
bootstrap.

2.3.6.2

Estimation when the slopes are independent of covariates

The traditional case where b i is independent of x i (conditional on ci ) is one where the ASF is identified without placing any restriction on ci or Var(yit |xxi ). The following summarizes the necessary
condition.
bi = β 0 + d i,
d i |(xxi , ci ) ∼ Normal(00, Ω 0 ).

(2.43)

The results of Section 3.4 continue to hold, but the time averages no longer enter E(yit |xxit , ci ) (that
is, Γ 0 = 0 ).
The LIE implies that for a fixed xt ,
1
ASF(xxt ) = E(ci ) exp xt β 0 + xt Ω 0 x it
2

(2.44)

Passing the derivative through the expectation, the APE of a continuous variable xt j is given
by:

1
δ j (xxt ) =E(ci ) exp xt β 0 + xt Ω 0 x it
2
For a binary variable xtk , the APE is:

48

K

β j + +ω j xt j +

∑ ρ jhxth
h= j

(2.45)

δk (xxt ) =
K
1
1
1
E(ci ) exp xtk β k + xˇ k ω k + αk + ωk + ∑ ρkh xth − E(ci ) exp xtk β k + xˇ k ω k
2
2
2

.

h=k

(2.46)
An estimator for E(ci ) is conveniently available. Poisson QMLE using (2.25) and treating the ci as
(strictly positive) parameters is algebraically equivalent to multinomial QCMLE. 9 ) In our current
β , ω ) , the QMLE for ci is:
application, for a given θ ≡ (β
ci (θθ ) =

ni
,
T exp(x
xit β + xˇ it ω )
∑t=1

(2.47)

T y . Define c = c (θ ), where θ is the FEP estimate of (β
β 0 , ω 0 ) . The
where again, ni = ∑t=1
it
i
i

properties of ci are not well-known in either the constant or heterogeneous slope case. Though
there is no incidental parameters problem for θ in the FEP case, ci (θθ ) = ci , even when evaluated
at θ 0 . Viewing ci as a parameter, there is no reason to think ci is unbiased and it cannot be
consistent with T fixed.
However, the ASF in this case is proportional to E(ci ). Strict exogeneity of x it and (2.24) imply
that
T

E(ni |ci , x i ) = ci

∑ exp

t=1

1
x it β 0 + xˇ it ω 0
2

(2.48)

It follows from the L.I.E. that



E(ci ) = E 

ni
T exp
∑t=1

x it β 0 + 12 xˇ it ω 0



(2.49)

meaning N −1 ∑N
i=1 ci consistently estimates E(ci ).
Many researchers are primarily interested in a single APE estimate (averaged across the sample
of observables). In this case, it may be attractive to treat ci as the unobservable ci and average
9 See

Wooldridge, 2010 or Cameron and Trivedi, 2013

49

across the distributions of ci and x i at the same time. We would, generally expect such APE
estimators for nonlinear FE models derived in such a way to suffer from the incidental parameters
problem, even if the slopes are estimated consistently. 10 Given that N −1 ∑N
i=1 ci is consistent for
E(ci ), however, it may be that estimators including functions of ci that are averaged across i have
desirable properties. This appears to be true at least for the data generating process considered in
this chapter. Simulation results in Section 4 indicate very small finite sample bias of overall APE
estimators computed using ci in this way.

2.4
2.4.1

Monte Carlo
Comparing estimation methods

To illustrate the impact of ignoring random coefficients in the FEP setting, I simulate the performance of the different estimators in both the ideal case of constant coefficients and in the case
where the coefficients vary by individual. I employed the following data generating process:

yit |(xxi , w i , ci , bi1 , bi2 ) ∼ Poisson [ci exp(bi1 xit + bi2 wit )] ,

(2.50)

log(ci ) ∼ Normal(0, 1/16)

(2.51)

xit = log(ci ) + .5xi,t−1 + vit , t > 1

10 See,

xi1 = log(ci )i + vi1 , vit ∼ N(0, 1/2)

(2.52)

wit = 1 [xit + hit > 0] , hit ∼ N(0, 1/2)

(2.53)

 
  

2
bi1 
β1  ω1 ρ 
  ∼ Normal   , 

bi2
β2
ρ ω22

(2.54)

for example, Fernandez-Val, 2009.

50

For the above draws, i = 1, . . . , 1000 and t = 1, . . . , 10. The case where ω12 , ω22 , and ρ all equal
zero corresponds to the constant coefficient case. For these simulations, the bi j are generated to be
independent of {xxi , w i }, and this assumption is maintained in estimation. The bi j are also generated
to be independent of each other (ρ = 0) but this is not assumed in estimation.
In the following tables, FEP refers to the estimator that ignores the random coefficients. FEP2
refers to the estimator that adds the square of x and an interaction between x and w. Since this
model’s assumptions does not separately identify β2 and ω22 , the estimated coefficient on w is
compared to β2 + 12 ω22 . NQML refers to the normal QML estimator that also assumes (2.27) and
(2.28).11 I set ω1 = ω2 = ω but do not assume equal variance in estimation. In each case, I used
one thousand replications.

11 APE

estimates from NQML also plugged in ci .

51

Table 2.1: Finite Sample Properties of Slope Estimators: β1 = 1, β2 = −1

ω
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50

FEP
Mean SD
1.00
0.02
1.00
0.02
1.01
0.02
1.02
0.02
1.03
0.02
1.05
0.03
1.07
0.03
1.10
0.04
1.14
0.06
1.18
0.07
1.23
0.09

β1
FEP2
NQML
Mean SD Mean SD
1.00
0.03 1.00
0.02
1.00
0.03 1.00
0.02
1.00
0.03 1.00
0.02
1.00
0.03 1.00
0.03
1.00
0.03 1.00
0.03
1.00
0.03 1.00
0.03
1.00
0.03 1.00
0.03
1.00
0.04 1.00
0.04
1.00
0.04 1.00
0.04
1.00
0.04 0.99
0.05
1.00
0.04 0.99
0.05

52

β2
FEP
NQML
Mean SD Mean SD
-1.00 0.03 -1.00 0.04
-1.00 0.03 -1.00 0.04
-1.00 0.03 -1.00 0.04
-0.99 0.03 -1.00 0.04
-0.99 0.03 -1.00 0.04
-0.98 0.03 -1.00 0.04
-0.98 0.04 -1.00 0.04
-0.97 0.04 -0.99 0.05
-0.96 0.05 -0.99 0.05
-0.96 0.07 -0.99 0.05
-0.95 0.08 -0.98 0.06

β2 + 12 ω22
FEP2
Mean SD Truth
-1.00 0.04 -1.00
-1.00 0.04 -1.00
-0.99 0.04 -1.00
-0.99 0.04 -0.99
-0.98 0.04 -0.98
-0.97 0.04 -0.97
-0.96 0.04 -0.96
-0.94 0.04 -0.94
-0.93 0.05 -0.92
-0.91 0.05 -0.90
-0.89 0.05 -0.88

Table 2.2: Finite Sample Properties of APE Estimators: β1 = 1, β2 = −1

ω
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50

Truth
0.88
0.88
0.90
0.91
0.95
0.98
1.04
1.11
1.20
1.32
1.49

Est. APE of x
FEP
FEP2
NQML
Mean SD Mean SD Mean SD
0.88
0.03 0.88
0.03 0.88
0.03
0.88
0.03 0.88
0.03 0.88
0.03
0.89
0.03 0.89
0.03 0.89
0.03
0.92
0.04 0.92
0.04 0.92
0.04
0.95
0.04 0.95
0.04 0.95
0.04
0.99
0.05 0.99
0.05 0.99
0.05
1.04
0.07 1.04
0.06 1.03
0.06
1.11
0.09 1.11
0.08 1.10
0.09
1.21
0.15 1.21
0.13 1.20
0.14
1.32
0.22 1.33
0.21 1.31
0.23
1.49
0.47 1.50
0.49 1.48
0.48

53

Truth
-1.12
-1.12
-1.13
-1.14
-1.15
-1.16
-1.19
-1.22
-1.26
-1.30
-1.36

Est. APE of w
FEP
FEP2
NQML
Mean SD Mean SD Mean SD
-1.12 0.06 -1.12 0.08 -1.12 0.07
-1.12 0.06 -1.12 0.08 -1.12 0.07
-1.13 0.06 -1.13 0.09 -1.13 0.08
-1.15 0.07 -1.14 0.10 -1.14 0.08
-1.17 0.07 -1.15 0.10 -1.15 0.08
-1.19 0.09 -1.17 0.12 -1.16 0.10
-1.23 0.11 -1.19 0.16 -1.18 0.11
-1.27 0.14 -1.22 0.20 -1.20 0.13
-1.35 0.22 -1.26 0.29 -1.23 0.16
-1.43 0.34 -1.30 0.47 -1.26 0.24
-1.57 0.86 -1.35 0.60 -1.29 0.26

It appears from Table 2.1 that the standard deviation of the coefficients is positively related to
the finite sample bias (in magnitude) in FEP slope estimates. This is not surprising given that (2.1)
fails for ω > 0. This is despite the fact that the coefficients are independent of the covariates and
each other, a case in which random coefficients would not cause a problem in linear models. In
contrast, the augmented FEP and the NQML estimators show much smaller bias at all levels of ω,
with the exception of the FEP2 coefficient on w, which, as expected, appears to show small bias
for β2 + 21 ω 2 .
The APEs are estimated using expressions similar to (2.45) and (2.46) using the FEP2 and
NQML parameter estimates. The difference is I treat ci as ci and average over {xxit , ci } only once.
I followed an analogous procedure for the FEP case.
Table 2.2 suggests that this approach to estimating APEs has small bias for the FEP2 and
NQML case, despite using estimates of incidental parameters. For FEP, bias in the APE of the
binary variable increases as ω increases. Surprisingly, this is not the case for the continuous
variable. Even though the simulation suggests a large bias in the FEP estimate of β1 . This warrants
further investigation as it suggests there many be circumstances in which researchers can ignore
random coefficients if all they care about is APEs of continuous variables, though it could also be
an artifact of this data generating process.

2.4.2

Testing when coefficients are not normal

Section 3 shows that for slope heterogeneity in a location-scale family of spherical distributions
(where the heterogeneity are independent of each other), an LM test for coefficient heterogeneity
is equivalent to testing the coefficients on squares of the covariates, which suggests that the heterogeneity need not be normal for the approach of this chapter to work well. To explore this, I
generate the responses using random coefficients of different distributions.

bi j2 = 1 + ω (u j2 − 0.5)/ 1/12 , u j2 ∼ U(0, 1)

54

(2.55)

√
bi j3 = 1 + ω (u j3 − 4)/ 8 , u j3 ∼ χ42

(2.56)

bi j4 = 1 + ω u j4 / 5/3 , u j4 ∼ t5

(2.57)

bi j5 = 1 + ω u j5 − 1 , u j5 ∼ Exponential (1)

(2.58)

bi j6 ∼ Gamma (1/ω 2 , ω 2 )

(2.59)

These draws are made separately for j = 1, 2, and for simplicity, Cov(bi1h , bi2h ) = 0 for each
h. Each coefficient’s data generating process ensures that it has a mean of 1 and variance of ω 2 .
Each of the first five coefficients falls into a location-scale family as they consist of a standardized
random variable multiplied by ω to result in a variance of ω 2 and shifted to have a mean of one.
The gamma coefficients, in contrast, are not drawn from a location-scale family, but are directly
specified to have a mean of 1 and variance of ω 2 .
Given the issue identifying parameters associated with binary regressors in the FEP2 setting, I
generate the responses to depend on continuous regressors only, where each xit j is generated as in
(2.52).
yit |(xxi1 , xi2 , ci , bi1h , bi2h ) ∼ Poisson [ci exp(bi1h xit1 + bi2h xit2 )]

(2.60)

2,
After generating the data, β1 , β2 , ω12 , ω22 , and ρ were estimated using FEP of yt on xt1 , xt2 , xt1
2 , and x x . A Wald test was then performed on x2 , x2 , and x x . The results of Section 3.3
xt2
t1 t2
t1 t2
t1 t2

suggest that this test should perform well for the first five coefficient types, and I conjecture that it
performs well for the Gamma coefficients as well. When testing for random slopes, is important
to use a FE procedure if one is concerned that the multiplicative effect ci is correlated with the
explanatory variables. Otherwise, the omitted variable problem is likely to cause the test to be
over-sized. In fact, in a simulation where Random Effects Poisson was used on the same set of

55

covariates, a Wald test rejected the null of constant slopes in 88% of replications when the true
slopes were nonrandom.
Table 2.3: Testing when b i is not normal
ω
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50

Empirical Rejection Probability (Null value 0.05)
Normal Uniform* Chi2* t5 *
Exp.* Gamma
0.069
0.069
0.069 0.069 0.069 0.069
0.108
0.115
0.112 0.108 0.121 0.132
0.186
0.212
0.159 0.196 0.16
0.178
0.308
0.359
0.287 0.302 0.303 0.334
0.468
0.531
0.439 0.408 0.404 0.472
0.640
0.691
0.543 0.579 0.553 0.625
0.785
0.796
0.689 0.693 0.652 0.741
0.881
0.887
0.796 0.804 0.757 0.817
0.914
0.948
0.860 0.852 0.814 0.868
0.931
0.965
0.897 0.897 0.876 0.892
0.970
0.979
0.904 0.919 0.876 0.923

Table 2.3 shows that as expected, rejection probabilities increase with ω when the coefficients
are normal, and are quite high when ω is large.12 What is interesting is that there does not seem
to be much change in either size or finite sample power when the coefficients are not normal, even
when the coefficients are not drawn from a location-scale family.

2.5

Empirical application: the Patent-R&D relationship

There is a long history of economic inquiry into the relationship between a firm’s research and
development (R&D) expenditures and the number of patents for which it applies in a given year.
Patent applications are viewed in the literature as an indicator of additions to the knowledge stock
of a firm (Pakes and Griliches, 1980). Pakes and Griliches (1980) were among the first to focus on
firm effects as a source of potential endogeneity in analyzing U.S. manufacturing firms. Hausman,
Hall, and Griliches (1984) and Hall, Griliches, and Housman (1986) also look to firm effects to
account for significant over-dispersion in the distribution of patent counts. In addition to FEP,
12 I

have not yet varied the cross-section size. I would expect these rejection probabilities to
increase.

56

Negative Binomial models are also common as a way to introduce more dispersion. Nonlinear
count models are not only attractive for logical reasons, but also because datasets can contain a
nontrivial proportion of observations with zero patents. These observations must be eliminated or
transformed in some ad hoc manner before estimating a linear log-log model(Hall, Griliches, and
Hausman, 1986). Such observations seem to be more common in more recent datasets as well.
While only 8% of observations were zero in Hall, Hausman, and Griliches 1968-1975 panel of 121
firms, 16.5% were zero in Gurmu and Perez-Sebastian’s 1982-1992 panel of 391 firms (Gurmu
and Pérez-Sebastián, 2008).
A common finding in the literature is that distributed lag models that do not account for any
firm heterogeneity tend to have a U-shaped lag profile, and that after accounting for firm heterogeneity, only contemporaneous R & D expenditure tends to be significant (Hall, Griliches, and
Hausman, 1986). In a cross-sectional analysis of the pharmaceutical industry, Wang, Cockburn,
and Puterman (1998) use a Poisson model and allow for heterogeneity in both the multiplicative
effect and coefficients. While the mixing distribution is allowed to depend on the regressors, they
assume that the vector of heterogeneity has finite support, which in their analysis consisted of three
or fewer points. This framework may be less palatable in studies with broader industry coverage.
The population of interest for this chapter is publicly-traded U.S. manufacturing firms in existence from 1996 to 2003. The patent data come from the United States Patent and Trademark
Office by way of the National Bureau of Economic Research’s Patent Data Project (PDP) and
includes data through 2006. As patents are not recorded in the USPTO database until they are
granted, the panel is truncated in 2003 to diminish the effect of the time-lag between application and granting.13 Financial information on publicly-traded firms comes from the Compustat
database, accessed through Wharton Research Data Services (WRDS) in September 2016. Hall,
Jaffe, and Trajtenberg (2001) and Bessen (2009) thoroughly describe the patent data as well as
matching information for the Compustat database. Matching patents to firms is not a trivial given
13 The

average lag over applications made in 1990-92 was 1.76 years, with 96.1% of patents
granted in three years or less.

57

nonstandard naming in USPTO records, among other issues.
I mainly follow Bound, et. al (1982) and Hall, Griliches, and Hausman (1986) in assembling
the panel dataset. The initial sample from the Compustat database consists of 3,126 firms in the
U.S. manufacturing industry that were in existence in the year 2000. Following the literature, I
require that data exist for patents and R&D expenditures for each year from 1996 to 2003, and
that R&D expenditures be strictly positive since I take logs. I also eliminate firms that show large
jumps in either gross capital or employment in a year. In the end, my sample consists of 848 firms
over the period 1996-2003. I describe the selectivity of my sample in Tables 2.4 and 2.5. The
tables show that although the sample covers only about a quarter of U.S. manufacturing firms in
2000, it covers nearly 70% of R&D expenditures. Coverage is generally poorer for smaller firms
and higher for larger firms both in terms of net sales and R&D. Sample coverage is comparable to
Hall, Griliches, and Hausman (1986) in terms of net sales, though they achieve 90% coverage of
total R&D.
Table 2.4: Distribution of Net Sales in 2000
Number in 2000 cross-section
Net Sales
All
Pos. R&D
Less than $1M
332
207
$1M-10M
439
335
$10M-100M
900
672
$100M-1B
986
588
$1B-10B
402
271
More than $10B
67
52
Total
3,126
2,125

Number in Sample
49
115
242
244
157
41
848

Coverage
All Pos. R&D
0.15
0.24
0.26
0.34
0.27
0.36
0.25
0.41
0.39
0.58
0.61
0.79
0.27
0.40

Table 2.5: R& D Expenditures in 2000
Firm R&D (2000 USD)
Less than $1M
$1M-10M
$10M-100M
$100M-1B
$1B-10B
Total

2000 Cross-section Sample
170.15
55.32
3695.48
1492.38
21621.47
8765.10
38160.81
25075.92
67084.16
54007.14
130732.08
89395.85

58

Coverage
0.33
0.40
0.41
0.66
0.81
0.68

Table 2.6 shows summary statistics for the key variables over the sample of 848.14 Consistent
with the literature, this shows the distribution of patents to be right-skewed and over-dispersed
with a thick right tail. Also noteworthy is that compared to previous studies, my sample contains
a much higher proportion of zeros than previous studies. Compared to either Hall, Griliches, and
Hausman (1986) or Gurmu and Perez-Sebastian (2008), the median number of patents is lower,
and the maximum number of patents is higher in this sample.

14 Note

that firms with zero patents in all years drop from the multinomial log-likelihood.

59

Table 2.6: Summary of Key Variables in 2000
Variable
Net Sales (Millions of USD)
R&D (Millions of USD)
Patents
Fraction with zero patents
Fraction in scientific sector

Mean
2506.28
105.42
30.47
0.35
0.55

St.Dev. Min 1st Q.
12980.46 0.00 15.77
490.95
0.01 2.22
141.85
0.00 0.00
0.48
–
–
0.50
–
–

Med.
118.73
7.53
2.00
–
–

3rd Q.
877.54
31.71
7.00
–
–

All dollars amounts are real 2000 USD.
The scientific sector is defined to include the drug, computer, electronic component, and scientific instrument industries.

60

Max
206083.00
6800.00
1811.00
–
–

I apply the exponential model introduced in Section 3 to patent counts where the regressors of
interest are the logs of current R&D and up to three lags. I include year dummies, but assume their
coefficients are constant.
τ

E [patentsit | log(Ri1 ), . . . , log(RiT ), δt , ci , b i ] = ci exp

∑ bi,s log(Ri,t−s) + δt

,

(2.61)

s=0

where Rit is real R&D expenditures by firm i in year t. The CRC assumption is:
α + γ log(R)i , Ω ),
b i |(log(Ri,t−0 ), . . . , log(Ri,t−τ ), δt , ci ) ∼ Normal(α

(2.62)

T log(R ) is a scalar. Section 3 implies that FEP of patents on current
where log(R)i = T −1 ∑t=1
it

and lagged log(R) terms, interactions between log(R) and the log(R) terms, and squares and crossproducts of the log(R) terms will be consistent under these assumptions.
Table 2.7: Results for traditional estimators
VARIABLES
log(R0 )

(1)
PQML 1

(2)
PQML 2

(3)
FEOLS 1

(4)
FEOLS 2

0.819***
(0.0441)

0.423**
(0.191)
0.234***
(0.0637)
0.0845
(0.108)
0.0826
(0.203)

0.113***
(0.0198)

0.0476** 0.318***
(0.0205)
(0.0682)
0.00784
(0.0192)
0.00777
(0.0180)
-0.00789
(0.0204)
-0.442***
(0.0301)
1.268***
(0.0765)
0.055
0.318***
(0.034)
(0.0682)
4,240
5,968
848
746
0.137

log(R−1 )
log(R−2 )
log(R−3 )
Dum. for zero pat.
Constant
Sum of log(R) coeff.
Observations
Number of firms
R-squared

-0.211
(0.211)
0.819***
(0.0441)
6,784
848

-0.228
(0.214)
0.824***
(0.045)
4,240
848

-0.543***
(0.0261)
1.091***
(0.0440)
0.113***
(0.0198)
6,784
848
0.157

(5)
FEP 1

(6)
FEP 2
0.161***
(0.0560)
0.0158
(0.0378)
-0.0250
(0.0710)
-0.00236
(0.0546)

0.1495
(0.1096)
3,510
702

Clustered standard errors in parentheses. Year dummies included in all specifications.

*** p<0.01, ** p<0.05, * p<0.1
Table 2.7 presents results from the six different specifications that assume constant coefficients.
For all but columns (3) and (4), the dependent variable is the number of patents. Columns (1) and

61

(2) contains Poisson QMLE estimates where firm heterogeneity is ignored. Column (3) contains
estimates from FE OLS where the dependent, variable is the log of patents. For this column only,
zero patent counts are changed to 1, with a dummy variable added following Hall, Griliches, and
Hausman (1986). Columns (5) and (6) contain FEP estimates.
Consistent with the literature, these estimates imply that correlation between patents and current R&D is strongest relative to lag effects, and that the total elasticity of patents with respect
to R&D that is less than unity. I also find the estimated elasticities fall once I account for firm
effects. For the Poisson specification, the total elasticity falls from 0.82 to 0.32 in the one-lag
model and from 0.82 to 0.15 in the three-lag model. The three-lag FEP specification implies an
elasticity with respect to current R&D that is only about half of those estimated in previous studies,
and this estimate is sensitive to the time dimension of the panel and lag-length chosen. If I mimic
Gurmu and Perez-Sebastian (2008) and estimate a four-lag FEP model over 1982-1992, I get very
similar results to theirs. It is possible that the nature of the patent-R&D relationship changed in
the intervening decade, but it may also be that the exponential model is incorrect, our specification
neglects some dynamics or endogeneity, or that sample selection has had a different effect on the
more current data.
Additionally, Section 3 and Section 5 imply that neglected slope heterogeneity could also be
a source of bias in this model. Table 2.8 gives results from the CRC estimator proposed in this
chapter, varying the lag length and assumptions about Ω . In columns (1) and (3), I impose that the
b i are deterministic linear functions of log(R)i , while in column (4), I impose that Ω is diagonal.
Given (2.61) and (2.62), these data do provide some evidence of slope heterogeneity. In the
one-lag models, none of the additional terms are statistically significant. The evidence is mixed in
the three-lag models. In column (3), the estimates of γ are jointly marginally significant (p = 0.08),
with the interaction involving the second lag of log(R) negative and significant at the 5% level. In
column (4), while all terms involving log(R) are jointly significant, the interactions and squares are
not. In column (5), the interactions, squares, and cross-products are jointly marginally signifcant
(p = 0.08). The terms associated with Ω are jointly insignificant, however, as are the interactions

62

Table 2.8: Results for CRC FEP estimators
VARIABLES
log(R0 )

(1)
CRCFEP 1

(2)
CRCFEP 2

(3)
CRCFEP 3

(4)
CRCFEP 4

(5)
CRCFEP 5

0.538***
(0.144)

0.548***
(0.151)

-0.0394
(0.0285)

0.165
(0.183)

0.115
(0.141)
0.0736
(0.0892)
0.444**
(0.173)
-0.0384
(0.149)
0.00850
(0.0248)
-0.0103
(0.0167)
-0.0844**
(0.0368)
0.00672
(0.0284)

0.152
(0.133)
0.0604
(0.0951)
0.423***
(0.148)
-0.00633
(0.142)
-0.182
(0.224)
-0.118
(0.195)
-0.556**
(0.258)
-0.0775
(0.159)
0.0915
(0.108)
0.0569
(0.0978)
0.234**
(0.118)
0.0404
(0.0735)

0.160
(0.141)
0.111
(0.0887)
0.360***
(0.121)
0.0205
(0.125)
-0.215
(0.251)
0.0177
(0.294)
-0.167
(0.313)
-0.236
(0.262)
0.0921
(0.118)
0.102
(0.108)
0.309**
(0.147)
0.117
(0.0854)
-0.0986
(0.141)
-0.0120
(0.177)
0.144
(0.176)
-0.255
(0.183)
0.123
(0.129)
-0.266**
(0.118)

log(R−1 )
log(R−2 )
log(R−3 )
log(R0 ) × log(R0 )
log(R−1 ) × log(R0 )
log(R−2 ) × log(R0 )
log(R−4 ) × log(R0 )
[log(R0 )]2

-0.102
(0.0892)

[log(R−1 )]2
[log(R−2 )]2
[log(R−3 )]2
log(R0 ) × log(R−1 )
log(R0 ) × log(R−2 )
log(R0 ) × log(R−3 )
log(R−1 ) × log(R−2 )
log(R−1 ) × log(R−3 )
log(R−2 ) × log(R−3 )

Clustered standard errors in parentheses. Year dummies included in all specifications.

*** p<0.01, ** p<0.05, * p<0.1

63

Table 2.9: CRCFEP 3 estimated elasticities
Parameter
β0
β−1
β−2
β−3
β0 + β−1 + β−2 + β−3

Estimate

S.E.

P-value

95% C.I.

0.134
0.051
0.257
-0.023
0.417

0.093
0.057
0.098
0.092
0.127

0.149
0.379
0.009
0.800
0.001

-0.048 0.315
-0.062 0.163
0.064 0.449
-0.205 0.158
0.169 0.666

β−τ = ατ+1 + γτ+1 log(R). Clustered S.E.’s ignore sampling error of log(R)

with the time average. Therefore, while there is marginal evidence of heterogeneity, I cannot parse
it into its components.
Focusing on model (3), therefore, the results are quite interesting, at least at face value. The
estimator for the average elasticity with respect to Rt−s is given by
β−s = αs+1 + γs+1 log(R),

(2.63)

T
where log(R) = (NT )−1 ∑N
i=1 ∑t=1 log(Rit ). I give these estimates in Table 2.9.

This implied lag profile for the average elasticity is different from that previously observed in
the literature, where typically the contemporaneous elasticity accounts for most of the total and
the lags are much smaller in magnitude and often statistically insignificant. Model (3) estimates
imply, however, that the highest estimated average elasticity is with respect to the second lag of
log(R), at 0.26 with a standard error of 0.098. Meanwhile, the contemporaneous and other lags are
insignificantly different from zero. At face value, this seems to imply a delay in the benefit to R&D
expenditures. Furthermore, the negative estimated coefficient on log(R−2 ) × log(R0 ) implies that
the firms with larger R&D expenditures overall experience lower marginal returns. The correlation
between log(R0 ) and the estimate of the multiplicative firm effect is 0.39, indicating that firms with
a higher base rate of patenting tend to have lower marginal returns to R&D dollars, which echoes
the findings of Wang, et. al. (1998) with regards to the pharmaceutical industry. Unfortunately,
however, the results do not appear to be robust to changes in the estimation sample. If I construct a
panel over 1994-2001, for instance, neither the lag-structure result or the finding of heterogeneous

64

slopes hold. It may be that there is still a sample selection problem caused by not observing any
patent applications made through 2003 if the were not granted before 2006.

2.6

Conclusion

FEP analysis of count or other nonnegative response variables cannot generally be justified in
the presence of heterogeneous slopes and may not lead to estimation of any quantity of interest.
Given this, I extend Chesher’s (1984) testing framework to the FEP setting and show that an LM
test for neglected heterogeneity amounts to adding squares of regressors to the set of covariates.
This procedure is more widely applicable than classical tests. Simulation evidence also suggests
robustness to this approach when coefficients are neither normal nor belong to a location-scale
family.
Identification via a correlated random coefficients assumption leads to FEP on a more flexible
mean function as an estimation method. Under a proportional variance assumption and CRE assumption for the scalar, multiplicative effect, normal QMLE is another technique which may have
advantages in cases of binary or time-constant regressors. Each of these options feasibly allows for
higher dimensional random coefficients than estimators based on likelihoods with integrals, while
also allowing for dependence between the heterogeneity and the regressors.
Application of these methods to the U.S. manufacturing industry may indicate firms may have
heterogeneous elasticities of patenting with respect to R&D, and that in contrast to previous results,
there may be a delay in the effect of R&D expenditures on patenting. results do not hold when
estimating over different years of data. One immediate avenue for future research is to extend
this type of correlated random coefficients model to cases where the regressors are not strictly
exogenous, either because of feedback, contemporaneous endogeneity, or sample selection, as a
way to explore robustness of these findings.

65

CHAPTER 3
ESTIMATION OF AVERAGE MARGINAL EFFECTS IN MULTIPLICATIVE
UNOBSERVED EFFECTS PANEL MODELS

3.1

Introduction and Review

Nonlinear models often make logical sense for representing limited dependent variables like discrete choices and counts. Challenges can arise, however, in micro-econometric panel settings
when one wishes to control for unobserved individual heterogeneity and has relatively few time
periods of data. For static multiplicative effects models with strictly exogenous covariates, fixed
effects Poisson (FEP) consistently estimates the parameters of a correctly-specified conditional
mean function (Wooldridge, 1999). Researchers may also want to estimate quantities like Average
Partial Effects (APE) and Average Treatment Effects (ATE), but as they depend on the unobserved
heterogeneity, it is not immediately clear how to proceed.
I study an approach that estimates APE and ATE by combining FEP parameter estimates with
estimates of the individual heterogeneity. The latter come from unconditional Poisson QMLE
treating the heterogeneity as parameters to be estimated, a procedure that yields estimates of the
conditional mean function parameters that are algebraically equivalent to FEP.1 While easy to
implement, such APE and ATE estimates potentially suffer from the incidental parameters problem
(IPP) since the individual effect estimates are based on only T observations (Lancaster, 2000).
However, I show that in multiplicative models, such APE and ATE estimators are consistent and
asymptotically normal with only the cross-sectional dimension growing. The consistency result
may not be surprising, but it is not implied by consistency of FEP for slope coefficients, and similar
results do not hold for other nonlinear models. For instance, the IPP still biases APE estimates in
fixed effects binary response models even if one knows the true values of the slope parameters or
can estimate them consistently (Fernandez-Val, 2009).
1 This

result was derived independently by Lancaster (2002) and a version of Blundell, et. al.

(2002).

66

To my knowledge, estimating APE and ATE using estimated incidental parameters has not
been studied in multiplicative models specifically. Many authors have studied consistent slope
parameter and marginal effect estimation using estimated incidental parameters in either general
nonlinear models or in other specific settings. One solution is to employ bias corrections that are
justified by large-T asymptotics. See, for example, Hahn and Newey (2004) for general nonlinear
models estimated with unconditional MLE, or Fernandez-Val (2009) for the unobserved effects
probit model. Although allowed to be much smaller than the number of individuals, the number of
time periods needs to be sufficiently large for the asymptotic approximation of the bias to perform
well. For static probit and logit models, Fernandez-Val, Greene (2004) and others have noted a
“small bias” property for APE and ATE estimates from unconditional MLE . The multiplicative
case, however, is special in that the average marginal effects estimators are actually consistent with
only the cross-section size growing, a rare result outside of the linear model. This means they
should perform well even with only two time periods.
Empirical researchers, of course, also have the option to focus on quantities that do not depend on unobserved heterogeneity. For instance, the exponential conditional mean function with a
linear index gives the slope coefficients interpretations as semi-elasticities, and proportional treatment effects are also identified (M. Lee and Kobayashi, 2001). Another possibility is to make
additional assumptions. For example, one could use a correlated random effects (CRE) approach
by assuming a parametric form for the mean of the heterogeneity conditional on the explanatory
variables. This is applicable in many nonlinear settings to estimate slope parameters as well as
average partial effects (Wooldridge, 2010). Using estimated heterogeneity, however, avoids additional restrictions and allows the researcher to estimate average marginal effects in levels, which
may be more meaningful than slope parameters and allows comparisons across models.
The rest of this chapter is organized as follows: Section 2 describes the multiplicative model
and derives the asymptotic properties of the APE and ATE estimators that use estimated heterogeneity. I also discuss some interesting implications of using these estimators in exponential models. Section 3 evaluates the proposed estimators via Monte Carlo, and Section 4 concludes. Simu-

67

lation tables are collected in Appendix D.

3.2

Theory

The multiplicative unobserved effects panel model assumes that for i = 1, . . . , N; T = 1, . . . , T ,
E(yit |xxi , ci ) = E(yit |xxit , ci ) = ci m(xxit , β 0 ),

(3.1)

where m(xxit , β 0 ) is a known, positive, continuous, differentiable function of a 1 × K vector of
explanatory variables x it and an unknown K × 1 parameter vector β 0 . The term ci is unobserved
heterogeneity that is assumed to be strictly positive. Equation (3.1) implicitly assumes that x it
is strictly exogenous, conditional on ci . I assume that the vector {yi1 , . . . , yiT , x i1 , . . . , x iT , ci } is
independent and identically distributed across i, and that T is fixed.
A common choice in the empirical literature is m(xxit , β ) = exp(xxit β ), but other forms are
possible, and the responses need not even be counts. For example, under the restriction that 0 <
ci < 1, yit could be binary or fractional, in which case m(xxit , β ) might be the logistic or normal
cumulative distribution function. Another option for nonnegative responses is a panel version of
Wooldridge’s (1992) alternative to the Box-Cox transformation. In this case, with β = (θθ , λ ) ,
the specification would be:

m(xxit , β ) =




[1 + λ x it θ ]1/λ , λ = 0


exp(xxit θ ),

(3.2)

λ = 0.

The parameters are perhaps less interesting in these examples than in the exponential case, motivating the estimation of marginal effects. While most of the derivations in this section are for a
generic m(xxit , β ), I include a discussion of the exponential case at the end of this section.
Hausman, Hall, and Griliches (1984) showed that if conditional on x i = {xxi1 , . . . , x iT } and ci ,
the yit are independently distributed as Poisson with mean given by (3.1), then conditioning on
T y results in the multinomial distribution for {y , . . . , y } . The resulting fixed effects
ni ≡ ∑t=1
it
iT
i1

68

Poisson (FEP) estimator is given by:
N

β ),
β = argmax ∑ i (β
T

β) =
i (β

(3.3)

i=1

β

∑ yit log

t=1

m(xxit , β )
.
∑Tr=1 m(xxir , β )

(3.4)

Wooldridge (1999) showed that β is consistent for β 0 under (3.1) only, making it a quasi conditional maximum likelihood estimator (QCMLE). Standard asymptotic theory for M-estimators
yields that under regularity conditions:
√
d
−1
N(β − β 0 ) → N(00, A −1
0 B 0 A 0 ),

(3.5)

β 0 ) , B 0 = Var [ssi (β
β 0 )], and s i (β
β 0 ) = ∇β i (β
β 0 ) . The sandwich form of
where A 0 = −E ∇2β i (β
the asymptotic variance estimator should be used to account for the fact that without the stronger
β ) is not the true log-likelihood for individual i.
assumptions of Hausman, et. al., i (β
Researchers are often interested in estimating marginal effects, as the β j may not have an
meaningful interpretation outside of the exponential case. I define the APE of a continuous variable
x j as:
δ j,0 = E

T ∂ m(x
T
xit , β 0 )
∂ E(yit |xxit , ci )
= E ci T −1 ∑
≡ E ci T −1 ∑ M j (xxit , β 0 ) ,
∂ xit j
∂ xit j
t=1
t=1

where M j (xxit , β ) =

β)
∂ m(xxit ,β
.
∂ xit j

(3.6)

I define the ATE for a binary xk as:

δk, =E E(yit |xxit(−k) , xitk = 1, ci ) − E(yit |xxit(−k) , xitk = 0, ci )
0

=E ci T −1

T

∑

m(xxit(−k) , 1, β 0 ) − m(xxit(−k) , 0, β 0 )

(3.7)

t=1

where the subscript (−k) indicates element k has been omitted, and where m(xxit(−k) , 1, β ) and
m(xxit(−k) , 1, β ) correspond to a 1 or 0 being inserted for xitk in m(xxit , β ).
Both of these quantities depend on ci , and so an additional assumption (i.e. correlated random
effects) would seem necessary to proceed. However, unconditional QMLE that treats the ci as

69

additional parameters offers algebraically equivalent estimates of β 0 as FEP, as well as a closedform estimate of ci . The formula is:
wi , β ) =
c(w

T y
∑t=1
it
T m(x
xit , β )
∑t=1

≡ ci

(3.8)

where w i ≡ {yi1 , . . . , yiT , x i1 , . . . , x iT }.
The analysis to follow hinges on studying the properties of this random function of the data,
which I rewrite for a generic β as:
wi , β ) ≡
c(w

T y
∑t=1
it
T m(x
∑t=1 xit , β )

(3.9)

There is a practical reason to estimate β 0 using FEP instead of unconditional QMLE (i.e.
including N individual dummies in the exponential model). As pointed out by Cameron and Trivedi
(2013), the econometrician may encounter computational or software limitations for large values
of N. It is easier to just calculate ci following FEP estimation. The APE and ATE estimators I
investigate are:
N

δ j = (NT )−1 ∑

T

∑ ciM j (xxit , β )

(3.10)

i=1 t=1

−1

δk = (NT )

N

T

∑ ∑ ci

m(xxit(−k) , 1, β ) − m(xxit(−k) , 0, β )

(3.11)

i=1 t=1

wi , β ) = ci , even if evaluated at β 0 , and with only N growing, ci cannot be consisClearly c(w
tent for ci (under the view that ci is one of N individual-specific parameters).2 One should not
generally expect marginal effects calculated from estimated incidental parameters to be consistent
in nonlinear models, even if slope parameter estimates of are consistent. However, some sample
wi , β ) and the fact that ci
averages involving ci are consistent in the FEP case due to the form of c(w
and m(xxit , β 0 ) are multiplicatively separable.
wi , β )hh(xxi , β ) is an estimator of λ 0 ≡ E [ci h (xxi , β 0 )]. AsTheorem 1 Suppose λ ≡ N −1 ∑N
i=1 c(w
wi , β ) ≡ c(w
wi , β )hh(xxi , β )
sume that (3.1) holds and that each element of the P×1 random vector g (w
2 Cameron

p

and Trivedi (2013) assert ci → ci as T → ∞, which is true if {yit } and {m(xxit , β 0 )}
are ergodic for the mean.

70

wi , β ) from Theorem 12.2 of Wooldridge (2010). Then
satisfies the regularity conditions on q(w
p

λ → λ0
p

p

wi , β )hh(xxi , β ) → E [c(w
wi , β 0 )hh(xxi , β 0 )] by Lemma 12.1
Proof. Since β → β 0 , then N −1 ∑N
i=1 c(w
in Wooldridge (2010). Furthermore, by the L.I.E.,
wi , β 0 )hh(xxi , β 0 )] =E {E [c(w
wi , β 0 )hh(xxi , β 0 )|xxi , ci ]}
E [c(w
T E(y |x
∑t=1
it x i , ci )
h (xxi , β 0 )
T m(x
∑t=1 xit , β 0 )

=E

T m(x
xit , β 0 )
ci ∑t=1

=E

T m(x
xit , β 0 )
∑t=1

h (xxi , β 0 )

=E [ci h (xxi , β 0 )]

(3.12)

xi , β ) = 1, while consistency of δ j
Consistency of N −1 ∑N
i=1 ci for E(ci ) follows from setting h (x
and δk follow from either setting
h (xxi

, β ) = T −1

T

∑ M j (xxit , β ) or

(3.13)

t=1

h (xxi , β ) = T −1

T

∑

m(xxit(−k) , 1, β ) − m(xxit(−k) , 0, β ) .

(3.14)

t=1

Theorem (1) shows that unlike with other nonlinear fixed effects estimators, no bias correction
is necessary to estimate the APE and ATE in this setting. One might expect, a priori, that δ j and
δk would perform well anyway as T grows and ci better approximates ci . Nevertheless, Theorem
(1) holds for an arbitrary T , so δ j and δk should perform well even in panels with only two time
periods (the minimum needed for FEP). The APE and ATE I consider are just two of many possible
quantities of interest. Researchers might also want to know the average marginal effect for a
specific time period, or for a specific subpopulation defined by the observables (i.e. the Average
Treatment Effect on the Treated). One might also want to estimate the partial effect evaluated at
the averages of the heterogeneity and covariates. As long as ci multiplies the relevant function

71

of the data, one need not worry about the difference between it and ci when averaging over the
cross-section.
As a caution, one cannot use ci to learn about other features of the distribution of ci except
in more restrictive cases. For instance, Var(ci ) is identified only under additional assumptions
about Var(yyi |xxi , ci ). A simple example is when the Poisson variance assumption, Var(yit |xxi , ci ) =
E(yit |xxi , ci ), and zero conditional covariance, Cov(yit , yir |xxi , ci ) = 0,t = r, both hold. In this case,
T m(x
wi , β 0 )] − E ci / ∑t=1
xit , β 0 ) .
one can show that Var(ci ) = Var [c(w

The asymptotic variance of λ can be derived similarly to the delta method, but making sure to
account for the randomness in w i . 3
Theorem 2 Under the assumptions in Theorem (1),
√
d
N(λ − λ 0 ) → N(00, D 0 ),
where
wi , β 0 ) − λ 0 − G 0 A −1
β 0) ,
D 0 = Var g (w
0 s i (β
wi , β 0 ) = E c(w
wi , β 0 )∇β h (xxi , β 0 ) + h (xxi , β 0 )∇β c(w
wi , β 0 ) ,
G 0 = E ∇β g (w
wi , β ) = −c(w
wi , β )
∇β c(w

T ∇ m(x
xit , β )
∑t=1
β
T m(x
xit , β )
∑t=1

,

∇β h (xxi , β ) is the P × K Jacobian of h (xxi , β ), and
∇β m(xxit , β ) is the 1 × K gradient of m(xxit , β ).
¨ i as the P × K Jacobian of g (w
wi , β ) evaluated at different mean values between β
Proof. Define G
3 The

derivation here is essentially the same as the solution to Wooldridge (2010), Problem

12.17.

72

and β 0 . By a mean value expansion of each element of
N

N

√
wi , β ) around β 0 ,
N λ = N −1/2 ∑N
i=1 g (w
N

¨i
wi , β ) =N −1/2 ∑ g (w
wi , β 0 ) + N −1 ∑ G
N −1/2 ∑ g (w
i=1

i=1
N

√
N β −β0

(3.15)

i=1

√
wi , β 0 ) + G 0 N β − β 0 + o p (1)
=N −1/2 ∑ g (w
i=1
N

N

i=1

i=1

(3.16)

β 0 ) + o p (1).
wi , β 0 ) − N −1/2 ∑ G 0 A −1
=N −1/2 ∑ g (w
0 s i (β

(3.17)

¨ p
The second equality follows because consistency of β implies N −1 ∑N
i=1 G i → G 0 and because
√
√
−1 β ) +
N β − β 0 = O p (1). The third follows because N β − β 0 = −N −1/2 ∑N
0
i=1 A 0 s i (β
o p (1). Therefore,
N
√
wi , β 0 ) − λ 0 − G 0 A −1
β 0 ) + o p (1)
N λ − λ 0 = N −1/2 ∑ g (w
0 s i (β

(3.18)

i=1

By the Asymptotic Equivalence Lemma, the limiting distribution of

√
N λ − λ 0 is the same as

wi , β 0 ) − λ 0 − G 0 A −1
β 0 ) , which is easily shown to be the scaled sample avN −1/2 ∑N
i=1 g (w
0 s i (β
erage of a mean-zero random vector. Therefore, by the Central Limit Theorem for i.i.d. sequences,
the result follows.
Applying Theorem (2) for the APE of a continuous covariate x j :
√
d
N δ j − δ j,0 → N(0, D j,0 ),
D j,0 = Var

(3.19)

T

T −1

β 0)
∑ c(wwi, β 0)M j (xxit , β 0) − δ j,0 − G j,0 A−1
0 si (β

,

(3.20)

t=1
T

wi , β 0 )(T −1 ) ∑
G j,0 = E c(w

∇β M j (xxit , β 0 ) − M j (xxit , β 0 )

t=1

73

T ∇ m(x
xit , β 0 )
∑t=1
β
T m(x
xit , β 0 )
∑t=1

(3.21)

For the ATE of the binary covariate xk :
√
N δk − δk,

d

0

Dk, = Var T −1
0

T

∑ c(wwi, β 0)

t=1

→ N(0, Dk, ),

(3.22)

0

β ) ,
s (β
m(xxit(−k) , 1, β 0 ) − m(xxit(−k) , 0, β 0 − δk, − Gk, A −1
0
0 0 i 0
(3.23)

T

wi , β 0 )(T −1 ) ∑
Gk, = E c(w
0

∇β mit (1) − ∇β mit (0) − (mit (1) − mit (0))

t=1

T ∇ m
∑t=1
β it
T m
∑t=1
it

,
(3.24)

where mit = m(xxit , β 0 ), mit (1) = m(xxit(−k) , 1, β 0 ), and mit (0) = m(xxit(−k) , 0, β 0 ). These asymptotic variances can be consistently estimated from the above expressions by plugging in β for β 0
and forming the sample analogs to the expectation and variance operators.

3.2.1

Exponential Models

Since it is a common specification in empirical research, I include a few observations about the
exponential conditional mean case. The form of the quasi log-likelihood means that one can estimate coefficients on time-varying x it only. Nevertheless, δ j,0 and δk,0 are still identified when
the conditional mean function is exponential and includes time-constant observables. To see this,
suppose the following:
E(yit |xxit , z i , vi ) = vi exp(xxit β 0 + z i γ 0 ),

(3.25)

where now I use vi to denote the unobserved heterogeneity. Define ci = vi exp(zzi γ 0 ). Then clearly
E(yit |xxit , ci ) = ci exp(xxit β 0 ).

(3.26)

The heterogeneity has absorbed the time-constant observables. Theorems (1) and (2) still hold, but
the function ci now serves as a stand-in for the total contribution from all time-constant variables—
observed and unobserved. Analogous to the linear case, γ 0 is not identified, nor are the average
partial effects of the z i , but given consistent estimates of β 0 , one can still consistently estimate the
average partial effects of the time-varying regressors.

74

One alternative estimand studied by Lee and Kobayashi (2001) is the proportional treatment
effect, which for a binary treatment and the simple index in (3.25) is: 4

ξk ≡

E(yit , x it(−k) , xitk = 1, z i , vi )
E(yit , xit(−k) , xitk = 0, zi , vi )

− 1 = exp(βk ) − 1

(3.27)

Of course, ξk may interesting in its own right, but my analysis shows that estimating the ATE in
levels using (3.11) is another option, even when time-constant regressors belong in the model.
Furthermore, APE of a continuous variable simplifies in the exponential conditional mean case.
δ j,0 =E T −1
=E T −1
= T −1

T

∑ ci exp(xxit β )

β j,0

(3.28)

∑ E(yit |xxit , ci)

β j,0

(3.29)

t=1
T

t=1
T

∑ E(yit )

t=1

β j,0 ,

(3.30)

where the last equality is by the L.I.E. Here, the population scale factor is analogous to the crosssection case and doesn’t depend on the heterogeneity.
Moreover, an estimator that treats ci as the unknown ci is equivalent to the sample analog of
(3.30).
N

δ j = (NT )−1 ∑

T

N

∑ ci exp(xxit β )

i=1 t=1

β j = (NT )−1 ∑

T

∑ yit

βj

(3.31)

i=1 t=1

Consistency of δ j for δ j is immediate given a consistent estimator of β j,0 . Since δ j does not
depend on ci , one could even estimate β 0 without assuming strict exogeneity of x it , using the
GMM approach of either Chamberlain (1992) or Wooldridge (1997) based on sequential moment
restrictions.
The asymptotic variance is simpler as well:
Avar

√
β 0 )),
N(δ j − δ j,0 ) = Var(y¯i β j − δ j − µyT r j A−1
0 s i (β

4 Lee

(3.32)

and Kobayashi’s model includes multi-valued treatment as well as interactions between
the treatment and covariates, so the proportional treatment effect depends on x it and z i , but only
involves coefficients on time-varying regressors and interactions.

75

T y ), and r is a 1 × K-vector with jth element equal to 1 and all other
where µyT ≡ E(T −1 ∑t=1
it
j

elements equal to 0. The expression is similar if GMM is used to estimate β 0 .

3.2.2

A note about dropped observations

If the dependent variable for an observation l is zero in each time period, then observation l contributes nothing to the quasi log-likelihood, as can be seen in equation (3.4). Clearly, the terms in
wl , β ) = 0.
δ j and δk corresponding to observation l’s contribution are then equal to zero, since c(w
Nevertheless, if interested in an APE or ATE with respect to the entire population of interest, the
sample size N in the formulas for δ j and δk should correspond to the number of individuals in
the entire the cross-section, not the number of individuals in the estimation sample (that is, with
ni > 0). Otherwise, the estimates will be conditional on this particular subsection of the population
and be inflated by a factor of N/N p , where N p = ∑N
i=1 1 [ni > 0].

3.3
3.3.1

Monte Carlo
Design

I employ the following data generating process. For i = 1, . . . , N and t = 1, . . . , T :

yit |(xxi , d i , ci ) ∼ Poisson [ci exp(β1 xit + β2 dit )] ,

(3.33)

log(ci ) ∼ Normal(0, σ 2 )

(3.34)

xit = log(ci ) + ρxi,t−1 + vit , t > 1

(3.35)

xi1 = log(ci )/(1 − ρ) + vi1 /

1 − ρ 2 , vit ∼ N(0, 1/2),

(3.36)

ρ = 0.3 − 0.5σ

(3.37)

dit = 1 [xit + log(ci ) + hit > 0] , hit ∼ N(0, 1/2)

(3.38)

76

I study panels of dimensions N ∈ {500, 1000, 2000} and T ∈ {2, 4, 10}. The conditional
marginal distribution of yt is Poisson with an exponential mean function. I set β1 = 0.5 and
β2 = −0.5. I vary the degree of heterogeneity, with σ ∈ {0, 0.25.0.5, 0.75, 1}. The continuous
covariate xt and the binary covariate dt are both correlated with the heterogeneity, and the strength
of the correlation increases with σ . The scaling of xi1 is intended to keep Var(xt ) constant across
the different T .5 That the autoregressive parameter in the equation for xt depends on σ is an
attempt to keep the autocovariance structure of xt more consistent as σ increases.
I estimate β1 and β2 using FEP, and employ the APE and ATE estimators proposed in equations
(3.10) and (3.11). In the tables to follow, FEP estimates are denoted with a “ ”. For reference, I
also estimate the slopes, APE, and ATE using pooled Poisson QMLE, which ignores ci entirely.
These estimates are denoted with a “ ”. Both FEP and Poisson QMLE are consistent when σ = 0,
but only FEP is consistent when σ > 0. Reporting the results for Poisson QMLE is intended to
give the reader a sense of how large a problem neglected heterogeneity causes under this particular
DGP. For each estimator and parameter combination, I report the mean and standard deviation of
the empirical distribution, the estimated bias, the ratio of the mean standard error to the empirical
standard deviation, and the probability of rejecting a true null hypothesis at the 5 percent significance level. I use cluster robust asymptotic standard errors with the slope estimates, though they
are technically not necessary with this DGP. For the APE and ATE estimates, I use the “unconditional” asymptotic standard errors derived in this chapter for the FEP case, as well as the analogous
versions for Poisson QMLE. For each parameter combination, I draw 2000 replications.

3.3.2

Results

Full tables of simulation results can be found in Appendix D. I focus attention on the APE and ATE
estimates, though the slope estimates are included for reference. As expected, across all values of
N and T , there is virtually no finite sample bias in the Poisson QMLE and the FEP estimates in the
absence of heterogeneity (σ = 0). In the presence of heterogeneity, however, Poisson QMLE slopes
5 See

Vamos, Soltuz, and Craciun 2007.

77

and APEs are biased. As heterogeneity increases, bias increases, and the probability of rejecting a
true null hypothesis quickly approaches one. Therefore, this DGP succeeds in simulating settings
where controlling for individual effects is important.
Finite sample bias in δ1 is less than 0.01 for all values of N, T , and σ , which is not surprising
given that in the exponential case, the APE scale factor does not even depend on ci . Some ATE
estimates at higher levels of σ are slightly biased away from zero when the panel is short and the
sample is smaller. For instance, when N = 500 and T = 2, finite sample bias is between 2 and
2.5 percent of the true value when σ ≥ 0.5. However, the magnitudes of these biases decrease to
1 − 1.5 percent when N = 1000 and 0 − 1 percent when N = 2000. In the T = 4 and T = 10 cases,
the finite sample bias is less than 1 percent and quite small in the larger cross-sections.
The finite sample standard deviations behave in predictable ways, decreasing as either N or
T increases. The variability in δ2 seems to be greater than that of δ1 , and the spread between
them increases with σ , which might be related to the fact that δ1 does not actually use ci in the
exponential case. The standard errors derived in this chapter perform reasonably well, particularly
with the largest cross-section, where at worst, their empirical mean underestimates the empirical
standard deviation by about 4 percent. This occurs for the standard error of δ1 in the T = 4, σ = 1
case, where as a point of comparison, the mean standard error for β1 also underestimates the finite
sample standard deviation of β1 by a similar amount. For the most part, the results suggest the
approximations get better as N increases, though simulating more replications may be necessary
to reduce sampling error. When σ is high, the apparent underestimation by the standard errors
leads to slight over-rejection by about one or two percentage points, but larger N also mitigates
this problem.
Overall, these simulations support this chapter’s theoretical findings. The asymptotic properties
derived in Section 2 for the APE and ATE estimators that use estimated incidental parameters seem
to approximate their finite sample behavior very well.

78

3.4

Conclusion

It is already well-known that in static multiplicative panel models under strict exogeneity, estimating the heterogeneity still leads to consistent estimation of the parameters of a correctly-specified
conditional mean function. This chapter adds the result that APE and ATE estimators that use es√
timated heterogeneity are also consistent and N-asymptotically normal with T fixed. In fact, the
results hold for estimating the mean of a wider class of random quantities where the heterogeneity
is multiplicatively separable from functions of the data. I derive asymptotic standard errors for
these estimators that perform well in simulations for a leading case in empirical research. One area
for future research would be to use higher order expansions to derive standard errors that better
approximate the standard deviation of the sampling distribution.

79

APPENDICES

80

APPENDIX A
ANALYTICAL BIAS CORRECTION EXPRESSIONS FROM CHAPTER 1

From Hahn and Newey (2004), and Fernandez-Val (2009), the one-step bias corrected estimator is
formed as
θbc = θ − B(θ )/T,

(A.1)

where B(θ ) = I (θ )−1 b(θ ). Here θ denotes a generic coefficient vector, and θ is the uncorrected
MLE.

A.1

Hahn and Newey’s bias correction for M-estimators

With strictly exogenous regressors x it :
N

I (θ ) = − (NT )−1 ∑

T

T

T

∑ [uitθ (θ ) − uitα (θ )]

∑ vitθ (θ ) /

∑ vitα (θ )

(A.2)

uitα (θ ) βi (θ ) + ψit (θ ) + uitαα σi2 (θ )/2 ,

(A.3)

i=1 t=1

N

b(θ ) = (NT )−1 ∑

T

∑

t=1

t=1

i=1 t=1

where

−1 T

T

βi (θ ) = −

∑ visα (θ )

s=1

∑

visα (θ )ψit (θ ) + visαα (θ )σi2 (θ )/2 ,

(A.4)

s=1
T

σi2 (θ ) = T −1

∑ ψit (θ )2.

(A.5)

s=1

In these expressions, uit (θ ) and vit (θ ) are derivatives of the log-likelihood with respect to θ and
T
αi , respectively, evaluated at αi = αi (θ ) = arg max ∑t=1
it (θ , αi )/T . Partial derivatives of uit (θ )
α

and vit (θ ) are denoted by the θ and α subscripts. The terms ψit (θ ), σi2 (θ ), βi (θ ) are estimators
for the influence function, asymptotic variance, and higher order asymptotic bias, respectively, of
αi (θ ) as T grows.

81

A.2

Fernandez-Val’s bias correction based on conditional expectations

Fernandez-Val (2009) simplifies the Hahn and Newey (2004) corrections by taking expectations
conditional on {xxi , αi } and using the Law of Iterated Expectations. For static probit models with
strictly exogenous regressors,

I (θ ) =
N −1

N

∑

T −1

i=1

b(θ ) = N −1

T

∑ Git (θ )xxit xit

−

T −1

∑

−T −1

i=1

T

∑ Git (θ )xxit

ηi (θ ) +

t=1

T −1

T

T −1

∑ Git (θ )xxit

t=1

t=1

N

T

σi2

∑ Git (θ )xxit

(A.6)

t=1

T

∑ Git (θ )λi(θ )xxit

σi2 /2 ,

t=1

where
[φ (αi (θ ) + x it θ )]2
Git (θ ) =
, σ2 = T
Φ(αi (θ ) + x it θ )[1 − Φ(αi (θ ) + x it θ )] i
ηi (θ ) = (1/2)

T −1

T

∑ λit (θ )Git (θ )

T

∑ Git (θ )

−1

,

(A.7)

t=1

σi4 ,

(A.8)

t=1

T

λit (θ ) = αi (θ ) + x it θ , and αi (θ ) = arg max ∑ it (θ , αi )/T
α

(A.9)

t=1

A.3

Average Partial Effects

As in equation (14) of Section II, we define the function m(β , γ, α, x it ) as the partial effect of wit
on the probability that yit = 1 for w ∈ {x, d}. Using one of the analytical bias-corrected slope
estimators, θbc and αbc = αi (θbc ), the bias-corrected estimator for the average partial effect is
µw,bc

= (NT )−1

N

T

∑ ∑ mw(βbc, γbc, αbc, x it ) − ∆/T.

i=1 t=1

82

(A.10)

Using Hahn and Newey’s method:
N

∆ = (NT )−1 ∑

T

∑

mα βi + (1/2)mαα σ 2 ,

(A.11)

i=1 t=1

where βit = βit (θbc ) σi2 = σi2 (θbc ), and mα and mαα denote partial derivatives with respect to α,
evaluated at θbc and αbc .
Using Fernandez-Val’s method:

T
T
N 
T −1 ∑ mα ηi + (1/2) T −1 ∑ mαα
∆ = N −1 ∑

t=1
t=1
i=1
where λit = λit (θbc ), ηi = ηi (θbc ) and Git = Git (θbc ).

83

T −1

T

∑ Git

t=1



−1 



(A.12)

APPENDIX B
SIMULATION RESULTS FOR BIAS CORRECTIONS ON A LARGER
CROSS-SECTION

Table B.1: Probit Slope Estimates when N = 500, T = 6

ρ = 0.0
MLE
A-FV09
A-HN04
J-DJ14
J-HN04
CRE
ρ = 0.4
MLE
A-FV09
A-HN04
J-DJ14
J-HN04
CRE
ρ = 0.8
MLE
A-FV09
A-HN04
J-DJ14
J-HN04
CRE

β (true value = 1)
SE
Mean SD cv: .95 SD

γ (true value = 1)
SE
Mean SD cv: .95 SD

1.33
0.95
1.15
0.92
0.90
0.99

0.10
0.06
0.09
0.13
0.07
0.06

0.08
0.92
0.57
0.61
0.68
0.94

0.99
1.13
0.94
0.53
0.98
0.98

1.32
0.97
1.15
0.94
0.92
0.99

0.10
0.07
0.09
0.13
0.07
0.07

0.15
0.99
0.71
0.79
0.91
0.96

1.06
1.34
1.09
0.72
1.26
1.06

1.51
1.02
1.32
0.85
1.02
0.99

0.12
0.06
0.11
0.19
0.08
0.06

0.00
0.98
0.09
0.41
0.92
0.94

0.99
1.17
0.92
0.36
0.90
1.01

1.51
1.05
1.32
0.87
1.03
0.99

0.12
0.07
0.11
0.18
0.08
0.07

0.01
0.99
0.16
0.58
0.97
0.95

1.06
1.41
1.04
0.50
1.14
1.05

2.36
0.79
2.12
0.94
1.60
0.99

0.20
0.21
0.19
0.47
0.17
0.06

0.00
0.41
0.00
0.26
0.00
0.94

1.00
0.32
0.87
0.17
0.62
1.00

2.37
0.76
2.14
0.71
1.59
0.99

0.22
0.24
0.21
1.09
0.18
0.07

0.00
0.50
0.00
0.32
0.01
0.94

0.99
0.40
0.86
0.10
0.66
1.00

84

Table B.2: Probit APE Estimates when N = 500, T = 6

ρ = 0.0
MLE
A-FV09
A-HN04
J-DJ14
J-HN04
CRE
LPM
ρ = 0.4
MLE
A-FV09
A-HN04
J-DJ14
J-HN04
CRE
LPM
ρ = 0.8
MLE
A-FV09
A-HN04
J-DJ14
J-HN04
CRE
LPM

µx /µx (true value = 1)
SE
Mean SD cv: .95 SD

µd /µd (true value = 1)
SE
Mean SD cv: .95 SD

0.99
0.95
1.03
1.09
1.04
1.00
0.93

0.06
0.06
0.07
0.09
0.07
0.06
0.06

0.93
0.85
0.87
0.59
0.85
0.95
0.78

0.94
0.94
0.86
0.66
0.81
0.99
0.99

0.98
0.94
1.01
1.11
1.05
1.00
1.29

0.08
0.08
0.08
0.10
0.09
0.08
0.08

0.95
0.91
0.94
0.68
0.87
0.96
0.05

1.02
1.08
1.00
0.77
0.90
1.05
1.07

0.97
0.93
1.04
1.13
1.05
1.00
0.93

0.06
0.06
0.07
0.09
0.07
0.06
0.06

0.91
0.72
0.86
0.41
0.78
0.95
0.80

0.94
0.93
0.84
0.59
0.77
1.01
1.01

0.98
0.93
1.02
1.13
1.06
1.00
1.29

0.08
0.07
0.08
0.11
0.09
0.08
0.08

0.94
0.87
0.94
0.56
0.84
0.96
0.04

1.01
1.08
0.98
0.71
0.87
1.04
1.05

0.92
0.72
1.04
1.23
1.13
1.00
0.93

0.06
0.15
0.07
0.10
0.08
0.06
0.06

0.66
0.02
0.82
0.10
0.32
0.95
0.77

0.91
0.33
0.77
0.53
0.62
0.99
1.00

0.96
0.64
1.00
1.14
1.09
1.00
1.29

0.07
0.18
0.07
0.11
0.08
0.08
0.08

0.89
0.01
0.93
0.47
0.69
0.95
0.02

0.94
0.39
0.93
0.62
0.76
1.00
1.02

85

APPENDIX C
DERIVATIONS OF TEST STATISTICS FROM CHAPTER 2

C.1

Derivations from Section 2.3.2

From section 3.2, the score of (2.13) evaluated at Λ = 0 is identically zero. Assuming we can pass
the derivative through the integral, we can work out the following:
β , Λ) =
∇Λ i (β

h
RK it

y
T p (x
T
xi , , b i ) f (uui ) duui
∏t=1
t x i , b i ) it ∑t=1 yit u i ⊗ qt (x
f (yyi |xxi , u i , ci , ni ) f (uui ) duui
RK

ni !
xi , b i ) = ∇b pt (xxi , b i )/pt (xxi , b i ). Evaluating
T y ! , qt (x
i
∏t=1
it
do not depend on u i out of the integrals, we have:

where hit =
terms that

β , Λ)
∇Λ i (β
=
Λ =00

y
T p (x
hit ∏t=1
t x i , β ) it

T y
∑t=1
it

y
T p (x
hit ∏t=1
t x i , β ) it

at Λ = 0, and pulling the

u ⊗ qt (xxi , β ) f (uui ) duui
RK i
RK

(C.1)

f (uui ) duui

(C.2)

T

=

∑ yit E

u i ⊗ qt (xxi , b i )

(C.3)

t=1

= 0.
The second equality uses that

RK

f (uui ) duui = 1, while the third follows from independence of x it

and u i , as well as E(uui ) = 0 .
Following the re-parameterization shown in (2.14), stacking the λ j into K ×1 vector λ , defining
β , λ ) , and following similar steps as before, we have:
let θ ≡ (β




T
β ,λ )
∂ i (β
1
x
u
u
=
y
q
(x
,
β
)
u
f
(u
)
du
ij
i
i
∑ it t j i
 2 λ t=1

∂λj
λ =0
RK
j

(C.4)

λ j =0

where qt j () is the jth element of qt (), The above has 0/0 form since E(uui ) = 0 .
Using L’Hopital’s rule, the limit, of

β ,λ
λ)
∂ i (β
∂λj

as each element of λ approaches zero from above

86

is:
1
2 λj

h [ p (xx , b )yit ]
RK it ∏t t i i

∑t yit rt j (xxi , b i ) + ∑t yit qt j (xxi , b i )

2

u2i j f (uui ) duui
,

2

2

1
λj

h [ p (xx , b )yit ]
RK it ∏t t i i

where rt j () is the ( j, j)th element of ∇b qt (xxi , b i ). The
i

(C.5)

f (uui ) duui

1
2 λj

terms cancel, as do the hit the product

terms when we evaluate at λ = 0 (bbi = β 0 ). Then using RK f (uui ) duui = 1 and RK u2i j f (uui ) duui =
E(u2i j ) = 1, we get the last K elements of (2.15).

C.2

Derivations from Section 2.3.3

As before, the restricted score of (2.21 is identically zero.
T

β , Λ ) = ∑ yit
∇Λ i (β
t=1
T

∇Λ pt (xxi , β , Λ )
pt (xxi , β , Λ )

∑T exp(xxir β + mr (xxi , Λ )) ∇λ mt (xxi , Λ ) − ∇λ mr (xxi , Λ )
= ∑ yit r=1
,
T exp(x
x
x
β
+
m
(x
,
Λ
))
∑
r
ir
i
t=1
r=1

∇λ mt (xxi , Λ ) =

RK

(C.6)

exp (xxit Λ ui ) (uui ⊗ xit ) f (uui ) duui
.
exp(xxit Λ 0 u i ) f (uui ) duui
RK

(C.7)

The complication arises because

∇λ mt (xxi , Λ )

Λ =00

=

RK

(uui ⊗ xit ) f (uui ) duui
= 0,
f (uui ) duui
RK

(C.8)

which implies
β , Λ)
∇Λ i (β
= 0.
Λ =00

(C.9)

After the re-parameterization, for each of the λ j , we have:
−1

∇λ mt (xxi , Λ ) =
j

RK

exp(xxit Λ 0 u i ) f (uui ) duui

RK

exp (xxit Λ ui ) xit j ui j f (uui ) duui
2

λj

When evaluated at λ = 0 , the second factor of (C.10) has the form 0/0.

87

.

(C.10)

Using L’Hopital’s rule, as each λ j approaches zero from above, we have:



1
 2 λj





xit Λ u i ) xit j ui j f (uui ) duui
K exp (x
 = lim 
lim  R

λ ↓0
λ ↓0
2 λj

exp (xxit Λ u i ) xit2 j u2i j f (uui ) duui
RK
2( 1 )
2





λj

xit2 j RK u2i j f (uui ) duui
=
2
1 2
= xit j
2
β , 0 ), we get (2.23).
Plugging these limits in into the expression for ∇Λ i (β

88



(C.11)

APPENDIX D
SIMULATION RESULTS FROM CHAPTER 3

Table D.1: Finite Sample Properties of Poisson QMLE: β1 = 0.5, β2 = −0.5, N = 500
β1
σ
T
T
T
σ
T
T
T
σ
T
T
T
σ
T
T
T
σ
T
T
T

Mean
= 0.00
=2
0.50
=4
0.50
= 10 0.50
= 0.25
=2
0.58
0.58
=4
= 10 0.58
= 0.50
=2
0.74
0.74
=4
= 10 0.74
= 0.75
=2
0.92
=4
0.92
= 10 0.91
= 1.00
=2
1.07
1.08
=4
= 10 1.08

β2

Bias

SD

SE/SD

RP(0.05)

Mean

Bias

SD

SE/SD

RP(0.05)

0.00
0.00
0.00

0.06
0.04
0.03

0.98
0.99
0.99

0.06
0.06
0.05

-0.50
-0.50
-0.50

0.00
0.00
0.00

0.09
0.06
0.04

0.99
1.00
0.98

0.05
0.05
0.05

0.08
0.08
0.08

0.06
0.04
0.03

0.99
1.00
1.01

0.29
0.52
0.87

-0.38
-0.39
-0.38

0.12
0.11
0.12

0.09
0.06
0.04

0.99
0.99
1.00

0.28
0.46
0.84

0.24
0.24
0.24

0.06
0.04
0.03

0.97
0.95
0.94

0.99
1.00
1.00

-0.19
-0.19
-0.19

0.31
0.31
0.31

0.10
0.07
0.05

0.99
0.95
1.00

0.90
0.99
1.00

0.42
0.42
0.41

0.07
0.06
0.05

0.86
0.84
0.82

1.00
1.00
1.00

-0.04
-0.04
-0.04

0.46
0.46
0.46

0.11
0.09
0.07

0.93
0.93
0.89

0.97
0.99
1.00

0.57
0.58
0.58

0.09
0.09
0.08

0.77
0.70
0.71

1.00
1.00
1.00

0.09
0.08
0.08

0.59
0.58
0.58

0.15
0.13
0.11

0.86
0.78
0.76

0.96
0.96
0.98

89

Table D.2: Finite Sample Properties of Fixed Effects Poisson: β1 = 0.5, β2 = −0.5, N = 500
β1
Mean
σ
T
T
T
σ
T
T
T
σ
T
T
T
σ
T
T
T
σ
T
T
T

= 0.00
=2
0.50
=4
0.50
= 10 0.50
= 0.25
=2
0.50
=4
0.50
= 10 0.50
= 0.50
=2
0.50
=4
0.50
= 10 0.50
= 0.75
=2
0.50
0.50
=4
= 10 0.50
= 1.00
=2
0.50
0.50
=4
= 10 0.50

β2

Bias

SD

SE/SD

RP(0.05)

Mean

Bias

SD

SE/SD

RP(0.05)

0.00
0.00
0.00

0.10
0.05
0.03

0.99
0.99
0.99

0.05
0.05
0.05

-0.50
-0.50
-0.50

0.00
0.00
0.00

0.13
0.07
0.04

0.99
1.00
0.98

0.05
0.05
0.05

0.00
0.00
0.00

0.09
0.05
0.03

0.98
1.01
1.00

0.06
0.05
0.05

-0.50
-0.50
-0.50

0.00
0.00
0.00

0.13
0.07
0.04

0.99
0.99
1.00

0.05
0.05
0.05

0.00
0.00
0.00

0.08
0.04
0.02

0.99
1.00
1.00

0.05
0.05
0.06

-0.50
-0.50
-0.50

0.00
0.00
0.00

0.14
0.08
0.04

1.00
0.98
1.02

0.05
0.06
0.05

0.00
0.00
0.00

0.06
0.03
0.02

1.01
0.99
0.99

0.05
0.06
0.06

-0.50
-0.50
-0.50

0.00
0.00
0.00

0.14
0.08
0.05

0.99
0.99
0.99

0.06
0.05
0.05

0.00
0.00
0.00

0.05
0.03
0.02

0.97
0.97
0.97

0.06
0.06
0.06

-0.50
-0.50
-0.50

0.00
0.00
0.00

0.15
0.08
0.05

0.97
1.01
1.01

0.06
0.05
0.05

90

Table D.3: Finite Sample Properties of Poisson QMLE: β1 = 0.5, β2 = −0.5, N = 500
Mean
σ
T
T
T
σ
T
T
T
σ
T
T
T
σ
T
T
T
σ
T
T
T

= 0.00
=2
0.41
=4
0.41
= 10 0.41
= 0.25
=2
0.50
=4
0.50
= 10 0.50
= 0.50
=2
0.74
=4
0.74
= 10 0.74
= 0.75
=2
1.19
1.19
=4
= 10 1.19
= 1.00
=2
2.00
2.03
=4
= 10 2.02

Bias

δ1 (APE)
SD SE/SD

RP(0.05)

Mean

Bias

δ2 (ATE)
SD SE/SD

RP(0.05)

0.00
0.00
0.00

0.05
0.04
0.02

0.99
0.99
0.99

0.05
0.05
0.05

-0.42
-0.42
-0.42

0.00
0.00
0.00

0.08
0.05
0.03

1.00
1.00
0.99

0.05
0.05
0.05

0.07
0.07
0.07

0.05
0.04
0.02

0.99
1.00
1.02

0.24
0.43
0.79

-0.34
-0.34
-0.34

0.12
0.11
0.12

0.08
0.06
0.04

1.00
0.99
1.00

0.30
0.49
0.85

0.24
0.24
0.24

0.07
0.06
0.05

0.97
0.95
0.97

0.94
1.00
1.00

-0.20
-0.20
-0.20

0.36
0.36
0.36

0.10
0.08
0.05

0.99
0.95
1.00

0.91
0.99
1.00

0.54
0.54
0.54

0.15
0.14
0.13

0.91
0.88
0.90

1.00
1.00
1.00

-0.06
-0.06
-0.06

0.70
0.71
0.71

0.16
0.12
0.10

0.92
0.92
0.88

0.96
0.98
0.99

1.07
1.10
1.09

0.36
0.39
0.35

0.83
0.76
0.82

1.00
0.99
1.00

0.13
0.12
0.14

1.28
1.26
1.28

0.28
0.27
0.22

0.84
0.72
0.73

0.95
0.96
0.97

91

Table D.4: Finite Sample Properties of Fixed Effects Poisson: β1 = 0.5, β2 = −0.5, N = 500
Mean
σ
T
T
T
σ
T
T
T
σ
T
T
T
σ
T
T
T
σ
T
T
T

= 0.00
=2
0.41
=4
0.41
= 10 0.41
= 0.25
=2
0.43
=4
0.43
= 10 0.43
= 0.50
=2
0.50
=4
0.50
= 10 0.50
= 0.75
=2
0.65
0.65
=4
= 10 0.65
= 1.00
=2
0.93
0.93
=4
= 10 0.93

Bias

δ1 (APE)
SD SE/SD

RP(0.05)

Mean

Bias

δ2 (ATE)
SD SE/SD

RP(0.05)

0.00
0.00
0.00

0.08
0.04
0.02

0.99
0.99
0.99

0.05
0.05
0.05

-0.43
-0.42
-0.42

0.00
0.00
0.00

0.11
0.06
0.04

0.99
1.00
0.99

0.05
0.05
0.05

0.00
0.00
0.00

0.08
0.04
0.02

0.98
1.00
1.01

0.05
0.05
0.05

-0.46
-0.46
-0.45

0.00
0.00
0.00

0.13
0.07
0.04

0.99
0.99
1.01

0.05
0.05
0.04

0.00
0.00
0.00

0.08
0.05
0.03

1.00
0.99
1.02

0.05
0.06
0.05

-0.57
-0.56
-0.56

-0.01
0.00
0.00

0.18
0.10
0.06

0.98
0.98
1.03

0.05
0.06
0.04

0.00
0.00
0.00

0.09
0.06
0.05

1.00
0.98
0.98

0.05
0.06
0.06

-0.79
-0.77
-0.77

-0.02
-0.01
-0.01

0.28
0.16
0.10

0.98
0.99
0.98

0.05
0.06
0.05

0.00
0.00
0.00

0.14
0.12
0.10

0.94
0.90
0.95

0.07
0.08
0.08

-1.17
-1.15
-1.15

-0.03
-0.01
0.00

0.46
0.28
0.19

0.95
0.98
0.97

0.06
0.06
0.06

92

Table D.5: Finite Sample Properties of Poisson QMLE: β1 = 0.5, β2 = −0.5, N = 1000
β1
Mean
σ
T
T
T
σ
T
T
T
σ
T
T
T
σ
T
T
T
σ
T
T
T

= 0.00
=2
0.50
=4
0.50
= 10 0.50
= 0.25
=2
0.58
=4
0.58
= 10 0.58
= 0.50
=2
0.74
=4
0.74
= 10 0.74
= 0.75
=2
0.92
0.92
=4
= 10 0.92
= 1.00
=2
1.09
1.09
=4
= 10 1.09

β2

Bias

SD

SE/SD

RP(0.05)

Mean

Bias

SD

SE/SD

RP(0.05)

0.00
0.00
0.00

0.04
0.03
0.02

1.00
1.00
0.98

0.05
0.05
0.05

-0.50
-0.50
-0.50

0.00
0.00
0.00

0.06
0.04
0.03

0.99
0.99
0.99

0.05
0.05
0.05

0.08
0.08
0.08

0.04
0.03
0.02

1.01
1.02
1.03

0.52
0.80
0.99

-0.38
-0.38
-0.38

0.12
0.12
0.12

0.06
0.04
0.03

1.01
0.99
1.03

0.46
0.78
0.99

0.24
0.24
0.24

0.04
0.03
0.02

0.98
0.98
0.97

1.00
1.00
1.00

-0.19
-0.19
-0.19

0.31
0.31
0.31

0.07
0.05
0.03

0.96
1.00
1.00

0.99
1.00
1.00

0.42
0.42
0.42

0.05
0.05
0.04

0.92
0.86
0.83

1.00
1.00
1.00

-0.04
-0.05
-0.04

0.46
0.45
0.46

0.08
0.07
0.05

0.96
0.90
0.88

0.99
0.99
0.99

0.59
0.59
0.59

0.08
0.07
0.07

0.78
0.73
0.75

1.00
1.00
1.00

0.07
0.07
0.07

0.57
0.57
0.57

0.11
0.10
0.09

0.85
0.78
0.77

0.98
0.98
0.99

93

Table D.6: Finite Sample Properties of Fixed Effects Poisson: β1 = 0.5, β2 = −0.5, N = 1000
β1
Mean
σ
T
T
T
σ
T
T
T
σ
T
T
T
σ
T
T
T
σ
T
T
T

= 0.00
=2
0.50
=4
0.50
= 10 0.50
= 0.25
=2
0.50
=4
0.50
= 10 0.50
= 0.50
=2
0.50
=4
0.50
= 10 0.50
= 0.75
=2
0.50
0.50
=4
= 10 0.50
= 1.00
=2
0.50
0.50
=4
= 10 0.50

β2

Bias

SD

SE/SD

RP(0.05)

Mean

Bias

SD

SE/SD

RP(0.05)

0.00
0.00
0.00

0.07
0.04
0.02

0.99
1.00
0.98

0.05
0.05
0.05

-0.50
-0.50
-0.50

0.00
0.00
0.00

0.09
0.05
0.03

1.01
0.98
0.99

0.05
0.05
0.05

0.00
0.00
0.00

0.06
0.03
0.02

0.98
1.00
1.04

0.05
0.05
0.04

-0.50
-0.50
-0.50

0.00
0.00
0.00

0.09
0.05
0.03

1.00
0.99
1.01

0.05
0.06
0.04

0.00
0.00
0.00

0.05
0.03
0.02

1.00
0.98
1.02

0.05
0.06
0.05

-0.50
-0.50
-0.50

0.00
0.00
0.00

0.10
0.06
0.03

1.01
1.00
0.99

0.05
0.06
0.05

0.00
0.00
0.00

0.04
0.03
0.01

1.02
0.99
1.01

0.04
0.05
0.05

-0.50
-0.50
-0.50

0.00
0.00
0.00

0.10
0.06
0.03

0.99
1.00
1.00

0.05
0.05
0.05

0.00
0.00
0.00

0.03
0.02
0.01

0.99
0.97
1.02

0.05
0.06
0.05

-0.50
-0.50
-0.50

0.00
0.00
0.00

0.11
0.06
0.03

0.99
1.01
1.00

0.05
0.05
0.05

94

Table D.7: Finite Sample Properties of Poisson QMLE: β1 = 0.5, β2 = −0.5, N = 1000
Mean
σ
T
T
T
σ
T
T
T
σ
T
T
T
σ
T
T
T
σ
T
T
T

= 0.00
=2
0.41
=4
0.41
= 10 0.41
= 0.25
=2
0.50
=4
0.50
= 10 0.50
= 0.50
=2
0.74
=4
0.74
= 10 0.74
= 0.75
=2
1.19
1.19
=4
= 10 1.19
= 1.00
=2
2.04
2.03
=4
= 10 2.04

Bias

δ1 (APE)
SD SE/SD

RP(0.05)

Mean

Bias

δ2 (ATE)
SD SE/SD

RP(0.05)

0.00
0.00
0.00

0.04
0.03
0.02

0.99
0.99
0.98

0.06
0.05
0.05

-0.42
-0.42
-0.42

0.00
0.00
0.00

0.05
0.04
0.02

0.99
0.99
1.00

0.05
0.05
0.05

0.07
0.07
0.07

0.04
0.03
0.02

1.00
1.00
1.01

0.45
0.71
0.98

-0.34
-0.34
-0.34

0.11
0.12
0.11

0.06
0.04
0.03

1.01
0.99
1.03

0.48
0.79
0.99

0.24
0.24
0.24

0.05
0.04
0.04

0.98
0.98
0.98

1.00
1.00
1.00

-0.20
-0.20
-0.20

0.36
0.36
0.36

0.08
0.05
0.04

0.96
1.00
1.00

0.99
1.00
1.00

0.54
0.54
0.54

0.11
0.10
0.10

0.95
0.91
0.90

1.00
1.00
1.00

-0.06
-0.06
-0.06

0.70
0.70
0.70

0.11
0.09
0.07

0.96
0.89
0.87

0.99
0.99
0.99

1.10
1.10
1.10

0.28
0.29
0.26

0.84
0.80
0.85

1.00
1.00
1.00

0.12
0.12
0.12

1.26
1.26
1.27

0.22
0.21
0.17

0.83
0.73
0.75

0.97
0.97
0.98

95

Table D.8: Finite Sample Properties of Fixed Effects Poisson: β1 = 0.5, β2 = −0.5, N = 1000
Mean
σ
T
T
T
σ
T
T
T
σ
T
T
T
σ
T
T
T
σ
T
T
T

= 0.00
=2
0.41
=4
0.41
= 10 0.41
= 0.25
=2
0.43
=4
0.43
= 10 0.43
= 0.50
=2
0.50
=4
0.50
= 10 0.50
= 0.75
=2
0.65
0.65
=4
= 10 0.65
= 1.00
=2
0.93
0.93
=4
= 10 0.93

Bias

δ1 (APE)
SD SE/SD

RP(0.05)

Mean

Bias

δ2 (ATE)
SD SE/SD

RP(0.05)

0.00
0.00
0.00

0.06
0.03
0.02

0.99
0.99
0.98

0.05
0.06
0.05

-0.42
-0.42
-0.42

0.00
0.00
0.00

0.08
0.05
0.03

1.01
0.98
0.99

0.05
0.05
0.05

0.00
0.00
0.00

0.06
0.03
0.02

0.98
1.00
1.02

0.05
0.05
0.05

-0.46
-0.45
-0.46

0.00
0.00
0.00

0.09
0.05
0.03

1.00
0.99
1.02

0.05
0.06
0.04

0.00
0.00
0.00

0.06
0.03
0.02

1.00
0.99
1.01

0.05
0.05
0.05

-0.57
-0.56
-0.56

-0.01
0.00
0.00

0.13
0.07
0.04

1.01
1.01
1.00

0.04
0.05
0.05

0.00
0.00
0.00

0.06
0.04
0.03

1.02
0.98
1.00

0.04
0.06
0.05

-0.78
-0.77
-0.77

-0.01
0.00
0.00

0.19
0.11
0.07

0.99
1.00
1.01

0.05
0.05
0.05

0.00
0.00
0.00

0.10
0.08
0.07

0.95
0.93
0.96

0.06
0.07
0.07

-1.16
-1.15
-1.15

-0.01
0.00
0.00

0.31
0.19
0.13

0.98
0.99
0.99

0.05
0.05
0.05

96

Table D.9: Finite Sample Properties of Poisson QMLE: β1 = 0.5, β2 = −0.5, N = 2000
β1
Mean
σ
T
T
T
σ
T
T
T
σ
T
T
T
σ
T
T
T
σ
T
T
T

= 0.00
=2
0.50
=4
0.50
= 10 0.50
= 0.25
=2
0.58
=4
0.58
= 10 0.58
= 0.50
=2
0.74
=4
0.74
= 10 0.74
= 0.75
=2
0.92
0.92
=4
= 10 0.92
= 1.00
=2
1.09
1.09
=4
= 10 1.09

β2

Bias

SD

SE/SD

RP(0.05)

Mean

Bias

SD

SE/SD

RP(0.05)

0.00
0.00
0.00

0.03
0.02
0.01

1.00
1.01
0.97

0.05
0.05
0.06

-0.50
-0.50
-0.50

0.00
0.00
0.00

0.04
0.03
0.02

0.99
1.00
0.99

0.05
0.05
0.05

0.08
0.08
0.08

0.03
0.02
0.01

1.02
1.01
0.98

0.81
0.98
1.00

-0.38
-0.38
-0.38

0.12
0.12
0.12

0.04
0.03
0.02

1.02
1.00
0.97

0.75
0.96
1.00

0.24
0.24
0.24

0.03
0.02
0.02

0.98
1.00
0.97

1.00
1.00
1.00

-0.19
-0.19
-0.19

0.31
0.31
0.31

0.05
0.03
0.02

1.00
1.04
0.99

1.00
1.00
1.00

0.42
0.42
0.42

0.04
0.04
0.03

0.92
0.88
0.89

1.00
1.00
1.00

-0.04
-0.05
-0.05

0.46
0.45
0.45

0.06
0.05
0.04

0.97
0.95
0.92

1.00
1.00
1.00

0.59
0.59
0.59

0.06
0.06
0.05

0.83
0.79
0.80

1.00
1.00
1.00

0.07
0.06
0.07

0.57
0.56
0.57

0.09
0.08
0.07

0.84
0.82
0.80

0.99
0.99
0.99

97

Table D.10: Finite Sample Properties of Fixed Effects Poisson: β1 = 0.5, β2 = −0.5, N = 2000
β1
Mean
σ
T
T
T
σ
T
T
T
σ
T
T
T
σ
T
T
T
σ
T
T
T

= 0.00
=2
0.50
=4
0.50
= 10 0.50
= 0.25
=2
0.50
=4
0.50
= 10 0.50
= 0.50
=2
0.50
=4
0.50
= 10 0.50
= 0.75
=2
0.50
0.50
=4
= 10 0.50
= 1.00
=2
0.50
0.50
=4
= 10 0.50

β2

Bias

SD

SE/SD

RP(0.05)

Mean

Bias

SD

SE/SD

RP(0.05)

0.00
0.00
0.00

0.05
0.03
0.01

0.99
1.00
0.97

0.06
0.05
0.06

-0.50
-0.50
-0.50

0.00
0.00
0.00

0.06
0.04
0.02

0.96
1.00
0.99

0.06
0.05
0.05

0.00
0.00
0.00

0.04
0.02
0.01

1.01
1.02
0.98

0.05
0.05
0.06

-0.50
-0.50
-0.50

0.00
0.00
0.00

0.07
0.04
0.02

0.99
1.00
0.98

0.06
0.05
0.05

0.00
0.00
0.00

0.04
0.02
0.01

0.99
1.02
0.99

0.05
0.05
0.05

-0.50
-0.50
-0.50

0.00
0.00
0.00

0.07
0.04
0.02

0.99
1.01
0.98

0.05
0.05
0.05

0.00
0.00
0.00

0.03
0.02
0.01

0.95
0.99
0.98

0.06
0.06
0.05

-0.50
-0.50
-0.50

0.00
0.00
0.00

0.07
0.04
0.02

0.97
1.03
0.98

0.05
0.04
0.05

0.00
0.00
0.00

0.02
0.01
0.01

0.98
0.96
1.01

0.05
0.06
0.05

-0.50
-0.50
-0.50

0.00
0.00
0.00

0.08
0.04
0.02

0.98
0.99
1.01

0.05
0.05
0.05

98

Table D.11: Finite Sample Properties of Poisson QMLE: β1 = 0.5, β2 = −0.5, N = 2000
Mean
σ
T
T
T
σ
T
T
T
σ
T
T
T
σ
T
T
T
σ
T
T
T

= 0.00
=2
0.41
=4
0.41
= 10 0.41
= 0.25
=2
0.50
=4
0.50
= 10 0.50
= 0.50
=2
0.74
=4
0.74
= 10 0.74
= 0.75
=2
1.19
1.19
=4
= 10 1.19
= 1.00
=2
2.03
2.04
=4
= 10 2.04

Bias

δ1 (APE)
SD SE/SD

RP(0.05)

Mean

Bias

δ2 (ATE)
SD SE/SD

RP(0.05)

0.00
0.00
0.00

0.02
0.02
0.01

0.99
1.01
0.96

0.05
0.05
0.05

-0.42
-0.42
-0.42

0.00
0.00
0.00

0.04
0.03
0.02

0.99
1.00
0.98

0.05
0.05
0.05

0.07
0.07
0.07

0.03
0.02
0.01

1.02
1.00
0.99

0.75
0.95
1.00

-0.34
-0.34
-0.34

0.11
0.11
0.11

0.04
0.03
0.02

1.02
1.00
0.97

0.77
0.97
1.00

0.24
0.24
0.24

0.04
0.03
0.03

0.98
0.99
0.99

1.00
1.00
1.00

-0.20
-0.20
-0.20

0.36
0.36
0.36

0.05
0.04
0.03

1.00
1.04
0.99

1.00
1.00
1.00

0.54
0.55
0.55

0.08
0.08
0.07

0.95
0.93
0.95

1.00
1.00
1.00

-0.06
-0.06
-0.06

0.71
0.70
0.70

0.08
0.06
0.05

0.96
0.95
0.92

1.00
1.00
1.00

1.10
1.11
1.11

0.19
0.20
0.19

0.91
0.86
0.90

1.00
1.00
1.00

0.11
0.11
0.11

1.26
1.25
1.26

0.16
0.15
0.13

0.84
0.80
0.77

0.99
0.99
0.99

99

Table D.12: Finite Sample Properties of Fixed Effects Poisson: β1 = 0.5, β2 = −0.5, N = 2000
Mean
σ
T
T
T
σ
T
T
T
σ
T
T
T
σ
T
T
T
σ
T
T
T

= 0.00
=2
0.41
=4
0.41
= 10 0.41
= 0.25
=2
0.43
=4
0.43
= 10 0.43
= 0.50
=2
0.50
=4
0.50
= 10 0.50
= 0.75
=2
0.65
0.65
=4
= 10 0.65
= 1.00
=2
0.93
0.93
=4
= 10 0.93

Bias

δ1 (APE)
SD SE/SD

RP(0.05)

Mean

Bias

δ2 (ATE)
SD SE/SD

RP(0.05)

0.00
0.00
0.00

0.04
0.02
0.01

0.99
0.99
0.97

0.05
0.05
0.06

-0.42
-0.42
-0.42

0.00
0.00
0.00

0.06
0.03
0.02

0.96
1.00
0.98

0.06
0.05
0.05

0.00
0.00
0.00

0.04
0.02
0.01

1.01
1.01
0.98

0.05
0.04
0.05

-0.46
-0.45
-0.46

0.00
0.00
0.00

0.07
0.04
0.02

0.98
1.00
0.99

0.05
0.05
0.05

0.00
0.00
0.00

0.04
0.02
0.02

0.98
1.03
0.99

0.05
0.04
0.05

-0.56
-0.56
-0.56

0.00
0.00
0.00

0.09
0.05
0.03

0.99
1.01
0.97

0.05
0.05
0.05

0.00
0.00
0.00

0.05
0.03
0.02

0.96
0.98
0.99

0.06
0.06
0.06

-0.77
-0.77
-0.77

0.00
0.00
0.00

0.14
0.08
0.05

0.97
1.03
0.99

0.06
0.05
0.05

0.00
0.00
0.00

0.07
0.06
0.05

0.99
0.96
1.00

0.06
0.06
0.05

-1.16
-1.15
-1.15

-0.01
0.00
0.00

0.22
0.14
0.09

0.98
0.99
0.99

0.05
0.06
0.05

100

REFERENCES

101

REFERENCES

Alexander, B. and R. Breunig (2014). “A Monte Carlo study of bias corrections for panel probit
models”. In: Journal of Statistical Computation and Simulation 86.1, pp. 74–90. DOI: 10.1080/
00949655.2014.994516.
Andersen, E.B. (1970). “Asymptotic Properties of Conditional Maximum-Likelihood Estimators”.
In: Journal of the Royal Statistical Society. Series B (Methodological) 32.2, pp. 283–301. ISSN:
00359246. DOI: 10.2307/2984535. URL: http://www.jstor.org/stable/2984535.
Arellano, M. and J. Hahn (2007). Understanding Bias in Nonlinear Panel Models: Some Recent
Developments. In Advances in Economics and Econometrics, Blundell R, Newey W, Persson T
(eds). Cambridge: Cambridge University Press.
Bessen, J. (2009). “Matching patent data to compustat firms”. In: NBER working paper.
Blundell, R. and J.L. Powell (2003). “Endogeneity in Nonparametric and Semiparametric Regression Models”. In: Advances in Economics and Econometrics: Theory and Applications: Eighth
World Congress Vol II, pp. 312–357. DOI: 10.1017/ccol0521818737.010.
Bound, J. et al. (1982). “Who does R&D and who patents?” In:
Cameron, A.C. and P.K. Trivedi (2013). Regression analysis of count data. 2nd ed. Cambridge
University Press.
Chamberlain, G. (1980). “Analysis of Covariance with Qualitative Data”. In: Review of Economic
Studies 47, pp. 225–238. DOI: 10.2307/2297110.
— (1982). “Multivariate Regression Models For Panel Data”. In: Journal of Econometrics 18,
pp. 5–46. DOI: 10.1016/0304-4076(82)90094-x.
— (1992). “Comment: Sequential moment restrictions in panel data”. In: Journal of Business &
Economic Statistics 10.1, pp. 20–26.
Chay, K.Y. and D.R. Hyslop (2014). “Identification and Estimation of Dynamic Binary Response
Panel Data Models: Empirical Evidence Using Alternative Approaches”. In: Safety Nets and
Benefit Dependence (Research in Labor Economics), pp. 1–39. DOI: 10.1108/s0147- 9121_
2014_0000039001.
Chesher, A. (1984). “Testing for Neglected Heterogeneity”. In: Econometrica 52.4, p. 865.
10.2307/1911188.

102

DOI :

Dhaene, G. and K. Jochmans (2015). “Split-panel Jackknife Estimation of Fixed-effect Models”.
In: Review of Economic Studies 82.3, pp. 991–1030. DOI: 10.1093/restud/rdv007.
Fernandez-Val, I. (2009). “Fixed Effects Estimation of Structural Parameters and Marginal Effects
in Panel Probit Models”. In: Journal of Econometrics 150, pp. 71–85. DOI: 10.1016/j.jeconom.
2009.02.007.
Fernández-Val, Iván and Martin Weidner (2016). “Individual and time effects in nonlinear panel
models with large N, T”. In: Journal of Econometrics 192.1, pp. 291–312.
Gourieroux, C., A. Monfort, and A. Trognon (1984). “Pseudo Maximum Likelihood Methods:
Theory”. In: Econometrica 52.3, p. 681. DOI: 10.2307/1913471.
Greene, W.H. (2004). “The Behavior of the Fixed Effects Estimator in Nonlinear Models”. In: The
Econometrics Journal 7, pp. 98–119. DOI: 10.1111/j.1368-423x.2004.00123.x.
— (2012). Econometric analysis. Prentice Hall.
Greene, W.H. and C. Mckenzie (2015). “An LM test based on generalized residuals for random
effects in a nonlinear model”. In: Economics Letters 127, pp. 47–50. DOI: 10.1016/j.econlet.
2014.12.031.
Gurmu, Shiferaw and Fidel Pérez-Sebastián (2008). “Patents, R&D and lag effects: evidence from
flexible methods for count panel data on manufacturing firms”. In: Empirical Economics 35.3,
pp. 507–526.
Hahn, J. and G. Kuersteiner (2011). “Bias reduction for dynamic nonlinear panel models with fixed
effects”. In: Econometric Theory 27.06, pp. 1152–1191.
Hahn, J., H.R. Moon, and C. Snider (2015). “LM test of neglected correlated random effects and
its application”. In: Journal of Business & Economic Statistics forthcoming.
Hahn, J. and W. Newey (2004). “Jackknife and Analytical Bias Reduction for Nonlinear Panel
Models”. In: Econometrica 72, pp. 1295–1319. DOI: 10.1111/j.1468-0262.2004.00533.x.
Hahn, J., W.K. Newey, and R.J. Smith (2014). “Neglected heterogeneity in moment condition
models”. In: Journal of Econometrics 178, pp. 86–100.
Hall, B., Z. Griliches, and J. Hausman (1986). “Patents and R and D: Is There a Lag?” In: International Economic Review, pp. 265–283.
Hall, B., A. Jaffe, and M. Trajtenberg (2001). The NBER patent citation data file: Lessons, insights
and methodological tools. Tech. rep. National Bureau of Economic Research.

103

Hausman, J., B. Hall, and Z. Griliches (1984). “Econometric Models for Count Data with an Application to the Patents-R&D Relationship”. In: Econometrica 52.4, p. 909. DOI: 10 . 2307 /
1911191.
Lancaster, T. (2000). “The Incidental Parameters Problem since 1948”. In: Journal of Econometrics 95, pp. 391–413. DOI: 10.1016/s0304-4076(99)00044-5.
— (2002). “Orthogonal parameters and panel data”. In: The Review of Economic Studies 69.3,
pp. 647–666.
Lee, L. and A. Chesher (1986). “Specification testing when score test statistics are identically
zero”. In: Journal of Econometrics 31.2, pp. 121–149. DOI: 10.1016/0304-4076(86)90045-x.
Lee, M. and S. Kobayashi (2001). “Proportional treatment effects for count response panel data:
effects of binary exercise on health care demand”. In: Health Economics 10.5, pp. 411–428.
Mundlak, Y. (1978). “On the pooling of Time Series and Cross Section Data”. In: Econometrica
46, pp. 69–85.
Neyman, J. and E. Scott (1948). “Consistent Estimates Based on Partially Consistent Observations”. In: Econometrica 16, pp. 1–32.
Pakes, A. and Z. Griliches (1980). “Patents and R and D at the firm level: A first look”. In:
Rabe-Hesketh, S. and A. Skrondal (2004). Generalized Latent Variable Modeling: Multilevel, Longitudinal, and Structural Equation Models. English. GB: CRC Press.
— (2013). “Avoiding biased versions of Wooldridge’s simple solution to the initial conditions
problem”. In: Economics Letters 120.2, pp. 346–349. DOI: 10.1016/j.econlet.2013.05.009.
Stoker, T. (1986). “Consistent Estimation of Scaled Coefficients”. In: Econometrica 54.6,
pp. 1461–1481. DOI: 10.2307/1914309.
Vamo¸s, C., S.
¸ Soltuz,
¸
and M. Cr˘aciun (2007). “Order 1 autoregressive process of finite length”. In:
Rev. Anal. Numér. Théor. Approx. 36.2, pp. 199–214.
Wang, P., I.M. Cockburn, and M.L. Puterman (1998). “Analysis of patent data—a mixed-Poissonregression-model approach”. In: Journal of Business & Economic Statistics 16.1, pp. 27–41.
White, H. (1982). “Maximum likelihood estimation of misspecified models”. In: Econometrica:
Journal of the Econometric Society, pp. 1–25.
Wooldridge, J.M. (1992). “Some alternatives to the Box-Cox regression model”. In: International
Economic Review, pp. 935–955.

104

Wooldridge, J.M. (1997). “Multiplicative panel data models without the strict exogeneity assumption”. In: Econometric Theory 13.05, pp. 667–678.
— (1999). “Distribution-free estimation of some nonlinear panel data models”. In: Journal of
Econometrics 90.1, pp. 77–97. DOI: 10.1016/s0304-4076(98)00033-5.
— (2010). Econometric analysis of cross section and panel data. MIT Press.

105