TRADE-OFFS IN NON-LINEAR MODELS AND ESTIMATION STRATEGIES

By

Alyssa Helen Carlson

A DISSERTATION

Submitted to

Michigan State University

in partial fulﬁllment of the requirements

for the degree of

Economics — Doctor of Philosophy

2019

ABSTRACT

TRADE-OFFS IN NON-LINEAR MODELS AND ESTIMATION STRATEGIES

By

Alyssa Helen Carlson

This dissertation examines the assumptions presumed throughout the literature to es-

tablish valid estimation procedures for non-linear models. The following three chapters

addresses issues of identiﬁcation, consistent and eﬃcient estimation, and incorporating het-

eroskedasticity and serial correlation for binary response models in cross-sectional and panel

data settings.

Chapter 1: Parametric Identiﬁcation of Multiplicative Exponential

Heteroskedasticity

Multiplicative exponential heteroskedasticity is commonly seen in latent variable models

such as Probit or Logit where correctly modelling the heteroskedasticity is imperative for

consistent parameter estimates. However, it appears the literature lacks a formal proof of

point identiﬁcation for the parametric model. This chapter presents several examples that

show the conditions presumed throughout the literature are not suﬃcient for identiﬁcation

and as a contribution provides proofs of point identiﬁcation in common speciﬁcations.

Chapter 2: Relaxing Conditional Independence in an Endogenous Binary

Response Model

For binary response models, control function estimators are a popular approach to address

endogeneity. But these estimators utilize a Control Function assumption that imposes Con-

ditional Independence (CF-CI) to obtain identiﬁcation. CF-CI places restrictions on the

relationship between the latent error and the instruments that are unlikely to hold in an em-

pirical context. In particular, the literature has noted that CF-CI imposes homoskedasticity

with respect to the instruments. This chapter identiﬁes the consequences of CF-CI, provides

examples to motivate relaxing CF-CI, and proposes a new consistent estimator under weaker

assumptions than CF-CI. The proposed method is illustrated in an application, estimating

the eﬀect of non-wife income on married women’s labor supply.

Chapter 3: Behavior of Pooled and Joint Estimators in Probit Model with

Random Coeﬃcients and Serial Correlation

This chapter compares a pooled maximum likelihood estimator (PMLE) to a joint (full) max-

imum likelihood estimator (JMLE), the dominant estimation method for mixture models,

for dealing with potential individual-speciﬁc heterogeneity and serial correlation in a binary

response Probit Mixture model. The JMLE is more statistically eﬃcient but computation-

ally demanding and the implementation becomes more diﬃcult if one tries to model the

serial correlation over time. On the other hand, the PMLE is computationally simple and

robust to arbitrary forms of serial correlation. Focusing on the Average Partial Eﬀects, this

chapter ﬁnds it imperative for the model to allow the individual-speciﬁc heterogeneity to be

potentially correlated with the covariates (not a standard speciﬁcation in Mixture models).

Moreover, the JMLE can produce quite satisfactory estimates that seem robust to serial

correlation even under misspeciﬁcation of the likelihood function. Results are illustrated in

an application, estimating the eﬀects of diﬀerent interventions on high risk men’s behavior,

complementing the original study of Blattman, Jamison, and Sheridan (2017).

ACKNOWLEDGMENTS

First and foremost, I would like to thank the chair of my dissertation committee, Jeﬀ

Wooldridge, for all of his advice, encouragement, and helpful critiques.

I would also like

to thank Kyoo Il Kim, Joe Herriges and Nicole Mason for serving on my committee and

providing valuable feedback and assistance. I also appreciate the comments of seminar par-

ticipants at Michigan State University, the Econometrics Reading Group at MSU, Grand

Valley State University, the 2018 MEA Conference, the 2018 and 2019 Annual Meeting of

the Midwest Econometrics Group and the corresponding Women’s Mentoring Workshops,

and the 2018 International Association of Applied Econometrics Conference.

I am especially grateful for the ﬁnancial support I received from the Graduate School

and the Department of Economics at Michigan State University, including the Goodman

Fellowship, Summer Research Fellowship, and Dissertation Completion Fellowship. I also

appreciate the support and advice that Lori Jean Nichols, Steven Haider, and Mike Conlin

all gave me as I navigated the graduate program and job market. I am also grateful to my

friends and colleagues at Michigan State for making my graduate experience so memorable.

I am truly thankful for my endlessly supportive parents Lance and Chim Carlson, whose

love and encouragement helped me at every step of my life. Finally, I am especially grateful

to my partner, Thom, for picking up his life and moving across the country to start a new

adventure with me (multiple times), as well as all the countless ways he has supported my

endeavors over the years.

iv

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

Chapter 1 Parametric Identiﬁcation of Multiplicative Exponential Het-
eroskedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1
1.2
Identiﬁcation when there is no bijective transformation . . . . . . . . . . . .
1.3 No identiﬁcation when there is a bijective transformation . . . . . . . . . . .
1.4
Identiﬁcation in a common speciﬁcation . . . . . . . . . . . . . . . . . . . . .
1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.4.1
2.4.2

Chapter 2 Relaxing Conditional Independence in an Endogenous Binary
Response Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Model Set Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 General Control Function . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Identiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simulation: General Control Function in the Demand for Premium
Cable
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5 Estimation and Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.1 Asymptotic Properties . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.2 Average Structural Function . . . . . . . . . . . . . . . . . . . . . . .
2.5.3 Average Partial Eﬀects . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.4

Simulation: ASF Estimates for the Eﬀect of Income on Home-
ownership . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6 Empirical Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.7 Extension: Semi-Parametric Distribution Free Estimator
. . . . . . . . . . .
2.7.1 Observational Equivalence and Identiﬁcation . . . . . . . . . . . . . .
2.7.2 Asymptotic Properties . . . . . . . . . . . . . . . . . . . . . . . . . .
2.7.3
Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Chapter 3 Behavior of Pooled and Joint Estimators in Probit Model with
Random Coeﬃcients and Serial Correlation . . . . . . . . . . .
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Model Set Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Estimation Methods
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

v

3
3
7
10
14
15

17
17
23
29
38
38

43
46
47
49
56

58
61
67
68
76
80
86

89
89
97
98

3.3.1 Mixed Eﬀects Probit . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
3.3.2 Pooled Heteroskedastic Probit . . . . . . . . . . . . . . . . . . . . . . 103
3.4 Average Partial Eﬀects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.5 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.5.1 Computational Results . . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.5.2 Parameter Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
3.5.3 Average Partial Eﬀect Estimates
. . . . . . . . . . . . . . . . . . . . 115
3.5.4 ASF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
3.6 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
3.7.1 AR(2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
3.7.2 No Random Eﬀects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
3.7.3 Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
APPENDIX A Figures for Chapter 1 . . . . . . . . . . . . . . . . . . . . . . . . . 138
APPENDIX B Proofs and Notation for Chapter 2 . . . . . . . . . . . . . . . . . . 141
APPENDIX C Simulation Details for Chapter 2 . . . . . . . . . . . . . . . . . . . 151
APPENDIX D Figures for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . 157
APPENDIX E Tables for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 181
APPENDIX F Figures for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . 200
APPENDIX G Tables for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . 224

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260

vi

LIST OF TABLES

Table E.1: Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

Table E.2: Comparison of Logit Parameter Estimates . . . . . . . . . . . . . . 183

Table E.3: Comparison of Price Elasticity Estimates . . . . . . . . . . . . . . . 183

Table E.4: Comparison of Summary Statistics . . . . . . . . . . . . . . . . . . . 184

Table E.5: Comparison of Parameter Estimates . . . . . . . . . . . . . . . . . . 185

Table E.6: APE Results and Simulated Distribution

(True APE = 0.6448)

. . . . . . . . . . . . . . . . . . . . . . . . . . . 186

Table E.7: Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

Table E.8: Coeﬃcient Estimates for Married Women’s LFP . . . . . . . . . . 188

Table E.9: Wald Test Results

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

Table E.10:APE Estimates for Non-Wife Income eﬀect on Wife’s LFP . . . . 191

Table E.11:Logistic Distribution (h1

o = v2i) . . . . . . . . . . . . . . . . . . . . . . 192

Table E.12:Uniform Distribution (h1

o = v2i)

. . . . . . . . . . . . . . . . . . . . . 192

Table E.13:Student T Distribution (h1

o = v2i) . . . . . . . . . . . . . . . . . . . . 193

Table E.14:Gaussian Mixture Distribution (h1

o = v2i) . . . . . . . . . . . . . . . 193

Table E.15:Logistic Distribution with Linear GCF (h2

o) . . . . . . . . . . . . . . 194

Table E.16:Uniform Distribution with Linear GCF (h2

o) . . . . . . . . . . . . . 194

Table E.17:Student T Distribution with Linear GCF (h2

o) . . . . . . . . . . . . 195

Table E.18:Gaussian Mixture Distribution with Linear GCF (h2

o) . . . . . . . 195

Table E.19:Logistic Distribution with Non-Parametric GCF (h3
o)

. . . . . . . 196

Table E.20:Uniform Distribution with Non-Parametric GCF (h3

o) . . . . . . . 196

vii

Table E.21:Student T Distribution with Non-Parametric GCF (h3

o) . . . . . . 197

Table E.22:Gaussian Mixture Distribution with Non-Parametric GCF (h3

o) . 197

Table E.23:Heteroskedastic Logistic (h1

o = v2i)

. . . . . . . . . . . . . . . . . . . 198

Table E.24:Heteroskedastic Logistic with Linear GCF (h2

o) . . . . . . . . . . . 198

Table E.25:Heteroskedastic Logistic with Non-Parametric GCF (h3

o) . . . . . 199

Table G.1: Estimation Times for DGP 1 . . . . . . . . . . . . . . . . . . . . . . . 225

Table G.2: Estimation Times for DGP 2 . . . . . . . . . . . . . . . . . . . . . . . 226

Table G.3: Estimation Times for DGP 3 . . . . . . . . . . . . . . . . . . . . . . . 227

Table G.4: Bias and Std Deviation of De-scaled ME Probit Estimates

for DGP 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

Table G.5: Bias and Std Deviation of De-scaled ME Probit Estimates

for DGP 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

Table G.6: Bias and Std Deviation of De-scaled ME Probit Estimates

for DGP 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

Table G.7: Bias and Std Deviation of Scaled Coeﬃcient Estimates

for DGP 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

Table G.8: Bias and Std Deviation of Scaled Coeﬃcient Estimates

for DGP 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

Table G.9: Bias and Std Deviation of Scaled Coeﬃcient Estimates

for DGP 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233

Table G.10:Root Mean Square Error of ˆβ2σ for Speciﬁcation (2) . . . . . . . . 234

Table G.11:Bias and Std Deviation of Variance Component σ2

2 for Speciﬁca-

tion (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
Table G.12:Bias and Std Deviation (×10) of APE Estimates for DGP 1 . . . 236
Table G.13:Bias and Std Deviation (×10) of APE Estimates for DGP 2 . . . 237
Table G.14:Bias and Std Deviation (×10) of APE Estimates for DGP 3 . . . 238

viii

Table G.15:Comparison of APE and PEA . . . . . . . . . . . . . . . . . . . . . . 239

Table G.16:Select Baseline Summary Statistics . . . . . . . . . . . . . . . . . . . 240

Table G.17:Preliminary OLS Estimates . . . . . . . . . . . . . . . . . . . . . . . . 242

Table G.18:Scaled Probit Coeﬃcient Estimates for Selling Drugs . . . . . . . 244

Table G.19:Scaled Probit Coeﬃcient Estimates for being Arrested . . . . . . 245

Table G.20:Scaled Probit Coeﬃcient Estimates for Illicit Activity . . . . . . . 246

Table G.21:ATE Estimates

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247

Table G.22:Bias and Std Deviation of Scaled Coeﬃcient Estimates

under AR(2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
Table G.23:Bias and Std Deviation (×10) of APE Estimates under AR(2) . . 249

Table G.24:Failure Count under no Random Coeﬃcients . . . . . . . . . . . . . 249

Table G.25:Estimation Times under no Random Coeﬃcients . . . . . . . . . . 250
Table G.26:Bias and Std Deviation (×10) of APE Estimates under no Random

Coeﬃcients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251

Table G.27:Variance Component σ2

1 Estimates under no Random

Coeﬃcients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252

Table G.28:Variance Component σ2

2 Estimates under no Random

Coeﬃcients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252

Table G.29:Rejection Rate of LR Test for Random Coeﬃcients

. . . . . . . . 253

Table G.30:Bias and Std Deviation of De-scaled ME Logit Estimate under a

Conditional Logistic AR(1) Process

. . . . . . . . . . . . . . . . . . 254

Table G.31:Bias and Std Deviation of De-scaled ME Logit Estimate under a

Marginal Logistic AR(1) Process . . . . . . . . . . . . . . . . . . . . 255

Table G.32:Bias and Std Deviation of Scaled Coeﬃcient Estimates under a

Conditional Logistic AR(1) Process

. . . . . . . . . . . . . . . . . . 256

Table G.33:Bias and Std Deviation of Scaled Coeﬃcient Estimates under a

Marginal Logistic AR(1) Process . . . . . . . . . . . . . . . . . . . . 257

ix

Table G.34:Bias and Std Deviation (×10) of APE Estimates under a Condi-

tional Logistic AR(1) Process . . . . . . . . . . . . . . . . . . . . . . 258

Table G.35:Bias and Std Deviation (×10) of APE Estimates under a Marginal

Logistic AR(1) Process

. . . . . . . . . . . . . . . . . . . . . . . . . . 259

x

LIST OF FIGURES

Figure A.1: Visual representation of bijective transformations . . . . . . . . . 139

Figure A.2: Parameter estimates from two observationally equivalent

models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

Figure D.1: Eﬀect of Heteroskedasticity on Parameter Estimate . . . . . . . . 158

Figure D.2: ASF for Income equal to $85,000 . . . . . . . . . . . . . . . . . . . 159

Figure D.3: ASF Estimates for Misspeciﬁed Models

. . . . . . . . . . . . . . . 160

Figure D.4: Consequence of CF-LI Assumption on ASF Estimates . . . . . . 161

Figure D.5: Comparison of ASF for Families with No Children . . . . . . . . 162

Figure D.6: Comparison of ASF for Families with Young Children Only . . . 163

Figure D.7: Comparison of ASF for Families with Old Children Only . . . . 164

Figure D.8: Comparison of ASF for Families with Both Young and Old Chil-

dren . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

Figure D.9: Logistic Distribution (h1

o = v2i) . . . . . . . . . . . . . . . . . . . . . 166

Figure D.10: Uniform Distribution (h1

o = v2i) . . . . . . . . . . . . . . . . . . . . . 167

Figure D.11: Student T Distribution (h1

o = v2i) . . . . . . . . . . . . . . . . . . . . 168

Figure D.12: Gaussian Mixture Distribution (h1

o = v2i) . . . . . . . . . . . . . . . 169

Figure D.13: Logistic Distribution with Linear GCF (h2

o) . . . . . . . . . . . . . 170

Figure D.14: Uniform Distribution with Linear GCF (h2

o) . . . . . . . . . . . . . 171

Figure D.15: Student T Distribution with Linear GCF (h2
o)

. . . . . . . . . . . 172

Figure D.16: Gaussian Mixture Distribution with Linear GCF (h2
o)

. . . . . . 173

Figure D.17: Logistic with Non-Parametric GCF (h3

o) . . . . . . . . . . . . . . . 174

Figure D.18: Uniform with Non-Parametric GCF (h3

o) . . . . . . . . . . . . . . . 175

xi

Figure D.19: Student T with Non-Parametric GCF (h3
o)

. . . . . . . . . . . . . 176

Figure D.20: Gaussian Mixture with Non-Parametric GCF (h3

o) . . . . . . . . . 177

Figure D.21: Heteroskedastic Logistic (h1

o = v2i) . . . . . . . . . . . . . . . . . . . 178

Figure D.22: Heteroskedastic Logistic with Linear GCF (h2

o) . . . . . . . . . . . 179

Figure D.23: Heteroskedastic Logistic with Non-Parametric GCF (h3
o)

. . . . 180

Figure F.1: Distribution of ˆσ2

1 for T=5 under DGP1 . . . . . . . . . . . . . . . 200

Figure F.2: Distribution of ˆσ2

1 for T=10 under DGP1 . . . . . . . . . . . . . . 201

Figure F.3: Distribution of ˆσ2

1 for T=20 under DGP1 . . . . . . . . . . . . . . 202

Figure F.4: Distribution of ˆσ2

1 for T=5 under DGP2 . . . . . . . . . . . . . . . 203

Figure F.5: Distribution of ˆσ2

1 for T=10 under DGP2 . . . . . . . . . . . . . . 204

Figure F.6: Distribution of ˆσ2

1 for T=20 under DGP2 . . . . . . . . . . . . . . 205

Figure F.7: Distribution of ˆσ2

1 for T=5 under DGP3 . . . . . . . . . . . . . . . 206

Figure F.8: Distribution of ˆσ2

1 for T=10 under DGP3 . . . . . . . . . . . . . . 207

Figure F.9: Distribution of ˆσ2

1 for T=20 under DGP3 . . . . . . . . . . . . . . 208

Figure F.10: ASF Estimates for T=5 under DGP1 . . . . . . . . . . . . . . . . . 209

Figure F.11: ASF Estimates for T=10 under DGP1 . . . . . . . . . . . . . . . . 210

Figure F.12: ASF Estimates for T=20 under DGP1 . . . . . . . . . . . . . . . . 211

Figure F.13: ASF Estimates for T=5 under DGP2 . . . . . . . . . . . . . . . . . 212

Figure F.14: ASF Estimates for T=10 under DGP2 . . . . . . . . . . . . . . . . 213

Figure F.15: ASF Estimates for T=20 under DGP2 . . . . . . . . . . . . . . . . 214

Figure F.16: ASF Estimates for T=5 under DGP3 . . . . . . . . . . . . . . . . . 215

Figure F.17: ASF Estimates for T=10 under DGP3 . . . . . . . . . . . . . . . . 216

xii

Figure F.18: ASF Estimates for T=20 under DGP3 . . . . . . . . . . . . . . . . 217

Figure F.19: ATE for Selling Drugs . . . . . . . . . . . . . . . . . . . . . . . . . . 218

Figure F.20: ATE for Being Arrested . . . . . . . . . . . . . . . . . . . . . . . . . 219

Figure F.21: ATE for Engaging in Illicit Activities . . . . . . . . . . . . . . . . . 220

Figure F.22: Distribution of ˆσ2

1 for T=5 under AR(2) . . . . . . . . . . . . . . . 221

Figure F.23: Distribution of ˆσ2

1 for T=10 under AR(2) . . . . . . . . . . . . . . 222

Figure F.24: Distribution of ˆσ2

1 for T=20 under AR(2) . . . . . . . . . . . . . . 223

xiii

Introduction

When the outcome has restricted support – non-negative, binary, discrete, etc. – non-linear

models are used to better capture the underlying data generating process in which a linear

model can only at best approximate. A common example is a binary response model where

the threshold latent variable set-up is more reasonable than a linear approximation where

the predicted outcomes could fall outside the 0 and 1 bound for probabilities. But unlike

the linear regression, the non-linear models are not as well understood when the standard

assumptions fail to hold. This dissertation addresses three settings in which the standard

assumptions fail to hold: heteroskedasticity, endogeneity, and a panel setting with random

coeﬃcients and serial correlation.

In the ﬁrst chapter, I address the identiﬁcation of a multiplicative exponential het-

eroskedastic model. Although it is presented in a general setting, introducing heteroskedas-

ticity in a binary response model is usually done through a multiplicative exponential het-

eroskedasticity. Unlike the linear regression, heteroskedasticity – where the variance of the

latent error depends on the covariates – does not just inﬂuence the calculation of the asymp-

totic variance (and consequently the standard error estimates), but it changes the conditional

mean function in the log-likelihood. This means that ignoring heteroskedasticity in a binary

response model will result in inconsistent parameter estimates, not just inaccurate standard

1

error estimates (see Figure D.1). Moreover, introducing a heteroskedastic speciﬁcation can

capture more ﬂexibility in the conditional mean function as see in Chapters 2 and 3.

In

chapter 2, I utilize an observational equivalence result from Khan (2013) that essentially

implies ﬂexibly speciﬁed heteroskedasticity allows for distributional misspeciﬁcation in the

latent error. In Chapter 3, a pooled probit estimator allows for random coeﬃcients through

a heteroskedastic speciﬁcation. Both of these examples provides further motivation for the

utility of a heteroskedastic binary response mode.

But I ﬁnd the assumptions in the literature are not suﬃcient to guarantee identiﬁcation

of a multiplicative exponential heteroskedastic model. I provide a proof of identiﬁcation for

a linear in parameters speciﬁcation that will be utilized in the later chapters.

The next two chapters propose and compare estimation procedures for speciﬁc binary

response settings. In both cases, the alternative estimators from the literature considered

are built upon a set of fairly restrictive assumptions. I examine the assumptions underlying

the estimators and ask if they are realistic empirically.

In both cases, I ﬁnd simply sce-

narios in which the underlying assumptions would be violated. In the second chapter, the

conditional independence assumption for the control function estimators are violated when

there is heteroskedasticity. In the third chapter, the joint maximum likelihood estimator is

inconsistent under the presence of serial correlation. Given these limitations, an alternative

estimation procedure is proposed. This dissertation aims to supply empirical economists

with estimation tools they would need to address these complex issues that commonly arise

in binary response estimation.

2

Chapter 1

Parametric Identiﬁcation of

Multiplicative Exponential

Heteroskedasticity

1.1

Introduction

Multiplicative exponential heteroskedasticity was ﬁrst proposed by Harvey (1976) in the

context of a linear conditional mean model. Estimation is undertaken in two stages, requiring

ﬁrst an argument that the conditional variance parameters are identiﬁed and then showing

that a weighted least squares estimator identiﬁes the conditional mean parameters. More

recently, multiplicative exponential functions are used to model heteroskedasticity in the

latent errors of binary response models. However, with the cases of heteroskedastic Logit

and Probit, the parameters in the conditional variance function are estimated concurrently

with the coeﬃcients of interest requiring joint identiﬁcation of the parameters. Standard

textbooks such as Greene (2011) and Wooldridge (2010) state that these models are estimable

under fairly standard conditions and are more ﬂexible in the speciﬁcation of the conditional

mean function compared to standard Probit and Logit models leading to widespread use in

empirical work. However the literature has yet to provide proofs of parametric identiﬁcation.

3

To ﬁll the gap in the literature, this chapter explores the issues of identiﬁcations in models

with exponential heteroskedasticity.

Although the results may be applied for any model with a multiplicative exponential

component, I will use the example of a heteroskedastic binary response model throughout

the chapter to give the identiﬁcation proofs some context. Consider the standard latent

binary response model set up:

Y = 1{Xβo − U > 0}

where U is heteroskedastic with conditional variance: V ar(U|Z) = exp(2Zδo) where Z may
include functions X. Presuming that there is no endogeneity in the usual sense, E(U|X, Z) =

0, and the scaled latent error, U/ exp(Zδo), is independent of (X, Z), then the heteroskedastic

binary response model has the following conditional probability distribution,

(cid:18)

(cid:18) Xβo

(cid:19)(cid:19)y(cid:18)

(cid:18) Xβo

(cid:19)(cid:19)(1−y)

exp(Zδo)

exp(Zδo)

f (y|X, Z, θo) =

Φ

1 − Φ

(1.1)

where Φ is the known cumulative distribution function for exp(Zδo)U , is monotonic, and

has support on the real line. If we assume a normal distribution then this is the individual

likelihood for a heteroskedastic Probit model and if we assume a logistic distribution this is

the individual likelihood for the heteroskedastic Logit model.

The following restates the identiﬁcation deﬁnition from Newey and McFadden (1994) for

MLE.1

Deﬁnition 1.1.1 (Identiﬁcation). Let f (y|X, Z, θo) be the conditional probability distribution
If θ (cid:54)= θo in the
of Y deﬁned over the measures of X and Z with positive probability.
parameter space Θ implies P (f (y|X, Z, θ) (cid:54)= f (y|X, Z, θo)) > 0 over the measures of X and

1This discussion can easily be extended to the cases of NLLS or GMM.

4

Z then θo is point identiﬁed.

If one were to assume that E(X(cid:48)X) is non-singular and βo is non-zero, then identiﬁcation

requires

For (β, δ) ∈ Θ, if X(βo − exp(Z(δ − δo))β) = 0 w.p. 1, then (β, δ) = (βo, δo)

(1.2)

where Θ is the joint parameter space. The above statement captures the fundamental iden-

tiﬁcation requirement for models with exponential heteroskedasticity. This chapter aims to

clarify when this statement holds and under what necessary or suﬃcient conditions.

The simplest case of identiﬁcation is when X and Z are not a bijective transformation

of each other in the sense that the variation in Z cannot be entirely explained by the X

or visa versa. One of the main contributions of this chapter is to provide a formal proof of

identiﬁcation in this scenario under the standard conditions of the literature. A suﬃcient

condition for X and Z to not be a bijective transformation of one another is to impose an

exclusion restriction (in either X or Z). An exclusion restriction would require that one of

the random variables in the vector X is not included in the vector Z or visa versa. By doing

so, variation in one of the random vectors has been introduced that cannot be perfectly

explained by the other random vector. But when one allows for an arbitrary relationship

between X and Z, the standard conditions are no longer suﬃcient for identiﬁcation.

To provide some intuition, when X and Z are a bijective transformation, then show-

ing identiﬁcation is diﬃcult due to the non-linear nature of the problem. Noted in Lewbel

(forthcoming), non-linearity can allow for multiple solutions to the statement in (1.2). Specif-

ically, if the relationship between X and Z allows Xβo to be equal to a scaling of Xβ by
exp(Z(δ − δo)), then the model is not identiﬁed. This chapter will look at two ways the scal-
ing of Xβ by exp(Z(δ − δo)) can be manipulated: through the joint support of (X, Z) and

5

through the functional form of the heteroskedasticity, exp(Zδo). Section 3 discusses several

counter-examples in which the conditions presupposed in the literature are not suﬃcient for

identiﬁcation.

The non-identiﬁcation result is section three can be compared to the literature on iden-

tiﬁcation in a binary response model. Identiﬁcation in this setting has been well-studied in

several papers by Manski (1985, 1988). In the earlier paper, Manski looks at identiﬁcation of

a binary response model under a median restriction. This method allowed for arbitrary het-
eroskedasticity in the latent error but at most, would identify the scaled parameters βo/||βo||.

Simply put, to obtain identiﬁcation in his framework, for every β in the parameter space
such that P (sgn(Xβ) (cid:54)= sgn(Xβo)) > 0 then β/||β|| (cid:54)= βo/||βo||. Consequently, he provided

non-identiﬁcation results depending on the support of X. For instance, if Xβ had bounded

support away from 0 (for all values of β in the parameter space) then βo is not identiﬁed.

However in our setting, the identiﬁcation deﬁnition is based on the entire likelihood, rather

than just the sign of the linear index. So to be clear, non-identiﬁcation results presented in

this chapter are consequences of the highly non-linear speciﬁcation of exponential multiplica-

tive heteroskedasticity as opposed to limited information in median restriction framework in

Manski (1985).

Manski (1988) looks at identiﬁcation of the scaled parameters βo in conjunction with

the non-parametric identiﬁcation of the conditional cumulative distribution of the latent
error, FU|X (·). Manski is able to show identiﬁcation in the case of a known cumulative

distribution function and statistical independence between U and X. But since in our setting

there is heteroskedasticity, statistical independence does not hold. Manski also provides a

non-identiﬁcation result in the case of conditional mean independence and an unknown

cumulative distribution function. Although in our setting, the conditional distribution of

6

the latent error is parametrically speciﬁed, the non-identiﬁcation result would suggest that

there should exist some conditional distribution speciﬁcation in which identiﬁcation does

not hold. Therefore it is unsurprising that even in this parametric setting, we can construct

counter-examples in which identiﬁcation is lost.

However, the non-identiﬁcation results should not discourage the utilization of models

with multiplicative heteroskedasticity in empirical work. The counter-examples provided

are trivial in nature and are meant to highlight the non-existence of a general identiﬁcation

theorem for these models. As a helpful contribution, this chapter ends with a corollary

that provides identiﬁcation with a bijective transformation relationship between the random

vectors for possibly the most commonly used speciﬁcation.

1.2

Identiﬁcation when there is no bijective transfor-

mation

Continuing with the example of a heteroskedastic binary response model described in equa-

tion (1.1.1), standard textbooks such as Greene (2011) and Wooldridge (2010) imply that

the parameters are estimable2 under the following conditions,

Condition 1. Z does not contain a constant.

Condition 2. E(X(cid:48)X) is non-singular.

Condition 3. E(Z(cid:48)Z) is non-singular.

2In this context, estimable is interpretively synonymous with point identiﬁed. However, neither text
provides proofs of identiﬁcation nor explicitly state that the models are point identiﬁed. Therefore the term
“estimable” emphasizes the lack of rigorous treatment in the literature for identiﬁcation in a parametric
model.

7

Condition 1 implies the model is only identiﬁed up to scale. Alternatively, one could assume

the normalization that one of the coeﬃcients on X is equal to 1. Conditions 2 and 3 are

needed in order to show Xβo = Xβ and Zδo = Zδ implies βo = β and δo = δ respectively.

Additionally, although not commonly stated, identiﬁcation requires

Condition 4. βo is non-zero.

Without this assumption, δo is not identiﬁed.3 This can easily be addressed by assuming

a non-zero intercept as a location normalization. Under these assumptions, identiﬁcation

holds when X and Z are not bijective transformations of each other stated in the following

theorem.

Theorem 1.2.1. If Conditions 1-4 hold, and X and Z are not bijective transformations of

each other, then the parameters βo and δo are point identiﬁed.

Before providing the proof, I will formally characterized ‘bijective transformation’.

Deﬁnition 1.2.1 (Bijective Transformation). X is a bijective transformation of Z (and

equivalently Z is a bijective transformation of X) if there exists a bijective function f such

that

X = f (Z)
Z = f−1(X)

where f−1 denotes the inverse of f .

3Suppose βo = 0 then Xβo is zero with probability 1, so as long as β = 0, then any δ (cid:54)= δo satisﬁes

X(βo − exp(Z(δ − δo))β) = 0

therefore δo is not identiﬁed. This has fairly minor consequences since in empirical work, researchers tend
to be more interested in the coeﬃcient parameters βo.

8

This deﬁnition can also be understood in terms of the support of X and Z. A bijective

transformation would require that for every x in the support of X, there exists a unique z

in the support of Z such that (x, z) occurs with positive probability in the joint support of
(X, Z) and for any z(cid:48) (cid:54)= z in the support of Z, (x, z(cid:48)) occurs with probability 0 in the joint

support. Conversely, for every z in the support of Z there exists a unique x in the support

of X such that (x, z) occurs with positive probability in the joint support of (X, Z) and
for any x(cid:48) (cid:54)= x in the support of X, (x(cid:48), z) occurs with probability 0 in the joint support.

This implies that the variation in X can be perfectly described by variation in Z. Figure

A.1 visually shows what is implied by a bijective transformation and examples in which a

bijective transformation does not hold.

Proof. Suppose there exists a (β, δ) ∈ Θ such that

X(βo − exp(Z(δ − δo))β) = 0

(1.3)

holds for almost all X and Z in their support. Since βo is non-zero and E(X(cid:48)X) is non-

singular, Xβo (and similarly Xβ) is non-zero with positive probability. Consequently,

Xβo/Xβ exists and is strictly positive with positive probability.4 Rearranging the equa-

tion above for a realization (x, z),

z(δo − δ) = ln(xβo/xβ)

(1.4)

where the realizations are in the following restricted support {(x, z) ∈ support(X, Z) :
xβo and xβ are non-zero}. Since X and Z are not bijective transformations of each other,

there must exist variation in either X or Z that cannot be explained by the other. When

there is variation in Z not explained by X, there exists a realization in X in which there are

4For equation (1.3) to hold, sign(Xβo) = sign(Xβ)

9

more than one realizations of Z that occur with positive probability in the joint support.

This would allow for diﬀerent realizations on the left hand side of the above equation while
the right hand side is ﬁxed at one possible realization. Since E(Z(cid:48)Z) is non-singular, the

above equation can only hold when δ = δo and consequently β = βo. Similar conclusions
follow when there is variation in X not explained by Z.(cid:3)

1.3 No identiﬁcation when there is a bijective trans-

formation

However conﬁning to the case X and Z are not bijective transformations of one another

is fairly restrictive. Return to the heteroskedastic binary response example where one is

interested in modelling the mean of Y conditional on X. Let σ(X) denote the conditional

standard deviation of the latent error where it is reasonable to assume a double index model

such that,

(cid:18) Xβo

(cid:19)

σ(X)

(cid:18) Xβo

(cid:19)

exp(Zδo)

= Φ

E(Y |X) = Φ

(1.5)

where Z consists of bijective transformations of the elements in X.5 As mentioned before,

to get around X and Z being bijective transformations, one could consider imposing an

exclusion restriction. But this would require prior knowledge of which elements in X eﬀect

the conditional variance and which would not. Nevertheless, more generally, identiﬁcation is

not obtainable in the case of bijective transformation under the previously stated conditions.

The following two counter-examples provide settings in which identiﬁcation fails.

5Klein and Vella (2009) discuss identiﬁcation in the semi-parametric case where they use a re-indexing

approach following Ichimura and Lee (1991).

10

Suppose X = (1, Z) where Z is a binary variable. Then the ﬁrst part of statement (1.2) can

Counter-example: binary support

be decomposed to,

X(βo − exp(Z(δ − δo))β) =

β1o − β1

β1o + β2o − exp(δ − δo)(β1 + β2)

if Z = 0

if Z = 1

(1.6)

The ﬁrst part implies β1 = β1o. Plugging into the second part, equation (1.2) holds if
β2 = exp(δo − δ)β2o − β1o(1 − exp(δo − δ)) which does not imply δ = δo or β2 = β2o. Even

though Conditions 1-4 are satisﬁed, identiﬁcation is lost because under the binary support of

Z, the parameters β2o and δo are inherently linked. Obviously with binary support there is

no possible way to separately identify a non-linear (the exponential component) eﬀect from

a linear eﬀect. Therefore specifying an exponential multiplicative heteroskedastic model is

naive and illogical in the binary support setting. In fact, it is not possible to discern any

non-negative scale function as heteroskedasticity as opposed to the linear mean function,

Xβo.6 Nevertheless, this concern needs to be addressed when determining conditions for

identiﬁcation.

6For any two non-negative scale functions go(Z) and g(Z)

(cid:18)

X

βo − go(Z)
g(Z)

β

=

(cid:19)

β1o − go(0)

g(0)

β1o + β2o − go(1)

g(1)

β1

if Z = 0

(β1 + β2)

if Z = 1

− go(0)
which implies β1 =
g(0)
which does not imply g(Z) = go(Z) or β = βo. Consequently any non-negative scale function cannot be
identiﬁed as heteroskedasticity separately from a mean eﬀect.

β1o and the second part holds as long as β2 =

β2o + β1o

g(1)
go(1)

go(0)
g(0)

(cid:18) g(1)

go(1)

(cid:19)

11

Counter-example: exponential transformation

Unlike the previous counter-example which manipulates the support of (X, Z), this counter-

example takes advantage of the functional form of the heteroskedasticity. Suppose X =

(1, exp(Z)) and Z is univariate and continuous, then the ﬁrst part of statement (1.2) becomes,

X(βo − exp(Z(δ − δo))β) = β1o + exp(Z)β2o − exp(Z(δ − δo))β1 − exp(Z(1 + δ − δo))β2

If δ− δo = −1 and β1 = β2o = 0, then any values β1o = β2 make the above equation equal to
0 for all values of X1. Alternatively if δ− δo = 1 and β1o = β2 = 0, then any values β1 = β2o

also make the above equation equal to 0. This only holds for the exponential transformation

because the heteroskedasticity is of exponential form. By imposing the same transformation

in the mean term as in the heteroskedastic term, it becomes diﬃcult to diﬀerentiate between

the mean eﬀect Xβo and the heteroskedastic eﬀect exp(Zδo).

Non-identiﬁcation in simulation

To illustrate the consequence of non-identiﬁcation in estimation, the following simulation

exercise uses the second counter-example to construct two observationally equivalent data
generating processes for a heteroskedastic Probit model. Let Z ∼ N (0, 1) and X = exp(Z),

then consider the following two data generating processes:

Y1 = 1{0 + 0.5X + U1},
Y2 = 1{0.5 + U2},

where U1 ∼ N (0, exp(4Z))
where U2 ∼ N (0, exp(2Z))

(1.7)

(1.8)

According to the analysis given above, these two models are observationally equivalent. The

simulation randomly draws a sample of (X, Z) and then computes two diﬀerent outcomes Y1,

12

and Y2 for the same independent variable sample. Then using the hetprobit command in

Stata, two estimations are performed, one using the outcomes Y1 from the ﬁrst speciﬁcation

and the other uses the outcomes Y2 from the second speciﬁcation.

Figure A.2 show the empirical distributions of the parameter estimates for a sample

size of 1,000. This plainly demonstrates that the estimator cannot distinguish between the

diﬀerent parameters values that construct the two outcomes. Because the distribution of

the estimates for Speciﬁcation 1 and Speciﬁcation 2 look identical, one could think there is

not divergence in the parameter estimates within a sample between the two data generating

processes but looking at the diﬀerence of the two parameter estimates (Diﬀerence), there

appears to be a trimodal distribution. The mass around 0 implies that the outcomes in the

two data-generating processes are similar enough that the estimation procedure calculates

the same parameter values when the data generating process is formed using two diﬀerent

parameter values. The mass around -0.5 in the ﬁrst ﬁgure and the mass around 0.5 in the

second ﬁgure show that in some of the samples, the estimator correctly matches the ‘true

parameter value’ to the data generating process. However the remaining mode that occurs

symmetrically across 0 shows that the estimator can also incorrectly match the parameter

estimates to the alternate data generating process.

But again, this example is trivial in nature in which an empirical researcher may sidestep

by ﬂexibly specifying the conditional mean as W = (1, 1/X) with a homogeneous latent

error. This speciﬁcation is observationally equivalent and is identiﬁed. The concern is how

could one generally show identiﬁcation in the case of bijective transformed variables that

excludes these types of counter-examples.

The two counter-examples show that Conditions 1-4 are not suﬃcient for identiﬁcation.

They manipulate the support of the random variables and the form of the heteroskedasticity

13

to lose identiﬁcation. To better understand why, it is best to re-examine equation (1.4). The

left hand side is linear in Z while the right hand side is a logarithmic function of a ratio of

X. If there is a deﬁned relationship between X and Z such that the logarithmic function

in X is equivalent to a linear function of Z then identiﬁcation does not hold. In the ﬁrst

example, the logarithmic function of X is necessarily linear because of the binary support.

In the second example, the transformation undoes the logarithmic function resulting in a

linear function (for speciﬁc values of the parameters).

1.4

Identiﬁcation in a common speciﬁcation

The previous section provides justiﬁcation as to why there is no general result on identiﬁca-

tion for the case of bijective transformations. The concern is that the most prevalent use of

multiplicative exponential heteroskedastic models is when there is a bijective transformation

between the random vectors. This would require showing identiﬁcation prior to estimation

for every variation of a speciﬁcation. To provide some assistance in that front, the follow-

ing shows identiﬁcation in a general (although not completely general) and commonly used

speciﬁcation.

Example: polynomial transformations

Wanting to allow for a ﬂexibly speciﬁed conditional variance function, one might consider

polynomial functions as an approximation of the variance function7. The following corollary

obtains identiﬁcation in this commonly used speciﬁcation.

7Khan (2013) shows that a heteroskedastic Probit model with a non-parametric conditional variance
function is observationally equivalent to a model with median restriction and no distributional assumptions
on the latent error. This would motivate ﬂexible speciﬁcation of the variance function as way to allow
ﬂexibility in the latent error distribution.

14

Corollary 1.4.1. If X = (1, X2) in which X2 is a univariate continuous random variable

and Z = (X2, X2

2 , ..., Xp

2 ) then under Conditions 1, 3, and 4, the parameters βo and δo are

identiﬁed.

Proof. Suppose there exists a (β, δ) ∈ Θ such that

X(βo − exp(Z(δ − δo))β) = 0

(1.9)

holds for almost all X and Z in their support. By Condition 4, one can rearrange equation

(1.9) to

X2(δ1 − δ1o) + X2

2 (δ2 − δ2o) + ... + Xp

2 (δp − δpo) = ln

Since X2 is continuous, taking the (p + 1)th derivative with respect to X2,

(cid:34)(cid:18)

0 = (−1)p+1

β2o

β1o + X2β2o

(cid:19)

(1.10)

(cid:18) β1o + X2β2o
(cid:19)p+1(cid:35)

β1 + X2β2

(cid:19)p+1 −

(cid:18)

β2

β1 + X2β2

which implies (β1o + X2β2o)/(β1 + X2β2) = X2β2o/X2β2 = β2o/β2. Plugging into equation

(1.10), the right hand side becomes a constant. By Conditions 1 and 3, the equality cannot
hold for any non-zero (δ − δo), thus δo is identiﬁed. Finally, since Condition 2 is inherently
implied by the given speciﬁcation, the identiﬁcation of δo implies β = βo. (cid:3)

Note that one of the most common speciﬁcations X2 = Z is a special case of this result.

This result could easily be extended to the cases where X2 is not univariate and contains

discrete random variables.

1.5 Conclusion

It has been widely accepted that a model with multiplicative exponential heteroskedasticity

was estimable under Conditions 1 through 4 provided in Section 2. This chapter provides a

15

proof of identiﬁcation when the variables are not bijective transformations of one another.

But in a more general case, I supply two examples in which those conditions are satisﬁed

but point identiﬁcation is not obtainable. Consequently, the conditions previously stated in

the literature are not suﬃcient in distinguishing a linear eﬀect from an exponential eﬀect in

all cases. To overcome much of the concerns from the lack of a general identiﬁcation proof,

this chapter also provides a proof of identiﬁcation in a commonly used speciﬁcation.

In the next chapter, the results here will be utilized to obtain identiﬁcation for the pro-

posed estimation of an endogenous binary response model. The proposed approach relaxes

assumptions that were standard in the literature but I found to be too restrictive in most

empirical settings. One of the motivations behind relaxing the assumptions was to allow for

potential heteroskedasticity. Obtaining identiﬁcation in this setting has two challenges (1)

relaxing assumptions that were previously used for identiﬁcation, and (2) identiﬁcation with

multiplicative exponential heteroskedasticity. This chapter provides the foundation for over-

coming the second challenge. Therefore the identiﬁcation strategy in Chapter 2 emphasizes

the importance and utility of the results in Chapter 1.

16

Chapter 2

Relaxing Conditional Independence in

an Endogenous Binary Response

Model

2.1

Introduction

In recent years, uncovering causal eﬀects has become a cornerstone in economics research.

The interest in causality as opposed to mere correlation allows for more plausible policy

implications, counter-factual analysis and the disentanglement of causal mechanisms. En-

dogeneity, correlation between the unobserved heterogeneity and covariates, is prevalent in

economic settings and will bias parameter estimates which will ultimately aﬀect the causal

interpretations. With more realistic assumptions than those provided in the literature, this

chapter proposes a new control function estimator in a binary response setting to address

endogeneity.

Binary responses, a 0 or 1 outcome, is a common setting in economics research. For

instance, employment, graduating from college, and purchasing decisions, are all be binary

outcomes. In order to accurately uncover the true underlying mechanism in a binary response

model, many researchers turn to the latent variable set up (sometimes refer to as a hurdle

17

model) resulting in non-linear estimation. But, treating endogeneity in a non-separable

and non-linear setting is not as straight forward as using a “plug-in” instrumental variables

estimator in a simple linear regression.

A series of papers (Smith and Blundell (1986), Rivers and Vuong (1988), Blundell and

Powell (2004), and Rothe (2009)) have proposed using a control function method in con-

structing an estimator that appropriately addresses endogeneity. To gain identiﬁcation, these

papers place strong assumptions on the relationships between the latent unobserved errors

and the instruments. Essentially they impose an exclusion restriction such that the condi-

tional distribution of the latent error cannot be a function of the instruments. These Control

Function assumptions (CF-CI) are equivalent to assuming conditional independence between

the latent error and the instrument and are unlikely to hold in an empirical setting.

For instance, in models of labor participation, one may be interested in uncovering the

eﬀect of non-wage income on the probability of employment. But there are concerns of

endogeneity because one of the main sources of non-wage income is the partner’s wages, and

their labor force participation decisions are usually simultaneously determined within the

household. CF-CI would require that shocks to (non-wage) household income aﬀect labor

participation decisions independent of any other included covariates, such as education, age,

children in the household, or instruments such as husband’s education.

Another example in the ﬁeld of health economics is evaluating the eﬀect of drug rehabili-

tation treatment on subsequent substance abuse. There is endogeneity because the covariate

of interest, number of visits the client makes during the episode of treatment, is most likely

correlated with unobserved characteristics of the client that determine the likelihood of suc-

cessful treatment. For example, those who are more likely to relapse initially (longer drug

use or less community support) are less likely to visit the center during the episode of the

18

treatment. CF-CI would require that the unobserved characteristics of the client cannot

have an interactive eﬀects with other included covariates such as age, income, or marital

status.

In a more structural setting, suppose researchers are interested in understanding the

welfare loss from government intervention into insurance markets using variation in prices to

estimate the demand and marginal cost of insurance. But observed prices are endogenously

determined since they are likely correlated with unobserved characteristics. Using exogenous

variation in prices, possibly through variation in administrative costs or changes in the

competitive environment over markets, endogeneity may be addressed. But, the CF-CI

imposes functional form restrictions on the utility function that unobserved characteristics

are additively separable from observed market, product or individual characteristics.

This chapter proposes an alternative framework and control function estimator that re-

laxes this strong assumption. This generalization has been explored in other settings such

as the case of endogenous random coeﬃcients for a linear model in Wooldridge (2005) and

demand estimation where the unobserved product characteristics does not enter the utility

function additively in Gandhi, Kim, and Petrin (2013). More generally, Kim and Petrin

(2017) sets up a framework for the “general control function,” permitting the unobserved

heterogeneity to be a function of the instruments, in the case of additively separable trian-

gular equation models. This chapter extends the general control function approach of Kim

and Petrin (2017) to the case of binary response models to propose a new estimator that is

valid under the failure of CF-CI.

One of the main contributions of this chapter is to clearly explain why CF-CI would

not realistically hold in empirical settings and, given the likely failure of CF-CI, apply the

general control function approach of Wooldridge (2005), Gandhi, Kim, and Petrin (2013),

19

Kim and Petrin (2017) to a binary response setting. A simulation illustrates that given the

failure of CF-CI, the general control function approach, as opposed to alternative control

function methods of the literature, is needed to accurately recover parameter estimates.

Under the more general framework, CF-CI implies testable hypotheses in which standard

variable addition or Wald tests can be used. In an empirical application on female labor

supply, the CF-CI assumption is easily rejected.

This chapter also adds to the larger literature on control function approaches to trian-

gular simultaneous equations for both additively separable1 and non-separable models.2 In

the literature, CF-CI has been taken as a required assumption in order to employ a con-

trol function approach. In the discussion on identiﬁcation, this chapter comments on how

other control function methods in the literature obtain identiﬁcation, explains why their

approaches to identiﬁcation can be restrictive, and proposes a simpler alternative. Conse-

quently, this chapter provides an example where, under a reasonable setting, CF-CI with

respect to the control variable need not hold to recover structural objects such as the Average

Structural Function (ASF) or the Average Partial Eﬀects (APE).

By focusing on the ASF and APE, this chapter contributes to the discussion on interpre-

tation of non-linear models under the presence of endogeneity. Blundell and Powell (2003,

2004) introduced the ASF as a way to interpret binary response models when there is en-

dogeneity. They note that a conditional mean interpretation cannot capture the causal and

1Although a latent variable binary response model is non-separable due to the indicator function, separa-
bility is imposed inside the indicator function. Consequently results from the additively separable literature
may still apply.

2Literature on additively separable triangular equation models include Newey, Powell, and Vella (1999),
Pinkse (2000), Su and Ullah (2008),Florens, Heckman, Meghir, and Vytlacil (2008), Ai and Chen (2003),
Newey and Powell (2003), Newey (2013), Kim and Petrin (2017), and Hoderlein, Holzmann, and Meister
(2017). Literature on non-separable triangular equation models include Imbens and Newey (2009), Kasy
(2011), Blundell and Matzkin (2014), Chen, Chernozhukov, Lee, and Newey (2014), Kasy (2014), and
Hoderlein, Holzmann, Kasy, and Meister (2016).

20

structural eﬀect that a model incorporating endogeneity should produce. More recently,

Lewbel, Dong, and Yang (2012) propose using an Average Index Function (AIF) as a gen-

erally easier to identify alternative to the ASF. Lin and Wooldridge (2015) compare the two

approaches and conclude the ASF is a more appropriate function for interpretation and is

able capture the mechanisms of interest. This chapter further supports the conclusions of

Lin and Wooldridge (2015), where it is shown that under the more general framework of

this chapter, only the proposed estimation procedure recovers the correct ASF. This chapter

also shows that the alternative estimator from Rothe (2009), the Semi-parametric Maximum

Likelihood (SML) estimator, actually produces estimates for the AIF, which is shown in

simulation to be distinctly and interpretively diﬀerent from the ASF.

The proposed estimator is presented in a parametric framework but in some empirical

contexts, the distributional assumptions may be unrealistic. Therefore this chapter also pro-

vides a semi-parametric extension that proposes a new distribution free estimator. Using

the observational equivalence results of Khan (2013), the proposed sieve semi-parametric

estimator is shown to be consistent under weaker assumptions than those found in the liter-

ature. Consequently, this chapter contributes to the literature on semi and non-parametric

estimation as a particular application of a semi-parametric two stage sieve estimator. Sieves

(as opposed to kernel methods) are suggested in order to impose necessary shape restrictions

on the general control function. Asymptotic results are derived using the works of Ai and

Chen (2003), Chen, Linton, and Van Keilegom (2003), and Hahn, Liao, and Ridder (2018).

A comprehensive simulation study shows that only the proposed estimator can produce

accurate parameter and ASF estimates under the failure of CF-CI.

The remainder of this chapter is organized as follows. Section 2 provides motivation for

relaxing CF-CI, speciﬁcally to the setting of binary response models, and reviews previous

21

approaches and their potential shortcomings. Section 3 describes the set up of the model

and introduces the general control function method of Kim and Petrin (2017) in the binary

response setting. Empirical examples are provided to illustrate how CF-CI is unlikely to

hold in many economic settings and how the proposed framework captures the potentially

complex structure of endogeneity. Section 4 goes into more detail about the operation

and interpretation of the general control function approach. Because CF-CI is used to

show identiﬁcation, the generalizations proposed in this chapter put into question whether

identiﬁcation still holds. The Conditional Mean Restriction from Kim and Petrin (2017) that

places a shape restriction on the general control function is used to show identiﬁcation. This

section also provides a simulation to illustrate the failure of estimators that require CF-CI

when only the weaker CMR assumption holds. Section 5 instructs on the implementation

of the proposed estimator and derives the asymptotic properties such as consistency and

asymptotic normality. Because the parameters of a binary choice model have no direct

economic interpretation, this section also discusses the ASF and APE as structural objects

of interest and how to recover them under the proposed framework. Section 6 illustrates

the proposed estimator in an empirical application. Using 1991 CPS data, this chapter

examines the eﬀect of non-wage income on a married woman’s probability of labor force

participation. CF-CI implies a testable hypothesis under the proposed framework and a Wald

test ﬁnds strong statistical evidence that the assumptions of previous estimators are violated.

Although the parametric assumptions are likely to hold in the empirical application provided,

there are many economic settings where the distributional assumptions are restrictive and

unconvincing. The ﬁnal section extends the framework to a distribution free setting using

a semi-parametric estimator. This section provides the asymptotic properties of the semi-

parametric estimator as well as a comprehensive simulation study comparing the proposed

22

approach to other estimators in the literatures.

2.2 Background and Motivation

Consider the latent variable triangular system where y1i is a binary response variable, zi =
(z1i, z2i) is a 1 × (k1 + k2) vector of “non-endogenous” included and excluded instruments,
y2i is a single continuous endogenous regressor, and xi is a 1 × k vector where each element

is a function of (z1i, y2i) and includes a constant.

1 y∗

0 y∗

y1i =

1i ≥ 0

1i < 0

y∗
1i = xiβo + u1i

y2i = m(zi)πo + v2i

(2.1)

The endogenous variable y2i can be decomposed into its conditional mean and the unob-

served endogenous component, v2i. Alternatively one could consider a linear probability

model, which has the advantages of being easy to estimate, easy to interpret, and dealing

with endogeneity is relatively simple, or at the very least well studied. But linear probabil-

ity models are restrictive and cannot be representative of the true underlying mechanisms.

Their predicted probabilities lie outside the [0,1] bounds which places limitations on the in-

terpretation of the estimates. Therefore this chapter will focus on the latent variable setting.

In this framework, a series of papers, Smith and Blundell (1986), Rivers and Vuong

(1988), Blundell and Powell (2004), and Rothe (2009), developed estimators to address

endogeneity using a control function approach. The control function approach supposes that

there is a particular function (or variable) that when included as an additional covariate,

23

is able to control for the endogeneity of the other regressors. For example, Rivers and

Vuong (1988) shows that if one were to assume that u1i and v2i are bivariate normal and

independent of the instruments zi, then one can derive the following conditional distribution,

(2.2)

(cid:18)

u1i|v2i, zi ∼ N

(cid:19)

ρ

v2i, (1 − ρ2)σ2

1

σ1
σ2

where σ1 and σ2 are the standard deviations of u1i and v2i respectively and ρ is the correlation

coeﬃcient. This conditional distribution provides the foundation for the control function

approach in this context. The latent equation can be rewritten as

y∗
1i = xiβo + γov2i + ε1i

(2.3)

where ε1i = u1i − γov2i and γo = ρ

. Notice that ε1i|v2i, zi ∼ N (0, (1 − ρ2)σ2

1) which

σ1
σ2

means there is no endogeneity between the regressors and the new latent error ε1i (i.e.:
E(ε1i|xi, v2i) = 0). Therefore the reduced form error v2i can be used as a control function;

by including v2i as an additional covariate, one can control for the endogeneity in y2i and

obtain consistent parameter estimates.

In general, the control function approach constructs a function of the instruments and

regressors that can act as a valid proxy for the source of the endogeneity. Of course, in

practice, one does not observe v2i, so instead the residuals from a ﬁrst stage estimation

procedure that regresses the endogenous variable on the instruments can be used.

In order to relax the distributional assumptions in Rivers and Vuong (1988), Blundell

and Powell (2004) and Rothe (2009) propose alternative semi-parametric estimators. But

to obtain non-parametric identiﬁcation of the distribution of the latent error, they make a

rather strong assumption on the relationship between the unobserved errors, u1i and v2i, and

the instruments, zi. This Control Function assumption imposes Conditional Independence

24

(CF-CI). The CF-CI assumption requires the the instruments zi are independent of the

latent error u1i after conditioning on the reduced form error v2i:

u1i|v2i, zi ∼ u1i|v2i

(2.4)

Note that CF-CI is implicit in the set up of Rivers and Vuong (1988). Interpretively, this

means that any source of endogeneity must be fully captured through the control variate

v2i, or in terms of an exclusion restriction, the conditional CDF Fu1|v2,z(u1i|v2i, zi) is only
a function of v2i (the instruments zi are excluded). This exclusion restriction must also hold

for all moments which, as will be shown shortly, may be hard to justify.

As a slight relaxation of CF-CI, Rothe (2009) also proposes a Linear Index (CF-LI)

suﬃciency assumption that, after conditioning on the ﬁrst stage error and the linear index

xiβo, the latent error is independent of the instruments:

u1i|v2i, zi ∼ u1i|v2i, xiβo

(2.5)

Now the instruments can be a part of the conditional distribution but only through the linear

index. The linear index restricts the relative direction and magnitudes of the regressors in

the conditional distribution. So, although it allows for a more relaxed relationship between

the instruments and the unobserved heterogeneity, it is hard to justify in a general setting.

In either case, these assumptions used to obtain identiﬁcation may be too stringent

in many empirical contexts. To give a motivating parametric example, consider a slight

variation of the Rivers and Vuong (1988) set up where u1i and v2i are still bivariate normal

but are allowed to be heteroskedastic in the instruments; i.e.,

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)zi ∼ N

u1i

v2i

0



0
 ,






(2.6)

[σ1(zi)]2

ρ(zi)σ1(zi)σ2(zi)

ρ(zi)σ1(zi)σ2(zi)

[σ2(zi)]2

25

Heteroskedasticity is commonly found in empirical data whether it is actually caused by

variability in the latent error over the regressors, or by heterogeneity in the slopes as in

a random coeﬃcients setting.3 Even in the linear regression, heteroskedasticity has been

accepted as endemic in empirical settings and heteroskedastic robust inference is always

employed. Again, by the properties of the bivariate normal distribution, the following con-

ditional distribution is derived.

u1i|v2i, zi ∼ N

(cid:18)

ρ(zi)

σ1(zi)
σ2(zi)

v2i, (1 − [ρ(zi)]2)[σ1(zi)]2

(cid:19)

(2.7)

This is a fairly small variation to the framework considered in Rivers and Vuong (1988)

but ignoring heteroskedasticity can strongly bias parameter estimates. A simple Monte

Carlo exercise, detailed in Figure D.1, illustrates the potential bias. Suppose equations (2.1)

and (2.6) hold with a single excluded instrument and no included instruments such that

ρ(zi) = 0.6, σ1(zi) = σ2(zi) = exp(0.25zi). Figure D.1 displays the empirical distribution of

the estimators for β where the true value is equal to one. The Rivers and Vuong estimator

that ignores heteroskedasticity is substantially biased with estimates of β centered around

1.2. This illustrates that ignoring heteroskedasticity in the context of binary response models

produces inconsistent parameter estimates.

Now let us compare the distribution in equation (2.7) to the CF-CI and CF-LI assump-

tions for the semi-parametric estimators. CF-CI clearly does not hold since the exclusion

restriction does not hold: both the conditional mean and conditional variance depend on the
instruments. For the CF-LI assumption to hold, the heteroskedastic functions σ1(·), σ2(·),
3This is similar to the set up in Kasy (2011) where he provides a counter-example to the control function
approach proposed by Imbens and Newey (2009). In Imbens and Newey (2009), they propose using the control

variable Vi = Fy1|z(y1i, zi) which would satisfy CF-CI when the heterogeneity is only one-dimensional, as

pointed out in Kasy (2011). In his example, a linear random coeﬃcient model is used to show the failure
of CF-CI using the Imbens and Newey control variable. Note that the random coeﬃcient model can be
rewritten as a linear model with heteroskedasticity as suggested here.

26

and ρ(·) could only be functions of the linear index xiβo. This is quite restrictive and would

not generally hold. Therefore the semi-parametric estimators do not apply to this simple

parametric setting.

This causes some concern that the control function method may not be valid for this ex-

ample. But the conditional distribution in equation (2.7) suggests that there should still be a

control function approach to address endogeneity. If one were to include ρ(zi)

ψ1(zi)
ψ2(zi) v2i as an
additional covariate, then estimating a heteroskedastic Probit with the following conditional

mean will produce consistent parameter estimates.

 xiβo + ρ(zi)
(cid:112)(1 − [ρ(zi)]2)[ψ1(zi)]2

ψ1(zi)
ψ2(zi) v2i



E(y1i|zi, v2i) = Φ

(2.8)

Building a control function method from a more general conditional distribution of the latent

error is the motivation and starting point for the proposed approach.

This example highlights that the assumptions used to obtain identiﬁcation in the semi-

parametric approaches can be fairly stringent and are not well understood in terms of their

consequences. This chapter questions the necessity of the CF-CI assumption and considers

its implications in estimation and interpretation. As an alternative to the semi-parametric

estimators, I propose an estimator that builds upon the same control function technique

but extends the model by relaxing the CF-CI assumption. In the previous heteroskedastic

bivariate normal example, this means the instruments can be a part of the conditional

variance and the conditional mean.

But the CF-CI and CF-LI assumptions are not imposed superﬂuously, they are used to

obtain identiﬁcation of the conditional distribution of the latent error. In this chapter, I will

ﬁrst consider a parametric alternative to isolate the necessary conditions for identiﬁcation. At

the end of this chapter, I present a distribution free extension that proposes a semi-parametric

27

estimator that has strictly weaker identiﬁcation requirements compared the estimators of

Blundell and Powell (2004) and Rothe (2009).

In a related strand of literature on non-parametric triangular simultaneous equation

models with additively separable unobserved heterogeneity, Kim and Petrin (2017) question

the restrictive control function assumptions in the Non-Parametric Control Function (NP-

CF) literature (Newey, Powell, and Vella (1999), Pinkse (2000), Su and Ullah (2008), and

Florens, Heckman, Meghir, and Vytlacil (2008)). Because the control function method

requires the additional control function assumption for identiﬁcation, the Non-Parametric

Instrumental Variables (NP-IV) approach, as in Ai and Chen (2003), Newey and Powell

(2003), Hall, Horowitz, et al. (2005), and Newey (2013), appears to be a superior approach

that only requires the weaker Conditional Mean Restriction (CMR). Kim and Petrin (2017)

show that a control function approach is still valid under the weaker CMR when a general

control function is speciﬁed. This chapter extends their results to a binary response model

under a latent variable framework and will use the CMR in showing identiﬁcation.

Alternative estimators that do not require the CF-CI assumption in estimating en-

dogenous binary response models are the special regressor estimator proposed in Lewbel

(2000) and Dong and Lewbel (2015) and the maximum score and smoothed maximum

score estimator in Hong and Tamer (2003) and Krief (2014). The special regressor es-

timator requires a regressor, independent of all unobserved heterogeneity, that has large

support and without this “special regressor” their procedure is invalid. Alternatively, Hong

and Tamer (2003) and Krief (2014) extend the maximum score and smoothed maximum

score methods of Manski (1985) and Horowitz (1992) to estimate the structural parameters

βo in the linear index. For identiﬁcation they require conditional median independence:
Med(u1i|v2i, zi) = Med(u1i|v2i). This would allow for general forms of heteroskedasticity

28

but, as seen in the heteroskedastic bivariate normal example, the conditional median inde-

pendence assumption is still quite restrictive and would not necessarily hold. Moreover, the

conditional median independence assumption does not identify the distribution of the latent

error and therefore they cannot recover the ASF and APE.

The proposed framework that allows for relaxation of CF-CI and CF-LI in a parametric

setting is introduced in the next section. The proposed general control function estimator

directly follows from the conditional distribution of the latent error provided in the frame-

work.

2.3 Model Set Up

Return to the set up described in equation (2.1). The distributional assumptions for u1i and

v2i determine the consistent estimation procedure. Although most of the assumptions in

the literature are based on a speciﬁcation of the joint distribution of u1i and v2i (see Rivers

and Vuong (1988) and Petrin and Train (2010)), one merely needs to specify the conditional

distribution to use the control function approach. For example, if one were to assume
u1i|v2i, zi ∼ N (0, 1), so there is no endogeneity and no heteroskedasticity, then a standard

Probit maximum likelihood estimation (MLE) procedure yields consistent estimates. On the
other hand, if u1i|v2i, zi ∼ N (0, exp(2ziδ)) such that heteroskedasticity is present, then the

standard Probit MLE procedure would be inconsistent but a Het-Probit MLE procedure,
included in many statistical packages, would be consistent. If u1i|v2i, zi ∼ N (ρv2i, 1), similar

to the setting in equation (2.2), then two step CMLE developed by Smith and Blundell

(1986) and Rivers and Vuong (1988) would be consistent and other methods that ignore

the endogeneity would be inconsistent. More generally, if the CF-CI assumption holds such

29

that u1i|v2i, zi ∼ u1i|v2i with some unknown distribution, Blundell and Powell (2004) (for

the remainder of the chapter referred to as BP) and Rothe (2009) provide semi-parametric

methods that estimates the parameters consistently.

As a ﬁrst step in relaxing the CF-CI assumption, the following assumption proposes an

alternative framework which assumes a more ﬂexible conditional distribution of the latent

error.4

Assumption 2.3.1. Consider the set up in equation (2.1), where {y1i, zi, y2i}n

i=1, is iid.

Assume the linear reduced form in the ﬁrst stage is the true conditional mean

E(y2i|zi) = m(zi)πo

and the unobserved latent error has the following conditional distribution

u1i|zi, v2i, y2i = u1i|zi, v2i ∼ N

h(v2i, zi)γo, exp(2g(y2i, zi)δo)

(cid:16)

(cid:17)

Where zi = (z1i, z2i) and m(zi), h(v2i, zi), and g(y2i, zi) are known vectors and h(v2i, zi) is

diﬀerentiable in v2i.

The ﬁrst part of the assumption breaks up the endogenous variable into its conditional

mean and what I will refer to as the control variate v2i. Note that by construction, the

control variate is mean independent of the instruments. This assumption does not take a

stand on the true data generating process of the endogenous variable.5 In the more general

setting of non-separable triangular equation models, Imbens and Newey (2009) consider the
4The normality assumption could be easily generalized to just a known distribution with CDF G(·). This

allows for a logit speciﬁcation which is also explored in one of the simulations.

5For now this does presume a linear reduced form, but when discussing asymptotic properties of the
estimator, if m(zi) is a sequence of basis function so that the ﬁrst stage acts as a non-parametric sieve
regression then this will not aﬀect the asymptotic variance estimates.

30

case of a non-separable ﬁrst stage

y2i = d(zi, ηi)

(2.9)

where zi are the instruments, ηi is unobserved heterogeneity independent of the instruments,
and d(·,·) is the unknown and true data generating process in the ﬁrst stage. In this set-

ting they suggest using the conditional CDF, e2i = Fy2|z(y2i, zi), as the control variable.
They show that their proposed control variable satisﬁes CF-CI and therefore the control

function method recovers the parameters of the model. Assumption 2.3.1 does not require

full independence between the control variable and the instruments and therefore can use
the population residual v2i = y2i − E(y2i|zi) as a control variable with the knowledge that

it does not satisfy CF-CI. I will discuss the diﬀerences between these two approaches after

explaining the second part of the assumption.

The second part of Assumption 2.3.1 speciﬁes the conditional distribution that allows for

the violation of CF-CI. Both the conditional mean and the conditional variance are functions

of the instruments, so the exclusion restriction implied by CF-CI is violated. Under this

assumption, the conditional mean of y1i is:

E(y1i|zi, y2i, v2i) = E(y1i|zi, v2i) = Φ

(cid:18)xiβo + h(v2i, zi)γo

(cid:19)

exp(g(y2i, zi)δo)

(2.10)

Note that there is a one-to-one mapping between y2i and v2i given the instruments zi.

This implies the mean is preserved regardless of which term is included in the conditioning

argument. This result should be unsurprising as the conditional mean appears to be a het-

eroskedastic Probit model that adjusts for endogeneity using the control function approach,

both of which have been discussed extensively in the literature. But in this case, the control

function (h(v2i, zi)γo) is a function of both the control variate v2i and the instruments zi.

31

This was ﬁrst introduced in Wooldridge (2005) where he suggests using the following control

function,

h(v2i, zi)γo = γ1ov2i + v2iziγ2o

(2.11)

in a linear regression with random coeﬃcients.6 Gandhi, Kim, and Petrin (2013) adopted a

similar generalization for demand estimation and Kim and Petrin (2017) provides a general

control function framework for the case of non-linear but additively separable triangular

equation models. As in Kim and Petrin (2017), this generalization will be referred to as the

“general control function,” as opposed to a more traditional control function that upholds

the exclusion restriction (not a function of the instruments) as in Rivers and Vuong (1988)

and Petrin and Train (2010).

The proposed framework suggests a simple two step estimator.

In the ﬁrst step, the

conditional mean of y2i is estimated to construct the control variate from the residuals (ˆv2i).

In the second step, the residuals are plugged into the conditional mean in equation (2.10)

in which parameter estimates are obtained via maximum likelihood estimation. This will be

discussed with more detail in Section 5.

How does the proposed approach diﬀer from the setting considered in Imbens and Newey

(2009)? In Imbens and Newey (2009), they attempt to ﬂexibly model the true data generating

process of the ﬁrst stage as a possibly non-separable function of instruments and unobserved

heterogeneity.7 They then construct a control variable, e2i = Fy2|z(y2i, zi), that they show
satisﬁes the CF-CI assumption. But they require the instruments to completely independent

6He also discusses in this chapter the implementation of the control function approach in a binary response
setting such as Probit. But in that example, he does not propose interaction with the instruments and instead
only suggests including higher order moments of the reduced form error. So his analysis stops short of what
is proposed in this chapter.

7The non-separable ﬁrst stage needs to be monotonic in the unobserved heterogeneity.

32

of any unobserved heterogeneity and only a single source of unobserved heterogeneity in the

ﬁrst stage. In this chapter, I use a control variate v2i that is always obtainable and must

satisfy conditional mean independence, by construction. Then to make up for the relaxation

of CF-CI, I ﬂexibly model the relationship between the structural heterogeneity u1i, the

control variable v2i, and the instruments zi using a general control function.

A major critique to the approach of Imbens and Newey (2009) is the caveat to their

framework brought up in Kasy (2011) noting their method only allows for one source of

heterogeneity (independent of the instruments) in the ﬁrst stage. This would prohibit the

simple example of random coeﬃcients in the ﬁrst stage

y2i = η1i + η2izi

(2.12)

The approach in this chapter allows for this possibility since equation (2.12) can be rewritten

in terms of a linear conditional mean with heteroskedasticity in the ﬁrst stage error.

One may object to the linear in parameters and known distribution speciﬁcations in

Assumption 2.3.1. If these speciﬁcations are not true, this leads to the misspeciﬁcation of

equation (2.10) as the true conditional mean. The general control function and heteroskedas-

tic function, h(v2i, zi)γo and g(y2i, zi)δ, respectively, are assumed to be linear in parameters,

which facilitates the identiﬁcation discussion because it allows for lower level conditions.

Alternatively, one can consider any parametric speciﬁcation, but then lower level conditions

for identiﬁcation would need to be derived to ﬁt the speciﬁcation. In the extension provided

in Section 6, I allow both the general control function and the heteroskedastic function to

be non-parametrically speciﬁed.

The distribution assumption is particular pertinent in contrast to estimators from BP and

Rothe, that have no distributional assumptions. Preserving the distributional assumption

33

keeps the diﬃcult discussions on identiﬁcation and interpreting the ASF and APE clear.8

But to appease any concerns, the extension provided in Section 7 proposes a semi-parametric

estimator that is free of any distributional assumptions.

Up till now I have only explained theoretically the consequences of CF-CI in terms of

exclusion restrictions on the conditional distribution. But as a researcher with empirical data,

how is one to determine why CF-CI may fail to hold? To further motivate the generalization

provided in Assumption 2.3.1, the following are two examples taken from applications in the

literature where their empirical settings may suggest a violation of CF-CI.

Example 1: Demand for Premium Cable from Petrin and Train (2010)

This example is a simpliﬁed version of the application given in Petrin and Train (2010)

(hereafter PT), who propose a control function approach for estimating structural models of

demand. In their application, they use a multinomial logit in modelling consumer’s choice

of television reception. To ﬁt a binary response setting, consider the choice of selecting

premium cable conditional on already selecting cable as the television reception. Let Uim

be the marginal utility of individual i choosing premium cable in market m over the utility

from not selecting premium cable (so the utility from the outside option is normalized to 0).

Violation of the CF-CI can be easily invoked by allowing for a utility that is not additively

separable between unobserved utility (u1im) and the observed utility (Uim) as in Gandhi,

Kim, and Petrin (2013). Suppose the observed utility is

5(cid:88)

Uim = β1pm +

β2gpmdgi + z11mβ3 + z12iβ4 + (1 + pmγ1 + z11mγ2 + z12iγ3)u1im (2.13)

g=2

where the variables in z11m include the market and product characteristics and the variables

8An added beneﬁt is the proposed estimator is much easier to implement compared to the semi-parametric
approaches. For instance, estimates can be obtained using canned commands in Stata. Hopefully this
will persuade empirical economists that implementing generalizations to previous estimators need not be
computationally burdensome.

34

in z12i include individual characteristics. The variables dgi are dummies of an index of

5 diﬀerent income levels, this allows price elasticity to be heterogeneous in income. The

unobserved utility consists of two components u1im = ξm +εim where εim is iid logistic while

ξm represents unobserved (to the researcher but not to the consumer or producer) attributes

of the product. Consequently, ξm captures the component of the unobserved utility that

is not independent from price. Note that this speciﬁcation, like Gandhi, Kim, and Petrin

(2013), allows for potential interactions between the observable covariates (price, market and

product characteristics, and individual characteristics) and the unobserved attributes of the

product.

This speciﬁcation, previously discussed in Gandhi, Kim, and Petrin (2013), can be mo-

tivated using the example of unobserved advertisement. For instance, one would expect not

only unobserved advertisement to aﬀect utility (though ξm) but would also expect a inter-

active eﬀect with product characteristics. For example, suppose premium cable is marketed

with advertisement that emphasizes the number of channels provided. Then advertisement

should contribute to utility of consumption interactively with the number of premium cables

actually provided.

Even if a researcher were to impose an additively separable form on the utility, it is still

unlikely that a simple control function from a reduced form pricing equation may capture

the true endogenous structure. Suppose the utility from purchasing premium is

5(cid:88)

Uim = β1pm +

β2gpmdgi + z11mβ3 + z12iβ4 + u1im

(2.14)

g=2

where the unobserved utility is composed of two components: u1im = ξm + εim, εim is

iid logistic while ξm is represents unobserved attributes of the product. The probability

35

consumer i chooses premium cable in market m is

Pim = P (Uim > 0|pm, dgi, z11m, z12i, ξm)

exp(β1pm +(cid:80)5
1 + exp(β1pm +(cid:80)5

=

g=2 β2gpmdgi + z11mβ3 + z12iβ4 + ξm)

g=2 β2gpmdgi + z11mβ3 + z12iβ4 + ξm)

(2.15)

and the expected demand from the perspective of the monopolist be E(Pim|pm, z11m, ξm).

A monopolist will maximize expected proﬁt with respect to price

pm = arg max

p

(p − M C(z2m, ωm))E(Pim|p, z11m, ξm)

From the ﬁrst order conditions, the optimal price satisﬁes

pm =

pm

|e(z11m, ξm)| + M C(z2m, ωm)

(2.16)

(2.17)

where e(z11m, ξm) is the price elasticity of demand. It is evident that prices are not separable

in ξm and the exogenous characteristics z11m, z2m. If one were to still use the control variable
v2m = pm − E(pm|z11m, z2m) then the CF-CI assumption implies E(ξm|z11m, z2m, v2i) =
E(ξm|v2m) and would generally not hold. Therefore the estimators based on the CF-CI

assumption would not be valid in this setting. Kim and Petrin (2017) provide a similar

example to motivate their general control function in non-linear but additively separable

setting.

Example 2: Home-ownership and Income from Rothe (2009)

This example is the application in Rothe (2009) where he considers the eﬀect of income

on home-ownership in Germany for low-educated middle age married men. The controls, z1i,

included age and an indicator for the presence of children under the age of 16. The instru-

ments, z2i, are wife’s education level and an indicator for wife’s employment status which

should only eﬀect home-ownership through family income. Rothe relaxes CF-CI slightly by

proposing the alternative CF-LI assumption. Recall, that the CF-LI assumption requires

36

the conditional distribution of the latent error to only be a function of the control variable

v2i and the linear index xiβo = z1iβ1o + y2iβ2o. In this example, endogeneity can be ex-

plained by omitted variables such as accessibility to loans via credit that are correlated with

income. Moreover, as the accessibility to loans via credit lowers (i.e., low credit score), the

eﬀect of income and whether or not you have children becomes less important in the decision

to purchase a home. So if credit score was observable, one would expect interactive eﬀects

between the linear index and credit score. Since v2i acts as a proxy for the omitted variable,

the conditional mean of the latent error should include the interactive eﬀect,

E(u1i|v2i, xiβ) = v2iγ1 + v2i(xiβo)γ2

(2.18)

Under this speciﬁcation, the CF-LI assumption in Rothe is satisﬁed while CF-CI is vio-

lated. However, this places fairly strong restrictions on the coeﬃcients of interactive eﬀects

between the omitted variable–credit score– and the included regressors –age, presence of

children, and income– such that they must be proportional to the linear index coeﬃcients,

βo. Alternatively, the proposed estimation procedure would recognize the interactive rela-

tionship of these eﬀects and could also allow the interactions to have eﬀects not necessarily

proportional to the index coeﬃcients.

These two examples provide some economic motivation for the relaxation of the CF-CI

assumption. However the CF-CI assumption was used in the literature to gain identiﬁcation.

In equation (2.10), xi is a function of z1i and y2i which both comprise the control function

h(v2i, zi). Without any restrictions on the control function, the two eﬀects may not be

separately identiﬁable. Wooldridge (2005) notes that the exclusion of z2i in the structural

equation allows for identiﬁcation of the general control function considered in equation (2.11).

I will use the more general CMR from Kim and Petrin (2017) to show identiﬁcation of the

37

general control function which also helps to illustrate which general control functions are or

are not identiﬁed.

2.4 General Control Function

The previous section set up the framework for using a general control function approach in

a binary response model. In contrast to other control function methods, the general control

function allows for the relaxation of the CF-CI assumption. This section is composed of two

parts. The ﬁrst part explains how identiﬁcation can still be obtained under CMR and how

the CMR relates to the other control function assumptions in the literature. The second

part is a short simulation to illustrate how the general control function will aid in estimation

when CF-CI does not hold but the true data structure satisﬁes CMR. In this simulation

I emulate the application in Petrin and Train (2010) concerning the demand for cable as

empirical context.

2.4.1 Identiﬁcation

Recently, there has been growing interest in the question of identiﬁcation for the control

function approach in non-parametric non-separable triangular simultaneous equation mod-

els.9 However, the discussion usually starts with independence assumptions between the

instruments and the unobservables. Then one searches for a control function that will sat-

isfy strong identiﬁcation assumptions such as CF-CI in BP or CF-CI and monotonicity in

Imbens and Newey (2009). In the setting considered here, I allow for a more ﬂexible re-

lationship between the instruments and the unobserved heterogeneity and then allow for a

9Imbens and Newey (2009), Kasy (2011), Hahn and Ridder (2011), Blundell and Matzkin (2014), Chen,

Chernozhukov, Lee, and Newey (2014), Torgovitsky (2015), and D’Haultfœuille and F´evrier (2015)

38

general control function, h(v2i, zi)γo, that can address endogeneity in a ﬂexible manner. Of

course, since I am only concerned with a binary response model, there are gains to knowing

the structure of the non-separability in the outcome equation.

The main concern for identiﬁcation is separately identifying the mean eﬀect xiβo and

the control function h(v2i, zi)γo. Because both of these terms are perfectly determined by

zi and v2i, without any additional assumptions on the construction of h(v2i, zi), perfect

multicollinearity is possible such that the parameters βo and γo are not identiﬁed.10 When

linearity of the control function is imposed, as in Assumption 2.3.1, identiﬁcation requires

E((xi, h(v2i, zi))(cid:48)(xi, h(v2i, zi))) is non-singular.11 However that does not place clear restric-

tions on the composition of the control function. The following assumption provides lower

level conditions in which identiﬁcation is shown.

Assumption 2.4.1. Let πo ∈ Π and βo, γo, δo ∈ Θ where Π and Θ denote the respective

parameter spaces.

ixi) is non-singular and the variance-covariance matrix of E(xi|zi) has full rank.

(i) E(m(zi)(cid:48)m(zi)) is non-singular
(ii) E(x(cid:48)
(iii) E(h(v2i, zi)(cid:48)h(v2i, zi)) is non-singular.
(iv) (CMR) E(h(v2i, zi)|zi) = 0

(v) g(y2i, zi) consists of polynomial functions of the elements in (xi, h(v2i, zi)), does not

include a constant, and E(g(y2i, zi)(cid:48)g(y2i, zi)) is non-singular.
o, γ(cid:48)

o) is a non-zero vector.

(vi) (β(cid:48)

The ﬁrst condition insures identiﬁcation of the ﬁrst stage parameters. The next three
10For example, if xi = (1, z1i, y2i) then a general control function of the form h(v2i, zi) = (z1i, z2i, v2i)
creates perfect multicollinearity. Even when z1i is excluded from the general control function (so xi and
h(v2i, zi) do not include the same terms) there is multicollinearity when y2i = π1 + z2iπ2 + v2i.

11Alternatively if one where to assume that the control function and the heteroskedastic function were

non-linear functions then one can verify the rank conditions from Rothenberg (1971) for identiﬁcation

39

conditions are used to show E((xi, h(v2i, zi))(cid:48)(xi, h(v2i, zi))) is non-singular. The CMR

is the more realistic identiﬁcation assumption used in Kim and Petrin (2017). The last

two conditions help in showing identiﬁcation in the highly non-linear heteroskedastic Probit

model. The following theorem states the identiﬁcation result.

Theorem 2.4.1. In the set-up described by equation (2.1) and Assumption 2.3.1, if As-

sumption 2.4.1 holds then the parameters πo and (βo, γo, δo) are identiﬁed.

The proof of Theorem 2.4.1 is provided in the Appendix. The CMR approach to obtain

identiﬁcation using a control function is adopted from Kim and Petrin (2017) where they

show non-parametric identiﬁcation following control function approach in a triangular system

with an additively separable error.12

The CMR can be interpreted as a way to distinguish between the endogeneity of y2i and

the “non-endogeneity” of zi. By law of iterated expectations,

E(u1i|zi) = E(E(u1i|zi, v2i)|zi) = E(h(v2i, zi)γo|zi) = 0

The middle equality holds by the speciﬁcation provided in Assumption 2.3.1 and the last

equality holds by the CMR. As a result, the CMR only implies zi is mean independent of

u1i and does not require any stronger forms of independence. This is a fairly standard and

weak exogeneity assumption on an instrument.

In practice, if one is concerned that this

restriction is violated then the included instrument, z1i, should be treated as an endogenous

variable and the excluded instruments, z2i, should not be used as valid instruments.

12Hahn and Ridder (2011) show that a “Conditional Mean Restriction” is insuﬃcient for identifying the
ASF in a general non-parametric non-separable model. However I would like to be clear that the CMR they
consider is

E(y1i − Ψ(xi)|zi) = 0

where Ψ(xi) is the unknown ASF. This diﬀers from the CMR consider here which is on the latent error.
Although the binary response model is non-separable, since the latent error is additively separable from the
mean component xiβo within the indicator function, identiﬁcation follows analogously from Kim and Petrin
(2017)

40

To provide some intuition for the implications, because v2i is mean independent of zi,

the CMR requires each element of h(v2i, zi) to include functions of v2i and to be condi-

tionally demeaned. For instance, v2
2i could not be an element of the control function, but
2i|zi) could be. In addition, no element can only be a function of zi alone – the
2i − E(v2
v2
instruments can only enter as an interaction with functions of v2i. Notice that in the ex-

amples provided in the previous section the general control functions satisfy the CMR. This

prevents any issues of linear dependence between elements of xi and h(v2i, zi). Wooldridge

(2005) explains that identiﬁcation holds given exclusion restriction on the instruments z2i

in the structural equation that creates variation in the control variate unexplained by xi.13

Consequently, the extra variation in the control variate needs to be used to identify the

parameters in the general control function. This can be demonstrated explicitly under the
assumptions stated above. Let (a(cid:48), b(cid:48))(cid:48) be a non-random vector such that

(cid:18)

(cid:19)a

 = xia + h(v2i, zi)b = 0

xi h(v2i, zi)

b

Taking the conditional expectation with respect to zi,

E(xi|zi)a + E(hi|zi)b = E(xi|zi)a = 0

Because the variance-covariance matrix of E(xi|zi) is full rank, a is a zero vector and it

follows that b is also a zero vector.

Now how does the CMR compare to the CF-CI? In the heteroskedastic bivariate probit

example it is easy to see how CF-CI is violated while the CMR continues to hold. However,

the CMR is not strictly weaker in the technical sense where CF-CI implies CMR.14 However,

13An alternative identiﬁcation strategy is used in Escanciano, Jacho-Ch´avez, and Lewbel (2016) that does
not require an exclusion restriction on the instruments z2i. But in their setting, identiﬁcation is dependent
on non-linearity in the reduced form and they still impose CF-CI as a control function assumption.

14I would like to thank David Kaplan for pointing this out.

41

given the earlier discussion, the CMR is more in line with what our prior beliefs on what

endogeneity is. Consider the following example in which CF-CI holds but CMR does not.
Let E(u1i|zi, v2i) = v2i + v2
suppose there is heteroskedasticity in the ﬁrst stage such that E(v2

v = V ar(v2i) such that CF-CI holds. But

v where σ2

2i − σ2

v which implies

2i|zi) (cid:54)= σ2
v (cid:54)= 0 and the CMR does not hold.

E(v2i + v2

2i − σ2

v|zi) = E(v2

2i|zi) − σ2

Interpretively

what does this mean? It means that the speciﬁcation of endogeneity is quadratic in the ﬁrst

stage residual while the quadratic term is deviations from the unconditional variance even

when there is heteroskedasticity in the ﬁrst stage. So for the CMR to fail, the endogeneity

must depend on v2

2i − σ2

v instead of v2

2i − E(v2

2i|zi), deviations for the conditional variance.

Therefore I would argue that although the CMR is not strictly weaker than CF-CI, it better

reﬂects how we perceive endogeneity and is much more plausible to hold in empirical settings.

By now, it seems that the relaxation of CF-CI, especially compared to the paramet-

ric approach of Rivers and Vuong (1988),

is fairly straightforward. Putting aside het-

eroskedasticity for a moment, the diﬀerence is only including the control variate ˆv2i as

an addition covariate, as suggested in Rivers and Vuong (1988), or including terms such as

(ˆv2i, ˆv2

2i− ˆE(v2

2i|zi), ziˆv2i), as the general control function approach proposed in this chapter.

One may wonder whether the relaxation of CF-CI to allow for a general control function is

really necessary and whether it would have an impact empirically. The following simulation

aims to show the importance of allowing for a general control function when it is called for.

The results of the simulation suggest that there is a high cost to not specifying a general

control function method when CF-CI fails, but there is very little cost in allowing for a more

ﬂexible speciﬁcation of the general control function when it is not truly present. The detri-

mental impacts of presuming CF-CI when it does not hold is seen not only in the parameter

estimates but in economic objects of interest such as the estimated choice probabilities and

42

price elasticities.

2.4.2 Simulation: General Control Function in the Demand for

Premium Cable

The data generating process will emulate the setting described in Example 1 above, which

is a simpliﬁcation of the application given in PT. Recall that in this example I wish to

estimate the demand for premium cable (conditional on already selecting cable as the tele-

vision provider) but am concerned that price is endogenous and correlated with unobserved

attributes. The latent utility function given in equation (2.13) is a function of product

characteristics, such as the number of channels (z11m), and individual characteristics of the

consumers (z12i), including income, single family household indicator, rent indicator, age,

and age squared. Building on the example of advertisement and marketing (part of the unob-

served product attributes), I interact ξm with product characteristics (number of channels)

and individual characteristics (age).

For simplicity, it is assumed that there is an exogenous cost shifter that acts as a valid

instrument. As in PT, price will be interacted with 5 income level dummies to allow the

price elasticity of premium cable to diﬀer by income levels.

A discussion on the construction of the variables as well as a table of summary statistics

is provided in Appendix C. As mentioned previously, it is important to note that the data

generating process speciﬁes a general control function that satisﬁes the CMR but does not

satisfy CF-CI.

Table E.2 provides the parameter estimates for the diﬀerent Logit speciﬁcations. As found

in PT, without addressing any endogeneity (column (1)), there is actually a positive eﬀect

43

of price for the higher income groups (in this simulation only the highest income group) and

a negative eﬀect for number of cable channels oﬀered. Addressing endogeneity by including

just the control variable (column (2)), as in PT, signiﬁcantly strengthens the coeﬃcient

estimate on price and signiﬁcantly alters to the coeﬃcient estimate on number of channels

to be positive. This is because price is strongly correlated with the number of channels oﬀered

and therefore addressing the endogeneity in price will also aﬀect the estimated coeﬃcient on

number of channels. But allowing for the general control function in columns (3) and (4),

the parameter estimates are much closer to their true value. For instance, the number of

channels becomes much less impactful once the unobserved attributes, such as advertisement,

of the premium channels is control for. In addition, the income eﬀects are slightly higher

than they are in column (2).

The diﬀerence between columns (3) and (4) is that column (3) the general control function

is correctly speciﬁed by including interactions between the control variate and the relevant

instruments (number of channels and age) while in column (4) the general control function

includes more terms that are not actually relevant such as interactions between the con-

trol variate and household size and income. This is to explore the realistic situation that

researchers would not typically have prior knowledge as to which terms to include in the

general control function. The simulation results illustrate that there is very little loss to

precision of the parameter estimates when one over speciﬁes the general control function

(column (4)). This shows there is very little cost to allowing the ﬂexibility in estimation

even when the true form may be more simplistic.

But these parameter estimates provide little interpretative value. Usually of more inter-

est are the choice probabilities which in this binary context corresponds to the ASF (the

derivation of the ASF will be discussed in more detail in Section 4). Figure D.2 illustrates

44

how the ASF varies over price for an additional 5 channels of premium cable, assuming the

individual is 35 years old in a family of 3 with income equal to $85,000. Estimates from

a linear probability model, OLS and 2SLS (in orange), are also included as a comparison.

OLS and Logit (dotted lines) which do not address endogeneity result in upward sloping

ASF while the remaining estimators more realistically provide downward sloping ASF. The

correctly-speciﬁed Logit (GCF) and over-speciﬁed Logit (Over) both follow the true ASF

quite closely. Although Logit (CV) performs better compared to Logit or the linear speciﬁ-

cations, there is still some cost in not allowing for a ﬂexible control function.

The price elasticity of demand for premium cable is calculated as,

(cid:18) ∂E(y1i|z1i, p)

∂p

(cid:19)

×

p

E(y1i|z1i, p)

Elasticity = E

(2.19)

The linear probability models estimated by OLS or 2SLS produces a conditional mean,
E(y1i|z1i, p), not strictly positive nor bounded below 1, this will result in imprecise and

extreme elasticity estimates. Table E.3 presents the estimated price elasticities for the dif-

ferent estimation procedures. OLS and 2SLS unsurprisingly provide poor estimates and Logit

which does not address endogeneity greatly underestimates the price elasticity as inelastic.

The Logit CV estimate is in a similar range to that produced in PT but the speciﬁcations

that allow for more ﬂexibility, Logit (GCF) and Logit (Over), are much closer to the true

value. Again, as seen in the parameter estimates, there is very little cost in terms of eﬃ-

ciency, to including more terms in Logit (Over) when a simpler control function is the true

speciﬁcation.

Now that the general control function is shown to be consequential and the complications

concerning identiﬁcation have been addressed, consistency of the estimation procedure is

insured along with standard regulatory conditions. The next section discusses the estimation

45

procedure in more detail and derives consistent estimates of the asymptotic variance. Since

the parameters are usually of little interest in a latent variable model, the next section also

discusses the formulation, identiﬁcation, and estimation of the ASF and APE using the

proposed estimation procedure.

2.5 Estimation and Interpretation

The estimation procedure proposed in this chapter for the parametric model is a standard
two step estimator. In the ﬁrst stage, the conditional mean function E(y2i|zi) = m(zi)πo
is estimated using standard LS regression techniques.15 The control variable is constructed
from the reduced form residuals, ˆv2i = y2i − m(zi)ˆπ, and used in the second step. In the

second stage, one would maximize the following likelihood

L(y1i, xi, zi; ˆπ, β, γ, δ) =

y1i log

Φ

n(cid:88)

i=1

(cid:20)

(cid:18)xiβ + h(ˆv2i, zi)γ

(cid:19)(cid:21)

exp(g(y2i, zi)δ)

(cid:20)

+ (1 − y1i) log

1 − Φ

(cid:18)xiβ + h(ˆv2i, zi)γ

exp(g(y2i, zi)δ)

(cid:19)(cid:21)

(2.20)

with respect to β, γ and δ to obtain estimates of the parameters. In addition to relaxing

assumptions in the literature, the proposed estimation procedure is quite simple to implement

using commands from standard statistical packages.16 However, the estimated standard

errors need to be adjusted to account for the variation from using the residual from the ﬁrst

stage as an approximation for the control variate. Asymptotic variance formulas that account

for the multi-step approach are given in the next section, although a common alternative

15Alternatively, one may consider a non-parametric ﬁrst stage regression to obtain estimates of a condi-
tional mean function. Using sieve, asymptotic results would follow directly from Newey (1994) which diﬀers
from the asymptotic theory presented in this chapter. However, Ackerberg, Chen, and Hahn (2012) explain
that the asymptotic variance estimator under the framework of the semi-parametric plug-in two step esti-
mator is numerically equivalent to the asymptotic variance estimator in the parametric framework as long
as the parametric speciﬁcation is ﬂexible enough.

16For example, the parameter estimates can be obtained using reg and hetprobit commands in Stata.

46

would be to bootstrap the standard errors.

As for consistency and asymptotic normality, this is a simple application of MLE to

a heteroskedastic Probit model with a generated regressor in which asymptotics are well-

established. The next subsection provides the asymptotic properties and the asymptotic

variance derivation under a two step M-estimation framework.

2.5.1 Asymptotic Properties

The two-step estimator can be written in a GMM framework by stacking the moment con-

ditions,

E(cid:0)M (y1i, y2i, zi; πo, βo, γo, δo)(cid:1) = E

(y2i − m(zi)πo)m(zi)(cid:48)

Si(πo, βo, γo, δo)

 = 0

(2.21)

where Si(π, β, γ, δ) = ∂L(y1i, xi, zi; π, β, γ, δ)/∂θ denotes the score,

Si(π, θ) =

Φi(π, θ)(1 − Φi(π, θ)) exp(g(y2i − m(zi)π, zi)δ)

(y1i − Φi(π, θ))φi(π, θ)



×

x(cid:48)

i

h(y2i − m(zi)π, zi)(cid:48)

−(xiβ + h(y2i − m(zi)π, zi))g(y2i − m(zi)π, zi)(cid:48)



Φi(·) and φi(·) are shorthand for the conditional CDF and PDF evaluated at the linear index

xiβ. Note that estimation using the stacked moment conditions is equivalent to the two step

approach previously described. Although using the GMM framework is useful for deriving

the asymptotic variance of the estimator, it is suggested to use the two step approach in

implementation to avoid issues of slow convergence.

Let θ(cid:48) = (β(cid:48), γ(cid:48), δ(cid:48)) and let Π and Θ denote the parameter spaces of π and θ respectively.

Consistency follows from Theorem 2.6 of Newey and McFadden (1994).

47

Theorem 2.5.1. In the set-up described by equation (2.1) where assumptions 2.3.1 and 2.4.1
hold, if πo ∈ Π and θo ∈ Θ, both of which are compact, then the GMM estimators that solve:

(cid:34)

n(cid:88)

i=1

1
n

(ˆπ, ˆθ) = arg min
(π,θ)∈Π×Θ

(cid:35)(cid:48)(cid:34)

n(cid:88)

i=1

1
n

M (y1i, y2i, zi; π, θ)

M (y1i, y2i, zi; π, θ)

+ op(1) (2.22)

(cid:35)

where M (y1i, y2i, zi; π, θ) are the stacked moment conditions in equation (2.21), are consis-
tent, ˆπ − πo = op(1) and ˆθ − θo = op(1).

Proof is provided in the appendix. Showing asymptotic normality follows from Theorem

6.1 in Newey and McFadden (1994).

Theorem 2.5.2. In the set-up described by equation (2.1) where assumptions 2.3.1 and 2.4.1
hold, if πo ∈ int(Π) and θo ∈ int(Θ), both of which are compact, then for (ˆπ, ˆθ) that solves

equation (2.22),

√

n(ˆθ − θo)

d−→ N (0, V ) where

2θ E(cid:0)Ξi(πo, θo)Ξi(πo, θo)(cid:48)(cid:1) G−1(cid:48)

2θ

V = G−1

(2.23)

where Ξi(πo, θo) = Si(πo, θo) + G2πG−1

 = E((cid:53)(π,θ)M (y1i, y2i, zi; πo, θo)) is deﬁned in detail in the appendix.

1π (y2i − m(zi)πo)m(zi)(cid:48) and

G1π G1θ

G2π G2θ

The proof is provided in the appendix. Note that the asymptotic variance takes into

account the variation introduced from the ﬁrst stage. A consistent estimator for the asymp-

totic variance would be the method of moments estimator that replaces all the unknown

parameters with their consistent estimates and then use sample averages in place of expec-

tations. Although this section provides consistency and

n-asymptotic normality for the

√

second stage parameter estimates, the parameters themselves bear very little interpretative

value. The next two subsections discuss the derivation of the ASF and the APE and their im-

portance for economic interpretation. These structural objects are magnitudes of eﬀects that

48

empirical researchers can use to discuss the eﬀectiveness of a particular policy or the average

probability of a successful outcome for an individual with a particular set of characteristics.

2.5.2 Average Structural Function

Researchers are often interested in using the data and model estimates to infer the average

predicted probability of success at particular point of the observed data. When there is no

endogeneity, this quantity can be easily described by the conditional mean, which in the

case of binary response, is equivalent to the propensity score. As explained in BP, when

endogeneity is present, the conditional mean is unable to capture the structural relationship

between the endogenous variable and the outcome.

In particular, most studies wish to

uncover the eﬀect of a structural intervention over the endogenous variable on the outcome,

while the conditional mean can only capture a reduced form eﬀect over changes in the

instruments.

For clariﬁcation, let us consider a simple linear structural equation.

yi = xiβo + ui

(2.24)

Without endogeneity, E(ui|xi) = 0 and the interpretation of the average outcome for a
given observation xo is simply the conditional mean: xoβo. The corresponding partial eﬀect
would be the slope parameter βo. But when endogeneity is introduced, E(ui|xi) (cid:54)= 0, the

conditional mean is composed of two parts.

E(yi|xi = xo) = xoβ + E(ui|xi = xo)

(2.25)

The ﬁrst component is the structural direct eﬀect of xi while the second component is the

endogenous indirect eﬀect of xi due to the presence of endogeneity. For instance, consider

49

the ubiquitous example of returns to education where education is endogenous due to unob-

served ability. Then the structural direct eﬀect is the average wage for particular education

level (independent of ability) and the endogenous indirect eﬀect is the contribution of aver-

age ability for that given education level on wages. But BP argues that one should only be

interested in the structural direct eﬀect because if one were to consider a policy interven-

tion on the level of education (ie: mandatory schooling) there would be no changes in the

distribution of ability and therefore one would only want to capture the structural direct

eﬀect.

To derive the ASF, BP instruct that one should integrate over the unconditional distri-

bution of the unobserved heterogeneity in the structural equation. If the structural equation

(2.24) includes an intercept then E(ui) = 0 and the ASF is xoβo, not equal to the conditional

mean but still the same as the case of no endogeneity.

Next is to extend the analysis to the binary response model.

yi = 1{xiβo + ui > 0}

(2.26)

When there is independence between the latent error ui and the regressors xi, the conditional

mean – equivalent to the propensity score – is

E(yi|xi = xo) = F−u(xoβo)

(2.27)

which calculates the probability of success for an individual with characteristics xo. Now

consider the case when there is no longer independence between the latent error and the
regressors so the unconditional CDF is not equal to the conditional CDF; i.e., F−u(−u) (cid:54)=
F−u|x(−u; x) where F−u|x(·;·) is the conditional CDF in which the ﬁrst argument is the point

of evaluation and the second argument is the conditioning argument. One can understand

50

the violation of independence either through the standard interpretation of endogeneity,
E(ui|xi) (cid:54)= 0, or possible due to endogeneity at higher moments such as heteroskedasticity,
V ar(ui|xi) (cid:54)= V ar(ui). Then the propensity score is

E(yi|xi = xo) = F−u|x(xoβo; xo)

(2.28)

in which the ﬁrst argument in F−u|x(xoβo; xo) is the point of evaluation which, corresponds

to the structural direct eﬀect, and the second argument is the conditioning argument, which

corresponds to the endogenous indirect eﬀect. As in the linear case, the conditional mean

does not capture a structural interpretation. Therefore, to obtain the ASF, one can integrate

over the unconditional distribution of the unobserved heterogeneity to obtain: F−u(xoβo).

Now the ASF only captures the structural direct eﬀect of xi and is not clouded by the

inﬂuence of endogeneity. However, in calculating the ASF, the unconditional distribution of

ui is usually unknown or at least not speciﬁed when estimating the structural parameters

βo.

Wooldridge (2005) studies the ASF in more depth and provides a more rigorous investi-

gation of the derivation of the ASF. Using the same notation as above, the structural model
of interest is E(yi|xi, ui) = µ1(xi, ui), where xi is observed covariates and ui is unobserved
heterogeneity. Then the ASF is deﬁned as ASF (xo) = Eu(µ1(xo, ui)) where the subscript of

u is meant to emphasize that the expectation is taken with respect to the unconditional dis-

tribution of ui. Using Lemma 2.1 from Wooldridge (2005), which is essentially an application

of law of iterated expectations, the ASF can also be calculated from

ASF (xo) = Ew(µ2(xo, wi))

(cid:90)
U µ1(xo, u)fu|w(u; wi)η(du)

µ2(xo, wi) =

51

where U is the support of ui and fu|w(·;·) is the conditional density of the unobserved
heterogeneity ui given wi with respect to a σ-ﬁnite measure η(·). Essentially, one can use

a conditioning argument wi to help identify the ASF. In many instances the conditioning

argument wi will include components of the covariates xi, but it is important to note that the

evaluation of the ASF requires the ability to distinguish between the point of evaluation xo

and the conditioning argument wi.17 This will be important when I discuss the implications

of the CF-LI assumption on the derivation and estimation of the ASF.

To apply Lemma 2.1 from Wooldridge (2005), the following conditions must hold

(i) (Ignorability) E(yi|xi, ui, wi) = E(yi|xi, ui)
(ii) (Conditional Independence) D(ui|xi, wi) = E(ui|wi)

Notice that conditional independence in this context is with respect to the conditioning ar-

gument, wi which has yet to be speciﬁed in our context. When the conditioning argument is

simply the control variate, v2i, then it is in fact the same conditional independence assump-

tion of BP. Therefore the CF-CI assumption of BP is also used to obtain identiﬁcation of the

ASF. But, as seen when showing identiﬁcation of the parameters, is CF-CI really necessary?

Consider the conditioning argument as both the control variate v2i and the instruments
zi. This easily satisﬁes the ignorability assumption E(y1i|xi, u1i, v2i, zi) = E(y1i|xi, u1i)
17Wooldridge (2005) considers the example of the heteroskedastic Probit model where in equation (2.26),
it is assumed ui is normally distributed with V ar(ui|xi) = exp(2xiδ). Then the covariates xi are used as
the conditioning argument (ie: wi = xi) such that

(cid:18)(cid:90)
(cid:19)
(cid:18) xoβ
(cid:18)
(cid:60) 1{xoβo + u > 0}fu|x(u; xi)du

(cid:19)(cid:19)

ASF (xo) = Exi

= Exi

Φ

exp(xiδ)

where the expectation is taken with respect to the xi in the heteroskedastic function (part of the conditioning
argument) and not with respect to the structural direct eﬀect of xo. Therefore, even when the conditioning
argument is the same as the covariates in the structural equation, it is necessary to be able to distinguish
between the two when composing the ASF.

52

given ignorability of the excluded instruments z2i and automatically satisﬁes this version of
conditional independence D(u1i|xi, v2i, zi) = D(u1i|v2i, zi) since xi is composed of v2i and

zi. Then under Assumption 2.3.1,

µ2(xo, (v2i, zi)) =

(cid:90)
(cid:60) 1{xoβo + u > 0}fu|v,z(u; v2i, zi)du
(cid:18)xoβo + h(v2i, zi)γo

(cid:19)
= Eu(1{xoβo + u > 0}|v2i, zi)

= Φ

exp(g(y2i, zi)δo)

and the ASF is

ASF (xo) = Ev2,z(µ2(xo, (v2i, zi))) = Ev2,z

(cid:18)

Φ

(cid:18)xoβo + h(v2i, zi)γo

(cid:19)(cid:19)

exp(g(y2i, zi)δo)

(2.29)

where the expectation is taken with respect to the unconditional distribution of v2i and zi. A

consistent method of moments estimator would replace the unknown parameter values with

their consistent estimates, (ˆπ, ˆβ, ˆγ, ˆδ), and in place of the expectation, take sample averages.

Therefore identiﬁcation of the ASF is still possible without the CF-CI assumption of BP.

Next is to examine the derivation of the ASF in the BP framework where CF-CI is
assumed. Let GCF−CI (·; v2i) be the CDF of −u1i|v2i, zi that will be estimated non-

parametrically and recall that CF-CI implies that zi is excluded from the conditional distri-

bution function. Then the ASF is easily calculated as,

ASFCF−CI (xo) = Ev2(G−u1|v2

(xoβo; v2i))

(2.30)

where the expectation is taken with respect to v2i. Comparing equations (2.29) and (2.30),

highlights the impact of CF-CI on interpretation. Since there is endogeneity, the eﬀect

of xo on the predicted probability of success can be broken down between the structural

direct eﬀect and an endogenous indirect eﬀect. The allure of the CF-CI assumption is it

53

immediately distinguishes between the two eﬀects in the conditional distribution function

(xoβo; v2i) where the ﬁrst argument captures the structural direct eﬀect and the second

Gu1|v2
argument should entirely control for endogenous indirect eﬀect. But when CF-CI fails, and

this structure of the conditional CDF is still presumed, the lines between the structural

direct eﬀect and an endogenous indirect eﬀect become blurred. Consequently, in estimation,

the ASF calculated when incorrectly imposing CF-CI will not be able to correctly average

out the endogenous indirect eﬀect.

In a more ﬂexible framework, Rothe assumes the CF-LI which slightly relaxes the CF-CI

by allowing the conditional distribution to be a function of the instruments through the
linear index xiβo. Recall, that CF-LI means D(u1i|v2i, zi) = D(u1i|v2i, xiβo). Using results
from Manski (1988), identiﬁcation of βo and GCF−LI (xiβo, v2i) = Fu1|v2,xβo(xiβo; v2i, xiβo)
which is the conditional CDF of u1i evaluated at xiβo can be obtained. As mentioned before,

the CF-LI assumption is still a fairly strong restriction on the conditional distribution of
ui|v2i, zi. Compared to the speciﬁcation in Assumption 2.3.1, this would require the control

function and the heteroskedastic function to be constructed with the linear index and not as

more ﬂexible functions of the instruments. But for now, consider the most optimistic case

where the CF-LI assumption holds, then how does one calculate and estimate the ASF?

Again applying the framework provided in Wooldridge (2005), where now the condition-

ing argument includes the control variate and the linear index, wi = (v2i, xiβo).

(cid:90)
(cid:60) 1{xoβo + u > 0}fu|v,xβ(u; v2i, xiβo)du

µ2(xo, wi) =

= Eu(1{xoβo + u > 0}|v2i, xiβo)

= Fu1|v2,xβo(xoβo; v2i, xiβo)

notice that the linear index appears twice as arguments: ﬁrst at the point of evaluation for the

54

conditional CDF xoβo (the structural direct eﬀect) and as part of the conditioning argument

xiβo (the endogenous indirect eﬀect). Applying Lemma 2.1 from Wooldridge (2005), the

ASF when the true data generating process satisﬁes the CF-LI assumption is

ASFCF−LI (xo) = Ev2,xβo(F−u1|v2,xβo(xoβo; v2i, xiβo))

(2.31)

where the expectation is taken with respect to the joint distribution of the conditioning

arguments (v2i, xiβ). The immediate issue is that the ASF cannot be written in terms of the

identiﬁed function GCF−LI (xiβo, v2i) that is estimated using the proposed SML estimator

in Rothe. The identiﬁed function is the conditional CDF evaluated at and conditioned on the

same linear index. Therefore one cannot distinguish between the direct structural eﬀect and

indirect endogenous eﬀect of the linear index. This reiterates the importance of being able

to separately identify the conditioning argument from the point of evaluation for estimation.

Rothe suggests using Ev2(GCF−LI (xoβo, v2i)) as the ASF but this only averages out the

part of the endogenous indirect component due to v2i, and does not average out any the

eﬀect due to the linear index. Therefore, the ASF proposed by Rothe is equal to the true

ASF only when CF-CI assumption of BP holds. So although it may be tempting to consider

the CF-LI assumption as a compromise to allow for ﬂexibility in terms of the relationship

between the unobserved heterogeneity and the instruments, the true ASF is not identiﬁed

under the CF-LI assumption.

In fact, the ASF proposed by Rothe is estimating the AIF of Lewbel, Dong, and Yang

(2012) who suggest using it as an alternative to the ASF since it is generally much easier to

identify. They deﬁne the AIF as

AIF (xo) = E(1{xiβo + ui1 > 0}|xiβo = xoβo)

(2.32)

= Ev2(GCF−LI (xoβo, v2i))

55

Given the choices of Propensity Score, ASF, and AIF, as possible ways to interpret the

estimates of the model, how should one proceed? Lin and Wooldridge (2015) address this

issue by comparing these functions, proposed in the context of binary response, in linear

regression case (as in equation (2.24)) to see if they uncover the direct structural component,

xoβo. Earlier in this section, the propensity score is shown to not be reﬂective of the

mechanisms researchers are interested in when endogeneity is present. Lin and Wooldridge

(2015) also show this for the AIF, explaining that “the AIF suﬀers from essentially the

same shortcomings as the propensity score because it is aﬀected by correlation between the

unobservables and the observed [endogenous explanatory variables].”

The next section discusses the derivation of the APEs. Again, the APEs should isolate

the structural impact of varying a particular covariate and therefore should be derived from

the ASF. Consequently presuming CF-CI or CF-LI aﬀects the derivations and interpretations

of the APEs.

2.5.3 Average Partial Eﬀects

Similar to the interpretation of βo in a linear regression (as in equation (2.24)), the APE

should capture the causal (structural direct) eﬀect of a regressors on the outcome variable.

In the binary response framework, the parameters are scale invariant and therefore provide

very little for interpretation and are generally not comparable across diﬀerent speciﬁcations.

Alternatively, the APE provide a comparable statistic that can be used for interpretation.

Let xo

j and βjo denote the jth elements of xo and βo respectively. Then the partial eﬀect
of the jth element of xi is deﬁned as the partial derivative of the ASF with respect to xo
j

averaged in the population. Under the setting consider in Assumption 2.3.1, the partial

56

eﬀect is,

(cid:46)

∂ASF (xo)

∂xo

j = ∂Ev2,z

(cid:18)
(cid:18)
(cid:18)

Φ

(cid:19)(cid:19)(cid:46)
(cid:18)xoβo + h(v2i, zi)γo
(cid:18)xoβo + h(v2i, zi)γo
(cid:19)(cid:46)
(cid:19)
(cid:18)xoβo + h(v2i, zi)γo

exp(g(y2i, zi)δo)

exp(g(y2i, zi)δo)

∂xo
j

(cid:19)

∂xo
j

βjo

= Ev2,z

∂Φ

= Ev2,z

φ

exp(g(y2i, zi)δo)

exp(g(y2i, zi)δo)

(cid:19)

To obtain the APE, one plugs in xi for xo and averages over the joint distribution of

(xi, v2i, zi),

(cid:18)

AP E = E

φ

(cid:18)xiβo + h(v2i, zi)γo

(cid:19)

(cid:19)

βjo

exp(g(y2i, zi)δo)

exp(g(y2i, zi)δo)

(2.33)

Use sample averages in place of expectations and consistent estimates of the parameters

to estimate. Notice that the derivative is only taken with respect to the structural direct

component, the argument in the ASF, but after the derivative is taken, one will average over

the joint distribution of xi, v2i, and zi in both the structural direct eﬀect and the endogenous

indirect eﬀect together.

How does using the AIF instead of the ASF aﬀect the APE derivation under the CF-LI

assumption? First, the correct APE under the CF-LI assumption

(cid:46)

(cid:16)

AP ECF−LI = ∂ASF (xi)

∂xji = E

f−u|v2,xβo(xiβo; v2i, xiβo)βjo

(cid:17)

(2.34)

where f−u|v2,xβo(·; v2i, xiβo) is the conditional PDF. Since I am averaging over the point

of evaluation and the conditioning argument, one may be hastily optimistic in thinking this

is identiﬁed from the conditional CDF, GCF−LI (xiβo, v2i) = Fu1|v2,xβo(xiβo; v2i, xiβo).
However, the correct PDF cannot be derived the from this function since

∂GCF−LI (xiβo, v2i)/∂[xiβo] (cid:54)= f−u|v2,xβo(xiβo; v2i, xiβo)

57

Consequently, the APE in Rothe are also incorrectly calculated from the AIF,

AP EAIF = E(cid:0)(∂GCF−LI (xiβo, v2i)/∂[xiβo])βjo

(cid:1)

(2.35)

This discussion has provided a theoretical argument for the diﬀerences between the AIF

and ASF and why one should prefer the ASF. But one may wonder whether all of this matters

in practice. Once you start averaging over components, minute diﬀerences in calculations

may be diminished in their impact. Perhaps the AIF used by Rothe may do a “good enough”

job in approximating the true ASF. This is investigated in a simulation study in the next

section where the CF-LI assumption holds true in the underlying data generating process

and the ASF from the proposed estimator and AIF from the SML estimator proposed by

Rothe are calculated and compared.18 The simulation results suggest that when the CF-LI

assumption holds, both the proposed method and the SML estimator from Rothe perform

quite well in terms of parameter estimates. However, there is a stark diﬀerence in ASF

estimates, consistent with the previous analysis. In the simulation, the poor ASF estimates

using the SML procedure can be entirely attributed to the fact that under CF-LI, the SML

can only recover the AIF which can be starkly diﬀerent from the ASF.

2.5.4 Simulation: ASF Estimates for the Eﬀect of Income on

Home-ownership

This simulation models the home-ownership and income application in Rothe (2009) as a

contextual setting. Rothe uses a sample of ‘981 married men aged 30 to 50 that are working

full time and have completed at most the lowest secondary school track of the German

18Rothe (2009) also considers the case that CF-LI assumption holds but CF-CI does not hold in the
simulation study. The second design introduces heteroskedasticity as a function of the linear index xiβo, in
the unobserved latent error, u1i. However, only results on coeﬃcient estimates, and not the ASF or APE
estimates, are reported.

58

education system’ from a 2004 wave of the German Socio-economic Panel. The outcome

y1i is an indicator that takes on the value 1 if an individual owns their home and 0 if they

are renting. The included instruments z1i are individual’s age (z11i) and an indicator of

the presence of children younger than 16 (z12i). The endogenous variable of interest, y2i is

household income and there are two excluded instruments: indicators for the wife’s education

level (intermediate z21i and advanced z22i) and an indicator for her employment status (z23i).

A more detailed discussion of the data generating process and table of summary statistics

are presented in the Appendix.

In this simulation the CF-LI assumption holds in the underlying data generating process

such that the distribution of u1i conditional on v2i and the exogenous regressors (zi) is,

(cid:16)

(cid:17)
y2iγ1o + v2i(xiβo)γ1o, exp(2 × (xiβo)δo)

u1i|zi, v2i ∼ N

Motivated by the explanation in example 2, by allowing for an interactive eﬀect in the

conditional mean function, I am allowing for interactions between the omitted variable, credit

score, and the linear index. Heteroskedasticity is also introduced which allows for variability
in the variance of unobservables conditional on observables. Since u1i|v2i, zi ∼ u1i|v2i, xiβ,

the CF-CI assumption is violated but the CF-LI assumption holds. This simulation will

examine in more detail the CF-LI assumption as a relaxation of the CF-CI assumption,

investigating whether the discussion on the ASF, AIF, and APE holds true in practice.

Given the analysis of the previous section, the Rothe SML estimator should be able to

estimate the parameters βo well but unable to correctly calculate the ASF and APE because

it cannot distinguish between the two eﬀects (structural direct and endogenous indirect) of

the linear index, xiβo. Implementation of the SML and the proposed estimator are explained

in more detail in the appendix.

59

Table E.5 reports the coeﬃcient estimates for the simulated data as well as the estimates

in the Rothe application as a comparison.19 All the second stage coeﬃcient estimates are

normalized such that the coeﬃcient on Children in the Household is one. The simulated

data is not an exact replica, but the estimates are in the same range and the change in

the coeﬃcient estimates as one starts to control for the endogeneity all move in the same

direction. As expected, the estimates for Het-Probit (GCF) – the proposed estimator – and

SML, columns (8) and (9), are quite similar and close to the true values in column (10).

Figures D.3 and D.4 show the ASF estimates for a 40 year old with children under the age

of 16 in the household as it varies over the endogenous regressor, log(total income). In Figure

D.3, the OLS and Probit estimators perform poorly since they do not address endogeneity

at all. The 2SLS and Probit (CV) estimates are much closer to the true ASF but predict a

slightly ﬂatter ASF.

Recall from the earlier discussion, even if the SML is producing consistent parameter

estimates, one would incorrectly estimate the ASF and consequently the APE. This is because

the CF-LI assumption does not correctly average out the distribution of the unobserved

heterogeneity. Figure D.4 reports the true ASF, the AIF (not a structural object of interest)

and the estimated ASFs for the proposed Het-Probit (GCF) and the semi-parametric SML.

The true ASF correctly averages over the distribution of the unobserved heterogeneity while

the wrong ASF only averages out the v2i components of the unobserved heterogeneity.

As expected, the proposed estimator does a good job estimating the true ASF while the

SML estimator does a good job estimating the AIF. In this simulation I ﬁnd that there can

be a fairly stark diﬀerence between the ASF and AIF which means the diﬀerences in the

estimators are consequential. For instance, the AIF would predict the average probability for

19SML and Het Probit are estimated using a Nelder-Mead Simplex Method in Matlab.

60

home-ownership for an individual with a log total income of 7.65 to be 0.595 while the ASF

would predict an average probability of 0.461, a substantial diﬀerence. This further reiterates

the discussion in Section 4; i.e., even under the CF-LI assumption, the SML estimator is not

capturing the true ASF.

The simulated distribution of the APE are reported in Table E.6. The true APE is 0.6448

so the proposed Het-Probit CF estimator has the closest mean whereas the mean of the SML

estimates is the third closest following the mean of the 2SLS estimates. But the diﬀerence

between these estimators in interpretation is minimal: a 10% increase in total income results

in either a 0.0626 increase in probability of home ownership according to the Het-Probit

(GCF) or a 0.0699 increase according to Rothe’s SML. The estimators that suﬀer the most

are the ones that do not address the issue of endogeneity at all, OLS and Probit, and are

distinctly biased downwards.

Therefore I ﬁnd in this simulation study that under the CF-LI assumption parameter

estimates are similar across the two estimators. But when looking at the ASF, the estimates

diverge signiﬁcantly. I show that this can be entirely accounted by the fact that the SML

estimator is actually estimating the AIF which is not equal to the ASF (and should not be

interpreted in the same way).

2.6 Empirical Example

To showcase the estimator in an empirical example, I examine married women’s labor force

participation using 1991 CPS data.20 All tables and ﬁgures referenced in this section can

20Data is part of the supplementary material provided with the textbook “Econometric Anal-
Data can be downloaded at

ysis of Cross Section and Panel Data” by Jeﬀrey Wooldridge.
https://mitpress.mit.edu/books/econometric-analysis-cross-section-and-panel-data

61

be found in Appendix D. Table E.7 provides some summary statistics for the data set.

The dependent variable is Employed (=1 when the individual is in the labor force) where

approximately 58% of married women in the sample participate in the labor force. The last

two column divide the sample over the binary outcome and reports the summary statistics

for the other observable characteristics.

The structural outcome equation is,
Employedi = 1{β1 + nwif inciβ2 + educiβ3 + experiβ4 + kidslt6β5 + kidsge6β6

+ nwif inci × kidslt6β7 + nwif inci × kidsge6β8 + u1i > 0}

(2.36)

where the economic interest is in estimating the eﬀect of non-wife income on the probability

of being in the labor force. Since there is a trade-oﬀ between work and leisure, by relaxing

the budget constraint such that an individual has other sources of income, one would expect

the individual to be less likely to work. From the summary statistics, those not working tend

to have higher non-wife income. But this can not be interpreted as a causal eﬀect since there

is concern that other sources of income would be endogenously determined with the wife’s

labor force participation. In particular, husband’s employment, which partly determines the

non-wife income, would probably be decided simultaneously with wife’s employment.

Utilizing husbands education level as an instrument, the causal eﬀect of non-wife income

on wife’s labor force participation can be parse out. Since education and the probability of

working are generally correlated, the instrument is easily argued to be relevant. In fact, the F-

statistic of signiﬁcance for the ﬁrst stage is quite large as seen in Table E.8.21 Excludability of

the instrument follows from the argument that husband’s education level should not directly

eﬀect the wife’s choice of labor force participation except through the channels of how it

21The standard benchmark of 10 from Stock, Wright, and Yogo (2002) only applies to the relative bias in
2SLS with homoskedasticity. It is an open area of research to determine benchmarks for non-standard cases.

62

eﬀects the non-wife income. The other controls considered in this example are the wife’s

education level, experience, and dummy variables for whether or not they have kids younger

than 6 and kids 6 and older.

Table E.8 reports the reduced form coeﬃcient estimates, the second stage parameter

estimates for several diﬀerent speciﬁcations of a Probit model, and the SML estimator pro-

posed by Rothe. For the second stage estimates the coeﬃcient on Education is normalized

1, since the model is only identiﬁed to scale. This allows for comparisons across the dif-

ferent estimators. The second column speciﬁes a standard Probit model which assume no

endogeneity and homoskedasticity in the latent error. This is slightly relaxed in column

(3) where heteroskedasticity is allowed. The speciﬁcations in columns (4)-(8) all address

endogeneity in one form or another. The fourth column corresponds to the setting of Rivers

and Vuong (1988) where they address endogeneity by only including the control variable

as an additional covariate, maintaining the CF-CI assumption. The next three columns

are variations on the proposed estimator all of which relax the CF-CI assumption by either

allowing for heteroskedasticity in the latent error and/or allowing for a general control func-

tion. The ﬁnal column presents results using the SML estimator (from Rothe) which impose

no distributional assumptions but require either CF-CI or CF-LI assumptions.

I ﬁnd that addressing endogeneity with only a control variable reduces the eﬀect of non-

wife income (in columns (4) and (5)). But then allowing for a general control function, where

the control variate interacts with the children dummies, raises the eﬀect of non-wife income

and also switches the signs for the interactions with the children (although not statistically

signiﬁcant). When a general control function is used without allowing for heteroskedasticity

the bootstrapped standard errors increase substantially. This is due to very small (almost

0) coeﬃcient estimates for education which blows up the scaled parameter estimates. When

63

heteroskedasticity is allowed, then the coeﬃcient estimate on education becomes statistical

diﬀerent from 0 which results in lower standard errors of the scaled parameter estimates.

Finally, the SML estimates are found to be quite similar to the proposed estimator results in

column (6). This would suggest that the CF-LI assumption may in fact hold in this setting.

When looking at the control function parameters I see particularly large eﬀects when

interacting the reduced form error ˆv2i with the children dummies (in speciﬁcation (7)).

Intuitively this makes sense since one would imagine the endogenous decision making process

of who in the household should work (either husband, wife or both) depends a lot on the

presence of children in the household. For instance, if there are very young children in the

household then the trade-oﬀ is not just between work and leisure but must also consider the

cost of childcare if both parents enter the workforce. Therefore it would make sense that

there is a negative interactive eﬀect such that when one partner is working, the other is less

likely to in order to provide childcare.

Since this chapter proposes a more ﬂexible speciﬁcation, Table E.9 provides Wald test

results on diﬀerent speciﬁcations. The ﬁrst 4 columns test the null hypothesis that non-wife

income is in fact exogenous. One of the beneﬁts of the control variable approach is the vari-

able addition test it supplies. One can test the null hypothesis of no endogeneity by testing

whether all the coeﬃcients in the control function are 0. Under all combination of modelling

assumptions (such as homoskedasticity/heteroskedasticity and control variable/general con-

trol function) I ﬁnd strong evidence of endogeneity. However this is conditional on the

instrument being exogenous and, since the model is just identiﬁed, there is no way to test

for exogeneity of the instrument.

The remainder of the table tests the diﬀerent components of the CF-CI assumption in

alternative speciﬁcations. The middle two columns test the null hypothesis that the control

64

variable is suﬃcient in capturing the full impact of the endogenous part of y2i. In other

words, testing the signiﬁcance of the coeﬃcients on the additional terms in the general

control function. The null is rejected at the 10% level under homoskedasticity and rejected

at the 5% level under heteroskedasticity. This gives statistical evidence of the violation of

CF-CI, through the general control function, in this empirical applications. Finally the last

three columns test the null hypothesis of homoskedasticity (i.e., all of the coeﬃcients in

heteroskedastic function are 0). There is strong statistical evidence of the violation of CF-CI

through the presence of heteroskedasticity, easily rejecting homoskedasticity at the 5% level.

Given these results, the preferred speciﬁcation should be the Het-Probit (CF), I reject the

possibility of homoskedasticity and reject the inclusion of only the control variable in favor

of a general control function.

To understand the consequences of the diﬀerent speciﬁcation on the interpretation of the

results, Table E.10 provides estimates of the APEs and their bootstrapped standard errors

with respect to the endogenous variable, non-wife income. The most signiﬁcant change in

the estimates is when one starts to address the issue of endogeneity. In the linear models,

the 2SLS APE estimates shrink to about 3/4 of the OLS estimates but are still statistically

signiﬁcant. In the non-linear models, a similar reduction in APE estimates is observed when

controlling for endogeneity (about 3/5). But in the models that relax CF-CI completely

(Het Probit (GCF)), the APE is no longer statistically signiﬁcant and even switches its sign.

Putting the APE estimates into interpretive setting, if the non-wife income increases by

$10,000 – a fairly substantial increase–, according to the preferred speciﬁcation, the likelihood

of the wife working decreases by around 1.16 percentage points, a fairly negligible eﬀect. As

suggested in the discussion on coeﬃcient estimates, this small eﬀect is most likely driven by

a heterogeneity over the presence of children in the household. Therefore examining the ASF

65

for diﬀerent combination of ages of children present in the household can be informative.

Since the APEs average over the distribution of all the covariates, these diﬀering eﬀects,

tend to be washed out in the single statistic.

Figures D.5-D.8 show the ASF with respect to non-wife income for a married woman with

high school education, 20 years of experience, and diﬀerent combination of ages of children

present in the household. The Probit models have a fairly linear ASF which explains why

2SLS gives fairly similar estimates of the APE.

The ﬁrst thing to note is that the ASF using Probit estimates is much more negatively

sloped in all the ﬁgures. This is because without addressing the issue of endogeneity, (i.e., if

the husband works then the wife is less likely to work and visa versa, but these decisions are

made simultaneously) I would expect to see this sort of substitution eﬀect. Once endogeneity

is controlled for in Probit (CV), the slope lessens. The much more striking revelation is with

the proposed Het-Probit (GCF) and SML estimators, I see a positive slope when there are

both children in the household and fairly ﬂat ASF when there are only older children. When

there are no children or just very young children in the household then the ASF is much

more negatively sloped. This goes to show that relaxing CF-CI and allowing for a much more

ﬂexible speciﬁcation in the conditional distribution of the latent error makes a interpretative

impact.

In this setting, the normality assumption seems likely to hold, especially since the semi-

parametric estimator (SML) produces fairly linear ASF estimates. But in other empirical

applications, the normality assumption may be too restrictive and unconvincing. Conse-

quently, neither the SML estimator or the proposed Het-Probit (GCF) estimator are strictly

weaker in their assumptions. The SML estimator imposes CF-CI (or CF-LI) by not allowing

for general heteroskedasticity and a ﬂexible general control function. But on the other hand,

66

the proposed approach imposes distributional assumptions. This divergence between the

proposed parametric estimator and the semi-parametric estimators oﬀered in the literature

leads to the following distribution free extension.

2.7 Extension: Semi-Parametric Distribution Free Es-

timator

In some empirical settings, imposing normality on the latent error may not be a reasonable

assumption. Therefore this section oﬀers an alternative semi-parametric estimator that

does not depend on distributional assumptions. The main result of this section is that by

allowing the heteroskedastic function and general control function to be non-parametrically

speciﬁed, the semi-parametric variation of the proposed Het-Probit (GCF) is actually a

distribution free estimator. This section will go into detail as to why this is true, how

to obtain non-parametric identiﬁcation and what the asymptotic properties of the semi-

parametric estimator are. But for an applied researcher, the results of this section imply

that as long as the heteroskedastic function and general control function are ﬂexibly speciﬁed

(i.e., sieve basis functions) the normality assumption that appears to be used in Assumption

2.3.1 is non-binding.

How is this distribution free estimator possible? A recent paper, Khan (2013), has noted

an observational equivalence result concerning binary response models: a heteroskedastic

Probit model with a non-parametric heteroskedastic function is observationally equivalent

to a “distribution free” model with only a conditional median restriction. The utility of this

result, is one may use simple estimation procedures (such as a semi-parametric Het-Probit

MLE) while not making any strong distributional assumptions to obtain structural param-

67

eter estimates and possibly even identify and estimate choice probabilities and marginal

eﬀects. This section extends the result to the case of endogeneity in a ﬂexible manner

that allows for the relaxation of CF-CI. By introducing a general control function into the

conditional median restriction, the observational equivalence holds under endogeneity and

a simple estimation procedure is obtainable. Consequently, this section proposes a semi-

parametric estimator based on assumptions that are more realistic than any other control

function methods in the literature for endogenous binary response models.

Section 7.1 reviews the observational equivalence result in Khan (2013) and extends it to

the case of endogeneity. Since this framework considers non-parametric functions for both

the general control function and the heteroskedastic function, identiﬁcation will be shown

under this more general scenario. Section 7.2 derives the asymptotic properties of the semi-

parametric estimator. Using the results in Song (2016), proofs of consistency and the rate of

convergence only need to be slightly altered to allow for the semi-parametric general control

function. Finally, this extension ends with a comprehensive simulation study. Over a variety

of conditional distributions (some satisfying CF-CI and some not), the performance of the

proposed semi-parametric Het-Probit (GCF) estimator is compared to the parametric Rivers

and Vuong (1988) estimator and the SML of Rothe (2009). The simulation results suggest

the proposed approach can handle a variety of alternative distributions while still allow for

the violation of CF-CI.

2.7.1 Observational Equivalence and Identiﬁcation

Consider the following binary response setting without endogeneity,

yi = 1{xiβo + ui ≥ 0}

(2.37)

68

where xi is a vector of covariates, and ui is the unobserved heterogeneity. The following

two assumptions restates the setting of the two observationally equivalent models in Khan

(2013).22

Assumption 2.7.1 (Conditional Median Restriction).

In the set up described by equation

(2.37)

(i) xi ∈ (cid:60)k is assumed to have density with respect to a Lebesgue measure, which is positive

on the set X ⊆ (cid:60)k.

(ii) Let po(t, xi) denote P (−ui < t|xi), and assume

(a) po(·,·) is continuous on (cid:60) × X .
(b) p(cid:48)

o(t, xi) = ∂po(t, xi)/∂t exists and is continuous and positive on all (cid:60) for all
xi ∈ X .

(c) po(0, xi) = 1/2.

(d) limt→−∞ po(t, xi) = 0, limt→∞ po(t, xi) = 1.

Assumption 2.7.2 (Heteroskedastic Probit).

In the set up described by equation (2.37)

(i) xi ∈ (cid:60)k is assumed to have density with respect to a Lebesgue measure, which is positive

on the set X ⊆ (cid:60)k.

(ii) ui = σo(xi)ei where σo(·) is continuous and positive on X a.s, and ei is independent of

xi with any known (e.g. logistic, normal) distribution with median 0 and has a density

function which is positive and continuous on the real line.

Theorem 2.1 of Khan (2013) states that under the above assumptions, the two models

are observationally equivalent. The equivalence between the two models is in terms of the

22Assumption 2.1 correspond to CM1 and CM2 and Assumption 2.7.2 corresponds to HP1 and HP2 in

Khan (2013)

69

choice probabilities: P (y1i|xi). In other words, both models will generate the same choice

probability functions and therefore cannot be distinguished from one another on that basis.

This means that a researcher is able to use estimators developed under Assumptions 2.7.2

such as a semi-parametric heteroskedastic Probit while only imposing the weaker distribu-

tional assumptions under Assumption 2.7.1. This allows for easy estimation using canned

commands in popular statistical programs such as Stata, Matlab or R, but still preserving

the “distribution free” interpretation.

Previously, endogeneity was understood as a non-zero conditional mean, but to ﬁt the

framework in Khan (2013), endogeneity will be determined by a non-zero conditional me-
dian: Med(−ui|xi) (cid:54)= 0. This would violate assumptions Assumptions 2.7.1(ii) part (c) and

2.7.2(ii). Therefore the provided observational equivalence result from Khan (2013) is no

longer applicable.

Now consider the set up in equation (2.1) and suppose I deﬁne the non-zero conditional

median as,

ho(v2i, zi) ≡ Med(−u1i|zi, v2i) = Med(−u1i|z1i, y2i)

(2.38)

where the second equality holds because y2i is merely a function of zi and v2i. This function

should look familiar as it would be the non-parametric version of the general control function

introduced in Assumption 2.3.1. This function captures the part of the unobserved latent

error that is correlated with the endogenous variable. Again, I allow for the violation of

CF-CI since the conditional median is a function of all the condition arguments including
the instruments. Thus far, I have made no assumptions on what the function ho(·,·) should

be and therefore the control function is completely general.

Now I can incorporate the general control function into the model assumptions to allow for

70

arbitrary endogeneity. The following assumptions include slight adjustments to Assumptions

2.7.1 and 2.7.2 to incorporate endogeneity.

Assumption 2.7.3 (General Conditional Median Restriction).

In the set up described by

equation (2.1)

(i) (v2i, zi) ∈ (cid:60)1+k1+k2 is assumed to have density with respect to a Lebesgue measure,

which is positive on the set (V × Z) ⊆ (cid:60)1+k1+k2.

(ii) Let po(t, v2i, zi) denote P (−u1i < t|v2i, zi), and assume

(a) po(·,·,·) is continuous on (cid:60) × (V × Z).
(b) p(cid:48)

o(t, v2i, zi) = ∂po(t, v2i, zi)/∂t exists and is continuous and positive on all (cid:60) for
all (v2i, zi) ∈ (V × Z).

(c) po(ho(v2i, zi), v2i, zi) = 1/2 where ho(v2i, zi) is continuous on all (v2i, zi) ∈ (V ×

Z).

(d) limt→−∞ po(t, v2i, zi) = 0, limt→∞ po(t, v2i, zi) = 1.

Assumption 2.7.4 (Endogenous Heteroskedastic Probit). In the set up described by equa-

tion (2.1)

(i) (zi, v2i) ∈ (cid:60)1+k1+k2 is assumed to have density with respect to a Lebesgue measure,

which is positive on the set (V × Z) ⊆ (cid:60)1+k1+k2.

(ii) u1i = σo(v2i, zi)e1i +ho(v2i, zi) where σo(v2i, zi) is continuous and positive on (V×Z),
ho(v2i, zi) that is continuous on all (v2i, zi) ∈ (V×Z), and ei is independent of (v2i, zi)

with any known (e.g.

logistic, normal) distribution with median 0 and has a density

function which is positive and continuous on the real line.

Modifying the observational equivalence result to this setting is almost trivial. Instead

of focusing the model on a zero median restriction, it is acknowledged that the median is

71

non-zero but a general conditional median function, ho(v2i, zi), is speciﬁed. Theorem 2.7.1

states this result with the proof provided in the appendix.

Theorem 2.7.1 (Observational Equivalence). Under Assumptions 2.7.3 and 2.7.4, the two

models are observationally equivalent.

Extending the results in Khan (2013) to the case of endogenous regressors is not novel.

In a working paper, Song (2016) uses a more traditional control function method to ad-

dress endogeneity. He imposes an exclusion restriction on the conditional median func-

tion, same as the conditional median independence assumption proposed in Krief (2014):
M ed(u1i|v2i, zi) = f (v2i). Although these assumptions are weaker than CF-CI, the exclu-

sion restrictions that are imposed are constrictive and unnecessary.

Utilizing Theorem 2.7.1, the following assumption is an alternative to Assumption 2.3.1

that allows for a non-parametric general control function and a non-parametric heteroskedas-

tic function.

Assumption 2.7.5. Consider the set up in equation (2.1), where {y1i, zi, y2i}n

i=1, is iid. In

the ﬁrst stage, the true conditional mean is

E(y2i|zi) = mo(zi)

and the unobserved latent error has the following conditional distribution

u1i|zi, v2i, y2i = u1i|zi, v2i ∼ N

(cid:16)
(cid:17)
ho(v2i, zi), exp(2 × go(y2i, zi))

Where zi = (z1i, z2i) and mo(zi), ho(v2i, zi), and go(y2i, zi) are unknown function.

Since the normal distribution is used in this framework and is symmetric, the conditional

median is equal to the conditional mean. Therefore the remaining discussion will return

72

to the conditional mean interpretation of endogeneity. With this slight variation from the

parametric framework, Assumption 2.7.5 suggests a distribution free estimator comparable

to the other semi-parametric estimators of the literature. To reiterate, in implementation

the normal distribution is used according to Assumption 2.7.5, but in interpretation, there

are no distributional strings attached because of the observational equivalence result.

However, this generalization further complicates identiﬁcation. First, the model is only

identiﬁed to scale, which can be solved with a normalization that assumes the last coeﬃcient

in a linear index xiβo is equal to 1: βko = 1. Similar to the previous literature, identi-

ﬁcation of the non-parametric heteroskedastic function is obtained by assuming that the

last regressor, xki, conditional on all other random variables in the numerator has a density
function with respect to the Lebesgue measure that is positive on (cid:60) and all other terms in

the numerator have bounded support.23 Second, as in the parametric model, introducing a

general control function without any restrictions will not be identiﬁed relative to the linear

index, xiβ, because they rely on the same sources of variation. Using an analogous CMR, a

shape restriction on the general control function insures that there is variation unexplained

by the linear index.

23This is essentially assumption RC2(i) in Khan (2013). As he notes in Remark 3.2, the bounded support
condition can be relaxed to the ﬁnite fourth moments. To illustrate how this condition is used in identiﬁcation,
consider the non-endogenous case where one would like to show identiﬁcation βo and σo from the choice

probability Φ

. First, suppose not, suppose there exists a β (cid:54)= βo and σ (cid:54)= σo such that

(cid:18)

(cid:19)

xiβo

exp(σo(xi))

xiβ

exp(σ(xi))

=

xiβo

exp(σo(xi))

for all xi ∈ X . But because xki, conditional on x−ki has a density function with respect to the Lebesgue
measure that is positive on (cid:60) and x−ki is bounded, for any realization of x∗−k in the support, there exists
a x∗

k also in the support such that

which is a contradiction.

x∗−kβ−k + x∗
exp(σ(x∗))

k

> 0 and

x∗−kβ−ko + x∗
exp(σo(x∗))

k

< 0

73

Assumption 2.7.6. Let mo(·) ∈ M, ho(·) ∈ H, and go(·) ∈ G denote the function spaces
and βko ∈ B denote the parameter space.

ixi) is non-singular, E(xi|zi) is full rank.

(i) E(x(cid:48)
(ii) (CMR) E(ho(v2i, zi)|zi) = 0

(iii) the last component of xi, xki, is an included instrument whose coeﬃcient is normalized

to 1 such that,

xiβo = x−kiβ−ko + xki

and xki conditional on (x−ki, ho(v2i, zi)) has a density function with respect to the
Lebesgue measure that is positive on (cid:60) and (x−ki, ho(v2i, zi)) has bounded support.

The ﬁrst two parts are taken from assumption 2.4.1. The last part imposes the scale

normalization and is crucial in identifying the heteroskedastic function. There is no consensus

in the literature on how to choose which regressor should have the scaled coeﬃcient. Song

(2016) uses the endogenous regressor since it will be continuously distributed and more likely

to satisfy the support requirements. But then no inference can be made on the structural

parameter whose value must be assumed to be non-zero. Therefore I suggest scaling on an

instrument whose relevancy is not in question and has suﬃcient support.

A quick remark on parts (ii) and (iii): in some simple scenarios, the CMR in assumption

(ii) is suﬃcient for the second part of assumption (iii), as long as xki conditional on x−ki has
a density function with respect to the Lebesgue measure that is positive on (cid:60). For instance,

consider the linear case where the general control function is of the form

(cid:88)

p∈P

ho(v2i, zi) =

bp(zi)(vp

2i − E(vp

2i|zi))

(2.39)

where bp : Z → (cid:60) and the set P consists of unique elements from the real line. Supposing

74

P does not include 0, then the CMR is satisﬁed. Also, for ease of understanding, consider

the common case that x−ki = y2i so there is only one included instrument that is acting at

the normalized covariate xki = z1i. Then, conditional on any realization h = ho(v2i, zi) and

y2 = y2i, one cannot precisely determine the corresponding z1i. Therefore z1i conditional on

(y2i, ho(v2i, zi)) has a density function with respect to the Lebesgue measure that is positive
on (cid:60).

The following Theorem states the general identiﬁcation result.

Theorem 2.7.2. In the set-up described by equation (2.1) and Assumption 2.7.5, if As-
sumption 2.7.6 holds then (mo(·), βo, ho(·), go(·)) are identiﬁed.

Proof is given in the appendix. Alternatively, if one is concerned that Assumption

2.7.6(iii) is unlikely to hold then the researcher may always turn to exclusion restrictions

as a suﬃcient condition.

If xki is assumed to be excluded in the control function, then,

given the proper bounded and unbounded supports, part (iii) of Assumption 2.7.6 is easily

satisﬁed.

With identiﬁcation, the proposed estimation procedure is quite simple and similar to the
parametric version in Section 4. In the ﬁrst stage, the conditional mean function E(y2i|zi) =

mo(zi) is estimated using standard non-parametric regression techniques such as sieves or
kernels. The control variable is constructed from the residuals ˆv2i = y2i− ˆm(zi) and plugged

into the second step. In the second stage, one would use sieves to estimate the non-parametric

components. In this case sieves are preferred over other non-parametric methods since it
will be easier to impose the CMR on the general control function. Let {bl(v2i, zi), l =
1, 2, ..., Lhn} and {cl(v2i, zi), l = 1, 2, ..., Lgn} be sequences of basis function of (v2i, zi) such

75

that bl(v2i, zi) satisfy the CMR for all l = 1, 2, ..., Lhn. So the sieve spaces are deﬁned as

Hn = {h : (V × Z) → (cid:60), h(v2i, zi) =

bl(v2i, zi)γl

Lhn(cid:88)

l=1

(2.40)

(2.41)

(2.42)

∈ (cid:60)}

: E(bl(v2i, zi)|zi) = 0 and γ1, ..., γLhn
Lgn(cid:88)
(cid:18)xiβ + h(ˆv2i, zi)

Gn = {g : (V × Z) → (cid:60), g(y2i, zi) =
(cid:20)

n(cid:88)

l=1

L(y1i, xi, zi; ˆm, β, γ, δ) =

Finally, one would maximize the following likelihood

y1i log

Φ

i=1

exp(g(y2i, zi))

+ (1 − y1i) log

1 − Φ

cl(y2i, zi)δl : δ1, ..., δLgn ∈ (cid:60)}

(cid:19)(cid:21)
(cid:18)xiβ + h(ˆv2i, zi)

exp(g(y2i, zi))

(cid:19)(cid:21)

(cid:20)

with respect to β, h(·) ∈ Hn, and g(·) ∈ Gn. Same as in the parametric version, to im-

plement, this is as simple as running the hetprobit command in Stata for the second

stage. However, inference should reﬂect both the two step estimation process and the non-

parametric speciﬁcation. The next section provides the asymptotic results for the proposed

distribution free estimator.

2.7.2 Asymptotic Properties

Song (2016) derives consistency and convergence rates of the semi-parametric Het Probit

(CF) estimator when the control function satisﬁes CF-CI and imposes the exclusion restric-

tion (i.e.: ho(v2i, zi) = ho(v2i)). Therefore the asymptotic results only need to be slightly

augmented to allow for the general control function. Explanations of the notation is left to

the appendix. The following assumption collects the remaining low level regulatory condi-

tions needed for consistency of the second stage parameters.

Assumption 2.7.7.

76

(i) Let (β−ko, ho(·), go(·)) ∈ (B × H × G) = Θ denote the joint parameter space. For any

(β−k, h(·), g(·)) ∈ Θ,
(a) β−k ∈ B ⊂ (cid:60)k−1 where B is compact,

(cid:16) xiβ+h(v2i,zi)

exp(g(y2i,zi))

(cid:17) ∈ Λs

(b) Φ

c((V × Z), w1) for s > 0 and w1 ≥ 0,

(c) h(v, z) is continuously diﬀerentiable with respect to its ﬁrst component such that

(ii) (cid:82) (1 +||(v, z)||2)w2fv,z(v, z)d(v, z) < ∞ where fv,z(·) denotes the joint density function

z∈Z sup
sup
v∈V

∂v2

∂h(v2, z)

< C < ∞

and w2 > w1 > 0.

(iii) For

bLhn(v2i, zi) = (b1(v2i, zi), ..., bLhn
cLgn(y2i, zi) = (c1(y2i, zi), ..., cLgn(y2i, zi))

(v2i, zi))

E(bLhn(v2i, zi)(cid:48)bLhn(v2i, zi)) and E(cLgn(y2i, zi)(cid:48)cLgn(y2i, zi)) are non-singular for all

n.

Part (i) collects conditions on the parameter and functional space. The second component

constraints the predicted probability function to be in a weighted Holder ball with radius
c, smoothness s and weight function (1 + || · ||2)w1/2 as deﬁned in equation (B.12) in the

appendix. Since the parameter space is a weighted Holder ball, there exists a projection

mapping from a standard sieve spaces constructed from power series, Fourier series, splines,
or wavelets to the parameter space as n → ∞. The last component allows for a Taylor

expansion around the control variate v2i since it is estimated in the ﬁrst stage. Part (ii)

replaces any compactness conditions on (v2i, zi) and the part (iii) insures point identiﬁcation

of the sieve coeﬃcients.

77

The following theorem provides consistency of the proposed estimator. Proof is omitted

since the arguments are identical to those provided in Song (2016).

Theorem 2.7.3. In the set-up described by equation (2.1) where Assumptions 2.7.5, 2.7.6,

and 2.7.7 hold, if the ﬁrst stage estimator ˆv2i satisﬁes

|ˆv2i − v2i| = Op(τv)

sup

(xi,zi)∈X×Z

where τv = op(1), then the estimators that maximize the log likelihood in equation (2.20) are

consistent,

(cid:32)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)Φ

|| ˆβ−k − β−ko|| = op(1)
(cid:33)

(cid:18)xiβo + ho(v2i, zi)

− Φ

exp(go(y2i, zi))

ˆβ + ˆh(ˆv2i, zi)
xi
exp(ˆg(y2i, zi))

(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)∞,w1

= op(1)

Note that consistency of the ﬁrst stage non-parametric estimator can be obtained using

Proposition 3.6 of Chen (2007) under fairly standard and relaxed conditions.24 This theorem

provides consistency of both the parametric component of the second stage estimator as well

as the predicted probability function. This will be used in providing consistency of APE

estimates in the following corollary.

Corollary 2.7.1. Under the conditions of Theorem 2.7.3, the APE estimator with respect

to component j of xi is consistent:

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)n−1

n(cid:88)
(cid:18)

i=1

(cid:33)
(cid:32)
(cid:18)xiβo + ho(v2i, zi)
(cid:19)

ˆβ + ˆh(ˆv2i, zi)
xi
exp(ˆg(y2i, zi))

φ

− E

φ

ˆβj

exp(ˆg(y2i, zi))

βjo

(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = op(1)

exp(go(y2i, zi))

exp(go(y2i, zi))

24The conditions for consistency of the ﬁrst stage estimator include Z ∈ (cid:60)k1+k2 is a Cartesian product
of compact intervals, E(y2i|zi) is bounded, and mo(zi) = E(y2i|zi) ∈ Λs where s > (k1 + k2)/2

78

The next assumption collects the remaining low level conditions needed to derive the

convergence rate of the parametric component of the second stage estimator.

Assumption 2.7.8.

c((V × Z), w1) for some s > 0 and w1 ≥ 0.
(i) Any h(·) ∈ H, h(v2i, zi) ∈ Λs
c((V × Z), w1) for some s > 0 and w1 ≥ 0.
(ii) Any g(·) ∈ G, g(y2i, zi) ∈ Λs
(iii) The smoothness exponent of the Holder space satisﬁes 2s ≥ 1 + k1 + k2.

(iv) In Assumption 2.7.7(ii), w2 > w1 + s

This assumption places stronger smoothness assumptions as well as further restricts the

tail behavior of the covariates. The convergence rates are derived for the one-step estimator

where the control variate v2i is assumed to be known and not estimated in a ﬁrst stage.

Therefore the estimator is deﬁned as,

˜θ ≡ ( ˜β−k, ˜h, ˜g) =

arg max

(β−k,h,g)∈B×Hn×Gn

(cid:20)

y1i log

Φ

n(cid:88)

i=1

(cid:18)xiβ + h(v2i, zi)

exp(g(y2i, zi))

(cid:20)

+ (1 − y1i) log

1 − Φ

(cid:19)(cid:21)
(cid:18)xiβ + h(v2i, zi)

exp(g(y2i, zi))

(cid:19)(cid:21)

(2.43)

The following theorem states the rate results, again the proof is omitted since the arguments

are almost identical to those given in Song (2016).25

Theorem 2.7.4. In the set-up described by equation (2.1) under Assumptions 2.7.5, 2.7.6,

2.7.7, and 2.7.8, then the estimator described in equation (2.43) satisﬁes

|| ˜β−k − β−ko|| = Op

max(Lhn, Lgn)

n

−s/(1+k1+k2)

hn

−s/(1+k1+k2)

gn

+ L

+ L

(cid:32)(cid:114)

(cid:33)

25The only diﬀerence is that now the control function is a function of both v2i and the instruments zi.
Derivation of the convergence rates still follow from Theorem 3.2 of Chen (2007), but now the approximation
rate of the control function converges at a slightly diﬀerent rate: ||ho − πnho|| = L

−s/(1+k1+k2)

.

hn

79

The proposed estimator converges much slower than the parametric rate. This will eﬀect

the performance of the estimator as seen in the simulation study given in the following

section.

2.7.3 Simulation

This simulation study will be a broad examination of the proposed Semi-Parametric Het-

eroskedastic Probit with a General Control Function (SP Het Probit (GCF)) in a variety of

settings. There is one included and one excluded instrument drawn from the following joint

distribution,

z1i

z2i

0
 ,

1

3 0

0 1





The common data generating process is

 ∼ N

1 y∗

0 y∗

1i ≥ 0

1i < 0

y1i =

y∗
1i = y2iβo + z1i + u1i

(2.44)

(2.45)

where βo = 1 and πo =(cid:0)−1/

y2i = π1o + π2oz1i + π3oz2i + v2i;
√

√
2,−1/

√
6, 1/

2(cid:1)(cid:48)

. The control variate v2i is drawn from a

N (0, 1). This means that there is a strong ﬁrst stage with an R2 of approximately 0.50.

The unobserved heterogeneity u1i will be decomposed into the general control function and

a mean zero random variable e1i that determines the conditional distribution of the latent

error,

u1i = ho(v2i, zi) +

√
2e1i

(2.46)

80

such that e1i and ho(v2i, zi) are standardized to have variance equal to one. This means
V ar(u1i) ≈ 3 and V ar(y2iβo + z1i) ≈ 2.45. The simulation experiment considers three

diﬀerent control functions that satisfy the CMR:

h1
o(v2i, zi) = v2i

h2
o(v2i, zi) =

h3
o(v2i, zi) =

√
z1i/3 + z2i/

(cid:17)

3

v2i

(cid:16)
(cid:18)

(cid:19)

√
v2i/

2.5

1 +

z1i

1 + (z1i/3 − 2z2i/3)2

The coeﬃcients in the linear control function, h2

o(v2i, zi), are chosen so a projection of

h2
o(v2i, zi) on just the control variate (v2i) only explains about 35% of the variation in

h2
o(v2i, zi). This means relaxing CF-CI should have meaningful consequences. The functional

form and of the non-linear control function, h3

o(v2i, zi), is chosen so a projection of h3

o(v2i, zi)

on (v2i, z1iv2i, z2iv2i) explains 90% of the variation and therefore a linear approximation is

very reasonable. Moreover the decomposition of variance explained is split 50-50 between

just the control variate (v2i) and the terms that are interacted with instruments (z1iv2i and

z2iv2i). Again, this means relaxing CF-CI should have meaningful consequences.

This simulation also considers four diﬀerent conditional distributions for the latent error:

12/2,

√
12/2)

1i ∼ Logistic(0, 3/π2)
e1
√
1i ∼ U nif orm(−
e2
√
1i ∼ T (0, 3)/
e3
1i ∼ 0.5N
e4
1i ∼ Logistic
e5

0, (z2

3

(cid:16)
(cid:16)−0.8, (1 − 0.82)
(cid:17)
2i/4)/π2(cid:17)
(cid:16)

1i/2 + 3y2

+ 0.5N

(cid:17)

0.8, (1 − 0.82)

Notice that a combination of the ﬁrst control function with any of the ﬁrst three conditional

distributions does not violate CF-CI. Only when the conditional distribution of the latent er-

81

ror is a function of the instruments, either through the control function or heteroskedasticity

(as in e5

1i), is CF-CI violated.

The simulation results will be presented in three segments. In the ﬁrst segment, the data

generating process satisﬁes CF-CI. This only allows for the ﬁrst control function h1

o(v2i, zi)

and the ﬁrst four conditional distributions (e1

1i, e2

1i, e3

1i, e4

1i). The SML estimator is expected

to perform as well, if not better, than the proposed method on all accounts (parameter

estimates, ASF estimates, and APE estimates). The second segment considers the remain-

ing two general control functions that do not satisfy CF-CI. Notice that h2

o(v2i, zi) is linear

in parameters and therefore can be estimated parametrically but h3

o(v2i, zi) is a non-linear

function and whose functional form will be treated as unknown and will be estimated non-

parametrically. For the ﬁnal segment heteroskedasticity is introduced so the non-parametric

heteroskedastic function in the proposed estimator must capture both the misspeciﬁed dis-

tribution and the heteroskedasticity in the latent error.

In addition to the proposed semi-parametric estimator, the simulation experiment will

employ the two step control function Probit estimator of Rivers and Vuong (1988) (Probit

(CF)) and the SML estimator of Rothe (2009) as a comparison. The SML estimator is

implemented with a Gaussian kernel of order 1. Although asymptotically the SML estimator

requires higher order kernels, Rothe ﬁnds that lower order kernels perform better in small

samples. As suggested in Rothe (2009), bandwidths for the SML estimator were treated as

additional parameters to be optimized over. All three estimators use the same ﬁrst stage

estimates for v2i: the residual from regressing y2i on zi.

Two issues with the proposed method arose during implementation. First the proposed

estimator is fairly sensitive to diﬀerent starting values. But using 15 randomized starting

values helps to avoid local maxima. Second, since the estimator incorporates two non-

82

parametric functions that need to be approximated via sieves, the number of parameters

increases quite quickly. To reduce the number of parameters, a reasonable restriction on the

general control function such that ho(v2i, zi) = ho(zi)v2i is used. For sample size n = 250,

both polynomial series approximating ho(zi) and go(y2i, zi) only include ﬁrst order terms.

For sample size n = 500 the polynomial series approximating ho(zi) only includes ﬁrst

order terms while the polynomial series approximating go(y2i, zi) includes both ﬁrst and

second order terms. For sample size n = 1, 000, both polynomial series are up to order 2.

Alternatively, one can consider a penalization method to restrict the number of non-zero

covariates. This extension is left to future research.

All tables and ﬁgures referenced in this section can be found in the Appendix. They

report the bias, standard deviation (Std. Dev.), root mean squared error (RMSE), and the

25th, 50th, and 75th sample quantiles of the parameter estimates for 1,000 repetitions of the

simulation. The following summarizes the results for the three cases: CF-CI holds, CF-CI

is violated by a general control function, and CF-CI is violated by heteroskedasticity.

CF-CI holds

Tables E.11-E.14 report the simulation results for the estimates of βo when CF-CI holds.

The results show all three estimators perform fairly well in terms of bias even though the

Probit (CF) estimator imposes a misspeciﬁed distribution (assumes normality when it does

not hold). The only exceptions are the proposed estimator under a Uniform and Gaussian

Mixture distribution for the latent error. This bias is stronger when n = 500 and n =

1, 000 which is when there are a larger number of higher order terms for the non-parametric

functions. This would suggest that there may be gains to adding a penalized approach to

control for potentially large number of irrelevant terms.

Also the proposed estimator is much less eﬃcient than the alternate methods. The

83

eﬃciency gain for using the SML estimator can be substantial. In the cases of the Uniform

Distribution and the Gaussian Mixture, the standard deviation can be reduced by 1/2 when

using the SML estimator instead of the proposed approach. Nevertheless, I see all the

estimators perform well in terms of ASF estimates in Figure D.9-D.12.

Violation of CF-CI with General Control Function

Now, the data generating process includes general control functions that violates CF-CI

by including the instruments. Recall that there is two possible general control functions:

one that is linear in parameters (h2

o), so that it is consequently parametrically speciﬁed, and

another that is non-linear and must be estimated non-parametrically.

Tables E.15-E.18 report the βo estimate results when the control function is linear. The

simulation results suggest SML estimator has a fairly strong negative bias compared to the

other two estimators where in some cases the 75th sample quantile falls below the true

parameter value of 1. Surprisingly, the parametric Probit (CF) estimator does not have

as strong of a bias as the SML estimator, even though it is also implicit imposing CF-CI.

But the proposed estimation approach still incurs fairly large standard deviation due to the

numerous parameters needed in estimation. Therefore both of the other approaches, Probit

(CF) and SML, fair better in terms of RMSE.

This places some doubt onto the proposed approach and whether any realist gains on

previous approaches is even possible. But examining the ASF estimates clariﬁes the matter.

The ﬁgures show that the proposed SP Het Probit (GCF) substantially outperforms the

other estimators by better estimating true ASF across all the diﬀerent distributions. Tables

E.19-E.22 and Figures D.17-D.20 report the results when the control function is unknown

and estimated non-parametrically. The conclusions stay the same although there tends to

be a larger bias (compared to the cases of linear general control function) for the proposed

84

estimator. But this is because the form of general control function is unknown and can only

be approximated.

Comparing these results to the previous segment, distributional misspeciﬁcation appears

to have a lighter eﬀect on the parameter estimates than violations of CF-CI. The Probit

(CF) estimator performs quite well under distributional misspeciﬁcation while the Probit

(CF) and SML estimators display stronger bias and poor ASF estimates when CF-CI does

not hold. Moreover, the proposed estimator always has a larger RMSE compared to the

other approaches which suggests that there is consequential trade-oﬀ in terms of eﬃciency

of the SML estimator and smaller bias of the proposed approach.

Violation of CF-CI with Heteroskedasticity

The ﬁnal segment only looks at the Logistic distribution for the latent error but introduce

heteroskedasticity as a further violation of CF-CI. Simulation results are reported in Tables

E.23-E.25. When the sample size is 250, only the ﬁrst order polynomials are included

in approximating the heteroskedasticity. Consequently, the proposed estimator performs

poorly and on par (in terms of bias) with the SML estimator. But when I allow for higher

order terms when the sample size increases, the bias of the proposed estimator diminishes

signiﬁcantly compared to the alternative methods. Examining the ASF estimates in Figures

D.21-D.23, only the SP Het Probit (GCF) estimator follows the true ASF closely which the

other two estimators suﬀer even under the simplest control function h1

o(v2i, zi) = v2i.

Overall the proposed estimator correctly adapts to the scenario in which CF-CI is vio-

lated while the alternative estimation methods are restricted by imposing the assumption.

An additional beneﬁt to the proposed method is much simpler to implement and can be

done using canned commands in common statistical packages. So for an applied researcher

the proposed method is more general than alternative estimators and is easier to implement.

85

However this simulation also brings to light some weaknesses of the SP Het Probit (GCF)

estimator. First, this estimator is quite ineﬃcient due to the large number of parameters

it needs to estimate. Therefore the proposed procedure could beneﬁt by a dimension re-

duction. Second, the proposed estimator is quite sensitive to starting values and without

prior knowledge of what the true parameter value should be, this may pose some challenge

to implementation. Randomizing around scaled parameter estimates from the Probit (CF)

estimator for starting values is a promising possibility, as this simulation study shows. The

parametric estimator tends to perform fairly well even under violation of CF-CI and with a

misspeciﬁed distribution.

2.8 Conclusion

This chapter presents a new control function approach to endogeneity in a binary response

model that does not impose CF-CI. Applying a similar framework as Kim and Petrin (2017),

this chapter uses a general control function method that allows the instruments to be a part

of the conditional distribution of the unobserved heterogeneity. The proposed estimator is

consistent and asymptotically normal. In simulations, it is shown that the general control

function method is necessary to obtain accurate parameter estimates under the weaker CMR

setting. Moreover, structural objects of interest such as the ASF and APE can be recovered

in the general framework presented in the chapter. Without CF-CI, other estimators of the

literature are unable to correctly estimate the ASF and APE resulting in inaccurate economic

interpretations. In the empirical application, a Wald test uncovers strong statistical evidence

for the violation of CF-CI, although there are fairly minimal diﬀerence in the economic

interpretations produced by the diﬀerent estimators.

86

The proposed estimator is introduced in a parametric framework which may be unre-

alistic in some economic settings. Therefore a semi-parametric extension is provided that

places no distributional assumptions on the unobserved heterogeneity. Simulations show that

when CF-CI is violated and the distribution of the latent error is misspeciﬁed, the proposed

semi-parametric estimator consistently estimates the parameters and the ASF. But, the sim-

ulations also uncover some drawbacks to the proposed semi-parametric approach. Due to

the fairly large dimension of the parameter space, the proposed approach is quite ineﬃcient

relative to other estimators in the literature. An interesting avenue for further research is

to develop a more eﬃcient semi-parametric estimator that still allows for the relaxation of

CF-CI.

The motivation for this chapter was to propose an estimation procedure built upon a

model and assumptions that are much more reﬂective of what we would expect in empirical

data. By creating a model that is much more ﬂexible and realistic as well as an estimation

procedure that is easy to implement, the proposed approach will be a useful addition to an

economists tool-kit of estimators. The next chapter approaches a diﬀerent setting but with

a similar purpose. In Chapter 3, a joint work with Jeﬀrey Wooldridge and Ying Zhu, we

consider a panel binary response (large N, small T) in which the standard joint maximum

likelihood approaches have simple and restrictive speciﬁcations for the individual hetero-

geneity and do not allow for serial correlation. Empirical data calls for more ﬂexibility so we

propose an approach that can capture individual persistence through several mechanisms.

First, we introduce individual heterogeneity in the levels and the slopes that are allowed to

be potentially correlated with the covariates. Second, we allow for serial correlation in the

latent error. The resulting estimator is a pooled correlated random eﬀects heteroskedastic

Probit in which identiﬁcation will again rely on the results provided in the ﬁrst chapter.

87

Both of the proposed approaches in these two chapters will ﬁnd their utility in empirical

work as they push the frontiers of the literature on how to incorporate ﬂexibility driven by

the demands of data.

88

Chapter 3

Behavior of Pooled and Joint

Estimators in Probit Model with

Random Coeﬃcients and Serial

Correlation1

3.1

Introduction

Multilevel data analysis is among the long standing statistical tools that leverages hetero-

geneity in the data. One of the most frequent occurrences in application is panel data where

the ﬁrst level is time and the second level is individuals. Given the broad framework pro-

vided by a multi-level setting, there is an absence of times series analysis in the multilevel

literature that appears in panel data settings. In particular, when observations are recorded

over time we expect the data is display a strong amount of persistence. This persistence can

arise with individual-speciﬁc heterogeneity or with serially correlated errors.

Economic theory can provide motivation as to why would expect to see persistence in

the data. In modelling demand, purchasing behavior can be traced back to a utility max-

1This is joint work with Jeﬀrey Wooldridge and Ying Zhu

89

imization problem where if one allows for heterogeneous agents – in preferences or income

eﬀects – the estimating equation should allow for individual heterogeneity. In Wooldridge

(2010), the individual-speciﬁc heterogeneity in a program evaluation framework is motivated

by “the usual omitted ability story.” The individual-speciﬁc heterogeneity controls for any

individual characteristic – such as ability or motivation – that may be correlated with pro-

gram participation. In the application of these examples, the individual heterogeneity is an

unobserved random variable.

Even after allowing for individual-speciﬁc heterogeneity, one would expect a strong pres-

ence of serial correlation in the errors. In the ﬁeld of labor economics, outcomes such as

employment, wages, and health outcomes are strongly persistent and exhibit clear signs of

auto-correlation. Bertrand, Duﬂo, and Mullainathan (2004) survey the empirical literature

on evaluating treatment eﬀect that apply the Diﬀerence in Diﬀerence technique and found

that out of 69 studies, only 5 explicitly address serial correlation. They also show that the

consequences of not correcting for serial correlation can be severe for inference. By evalu-

ating placebo interventions, ignoring serial correlation can result in concluding a “eﬀect” at

the 5 percent level for up to 45 percent of the placebo interventions.

This chapter intends to further explore the eﬀects of persistence – individual-speciﬁc

heterogeneity and serial correlation – in popular estimation procedures in a binary response

setting. We are interested in any robustness properties these estimation procedures may

provide either theoretically or in simulations. The most common formulation of a model for
a panel binary response, yit ∈ {0, 1}, is derived from the latent variable set up that allows

for level individual heterogeneity.

yit = 1{ai + xitβ + εit > 0}

(3.1)

90

where xit is a vector of observed random variables – the covariates– and ai is an unobserved

random variable – the individual heterogeneity. To begin, we will assume ai, xit, and εit are

all independent from one another. If we make the following “random eﬀects” assumption,

ai|xi1, ..., xiT ∼ N (α, σ2
1)

(3.2)

then a Joint Maximum Likelihood Estimation (JMLE) procedure that integrates the ran-

dom eﬀect ai is consistent. If we assume the idiosyncratic error εit is normally distributed

this results in the random eﬀect Probit estimator. Alternatively, if we assume a logistic

distribution then random eﬀects Logit estimator is used. Estimation can be computational

more diﬃcult given a logistic distribution since it does not mix well with any other distribu-

tion. Consequently, this chapter will more heavily examine the Probit case, but most of the

analysis can be extended to the Logit case as well.

However the conditional independence assumption that is implicit in the random eﬀects

assumption in equation 3.2 can be quite stringent. In the linear panel data model literature

in econometrics, there are two popular modelling approaches in the literature to relax the

conditional independence of ai and the x’s. One is the Fixed Eﬀect (FE) approach and

the other is the Mundlak device. The FE approach runs a pooled OLS regression of the
form (yit − ¯yi) on (xit − ¯xi). This allows for arbitrary correlation between the individual

heterogeneity ai and the x’s. Alternatively, one can model the correlated random eﬀects using

a Mundlack device. The Mundlack device proposes allowing ai to be correlated with the x’s

through time constant functions of the data. A common implementation is to use the time

averages ¯xi. Wooldridge (2018), Proposition 2.1, shows that running a pooled OLS regression

of the form yit on xit, ¯xi yields the same estimates for as the FE approach. However, the

FE approach does not allow us to estimate functions that involve the conditional mean of

91

the heterogeneity, which might be of interest in certain applications (as we will see soon).

Extending the discussion to a non-linear setting, using the FE approach results in an

incidental parameters problem and consequently serious biases in the coeﬃcient estimates.

Greene (2004) and Fern´andez-Val (2009) provide a more complete discussion of the FE

approach for Probit. To avoid these issues, we propose applying the Mundlack device to the

setting of equation (3.1) by assuming

ai = α + ¯xiξa + u1i where u1i|xi1, ..., xiT ∼ N (0, σ2
1).

(3.3)

We could use a JMLE procedure that integrates out the random eﬀect u1i or, as in the

linear case, we could consider a simpler Pooled Maximum Likelihood (PMLE) approach. The

PMLE approach makes no assumptions on the joint distribution over the time observations

but pools the likelihood over i and t.

So far, we have only introduced individual heterogeneity into the level but there is little

reason as to why the individual heterogeneity should be restricted to a level eﬀect. While

introducing random slopes is much more common in the linear regression literature (see Hall,

Horowitz, et al. (2005), Swamy (1970), and Swamy and Tavlas (1995)), there has been fewer

papers that attempt to account for the unobserved heterogeneity in slope parameters in a

nonlinear model like Probit.2 One of the reasons has to do with the fact that joint estimation

methods are so far the dominant approach. A JMLE approach, which requires obtaining

the joint distribution of (yi1, ..., yiT ) conditional on (xi1, ..., xiT ), can be computationally

diﬃcult, and we may not even have enough assumptions to obtain the joint distribution.

In any case, the JMLE will generally require more assumptions to consistently estimate

the parameters. The beneﬁt from the additional assumptions and computational burden is

2Hausman and Wise (1978) and Akin, Guilkey, and Sickles (1979) introduce random coeﬃcients in the

multinomial and ordered Probit models.

92

greater asymptotic eﬃciency.

Extending the speciﬁcation in equation (3.1) we will assume,

yit = 1{ai + xitbi + εit > 0}

(3.4)

where now both ai and bi are unobserved random variables capturing the level and slope

individual heterogeneity.

Before we dive into the more complicated joint methods, perhaps it would be wise for

us to take a step back and ask the following question: What features of the model should

we focus on? In evaluating policy interventions, the ultimate interest usually concerns the

treatment eﬀect. While the average treatment eﬀect coincides with the slope coeﬃcient in a

linear model with only additive heterogeneity, in a nonlinear model like the one above, the

average treatment eﬀect is much more complex in its derivation.

The concepts of the Average Structural Function (ASF) are simultaneously proposed by

Blundell and Powell (2004) and Wooldridge (2005), in which the average treatment eﬀect

should be derived from the ASF. Using the notation in Wooldridge (2005), the conditional
mean of y is deﬁned as, E(y|x, q) = µ1(x, q) where x are observed covariates – xit in the

setup above – and q is unobserved heterogeneity – ai and bi in the setup above. Assuming
the standard distributional assumptions in the Probit case (ε ∼ N (0, 1) and independent of

all other random variables), applied to our model of interest:

µ1(xit, (ai, bi)) = Φ(ai + xitbi).

(3.5)

Then the ASF averages the above equation over the distribution of unobserved heterogeneity.

The treatment eﬀect is the diﬀerence of the ASF over the treated and not treated, but this can

vary over the observed covariates. Unlike the linear model, the complexity of the non-linear

93

model allows the treatment eﬀect to be heterogeneous over the distribution of the covariates.

Then averaging over the distribution of the observed covariates renders the average treatment

eﬀect.

It is useful to begin with a framework that uniﬁes the discussion of treatment eﬀects

in models with unobserved heterogeneity. Average treatment eﬀect is usually reserved for

cases of binary treatment and average partial eﬀect for the continuous analogue.

In the

continuous case, in lieu of taking diﬀerence, the partial derivative of the ASF produces the

partial eﬀect. We will refer to the Average Partial Eﬀect (APE) synonymously for the binary

and continuous case.

If our focus is on the APEs, then adopting the Mundlak device to model the unobserved

heterogeneity in slope parameters would seem a sensible approach. Combined with a pooled

estimation method, the Mundlak approach treats the data as if it is one long cross section

and computation is typically straightforward.

Motivated by Mundlack, we return to setting of equation (3.4) and assume

ai = α + ¯xiξa + u1i where u1i|xi1, ..., xiT ∼ N (0, σ2
1)
bi = α + ¯xiξb + u2i where u2i|xi1, ..., xiT ∼ N (0, σ2
1).

(3.6)

(3.7)

As previously mentioned will focus on two diﬀerent estimation routes: JMLE and PMLE. The

Joint MLE procedure derives joint distribution of (yi1, ..., yiT ) conditional on (xi1, ..., xiT )

by integrating out the random eﬀects (u1i, u2i). This integral is not solved in closed form and

in estimation is approximated using numerical methods. This can cause more computational

issues including failures due to non-convergence and long estimation times. Because it is a full

MLE method, the JMLE produces the eﬃcient estimates of the parameters α, β, ξa, ξb, σ2
1,

and σ2

2. However it does assume εit is iid over i and t and (u1i, u2i) are bivariate normal

94

independent of εit and (xi1, ..., xiT ). These assumptions could be relaxed theoretically, but

trying to implement the more ﬂexible models turn out computationally costly so that the

current state of statistical software makes these assumptions in implementation.

In the alternative pooled framework, it is computationally easy for us to relax the as-

sumption that εit is independent over i and t. Since we are considering a panel setting, it

is natural to expect serial correlation in the latent error. Although it is not as eﬃcient as

a joint procedure, it is robust to serial correlation. The drawback to this approach is that

(cid:113)

1 + σ2
1

(cid:113)

it cannot separately identify the coeﬃcients (α, β, ξa, ξb) from the scaling factor 1/

but it is consistent in estimating the scaled parameters, θσ = θ/

any of the coeﬃcients.

1 + σ2

1 where θ represents

This leads to an interesting trade-oﬀ between the two estimation procedures. The JMLE

can separately identify and estimates the variance of the random eﬀects and should be more

eﬃcient, but is not robust to serial correlation and may be computationally more demanding.

On the other hand, the PMLE is robust to serial correlation but is less eﬃcient and can

only estimate the scaled coeﬃcients. However, if we focus on the APEs, then the lack of

identiﬁcation would not pose any issue in the interpretations of the results. In this case, it

is possible that precise estimates of individual coeﬃcients may have a much smaller impact

on the estimates of the APEs.

We conduct extensive simulation experiments for the Probit model comparing the JMLE

and PMLE. We look at both the continuous – with and without strong dependence – and

binary treatment cases. The pooled approach performs as we expect:

less eﬃcient but

consistent over diﬀerent levels of serial correlation in the latent error. We do ﬁnd some

surprising trends in the coeﬃcient estimates using the JMLE procedure. We ﬁnd that even

under no serial correlation, the coeﬃcients estimates have a serious negative bias. The

95

driving factor appears to be poor estimation of the variance components, σ1 and σ2, which

tend to also have signiﬁcant negative bias. But these biases seem to cancel each other out

when examining the estimates of the scaled coeﬃcients, even under the presence of serial

correlation. Consequently, the APE calculated from the JMLE estimates appear to have

robustness properties with respect to serial correlation.

The remainder of this chapter is organized as follows. Section 2 presents the model set up

and assumptions. Section 3 goes into more detail in deriving the two estimation procedures.

In particular, we discuss how the JMLE procedure ﬁts into the Generalized Linear Mixed

Eﬀects Model literature and how the pooled approach results in a heteroskedastic Probit

estimator. Section 4 derives the average structural function and the corresponding APE. We

provide a more detailed discussion on how the heterogeneity is incorporated into the APEs.

Section 5 present the speciﬁcations for the simulation study and discusses the results. Section

6 uses the two estimators in an application to investigate if our simulation results hold with

empirical ﬁndings. Section 7 presents a short discussion on extending this analysis to the

Logit case. There is no easy implementation of a pooled approach since no distribution

mixes well with the logistic distribution and the JMLE approach does not provide consistent

estimates of coeﬃcients which leads to the question of what is a more robust statistic:

the APEs or the log-odds. Finally we conclude with a summary of our results and their

implications.

96

3.2 Model Set Up

We are considering a binary response model in a panel setting with small T and large N

allowing for correlated random eﬀects in the intercept as well as the coeﬃcient of interest,

yit = 1{ai + x1itb1i + x2itβ2 + εit > 0}

ai = α + g1( ¯xi)ξa + u1i

bi = β1 + g2( ¯xi)ξb + u2i.

(3.8)

Motivated by the Mundlack device, we will assume that the elements of g1(·) and g2(·)

are known functions of the time averages ¯xi. This could of course be generalized to know

functions of all the time observations. The random intercept and coeﬃcients are allowed to

be correlated with all time observations of the x’s, (xi1, ..., xiT ), through the linear func-

tions g1( ¯xi)ξa and g2( ¯xi)ξb. We will assume the following independence and distributional

assumptions hold: u1i

(cid:12)(cid:12)(cid:12)xi1, ..., xiT ∼ N



0
 ,

 σ2



 .

u2i

1 σ12

0

σ12 σ2
2

(3.9)

This means that the random eﬀects ai and bi have a known distribution and are independent

of the x’s conditional on the time averages, ¯xi. Generally, u1i and u2i are allowed to drawn

from a multivariate normal distribution with possible correlation, however we found that,

in the simulation, allowing for a general variance covariance structure was quite straining

to compute. In the remainder of this chapter we will assume σ12 = 0 but the analysis can

be easily extended to allow for the correlated case. Finally, the idiosyncratic error εit is

assumed to be independent of all other random variables in the model and independent over

i.

In order to allow persistence in the outcome that is unrelated to the covariates, εit is

97

serially correlated over t following an AR(1) process,

εit = ρεit−1 + eεit

(3.10)

An AR(1) process does a fair job at modeling the persistence in outcomes that we see in

empirical data. However, this could be extended even further by allowing for an AR(p)

process or even a ARMA(p,q).

3.3 Estimation Methods

Given the set-up above, we will consider two diﬀerent estimation procedures: JMLE and

PMLE. Section 15.8 of Wooldridge (2010) reviews these two estimation methods (as well as

others) and their accompanying assumptions and implications with only additive individual

heterogeneity. We extend this by introducing slope individual heterogeneity and focus on

the implications of serial correlation.

3.3.1 Mixed Eﬀects Probit

We will ﬁrst look at a JMLE, referred to as the Mixed Eﬀects (ME) Probit, which can be

derived through two similar but diﬀerent framework: one can be viewed as an extension

to the Random Eﬀects Probit (described in Wooldridge (2010) chapter 15.8) to allow for

a random coeﬃcient and the other as a Generalized linear Mixed Model (GLMM) with a

Bernoulli distribution and a Probit link function.

Under the ﬁrst framework, we may consider the set up described in equation (3.8),

so the marginal density of yit conditional on the contemporaneous regressors and random

98

coeﬃcients is,

f (yit|xit, ai, bi) = Φ (ai + x1itb1i + x2itβ2)yit (1 − Φ (ai + x1itb1i + x2itβ2))(1−yit) . (3.11)

If one were to assume independence across t, the joint density (over time) will be a product

of the marginal densities. By allowing for correlated random eﬀects and integrating out the

random eﬀects (u1i, u2i) we obtain,

f (yi1, ..., yiT|xi1, ..., xiT ) =

(cid:90) ∞

(cid:90) ∞

−∞

−∞

T(cid:89)

t=1

Φ(cid:0)α + x1itβ1 + x2itβ2 + g1(¯xi)ξa

(cid:1)yit ×(cid:16)

1 − Φ(cid:0)α + x1itβ1
(cid:1)(cid:17)(1−yit)

+ x1itg2(¯xi)ξb + u1 + x1itu2

(cid:18)u1

(cid:19)

(cid:18) u2

(cid:19)

+ x2itβ2 + g1(¯xi)ξa + x1itg2(¯xi)ξb + u1 + x1itu2
× 1
σ1σ2

du1du2

σ2

σ1

φ

φ

(3.12)

where σ1 and σ2 are the standard deviations of u1i and u2i respectively. The integral is

not solved in closed form but can estimated using numerical methods.3 Taking the log of

equation (3.12) gives the conditional log likelihood for each i. Maximizing the sum, over i,

of the log likelihood with respect to the parameters α, β1, β2, σ1, and σ2 produces the JMLE

estimator.

As for the second framework, let us consider the deﬁnition of GLMM in chapter 4 of

McCulloch and Neuhaus (2001) (with notations altered slightly),

Yi(= (yi1, ..., yiT ))|u ∼ independent fYi|u(yi|u)

h(E(Yi|u)) = XiB + Ziu

u ∼ fU (u)

(3.13)

where X and Z are considered ﬁxed design matrices and u is the only random eﬀect such

3The simulation uses a mean-variance adaptive Gauss–Hermite quadrature, but other procedures such as

a Laplacian approximation could be used.

99

that XB is the ﬁxed component and Zu is the random component. Note that in our set up,
h(·) is the inverse of the Standard Normal CDF which is why this estimator will be referred

to as the Mixed Eﬀects (ME) Probit estimator. Then deﬁning the components in equation

(3.13) to match the set up described by equation (3.8) yields,

(cid:19)
1T , x1i, x2i, 1T × g1(¯xi), 1T × g2(¯xi)
(cid:19)
2 γ(cid:48)
α β1 β(cid:48)

a γ(cid:48)

(cid:19)

b

(cid:18)
(cid:18)
(cid:18)
u1i

u2i



1T , x1i

Xi =
B(cid:48) =

Zi =

u =

where 1j is a j×1 vector ones and x1i and x2i are the stacked time observations for individual

i. Then the standard GLMM estimator computes the log likelihood under the assumption

of independence across t and then integrating out the random eﬀects u = (u1i, u2i)(cid:48).

There are several concerns that should be addressed with this estimator. First, in prac-

tice, the second level equations (deﬁning ai and bi) are often not given a ﬂexible speciﬁcation

that allows for correlation with the regressors x1i and x2i. It is perhaps the assumption of

“ﬁxed design matrices” in the GLMM literature that leads to a general lack of concern for

correlation between the random eﬀect u and the design matrices X and Z. As our simula-

tion study will show, not allowing for correlated random eﬀects will result in heavily biased

parameter estimates.

Second, the JMLE can be quite computationally demanding. The discussion for the

results of the simulation study will provide more detail, but to summarize, the ME Probit

estimator is more likely to fail to converge and if it does converge, takes much longer than the

alternate estimator. The failure of converges is more frequent when the true data generating

100

process does not have any random eﬀects and therefore the parameters are at the boundary

of the identiﬁed set (σ2

1 = σ2

2 = 0). The slower speeds are because the ME Probit estimator

must numerically approximate several integrals.

Third, note that this estimator depend on independence across t. Since we are introducing

serial correlation through an AR(1) process, we would expect the estimator to be inconsistent.

In particular, let us re-examine the joint distribution of (yi1, ..., yiT ) conditional on the x’s

where in an AR(1) process with correlation coeﬃcient ρ, Σε will have the form,

 ∼ N (0, Σε)

εi1
...

εiT

assuming correlation over t. Suppose,


ρ2
...

Σε =

1

ρ



. . . ρT−1
. . . ρT−2
. . .

...

. . .

ρ

1

ρ

1

ρ
...

ρ2

ρ

1
. . .

ρT−1 ρT−2

. . .

ρ

.

(3.14)

By the properties of conditional distributions,

f (yi1, ..., yiT|x1i, x2i, ai, bi) =f (yiT|yi1, ..., yiT−1, x1i, x2i, ai, bi)

× f (yiT−1|yi1, ..., yiT−2, x1i, x2i, ai, bi)
× ··· × f (yi2|yi1, x1i, x2i, ai, bi)
× f (yi1|x1i, x2i, ai, bi),

(3.15)

101

solving for the conditional means of yit for t ≥ 2 yields

E(yit|yi1, ..., yit−1, x1i, x2i, ai, bi)

(cid:32)

=E

Φ

=

(cid:112)

1 − ρ2

Φ

ai + x1itb1i + x2itβ2 − ρuit−1

(cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)yi1, ..., yit−1, x1i, x2i, ai, bi
(cid:32)
(cid:33)
=E(E(yit|uit−1, yi1, ..., yit−1, x1i, x2i, ai, bi)|yi1, ..., yit−1, x1i, x2i, ai, bi)
(cid:35)yit−1
(cid:34)(cid:90) ∞
(cid:32)
(cid:112)
(cid:35)(1−yit−1)
(cid:33)
(cid:34)(cid:90) ai+x1it−1b1i+x2it−1β2
(cid:112)
≡E(yit|yit−1, x1it, x1it−1, x2it, x2it−1, ai, bi).

ai + x1itb1i + x2itβ2 − ρu
(cid:32)

ai + x1itb1i + x2itβ2 − ρu

ai+x1it−1b1i+x2it−1β2

×

−∞

1 − ρ2

Φ

1 − ρ2

φ(u) du

(cid:33)

φ(u) du

In words, the mean of yit conditional on all past observations and the random eﬀects ai and

bi is only a function of the data from the last time period and the random eﬀects. However,

due to the nonlinearity of the Probit model, the conditional mean relies on the past data in a
complicated manner. It is the diﬀerence between E(yit|yit−1, x1it, x1it−1, x2it, x2it−1, ai, bi)
and E(yit|x1it, x2it, ai, bi) that would suggest the ME Probit estimator is inconsistent under

an AR(1) process.

Of course one could consider estimating using a joint likelihood based on a AR(1) model

rather than assume independence. However this would require correctly specify the depen-

dence structure as AR(1), where in empirical data, a simple AR(1) process may not be able

to truly capture the complex time dependencies. Moreover, Keane (1994) discusses the diﬃ-

culty of doing so directly and instead proposes a simulated variation of a Method of Moments

estimator. In this chapter we do not consider a simulated version of the JMLE method since

this is done rarely in practice.

102

3.3.2 Pooled Heteroskedastic Probit

The PMLE method is an alternative to a JMLE method and requires fewer assumptions. Un-

like the ME Probit, the Pooled Heteroskedastic Probit does not depend on correct speciﬁca-

tion of the joint likelihood and therefore is consistent under the presence of serial correlation.

Note that the pooled method will produce a conditional mean similar to a heteroskedastic

probit. The heteroskedasticity in the pooled method is due to the heterogeneous slope co-

eﬃcient, rather than the traditional interpretation of heteroskedasticity in the latent error.

In deriving the Pooled Heteroskedastic Probit estimator, we apply the assumptions stated
in section 2, (u1i + x1itu2i + εit)|x1i, x2i ∼ N (0, 1 + σ2

 α + x1itβ1 + x2itβ2 + g1(¯xi)ξa + x1itg2(¯xi)ξb

1 + x2

2) and

1itσ2

(cid:113)

 .

E(yit|x1i, x2i) = Φ

(3.16)

1 + σ2

1 + x2

1itσ2
2

Plugging the conditional mean into the Bernoulli density, taking logs, and pooling over i and

t, yields the following log likelihood

N(cid:88)

T(cid:88)

i=1

t=1

yit ln

Φ

α + x1itβ1 + x2itβ2 + g1(¯xi)ξa + x1itg2(¯xi)ξb
1 − Φ

(cid:113)
 α + x1itβ1 + x2itβ2 + g1(¯xi)ξa + x1itg2(¯xi)ξb



1 + x2

1 + σ2

1itσ2
2

(cid:113)

× (1 − yit) ln

 (3.17)

1 + σ2

1 + x2

1itσ2
2

A standard practice in statistical packages is to assume an exponential function variance

function which insures a strictly positive variance in estimation. We can then approximate

equation (3.16) with,

E(yit|x1i, x2i) = Φ

(cid:32)

α + x1itβ1 + x2itβ2 + g1(¯xi)ξa + x1itg2(¯xi)ξb

(cid:33)

exp(1/2 ln(1 + σ2

1 + x2

1itσ2

2))

103


 α + x1itβ1 + x2itβ2 + g1(¯xi)ξa + x1itg2(¯xi)ξb
(cid:19)
(cid:113)
(cid:18) ασ + x1itβ1σ + x2itβ2σ + g1(¯xi)ξaσ + x1itg2(¯xi)ξbσ

1 exp(1/2 ln

σ2
2
1+σ2
1

1 + σ2

(cid:18)

x2
1it

1 +

exp(v(x1it))

= Φ

= Φ

(cid:19)

(3.18)

(cid:113)

√
1 + σ1.

where θσ = θ/(

1 + σ2

1) is the scaled coeﬃcient for θ = α, β1, β2, ξa, ξb. As a result,

using a Pooled heteroskedastic Probit approach does not allow us to separately identify

(α, β1, β2, ξa, ξb) and

If we focus on the APEs (to be deﬁned formally in the

subsequent section), we only need the estimates of the scaled coeﬃcient. The function v(x1it)

does not include a constant as a necessary requirement for identiﬁcation in heteroskedastic

Probit. Any constant in v(x1it) would be incorporated into the scaling factor. We can

approximate v(x1it) using a polynomial expansion4:

(cid:32)
∞(cid:88)

ln

v(x1it) =

1
2

≈ 1
2

σ2
2

1 +

1 + σ2
1
(−1)n+1

(cid:32)

n

n=1

(cid:33)

x2
1it

σ2
2

1 + σ2
1

x2
1it

(cid:33)n

(3.19)

where we have used a Taylor expansion in the second line.

Compared to the ME Probit estimator, the Pooled Heteroskedastic Probit is compu-

tationally simple, identiﬁed when there are no random eﬀects, and consistent under serial

correlation. The drawbacks are having to approximate the function v(x1it) while using a

preprogrammed command, a loss of eﬃciency comparative to the JMLE, and not being able

to separately identify the scaled parameters.

4One could maximize the log-likelihood in equation (3.17) directly without approximating the het-
eroskedastic function v(x1it). However, to allow for easy implementation by using preprogrammed commands
such as hetprobit in STATA, v(x1it) can be well approximated by x1itδ1 + x

1itδ3 + x4

1itδ4.

it2 δ2 + x3

104

3.4 Average Partial Eﬀects

As discussed in our introduction, a more meaningful statistic in our model of interest are

the Average Partial Eﬀects (APE).This section discusses the identiﬁcation and formulation

of the ASF and APEs using the results of Wooldridge (2005). Identiﬁcation is shown using

Lemma 2.2 and then the ASF is calculated using the results of Lemma 2.1. We will then

discuss the derivation and interpretation of the APEs and contrast it to the Partial Eﬀects

at the Average that is commonly computed in lieu of the APEs.

The set up explained in Section 2 can be seen as an extension to the Probit example

given in Wooldridge (2005) allowing for a random coeﬃcient. A consequence of the Mundlack

device, the observable random variables (w in his notation) that help identify the unobserved

heterogeneity (q in his notation) are the time averages of the covariates, ¯xi. We start from the

Average Structural Function (ASF) deﬁned in Blundell and Powell (2004). The ASF deﬁnes

the structural relationship between the expected outcome and the covariates, averaging out

all the unobserved heterogeneity. To obtain identiﬁcation of the ASF using Lemma 2.2, the

following ignorability assumptions must be satisﬁed. Applied to the notation of our model,

the ﬁrst is an excludeability assumption that requires,

E(yit|xit, ai, bi, ¯xi) = E(yit|xit, ai, bi),

and the second is a selection on observables assumption that requires,

D(ai, bi| ¯xi, xit) = D(ai, bi| ¯xi),

(3.20)

(3.21)

where D(·) denotes the distribution. Equation (3.9) satisﬁes the ignorability assumptions and
therefore following Lemma 2.2, the ASF will be identiﬁed from µ2(xit, ¯xi) = E(yit|xit, ¯xi)

105

where,

µ2(xit, ¯xi) =E(yit|xit, ¯xi)

=E(1{ai + x1itb1i + x2itβ2 + εit} > 0|xit, ¯xi)
=E(1{α + x1oβ1 + x2oβ2 + g1(¯xi)ξa + x1og2(¯xi)ξb

+ u1i + x1ou2i + εit > 0}|xit, ¯xi)

α + x1itβ1 + x2itβ2 + g1(¯xi)ξa + x1itg2(¯xi)ξb

(cid:113)

 .

(3.22)

=Φ

1 + σ2

1 + x2

1itσ2
2

The ASF, E(ai,bi)(µ1(xo, (ai, bi))), can be calculated as,

E(ai,bi)(µ1(xo, (ai, bi))) = E ¯xi(µ2(xo, ¯xi))

Φ

α + x1oβ1 + x2oβ2 + g1(¯xi)ξa + x1og2(¯xi)ξb

(cid:113)



= E ¯xi

1 + σ2

1 + x2

1oσ2
2

using Lemma 2.1. Next we take the partial derivative of the ASF with respect to xo

1 (the

variable of interest). Under typical regularity conditions that allow the derivative to pass

through the integration, the partial eﬀect with respect to x1o evaluated at the values x1o

and x2o takes on the form:

P E(x1o, x2o) =

∂E ¯xi(µ2(xo, ¯xi))

∂x1o

Φ
 α + x1oβ1 + x2oβ2 + g1(¯xi)ξa + x1og2(¯xi)ξb
(cid:44)
 ×
 α + x1oβ1 + x2oβ2 + g1(¯xi)ξa + x1og2(¯xi)ξb
(cid:34)
(cid:32)
(cid:113)
(cid:35)(cid:33)

(cid:113)
(cid:113)

1 + x2

1 + σ2

1 + x2

1oσ2
2

1 + σ2

1oσ2
2

(α + x1oβ1 + x2oβ2 + g1(¯xi)ξa + x1og2(¯xi)ξb)

φ

=∂E

=E

− (xo

1σ2
2)

(1 + σ2

1 + x2

1oσ2

2)3/2

∂x1o

β1 + g2(¯xi)ξb
1 + σ2
1 + x2

1oσ2
2

(3.23)

Then we deﬁne the AP E = E (P E(x1it, x2it)), where the inner expectation is with respect to

106

¯xi and then an outer expectation with respect to x1it and x2it. Since we will have estimates

for σ2

1 and σ2

2 using the ME Probit estimator, the APE can be estimated by plugging in

the parameter estimates and then replacing the expectations with the sample averages.

Alternatively, following the Pooled Heteroskedastic Probit estimator, in which we are using

an exponential function for the heteroskedasticity function and only obtain estimates of the

scaled parameters, the partial eﬀect can also be written as,

P E(x1o, x2o) =

∂x1o

=∂E ¯xi

Φ

=E ¯xi

(cid:18)
(cid:32)

φ

∂E ¯xi(µ2(xo, ¯xi))

(cid:18) ασ + x1oβ1σ + x2oβ2σ + g1(¯xi)ξaσ + x1og2(¯xi)ξbσ
(cid:19)
(cid:18) ασ + x1oβ1σ + x2oβ2σ + g1(¯xi)ξaσ + x1og2(¯xi)ξbσ
(cid:17)(cid:33)

(cid:16)
× exp(−v(x1o))
× (ασ + x1oβ1σ + x2oβ2σ + g1(¯xi)ξaσ + x1og2(¯xi)ξbσ)

β1σ + g2(¯xi)ξbσ − (∂v(x1o)/∂x1o)

exp(v(x1o))

exp(v(x1o))

(cid:19)(cid:19)(cid:46)

∂x1o

(3.24)

.

To estimate the above quantity, we replace the scaled coeﬃcients and the heteroskedastic

function v(x1o) with the Pooled Heteroskedastic Probit estimates.

In this chapter, we advocate the APE calculated from the ASF as the statistic that most

appropriately captures the eﬀect of interest. However, the literature also places value on

what we will refer to as the Partial Eﬀect at the Average (PEA). In the linear case, the

APE and PEA are equivalent whereas in the nonlinear case they can be quite diﬀerent.

The source of their diﬀerence follows from the basic principle that expectations cannot pass

through nonlinear functions.

In our model, the PEA is simply the partial derivative of

E ¯xi(µ1(xo, (E(ai|¯xi), E(bi|¯xi))) where the unobserved heterogeneity are evaluated at their

conditional means:

P EA(x1o, x2o) =

∂E ¯xi(µ1(xo, (E(ai|¯xi), E(bi|¯xi)))

∂x1o

107

(cid:16)
(cid:17)
× (β1 + g2(¯xi)ξb)

.

=E ¯xi

φ (α + x1oβ1 + x2oβ2 + g1(¯xi)ξa + x1og2(¯xi)ξb)

(3.25)

Note that the PEA only incorporates the part of the heterogeneity that is correlated with

the observables. In fact, there is no distinction between the PEA in our model that allows for

heterogeneity in the level and slope and the APE in a model that assumes constant eﬀects

but time averages enter the structural function as additional covariates. We argue that using

the APEs calculated from equations (3.23) and (3.24) truly capture the heterogeneous eﬀect

while the PEA mutes the genuine impact of heterogeneity.

3.5 Simulation

In this section, we investigate the behaviour of the two estimators with simulated data.

In particular we are interested in the trade oﬀ in the robustness properties of the Pooled

Heteroskedastic Probit estimator under serial correlation and the ability of the ME Probit

estimator to separately identify the variance components. We consider several variations on

the same model described in equation (1).

1. The covariates are iid over i and t and drawn from the following multivariate normal

(3.26)

(3.27)

distribution

x1it

 ∼ N



1
 ,

 1



 .

0.3

x2it

1

0.3

1

and the random coeﬃcients are generated as,

ai = −0.25 − 0.5¯x1i − 0.25¯x2
bi = 1.25 − 0.5¯x1i − 0.25¯x2

1i − 0.1¯x1i ¯x2i + u1i
1i − 0.1¯x1i ¯x2i + u2i

108

where the random eﬀects u1i, u2i are generated from the following multivariate normal

distribution,

u1i

 ∼ N

0,

0.5



 .

0

u2i

0

0.25

(3.28)

2. The variable of interest is iid over i but correlated over t through an AR(1) process

x1i1 = axi + e1i1

x1it = 0.5axi + 0.5x1it−1 + e1it,

t = 2, 3, ..., T

(3.29)

where axi ∼ iid N (1, 0.2) is the persistent individual eﬀect, e1i1 ∼ iid N (0, 0.2) and
e1i1 ∼ iid N (0, 1 − 0.52 − 0.2(0.52)) are additional noise terms. Although x1it is not

independent over t, it is identically distributed. To induce correlation between the
regressors, we let x2i = 0.7 + 0.3x1it + e2it, e2it ∼ iid N (0, 1 − 0.32). The corre-

lated random coeﬃcients are generated in the same was as the ﬁrst DGP described in

equations (3.27) - (3.28).

3. As in DGP 1, the covariates are generated iid over i and t and distributed as in

equation (3.26).

In this DGP, we are interested if using a simple speciﬁcation for

the correlated random coeﬃcients when the random coeﬃcients are generated in a

more ﬂexible manner would still result in the correlated random eﬀects speciﬁcations

performing better than just the random eﬀects speciﬁcations that do not allow for any

correlation between the random coeﬃcients and the covariates. The random coeﬃcients

will be generated with the following equations,
ai = −0.25 − 0.25¯x3
bi = 1.25 − 0.25¯x3

1i − 0.15¯x4
1i − 0.15¯x4

2i + u1i

2i + u2i

(3.30)

but in estimation, we will only include the polynomial functions of the time averages

109

up to order 2.

Finally, we vary the cases over the number of time periods observed (T = 5, 10, 20) and

the level of serial correlation in the unobserved heterogeneity εit (ρ = 0, 0.4, 0.8 in equation

(3.10)). We expect the ME Probit estimator to be inconsistent under serial correlation

while the Pooled Heteroskedastic Probit estimator to be consistent under serial correlation

or independence. In addition, we would expect the ME Probit estimator to be more eﬃcient

than the Pooled estimator under serial independence.

We will estimate two speciﬁcations for both the ME Probit and Pooled Heteroskedastic

Probit estimators. The ﬁrst speciﬁcation incorrectly assumes that the random eﬀects ai and

bi are uncorrelated with the x’s while the second speciﬁcation assumes that ai and bi are

random eﬀects that are correlated with the x’s through their time averages.

3.5.1 Computational Results

In addition to providing results on estimation consistency, it seems prudent to report results

on the computational ease of implementation. All estimation was performed in STATA5

using the commands meprobit and hetprobit. Since the JMLE requires numerically inte-

grating out the random eﬀects, we expect the ME Probit estimator to take longer.

Tables G.1–G.3 present the average length of time of the two estimators with several

notable features. Although the estimation times may seem short (5 seconds at most), this

is reﬂective of the simple speciﬁcation of only two covariates and fairly small sample sizes of

300 individuals. As the speciﬁcations become more complex and the sample size increases,

the time to compute will lengthen. Therefore we will focus on the relative speed between the

5STATA is among the most popular software used by researchers in social science. Other software such

as Matlab and R also have built-in commands that perform similar functions as those in STATA.

110

two estimators. As expected, the ME Probit estimator always takes much longer than the

Pooled Heteroskedastic Probit estimator almost 6 times as long. In addition, the distribution

of the ME Probit times are much more variable, with standard deviations as large as 9.03

(in DGP 1). So it appears that there may be some outliers skewing the distribution to the

right.

This conﬁrms our initial expectations that the ME Probit estimator will suﬀer computa-

tionally compared to the pooled method.

3.5.2 Parameter Estimates

The estimates from the ME Probit estimator and Pooled Heteroskedastic Probit estimator

are not comparable “as is” because the Pooled Heteroskedastic Probit estimator is only able

to estimate the scaled coeﬃcients. Therefore we will ﬁrst look at the “de-scaled” ME Probit

coeﬃcient estimates and then compare the two estimation procedures with the “scaled”

coeﬃcient estimates. Recall that we will need to calculate the scaled ME Probit estimates

√
1 + ˆσ1).
by dividing the coeﬃcient estimates by the estimate of the scale value (

Tables G.4 – G.6 present the bias and standard deviation (given in parenthesis) of the de-

scaled ME Probit estimates. The ﬁrst thing to note is that speciﬁcation (1) performs quite

poorly at any given level of serial correlation. Since speciﬁcation (1) incorrectly assumes

that the random eﬀects ai and bi are uncorrelated with the covariates, this emphasizes the

importance of allowing for correlation between the random coeﬃcients and the covariates.

In this particular setting, not doing so would result in an interpretation that the regressors

x1it has no strong predictive power for the outcome y1it.

Turning to the correct speciﬁcation that allows for correlated random coeﬃcients, spec-

iﬁcation (2) in DGP 1 and 2, under no serial correlation (ρ = 0), the ME Probit appears

111

to perform well with very little bias. But in DGP 3 where speciﬁcation (2) acts as an

approximation to a more ﬂexible correlated random coeﬃcients, there appears to be signif-

icant amount of bias. But the bias is at a lesser degree than not attempting to control for

correlated random coeﬃcients at all. As the level of serial correlation increases, we see an

increasingly positive bias on the de-scaled coeﬃcient estimates. This conﬁrms our earlier

discussions where we would expect the JMLE procedure to be sensitive to correlation over

the time dimension.

We do see a quite loss to eﬃciency when using speciﬁcation (2) relative to speciﬁcation

(1). This is because a correlated random coeﬃcient approach requires including several more

terms, polynomials of the time averages, that may be strongly collinear with one another. We

would expect that as the distribution of the time averages becomes more precise (ie: number

of time observations increase), the terms in the correlated random coeﬃcient become more

collinear. So even though the standard deviations of speciﬁcation (1) decreases as the number

of time observations increases, we see the standard deviations for speciﬁcation (2) increase.

Finally, as the level of serial correlation increases, there is also an increase in the standard

deviation. Therefore using the JMLE procedure, introducing serial correlation results in

increasing bias and increasing variance in the de-scaled parameter estimates.

But to compare the two estimation procedures, we will need to look at the scaled param-

eter estimates. Both the scaled Pooled Heteroskedastic Probit and ME Probit estimates are

presented in Tables G.7 – G.9. Again, and with both estimators, there is strong bias with

speciﬁcation (1) that does not allow the random coeﬃcients to be correlated with the x’s.

Moreover, as we expected, the bias of the Pooled Heteroskedastic Probit estimator is

unaﬀected by the level of serial correlation. But more surprising is that the bias from the

presence of serial correlation of the scaled ME Probit estimates is diminished. In fact the

112

bias of the scaled ME Probit estimates appears to be approaching the level of bias observed

for the Pooled Heteroskedastic Probit estimates.

We also see that the ME Probit scaled coeﬃcient estimates are slightly more eﬃcient

even under serial correlation. Of course one should expect a JMLE procedure to be more

eﬃcient than a PMLE procedure when the distribution is correctly speciﬁed. However, this

need not be true when the joint likelihood is miss-speciﬁed and the pooled likelihood is still

correctly speciﬁed. The eﬃciency of the ME Probit estimator in this simulation should not

be mistaken as a general result, however, since it appears to be a fairly stable result across

all three DGP, it may be worth investigating a theoretical result.

Therefore there seems to be a bias eﬃciency trade-oﬀ in terms of the scaled parameters

estimates. If one were to compare on the basis of Root Mean Squared Error, as in Table G.10,

the ME Probit estimator appears superior to the Pooled Heteroskedastic Probit estimator

under all of the diﬀerent sampling scenarios.

Why are the ME Probit estimates of the scaled coeﬃcients performing so well when the

de-scaled coeﬃcients are quite poor? The answer lies in the estimation of the scaling factor

in which the ME Probit is able to identify and estimate σ2

1. Figures F.1-F.9 present the

empirical distribution of ˆσ2

1. Recall the true variance is 0.5, so under no serial correlation,

the ME Probit estimator does a fair job of estimating the variance component. But as

serial correlation increased, the distribution of the variance estimates move towards the

right suggesting an upward bias. Since we also see an upward bias in the de-scaled coeﬃcient

estimates, the biases cancel each other when calculating the scaled coeﬃcients.

Since ME Probit assumes no serial correlation, it is interpreting the persistence in the

latent error as part of the individual heterogeneity. Returning to the latent variable set up

113

in equation (3.8), the unobserved error that the estimator is trying to parse is,

(unobserved error)it = u1i + x1itu2i + εit

(3.31)

The serial correlation in εit appears a lot like the persistence induced by the additive het-

erogeneity u1i. Consequently, the estimate of the variance component would be biased up

and less precise compared to the case of no serial correlation.

An advantage of the JMLE procedure is that it is able to identify both of the variance

components σ2

1 and σ2

2. However given this analysis, one should be wary of the validity of

the σ2

1 estimates if there is concern for serial correlation. As for σ2

2, the results reported in

Table G.11 mirror the other coeﬃcient estimates. The de-scaled estimates display increasing

bias over the level serial correlation which cancels with the scaling such that the scaled

estimates appear unbiased. Incorrectly assuming that there is no serial correlation would

lead a researcher to incorrectly conclude that the distribution of the random coeﬃcient is

much larger than it truly is. But nevertheless, the bias seems to work in our favor if one is

more interested in APE, which we typically are.

As mentioned previously, the scaled coeﬃcients are really what is used to determine

APE estimates. The results from the simulation thus far would suggest that both estimation

procedures will perform reasonable well given that they both estimate the scaled coeﬃcients

with fairly small bias (when they allow for correlated random eﬀects). The question remains

if any eﬃciency gains will be observed when using a miss-speciﬁed JMLE over a PMLE.

Moreover, we saw that not allowing for correlated random eﬀects (speciﬁcation (1)) will

bias the parameter estimates. Some may hope that the simple speciﬁcation will still be

able to capture an average eﬀect. But from theory, we know that this is unlikely in a non-

linear model such as probit because the average does not pass through non-linear functions.

114

Finally, we saw that poor parameter estimates for DGP 3, where the correlated random

coeﬃcient structure was only approximated. If we are only interested in APE, can only an

approximation for the random coeﬃcient be suﬃcient in capturing the correlation structure

with the covariates.

3.5.3 Average Partial Eﬀect Estimates

Estimates of the Average Partial Eﬀects with respect to x1it are presented in Tables G.12

– G.14. In line with the results on coeﬃcient estimates, not allowing the random eﬀects to

be correlated with the xs (speciﬁcation (1)) results in signiﬁcant bias across all three DGPs.

When we allow the random eﬀects to be correlated with the xs (speciﬁcation (2)), the bias

in the APE estimates shrink considerably, for both estimation procedure.

In DGP 1 and 2, the ME Probit estimates appear to have smaller bias are more eﬃcient

than the Pooled Heteroskedastic Probit estimates. Therefore the bias in the de-scaled co-

eﬃcient and the bias in the variance component of the scaling factor neutralize each other,

resulting in very little bias in the ME Probit estimates for APE over any level of serial corre-

lation. In DGP 3, the Pooled Heteroskedastic Probit estimator tends to have slightly smaller

bias but is less eﬃcient than the ME Probit estimator. This suggests that there might be

some bias eﬃciency trade-oﬀ, but after a quick examination of the RMSE, the ME probit

estimator is preferred.

But what does this mean for a researcher working with empirical data? At ﬁrst glance,

one should trust the Pooled Heteroskedastic Probit estimates over the ME Probit estimates

since the speciﬁed likelihood is robust to arbitrary correlation over the time dimension.

But the simulation results suggest that under a simple correlation structure, such as AR(1),

there may be robustness in the scaled parameter estimates and APE using a JMLE procedure

115

where the joint likelihood is misspeciﬁed.

However, it should also be emphasized that similar scaled coeﬃcient and APE estimates

between the ME Probit and Pooled Heteroskedastic Probit procedures should not trick a

researcher into thinking that there is statistical evidence that the assumptions underlying

the JMLE necessarily hold. Consequently, any interpretations based on variance estimates,

such as the amount of variation explained by the “random” and “ﬁxed” components should

be taken with a hearty amount of skepticism.

Finally, to emphasize the importance of correctly understanding the unobserved hetero-

geneity and how to incorporate it into the descriptive statistics, we provide calculations of

the true PEA as a comparison to the true APE over a single sample. Many researchers,

turn to the PEA as a simpler to calculate approximation to the APE. As explained earlier,

the PEA plugs in the averages for the unobserved heterogeneity rather than integrating it

out. Although quicker to compute, this does not truly reﬂect the data structure we believe

is present.

Table G.15 present the results over all 3 DGPs and increasing time observations. Note

that the true APE and PEA should not vary by any serial correlation in the latent error.

The PEA systematically over-estimates the eﬀect in comparison to the APE. This can be

easily explained by comparing the Partial Eﬀect equations (3.23) and (3.25).

First, the PEA does not incorporate the scaling factor 1/

1 + σ2

(cid:113)

1 + x2

1oσ2

2. Because

of the chain rule, the APE and PEA can be broken down into two similar terms. We
refer to the ﬁrst as the “Probit scaling” which consists of the Standard Normal CDF (φ(·))

evaluated at some point in relation to the covariates x1o and x2o. This term insures that

the partial eﬀect diminishes as the covariates x1o and x2o get large in absolute value. It

is a consequence of the Probit functional form that bounds the average structural function

116

between 0 and 1. The second term multiplied by the Probit scaling is the “latent” eﬀect.

This is the eﬀect of the random variable of interest to the latent index. In a standard Probit

with random coeﬃcients, this is just the coeﬃcient on the random variable of interest. By

not incorporating the scaling factor, the PEA diminishes the Probit scaling and enlarges the

latent eﬀect. This means that the PEA will be shifted up (since β1 is positive) and ﬂatter

over the support of the x’s. Second, the latent eﬀect in the PEA does not include the impact

of the part of the heterogeneous eﬀect that is uncorrelated with the x’s. This eﬀect varies

over the value of x1o and can either enlarge or diminish the latent eﬀect.

The main take away is that any patterns or signiﬁcant biases caused by serial correlation

in the latent error appear to be signiﬁcantly muted when it comes to computing the APEs.

A major contributing factor is the ability of the ME Probit estimator to somewhat preserve

the consistency of the scaled coeﬃcient estimates under speciﬁcations in which we would

otherwise deem the procedure inconsistent. This calls for a theoretical investigation of the

possible limitations of the consequences when miss-specifying the joint likelihood under serial

correlation.

3.5.4 ASF

Figures F.10 - F.15 provide ASF estimates for DGP 1 and 2 over the relevant values of x1i

and ﬁxing x2i at its mean (1). There is very little diﬀerence between the two estimation

procedures, over the diﬀerent level of serial correlation or over the number of time observa-

tions. This again reiterates that because the ME Probit estimator is able to well estimate

the scaled coeﬃcient estimates, statistics such as the APE and ASF that only depend on

the scaled coeﬃcient estimates tend to be well estimated.

This simulation study has uncovered a surprising number of results which we will sum-

117

marize here. First, regardless of the estimation procedure, not specifying correlated random

eﬀects (i.e., allowing the random coeﬃcients to be correlated with the covariates) signiﬁcantly

eﬀects the results and interpretations of the results. Second, the Pooled Heteroskedastic Pro-

bit estimator has performed quite well in terms of the scaled parameter estimates, the APE

estimates, and the ASF estimates. It produces estimates with fairly low bias and is much

quicker, running in 15.3% of the time ME Probit takes to run. However, one of the main

drawbacks to the pooled approach is that it is unable to identify the variance components

and therefore some information is lost using this approach.

Alternatively, the ME Probit estimator is able to identify and estimate the variance

components but relies on specifying the whole joint distribution which is generally assumed

to be independent over time. We saw that there are biases in the de-scaled coeﬃcient and

variance component estimates on serial correlation. Therefore, even if the ME Probit can

identify these parameters, interpretation should be taken lightly when one is concerned for

the presence of serial correlation. But surprisingly these biases appear to counterbalance

when calculating the scaled coeﬃcient. This leads to good estimates of the APE and ASF

under miss-speciﬁcation of the joint likelihood. Finally, there appears to be eﬃciency gains

using the JMLE approach regardless of whether or not the joint likelihood is misspeciﬁed.

This is somewhat surprising since there are no theoretical results that would suggest eﬃciency

under misspeciﬁcation.

It should be reiterated that this apparent robustness results could be an artifact of the

particular data generating processes considered. But we tried to provide a range of interesting

DGP to investigate this results. Although beyond the scope of this chapter, there may be

some theoretical result in terms of deriving the bias of the scaled coeﬃcient for the JMLE

under relatively simple dependency structures.

118

The next section examines whether the results found here are accordant with real data.

Given these results, we would expect to see the Pooled Heteroskedastic Probit and ME probit

approaches to provide similar results in their scaled coeﬃcient estimates and APE estimates

even when we ﬁnd evidence of serial correlation.

3.6 Application

Our application utilizes data from Blattman, Jamison, and Sheridan (2017) (for the remain-

der of the chapter, referred to as BJS) where they study the eﬀect of Cognitive Behavioral

Therapy (CBT) on criminal and violent behavior of men in Liberia. After identifying and

approaching potential high risk men, the research team obtained 999 men who agreed to

enter the sample. Then treatment was assigned randomly within blocks as described in their

accompanying appendix. The three possible treatments were: CBT, cash, and both CBT and

cash. CBT works to make the patient aware of their automatic negative or self-destructive

thoughts so they may be better able to actively change their behavior. Supplying cash should

reduce criminal behavior for budget constrained individuals. BJS provides a more thorough

discussion on what mechanisms these interventions may change behavioral and economic

outcomes.

The data was collected as a series of 5 surveys. The initial survey provided baseline

covariates on the men from the study and was taken prior to treatment. Table G.16 provide

a summary of a section of these variables. The remaining four endline surveys were taken

after 2 weeks, 5 weeks, 12 months and 13 months.6 One of the major diﬀerences between our

6Because the surveys are taken unevenly over time, it would diﬃcult to conclude that the dependency in
the latent error follows a simple AR(1) process as we used in simulations. We believe that this would make
any similarities found between the two procedures even more convincing that a robustness property may
hold.

119

analysis and the initial work done in BJS, is they average the ﬁrst two surveys and the last

two surveys to construct short-run outcomes and long-run outcomes and calculate the eﬀects

separately while we treat it as a panel structure. By doing so they are able to investigate

heterogeneity in treatment eﬀect over time while we are more interested in heterogeneity

that is correlated with the controls.

Although many diﬀerent types of outcomes were recorded and analyzed by BJS, we will

look at only some of the antisocial behavior outcomes in more detail. They deﬁne the anti-

social behavior as, “disruptive or harmful acts toward others, such as crime or aggression.”

We will look at the binary outcomes of selling drugs, being arrest, and engaging in illicit

activity.7 Over all the observations, each outcome occurred on average around 10-13 percent

of the population.

The last four variables (antisocial behavior index, perseverance index, reward responsive-

ness and impulsiveness index) are combinations of survey responses to capture the individuals

inclination towards a particular characteristics. All are standardize to 0 mean and variance

1. These will be the dimensions in which we will investigate a heterogeneous treatment eﬀect

using a correlated random eﬀects approach. BJS does investigate heterogeneous treatment

in their appendix (Table E.7) but uses the endline survey responses to construct the out-

come antisocial behavior index and only looks at heterogeneity correlated with the baseline

antisocial behavior index and a baseline measure of self-control/patience.

To motivate a heterogeneous treatment in the nonlinear ME Probit and Pooled Het-

eroskedastic Probit models, Table G.17 provides the OLS estimates of a linear probability

CRE model. In this setting, a simple linear analysis should provide fairly good estimates of

the treatments eﬀect because of the random assignment of the treatment.

7In the survey the respondents are asked if each of these outcomes occurred within the last two weeks.

120

Sells Drugs – All of the interventions decrease the probability of selling drugs in which

the sum of the CBT and cash eﬀects is comparable to the eﬀect of both as a treatment.

However the cash intervention is not very large and is also not statistically diﬀerent from 0.

The other two treatments are statistically signiﬁcant. We also see strong evidence of hetero-

geneity in treatment over the antisocial behavior index and some evidence of heterogeneity

over the perseverance index (not statistically signiﬁcant). The direction of the heterogeneity

suggests that those who initially demonstrate antisocial behavior (i.e., one standard devia-

tion away from the average level of antisocial behavior), tend to have a stronger treatment

eﬀect (i.e., treatment eﬀect of both CBT and cash changes from -0.0724 to -0.1432, almost

doubling). This is consistent since one would expect CBT to have decreasing returns over the

level of antisocial behavior (i.e., those who already display low levels of antisocial behavior

do not gain much from CBT whereas those with high levels of antisocial behavior can gain

much more). The treatments that include cash have more of an eﬀect if the individual dis-

played poor perseverance (lower on the perseverance index). A possible explanation is that

those with poor perseverance are more likely to have binding budget constraints compared

to those with better perseverance.

Arrested – None of the interventions show a statistically signiﬁcant eﬀect on the prob-

ability of arrest. On top of that, the cash intervention led to a slightly positive-but not

statistically signiﬁcant- eﬀect, opposite direction of what one would expect. Even so, there

is statistically signiﬁcant evidence of a heterogeneous eﬀect for the treatment of both CBT

and cash over the antisocial behavior index. It is important to note that being arrested is

not just a measure of behavior but also a measure of the governments ability to enforce the

law. The next outcome looks to isolate the eﬀect on the behavior.

Illicit – All of the treatments are estimated to have the expected negative eﬀect in which

121

the treatments including CBT have signiﬁcant eﬀects. Interestingly, the marginal eﬀect of

providing cash in addition to therapy is minimal since the treatment eﬀects are estimated

to be about the same. The treatment eﬀects are heterogeneous in antisocial behavior for

the interventions that include CBT and slightly heterogeneous in perseverance for only both

CBT and cash. In BJS, the characteristic of perseverance is sometimes referred to as grit,

and measured from the responses of seven questions on “the ability to press on in the face

of diﬃculty” from the GRIT scale (Duckworth and Quinn (2009)). The positive direction

of the heterogeneity in perseverance can be interpreted as: for an individual who has more

perseverance than the average level by 1 standard deviation has a treatment eﬀect from

both cash and CBT of -0.0227 compared to -.0622 (almost a reduction of 2/3). This would

suggest that perseverance may be a detriment in trying to change individuals’ behaviors and

actions.

It appears that overall, both CBT and cash produce stronger eﬀects which may indicate

that cash is necessary to loosen the budget constraint such that an individual may change

their behavior inﬂuenced by the CBT. Moreover, most the heterogeneity seems to be captured

by the antisocial behavior or the perseverance indexes. Finally, this panel structure suggests

the possibility of serial correlation. Following the suggestion in Wooldridge (2010), we test

for serial correlation in the linear model by regressing ﬁrst diﬀerences of yit on its lag and

testing if the coeﬃcient is equal to 0.5 (as implied by the case of no serial correlation). We

are only able to reject the null hypothesis of no serial correlation for the outcome of physical

ﬁghts (p-value = 0.7689).

Motivated that there is evidence of a treatment eﬀect and possible heterogeneity in the

treatment eﬀect in a simpler linear model, Table G.18-G.20 provides the parameter estimates

in the diﬀerent Probit speciﬁcations. First note that the estimates are scaled parameter

122

estimates and therefore comparable between the diﬀerent estimation methods. As per usual,

the reported standard errors for the PMLE are robust to arbitrary serial correlation. But we

also report the JMLE standard errors that are robust to arbitrary serial correlation. This is

usually not done, since we assume serial independence for consistent estimation. However,

given the results of the earlier simulation, we observed fairly accurate scaled coeﬃcient and

APE eﬀects estimates from the JMLE even under serial correlation. Therefore we treat the

estimator as if it were a quasi-MLE, knowing the likelihood is misspeciﬁed, and adjusting

the inference accordingly.

For all three outcomes, the coeﬃcient estimates on the treatments are quite diﬀerent

between the JMLE and PMLE procedures. For instance the cash treatment is estimated

to have a negative coeﬃcient for the outcomes of selling drugs and being arrested when

estimated using the JMLE procedure but then estimated to have a positive coeﬃcient in the

PMLE approach. In the end, this may still have very little impact on the treatment eﬀects

estimates since they are also strongly determined by the heteroskedastic coeﬃcients in the

Pooled Heteroskedastic Probit model.

An explanation for the stark diﬀerences could be because the pooled estimator is quite
ineﬃcient with some standard error estimates approaching 4× higher than the JMLE stan-

dard errors. Consequently, the JMLE coeﬃcients are more frequently statistically signiﬁcant

from 0 whereas the PMLE coeﬃcients are almost never statistically signiﬁcant.

On the other hand, the eﬃciency of the JMLE also comes at a computational cost.

The Pooled Heteroskedastic Probit estimator was always able to compute within a couple

seconds while the ME Probit estimator was taking as long as 4 hours to compute. This

makes bootstrapping for standard error, the common procedure when obtaining standard

errors for ATE and ASF, impractical.

123

Given the evidence in the simulation studies of Section 5, one should not readily trust

the variance component estimates in the ME Probit model. However, it is interesting to note

that under the outcome of selling drugs and engaging illicit activity, the estimator would

suggest that the cash or both the CBT and cash treatment do not have a random eﬀect at

all.

As for the CRE speciﬁcations, estimates of the coeﬃcients for interaction terms appear

more similar in direction, magnitude, and eﬃciency across the two estimators. For the out-

come of selling drugs, the most important dimensions of heterogeneity appear to be antisocial

behavior and perseverance, especially for the treatments that include give cash. As for being

arrested, there is little heterogeneity in the treatment of CBT, but the treatment of cash

is heterogeneous in perseverance and reward. Unlike the results of the linear speciﬁcation,

these results suggest that those with more perseverance are less likely to be arrested after

given cash. The reward index compiles responses from eight survey questions to measure

“whether [an individual is] motivated by immediate, typically emotional rewards.” The re-

sults indicate the an individual more motivated by rewards is less likely to respond well to

cash treatments in reducing the probability of being arrested. This may be because, without

any changes in their behavior prior to the treatment, they were then rewarded with cash

which provides positive reinforcement of their bad behavior.

Similar to what we observe in the linear speciﬁcation, treatment is heterogeneous with

respect to antisocial behavior and perseverance for the outcome of engaging in illicit activity.

In particular, the treatment of CBT is fairly heterogeneous in antisocial behavior and the

treatment of both CBT and cash is heterogeneous in perseverance in the same directions of

the linear estimates.

The implications for the ATE can be seen in Table G.21. We ﬁnd, using the OLS estimates

124

and ME Probit estimates allowing for correlation between the unobserved heterogeneity and

the covariates tends to lower the ATE with very little cost to eﬃciency. It is more of a mixed

bag when we look at the Pooled Heteroskedastic Probit estimator. Again, this may be due to

the ineﬃciency of the estimator. Consequently, we tend to see stronger similarities between

the ME Probit and OLS estimates. This reiterates the robustness of APEs using ME Probit

seen in the simulation study.

In all outcomes, the strongest treatment among the three is both CBT and cash. For

selling drugs, we ﬁnd a statistically signiﬁcant eﬀect of both CBT and cash as well as therapy

only. Interpreting the ME Probit estimates, both CBT and cash will reduce, on average,

the probability of selling drugs in the future by 7.6 percentage points while the treatment of

therapy only reduces the probability by 6.4 percentage point. The Pooled Heteroskedastic

Probit Estimates diﬀers slightly estimating a 4.9% decrease in probability for both CBT and

cash. However, there appears to be a signiﬁcant jump in the standard error so the diﬀerence

between the two estimates are not likely to be statistically signiﬁcant.

We ﬁnd no statistically signiﬁcant treatment eﬀects for the outcome of arrested at any

conventional levels of signiﬁcance. For illicit activity we ﬁnd fairly similar estimates between

the ME Probit and Pooled Heteroskedastic Probit ATE.

An interesting result is that for some speciﬁcations and outcomes, we ﬁnd the Pooled

Heteroskedastic Probit estimator to be more eﬃcient. This was not seen in our simula-

tion results, were the ME Probit estimator appeared always more eﬃcient (even under a

misspeciﬁed log likelihood).

Since we saw ample statistical signiﬁcance in the correlated random eﬀects, Figures F.19

- F.21 show the surface of the treatment eﬀects over relevant values of two characteris-

tics. As discussed previously, the most inﬂuential characteristics are antisocial behavior and

125

perseverance, except in the case of being arrested in which reward has more impact than

perseverance. When looking at the outcome for selling drugs, the ﬁrst thing to note is that

at combination of relevant characteristic values, there is a treatment that induces an eﬀect in

the desired direction. Moreover this ﬁgure tells us that those with low levels of perseverance

require the treatment of both therapy and cash whereas those with higher perseverance are

better served with just receiving therapy. Finally, both estimation procedures (JMLE and

PMLE) produce similar ﬁgures with inconsequential diﬀerences in interpretation.

Moving to the treatment eﬀects for being arrested, we ﬁnd that there are areas in which

no treatment is able to produce an eﬀect in the desired direction. For those who are relatively

better behaved initially and are very responsive to rewards, none of the treatments produce

desirable eﬀects. On the other end of the spectrum, a therapy and cash treatment produce

a strong eﬀect (i.e., those with antisocial behavior = 2 and reward = -2, the treatment of

both therapy and cash reduces the probability of being arrested by approximately 25 to 30

percentage point). Although the broad conclusions are the same, there are small diﬀerences

between the two estimators. Unlike the JMLE, the PMLE requires much lower values of

reward and antisocial behavior to ﬁnd cash to be the best treatment. Moreover the ﬁndings

of the JMLE show a slightly larger area in which none of the treatments produce desirable

eﬀects compared to the PMLE.

The conclusions for the outcome of engaging in illicit activities are similar to those found

in studying the outcome of selling drugs. Those with lower levels of perseverance require

both therapy and cash while those with higher perseverance suﬃce with just therapy.

The conclusions of either only therapy or only cash as the optimal treatments may be

unexpected as it suggests that is actually a marginal detriment in providing cash when also

providing therapy (or vice versa). A possible explanation for this conclusion is the limitations

126

of the model speciﬁcation. We have assumed that the heterogeneous treatment is linear in

the individuals characteristics. Therefore the crossing from both therapy and cash to just

therapy or just cash as the optimal treatment may be an unsubstantiated consequence of

the marginally strong eﬀect of therapy and cash over just therapy or over just cash on the

other end of the characteristic spectrum.

A possible solution would be allow for a much more ﬂexible speciﬁcation of the corre-

lated random eﬀect. Instead of only specifying linear terms in random eﬀect, we could also

include higher order terms to capture any nonlinear relationship. However this will increase

the dimensions of the parameter space fairly quickly, providing grounds for utilizing high di-

mensional approaches. For instance, including second order terms for the four characteristics

(10 terms) for the intercept and each of the treatments will result in increasing the number

of parameters by 40. One could extend the work of Wooldridge and Zhu (Forthcoming) who

use a debiased estimator of a L1-penalized pooled probit with correlated random eﬀects (only

in the intercept). Unfortunately, to our knowledge, there are not any published commands

in common statistical packages such as Stata or MATLAB that allow for either a penalized

ME Probit (or any ME Generalized Linear Model) estimator or a penalized Heteroskedastic

Probit pooled estimator.

3.7 Discussion

The results from the simulations and application leave some open ended questions that we

wish to look at in more depth. First, we are concerned that the robustness of the JMLE when

independence over time does not hold may be because we have introduce serial correlation

in a fairly simplistic manner. We will examine DGP 1 under AR(2) process. This introduces

127

a much more complex model of serial correlation rather than merely a perception of more or

less persistence.

Second, we are concerned that many researchers are attracted to the JMLE because it

is able to identify the variance parameters. As we showed in simulation, the estimates are

strongly biased under the presence of serial correlation. But in our simulations we have

always assumed the presence of random eﬀects. Alternative in this simulation, we consider

what will happen when the coeﬃcients are in fact non-random. We also ﬁnd that the

presence of serial correlation can mislead one to believe there are random eﬀects when there

are none. This further illustrates the caution that should be taken when interpreting the

variance components from the ME Probit estimator.

Finally, there is a growing interest in utilizing a Logit model as an alternative to a Probit

model. Therefore we repeat DGP1 but specify that the latent error is logistically distributed.

The analogue of the ME Probit estimator is the ME Logit estimator in which we employ

the command melogit in STATA. As in the Probit case, the random components are still

assumed to be normally distributed and integrated out numerically. However this means

there is no good analogue of the pooled approach with correct distribution assumptions

since the logistic distribution does not mix well with the normal distribution. Consequently,

this section will not focus on the comparison of the JMLE to the PMLE but rather whether

the JMLE is itself consistent under serial correlation in terms of the parameter, variance

components, and partial eﬀect estimates.

128

3.7.1 AR(2)

Consider the following AR(2) process in the latent error

εit = 0.6εit−1 − 0.3εit−2 + eit

(3.32)

where eit ∼ N (0, 1 − 0.62 − 0.32). This means that each error is positively correlated with

the ﬁrst lagged error and then negatively correlated (conditional on the ﬁrst lag) with the

second lagged error. With a simple AR(1) process, serial correlation in the latent error

appears similar to individual heterogeneity, and as we found in section 5, does not bias the

scaled coeﬃcient or APE estimates. But with a more complex AR(2) process, the correlation

over time cannot be as easily mistaken as individual heterogeneity.

Table G.22 present the scaled coeﬃcient estimates. Again, we ﬁnd no strong bias in the

ME Probit estimates even though the joint likelihood is misspeciﬁed. We do see an increase

in bias as the number of time observation increases, which might suggest that the JMLE

starts to waver in its capability of addressing a more complex correlation structure as more

observations are present. But this holds true for the Pooled Heteroskedastic Probit estimator

as well.

Turning to the APE estimates in Table G.23, both the ME Probit and the Pooled Het-

eroskedastic Probit estimates have low bias. We ﬁnd that there are still fairly substantial

eﬃciency gains by utilizing the JMLE even when the joint likelihood is misspeciﬁed.

So even when introducing a more complex structure to the serial correlation, we ﬁnd

that the bias in estimating the variance component σ2

1 fully captures the consequences of

the serial correlation in the latent error. Figures F.22 - F.24 show the empirical distribution

of the ME Probit estimates for σ2

1. As one would expect, there is an upward bias since

129

overall, the AR(2) process in equation (3.32) would induce positive correlation among the

time observation.

Overall, this would help to further illustrate a possible robustness to serial correlation

in the scaled parameter and ASF/APE estimates under JMLE. This should be theoretically

investigated in further studies.

3.7.2 No Random Eﬀects

Since the JMLE seems to be able to address serial correlation through the variance com-

ponent σ2

1. It would be interesting to observe what would occur when no random eﬀects

are actually present σ2

1 = σ2

2 = 0. Tables G.24 and G.25 present the computational results.

Since σ2

1 = σ2

2 = 0 is at the boundary of the valid parameter space, we would expect the ME

Probit estimator to struggle. Table G.24 presents the number of failed convergence of the

estimator prior to obtaining 1,000 successes. When there is no serial correlation, there are

upwards of 700 failures for the ME Probit estimator, but as serial correlation is introduced,

the failures reduce dramatically. This is because the introduction of serial correlation al-

lows for estimates of variance components away from the boundary. Table G.25 reports the

estimation times. Now we see a much strong contrast between the Pooled Heteroskedastic

Probit estimator and ME Probit estimator where the ME Probit can take up to 22 times

longer.

Instead of looking at the estimates of α and β (which follow the trends of all the previous

simulations) we will simply note that the APE, in Table G.26, are well estimated regardless of

the estimation procedure used or whether a correlated random eﬀects speciﬁed. Since there

are no random coeﬃcients, there cannot be correlation between the ﬁxed parameters and

the covariates. Consequently, specifying correlated random coeﬃcients does not necessarily

130

hurt the estimators in terms of bias but it does result in a less eﬃcient estimator as it calls

for the inclusion of many irrelevant covariates.

We will focus on the estimates of variance components and whether or not standard LR

tests are valid in detecting the presence of random coeﬃcients under serial correlation. Tables

G.27-G.28 present the average and standard deviations of the predicted variance components

from ME Probit under speciﬁcations (1) and (2). When there is no serial correlation, both

the estimates of σ2

1 and σ2

2 are quite close to zero, which is what we would hope for when

there is no random coeﬃcients actually present. As the level of serial correlation increases,

the variance component σ2

2 remains low while the estimates for σ2

1 are increasingly biased

up. This means that serial correlation can be miss-interpreted as individual heterogeneity.

This leads us to caution any researcher that would like to make inference on the variance

component estimates and use them interpretatively. One would hope that the LR test should

be able to reject the model of random coeﬃcients in favour of a more simple non-random

coeﬃcient Probit model. Table G.29 reports for the rejection rates at the 5% signiﬁcance

level. We ﬁnd that under no serial correlation the test performs as expected but as the serial

correlation increases the rejection rates also increase. In fact, with a correlation coeﬃcient

of 0.8, we reject 100% of the simulated samples.

So although the earlier simulation results are able to suggest that the ME Probit estimator

is a favourable estimator given that it is more eﬃcient than the pooled approach and there

appears to not be much bias under the misspeciﬁcation of the joint likelihood for scaled

coeﬃcients and APE estimates. But when there is in fact no random coeﬃcient, we ﬁnd the

ME Probit estimator struggle computationally compared the Pooled Heteroskedastic Probit

estimator. The ME Probit estimator takes much longer to compute and failing to converge

at all in many instances. Moreover this simulation re-emphasizes the caution that should be

131

taken when interpreting the variance components.

3.7.3 Logit

This simulation utilizes a logistic distribution in the latent error compared to the normal

distribution used in a Probit. Although it seems to be favoured particularly in applied work,

the logistic distribution does not easily incorporate a mixed eﬀects framework. The logistic

distribution does not mix well with itself nor the normal distribution. This raises two issues:

1. Assuming that the random coeﬃcients are normally distributed, as is usually done in

the Mixed Eﬀects literature, then there is no equivalent pooled approach. Speciﬁcally,

the unobserved components:

u1i + x1iu2i + εit

(3.33)

are the sum of two normals and a logistic random variable whose distribution is gener-

ally unknown. Consequently we cannot evaluate the conditional distribution to obtain

the contemporaneous conditional mean of yit as we did in equation (3.16) when all the

unobserved components are assumed to be normally distributed.

2. It is unclear how to implement AR(1) process to the logistic errors since the logistic

distribution does not mix well with other logistically distributed random variables.

We ﬁnd that how the AR(1) process is implemented will vary greatly in how we can

approach the estimation problem. We consider two approaches, with the following

autoregressive process of order 1

εit = ρεit−1 + eεit

(3.34)

we can either aim to insure εit|εit−1 is logistically distributed or that the marginal

132

distributions of all the time observation are identically logistically distributed.

These two challenges have to be considered when constructing our simulation study. With

respect to the ﬁrst point, in our simulation, we of course use the ME Logit estimator, but we

also consider the Pooled Heteroskedastic Probit estimator where imposing normality in the

latent error is used as an approximation to, what is usually an unknown latent distribution.

In addressing the second point, we run the simulations on two diﬀerent implementations

of the serial correlation. In the ﬁrst case, which we will refer to a conditional logistic AR(1)

we will generate the process from the following

εi1 ∼ logistic(0,
eεit ∼ logistic(0,

√

(cid:113)

3/π)
3(1 − ρ2)/π), for t = 2, ..., T

(3.35)

This means εit|εit−1 is logistically distributed but since the logistic distribution does not

mix with itself, the marginal distributions are not identically distributed over t (although

they will all have the same standardized ﬁrst two moments). The second case, which we will

refer to as a marginal logistic AR(1), will be generating from the following distributions

√

εi1 ∼ logistic(0,
eεit ∼ log

(cid:18) sin(ρU π)

3/π)

sin(ρ(1 − U )π)

(cid:19)

where U ∼ U nif orm(0, 1), for t = 2, ..., T

(3.36)

as proposed in Sim (1993). Now the marginal distributions will be identically logistically

distributed. We feel that this would give more credence to a pooled, although misspeciﬁed

in distribution, approach. Under no serial correlation (ρ = 0) both processes are identical

and the ME Logit likelihood is correctly speciﬁed.

Tables G.30 and G.31 reports the de-scaled coeﬃcient estimates. Similar to the Probit

case, as the level of serial correlation increases, there is an increasingly positive bias for

both AR(1) speciﬁcations. Tables G.32 and G.33 report the scaled coeﬃcient estimates for

133

the conditional Logit AR(1) process and the marginal Logit AR(1) process. The ME Logit

parameter estimates are scaled by 1/

(cid:113)

π2/3 + ˆσ2

1 which should match (at least in terms of

scaling to) the Pooled Heteroskedastic Probit scaled coeﬃcient estimates. As we saw in

the Probit case, the bias of the ME Logit estimator is countered by bias in the variance

component estimate, ˆσ2

1, which results in unbiased scaled coeﬃcient estimates. In fact, we

see the ME Logit estimator is far superior to the Pooled Heteroskedastic Probit estimator

in terms of bias and eﬃciency.

Finally, the APE estimates are presented in Tables G.34 and G.35. As we expected,

since the marginal logistic AR(1) process produces identically distributed errors, the Pooled

Heteroskedastic estimator performs better compared to the conditional logistic AR(1) data

generating process. But in either case the ME Logit estimator has lower bias and is much

more eﬃcient than the pooled approach even though we know that the joint likelihood is

misspeciﬁed.

The results from this simulation suggest that the robustness of the JMLE under serial

correlation is not limited to the normal distribution.

3.8 Conclusion

This study has been an comprehensive investigation in the behaviors of PMLE and JMLE for

panel random coeﬃcient binary response models under serial correlation. After introducing

the two estimators and the context in which they are usually implemented, we explored

their potential in a diverse simulation study as well as sought to conﬁrm our results with an

application. Consistent with our initial intuition, there are several points that need to be

considered when implementing these estimators.

134

First and foremost, specifying correlated random eﬀects matters enormously whether

you are consider a PMLE or JMLE approach. We saw this regardless of the data generating

process (DGP1, DGP2, DGP3), the level of serial correlation (ρ = 0, 0.4, 0.8), the type of

serial correlation (AR(1) vs AR(2)), or the distribution of the latent error (Probit vs Logit).

Our biggest concern is that those who implement the Mixed Eﬀects approaches may be

swayed by the language in thinking that they are able to easily model the heterogeneity with

such a ﬂexible framework without considering potential correlation with the covariates.

Another expected results is that the pooled approach is much quicker to implement

which may have more importance when considering much larger datasets with many more

covariates. As we saw in our application the diﬀerence ranged between seconds for the PMLE

approach and hours for the JMLE approach.

More intriguing, we ﬁnd quite a number of surprising results that should change some of

the perceptions of JMLE and PMLE.

JMLE estimates of the scaled coeﬃcient, ASF, and APE appear to be robust to fairly

simple speciﬁcations of serial correlation even though the presence of serial correlation implies

that the joint likelihood is miss-speciﬁed. This is because the bias in the de-scale param-

eter estimates are countered by a bias in the variance component estimate. Consequently,

interpretation of the de-scaled parameter estimates and variance components are ill-advised

when one is concerned there may be serial correlation in the latent error.

In simulation, we repeatedly found the JMLE to be the eﬃcient estimator even under

miss-speciﬁcation of the likelihood. There is no theoretical grounding as to why a Pooled

estimator need not be more eﬃcient than a miss-speciﬁed JMLE. In fact, we do see the

PMLE become more eﬃcient than the JMLE in the Application but we were unable to

reproduce these results in simulations. Therefore the questions of eﬃciency and under what

135

settings the PMLE becomes more eﬃcient than the JMLE remains open and requires further

investigation.

In our discussion of the case of no random eﬀects, it was surprising to see the large

number of failures to converge for the ME Probit under no serial correlation and then the

dramatic drop as the level of serial correlation increases. This adds to the computational

advantage of the pooled approach. Although it does not exactly model the heterogeneity, it

is much more likely to be able to converge when there is no random eﬀects and at a much

faster rate.

Should we really care about diﬀerentiating between serial correlation and random ef-

fects? One could argue that they are both ways for an econometrician to model persistence

in the data and there is no particular reason to prefer one over the other. As we see in

the simulation, this idea appears to be consistent with the robustness of the JMLE under

serial correlation. But it would warn against making strict interpretation of the variance

components. In the end they are capturing the variability of the persistence over time but

this is not necessarily equivalent to the true variance of the random eﬀect.

136

APPENDICES

137

APPENDIX A

Figures for Chapter 1

138

Figure A.1: Visual representation of bijective transformations

(a)

(b)

(c)

(d)

Each oval represent the support of either X or Z, the objects inside represent possible realizations in the
support, and the lines connecting the realizations represent pairs of realizations that occur in the joint
support with positive probability. From left to right: (a) shows a bijective transformation from X to Z and
equivalently a bijective transformation from Z to X. There is no variation in one of the random variables
that cannot be perfectly described by the variation in the other. (b), (c), and (d) show examples where there
is not a bijective transformation. In (b) there is extra variation in X that cannot be explained by Z and
in (c) there is extra variation in Z that cannot be explained by X. The case of an exclusion restriction is
presented in (d). Imagine there is an element in Z that is excluded in X that can take on 3 values. Then for
every point in the support of X there are 3 possible realizations in Z that will occur with positive probability.

139

Figure A.2: Parameter estimates from two observationally equivalent models

From top to bottom: intercept estimates where the true values are 0 (Speciﬁcation 1) or 0.5 (Speciﬁcation
2), coeﬃcient estimates where the true values are 0.5 (speciﬁcation 1) or 0 (Speciﬁcation 2), and the het-
eroskedastic coeﬃcient estimates where the true values are 2 (Speciﬁcation 1) or 1 (Speciﬁcation 2). 2,000
simulations of sample size 1,000 using the Stata command hetprobit.

140

APPENDIX B

Proofs and Notation for Chapter 2

141

Identiﬁcation

The following lemma is an extension of corollary 1.4.1 given in chapter 1 to allow for multi-

variate X and Z.

Lemma B.1. Let X and Z be vectors of random variables with continuous support. Suppose

the following conditions hold

(i) Z does not contain a constant.
(ii) E(X(cid:48)X) is non-singular.
(iii) E(Z(cid:48)Z) is non-singular.

(iv) βo is non-zero

(v) Each element of Z is a polynomial function of an element in X, such that

Zj = X

pk
j
k

where K is the dimension of X and pk

j ∈ {1, 2, 3, ...} is the order of the polynomial on

the kth term in X that composes the jth term in Z.

Then for all parameters (β, δ) ∈ Θ (the parameter space) if X(βo − exp(Z(δ − δo))β) = 0

with probability 1, then (β, δ) = (βo, δo).

Proof. Suppose there is a (β, δ) ∈ Θ, such that

X(βo − exp(Z(δ − δo))β) = 0

with probability 1. Then I can rearrange given condition (iv),

(cid:18)Xβo

(cid:19)

Xβ

Z(δ − δo) = ln

142

(B.1)

(B.2)

Let A denote the set of k such that the set {pl

maximum polynomial order, ˜pk = max

j : l = k} is nonempty, then there exists a
{pl
j : l = k} for each k ∈ A. Then for each k ∈ A,

take the partial derivative with respect to Xk, ˜pk + 1 times,

j

(cid:34)(cid:18) βko

Xβo

(cid:19)˜pk+1 −

(cid:18) βk

Xβ

(cid:19)˜pk+1(cid:35)

0 = (−1)˜pk+1

which implies

βko
Xβo

=

Xβ . There are two cases: either for all k such that {pl
βk

an empty set, βk = βko = 0 or there exists at least one ˆk such that βˆk

(B.3)

j : l = k} is not
(cid:54)= 0 and βˆko
(cid:54)= 0.

In the ﬁrst case, this reduces to the scenario that X and Z are not functionally related in

which Theorem 1.2.1 of chapter 1 can be applied to obtain identiﬁcation. In the second case,

equation (B.3) implies

βko
βk

= Xβo

Xβ and plugging into equation (B.2),

(cid:18) βko

(cid:19)

βk

Z(δ − δo) = ln

(B.4)

the right hand side is a constant. By conditions (i) and (iii) equation (B.4) can only hold if
δo − δ = 0 and by condition (ii) this implies βo = β.

Proof of Theorem 2.4.1:

Using Lemma B.1: Parts (i)-(iii) of Assumption 2.4.1 insure that

E((xi, h(v2i, zi))(cid:48)(xi, h(v2i, zi)))

(B.5)

is non-singular (shown in the paper). Part (iv) restricts how the heteroskedastic function may

be speciﬁed to avoid issues non-identiﬁcation due to the non-linear setting and correspond

to conditions (i), (iii), and (v) of Lemma B.1. Part (v) insures that there is identiﬁcation of

the heteroskedastic components and corresponds to condition (iv) of Lemma B.1. Applying

Lemma B.1, identiﬁcation follows.

143

Asymptotic for Parametric Estimator

Proof of Theorem 2.5.1:

Using Theorem 2.6 of Newey and McFadden (1994), since there is no weighting matrix, I

merely need to show the following:

(i) E(M (y1i, y2i, zi; π, θ)) = 0 only if π = πo and θ = θo
(ii) πo ∈ Π and θo ∈ Θ, both of which are compact
(iii) E(M (y1, y2, z; π, θ)) is continuous at each π ∈ Π and each θ ∈ Θ
(iv) E(sup(π,θ)∈Π×Θ ||M (y1i, y2i, zi; π, θ)||) < ∞

Identiﬁcation, part (i), holds under Assumption 2.4.1. Part (ii) is assumed. Part (iii) is

evident given the linear LS and Probit speciﬁcations and part (iv) is satisﬁed given the ﬁnite

second moment conditions given in Assumption 2.4.1, see below for more details.

||M (y1i, y2i, zi; π, θ)|| ≤||(y2i − m(zi)π)m(zi)|| + ||Si(π, β, γ, δ)||

≤||(y2i − m(zi)π)m(zi)|| +

(cid:35)
× (||xi|| + ||hi(π)|| + ||(xiβ + hi(π)γ)||||gi||)

(y1i − Φi(π, θ))φi(π, θ)

Φi(π, θ)(1 − Φi(π, θ)) exp(giδ)

max(λi(π, β, γ, δ), λi(π,−β,−γ, δ))

exp(giδ)

≤||(y2i − m(zi)π)m(zi)|| +

(cid:35)
× (||xi|| + ||hi(π)|| + ||(xiβ + hi(π)γ)||||gi||)
(cid:12)(cid:12)(cid:12)xiβ + hi(π)γ
(cid:35)
× (||xi|| + ||hi(π)|| + ||(xiβ + hi(π)γ)||||gi||)

≤||(y2i − m(zi)π)m(zi)|| +

exp(giδ)

exp(giδ)

(cid:18)

1

C

1 +

(cid:34)(cid:12)(cid:12)(cid:12)
(cid:34)

(cid:34)

(cid:12)(cid:12)(cid:12)

(cid:12)(cid:12)(cid:12)(cid:19)

where λi(π, θ) is the inverse mills ratio and notationally, hi(π) = h(y2i − m(zi)π, zi) and

144

gi = g(y2i, zi). Therefore, E(sup(π,θ)∈Π×Θ ||M (y1i, y2i, zi; π, θ)||) is ﬁnite as long as the sec-

ond moments of m(zi), xi, hi(π), and gi are bounded (which is presumed under Assumption

2.4.1).

Proof of Theorem 2.5.2:

Using Theorem 6.1 of Newey and McFadden (1994), I merely need to show the following:

(i) πo ∈ int(Π) and θo ∈ int(Θ), both of which are compact

(ii) M (y1i, y2i, zi; π, θ) is continuously diﬀerentiable in a neighborhood of (πo, θo) with

probability approaching one.

(iii) E(M (y1i, y2i, zi; πo, θo)) = 0
(iv) E(||M (y1i, y2i, zi; πo, θo)||2) is ﬁnite;
(v) E(sup(π,θ)∈Π×Θ || (cid:53)(π,θ) M (y1i, y2i, zi; π, θ)||) < ∞
(vi) G(cid:48)G is non-singular
where G = E((cid:53)(π,θ)M (y1i, y2i, zi; πo, θo)). Part (i) is assumed and part (ii) is evident

given the linear LS and Probit speciﬁcations. Part (iii) holds by Assumption 2.3.1 (correct

conditional mean speciﬁcation in the ﬁrst stage and Fischer consistency in the second stage).

Part (iv) can be veriﬁed

(cid:19)
||M (y1i, y2i, zi; πo, θo)||2 = ||(y2i − m(zi)πo)m(zi)||2 + ||Si(πo, βo, γo, δo)||2
=||(y2i − m(zi)πo)m(zi)||2 +

(y1i − Φi(πo, θo))2φi(πo, θo)2

Φi(πo, θo)2(1 − Φi(πo, θo))2 exp(2giδo)

(cid:18)

× (||xi||2 + ||hi(πo)||2 + ||(xiβo + hi(πo)γo)||2||gi||2)

applying law of iterated expectations,

(cid:16)||M (y1i, y2i, zi; πo, θo)||2(cid:12)(cid:12)(cid:12)zi, y2i

E

(cid:17)

145

=||(y2i − m(zi)πo)m(zi)||2 +

× (||xi||2 + ||hi(πo)||2 + ||(xiβo + hi(πo)γo)||2||gi||2)

=||(y2i − m(zi)πo)m(zi)||2 +

(cid:33)

(cid:32)

E(cid:0)(y1i − Φi(πo, θo))2|zi, y2i
(cid:1)φi(πo, θo)2
Φi(πo, θo)2(1 − Φi(πo, θo))2 exp(2giδo)
(cid:19)
(cid:18) Φi(πo, θo)(1 − Φi(πo, θo))φi(πo, θo)2
Φi(πo, θo)2(1 − Φi(πo, θo))2 exp(2giδo)
(cid:19)
(cid:18) λi(πo, βo, γo, δo)λi(πo,−βo,−γo, δo)

× (||xi||2 + ||hi(πo)||2 + ||(xiβo + hi(πo)γo)||2||gi||2)

=||(y2i − m(zi)πo)m(zi)||2 +

exp(2giδo)
× (||xi||2 + ||hi(πo)||2 + ||(xiβo + hi(πo)γo)||2||gi||2

Since λi(πo, βo, γo, δo)λi(πo,−βo,−γo, δo) is bounded (and bounded away from 0), taking an
expectation of the above equation, E(||M (y1i, y2i, zi; πo, θo)||2) is ﬁnite as long as the sec-

ond moments of m(zi), xi, hi(π), and gi are bounded (which is presumed under Assumption

2.4.1). Part (v) follows from boundedness of the ﬁrst derivative of the inverse mills ratio and

ﬁnite second moments of m(zi), xi, hi(π), and gi. In showing (vi), let G = (Gπ, Gθ) where

G1π
 =
G1θ
 =

G2π

G2θ

E(m(zi)(cid:48)m(zi))




−E(cid:0)∆i(πo, θo)ωi(πo, θo)(cid:48)ωi(πo, θo)(cid:1)

E(Γi(πo, θo)ωi(πo, θo)(cid:48)m(zi))

0



Gπ =

Gθ =

and

Γi(πo, θo) = (∂hi(πo)/∂v2)γo∆i(πo, θo)

∆i(πo, θo) =

ωi(πo, θo) =

φi(πo, θo)2

(cid:18)
(cid:19)
Φi(πo, θo)(1 − Φi(πo, θo)) exp(2giδo)
xi, hi(πo, θo), −(xiβ + hi(πo, θo)γo)gi(πo, θo)

146

Then

G(cid:48)G =

G(cid:48)

πGπ G(cid:48)
θGπ G(cid:48)
G(cid:48)

πGθ

θGθ



πGπ =E(m(zi)(cid:48)m(zi))E(m(zi)(cid:48)m(zi))
G(cid:48)

G(cid:48)

+ E(Γi(πo, θo)m(zi)(cid:48)ωi(πo, θo))E(Γi(πo, θo)ωi(πo, θo)(cid:48)m(zi))

πGθ = − E(Γi(πo, θo)m(zi)(cid:48)ωi(πo, θo))E(cid:0)∆i(πo, θo)ωi(πo, θo)(cid:48)ωi(πo, θo)(cid:1)
GθGθ =E(cid:0)∆i(πo, θo)ωi(πo, θo)(cid:48)ωi(πo, θo)(cid:1) E(cid:0)∆i(πo, θo)ωi(πo, θo)(cid:48)ωi(πo, θo)(cid:1)

which is easily non-singular by Assumption 2.4.1.

Identiﬁcation and Asymptotic for Semi-Parametric Es-

timator

Proof of Theorem 2.7.1:
Write u1i = ho(zi, v2i) + i where Med(i|zi, v2i) = 0. Plugging into equation (2.1) and
redeﬁning: ˜xi = (xi, ho(zi, v2i)) and ˜β(cid:48)

o, 1), one can apply Theorem 2.1 of Khan

o = (β(cid:48)

(2013) to

yi = 1{˜xi

˜βo +  ≥ 0}

(B.6)

and obtain the observational equivalence result.

Proof of Theorem 2.7.2:
Identiﬁcation of mo(·) in the ﬁrst stage is immediate from part (i) of Assumption 2.7.6. For
identiﬁcation of the second stage parameters and functions, suppose there are β−k ∈ B,

147

h(·) ∈ H, and g(·) ∈ G such that (β−k, h(·), g(·)) (cid:54)= (β−ko, ho(·), go(·)) and
x−kiβ−k + xki + h(v2i, zi)

x−kiβ−ko + xki + ho(v2i, zi)

exp(go(y2i, zi))

exp(g(y2i, zi))

=

(B.7)

with probability 1. By Assumption 2.7.6 (iii), xki conditional on x−ki and h(v2i, zi) had
density with respect to the Lebesgue measure that is positive on (cid:60) for any h(·) ∈ H. So for

any realization x−ki, v2i, zi, there exists a xki such that,

x−kiβ−ko + xki + ho(v2i, zi) > 0 and x−kiβ−k + xki + h(v2i, zi) < 0

(B.8)

and since the scaling by exp(g(y2i, zi)) is always positive for any g(·) ∈ G, this is a contra-

diction. I maintain separate identiﬁcation of βo and ho(v2i, zi) by parts (ii) and the CMR

(part (iv)) of Assumption 2.7.6.

Before providing the remaining proofs from Section 2.7.2, I will brieﬂy outline some
notation. Let a = (a1, ..., ak) be a 1 × k vector of non-negative integers, then the |a|-th
derivative with respect to a function f : (cid:60)k → (cid:60) is deﬁned as,

where |a| =(cid:80)k

(cid:53)af (x) =

∂|a|
1 ··· ∂x
a1

ak
k

∂x

(B.9)

i=1 ai. For any s > 0, let [s] denote the largest integer smaller than s. Deﬁne

the s-th Holder norm, || · ||Λs, as

(cid:88)

|a|≤[s]

||f||Λs =

| (cid:53)a f (x)| +

sup
x∈X

(cid:88)

|a|=[s]

| (cid:53)a f (x)| − | (cid:53)a f (¯x)|

||x − ¯x||s−[s]

sup
x(cid:54)=¯x

where || · || denotes the euclidean norm. Deﬁne a Holder space with smoothness s as

Λs(X ) = {f ∈ Cs−[s](X ) : ||f||Λs < ∞}

(B.10)

(B.11)

where Cr(X ) is the set of continuous function on X that have continuous ﬁrst r-th derivatives.
Deﬁne a weighted Holder ball with radius c, smoothness s, and weight function (1+||·||2)−w/2

148

with w > 0,

c(X , w) = {f ∈ Λs(X ) : ||f (·)(1 + || · ||2)−w/2||Λs ≤ c < ∞}
Λs

(B.12)

(cid:18)(cid:90)
X f (x)2dFx

(cid:19)

(cid:12)(cid:12)f (x)(1 + ||x||2)−w/2(cid:12)(cid:12)

Finally, deﬁne the following two norms

||f (x)||2 =

||f (x)||∞,w = sup
x∈X

Proof of Corollary 2.7.1:

By the triangle inequality and deﬁnition of || · ||∞,w1,

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)n−1

−E

n(cid:88)
(cid:32)

i=1

φ

(cid:32)
(cid:18)xiβo + ho(v2i, zi)

ˆβ + ˆh(ˆv2i, zi)
xi
exp(ˆg(y2i, zi))

φ

(cid:33)
(cid:19)

(B.13)

(B.14)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

ˆβj

βjo

+

φ

i=1

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

φ

i=1

exp(ˆg(y2i, zi))

exp(go(y2i, zi))

− n−1

exp(go(y2i, zi))

ˆβ + ˆh(ˆv2i, zi)
xi
exp(ˆg(y2i, zi))

(cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)
(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)
(cid:32)
(cid:33)
(cid:19)
(cid:18)xiβo + ho(v2i, zi)
(cid:19)
(cid:18)xiβo + ho(v2i, zi)
(cid:19)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)n−1
n(cid:88)
n(cid:88)
(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)n−1
(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)
n(cid:88)
(cid:18)xiβo + ho(v2i, zi)
(cid:18)
(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)φ
(cid:33)
(cid:32)
(cid:18)xiβo + ho(v2i, zi)
(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)n−1
(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)
n(cid:88)

(cid:18)xiβo + ho(v2i, zi)

ˆβ + ˆh(ˆv2i, zi)
xi
exp(ˆg(y2i, zi))

exp(go(y2i, zi))

exp(go(y2i, zi))

exp(go(y2i, zi))

exp(go(y2i, zi))

βjo

(cid:19)

(cid:19)

− E

φ

φ

i=1

φ

i=1

− φ

+

ˆβj

exp(ˆg(y2i, zi))

βjo

exp(go(y2i, zi))

exp(go(y2i, zi))

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

βjo

(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)
(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)∞,w1

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

βjo

exp(go(y2i, zi))
ˆβj

exp(ˆg(y2i, zi))

exp(go(y2i, zi))

βjo

exp(go(y2i, zi))

exp(go(y2i, zi))

≤

≤

149

(cid:18)

φ

− E

(cid:18)xiβo + ho(v2i, zi)

(cid:19)

βjo

exp(go(y2i, zi))

exp(go(y2i, zi))

(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

and the ﬁrst term is op(1) by the results of Theorem 2.7.3 and the second term is op(1) using

(cid:18)xiβo + ho(v2i, zi)

(cid:19)

(cid:19)

βjo

(B.15)

the Weak Law of Large Numbers, noting

(cid:18)

V ar

φ

is bounded.

exp(go(y2i, zi))

exp(go(y2i, zi))

150

APPENDIX C

Simulation Details for Chapter 2

151

General Control Function in the Demand for Premium

Cable

First I constructed the distribution of markets and operators in which there is one operator in

each market (but the operators serve multiple markets). The market ID was assigned using

a truncated (from 0 to 172) exponential distribution with mean 40 such that the higher

the market id (rounded up to the integer), the smaller the market size.1 I only allow for

two operators, to mimic the competition between Time Warner and AT&T (assigned with

equal probability to each market). The product characteristics: number of premium channels

oﬀered (z11m) and cost shifter (z2m), are the same within a market. The number of premium

channels was drawn from a truncated (0 to 10) N ormal(4.5, 6.25). The cost shifter was a

function of the quality (number of channels) and the operator (eﬃcient/ineﬃcient operator):

costm = 10 + 5om + 2numchm + 1m

where om ∈ {0, 1} is the operator in the market and 1m ∼ N ormal(0, 1) is a market level

cost shock. The endogenous variable price is constructed as a function of the number of

channels, cost, and unobserved quality (v2m).

pm = 7.5 + 2numchm + costm + v2m

(C.1)

where v2m is drawn from a U nif orm(−8, 8). Then to construct the consumer characteristics

(z12i), I draw e1, e2 from independent standard normal distributions where age and income

1The market identiﬁers could be constructed any way, this was just so there was a good range in market

size.

152

are constructed as,

√

0.85e2)(cid:101)

1 − 1) +

agei = (cid:100)20 + 40Φ(0.4e1 − 0.1(e2
d1i = 1{Φ(e1) < .196}
d2i = 1{Φ(e1) ≥ 0.196 and Φ(e1) < 0.44}
d3i = 1{Φ(e1) ≥ 0.44 and Φ(e1) < 0.685}
d4i = 1{Φ(e1) ≥ 0.685 and Φ(e1) < 0.86}
d5i = 1{Φ(e1) ≥ 0.86}

5(cid:88)

incomei =

dgiincg

where e2i is drawn from a

g=1

inc1 ∼ U nif orm(10, 25)
inc2 ∼ U nif orm(25, 50)
inc3 ∼ U nif orm(50, 75)
inc4 ∼ U nif orm(75, 100)
inc5 ∼ 20Exp(1) + 100

Consequently age and income are positively correlated. The last consumer characteristic,

household size, is constructed as the following function of age and income,

hhsi = (cid:100)exp(−0.75 + 0.0015incomei + 0.03agei + 2i(cid:101)

where 2i is drawn from a truncated (-1 to 1) N ormal(0, 0.45). In violation of CF-CI, the

conditional distribution of u1i is,

u1i|z11m, z12i, z2m, v2m ∼ N (0.32v2m + 0.15v2m × numchm − 0.02v2m × agei, 1)

(C.2)

153

So there is no heteroskedasticity in the latent error but the unobserved product attributes

(advertisement) has an interactive eﬀect with number of channels (the addition of more

channels matters more if this was advertised) and age (younger consumers may be more

susceptible to advertisement). Finally, the binary dependent variable, y1i is calculated from,

y1i = 1{ − 9.8 − 0.14pm + 0.017pmd2i + 0.03pmd3i + .035pmd4i

+ 0.045pmd5i + 0.01numchm + 0.005incomei + 0.03hhsi

(C.3)

+ 0.005agei + 0.006age2

1 + u1i > 0}

The summary statistics are presented in Table E.1. Similar to the real data in PT, conditional

on choosing cable, about 1/3 of the sample selects premium cable.

ASF Estimates for the Eﬀect of Income on Home-

ownership

In the construction of the exogenous variables, ﬁrst z11i (age), z21i and z22i(education of wife)

are determined (z11i independent of z21i and z22i, z21i and z22i are mutually exclusive) and

then z12i (children in household) and z23i (wife working) are functions of the other exogenous

variables to induce correlation. Since the sample consisted on 981 married men aged 30 to

50, z11i is drawn from a truncated N ormal(41.8, 60) with a lower bound of 30 and an upper

bound of 50. Let e be drawn from a U nif orm(0, 1), then the education of the wife was

determined by,

z21i =

1

0

1

0

if e2 > 0.897

if else

if 0.482 < e1 ≤ .897

if else

z22i =

154

Since it would seem reasonable for having young children in household be negatively corre-

lated with age, and higher education,z12i is calculated as the following,

z12i = 1{261.2 − 5z11i − 20z21i − 50z22i + ε12i > 0}

(C.4)

where ε12i is drawn from a N ormal(0, 30). Since it would seem reasonable that the proba-

bility of a wife working is negatively correlated with having young children in the household

and age but positively correlated with higher education, z23i is calculated as the following,

z23i = 1{17.6 − .3z11i − 5z12i + 3z21i + 10z22i + ε23i > 0}

(C.5)

where ε23i is drawn from a N (0, 5). The conditional mean of y2i is,

y2i =7.2 + 0.0117z11i + 0.0911z12i + 0.0642z21i + 0.1291z22i + 0.0911z23i + v2i;

where v2i is drawn from a N (0, 0.088). The linear index is,

xiβo = 3.8y2i + 0.09z11i + z12i

(C.6)

so the conditional distribution of u1i is only a function of the linear index and v2i

u1i|v2i, z11i, z12i, z21i, z22iz23i ∼ N

− 2v2i − 2v2ixiβo, exp

2(0.01(xiβo)

(cid:32)

(cid:16)

(cid:17)(cid:33)

and the binary dependent variable is calculated from,

y1i = 1{−34 + xiβo + u1i > 0}

(C.7)

Table E.4 present the summary statistics of the simulated data as well as the summary

statistics from Rothe (2009) as a comparison.

The SML estimator proposed in Rothe (2009) maximizes the following log likelihood,

(cid:16)
(cid:17)
1 − ˆG(xiβ, ˆv2i)

+ (1 − y1i) log

(C.8)

ˆβSM L = arg max

β

y1i log

n(cid:88)

(cid:16) ˆG(xiβ, ˆv2i)

(cid:17)

i=1

155

where ˆG(xiβ, ˆv2i) = Φ((cid:80)n

j=1 Kh([xjβ, ˆv2j]− [xiβ, ˆv2i])y1j/(cid:80)n

j=1 Kh([xjβ, ˆv2j]− [xiβ, ˆv2i]))
and Kh(·) is a bivariate kernel based on bandwidth h and scaled by that bandwidth. In order

to eliminate the asymptotic bias, the SML estimator requires the use of higher order kernels.

However, Rothe ﬁnds, utilizing lower order kernels tend to perform better with ﬁnite samples.

Therefore I use a ﬁrst order Gaussian kernel. The normal CDF transformation insures the

estimates fall between 0 and 1 and imposes the correct distribution as a transformation. I

ﬁnd that this helps the parameter estimates.

In determining optimal bandwidths, Rothe

suggests maximizing the above likelihood with respect to both the parameters β and the

bandwidths h. I ﬁnd that this can result in a number of extreme outliers that corrupt the

analysis. Therefore I equate the bandwidths to the optimal value (given the distribution is

truly normal) as a function of the parameters h = 1.06(cid:112)V ar([xiβ, ˆv2i])n−1/5. This is then

plugged into the likelihood and maximize with respect to the parameters β.

The proposed Het Probit (GCF) maximizes the following likelihood

n(cid:88)

( ˆβ, ˆγ, ˆδ)HetP robit(GCF ) = arg max

β,γ,δ

y1i log (G(xi, v2i; β, γ, δ))

i=1

(C.9)

+ (1 − y1i) log (1 − G(xi, v2i; β, γ, δ))

(cid:16) xiβ+ˆv2iγ1+ˆv2ixiγ2

(cid:17)

. I found that the estimates are sensitive
where G(xi, v2i; β, γ, δ) = Φ
to the starting values therefore I used [1, 0.5, 0.75, 1.5, 2] × (βo, γo, δo) as the starting values

exp(xiδ)

and choose the estimates with the largest log likelihood.

156

APPENDIX D

Figures for Chapter 2

157

Figure D.1: Eﬀect of Heteroskedasticity on Parameter Estimate

Shows empirical distribution for estimates of β in the model y1i = 1{y2iβ + u1i > 0}
and y2i = zi +v2i. The unobserved heterogeneity are generated from a heteroskedastic
bivariate normal as in equation (2.6) where ρ(zi) = 0.6, σ1(zi) = σ2(zi) = exp(0.25zi).
The Homoskedastic estimator assumes the data generating process in equation (2.2)
while the heteroskedastic estimator correctly scales by the true conditional variance.
Calculated from 1,000 simulations of sample size 1,000.

158

General Control Function in the Demand for Premium

Cable

Figure D.2: ASF for Income equal to $85,000

ASF is evaluated over diﬀerent prices for an additional 5 channels of premium cable
oﬀered to a consumer who is 35 years old in a family of 3 with income equal to $85,000.

159

ASF Estimates for the Eﬀect of Income on Home-

ownership

Figure D.3: ASF Estimates for Misspeciﬁed Models

ASF evaluated over diﬀerent levels of log(total income) for a 40 year old with children
under the age of 16 in the household.

160

Figure D.4: Consequence of CF-LI Assumption on ASF Estimates

ASF evaluated over diﬀerent levels of log(total income) for a 40 year old with children
under the age of 16 in the household.

161

Empirical Example

Figure D.5: Comparison of ASF for Families with No Children

1991 CPS data on Married Women Labor force participation. ASF evaluated over
diﬀerence levels of Non-Wife Income for family with no children in the household.

162

Figure D.6: Comparison of ASF for Families with Young Children Only

1991 CPS data on Married Women Labor force participation. ASF evaluated over
diﬀerence levels of Non-Wife Income for family with only young (under 6) children in
the household.

163

Figure D.7: Comparison of ASF for Families with Old Children Only

1991 CPS data on Married Women Labor force participation. ASF evaluated over
diﬀerence levels of Non-Wife Income for family with only old (over 6) children in the
household.

164

Figure D.8: Comparison of ASF for Families with Both Young and Old Children

1991 CPS data on Married Women Labor force participation. ASF evaluated over
diﬀerence levels of Non-Wife Income for family with both old (over 6) and young
(under 6) children in the household.

165

Extension: Semi-Parametric Distribution Free Estima-

tor

Figure D.9: Logistic Distribution (h1

o = v2i)

1,000 simulations of Sample size 1,000.

166

Figure D.10: Uniform Distribution (h1

o = v2i)

1,000 simulations of Sample size 1,000.

167

Figure D.11: Student T Distribution (h1

o = v2i)

1,000 simulations of Sample size 1,000.

168

Figure D.12: Gaussian Mixture Distribution (h1

o = v2i)

1,000 simulations of Sample size 1,000.

169

Figure D.13: Logistic Distribution with Linear GCF (h2
o)

1,000 simulations of Sample size 1,000.

170

Figure D.14: Uniform Distribution with Linear GCF (h2
o)

1,000 simulations of Sample size 1,000.

171

Figure D.15: Student T Distribution with Linear GCF (h2
o)

1,000 simulations of Sample size 1,000.

172

Figure D.16: Gaussian Mixture Distribution with Linear GCF (h2
o)

1,000 simulations of Sample size 1,000.

173

Figure D.17: Logistic with Non-Parametric GCF (h3
o)

1,000 simulations of Sample size 1,000.

174

Figure D.18: Uniform with Non-Parametric GCF (h3
o)

1,000 simulations of Sample size 1,000.

175

Figure D.19: Student T with Non-Parametric GCF (h3
o)

1,000 simulations of Sample size 1,000.

176

Figure D.20: Gaussian Mixture with Non-Parametric GCF (h3
o)

1,000 simulations of Sample size 1,000.

177

Figure D.21: Heteroskedastic Logistic (h1

o = v2i)

1,000 simulations of Sample size 1,000.

178

Figure D.22: Heteroskedastic Logistic with Linear GCF (h2
o)

1,000 simulations of Sample size 1,000.

179

Figure D.23: Heteroskedastic Logistic with Non-Parametric GCF (h3
o)

1,000 simulations of Sample size 1,000.

180

APPENDIX E

Tables for Chapter 2

181

General Control Function in the Demand for Premium

Cable

Table E.1: Summary Statistics

Variables

Mean

Std. dev.

Premium Cable
Age
Income (in thousands)
Family Size
Price
Number of Channels
Cost

(y1im)
(z12i)
(z12i)
(z12i)
(y2m)
(z11m)
(z2m)

0.329
40.598
60.019
2.596
40.691
5.139
22.912

0.470
11.513
34.782
1.387
11.594
2.420
6.086

1,000 simulations of sample size 7,677.

182

Table E.2: Comparison of Logit Parameter Estimates

Variables

Price

Price × Income Group 2
Price × Income Group 3
Price × Income Group 4
Price × Income Group 5

Number of Channels

Income

Household Size

Age

Age2

Logit

(1)

-0.032
(0.011)
0.015
(0.005)
0.026
(0.006)
0.031
(0.008)
0.040
(0.010)
-0.231
(0.046)
0.002
(0.004)
0.024
(0.038)
0.077
(0.114)
0.004
(0.001)

Logit
(CV)
(2)

-0.109
(0.018)
0.015
(0.005)
0.026
(0.006)
0.031
(0.008)
0.040
(0.010)
0.094
(0.076)
0.002
(0.004)
0.024
(0.038)
0.073
(0.115)
0.004
(0.001)

Logit
(GCF)

(3)

-0.141
(0.021)
0.017
(0.006)
0.030
(0.007)
0.035
(0.008)
0.045
(0.011)
0.011
(0.091)
0.005
(0.004)
0.031
(0.044)
0.025
(0.147)
0.006
(0.002)

Logit
(Over)

(4)

TRUE

-0.141
(0.021)
0.017
(0.006)
0.030
(0.007)
0.035
(0.009)
0.045
(0.011)
0.011
(0.091)
0.005
(0.004)
0.031
(0.044)
0.024
(0.147)
0.006
(0.002)

-0.14

0.017

0.03

0.035

0.045

0.01

0.005

0.03

0.005

0.006

1,000 simulations of sample size 7,677 and standard deviations are given in parenthesis. Logit
(CV) only includes the control variable v2i to address the issue of endogeneity, Logit (GCF)
uses the correct speciﬁcation that allows for a general control function, and Logit (Over) over
speciﬁes the control function by including terms that are not in the true speciﬁcation.

Table E.3: Comparison of Price Elasticity Estimates

Estimator

Mean

Std. dev.

OLS
CF
Logit
Logit (CV)
Logit (GCF)
Logit (Over)
TRUE

0.485
0.082
-0.386
-2.536
-3.348
-3.350
-3.320

12.009
55.989
0.267
0.489
0.571
0.571

1,000 simulations of sample size 7,677.

183

ASF Estimates for the Eﬀect of Income on Home-

ownership

Table E.4: Comparison of Summary Statistics

Std. dev. Mean

Simulated Data
Std. dev.

Rothe

0.490
0.324
5.374
0.359

0.493
0.304
0.459

0.608
7.857
40.633
0.851

0.422
0.111
0.689

0.488
0.316
5.330
0.356

0.494
0.314
0.463

Variable

Homeowner
log(total income)
Age
Children in HH
Education of Wife

Intermediate Degree
High Degree

Wife Working

Mean

0.599
7.853
40.613
0.848

0.415
0.103
0.699

(y1)
(y2)
(z11)
(z12)

(z21)
(z22)
(z23)

1,000 simulations of sample size 981.

184

Table E.5: Comparison of Parameter Estimates

Rothe

Simulated Data

SML

Het-Probit

(8)

3.9605
(0.0237)
0.0925
(0.0007)

1

(GCF)

(9)

3.9202
(0.0511)
0.0907
(0.0008)

1

Probit

(6)

1.3789
(0.0122)
0.1084
(0.0008)

1

Probit
(CV)
(7)

4.1969
(0.0725)
0.0971
(0.0010)

1

-2.6510
(0.0600)

Variables

log(Income) (y2)

Age (z11)

Child (z12)

CV (ˆv2)

Ed. of Wife

Intm. (z21)

High (z22)

Wife Emp (z23)

R2

RF

(1)

0.0117
(0.0117)
0.0911
(0.0194)

0.0642
(0.0185)
0.1291
(0.0298)
0.0911
(0.0194)

0.1072

Probit

(2)

2.1343
(0.5571)
0.2076
(0.0257)

1

SML

(4)

3.8533
(1.3338)
0.0982
(0.0889)

1

Probit
(CV)
(3)

4.7923
(1.5135)
0.0863
(0.0209)

1

-3.0348
(1.3048)

RF

(5)

0.0117
(0.0001)
0.0911
(0.0009)

0.0646
(0.0008)
0.1286
(0.0012)
0.0914
(0.0008)

0.1252

1,000 simulations of sample size 981. Standard errors (for Rothe) and standard deviations (for Simulated Data) are given in parenthesis. RF is the
reduced form ﬁrst stage estimates, Probit does not address endogeneity at all, Probit (CV) is the Rivers and Vuong (1988) estimator that is a Probit
model that only includes the control variable ˆv2i as an additional covariate to address endogeneity SML is the estimator proposed in Rothe (2009).
Het-Probit (GCF) is the proposed heteroskedastic Probit with a ﬂexible control function. Since coeﬃcients are only identiﬁed to scale, I normalize
the coeﬃcients in columns (2)-(4) and (6)-(9) so the coeﬃcient on Children in HH is 1. This allows for comparisons across the diﬀerent speciﬁcations.
True values of coeﬃcients on log(Income) and Age are 3.80 and 0.09 respectively.

185

Table E.6: APE Results and Simulated Distribution (True APE = 0.6448)

Speciﬁcation

Mean

SD

10%

25%

50%

75%

90%

Het-Probit (GCF)
SML (Sieve)

0.6260
0.6996

0.0034
0.0025

0.4839
0.5976

0.5603
0.6475

0.6350
0.6972

0.7017
0.7512

0.7528
0.8035

Probit
Probit (CV)

0.2851
0.5802

0.0015
0.0037

0.2247
0.4274

0.2546
0.5098

0.2858
0.5922

0.3170
0.6643

0.3434
0.7184

Lin. Prob. (OLS)
Lin. Prob. (2SLS)

0.3117
0.6960

0.0016
0.0062

0.2462
0.4553

0.2795
0.5620

0.3122
0.6886

0.3451
0.8213

0.3739
0.9475

1,000 simulations of sample size 981.

186

Empirical Example

Table E.7: Summary Statistics

Variables

Mean

Std Dev.

Mean

Mean

(If Employed=0 )

(If Employed=1)

Employed
Non-Wife Inc ($1000)
Education
Experience
Has kids (age<6)
Has kids (age≥6)
Husband’s Education

(y1)
(y2)
(z11)
(z12)
(z13)
(z14)
(z2)

0.583
30.269
12.984
20.444
0.279
0.308
13.148

0.493
27.212
2.615
10.445
0.449
0.462
2.977

Observations

5,634

1991 CPS data on Married Women Labor force participation.

34.771
12.395
22.080
0.324
0.259
12.811

2,348

27.053
13.405
19.274
0.247
0.342
13.388

3,286

187

Table E.8: Coeﬃcient Estimates for Married Women’s LFP

Probit Het Probit Probit Het Probit

(2)

(3)

-0.071
(0.011)
-0.058
(0.016)
-0.005
(0.014)

1

-0.168
(0.028)
-2.782
(0.795)
1.102
(0.665)

-0.071
(0.010)
-0.034
(0.015)
-0.010
(0.011)

1

-0.134
(0.036)
-2.870
(0.813)
1.337
(0.614)

(CV)
(4)

-0.024
(0.029)
-0.069
(0.021)
-0.008
(0.017)

1

-0.224
(0.052)
-3.432
(1.015)
1.036
(0.770)

(CV)
(5)

-0.011
(0.026)
-0.045
(0.019)
-0.017
(0.015)

1

-0.208
(0.047)
-3.886
(0.996)
1.322
(0.774)

Probit
(GCF)

(6)

-0.073
(1.873)
0.010
(3.128)
0.068
(3.825)

1

-0.221
(2.946)
-5.693

(363.525)

-1.274

(484.991)

Het Probit

(GCF)

(7)

-0.061
(0.025)
0.030
(0.037)
0.065
(0.035)

1

-0.223
(0.062)
-6.263
(1.816)
-1.196
(1.142)

SML
(Sieve)

(8)

-0.072
(0.004)
0.031
(0.003)
0.059
(0.007)

1

-0.223
(0.011)
-6.031
(0.259)
-1.306
(0.117)

Variables

Non-wife Income

RF

(1)

Non-wife Income

× Has Kids (Age<6)
× Has Kids (Age≥6)

Non-wife Income

Education

Experience

Has Kids (Age<6)
Has Kids (Age≥6)

Husband’s Education

Husband’s Education
× Has Kids (Age<6)
Husband’s Education
× Has Kids (Age<6)

0.002

0.003

(2.42×10−4)
(2.68E×10−4)
1.09×10−4
(2.40×10−5)
1.10×10−4
(2.3×10−5)
(4.54×10−4)

0.002

0.014
(0.004)
0.012
(0.002)

1991 CPS data on Married Women Labor force participation. Standard errors given in parenthesis and calculated using 100 bootstraps. F-test in
Reduced Form is a joint test of signiﬁcant for the terms that include the instrument husband’s education.

188

Table E.8 (cont’d)

RF

Probit Het Probit

Variables

(1)

(2)

(3)

Probit Het Probit Probit Het Probit
(CV)
(4)

(CV)
(5)

(6)

(GCF)

(GCF)

(7)

SML
(Sieve)

(8)

Control Function

ˆv2i
ˆv2i× Has Kids (Age<6)
ˆv2i× Has Kids (Age≥6)

Heteroskedasticity
Non-wife Income

Education

Experience

F-Stat

45.304

-0.062
(0.035)

-0.080
(0.031)

-0.006
(4.006)
-0.093
(5.647)
-0.088
(7.727)

-0.004
(0.001)
1.25×10−4
(1.92×10−4)

0.008
(0.006)

-0.004
(0.001)
0.003
(0.005)
0.009
(0.006)

-0.025
(0.032)
-0.088
(0.056)
-0.091
(0.045)

-0.004
(0.001)
0.000
(0.001)
0.005
(0.006)

1991 CPS data on Married Women Labor force participation. Standard errors given in parenthesis and calculated using 100 bootstraps. F-test in
Reduced Form is a joint test of signiﬁcant for the terms that include the instrument husband’s education.

189

Table E.9: Wald Test Results

Null
Hypothesis

Alternative
Hypothesis

Non-Wife Income

is Exogenous

Non-Wife Income

is Endogenous

Control
Variable

Homoskedasticity

General Control

Heteroskedasticity

Function

Wald Statistic
p-value

4.582
0.032

12.219
0.007

4.749
0.029

12.987
0.005

4.949
0.084

7.109
0.029

15.213
0.002

15.851
0.001

10.658
0.014

Additional Assumptions:
Homoskedasticity
Endogeneity (CV)
Endogeneity (GCF)

x
x

x

x

x

x

x

x

x

1991 CPS data on Married Women Labor force participation. Wald Statistics calculated using bootstrapped standard errors.

190

Table E.10: APE Estimates for Non-Wife Income eﬀect on Wife’s LFP

Estimators

OLS
2SLS

Probit
Het Probit

Probit (CV)
Het Probit (CV)
Probit (GCF)
Het Probit (GCF)

APE

-0.00333
-0.00253

-0.00266
-0.00331

-0.00155
-0.00116
-0.00180
0.00027

SE

0.00023
0.00095

0.00032
0.00020

0.00076
0.00088
0.00093
0.00103

1991 CPS data on Married Women Labor force participation.
Standard errors given in parenthesis and calculated using 100
bootstraps.

191

Extension: Semi-Parametric Distribution Free Estima-

tor

Table E.11: Logistic Distribution (h1

o = v2i)

N

Estimators

Bias

Std. Dev. RMSE

25%

50%

75%

250

500

1, 000

Probit (CF)
SML
SP Het Probit (GCF)

Probit (CF)
SML
SP Het Probit (GCF)

Probit (CF)
SML
SP Het Probit (GCF)

-0.0131
-0.0352
-0.0183

-0.0073
-0.0219
-0.0124

0.0007
-0.0075
0.0004

0.1717
0.1778
0.2210

0.1192
0.1236
0.1563

0.0821
0.0840
0.1246

0.1721
0.1811
0.2216

0.1194
0.1255
0.1567

0.0820
0.0843
0.1245

0.8875
0.8595
0.8603

0.9172
0.9024
0.8975

0.9499
0.9404
0.9285

0.9988
0.9702
0.9873

0.9992
0.9817
0.9995

1.0022
0.9962
1.0052

1.0947
1.0829
1.1191

1.0758
1.0661
1.0864

1.0593
1.0521
1.0831

1,000 simulations. Root mean square error is reported in the third column and the 25th, 50th and 75th
percentiles of the empirical distribution are reported in the last three columns.

Table E.12: Uniform Distribution (h1

o = v2i)

N

Estimators

Bias

Std. Dev. RMSE

25%

50%

75%

250

500

1, 000

Probit (CF)
SML
SP Het Probit (GCF)

Probit (CF)
SML
SP Het Probit (GCF)

Probit (CF)
SML
SP Het Probit (GCF)

-0.0210
-0.0402
-0.0354

-0.0056
-0.0124
-0.0939

0.0008
-0.0021
-0.0996

0.1808
0.1898
0.2598

0.1223
0.1225
0.2908

0.0856
0.0848
0.2097

0.1819
0.1939
0.2621

0.1224
0.1230
0.3055

0.0855
0.0848
0.2320

0.8676
0.8400
0.8240

0.9180
0.9089
0.7547

0.9484
0.9444
0.7794

0.9862
0.9663
0.9773

0.9991
0.9907
0.9294

1.0047
1.0007
0.9149

1.1053
1.0985
1.1336

1.0734
1.0730
1.0814

1.0613
1.0559
1.0371

1,000 simulations. Root mean square error is reported in the third column and the 25th, 50th and 75th
percentiles of the empirical distribution are reported in the last three columns.

192

Table E.13: Student T Distribution (h1

o = v2i)

N

Estimators

Bias

Std. Dev. RMSE

25%

50%

75%

250

500

1, 000

Probit (CF)
SML
SP Het Probit (GCF)

Probit (CF)
SML
SP Het Probit (GCF)

Probit (CF)
SML
SP Het Probit (GCF)

-0.0123
-0.0190
-0.0136

-0.0083
-0.0122
0.0042

-0.0031
-0.0045
0.0100

0.1611
0.1571
0.1872

0.1087
0.1077
0.1254

0.0761
0.0735
0.1023

0.1614
0.1582
0.1876

0.1090
0.1083
0.1254

0.0761
0.0736
0.1027

0.8928
0.8853
0.8841

0.9202
0.9157
0.9161

0.9466
0.9475
0.9473

1.0019
0.9850
0.9917

0.9888
0.9875
1.0096

0.9960
0.9975
1.0168

1.0989
1.0902
1.1018

1.0709
1.0648
1.0915

1.0500
1.0454
1.0764

1,000 simulations. Root mean square error is reported in the third column and the 25th, 50th and 75th
percentiles of the empirical distribution are reported in the last three columns.

Table E.14: Gaussian Mixture Distribution (h1

o = v2i)

N

Estimators

Bias

Std. Dev. RMSE

25%

50%

75%

250

500

1, 000

Probit (CF)
SML
SP Het Probit (GCF)

Probit (CF)
SML
SP Het Probit (GCF)

Probit (CF)
SML
SP Het Probit (GCF)

-0.0135
-0.0373
-0.0203

-0.0130
-0.0248
-0.0608

-0.0021
-0.0085
-0.0440

0.1840
0.1972
0.2463

0.1201
0.1230
0.2383

0.0877
0.0887
0.1658

0.1845
0.2006
0.2470

0.1207
0.1254
0.2458

0.0877
0.0891
0.1715

0.8783
0.8345
0.8425

0.9066
0.8923
0.8090

0.9380
0.9321
0.8562

1.0023
0.9765
0.9995

0.9910
0.9794
0.9548

1.0011
0.9948
0.9715

1.1095
1.1025
1.1400

1.0759
1.0653
1.0908

1.0558
1.0488
1.0676

1,000 simulations. Root mean square error is reported in the third column and the 25th, 50th and 75th
percentiles of the empirical distribution are reported in the last three columns.

193

Table E.15: Logistic Distribution with Linear GCF (h2
o)

N

Estimators

Bias

Std. Dev. RMSE

25%

50%

75%

250

500

1, 000

Probit (CF)
SML
SP Het Probit (GCF)

Probit (CF)
SML
SP Het Probit (GCF)

Probit (CF)
SML
SP Het Probit (GCF)

-0.0119
-0.0710
-0.0127

-0.0064
-0.0602
-0.0059

-0.0013
-0.0517
0.0033

0.1533
0.1539
0.1985

0.1135
0.1044
0.1502

0.0765
0.0696
0.1120

0.1537
0.1694
0.1988

0.1136
0.1205
0.1503

0.0765
0.0867
0.1120

0.8913
0.8296
0.8643

0.9271
0.8737
0.9154

0.9509
0.9014
0.9293

0.9948
0.9213
0.9936

1.0038
0.9397
0.9995

0.9996
0.9509
1.0067

1.0912
1.0314
1.1169

1.0647
1.0111
1.0917

1.0500
0.9972
1.0820

1,000 simulations. Root mean square error is reported in the third column and the 25th, 50th and 75th
percentiles of the empirical distribution are reported in the last three columns.

Table E.16: Uniform Distribution with Linear GCF (h2
o)

N

Estimators

Bias

Std. Dev. RMSE

25%

50%

75%

250

500

1, 000

Probit (CF)
SML
SP Het Probit (GCF)

Probit (CF)
SML
SP Het Probit (GCF)

Probit (CF)
SML
SP Het Probit (GCF)

-0.0310
-0.0868
-0.0145

-0.0166
-0.0644
-0.0168

-0.0141
-0.0604
-0.0203

0.1794
0.1722
0.2111

0.1194
0.1114
0.2088

0.0836
0.0762
0.1715

0.1820
0.1927
0.2115

0.1205
0.1286
0.2094

0.0848
0.0972
0.1726

0.8609
0.8028
0.8513

0.9068
0.8672
0.8467

0.9311
0.8886
0.8724

0.9757
0.9128
0.9923

0.9889
0.9370
0.9911

0.9869
0.9413
0.9954

1.0863
1.0223
1.1239

1.0649
1.0094
1.1106

1.0437
0.9911
1.0915

1,000 simulations. Root mean square error is reported in the third column and the 25th, 50th and 75th
percentiles of the empirical distribution are reported in the last three columns.

194

Table E.17: Student T Distribution with Linear GCF (h2
o)

N

Estimators

Bias

Std. Dev. RMSE

25%

50%

75%

250

500

1, 000

Probit (CF)
SML
SP Het Probit (GCF)

Probit (CF)
SML
SP Het Probit (GCF)

Probit (CF)
SML
SP Het Probit (GCF)

-0.0033
-0.0563
-0.0060

-0.0034
-0.0529
0.0002

-0.0002
-0.0479
0.0086

0.1421
0.1309
0.1674

0.0992
0.0889
0.1208

0.0695
0.0636
0.0967

0.1421
0.1425
0.1674

0.0992
0.1034
0.1208

0.0695
0.0796
0.0970

0.9128
0.8626
0.8890

0.9317
0.8892
0.9257

0.9519
0.9083
0.9473

1.0040
0.9486
1.0017

0.9959
0.9460
1.0039

0.9985
0.9505
1.0118

1.0904
1.0311
1.1013

1.0652
1.0049
1.0789

1.0479
0.9957
1.0684

1,000 simulations. Root mean square error is reported in the third column and the 25th, 50th and 75th
percentiles of the empirical distribution are reported in the last three columns.

Table E.18: Gaussian Mixture Distribution with Linear GCF (h2
o)

N

Estimators

Bias

Std. Dev. RMSE

25%

50%

75%

250

500

1, 000

Probit (CF)
SML
SP Het Probit (GCF)

Probit (CF)
SML
SP Het Probit (GCF)

Probit (CF)
SML
SP Het Probit (GCF)

-0.0195
-0.0778
-0.0377

-0.0189
-0.0721
-0.0177

-0.0094
-0.0578
-0.0014

0.1773
0.1768
1.0300

0.1163
0.1076
0.1930

0.0822
0.0760
0.1396

0.1783
0.1931
1.0302

0.1178
0.1294
0.1937

0.0827
0.0954
0.1395

0.8781
0.8175
0.8759

0.9042
0.8592
0.8674

0.9351
0.8902
0.9048

0.9915
0.9289
1.0019

0.9863
0.9273
0.9967

0.9904
0.9393
1.0022

1.1023
1.0282
1.1342

1.0605
0.9984
1.1047

1.0469
0.9943
1.0942

1,000 simulations. Root mean square error is reported in the third column and the 25th, 50th and 75th
percentiles of the empirical distribution are reported in the last three columns.

195

Table E.19: Logistic Distribution with Non-Parametric GCF (h3
o)

N

Estimators

Bias

Std. Dev. RMSE

25%

50%

75%

250

500

1, 000

Probit (CF)
SML
SP Het Probit (GCF)

Probit (CF)
SML
SP Het Probit (GCF)

Probit (CF)
SML
SP Het Probit (GCF)

0.0349
-0.0525
0.0117

0.0354
-0.0451
0.0453

0.0426
-0.0344
0.0010

0.1565
0.1739
0.2313

0.1106
0.1245
0.1537

0.0774
0.0843
0.1231

0.1603
0.1816
0.2315

0.1161
0.1324
0.1601

0.0883
0.0910
0.1230

0.9421
0.8363
0.9123

0.9657
0.8776
0.9581

0.9929
0.9101
0.9219

1.0387
0.9467
1.0244

1.0395
0.9594
1.0551

1.0465
0.9649
1.0074

1.1367
1.0589
1.1405

1.1084
1.0376
1.1472

1.0950
1.0245
1.0872

1,000 simulations. Root mean square error is reported in the third column and the 25th, 50th and 75th
percentiles of the empirical distribution are reported in the last three columns.

Table E.20: Uniform Distribution with Non-Parametric GCF (h3
o)

N

Estimators

Bias

Std. Dev. RMSE

25%

50%

75%

250

500

1, 000

Probit (CF)
SML
SP Het Probit (GCF)

Probit (CF)
SML
SP Het Probit (GCF)

Probit (CF)
SML
SP Het Probit (GCF)

0.0270
-0.0710
-0.0159

0.0381
-0.0488
-0.0063

0.0424
-0.0403
-0.0606

0.1695
0.1966
0.2418

0.1110
0.1249
0.2364

0.0789
0.0870
0.1781

0.1716
0.2089
0.2422

0.1173
0.1341
0.2363

0.0895
0.0958
0.1881

0.9199
0.8151
0.8469

0.9612
0.8735
0.8518

0.9878
0.9030
0.8281

1.0313
0.9353
1.0006

1.0390
0.9541
1.0080

1.0442
0.9643
0.9525

1.1433
1.0599
1.1395

1.1127
1.0344
1.1458

1.0971
1.0211
1.0616

1,000 simulations. Root mean square error is reported in the third column and the 25th, 50th and 75th
percentiles of the empirical distribution are reported in the last three columns.

196

Table E.21: Student T Distribution with Non-Parametric GCF (h3
o)

N

Estimators

Bias

Std. Dev. RMSE

25%

50%

75%

250

500

1, 000

Probit (CF)
SML
SP Het Probit (GCF)

Probit (CF)
SML
SP Het Probit (GCF)

Probit (CF)
SML
SP Het Probit (GCF)

0.0363
-0.0338
0.0241

0.0373
-0.0291
0.0577

0.0387
-0.0264
0.0078

0.1431
0.1559
0.1870

0.0990
0.1045
0.1205

0.0700
0.0740
0.1036

0.1475
0.1595
0.1885

0.1058
0.1085
0.1336

0.0800
0.0786
0.1038

0.9511
0.8733
0.9318

0.9747
0.9026
0.9807

0.9883
0.9244
0.9395

1.0409
0.9677
1.0314

1.0376
0.9692
1.0638

1.0408
0.9737
1.0132

1.1333
1.0657
1.1473

1.1035
1.0401
1.1440

1.0856
1.0243
1.0760

1,000 simulations. Root mean square error is reported in the third column and the 25th, 50th and 75th
percentiles of the empirical distribution are reported in the last three columns.

Table E.22: Gaussian Mixture Distribution with Non-Parametric GCF (h3
o)

N

Estimators

Bias

Std. Dev. RMSE

25%

50%

75%

250

500

1, 000

Probit (CF)
SML
SP Het Probit (GCF)

Probit (CF)
SML
SP Het Probit (GCF)

Probit (CF)
SML
SP Het Probit (GCF)

0.0303
-0.0668
0.0056

0.0317
-0.0581
0.0106

0.0389
-0.0443
-0.0296

0.1708
0.1945
0.2327

0.1130
0.1247
0.2048

0.0828
0.0922
0.1561

0.1733
0.2056
0.2327

0.1173
0.1375
0.2050

0.0915
0.1023
0.1588

0.9220
0.8071
0.8857

0.9565
0.8525
0.9050

0.9825
0.8920
0.8595

1.0385
0.9462
1.0238

1.0307
0.9403
1.0290

1.0408
0.9565
0.9771

1.1441
1.0685
1.1484

1.1115
1.0303
1.1443

1.0979
1.0225
1.0797

1,000 simulations. Root mean square error is reported in the third column and the 25th, 50th and 75th
percentiles of the empirical distribution are reported in the last three columns.

197

Table E.23: Heteroskedastic Logistic (h1

o = v2i)

N

Estimators

Bias

Std. Dev. RMSE

25%

50%

75%

250

500

1, 000

Probit (CF)
SML
SP Het Probit (GCF)

Probit (CF)
SML
SP Het Probit (GCF)

Probit (CF)
SML
SP Het Probit (GCF)

-0.0687
-0.1041
0.1213

-0.0593
-0.0926
-0.0263

-0.0516
-0.0816
-0.0043

0.1798
0.1935
0.2858

0.1319
0.1460
0.1532

0.0854
0.0953
0.1160

0.1925
0.2197
0.3104

0.1446
0.1728
0.1554

0.0998
0.1254
0.1160

0.8222
0.7730
0.9453

0.8557
0.8136
0.8836

0.8951
0.8529
0.9203

0.9295
0.8895
1.1030

0.9456
0.9040
0.9712

0.9497
0.9231
0.9940

1.0513
1.0225
1.2574

1.0256
1.0116
1.0713

1.0034
0.9824
1.0705

1,000 simulations. Root mean square error is reported in the third column and the 25th, 50th and 75th
percentiles of the empirical distribution are reported in the last three columns.

Table E.24: Heteroskedastic Logistic with Linear GCF (h2
o)

N

Estimators

Bias

Std. Dev. RMSE

25%

50%

75%

250

500

1, 000

Probit (CF)
SML
SP Het Probit (GCF)

Probit (CF)
SML
SP Het Probit (GCF)

Probit (CF)
SML
SP Het Probit (GCF)

-0.0171
-0.0843
0.1245

-0.0148
-0.0609
-0.0019

-0.0105
-0.0557
0.0026

0.1649
0.1601
0.2358

0.1204
0.1163
0.1350

0.0816
0.0780
0.1025

0.1657
0.1809
0.2665

0.1213
0.1313
0.1349

0.0822
0.0958
0.1025

0.8795
0.8114
0.9792

0.9132
0.8621
0.9110

0.9380
0.8905
0.9343

0.9869
0.9116
1.1112

0.9870
0.9365
0.9996

0.9910
0.9449
1.0019

1.0861
1.0197
1.2494

1.0650
1.0193
1.0843

1.0399
0.9964
1.0628

1,000 simulations. Root mean square error is reported in the third column and the 25th, 50th and 75th
percentiles of the empirical distribution are reported in the last three columns.

198

Table E.25: Heteroskedastic Logistic with Non-Parametric GCF (h3
o)

N

Estimators

Bias

Std. Dev. RMSE

25%

50%

75%

250

500

1, 000

Probit (CF)
SML
SP Het Probit (GCF)

Probit (CF)
SML
SP Het Probit (GCF)

Probit (CF)
SML
SP Het Probit (GCF)

0.0044
-0.0846
0.1307

0.0051
-0.0721
0.0208

0.0110
-0.0646
0.0042

0.1618
0.1823
0.2867

0.1182
0.1360
0.1264

0.0795
0.0906
0.1106

0.1618
0.2009
0.3149

0.1182
0.1539
0.1281

0.0803
0.1112
0.1106

0.8972
0.7910
0.9974

0.9308
0.8405
0.9343

0.9568
0.8753
0.9242

1.0054
0.8995
1.1174

1.0111
0.9251
1.0210

1.0131
0.9398
1.0001

1.1102
1.0314
1.2534

1.0815
1.0204
1.1102

1.0662
0.9999
1.0810

1,000 simulations. Root mean square error is reported in the third column and the 25th, 50th and 75th
percentiles of the empirical distribution are reported in the last three columns.

199

APPENDIX F

Figures for Chapter 3

Simulation

Figure F.1: Distribution of ˆσ2

1 for T=5 under DGP1

1,000 simulations of N=300. Increasing serial correlation as color lightens.

200

Figure F.2: Distribution of ˆσ2

1 for T=10 under DGP1

1,000 simulations of N=300. Increasing serial correlation as color lightens.

201

Figure F.3: Distribution of ˆσ2

1 for T=20 under DGP1

1,000 simulations of N=300. Increasing serial correlation as color lightens.

202

Figure F.4: Distribution of ˆσ2

1 for T=5 under DGP2

1,000 simulations of N=300. Increasing serial correlation as color lightens.

203

Figure F.5: Distribution of ˆσ2

1 for T=10 under DGP2

1,000 simulations of N=300. Increasing serial correlation as color lightens.

204

Figure F.6: Distribution of ˆσ2

1 for T=20 under DGP2

1,000 simulations of N=300. Increasing serial correlation as color lightens.

205

Figure F.7: Distribution of ˆσ2

1 for T=5 under DGP3

1,000 simulations of N=300. Increasing serial correlation as color lightens.

206

Figure F.8: Distribution of ˆσ2

1 for T=10 under DGP3

1,000 simulations of N=300. Increasing serial correlation as color lightens.

207

Figure F.9: Distribution of ˆσ2

1 for T=20 under DGP3

1,000 simulations of N=300. Increasing serial correlation as color lightens.

208

Figure F.10: ASF Estimates for T=5 under DGP1

1,000 simulations of N=300. Increasing serial correlation as color darkens. Simulated
95% Conﬁdence Intervals given with dotted lines.

209

Figure F.11: ASF Estimates for T=10 under DGP1

1,000 simulations of N=300. Increasing serial correlation as color darkens. Simulated
95% Conﬁdence Intervals given with dotted lines.

210

Figure F.12: ASF Estimates for T=20 under DGP1

1,000 simulations of N=300. Increasing serial correlation as color darkens. Simulated
95% Conﬁdence Intervals given with dotted lines.

211

Figure F.13: ASF Estimates for T=5 under DGP2

1,000 simulations of N=300. Increasing serial correlation as color darkens. Simulated
95% Conﬁdence Intervals given with dotted lines.

212

Figure F.14: ASF Estimates for T=10 under DGP2

1,000 simulations of N=300. Increasing serial correlation as color darkens. Simulated
95% Conﬁdence Intervals given with dotted lines.

213

Figure F.15: ASF Estimates for T=20 under DGP2

1,000 simulations of N=300. Increasing serial correlation as color darkens. Simulated
95% Conﬁdence Intervals given with dotted lines.

214

Figure F.16: ASF Estimates for T=5 under DGP3

1,000 simulations of N=300. Increasing serial correlation as color darkens. Simulated
95% Conﬁdence Intervals given with dotted lines.

215

Figure F.17: ASF Estimates for T=10 under DGP3

1,000 simulations of N=300. Increasing serial correlation as color darkens. Simulated
95% Conﬁdence Intervals given with dotted lines.

216

Figure F.18: ASF Estimates for T=20 under DGP3

1,000 simulations of N=300. Increasing serial correlation as color darkens. Simulated
95% Conﬁdence Intervals given with dotted lines.

217

Application

Figure F.19: ATE for Selling Drugs

Vertical axis is the negative ATE such that the higher the ATE the better treatment it is (increasingly reduces the probability of the antisocial
behavior outcomes). The transparent gray plane is ﬂat at an ATE equal to 0, therefore any treatment eﬀect above the plan is a desired outcome. The
ATEs are calculated from a matrix of the characteristics of interest valued between [0,1] (recall that the characteristics are standardized to mean 0
and standard deviation of 1).

218

Figure F.20: ATE for Being Arrested

Vertical axis is the negative ATE such that the higher the ATE the better treatment it is (increasingly reduces the probability of the antisocial
behavior outcomes). The transparent gray plane is ﬂat at an ATE equal to 0, therefore any treatment eﬀect above the plan is a desired outcome. The
ATEs are calculated from a matrix of the characteristics of interest valued between [0,1] (recall that the characteristics are standardized to mean 0
and standard deviation of 1).

219

Figure F.21: ATE for Engaging in Illicit Activities

Vertical axis is the negative ATE such that the higher the ATE the better treatment it is (increasingly reduces the probability of the antisocial
behavior outcomes). The transparent gray plane is ﬂat at an ATE equal to 0, therefore any treatment eﬀect above the plan is a desired outcome. The
ATEs are calculated from a matrix of the characteristics of interest valued between [0,1] (recall that the characteristics are standardized to mean 0
and standard deviation of 1).

220

Discussion

Figure F.22: Distribution of ˆσ2

1 for T=5 under AR(2)

1,000 simulations of N=300.

221

Figure F.23: Distribution of ˆσ2

1 for T=10 under AR(2)

1,000 simulations of N=300.

222

Figure F.24: Distribution of ˆσ2

1 for T=20 under AR(2)

1,000 simulations of N=300.

223

APPENDIX G

Tables for Chapter 3

224

Simulation

Table G.1: Estimation Times for DGP 1

ρ = 0

ρ = 0.4

ρ = 0.8

PHP

MEP

PHP

MEP

PHP

MEP

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

0.696
(0.11)
0.613
(0.08)
0.676
(0.06)

0.656
(0.10)
0.688
(0.08)
0.873
(0.08)

1.581
(0.14)
2.429
(0.18)
4.277
(0.36)

1.845
(0.25)
2.755
(0.23)
4.837
(0.39)

0.708
(0.12)
0.619
(0.07)
0.681
(0.06)

0.668
(0.13)
0.691
(0.07)
0.879
(0.07)

1.617
(0.15)
2.412
(0.17)
4.251
(0.34)

1.852
(0.14)
2.776
(0.21)
4.755
(0.34)

0.678
(0.11)
0.624
(0.08)
0.687
(0.07)

0.651
(0.11)
0.699
(0.08)
0.882
(0.08)

1.950
(0.28)
2.621
(0.21)
4.471
(0.42)

2.407
(9.03)
2.982
(0.23)
4.692
(0.40)

T

5

10

20

Average estimation time in seconds and standard deviations given in parenthesis. Speciﬁcation (1) incorrectly assumes that
the random eﬀects ai and bi are uncorrelated with the x’s while speciﬁcation (2) assumes that ai and bi are correlated with
the x’s through their time averages.

225

Table G.2: Estimation Times for DGP 2

ρ = 0

ρ = 0.4

ρ = 0.8

PHP

MEP

PHP

MEP

PHP

MEP

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

0.643
(0.13)
0.614
(0.09)
0.671
(0.09)

0.609
(0.09)
0.616
(0.07)
0.779
(0.08)

1.613
(0.17)
2.199
(0.23)
3.935
(0.36)

1.881
(0.86)
2.647
(0.25)
4.542
(0.41)

0.655
(0.15)
0.633
(0.10)
0.676
(0.09)

0.617
(0.09)
0.633
(0.07)
0.789
(0.08)

1.606
(0.19)
2.213
(0.23)
3.908
(0.34)

1.833
(0.58)
2.624
(0.24)
4.403
(0.36)

0.634
(0.13)
0.643
(0.11)
0.682
(0.09)

0.610
(0.12)
0.638
(0.08)
0.792
(0.08)

3.121
(3.38)
2.579
(0.31)
3.891
(0.34)

2.189
(0.72)
2.844
(0.67)
4.492
(0.43)

T

5

10

20

Average estimation time in seconds and standard deviations given in parenthesis. Speciﬁcation (1) incorrectly assumes that
the random eﬀects ai and bi are uncorrelated with the x’s while speciﬁcation (2) assumes that ai and bi are correlated with
the x’s through their time averages.

226

Table G.3: Estimation Times for DGP 3

ρ = 0

ρ = 0.4

ρ = 0.8

PHP

MEP

PHP

MEP

PHP

MEP

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

0.169
(0.03)
0.385
(0.04)
0.411
(0.03)

0.208
(0.20)
0.416
(0.04)
0.523
(0.03)

1.025
(0.25)
1.692
(0.13)
2.970
(0.25)

1.053
(0.25)
1.831
(0.13)
3.108
(0.21)

0.306
(0.05)
0.390
(0.05)
0.416
(0.03)

0.302
(0.06)
0.430
(0.05)
0.527
(0.03)

1.142
(0.11)
1.680
(0.13)
2.928
(0.22)

1.104
(0.20)
1.860
(0.10)
3.043
(0.19)

0.445
(0.08)
0.397
(0.05)
0.428
(0.03)

0.405
(0.08)
0.430
(0.04)
0.539
(0.03)

1.610
(0.81)
1.882
(0.10)
2.954
(0.21)

1.371
(0.33)
1.988
(0.10)
3.174
(0.20)

T

5

10

20

Average estimation time in seconds and standard deviations given in parenthesis. Speciﬁcation (1) incorrectly assumes that
the random eﬀects ai and bi are uncorrelated with the x’s while speciﬁcation (2) assumes that ai and bi are correlated with
the x’s through their time averages.

227

Table G.4: Bias and Std Deviation of De-scaled ME Probit Estimates for DGP 1

ρ = 0

ρ = 0.4

ρ = 0.8

T

(1)

(2)

(1)

(2)

(1)

(2)

α

5

β1

β2

α

10

β1

β2

α

20

β1

β2

-0.8495
(0.114)

-0.9902
(0.080)

0.0176
(0.078)

-0.8789
(0.079)

-0.8990
(0.056)

0.0062
(0.047)

-0.8688
(0.059)

-0.8687
(0.042)

0.0020
(0.030)

-0.0003
(0.423)

0.0698
(0.436)

0.0073
(0.075)

-0.0280
(0.715)

0.0439
(0.577)

0.0007
(0.046)

0.0067
(1.120)

0.0361
(0.891)

0.0004
(0.030)

-1.0358
(0.135)

-0.9320
(0.092)

0.1652
(0.094)

-0.9679
(0.089)

-0.8702
(0.058)

0.0795
(0.051)

-0.9107
(0.065)

-0.8556
(0.044)

0.0360
(0.033)

-0.0485
(0.509)

0.2210
(0.454)

0.1439
(0.088)

-0.0419
(0.747)

0.1347
(0.623)

0.0734
(0.051)

-0.0910
(1.243)

0.1208
(0.926)

0.0345
(0.032)

-1.8101
(0.261)

-0.6902
(0.140)

0.7966
(0.189)

-1.3745
(0.131)

-0.7305
(0.076)

0.4223
(0.077)

-1.1243
(0.084)

-0.7811
(0.050)

0.2212
(0.042)

-0.2168
(0.863)

0.9404
(0.666)

0.7320
(0.173)

-0.1355
(1.094)

0.6394
(0.813)

0.4120
(0.076)

-0.1522
(1.705)

0.3639
(1.102)

0.2194
(0.042)

R=1,000, N=300. Standard Deviations are given in parenthesis. The true coeﬃ-

cient values are α = −0.25, β1 = 1.25, and β2 = 1. Speciﬁcation (1) incorrectly
assumes that the random eﬀects ai and bi are uncorrelated with the x’s while
speciﬁcation (2) assumes that ai and bi are correlated with the x’s through their
time averages.

228

Table G.5: Bias and Std Deviation of De-scaled ME Probit Estimates for DGP 2

ρ = 0

ρ = 0.4

ρ = 0.8

T

(1)

(2)

(1)

(2)

(1)

(2)

α

5

β1

β2

α

10

β1

β2

α

20

β1

β2

-0.7149
(0.119)

-1.2177
(0.100)

0.0103
(0.082)

-0.8690
(0.086)

-0.9803
(0.071)

0.0159
(0.049)

-0.8924
(0.064)

-0.9041
(0.047)

0.0062
(0.032)

-0.0048
(0.285)

0.0516
(0.311)

0.0107
(0.079)

-0.0263
(0.426)

0.0166
(0.375)

0.0058
(0.048)

-0.0110
(0.700)

0.0354
(0.562)

0.0020
(0.032)

-0.9152
(0.145)

-1.1591
(0.113)

0.1668
(0.098)

-0.9867
(0.101)

-0.9423
(0.074)

0.0976
(0.054)

-0.9481
(0.070)

-0.8816
(0.052)

0.0492
(0.033)

-0.0385
(0.336)

0.2198
(0.365)

0.1518
(0.093)

-0.0122
(0.471)

0.1309
(0.421)

0.0845
(0.053)

-0.0213
(0.762)

0.0945
(0.591)

0.0446
(0.033)

-1.6801
(0.258)

-0.9264
(0.173)

0.8092
(0.182)

-1.4536
(0.149)

-0.7891
(0.093)

0.4769
(0.086)

-1.1919
(0.093)

-0.7962
(0.060)

0.2514
(0.044)

-0.2266
(0.573)

0.9672
(0.538)

0.7461
(0.165)

-0.1076
(0.733)

0.6271
(0.562)

0.4544
(0.083)

-0.0525
(1.027)

0.3712
(0.706)

0.2464
(0.043)

R=1,000, N=300. Standard Deviations are given in parenthesis. The true coeﬃ-

cient values are α = −0.25, β1 = 1.25, and β2 = 1. Speciﬁcation (1) incorrectly
assumes that the random eﬀects ai and bi are uncorrelated with the x’s while
speciﬁcation (2) assumes that ai and bi are correlated with the x’s through their
time averages.

229

Table G.6: Bias and Std Deviation of De-scaled ME Probit Estimates for DGP 3

ρ = 0

ρ = 0.4

ρ = 0.8

T

(1)

(2)

(1)

(2)

(1)

(2)

α

5

β1

β2

α

10

β1

β2

α

20

β1

β2

-0.6468
(0.116)

-0.6229
(0.099)

-0.0332
(0.084)

-0.5603
(0.077)

-0.5215
(0.067)

-0.0094
(0.051)

-0.4860
(0.061)

-0.4739
(0.050)

-0.0004
(0.033)

-0.2268
(0.451)

-0.0972
(0.474)

0.0095
(0.080)

-0.4977
(0.710)

-0.4936
(0.640)

0.0013
(0.050)

-0.5450
(1.182)

-0.6094
(0.958)

0.0033
(0.033)

-0.8121
(0.145)

-0.5174
(0.122)

0.1248
(0.103)

-0.6190
(0.084)

-0.4707
(0.074)

0.0626
(0.054)

-0.5140
(0.064)

-0.4437
(0.052)

0.0333
(0.035)

-0.3008
(0.523)

0.0199
(0.522)

0.1484
(0.093)

-0.5812
(0.792)

-0.4200
(0.680)

0.0705
(0.052)

-0.6948
(1.260)

-0.5018
(0.970)

0.0367
(0.035)

-1.4530
(0.256)

-0.0816
(0.204)

0.7717
(0.216)

-0.9248
(0.127)

-0.2153
(0.098)

0.4171
(0.085)

-0.6467
(0.084)

-0.3001
(0.061)

0.2138
(0.048)

-0.6283
(0.894)

0.6856
(0.752)

0.7304
(0.193)

-0.9415
(1.142)

-0.1168
(0.846)

0.4133
(0.082)

-0.7962
(1.669)

-0.4328
(1.226)

0.2152
(0.047)

R=1,000, N=300. Standard Deviations are given in parenthesis. The true coeﬃ-

cient values are α = −0.25, β1 = 1.25, and β2 = 1. Speciﬁcation (1) incorrectly
assumes that the random eﬀects ai and bi are uncorrelated with the x’s while
speciﬁcation (2) assumes that ai and bi are correlated with the x’s through their
time averages.

230

Table G.7: Bias and Std Deviation of Scaled Coeﬃcient Estimates for DGP 1

ρ = 0

ρ = 0.4

ρ = 0.8

PHP

MEP

PHP

MEP

PHP

MEP

T

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

ασ

5

β1σ

β2σ

ασ

10

β1σ

β2σ

ασ

20

β1σ

β2σ

-0.5288
(0.090)

-0.9152
(0.086)

-0.0751
(0.085)

-0.6001
(0.070)

-0.8329
(0.055)

-0.0388
(0.056)

-0.6442
(0.054)

-0.7695
(0.043)

-0.0174
(0.040)

0.0055
(0.357)

0.0388
(0.392)

0.0069
(0.087)

-0.0254
(0.602)

0.0218
(0.517)

-0.0013
(0.055)

-0.0124
(0.944)

0.0462
(0.805)

0.0010
(0.040)

-0.6084
(0.073)

-0.8288
(0.057)

-0.0636
(0.054)

-0.6609
(0.058)

-0.7517
(0.042)

-0.0453
(0.038)

-0.6812
(0.046)

-0.7189
(0.033)

-0.0235
(0.027)

-0.0009
(0.347)

0.0606
(0.354)

0.0091
(0.058)

-0.0245
(0.588)

0.0422
(0.476)

0.0049
(0.039)

0.0050
(0.917)

0.0350
(0.732)

0.0046
(0.027)

-0.5318
(0.092)

-0.9164
(0.088)

-0.0726
(0.082)

-0.6043
(0.074)

-0.8318
(0.056)

-0.0351
(0.059)

-0.6458
(0.059)

-0.7706
(0.041)

-0.0171
(0.040)

-0.0060
(0.382)

0.0233
(0.375)

0.0065
(0.087)

-0.0156
(0.585)

0.0188
(0.530)

0.0027
(0.058)

-0.0692
(1.012)

0.0734
(0.842)

0.0005
(0.040)

-0.6255
(0.074)

-0.8156
(0.058)

-0.0641
(0.053)

-0.6678
(0.061)

-0.7486
(0.041)

-0.0434
(0.039)

-0.6828
(0.050)

-0.7192
(0.034)

-0.0247
(0.028)

-0.0116
(0.367)

0.0435
(0.323)

0.0119
(0.056)

-0.0208
(0.573)

0.0424
(0.479)

0.0074
(0.040)

-0.0661
(0.986)

0.0661
(0.734)

0.0035
(0.027)

-0.5368
(0.096)

-0.9140
(0.088)

-0.0704
(0.084)

-0.6040
(0.081)

-0.8310
(0.056)

-0.0344
(0.060)

-0.6500
(0.064)

-0.7697
(0.042)

-0.0140
(0.043)

-0.0223
(0.438)

0.0155
(0.401)

0.0069
(0.085)

-0.0114
(0.675)

0.0620
(0.554)

0.0042
(0.060)

-0.0684
(1.193)

0.0563
(0.884)

0.0055
(0.042)

-0.6439
(0.082)

-0.7904
(0.054)

-0.0764
(0.057)

-0.6730
(0.067)

-0.7401
(0.041)

-0.0483
(0.042)

-0.6893
(0.055)

-0.7157
(0.033)

-0.0225
(0.030)

-0.0194
(0.411)

0.0237
(0.309)

0.0090
(0.058)

-0.0201
(0.634)

0.0792
(0.470)

0.0062
(0.043)

-0.0672
(1.151)

0.0691
(0.743)

0.0075
(0.030)

R=1,000, N=300. Standard Deviations are given in parenthesis. The true coeﬃcient values are α = −0.25, β1 = 1.25, and β2 = 1.
Speciﬁcation (1) incorrectly assumes that the random eﬀects ai and bi are uncorrelated with the x’s while speciﬁcation (2) assumes that ai
and bi are correlated with the x’s through their time averages.

231

Table G.8: Bias and Std Deviation of Scaled Coeﬃcient Estimates for DGP 2

ρ = 0

ρ = 0.4

ρ = 0.8

PHP

MEP

PHP

MEP

PHP

MEP

T

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

ασ

5

β1σ

β2σ

ασ

10

β1σ

β2σ

ασ

20

β1σ

β2σ

-0.2231
(0.235)

-1.3919
(0.241)

-0.1331
(0.140)

-0.5102
(0.065)

-0.9603
(0.094)

-0.0786
(0.060)

-0.5803
(0.055)

-0.8798
(0.048)

-0.0422
(0.042)

0.0042
(0.240)

0.0087
(0.288)

-0.0015
(0.088)

-0.0232
(0.362)

0.0025
(0.345)

0.0039
(0.056)

-0.0156
(0.585)

0.0228
(0.515)

0.0010
(0.040)

-0.5189
(0.074)

-0.9973
(0.075)

-0.0577
(0.060)

-0.6169
(0.057)

-0.8227
(0.051)

-0.0705
(0.040)

-0.6620
(0.047)

-0.7584
(0.036)

-0.0534
(0.028)

-0.0034
(0.232)

0.0403
(0.253)

0.0075
(0.064)

-0.0222
(0.349)

0.0174
(0.306)

0.0082
(0.040)

-0.0094
(0.574)

0.0326
(0.459)

0.0049
(0.028)

-0.2203
(0.232)

-1.3901
(0.244)

-0.1326
(0.143)

-0.5080
(0.069)

-0.9625
(0.091)

-0.0845
(0.060)

-0.5802
(0.055)

-0.8787
(0.052)

-0.0440
(0.041)

0.0007
(0.250)

0.0044
(0.303)

-0.0014
(0.088)

0.0036
(0.373)

0.0160
(0.356)

0.0024
(0.056)

-0.0190
(0.626)

0.0358
(0.529)

0.0002
(0.040)

-0.5477
(0.075)

-0.9627
(0.073)

-0.0620
(0.060)

-0.6304
(0.063)

-0.8131
(0.049)

-0.0753
(0.040)

-0.6670
(0.050)

-0.7526
(0.039)

-0.0535
(0.028)

-0.0040
(0.243)

0.0374
(0.263)

0.0124
(0.063)

0.0043
(0.359)

0.0337
(0.322)

0.0113
(0.041)

-0.0100
(0.600)

0.0400
(0.467)

0.0073
(0.028)

-0.2491
(0.241)

-1.3667
(0.247)

-0.1207
(0.140)

-0.5110
(0.080)

-0.9539
(0.090)

-0.0796
(0.065)

-0.5803
(0.060)

-0.8777
(0.052)

-0.0417
(0.042)

-0.0224
(0.286)

0.0024
(0.304)

0.0059
(0.090)

0.0011
(0.442)

0.0319
(0.391)

0.0047
(0.061)

0.0037
(0.714)

0.0294
(0.550)

0.0039
(0.041)

-0.5905
(0.084)

-0.8890
(0.067)

-0.0702
(0.063)

-0.6458
(0.068)

-0.7907
(0.046)

-0.0791
(0.045)

-0.6731
(0.055)

-0.7445
(0.037)

-0.0549
(0.031)

-0.0223
(0.272)

0.0357
(0.248)

0.0156
(0.062)

-0.0008
(0.419)

0.0522
(0.321)

0.0149
(0.046)

0.0044
(0.681)

0.0536
(0.468)

0.0093
(0.031)

R=1,000, N=300. Standard Deviations are given in parenthesis. The true coeﬃcient values are α = −0.25, β1 = 1.25, and β2 = 1.
Speciﬁcation (1) incorrectly assumes that the random eﬀects ai and bi are uncorrelated with the x’s while speciﬁcation (2) assumes that ai
and bi are correlated with the x’s through their time averages.

232

Table G.9: Bias and Std Deviation of Scaled Coeﬃcient Estimates for DGP 3

ρ = 0

ρ = 0.4

ρ = 0.8

PHP

MEP

PHP

MEP

PHP

MEP

T

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

ασ

5

β1σ

β2σ

ασ

10

β1σ

β2σ

ασ

20

β1σ

β2σ

-0.2446
(0.090)

-0.5314
(0.115)

-0.2057
(0.079)

-0.2998
(0.068)

-0.4438
(0.075)

-0.1064
(0.054)

-0.3181
(0.053)

-0.3982
(0.058)

-0.0486
(0.038)

-0.1737
(0.378)

-0.0883
(0.420)

-0.0024
(0.090)

-0.3742
(0.592)

-0.4378
(0.549)

-0.0033
(0.056)

-0.4102
(1.006)

-0.5671
(0.879)

0.0019
(0.038)

-0.4033
(0.071)

-0.5951
(0.068)

-0.1609
(0.054)

-0.3981
(0.056)

-0.4789
(0.052)

-0.0802
(0.038)

-0.3739
(0.048)

-0.4110
(0.041)

-0.0314
(0.028)

-0.1846
(0.367)

-0.0812
(0.384)

0.0068
(0.063)

-0.4071
(0.581)

-0.4021
(0.522)

0.0034
(0.040)

-0.4487
(0.969)

-0.4956
(0.785)

0.0067
(0.029)

-0.2453
(0.090)

-0.5305
(0.115)

-0.2075
(0.076)

-0.2962
(0.065)

-0.4441
(0.078)

-0.1043
(0.055)

-0.3192
(0.054)

-0.3959
(0.057)

-0.0499
(0.039)

-0.1748
(0.395)

-0.1363
(0.406)

-0.0058
(0.085)

-0.3994
(0.627)

-0.4221
(0.575)

0.0003
(0.055)

-0.5041
(1.022)

-0.5069
(0.858)

0.0010
(0.039)

-0.4163
(0.073)

-0.5920
(0.069)

-0.1585
(0.053)

-0.3975
(0.056)

-0.4806
(0.054)

-0.0805
(0.038)

-0.3754
(0.048)

-0.4088
(0.041)

-0.0325
(0.029)

-0.1907
(0.374)

-0.1101
(0.369)

0.0081
(0.060)

-0.4338
(0.608)

-0.3842
(0.522)

0.0054
(0.040)

-0.5452
(1.000)

-0.4277
(0.769)

0.0056
(0.029)

-0.2485
(0.096)

-0.5291
(0.117)

-0.2082
(0.076)

-0.3017
(0.075)

-0.4407
(0.077)

-0.1018
(0.057)

-0.3152
(0.059)

-0.3979
(0.055)

-0.0518
(0.041)

-0.2025
(0.454)

-0.1006
(0.434)

-0.0032
(0.084)

-0.4392
(0.690)

-0.4056
(0.568)

0.0052
(0.060)

-0.4641
(1.141)

-0.5270
(0.964)

0.0004
(0.043)

-0.4216
(0.079)

-0.5910
(0.067)

-0.1659
(0.053)

-0.4079
(0.065)

-0.4814
(0.052)

-0.0781
(0.042)

-0.3739
(0.054)

-0.4081
(0.041)

-0.0339
(0.032)

-0.2142
(0.426)

-0.1054
(0.347)

0.0023
(0.059)

-0.4909
(0.664)

-0.3591
(0.493)

0.0099
(0.045)

-0.5001
(1.126)

-0.4683
(0.829)

0.0047
(0.033)

R=1,000, N=300. Standard Deviations are given in parenthesis. The true coeﬃcient values are α = −0.25, β1 = 1.25, and β2 = 1.
Speciﬁcation (1) incorrectly assumes that the random eﬀects ai and bi are uncorrelated with the x’s while speciﬁcation (2) assumes that ai
and bi are correlated with the x’s through their time averages.

233

Table G.10: Root Mean Square Error of ˆβ2σ for Speciﬁcation (2)

ρ = 0

ρ = 0.4

ρ = 0.8

PHP MEP

PHP MEP

PHP MEP

0.1552
0.2677
0.6496

0.0832
0.1188
0.2662

0.1845
0.4935
1.0950

0.1286
0.2284
0.5370

0.0654
0.0937
0.2118

0.1540
0.4338
0.8625

0.1415
0.2808
0.7146

0.0920
0.1273
0.2811

0.1833
0.5091
0.9926

0.1062
0.2311
0.5432

0.0707
0.1049
0.2198

0.1483
0.4200
0.7738

0.1611
0.3109
0.7839

0.0926
0.1541
0.3029

0.1985
0.4876
1.2071

0.0962
0.2275
0.5572

0.0626
0.1056
0.2219

0.1317
0.3717
0.9059

T

5
10
20

5
10
20

5
10
20

DGP 1

DGP 2

DGP 3

R=1,000, N=300

234

Table G.11: Bias and Std Deviation of Variance Component σ2

2 for Speciﬁcation (2)

DGP 1
ρ = 0.4 ρ = 0.8

0.0633
(0.117)

0.0302
(0.069)

0.0100
(0.040)

0.4829
(0.305)

0.2278
(0.098)

0.1113
(0.060)

ρ = 0

-0.0165
(0.105)

-0.0064
(0.054)

-0.0067
(0.040)

DGP 2
ρ = 0.4 ρ = 0.8

0.1090
(0.149)

0.0632
(0.079)

0.0365
(0.048)

0.5318
(0.372)

0.3325
(0.140)

0.1913
(0.068)

ρ = 0

0.0084
(0.281)

-0.0506
(0.237)

-0.1004
(0.228)

DGP 3
ρ = 0.4 ρ = 0.8

0.0458
(0.354)

-0.0079
(0.314)

-0.0695
(0.258)

0.3359
(0.804)

0.2285
(0.536)

0.0139
(0.386)

T

5

10

20

ρ = 0

0.0050
(0.095)

-0.0023
(0.053)

-0.0035
(0.042)

R=1,000, N=300

235

Table G.12: Bias and Std Deviation (×10) of APE Estimates for DGP 1

ρ = 0

ρ = 0.4

ρ = 0.8

PHP

MEP

PHP

MEP

PHP

MEP

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

-0.6680
(0.123)
-0.4028
(0.102)
-0.2163
(0.083)

0.0027
(0.128)
0.0011
(0.098)
-0.0012
(0.078)

-0.3953
(0.123)
-0.1537
(0.096)
-0.0600
(0.075)

0.0025
(0.125)
0.0009
(0.096)
-0.0004
(0.075)

-0.6715
(0.133)
-0.3999
(0.102)
-0.2145
(0.083)

-0.0037
(0.131)
-0.0027
(0.096)
0.0002
(0.078)

-0.3593
(0.126)
-0.1408
(0.092)
-0.0561
(0.077)

-0.0043
(0.126)
-0.0010
(0.092)
0.0004
(0.077)

-0.6714
(0.132)
-0.3983
(0.101)
-0.2128
(0.084)

-0.0082
(0.126)
-0.0021
(0.093)
0.0011
(0.076)

-0.2852
(0.115)
-0.1127
(0.089)
-0.0454
(0.073)

-0.0055
(0.118)
-0.0007
(0.090)
0.0028
(0.073)

T

5

10

20

R=1,000, N=300. Standard deviations are given in parenthesis. Both bias and standard deviations are multiplied by 10. True APE value is
0.0732. Speciﬁcation (1) incorrectly assumes that the random eﬀects ai and bi are uncorrelated with the x’s while speciﬁcation (2) assumes
that ai and bi are correlated with the x’s through their time averages.

236

Table G.13: Bias and Std Deviation (×10) of APE Estimates for DGP 2

ρ = 0

ρ = 0.4

ρ = 0.8

PHP

MEP

PHP

MEP

PHP

MEP

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

-1.7912
(0.433)
-0.7884
(0.132)
-0.5333
(0.084)

0.0030
(0.148)
-0.0013
(0.110)
-0.0022
(0.082)

-0.8927
(0.144)
-0.3996
(0.111)
-0.1712
(0.080)

0.0027
(0.143)
-0.0012
(0.107)
-0.0014
(0.080)

-1.7813
(0.438)
-0.7904
(0.136)
-0.5307
(0.091)

-0.0017
(0.149)
-0.0019
(0.108)
0.0013
(0.088)

-0.8139
(0.139)
-0.3735
(0.105)
-0.1624
(0.085)

0.0001
(0.143)
0.0003
(0.106)
0.0027
(0.086)

-1.7429
(0.442)
-0.7830
(0.138)
-0.5338
(0.091)

-0.0014
(0.143)
-0.0016
(0.103)
-0.0011
(0.083)

-0.6664
(0.136)
-0.3161
(0.101)
-0.1436
(0.080)

-0.0003
(0.134)
-0.0006
(0.099)
0.0025
(0.079)

T

5

10

20

R=1,000, N=300. Standard deviations are given in parenthesis. Both bias and standard deviations are multiplied by 10. True APE value is
0.0731. Speciﬁcation (1) incorrectly assumes that the random eﬀects ai and bi are uncorrelated with the x’s while speciﬁcation (2) assumes
that ai and bi are correlated with the x’s through their time averages.

237

Table G.14: Bias and Std Deviation (×10) of APE Estimates for DGP 3

ρ = 0

ρ = 0.4

ρ = 0.8

PHP

MEP

PHP

MEP

PHP

MEP

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

-0.6549
(0.136)
-0.3192
(0.110)
-0.1452
(0.079)

-0.0037
(0.122)
-0.0010
(0.089)
-0.0039
(0.071)

-0.3342
(0.119)
-0.1419
(0.089)
-0.0610
(0.069)

-0.0092
(0.119)
-0.0040
(0.087)
-0.0032
(0.068)

-0.6575
(0.141)
-0.3273
(0.109)
-0.1425
(0.078)

-0.0089
(0.127)
-0.0053
(0.091)
-0.0013
(0.070)

-0.3137
(0.120)
-0.1393
(0.088)
-0.0559
(0.067)

-0.0157
(0.121)
-0.0082
(0.088)
-0.0003
(0.0680)

-0.6564
(0.143)
-0.3266
(0.110)
-0.1414
(0.078)

-0.0037
(0.125)
-0.0021
(0.088)
-0.0010
(0.068)

-0.2646
(0.111)
-0.1181
(0.085)
-0.0491
(0.065)

-0.0139
(0.115)
-0.0046
(0.084)
-0.0003
(0.065)

T

5

10

20

R=1,000, N=300. Standard deviations are given in parenthesis. Both bias and standard deviations are multiplied by 10. True APE value is
.1201. Speciﬁcation (1) incorrectly assumes that the random eﬀects ai and bi are uncorrelated with the x’s while speciﬁcation (2) assumes
that ai and bi are correlated with the x’s through their time averages.

238

Table G.15: Comparison of APE and PEA

T

5

10

20

DGP 1

DGP 2

DGP 3

APE

PEA

APE

PEA

APE

PEA

0.0595
(0.050)

0.0640
(0.047)

0.0692
(0.045)

0.0719
(0.056)

0.0797
(0.050)

0.0879
(0.047)

0.0487
(0.055)

0.0531
(0.052)

0.0565
(0.052)

0.0572
(0.065)

0.0635
(0.061)

0.0682
(0.059)

0.1005
(0.076)

0.0986
(0.083)

0.0913
(0.086)

0.1147
(0.090)

0.1150
(0.099)

0.1071
(0.106)

APE and PEA values using simulated data of sample size 1,000 and
calculated using the true parameter values. Standard deviations of
the distribution of the Partial eﬀects given in parenthesis.

239

Application

Table G.16: Select Baseline Summary Statistics

Baseline Covariate

Therapy

Cash

Both

Control Total

Age

Married or partnered

Number of children <15 in household

Years of Schooling

Has any disabilities

Ex-Combatant

Currently sleeping on street

Saving Stock (US$)

Hrs/week, illicit activites

Hrs/week, agriculture

Hrs/week, low-skill wage labor

Hrs/week, low-skill business

Hrs/week, illicit high skill work

25.16
(4.82)
0.152
(0.36)
2.031
(3.10)
7.576
(3.38)
0.063
(0.24)
0.375
(0.48)
0.228
(0.42)
33.92
(70.16)
15.56
(32.54)
0.565
(4.61)
18.44
(30.63)
14.19
(28.68)
1.813
(8.15)

25.69
(5.01)
0.133
(0.40)
2.058
(3.09)
7.832
(3.29)
0.062
(0.24)
0.394
(0.49)
0.252
(0.44)
28.16
(63.35)
12.11
(23.74)
0.128
(0.71)
17.88
(26.91)
8.657
(20.92)
1.795
(8.08)

25.45
(5.09)
0.198
(0.40)
2.528
(3.50)
7.766
(3.27)
0.061
(0.24)
0.315
(0.47)
0.244
(0.43)
41.11
(79.37)
14.20
(26.08)
0.197
(1.35)
19.09
(28.44)
9.515
(20.80)
0.947
(5.10)

25.37
(4.65)
0.143
(0.35)
1.865
(3.00)
7.647
(3.06)
0.095
(0.29)
0.389
(0.49)
0.258
(0.44)
27.37
(47.58)
13.68
(27.12)
0.604
(5.71)
18.97
(27.22)
10.26
(22.41)
1.054
(6.32)

25.41
(4.88)
0.155
(0.36)
2.100
(3.17)
7.702
(3.25)
0.071
(0.26)
0.370
(0.48)
0.246
(0.43)
32.21
(65.39)
13.87
(27.60)
0.385
(3.87)
18.59
(28.29)
10.67
(23.55)
1.406
(7.07)

899 individuals, BJS provides tests of balance for select baseline covariates in Table 1

240

Table G.16 (cont’d)

Baseline Covariate

Therapy

Cash

Both

Control Total

Sells Drugs

Uses marijuana daily

Indicator for usually Takes hard drugs

Uses hard drugs daily

Committed theft, past 2 wks

Antisocial behavior index

Perseverance Index

Reward Responsiveness

Impulsiveness Index

0.223
(0.42)
0.464
(0.50)
0.272
(0.45)
0.134
(0.34)
0.576
(0.49)
-0.066
(0.962)
-0.026
(1.02)
-0.066
(1.05)
-0.085
(1.05)

0.177
(0.38)
0.465
(0.50)
0.279
(0.45)
0.177
(0.38)
0.540
(0.50)
0.078
(1.090)
-0.036
(1.12)
0.111
(1.08)
0.019
(1.09)

0.198
(0.40)
0.416
(0.49)
0.269
(0.44)
0.157
(0.36)
0.533
(0.50)
-0.036
(1.048)
-0.009
(1.09)
0.080
(1.07)
-0.004
(1.07)

0.194
(0.40)
0.484
(0.50)
0.246
(0.43)
0.115
(0.32)
0.579
(0.49)
0.035
(1.084)
0.008
(1.05)
0.020
(1.03)
0.109
(1.04)

0.198
(0.40)
0.459
(0.50)
0.266
(0.44)
0.145
(0.35)
0.558
(0.50)
0.005
(1.050)
-0.015
(1.07)
0.035
(1.06)
0.013
(1.07)

899 individuals, BJS provides tests of balance for select baseline covariates in Table 1

241

Table G.17: Preliminary OLS Estimates

Coeﬀ

Sells Drugs
SE

Coeﬀ

SE

Coeﬀ

SE

Coeﬀ

SE

Arrested

-0.0558
-0.0087
-0.0724

(0.023)
(0.024)
(0.022)

-0.0559
-0.0068
-0.0642

(0.023)
(0.024)
(0.023)

-0.0010
0.0044
-0.0195

(0.018)
(0.019)
(0.019)

-0.0003
0.0125
-0.0153

(0.019)
(0.019)
(0.019)

-0.0522
0.0052
-0.0032
-0.0179

-0.0280
0.0377
0.0068
0.0292

-0.0708
0.0380
-0.0080
0.0116

(0.024)
(0.021)
(0.025)
(0.021)

(0.026)
(0.021)
(0.021)
(0.022)

(0.023)
(0.019)
(0.019)
(0.019)

-0.0541
0.0007
-0.0065
-0.0198

-0.0201
0.0387
0.0032
0.0155

-0.0729
0.0260
-0.0145
0.0007

(0.024)
(0.021)
(0.025)
(0.021)

(0.027)
(0.021)
(0.022)
(0.023)

(0.024)
(0.020)
(0.020)
(0.020)

-0.0099
-0.0181
0.0049
-0.0052

-0.0021
-0.0238
0.0305
-0.0012

-0.0480
0.0166
0.0244
0.0038

(0.021)
(0.017)
(0.016)
(0.018)

(0.023)
(0.017)
(0.017)
(0.020)

(0.023)
(0.017)
(0.017)
(0.018)

-0.0064
-0.0199
0.0007
-0.0061

0.0097
-0.0173
0.0251
-0.0121

-0.0561
0.0167
0.0187
0.0027

(0.021)
(0.017)
(0.016)
(0.019)

(0.022)
(0.017)
(0.018)
(0.020)

(0.023)
(0.018)
(0.017)
(0.019)

Treatment
Therapy
Cash
Both

Interact with Therapy

Bad Behavior
Perseverance
Reward
Impulsiveness

Interact with Cash

Bad Behavior
Perserverence
Reward
Impulsiveness

Interact with Both

Bad Behavior
Perserverence
Reward
Impulsiveness

Includes Block FE
Number of Individuals
Number of Observations

No
890
3,312

Yes
890
3,312

No
890
3,312

Yes
890
3,312

Standard errors for both estimators are robust and clustered at the individual level. Treatment is randomly
assigned within Blocks.

242

Table G.17 (cont’d)

Treatment
Therapy
Cash
Both

Interact with Therapy

Bad Behavior
Perseverance
Reward
Impulsiveness

Interact with Cash

Bad Behavior
Perseverance
Reward
Impulsiveness

Interact with Both

Bad Behavior
Perseverance
Reward
Impulsiveness

Coeﬀ

Illicit Activity
Coeﬀ
SE

SE

-0.0575
-0.0204
-0.0622

(0.021)
(0.021)
(0.021)

-0.0565
-0.0163
-0.0570

(0.022)
(0.021)
(0.022)

-0.0642
0.0094
-0.0010
0.0039

-0.0271
0.0253
0.0043
0.0175

-0.0567
0.0495
-0.0092
0.0028

(0.024)
(0.019)
(0.023)
(0.020)

(0.023)
(0.018)
(0.018)
(0.019)

(0.024)
(0.020)
(0.018)
(0.019)

-0.0678
0.0043
-0.0043
0.0064

-0.0270
0.0253
-0.0028
0.0091

-0.0671
0.0339
-0.0129
-0.0031

(0.024)
(0.019)
(0.023)
(0.019)

(0.024)
(0.018)
(0.019)
(0.019)

(0.025)
(0.020)
(0.019)
(0.019)

Includes Block FE
Number of Individuals
Number of Observations

No
890
3,328

Yes
890
3,328

Standard errors for both estimators are robust and clustered at the
individual level. Treatment is randomly assigned within Blocks.

243

Table G.18: Scaled Probit Coeﬃcient Estimates for Selling Drugs

ME Probit

Pooled Probit

RE

CRE

RE

CRE

Coeﬀ

SE

Coeﬀ

SE

Coeﬀ

SE

Coeﬀ

SE

Treatment

Therapy
Cash
Both

Therapy ×
Antisocial
Perseverance
Reward
Impulsiveness
Cash ×

Antisocial
Perseverance
Reward
Impulsiveness
Both ×

Antisocial
Perseverance
Reward
Impulsiveness

Var. Components

Therapy
Cash
Both
Intercept

Het. Coeﬃcients

Therapy
Cash
Both

-0.760
-0.086
-0.392

(0.24)
(0.12)
(0.12)

-0.740
-0.080
-0.337

(0.26)
(0.13)
(0.13)

-0.679
0.303
-0.513

(0.55)
(0.21)
(0.55)

-0.233
0.494
0.104

-0.155
-0.140
0.127
-0.176

-0.146
0.213
0.106
0.114

-0.307
0.237
-0.055
0.081

1.550
0.000
0.000
0.786

(0.16)
(0.16)
(0.18)
(0.16)

(0.12)
(0.12)
(0.12)
(0.12)

(0.13)
(0.13)
(0.12)
(0.12)

(0.89)
(0.00)
(0.00)
(0.18)

1.621
0.000
0.000
0.838

(0.92)
(0.00)
(0.00)
(0.19)

-0.198
-0.041
0.046
-0.081

-0.223
0.191
0.014
0.060

-0.279
0.229
-0.073
0.074

(0.49)
(0.20)
(0.45)

(0.13)
(0.13)
(0.15)
(0.13)

(0.10)
(0.10)
(0.09)
(0.08)

(0.11)
(0.11)
(0.10)
(0.09)

0.256
-0.546
0.078

(0.33)
(0.33)
(0.37)

-0.046
-0.819
-0.431

(0.37)
(0.37)
(0.48)

Time (in Seconds)

11369.869

11973.093

1.029

2.208

3,312 total observations for 890 individuals, in which dummies for the number of time observations for
each individual are included to address the unbalanced panel. Standard errors for both estimators are
1 + σ2
a)

robust and clustered at the individual level. ME Probit coeﬃcient estimates are scaled by (1/
and standard errors are calculated using the delta method.

(cid:113)

244

Table G.19: Scaled Probit Coeﬃcient Estimates for being Arrested

ME Probit

Pooled Probit

RE

CRE

RE

CRE

Coeﬀ

SE

Coeﬀ

SE

Coeﬀ

SE

Coeﬀ

SE

Treatment

Therapy
Cash
Both

Therapy ×
Antisocial
Perseverance
Reward
Impulsiveness
Cash ×

Antisocial
Perseverance
Reward
Impulsiveness
Both ×

Antisocial
Perseverance
Reward
Impulsiveness

Var. Components

Therapy
Cash
Both
Intercept

Het. Coeﬃcients

Therapy
Cash
Both

-0.020
-0.031
-0.305

(0.13)
(0.15)
(0.17)

-0.019
-0.098
-0.222

(0.14)
(0.16)
(0.17)

-0.013
0.397
-1.185

(0.32)
(0.21)
(0.87)

-0.060
0.335
-0.730

-0.055
-0.099
0.027
-0.039

-0.046
-0.182
0.226
-0.021

-0.308
0.072
0.187
0.040

0.068
0.190
0.334
0.149

(0.10)
(0.10)
(0.10)
(0.10)

(0.11)
(0.11)
(0.11)
(0.10)

(0.13)
(0.11)
(0.11)
(0.11)

(0.21)
(0.25)
(0.27)
(0.15)

0.062
0.177
0.522
0.156

(0.20)
(0.26)
(0.31)
(0.14)

-0.028
-0.097
0.032
-0.045

-0.088
-0.109
0.180
-0.008

-0.286
0.072
0.229
0.052

(0.47)
(0.28)
(0.87)

(0.12)
(0.12)
(0.10)
(0.10)

(0.10)
(0.10)
(0.10)
(0.09)

(0.16)
(0.14)
(0.16)
(0.13)

0.006
-0.455
0.646

(0.28)
(0.30)
(0.37)

0.047
-0.403
0.431

(0.39)
(0.36)
(0.46)

Time (in Seconds)

818.475

827.293

4.746

5.988

3,302 total observations for 880 individuals, in which dummies for the number of time observations for
each individual are included to address the unbalanced panel. Standard errors for both estimators are
1 + σ2
a)

robust and clustered at the individual level. ME Probit coeﬃcient estimates are scaled by (1/
and standard errors are calculated using the delta method.

(cid:113)

245

Table G.20: Scaled Probit Coeﬃcient Estimates for Illicit Activity

ME Probit

Pooled Probit

RE

CRE

RE

CRE

Coeﬀ

SE

Coeﬀ

SE

Coeﬀ

SE

Coeﬀ

SE

Treatment

Therapy
Cash
Both

Therapy ×
Antisocial
Perseverance
Reward
Impulsiveness
Cash ×

Antisocial
Perseverance
Reward
Impulsiveness
Both ×

Antisocial
Perseverance
Reward
Impulsiveness

Var. Components

Therapy
Cash
Both
Intercept

Het. Coeﬃcients

Therapy
Cash
Both

-0.972
-0.101
-0.704

(0.31)
(0.12)
(0.27)

-0.877
-0.095
-0.538

(0.30)
(0.12)
(0.25)

-2.003
-0.498
-1.320

(1.65)
(0.46)
(1.72)

-1.242
-0.356
-0.267

-0.309
-0.025
0.137
0.002

-0.118
0.105
0.121
0.007

-0.157
0.382
-0.087
-0.017

1.761
0.000
0.568
0.461

(0.17)
(0.17)
(0.18)
(0.16)

(0.11)
(0.11)
(0.11)
(0.11)

(0.14)
(0.15)
(0.13)
(0.13)

(0.89)
(0.00)
(0.55)
(0.17)

1.988
0.000
1.018
0.422

(0.98)
(0.00)
(0.66)
(0.15)

-0.288
-0.147
0.153
-0.028

-0.051
0.084
0.117
0.035

-0.167
0.342
-0.081
-0.013

(1.36)
(0.55)
(0.90)

(0.19)
(0.23)
(0.27)
(0.20)

(0.17)
(0.13)
(0.14)
(0.12)

(0.15)
(0.17)
(0.12)
(0.13)

0.830
0.327
0.557

(0.54)
(0.30)
(0.74)

0.542
0.205
-0.075

(0.58)
(0.37)
(0.71)

Time (in Seconds)

4642.346

6735.654

1.063

1.872

3,320 total observations for 882 individuals, in which dummies for the number of time observations for
each individual are included to address the unbalanced panel. SStandard errors for both estimators are
1 + σ2
a)

robust and clustered at the individual level. ME Probit coeﬃcient estimates are scaled by (1/
and standard errors are calculated using the delta method.

(cid:113)

246

Table G.21: ATE Estimates

Therapy Only
Coeﬀ
SE

Cash Only
SE

Coeﬀ

Both Cash
and Therapy
SE

Coeﬀ

-0.0569
-0.0573

-0.0621
-0.0648

-0.0638
-0.0642

-0.0006
-0.0005

0.0011
0.0021

-0.0012
0.0002

-0.0563
-0.0587

-0.0536
-0.0597

-0.0238
-0.0473

(0.023)
(0.023)

(0.033)
(0.033)

(0.019)
(0.031)

(0.018)
(0.018)

(0.024)
(0.025)

(0.018)
(0.019)

(0.021)
(0.022)

(0.034)
(0.034)

(0.034)
(0.026)

-0.0074
-0.0095

-0.0162
-0.0207

-0.0202
0.0065

0.0082
0.0059

0.0077
0.0072

-0.0035
-0.0001

-0.0180
-0.0211

-0.0174
-0.0219

0.0049
-0.0112

(0.024)
(0.024)

(0.023)
(0.023)

(0.020)
(0.059)

(0.019)
(0.019)

(0.025)
(0.024)

(0.019)
(0.020)

(0.021)
(0.021)

(0.020)
(0.021)

(0.027)
(0.021)

-0.0704
-0.0747

-0.0653
-0.0764

-0.0686
-0.0495

-0.0187
-0.0201

-0.0140
-0.0163

-0.0363
-0.0298

-0.0600
-0.0649

-0.0549
-0.0639

-0.0246
-0.0533

(0.023)
(0.023)

(0.021)
(0.021)

(0.019)
(0.055)

(0.019)
(0.019)

(0.027)
(0.026)

(0.020)
(0.020)

(0.021)
(0.022)

(0.029)
(0.029)

(0.034)
(0.018)

OLS

ME

PHP

OLS

ME

PHP

OLS

ME

PHP

(1)
(2)

(1)
(2)

(1)
(2)

(1)
(2)

(1)
(2)

(1)
(2)

(1)
(2)

(1)
(2)

(1)
(2)

Sells Drugs

Arrested

Illicit Activity

Speciﬁcation (1) assumes a RE structure such that the random treatment eﬀects are not hetero-
geneous in terms of the individual characteristics. Speciﬁcation (2) implements a ﬂexible CRE
speciﬁcation that allows the treatment eﬀects to be heterogeneous in individual characteristics.

247

Discussion

Table G.22: Bias and Std Deviation of Scaled Coeﬃcient Estimates under AR(2)

PHP

MEP

T

(1)

(2)

(1)

(2)

ασ

5

β1σ

β2σ

ασ

10

β1σ

β2σ

ασ

20

β1σ

β2σ

-0.5520
(0.091)

-0.9095
(0.087)

-0.0542
(0.083)

-0.6284
(0.074)

-0.8229
(0.055)

-0.0115
(0.059)

-0.6678
(0.060)

-0.7616
(0.043)

0.0057
(0.043)

-0.0159
(0.388)

0.0398
(0.397)

0.0284
(0.088)

-0.0580
(0.592)

0.0510
(0.531)

0.0297
(0.058)

-0.0247
(1.031)

0.1005
(0.821)

0.0255
(0.043)

-0.6439
(0.076)

-0.8112
(0.057)

-0.0441
(0.056)

-0.6898
(0.061)

-0.7401
(0.041)

-0.0221
(0.039)

-0.7055
(0.053)

-0.7095
(0.034)

-0.0016
(0.030)

-0.0193
(0.375)

0.0659
(0.343)

0.0358
(0.059)

-0.0647
(0.573)

0.0695
(0.484)

0.0321
(0.040)

-0.0063
(1.002)

0.0918
(0.718)

0.0285
(0.030)

R=1,000, N=300. Standard Deviations are given in parenthesis.

The true coeﬃcient values are α = −0.25, β1 = 1.25, and β2 =

1. Speciﬁcation (1) incorrectly assumes that the random eﬀects
ai and bi are uncorrelated with the x’s while speciﬁcation (2)
assumes that ai and bi are correlated with the x’s through their
time averages.

248

Table G.23: Bias and Std Deviation (×10) of APE Estimates under AR(2)

T

5

10

20

PHP

MEP

(1)

(2)

(1)

(2)

-0.6657
(0.1284)
-0.3966
(0.0987)
-0.2096
(0.0820)

0.0024
(0.1274)
0.0044
(0.0936)
0.0059
(0.0768)

-0.3645
(0.1238)
-0.1409
(0.0898)
-0.0510
(0.0743)

0.0042
(0.1228)
0.0058
(0.0914)
0.0072
(0.0750)

R=1,000, N=300. Standard deviations are given in parenthesis.
Both bias and standard deviations are multiplied by 10. True
APE value is 0.0735. Speciﬁcation (1) incorrectly assumes that
the random eﬀects ai and bi are uncorrelated with the x’s while
speciﬁcation (2) assumes that ai and bi are correlated with the
x’s through their time averages.

Table G.24: Failure Count under no Random Coeﬃcients

ρ = 0

ρ = 0.4

ρ = 0.8

PHP

MEP

PHP

MEP

PHP

T (1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

MEP
(2)

(1)

5
10
20

0
0
0

6
0
0

525
559
652

448
567
721

0
0
0

4
0
0

6
0
4

8
4
4

0
0
0

7
0
0

0
0
0

1
0
0

Speciﬁcation (1) assumes that the random eﬀects ai and bi are uncorrelated with the
x’s while speciﬁcation (2) assumes that ai and bi are correlated with the x’s through
their time averages.

249

Table G.25: Estimation Times under no Random Coeﬃcients

ρ = 0

ρ = 0.4

ρ = 0.8

PHP

MEP

PHP

MEP

PHP

MEP

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

0.301
(0.06)
0.538
(0.10)
0.605
(0.07)

0.464
(0.13)
0.695
(0.14)
0.845
(0.15)

6.826
(3.90)
7.991
(4.69)
13.688
(8.29)

10.056
(5.06)
11.219
(6.23)
18.954
(10.54)

0.389
(0.08)
0.533
(0.08)
0.618
(0.08)

0.525
(0.15)
0.686
(0.14)
0.863
(0.16)

3.681
(2.41)
5.078
(3.39)
8.532
(6.13)

5.442
(3.15)
7.038
(4.22)
11.741
(7.63)

0.478
(0.08)
0.532
(0.08)
0.629
(0.08)

0.613
(0.26)
0.692
(0.13)
0.878
(0.16)

4.269
(1.78)
7.455
(3.18)
9.070
(6.56)

5.592
(2.00)
9.362
(3.78)
10.434
(8.16)

T

5

10

20

Average estimation time in seconds and standard deviations given in parenthesis. Speciﬁcation (1) assumes that the random
eﬀects ai and bi are uncorrelated with the x’s while speciﬁcation (2) assumes that ai and bi are correlated with the x’s through
their time averages.

250

Table G.26: Bias and Std Deviation (×10) of APE Estimates under no Random Coeﬃcients

ρ = 0

ρ = 0.4

ρ = 0.8

PHP

MEP

PHP

MEP

PHP

MEP

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

-0.0023
(0.084)
0.0016
(0.058)
-0.0035
(0.040)

-0.0005
(0.092)
0.0006
(0.060)
-0.0036
(0.040)

-0.0001
(0.081)
0.0031
(0.057)
-0.0027
(0.040)

0.0016
(0.088)
0.0029
(0.059)
-0.0025
(0.041)

0.0005
(0.093)
-0.0005
(0.059)
0.0018
(0.044)

0.0052
(0.099)
-0.0005
(0.063)
0.0022
(0.045)

0.0013
(0.089)
0.0005
(0.059)
0.0022
(0.044)

0.0057
(0.095)
0.0013
(0.061)
0.0028
(0.044)

0.0045
(0.094)
0.0017
(0.0672)
-0.0008
(0.0438)

0.0044
(0.098)
0.0025
(0.068)
-0.0005
(0.044)

0.0018
(0.087)
0.0008
(0.065)
-0.0008
(0.043)

0.0024
(0.090)
0.0018
(0.066)
-0.0004
(0.044)

T

5

10

20

R=1,000, N=300. Standard deviations are given in parenthesis. Both bias and standard deviations are multiplied by 10. True APE value is
0.1511. Speciﬁcation (1) assumes that the random eﬀects ai and bi are uncorrelated with the x’s while speciﬁcation (2) assumes that ai and
bi are correlated with the x’s through their time averages.

251

Table G.27: Variance Component σ2

1 Estimates under no Random Coeﬃcients

T

5

10

20

ρ = 0

ρ = 0.4

ρ = 0.8

(1)

(2)

(1)

(2)

(1)

(2)

0.0556
(0.064)
0.0232
(0.026)
0.0117
(0.012)

0.0463
(0.059)
0.0186
(0.024)
0.0092
(0.011)

0.2944
(0.154)
0.1456
(0.060)
0.0688
(0.026)

0.2771
(0.153)
0.1366
(0.060)
0.0642
(0.026)

2.1492
(0.916)
0.9861
(0.205)
0.4779
(0.077)

2.1260
(0.944)
0.9650
(0.203)
0.4667
(0.076)

Speciﬁcation (1) assumes that the random eﬀects ai and bi are uncorrelated
with the x’s while speciﬁcation (2) assumes that ai and bi are correlated
with the x’s through their time averages.

Table G.28: Variance Component σ2

2 Estimates under no Random Coeﬃcients

T

5

10

20

ρ = 0

ρ = 0.4

ρ = 0.8

(1)

(2)

(1)

(2)

(1)

(2)

0.0164
(0.041)
0.0106
(0.024)
0.0065
(0.013)

0.0107
(0.033)
0.0076
(0.021)
0.0047
(0.011)

0.0383
(0.063)
0.0245
(0.037)
0.0134
(0.019)

0.0268
(0.055)
0.0178
(0.033)
0.0102
(0.017)

0.0808
(0.188)
0.0337
(0.072)
0.0324
(0.039)

0.0524
(0.160)
0.0260
(0.063)
0.0282
(0.035)

Speciﬁcation (1) assumes that the random eﬀects ai and bi are uncorrelated
with the x’s while speciﬁcation (2) assumes that ai and bi are correlated
with the x’s through their time averages.

252

Table G.29: Rejection Rate of LR Test for Random Coeﬃcients

ρ = 0

ρ = 0.4

(1)

(2)

(1)

(2)

ρ = 0.8
(1)
(2)

0.054
0.047
0.055

0.033
0.026
0.027

0.723
0.858
0.881

0.638
0.799
0.841

1
1
1

1
1
1

T

5
10
20

True value of σ2
1 is 0.5. Speciﬁcation (1) assumes that the random
eﬀects ai and bi are uncorrelated with the x’s while speciﬁcation
(2) assumes that ai and bi are correlated with the x’s through
their time averages.

253

Table G.30: Bias and Std Deviation of De-scaled ME Logit Estimate under a
Conditional Logistic AR(1) Process

ρ = 0

ρ = 0.4

ρ = 0.8

T

(1)

(2)

(1)

(2)

(1)

(2)

α

5

β1

β2

α

10

β1

β2

α

20

β1

β2

-0.7919
(0.129)

-1.0646
(0.100)

-0.0034
(0.090)

-0.8592
(0.089)

-0.9314
(0.066)

0.0034
(0.061)

-0.8660
(0.067)

-0.8784
(0.049)

0.0020
(0.040)

-0.0614
(0.537)

0.0599
(0.523)

0.0030
(0.091)

-0.0628
(0.822)

0.0007
(0.688)

0.0030
(0.061)

0.0009
(1.397)

0.0217
(1.045)

0.0019
(0.039)

-1.0504
(0.164)

-0.9749
(0.113)

0.1738
(0.102)

-0.9699
(0.108)

-0.8845
(0.070)

0.0867
(0.065)

-0.9119
(0.077)

-0.8596
(0.051)

0.0441
(0.042)

-0.0641
(0.701)

0.2782
(0.586)

0.1705
(0.099)

0.0248
(1.013)

0.1203
(0.689)

0.0854
(0.065)

-0.0367
(1.486)

0.0820
(1.094)

0.0437
(0.042)

-1.9744
(0.310)

-0.6782
(0.159)

0.8856
(0.192)

-1.4747
(0.173)

-0.7167
(0.087)

0.5012
(0.093)

-1.1741
(0.113)

-0.7661
(0.061)

0.2643
(0.054)

-0.2474
(1.202)

1.1248
(0.895)

0.8505
(0.187)

-0.1786
(1.599)

0.6357
(0.980)

0.4960
(0.093)

-0.1120
(2.326)

0.3829
(1.324)

0.2633
(0.054)

R=1,000, N=300. Standard Deviations are given in parenthesis. The true coeﬃcient

values are α = −0.25, β1 = 1.25, and β2 = 1. Speciﬁcation (1) incorrectly assumes
that the random eﬀects ai and bi are uncorrelated with the x’s while speciﬁcation
(2) assumes that ai and bi are correlated with the x’s through their time averages.
Conditional logistic AR(1) is generated according to equation (3.35)

254

Table G.31: Bias and Std Deviation of De-scaled ME Logit Estimate under a
Marginal Logistic AR(1) Process

ρ = 0

ρ = 0.4

ρ = 0.8

T

(1)

(2)

(1)

(2)

(1)

(2)

α

5

β1

β2

α

10

β1

β2

α

20

β1

β2

-0.7962
(0.129)

-1.0646
(0.098)

0.0057
(0.086)

-0.8561
(0.095)

-0.9296
(0.063)

0.0017
(0.059)

-0.8632
(0.070)

-0.8780
(0.050)

0.0004
(0.039)

-0.0541
(0.538)

0.0399
(0.508)

0.0119
(0.087)

-0.0393
(0.818)

0.0449
(0.659)

0.0010
(0.059)

0.0045
(1.369)

0.0646
(1.054)

0.0003
(0.040)

-1.0212
(0.167)

-0.9827
(0.104)

0.1510
(0.106)

-0.9424
(0.106)

-0.8944
(0.069)

0.0633
(0.061)

-0.8933
(0.079)

-0.8695
(0.052)

0.0234
(0.039)

-0.0769
(0.690)

0.2104
(0.578)

0.1480
(0.106)

-0.1014
(0.995)

0.1266
(0.750)

0.0618
(0.061)

-0.0981
(1.510)

0.0737
(1.093)

0.0232
(0.040)

-1.8377
(0.276)

-0.7226
(0.153)

0.7812
(0.177)

-1.3900
(0.168)

-0.7464
(0.088)

0.4205
(0.086)

-1.0903
(0.109)

-0.7932
(0.057)

0.1897
(0.049)

-0.2540
(1.259)

0.9752
(0.818)

0.7508
(0.171)

-0.1505
(1.688)

0.5626
(0.961)

0.4153
(0.085)

-0.0510
(2.344)

0.3125
(1.285)

0.1891
(0.049)

R=1,000, N=300. Standard Deviations are given in parenthesis. The true coeﬃcient

values are α = −0.25, β1 = 1.25, and β2 = 1. Speciﬁcation (1) incorrectly assumes
that the random eﬀects ai and bi are uncorrelated with the x’s while speciﬁcation
(2) assumes that ai and bi are correlated with the x’s through their time averages.
Marginal logistic AR(1) is generated according to equation (3.36)

255

Table G.32: Bias and Std Deviation of Scaled Coeﬃcient Estimates under a Conditional Logistic AR(1) Process

ρ = 0

ρ = 0.4

ρ = 0.8

PHP

MEL

PHP

MEL

PHP

MEL

T

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

ασ

5

β1σ

β2σ

ασ

10

β1σ

β2σ

ασ

20

β1σ

β2σ

-0.3808
(0.078)

-0.5884
(0.081)

0.0035
(0.074)

-0.4309
(0.058)

-0.5172
(0.045)

0.0270
(0.050)

-0.4536
(0.043)

-0.4716
(0.033)

0.0331
(0.032)

-0.0389
(0.305)

0.0664
(0.323)

0.0333
(0.074)

-0.0447
(0.460)

0.0429
(0.408)

0.0386
(0.051)

-0.0053
(0.777)

0.0398
(0.603)

0.0371
(0.032)

-0.3876
(0.060)

-0.5505
(0.049)

-0.0197
(0.044)

-0.4267
(0.043)

-0.4826
(0.033)

-0.0114
(0.030)

-0.4367
(0.034)

-0.4539
(0.025)

-0.0062
(0.020)

-0.0321
(0.277)

0.0321
(0.270)

0.0026
(0.047)

-0.0321
(0.423)

0.0018
(0.353)

0.0029
(0.031)

0.0003
(0.720)

0.0125
(0.538)

0.0021
(0.020)

-0.3824
(0.080)

-0.5873
(0.081)

0.0024
(0.074)

-0.4268
(0.064)

-0.5162
(0.046)

0.0264
(0.050)

-0.4502
(0.045)

-0.4721
(0.034)

0.0333
(0.031)

-0.0172
(0.340)

0.0750
(0.317)

0.0374
(0.072)

0.0117
(0.531)

0.0557
(0.389)

0.0384
(0.050)

-0.0185
(0.803)

0.0525
(0.620)

0.0373
(0.032)

-0.4311
(0.064)

-0.5239
(0.048)

-0.0081
(0.042)

-0.4410
(0.049)

-0.4715
(0.032)

-0.0064
(0.029)

-0.4409
(0.037)

-0.4508
(0.025)

-0.0020
(0.020)

-0.0135
(0.318)

0.0512
(0.265)

0.0176
(0.043)

0.0198
(0.489)

0.0190
(0.332)

0.0101
(0.030)

-0.0146
(0.741)

0.0221
(0.546)

0.0066
(0.020)

-0.3783
(0.086)

-0.5858
(0.074)

0.0002
(0.077)

-0.4238
(0.071)

-0.5210
(0.044)

0.0264
(0.049)

-0.4511
(0.055)

-0.4737
(0.036)

0.0328
(0.034)

-0.0183
(0.383)

0.0583
(0.339)

0.0382
(0.073)

-0.0292
(0.629)

0.0459
(0.458)

0.0404
(0.049)

-0.0260
(1.034)

0.0681
(0.654)

0.0375
(0.033)

-0.4683
(0.072)

-0.4887
(0.042)

-0.0073
(0.044)

-0.4696
(0.059)

-0.4572
(0.030)

0.0069
(0.031)

-0.4619
(0.047)

-0.4415
(0.025)

0.0104
(0.023)

-0.0176
(0.352)

0.0528
(0.257)

0.0282
(0.046)

-0.0269
(0.579)

0.0414
(0.356)

0.0283
(0.032)

-0.0254
(0.985)

0.0484
(0.559)

0.0209
(0.023)

R=1,000, N=300. Standard Deviations are given in parenthesis. The true coeﬃcient values are α = −0.25, β1 = 1.25, and β2 = 1.
Speciﬁcation (1) incorrectly assumes that the random eﬀects ai and bi are uncorrelated with the x’s while speciﬁcation (2) assumes that ai
and bi are correlated with the x’s through their time averages. Conditional logistic AR(1) is generated according to equation (3.35)

256

Table G.33: Bias and Std Deviation of Scaled Coeﬃcient Estimates under a Marginal Logistic AR(1) Process

ρ = 0

ρ = 0.4

ρ = 0.8

PHP

MEL

PHP

MEL

PHP

MEL

T

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

ασ

5

β1σ

β2σ

ασ

10

β1σ

β2σ

ασ

20

β1σ

-0.3855
(0.078)

-0.5856
(0.079)

0.0083
(0.072)

-0.4298
(0.060)

-0.5160
(0.045)

0.0271
(0.048)

-0.4515
(0.044)

-0.4715
(0.033)

-0.0365
(0.304)

0.0654
(0.307)

0.0428
(0.073)

-0.0279
(0.457)

0.0626
(0.397)

0.0369
(0.048)

-0.0001
(0.760)

0.0620
(0.612)

-0.3901
(0.059)

-0.5503
(0.048)

-0.0149
(0.041)

-0.4249
(0.046)

-0.4818
(0.031)

-0.0125
(0.029)

-0.4351
(0.034)

-0.4538
(0.026)

-0.0280
(0.277)

0.0226
(0.262)

0.0077
(0.044)

-0.0199
(0.421)

0.0243
(0.339)

0.0016
(0.029)

0.0024
(0.705)

0.0338
(0.542)

-0.3738
(0.084)

-0.5854
(0.078)

-0.0064
(0.075)

-0.4183
(0.059)

-0.5190
(0.047)

0.0187
(0.046)

-0.4428
(0.046)

-0.4763
(0.034)

-0.0261
(0.338)

0.0458
(0.310)

0.0292
(0.073)

-0.0521
(0.521)

0.0652
(0.426)

0.0294
(0.046)

-0.0584
(0.818)

0.0514
(0.617)

-0.4223
(0.067)

-0.5263
(0.045)

-0.0147
(0.043)

-0.4311
(0.047)

-0.4752
(0.032)

-0.0147
(0.027)

-0.4336
(0.038)

-0.4550
(0.025)

-0.0211
(0.314)

0.0244
(0.263)

0.0102
(0.046)

-0.0414
(0.483)

0.0245
(0.363)

0.0005
(0.028)

-0.0458
(0.754)

0.0197
(0.545)

-0.3639
(0.087)

-0.5894
(0.079)

-0.0123
(0.074)

-0.4062
(0.069)

-0.5265
(0.045)

0.0066
(0.049)

-0.4230
(0.054)

-0.4820
(0.033)

-0.0273
(0.406)

0.0331
(0.325)

0.0219
(0.069)

-0.0410
(0.655)

0.0626
(0.438)

0.0193
(0.047)

-0.0117
(1.054)

0.0491
(0.640)

-0.4549
(0.069)

-0.4948
(0.042)

-0.0157
(0.043)

-0.4524
(0.058)

-0.4638
(0.031)

-0.0105
(0.029)

-0.4350
(0.046)

-0.4501
(0.024)

-0.0262
(0.380)

0.0333
(0.246)

0.0175
(0.043)

-0.0209
(0.623)

0.0274
(0.353)

0.0088
(0.030)

-0.0006
(1.004)

0.0270
(0.550)

β2σ

0.0323
(0.032)

0.0364
(0.032)

-0.0040
(0.022)
R=1,000, N=300. Standard Deviations are given in parenthesis. The true coeﬃcient values are α = −0.25, β1 = 1.25, and β2 = 1.
Speciﬁcation (1) incorrectly assumes that the random eﬀects ai and bi are uncorrelated with the x’s while speciﬁcation (2) assumes that ai
and bi are correlated with the x’s through their time averages. Marginal logistic AR(1) is generated according to equation (3.36)

-0.0073
(0.019)

-0.0136
(0.021)

0.0072
(0.033)

0.0009
(0.020)

0.0242
(0.031)

0.0284
(0.030)

-0.0106
(0.019)

-0.0022
(0.019)

0.0117
(0.033)

257

Table G.34: Bias and Std Deviation (×10) of APE Estimates under a Conditional Logistic AR(1) Process

ρ = 0

ρ = 0.4

ρ = 0.8

PHP

MEP

PHP

MEP

PHP

MEP

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

-0.4665
(0.155)
-0.2824
(0.104)
-0.1498
(0.080)

0.1036
(0.148)
0.0487
(0.101)
0.0272
(0.075)

-0.2912
(0.144)
-0.1052
(0.101)
-0.0322
(0.078)

0.0033
(0.142)
0.0003
(0.099)
0.0017
(0.077)

-0.4681
(0.156)
-0.2836
(0.104)
-0.1508
(0.080)

0.0985
(0.138)
0.0498
(0.098)
0.0241
(0.076)

-0.2173
(0.139)
-0.0755
(0.098)
-0.0258
(0.077)

-0.0030
(0.132)
0.0017
(0.097)
0.0000
(0.077)

-0.4643
(0.139)
-0.2919
(0.100)
-0.1520
(0.084)

0.1040
(0.125)
0.0438
(0.089)
0.0230
(0.074)

-0.1036
(0.117)
-0.0353
(0.088)
-0.0096
(0.076)

0.0051
(0.116)
-0.0029
(0.088)
0.0000
(0.075)

T

5

10

20

R=1,000, N=300. Standard deviations are given in parenthesis. Both bias and standard deviations are multiplied by 10. True APE value is
0.0585. Speciﬁcation (1) incorrectly assumes that the random eﬀects ai and bi are uncorrelated with the x’s while speciﬁcation (2) assumes
that ai and bi are correlated with the x’s through their time averages. Conditional logistic AR(1) is generated according to equation (3.35)

258

Table G.35: Bias and Std Deviation (×10) of APE Estimates under a Marginal Logistic AR(1) Process

ρ = 0

ρ = 0.4

ρ = 0.8

PHP

MEP

PHP

MEP

PHP

MEP

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

(1)

(2)

-0.4652
(0.151)
-0.2827
(0.098)
-0.1498
(0.081)

0.0975
(0.140)
0.0498
(0.099)
0.0264
(0.077)

-0.2982
(0.144)
-0.1051
(0.099)
-0.0314
(0.077)

-0.0028
(0.136)
-0.0002
(0.098)
0.0023
(0.077)

-0.4634
(0.148)
-0.2878
(0.105)
-0.1551
(0.082)

0.0947
(0.135)
0.0404
(0.100)
0.0196
(0.076)

-0.2191
(0.132)
-0.0839
(0.098)
-0.0326
(0.078)

-0.0053
(0.130)
-0.0052
(0.097)
-0.0055
(0.077)

-0.4708
(0.150)
-0.2913
(0.105)
-0.1629
(0.078)

0.0896
(0.126)
0.0365
(0.094)
0.0096
(0.073)

-0.1200
(0.118)
-0.0442
(0.090)
-0.0213
(0.072)

-0.0057
(0.117)
-0.0109
(0.090)
-0.0109
(0.071)

T

5

10

20

R=1,000, N=300. Standard deviations are given in parenthesis. Both bias and standard deviations are multiplied by 10. True APE value is
0.0582. Speciﬁcation (1) incorrectly assumes that the random eﬀects ai and bi are uncorrelated with the x’s while speciﬁcation (2) assumes
that ai and bi are correlated with the x’s through their time averages. Marginal logistic AR(1) is generated according to equation (3.36)

259

BIBLIOGRAPHY

260

BIBLIOGRAPHY

Ackerberg, D., X. Chen, and J. Hahn (2012): ‘A practical asymptotic variance estimator for
two-step semiparametric estimators,’ Review of Economics and Statistics, 94(2), 481–498.

Ai, C., and X. Chen (2003): ‘Eﬃcient estimation of models with conditional moment restric-

tions containing unknown functions,’ Econometrica, 71(6), 1795–1843.

Akin, J. S., D. K. Guilkey, and R. Sickles (1979): ‘A random coeﬃcient probit model with

an application to a study of migration,’ Journal of Econometrics, 11(2), 233 – 246.

Bertrand, M., E. Duﬂo, and S. Mullainathan (2004):

‘How Much Should We Trust
Diﬀerences-In-Diﬀerences Estimates?*,’ The Quarterly Journal of Economics, 119(1), 249–
275.

Blattman, C., J. C. Jamison, and M. Sheridan (2017): ‘Reducing Crime and Violence: Ex-
perimental Evidence from Cognitive Behavioral Therapy in Liberia,’ American Economic
Review, 107(4), 1165–1206.

Blundell, R., and R. L. Matzkin (2014):

‘Control functions in nonseparable simultaneous

equations models,’ Quantitative Economics, 5(2), 271–295.

Blundell, R., and J. L. Powell (2003):

‘Endogeneity in nonparametric and semiparametric

regression models,’ Econometric society monographs, 36, 312–357.

Blundell, R. W., and J. L. Powell (2004): ‘Endogeneity in Semiparametric Binary Response

Models,’ The Review of Economic Studies, 71(3), 655–679.

Chen, X. (2007): ‘Large sample sieve estimation of semi-nonparametric models,’ Handbook

of econometrics, 6, 5549–5632.

Chen, X., V. Chernozhukov, S. Lee, and W. K. Newey (2014): ‘Local identiﬁcation of non-

parametric and semiparametric models,’ Econometrica, 82(2), 785–809.

Chen, X., O. Linton, and I. Van Keilegom (2003):

‘Estimation of semiparametric models

when the criterion function is not smooth,’ Econometrica, 71(5), 1591–1608.

D’Haultfœuille, X., and P. F´evrier (2015): ‘Identiﬁcation of nonseparable triangular models

with discrete instruments,’ Econometrica, 83(3), 1199–1210.

Dong, Y., and A. Lewbel (2015): ‘A simple estimator for binary choice models with endoge-

nous regressors,’ Econometric Reviews, 34(1-2), 82–105.

Duckworth, A. L., and P. D. Quinn (2009): ‘Development and validation of the Short Grit

Scale (GRIT–S),’ Journal of personality assessment, 91(2), 166–174.

Escanciano, J. C., D. Jacho-Ch´avez, and A. Lewbel (2016): ‘Identiﬁcation and estimation of

semiparametric two-step models,’ Quantitative Economics, 7(2), 561–589.

261

Fern´andez-Val, I. (2009):

‘Fixed eﬀects estimation of structural parameters and marginal

eﬀects in panel probit models,’ Journal of Econometrics, 150(1), 71–85.

Florens, J.-P., J. J. Heckman, C. Meghir, and E. Vytlacil (2008):

‘Identiﬁcation of treat-
ment eﬀects using control functions in models with continuous, endogenous treatment and
heterogeneous eﬀects,’ Econometrica, 76(5), 1191–1206.

Gandhi, A., K. I. Kim, and A. Petrin (2013):

‘Identiﬁcation and Estimation in Discrete

Choice Demand Models when Endogenous Variables Interact with the Error,’ .

Greene, W. (2004): ‘The behaviour of the maximum likelihood estimator of limited depen-
dent variable models in the presence of ﬁxed eﬀects,’ The Econometrics Journal, 7(1),
98–119.

Greene, W. (2011): Econometric Analysis. Pearson Education.

Hahn, J., Z. Liao, and G. Ridder (2018): ‘Nonparametric two-step sieve M estimation and

inference,’ Econometric Theory, pp. 1–44.

Hahn, J., and G. Ridder (2011): ‘Conditional moment restrictions and triangular simultane-

ous equations,’ Review of Economics and Statistics, 93(2), 683–689.

Hall, P., J. L. Horowitz, et al. (2005): ‘Nonparametric methods for inference in the presence

of instrumental variables,’ The Annals of Statistics, 33(6), 2904–2929.

Harvey, A. C. (1976): ‘Estimating regression models with multiplicative heteroscedasticity,’

Econometrica, Vol. 44(3), 461–465.

Hausman, J. A., and D. A. Wise (1978): ‘A Conditional Probit Model for Qualitative Choice:
Discrete Decisions Recognizing Interdependence and Heterogeneous Preferences,’ Econo-
metrica, 46(2), 403–426.

Hoderlein, S., H. Holzmann, M. Kasy, and A. Meister (2016): ‘Erratum Instrumental Vari-
ables with Unrestricted Heterogeneity and Continuous Treatment,’ The Review of Eco-
nomic Studies, forthcoming.

Hoderlein, S., H. Holzmann, and A. Meister (2017):

‘The triangular model with random

coeﬃcients,’ Journal of Econometrics, 201(1), 144–169.

Hong, H., and E. Tamer (2003): ‘Endogenous binary choice model with median restrictions,’

Economics Letters, 80(2), 219–225.

Horowitz, J. L. (1992):

‘A smoothed maximum score estimator for the binary response

model,’ Econometrica: journal of the Econometric Society, pp. 505–531.

Ichimura, H., and L.-F. Lee (1991):

‘Semiparametric least squares estimation of multiple
index models: single equation estimation,’ in Nonparametric and semiparametric meth-
ods in econometrics and statistics: Proceedings of the Fifth International Symposium in
Economic Theory and Econometrics. Cambridge, pp. 3–49. Cambridge University Press.

262

Imbens, G. W., and W. K. Newey (2009): ‘Identiﬁcation and estimation of triangular simul-

taneous equations models without additivity,’ Econometrica, 77(5), 1481–1512.

Kasy, M. (2011): ‘Identiﬁcation in triangular systems using control functions,’ Econometric

Theory, 27(3), 663–671.

(2014): ‘Instrumental variables with unrestricted heterogeneity and continuous treat-

ment,’ The Review of Economic Studies, 81(4), 1614–1636.

Keane, M. P. (1994): ‘A Computationally Practical Simulation Estimator for Panel Data,’

Econometrica, 62(1), 95–116.

Khan, S. (2013):

‘Distribution free estimation of heteroskedastic binary response models
using Probit/Logit criterion functions,’ Journal of Econometrics, Vol. 172(1), 168 – 182.

Kim, K. i., and A. Petrin (2017): ‘A New Control Function Approach for Non-Parametric

Regressions with Endogenous Variables,’ .

Klein, R., and F. Vella (2009): ‘A semiparametric model for binary response and continuous
outcomes under index heteroscedasticity,’ Journal of Applied Econometrics, Vol. 24(5),
735–762.

Krief, J. M. (2014): ‘An integrated kernel-weighted smoothed maximum score estimator for

the partially linear binary response model,’ Econometric Theory, 30(3), 647–675.

Lewbel, A. (2000):

‘Semiparametric qualitative response model estimation with unknown

heteroscedasticity or instrumental variables,’ Journal of Econometrics, 97(1), 145–177.

(forthcoming): ‘The identiﬁcation zoo–meanings of identiﬁcation in econometrics,’

Journal of Economic Literature.

Lewbel, A., Y. Dong, and T. T. Yang (2012):

‘Comparing features of convenient estima-
tors for binary choice models with endogenous regressors,’ Canadian Journal of Eco-
nomics/Revue canadienne d’´economique, 45(3), 809–829.

Lin, W., and J. M. Wooldridge (2015): ‘On diﬀerent approaches to obtaining partial eﬀects
in binary response models with endogenous regressors,’ Economics Letters, 134, 58 – 61.

Manski, C. F. (1985): ‘Semiparametric analysis of discrete response: asymptotic properties

of the maximum score estimator,’ Journal of Econometrics, 27(3), 313–333.

Manski, C. F. (1988): ‘Identiﬁcation of Binary Response Models,’ Journal of the American

Statistical Association, 83(403), 729–738.

McCulloch, C. E., and J. M. Neuhaus (2001): Generalized linear mixed models. Wiley Online

Library.

Newey, W. K. (1994): ‘The asymptotic variance of semiparametric estimators,’ Economet-

rica: Journal of the Econometric Society, pp. 1349–1382.

263

(2013):

‘Nonparametric instrumental variables estimation,’ American Economic

Review, 103(3), 550–56.

Newey, W. K., and D. McFadden (1994): ‘Chapter 36: Large sample estimation and hypoth-

esis testing,’ Handbook of Econometrics, Vol. 4, 2111 – 2245.

Newey, W. K., and J. L. Powell (2003): ‘Instrumental variable estimation of nonparametric

models,’ Econometrica, 71(5), 1565–1578.

Newey, W. K., J. L. Powell, and F. Vella (1999):

‘Nonparametric estimation of triangular

simultaneous equations models,’ Econometrica, 67(3), 565–603.

Petrin, A., and K. Train (2010): ‘A Control Function Approach to Endogeneity in Consumer

Choice Models,’ Journal of Marketing Research, 47(1), 3–13.

Pinkse, J. (2000):

‘Nonparametric Two-Step Regression Estimation When Regressors and
Error Are Dependent,’ The Canadian Journal of Statistics / La Revue Canadienne de
Statistique, 28(2), 289–300.

Rivers, D., and Q. H. Vuong (1988): ‘Limited information estimators and exogeneity tests

for simultaneous probit models,’ Journal of Econometrics, 39(3), 347 – 366.

Rothe, C. (2009): ‘Semiparametric estimation of binary response models with endogenous

regressors,’ Journal of Econometrics, 153(1), 51–64.

Rothenberg, T. J. (1971): ‘Identiﬁcation in parametric models,’ Econometrica: Journal of

the Econometric Society, pp. 577–591.

Sim, C. H. (1993): ‘First-Order Autoregressive Logistic Processes,’ Journal of Applied Prob-

ability, 30(2), 467–470.

Smith, R. J., and R. W. Blundell (1986): ‘An exogeneity test for a simultaneous equation To-
bit model with an application to labor supply,’ Econometrica: Journal of the Econometric
Society, pp. 679–685.

Song, W. (2016): ‘A Semiparametric Estimator for Binary Response Models with Endoge-

nous Regressors,’ .

Stock, J. H., J. H. Wright, and M. Yogo (2002): ‘A survey of weak instruments and weak
identiﬁcation in generalized method of moments,’ Journal of Business & Economic Statis-
tics, 20(4), 518–529.

Su, L., and A. Ullah (2008):

‘Local polynomial estimation of nonparametric simultaneous

equations models,’ Journal of Econometrics, 144(1), 193–218.

Swamy, P., and G. S. Tavlas (1995): ‘Random Coeﬃcient Models: Theory and Applications.,’

Journal of Economic Surveys, 9(2), 165.

Swamy, P. A. V. B. (1970): ‘Eﬃcient Inference in a Random Coeﬃcient Regression Model,’

Econometrica, 38(2), 311–323.

264

Torgovitsky, A. (2015): ‘Identiﬁcation of nonseparable models using instruments with small

support,’ Econometrica, 83(3), 1185–1197.

Wooldridge, J. (2010): Econometric Analysis of Cross Section and Panel Data, Econometric

Analysis of Cross Section and Panel Data. MIT Press.

Wooldridge, J. M. (2005): ‘Unobserved heterogeneity and estimation of average partial ef-
fects,’ Identiﬁcation and Inference for Econometric Models: Essays in Honor of Thomas
Rothenberg, pp. 27–55.

(2018):
Econometrics.

‘Correlated Random Eﬀects Models with Unbalanced Panels,’ Journal of

Wooldridge, J. M., and Y. Zhu (Forthcoming): ‘Inference in Approximately Sparse Correlated

Random Eﬀects Probit Models,’ Journal of Business and Economic Statistics.

265