QUANTIFYING STRENGTH OF EVIDENCE IN EDUCATION RESEARCH: ACCOUNTING

FOR SPILLOVER, HETEROGENEITY, AND MEDIATION

By

Qinyun Lin

A DISSERTATION

Michigan State University

in partial fulﬁllment of the requirements

Submitted to

for the degree of

Measurement and Quantitative Methods – Doctor of Philosophy

2019

ABSTRACT

QUANTIFYING STRENGTH OF EVIDENCE IN EDUCATION RESEARCH: ACCOUNTING

FOR SPILLOVER, HETEROGENEITY, AND MEDIATION

By

Qinyun Lin

It is very rare that education studies have constant intervention eﬀects through simple mechanisms
to independent individuals. It is well-documented that schooling is a complex process because
teachers, students, and administrators interact with each other in a diverse set of social contexts
(e.g., An, 2018; Frank, 1998; Hong, 2015; Kim, Frank, & Spillane, 2018; Maroulis et al., 2010).
As such, considering potential bias due to unobserved or uncontrolled spillover, heterogeneity
and alternative mediators is important to making an inference for policy implications. Addition-
ally, since the ultimate goal of education research is to inform decision-makings in the allocation
of educational resources regarding curricula, pedagogy, practices or school organizations (e.g.,
Bulterman-Bos, 2008; Cook, 2002), education research must be accessible to practitioners. Conse-
quently, a sensitivity framework that can account for all potential sources of bias, including spillover,
heterogeneity and alternative mediators, is required to allow all stakeholders to conceptualize the
quality of evidence independently so that the debate for future policy manipulations can take place
in a more transparent, eﬀective and equitable way.

Drawn on the work by Frank, Maroulis, Duong, and Kelcey (2013), Chapters 1 and 2 in this
dissertation propose a non-parametric case replacement approach to quantify the robustness of
inference in multisite randomized control trials and value-added measures for teacher eﬀectiveness,
accounting for spillover and heterogeneity. Throughout, the Tennessee class size experiment
(Project STAR) is applied to demonstrate the case replacement approach. Chapters 3 and 4 focus
on unobserved mediators in a single-mediator model. Speciﬁcally, Chapter 3 examines whether
and how omitting an alternative mediator can bias causal mediation eﬀect estimates in a cross-
sectional single-mediator model. Further, a sensitivity analysis approach is proposed to evaluate
the robustness of causal mediation inference to missing a potential confounding mediator. Chapter

4 continues the discussion in Chapter 3 and a parameter framework is developed to characterize
inconsistency in mediation models. This parameter framework is also applied to a longitudinal
design for a post-treatment confounder.

Copyright by
QINYUN LIN
2019

This dissertation is dedicated in memory of my mother, Minhong Du. I miss you everyday, but I

believe you would be glad to see this process through to completion.

v

ACKNOWLEDGMENTS

Doctoral study is such a long and challenging journey that I would have not been here without
help and support from so many great people. I have been waiting for this moment when I can say
THANK YOU to all the great people that I have met and learned from during the past ﬁve years.
First, I would like to show my greatest gratitude to my incredible adviser, Dr. Ken Frank, for
all his encouragement, guidance, trust, and support. I have learned so much from him that is far
beyond just doing good research. I was so fortunate to have so many opportunities to travel with
him and talk with him to learn about his journey as a scholar, a professor and a farther, which has
inspired me greatly along my journey as a graduate student, especially when I felt struggled or lack
of conﬁdence for my future as a scholar. He has brought me to the world where I found genuine joy
in research, especially those moments when I ﬁnally ﬁgured out those intriguing questions. Thank
you so much for always believing and trusting me. It is your encouragement and support that has
given me the conﬁdence to continue pursuing the research journey I have dreamed about!

I want to express my sincere appreciation to all my committee members, for all the constructive
suggestions and advice that they have provided. My dissertation is very interdisciplinary as it
relates to methods and topics across education, econometrics and psychology. It is my committee
members’ open-mindedness that has made this dissertation possible. I would like to thank Dr. Amy
Nuttall for her dedicated and motivating guidance, that has not only made her a great instructor, but
also an inspiring mentor that led me through a challenging period of graduate studies. For the past
two years, she has been so generous with her time and excellent advice. I also want to express my
appreciation for Dr. Jeﬀrey Wooldridge. Your econometrics courses have introduced me to such
an amazing world that I got fascinated by the beauty of quantitative methods. I believe these are
the best courses I have ever taken! I also owe a thank you to Dr. Spiro Maroulis and Dr. Spyros
Konstantopoulos. They have provided me with so many constructive suggestions and comments
on how to frame this dissertation to keep moving forward. Although Dr. Qian Zhang is not my
committee member, she has provided me with a great amount of comments and suggestions for the

vi

last two chapters, which I appreciate a lot!

I also want to send special thanks to Dr. Andy Anderson for your generous support. You have
been a role model for me as a good scientist and researcher. Every time we discussed quantitative
analysis, your questions pushed me to think more and understand deeper. I also feel so fortunate
to have the opportunity to work with Christie Thomas, Dr. Stefanie Marshall, and everyone else in
Carbon TIME in the past ﬁve years. I have learned so much from working with you. Thank you
for all the help and support!

I also owe so many thanks to my friends and colleagues. Dr. Siwen Guo and Dr. Ran Xu, I
appreciate all your help along the way. Talking with you can always help me understand a problem
much better when I got stuck at somewhere. I also have special thanks to everyone in our research
group: Tingqiao Chen, Zixi Chen, Yuqing Liu, Dr. I-Chien Chen. I deeply appreciate so many
discussions and emotional support that we have shared.

Finally, I would like to give my most special thanks to my husband, Xukun Xiang, and my
farther, Hui Lin. You are always there with me, no matter when I am struggled, lost, or enthusiastic
about some new progress. I am the luckiest person in the world to have such a great family in my
life, always supporting me to pursue the life I really want.

vii

TABLE OF CONTENTS

LIST OF TABLES .
LIST OF FIGURES .
INTRODUCTION .

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

xi

1

.

.

.

.

.

. .

Introduction .

. . . . . . . . . . . . . . . . . . .

CHAPTER 1 QUANTIFYING STRENGTH OF EVIDENCE FOR INFERENCES IN
MULTISITE RANDOMIZED CONTROL TRIALS: CASE REPLACE-
MENT, SPILLOVER, AND HETEROGENEITY . . . . . . . . . . . . . . .
3
3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1
5
.
1.2 Multisite Randomized Control Trials (MSTs)
6
1.3 Spillover and Heterogeneity in MSTs . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.4 Strength of Evidence in MSTs
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.5 Case Replacement as a Counterfactual Thought Experiment . . . . . . . . . . . . .
9
1.6 Case Replacement for Quantifying the Strength of Evidence in MSTs . . . . . . . .
9
1.6.1
. . . . . . . . . . . . . . . . . . . . . . . . . . .
9
1.6.2 Case replacement for each site when SUTVA holds. . . . . . . . . . . . . .
. . . . . . . . . . . . . . 11
1.6.3 Case replacement for within-site spillover eﬀects.
1.6.4 Case replacement for within-site heterogeneous treatment eﬀects. . . . . . . 17
1.6.5 Heterogeneous treatment eﬀects across sites. . . . . . . . . . . . . . . . . . 19
Illustrative Example of the Study of Class Size Eﬀect in Project STAR . . . . . . . 20
1.7.1 Case replacement when SUTVA holds. . . . . . . . . . . . . . . . . . . . . 20
1.7.2 Case replacement for within-site heterogeneous treatment eﬀect.
. . . . . . 21
. . . . . . . . . . . . . . . . . . . . 22
1.7.3 Case replacement for spillover eﬀects.
1.7.4 Case replacement for cross-site heterogeneity.
. . . . . . . . . . . . . . . . 23
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Sources of bias in MSTs.

1.8 Discussion .

1.7

. .

. .

.

.

.

.

.

.

.

.

. .

Introduction .

CHAPTER 2 QUANTIFYING STRENGTH OF EVIDENCE FOR INFERENCES IN
VALUE ADDED MEASURES: CASE REPLACEMENT, SPILLOVER,
AND HETEROGENEITY . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.1
. . . . . . . . . . . 29
2.2 Spillover and Heterogeneity in Value-added Measures (VAMs)
2.3 Strength of Evidence in VAMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4 Case Replacement as a Counterfactual Thought Experiment . . . . . . . . . . . . . 31
2.5 Case Replacement for Quantifying Uncertainty in VAMs
. . . . . . . . . . . . . . 32
2.5.1
Sources of bias in VAMs. . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5.2 Case replacement when SUTVA holds. . . . . . . . . . . . . . . . . . . . . 33
2.5.3
. . . . . . . . . . . . 39
2.5.4 Case replacement for peer eﬀect (violation of SUTVA). . . . . . . . . . . . 41
. 45
Illustrative Example of Evaluating Grade 1 Math Teachers in Project STAR . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Selective case replacement for heterogeneous eﬀects.

2.6
2.7 Discussion .

. .

. .

.

.

.

viii

CHAPTER 3 UNOBSERVED MEDIATOR IN A SINGLE-MEDIATOR MODEL . . . . 60
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

Introduction .

.
.

.

3.1
.
3.2 Dual-Mediator Designs .
3.3

. .

.

.

.

.

.

.

.

. .

. .

. .

. .

Illustrative Data Example about Consequences of Omitting an Alternative Related
Mediator .

3.4 Unobserved Causally Related Mediator as Posttreatment Confounder . . . . . . .
3.5 Goals of the Study .
3.6

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
. 66
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Inconsistency When  is Omitted . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.6.1 Conditions for consistent estimates when omitting  . . . . . . . . . . . 71
3.6.2 Direction of inconsistency when omitting  . . . . . . . . . . . . . . . . 72
3.6.3 How inconsistency changes with -related parameters.
. . . . . . . . . . 73
78
83
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
. 89

3.7 How Serious Inconsistency Could be at Diﬀerent Levels of -related Parameters
3.8 Correlation Framework and Sensitivity Analysis for Omitted Alternative Mediators
3.9 Discussion .
3.10 Limitations and Future Directions

. . . . . . . . . . . . . . . . . . . . . . . . .

. .

. .

.

.

.

CHAPTER 4 APPLYING A PARAMETER FRAMEWORK TO QUANTIFY INCON-

.

. .

SISTENCY IN A TIME VARYING MEDIATION MODEL . . . . . . . . . 90
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.1
4.2 Deriving Inconsistency Using the Law of Iterated Expectation for a Parameter

Introduction .

Framework .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.3 Understanding What Happens When Omitting  . . . . . . . . . . . . . . . . . 93
4.4 Using the Mechanism to Understand How Inconsistency Changes with -

. .

. .

. .

.

.

.

.

.

.

. .

. 97
related Parameters . .
. . . . . . . . . . . . 97
4.4.1 How inconsistency changes with diﬀerent levels of .
4.4.2 How inconsistency changes with diﬀerent levels of 2 . . . . . . . . . . . . 98
4.4.3 How inconsistency changes with diﬀerent levels of 2.
. . . . . . . . . . . 100

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.5 Using the Mechanism to Understand the Inconsistency of Indirect and Direct

Eﬀects When Omitting  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.6 Applying the Mechanism to Understand the Inconsistency in Longitudinal De-

signs When Omitting  .

4.7 Discussion .
4.8 Limitations and Future Directions

. .

. .

.

.

.

DISCUSSION .
APPENDIX .
.
REFERENCES .

.
.

.
.

.

.
.

.

.
.

.

.
.

.

.
.

.

.
.

.

.
.

.

.
.

.

.
.

.

.
.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
. 111
. . . . . . . . . . . . . . . . . . . . . . . . . . 112
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

ix

LIST OF TABLES

Table 1.1: Quantifying robustness of inference to potential spillover eﬀects under re-

stricted assumptions.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Table 1.2: A general approach for quantifying robustness of inference to potential spillover

eﬀects.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Table 1.3: Quantifying robustness of inference to potential spillover eﬀects from treatment

to control group.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Table 1.4: Summary Statistics of Math Achievements by Class Type in School A and B.

. . 21

Table 2.1: Case replacement approach for VAMs estimated by EB and DOLS.

. . . . . . . 55

Table 3.1: Bias in the Estimated Direct Eﬀect of  on  and Estimated Indirect Eﬀect via . 79

Table 3.2: Bias in the Estimated Direct Eﬀect of  on  and Estimated Indirect Eﬀect via

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

Table 3.3: Bias in the Estimated Direct Eﬀect of  on  and Estimated Indirect Eﬀect via

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

( = 0). .

.

.

.

(1 = 2 = 0).

Table 3.4: Bias in the Estimated Direct Eﬀect of  on  and Estimated Indirect Eﬀect via

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 82

(2 = 0).

.

.

.

.

.

.

.

.

.

.

.

.

.

x

LIST OF FIGURES

Figure 1.1: Bias introduced by spillover eﬀects. . . . . . . . . . . . . . . . . . . . . . . .

. 14

Figure 1.2: Successive extreme replacement for within-site heterogeneous treatment ef-

fects.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Figure 1.3: Cross-site variation of the intervention eﬀect. . . . . . . . . . . . . . . . . . . . 19

Figure 1.4: Math achievement by class type in school B.

. . . . . . . . . . . . . . . . . .

. 22

Figure 1.5: Cross-site variation in the robustness of inference of a small class size eﬀect

on math achievement (Grade K).

. . . . . . . . . . . . . . . . . . . . . . . . . 24

Figure 1.6: Summarizing small class size eﬀect across schools in one Figure.

. . . . . . . 25

Figure 1.7: Cross-site variation in small class size eﬀect on math achievement (Grade 1). . . 26

Figure 2.1: Teacher eﬀects estimated by VAM (hypothetical example). . . . . . . . . . . . . 33

Figure 2.2: Example replacement of students to invalidate Ashley’s evaluation based on

VAM.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Figure 2.3: Case replacement for peer eﬀects. . . . . . . . . . . . . . . . . . . . . . . . . . 44

Figure 2.4: VAMs for 268 teachers.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 47

Figure 2.5: VAMs for 14 ineﬀective teachers and their robustness in terms of percent of

students need to be replaced ().

. . . . . . . . . . . . . . . . . . . . . . . . . 48

Figure 2.6: Sampling variability for percentage of students need to be replaced () for

teacher  and . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

.

.

.

.

.

.

Figure 2.7: VAMs for 14 ineﬀective teachers and their robustness to potential peer eﬀect

as bias. .

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Figure 2.8: VAMs (estimated by DOLS) for 101 teachers who taught small classes. . . . . . 54

Figure 2.9: Gain score distributions for 5 teachers below the threshold in the DOLS approach. 57

Figure 3.1: Simple mediation and dual mediator designs. . . . . . . . . . . . . . . . . . . . 61

Figure 3.2:

Illustrative data example of presumed media inﬂuence. . . . . . . . . . . . . . . 64

xi

Figure 3.3: Confounder in mediation.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

Figure 3.4: True model with two mediators and the model omitting . . . . . . . . . . . . 70

Figure 3.5: How bias changes with diﬀerent levels of .

. . . . . . . . . . . . . . . . . . . 75

Figure 3.6: How bias changes with diﬀerent level of 2.

. . . . . . . . . . . . . . . . . . . 76

Figure 3.7: How bias changes with diﬀerent levels of 2. . . . . . . . . . . . . . . . . . . . 77

Figure 3.8: Sensitivity analysis for unobserved post-treatment confounder. . . . . . . . . .

. 85

Figure 4.1: Parameter framework to understand what happens when omitting .

. . . . . 94

Figure 4.2: Understand how bias changes with diﬀerent levels of 2. . . . . . . . . . . . . . 99

Figure 4.3: Diﬀerent causal pathways from  to  by models: the true model with two

mediators versus the model that omits . . . . . . . . . . . . . . . . . . . . . 102

Figure 4.4: Two special situations where all inconsistency goes to the direct eﬀect  → 

or all inconsistency goes to the indirect eﬀect  →  → .

. . . . . . . . . 104

Figure 4.5: Longitudinal mediation model with two unit lag for direct eﬀect of  on .

. . 106

Figure 4.6: Longitudinal mediation model with unobserved latent mediator .

. . . . . . 107

Figure 4.7: Parameter framework extended to longitudinal designs.

. . . . . . . . . . . .

. 109

Figure .8:

Figure .9:

Sign of  ˜1
2
Sign of  ˜
2
Figure .10: Sign of  ˜
2

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

xii

INTRODUCTION

This dissertation is centered on causal inference regarding evaluation of an intervention, from
whether an intervention works to why it works, accounting for the social and dynamic contexts in
which interventions are implemented. The ﬁrst two chapters propose an approach to quantify the
robustness of inference in multisite randomized control trials and value-added measures for teacher
eﬀectiveness. Drawn on the work by Frank et al. (2013), these two chapters further extend the
non-parametric case replacement approach to quantify how much bias due to spillover violations of
SUTVA and presence of heterogeneous treatment eﬀects must be present to invalidate an inference.
Throughout, the Tennessee class size experiment (Project STAR) is applied to demonstrate the
case replacement approach in both contexts of multisite randomized control trials and value-added
measures.

One of the main goals of Project STAR is to study the eﬀect of small class size eﬀect on
student achievement. It is well documented that small class size has a positive eﬀect on boosting
students’ learning, but the eﬀect can vary substantially depending on grade levels and schools
(e.g., Hanushek, 1999; Konstantopoulos, 2011; Pedder, 2006; Schanzenbach, 2007). As shown in
Chapter 1, this also includes the fact that some schools in Project STAR show strong evidence for
negative small class size eﬀects. Additionally, we need to evaluate each school on its own, and to
do so we need to quantify its eﬀect relative to a unique threshold for that school. We also need to
account for how students might inﬂuence each other within or between classrooms as these may
bias the estimation of the intervention eﬀects.

Mediating Mechanisms

Underlying this treatment eﬀect heterogeneity are potential mechanisms for how an intervention
aﬀects students’ learning. Speciﬁcally, these mechanisms require in-depth studies of complex
classroom and school processes that mediate interventions on students’ learning (Pedder, 2006),
where various factors are included regarding teachers’ practices, classroom discourse routines,
teacher-student interactions, peer relations etc. Studying these mechanisms can tell us why an

1

intervention works or not in certain contexts so that future policy manipulations can be better
informed. For example, Harﬁtt (2013) shows that reducing small class sizes may not work if
teachers do not seek to exploit the advantages of a smaller class size via changing their pedagogies.
With teaching practices serving as a crucial mediator, class size reduction may only work as
expected when coupled with professional development for teachers. As such, studying the causal
mechanism may allow us to explain the heterogeneity of treatment eﬀects, and to ﬁgure out necessary
conditions for realizations of intervention eﬀects, so that later policy manipulations can be better
informed. Otherwise, heterogeneous treatment eﬀects may lead to unexplained inconsistencies that
allow politicians of diﬀerent persuasions to intentionally select ﬁndings that support their preferred
policy choices (Blatchford & Martin, 1998; Pedder, 2006).

Recognizing the importance of studying mediation analysis, Chapters 3 and 4 focus on unob-
served mediators in a single-mediator model. Speciﬁcally, Chapter 3 examines whether and how
omitting an alternative mediator that is confounded with an observed mediator can bias causal
mediation eﬀect estimates in a cross-sectional single-mediator model. Further, a sensitivity anal-
ysis approach is proposed to evaluate the robustness of causal mediation inference to missing a
potential confounding mediator. Chapter 4 continues the discussion in Chapter 3 about an un-
observed mediator but further leverages a parameter framework to discuss how the bias (more
precisely, inconsistency) is generated for each path coeﬃcient of interest in a time varying model
consistent with a dynamic process. Applying the Law of Iterated Expectation and the Linear
Regression Framework, the bias (more precisely, inconsistency) generation mechanism underlying
the cross-sectional model can also be applied to a post-treatment confounder in a time varying
single-mediator model.

2

CHAPTER 1

QUANTIFYING STRENGTH OF EVIDENCE FOR INFERENCES IN MULTISITE
RANDOMIZED CONTROL TRIALS: CASE REPLACEMENT, SPILLOVER, AND

HETEROGENEITY

1.1

Introduction

In the introduction I described how threats to validity can be interpreted in terms of the sampling
mechanism. The goal of this chapter is to characterize the robustness of a causal inference by
interpreting it as the percentage of a sample that must be replaced with counterfactual no-eﬀect
cases to alter the inference. Most importantly, I will extend this case replacement framework to
attend to violations of the Stable Unit Treatment Value Assumption (SUTVA) and presence of
heterogeneous treatment eﬀects.

The ultimate goal of any educational research is to inform decision-makings in the allocation
of educational resources regarding curricula, pedagogy, practices or school organizations (e.g.,
Bulterman-Bos, 2008; Cook, 2002). Consequently, education research must be accessible to prac-
titioners. There are two overarching principles underlying the AERA’s “Standards for Reporting
on Empirical Social Science Research” guide education researchers to engage stakeholders: the
suﬃciency of the warrants and the transparency of the report. But it is not easy for education
researchers to achieve these two principles in practice, especially for varied audiences that may
include stakeholders from various backgrounds: policymakers and practitioners including adminis-
trators, teachers and parents. Eﬀective communication requires a framework to inform discussions
and debate about inferences that make sense to all stakeholders.

Sensitivity analyses can serve as a useful tool to inform debate about speciﬁc inferences by
quantifying the strength of evidence in education research. The quality of evidence is quantiﬁed
by discussing the conditions that would alter the inference (e.g., Frank, 2000; Imbens, 2003;
Rosenbaum, 2002; VanderWeele & Arah, 2011). These analyses generate statements such as “an
omitted variable would have to be correlated at __ with the treatment and with the outcome to

3

invalidate an inference of an eﬀect of the treatment on the outcome.” As such recent approaches to
sensitivity analysis help interpreters of research quantify the conditions necessary to invalidate an
inference drawing on familiar quantities such as correlations (Frank, 2000), percentage of variance
explained (Cinelli & Hazlett, 2018) or graphical representations such as contour plots (Imbens,
2003).

But these existing sensitivity analyses approaches are constrained by speciﬁc models and the
discourse is in the language of correlations and variances. We argue that a well-designed sensitivity
analysis framework should go beyond the constraint of speciﬁc models and make sense to varied
audiences, including those without any statistics background. As a result, a powerful sensitivity
framework should allow all stakeholders to conceptualize the quality of evidence independently so
that the debate can take place in a more transparent, eﬀective and equitable way. Accordingly, the
resulting policy manipulations can also be based on a well-informed discussion. In addition, the
advantage of going beyond model constraints allows comparisons among diﬀerent studies.

As an attempt to provide a powerful sensitivity analysis tool that serves these requirements, we
will introduce a non-parametric case replacement approach illustrated by Frank et al. (2013) that
draws on Rubin’s causal model (RCM) (Rubin, 1974) to express concerns about bias in terms of
the characteristics of unobserved, counterfactual data. To demonstrate how this case replacement
framework diﬀers from other existing approaches, we will contextualize our discussions in mul-
tisite randomized control trials (MSTs). The MSTs are highly relevant as they provide evidence
for inference to inform policy manipulations and demonstrate typical education scenarios where
SUTVA and constant treatment eﬀect assumption are rarely satisﬁed in practice.

For purpose of illustration, we will use the Tennessee class size experiment, or Project STAR
(Student-Teacher Achievement Ratio), to demonstrate our sensitivity approach. There were 79
elementary schools in 42 school districts involved in this 4-year long project to study the eﬀect of
class sizes on student achievement. In each school, kindergarten students were randomly assigned
into small classes (13-17 students), regular classes (22-26 students), or regular classes with a full-
time aid. Teachers were also randomly assigned to these diﬀerent types of classes. The assignments

4

of students and teachers to diﬀerent class types were maintained from kindergarten through the
third grade.

1.2 Multisite Randomized Control Trials (MSTs)

Regarded as the “gold standard” and the most powerful experimental design, randomized
control trials (MSTs) are being applied with increasing frequency to measure the eﬀectiveness of
educational interventions. More than 160 evaluations that randomized individuals or groups to
treatment and control conditions have been funded by the National Center for Education Research
of the Institute of Education Sciences (IES) since 2002 (Bloom & Spybrook, 2017). Among
these, MSTs are gaining increasing popularity (Spybrook & Raudenbush, 2009; Spybrook, Shi, &
Kelcey, 2016), where individuals within each site are randomly assigned to treatment and control
conditions. With a large and diverse sample from diﬀerent sites, MSTs may have several potential
advantages, including stronger generalizability of ﬁndings and allowing estimations of cross-site
treatment eﬀect variation (e.g., Bloom, Raudenbush, Weiss, & Porter, 2017; Bloom & Spybrook,
2017). The ﬁndings regarding the overall mean treatment eﬀect as well as the variance of cross-site
treatment eﬀects are then both used to inform later policy manipulations. For example, some studies
based on Project STAR suggest that the positive eﬀect of small class size on student achievement in
early grade is large enough to inform education policy (Nye, Hedges, & Konstantopoulos, 2000),
but the small class eﬀect is not consistent in all schools: although students in many schools beneﬁt
considerably, in other schools being assigned in small class shows no eﬀect or even negative eﬀects
on student achievement (Konstantopoulos, 2011). Thus, based on these ﬁndings, a policymaker
might conclude that reduction of class sizes might beneﬁt some students, but not all.

Even randomized control trials are not free of bias. The underlying idea of randomization
is to eliminate possible contaminating eﬀects by trying to ensure no systematic diﬀerences in
participants’ baseline characteristics. But randomization cannot exclude other sources of error that
can happen in educational settings, such as non-compliance, attrition and problems in intervention
implementation ﬁdelity (e.g., Hanushek, 1999; Sullivan, 2011). These potential sources of bias can

5

create validity problem of MSTs as well. For example, it is well-documented that Project STAR
suﬀers from a number of important design and implementation issues that can create potential
bias (Hanushek, 1999; Konstantopoulos, 2011; Nye et al., 2000). First, the manipulation of class
size, as the intervention of interest, was not implemented with ﬁdelity in all schools. Second,
there was sizable attrition as well as missing test scores in each year. When attrition occurred,
new students were added but there were no pretests available for these new students to verify the
randomization through the experiment. Third, the participating schools were not randomly selected,
instead they had to volunteer to participate. Evidence shows the sample does diﬀer from the total
student population in Tennessee in fall 1986 (Hanushek, 1999). Finally, some students moved
between diﬀerent classroom types (treatment or control conditions in this project) throughout the
experiment.

1.3 Spillover and Heterogeneity in MSTs

MSTs demonstrate typical education scenarios where SUTVA is rarely satisﬁed in practice. As
a fundamental assumption for causal inferences, SUTVA requires the treatment status experienced
by one unit does not aﬀect the treatment eﬀect for another (Rubin, 1986, 1990). But it is well-
documented that schooling is a complex process because teachers, students, and administrators
interact with each other in a diverse set of social contexts (e.g., An, 2018; Frank, 1998; Kim et
al., 2018; Maroulis et al., 2010). In many MSTs for educational interventions, each site can be
one classroom or one school, where peer eﬀects can create bias for even the estimation of the
treatment in a single site. For example, in Project STAR, students were randomly assigned to each
treatment condition in kindergarten but because of attrition, new students were added to each grade
every year. With no guarantee that these new students were randomly assigned, they might become
distractors or contributors that can aﬀect other students in the same classroom. It is also possible
that students from diﬀerent classrooms (i.e., treatment conditions) might interact and learn from
each other, introducing bias to estimating the between-class achievement average (i.e., intervention
eﬀect).

6

Whenever a single treatment eﬀect is estimated, there can be heterogeneous treatment eﬀects.
The heterogeneity can stem from diﬀerences in any aspect that relates to the realization of the
treatment eﬀect, including individual characteristics, contextual eﬀects and mediating mechanisms.
In MSTs, heterogeneous treatment eﬀects can appear both within sites and across sites, among which
the cross-site inconsistencies of treatment eﬀects can play a crucial role in the generalizability of
the ﬁndings and policy implications.

1.4 Strength of Evidence in MSTs

In MSTs, the estimated eﬀects are compared to a certain threshold to make an inference as
a basis for policy implications. Speciﬁcally, the thresholds based on statistical signiﬁcance can
be applied to claim whether the overall mean intervention eﬀect is signiﬁcantly positive/negative
and whether there are signiﬁcant cross-site variations. Regardless of the speciﬁc deﬁnition, a
threshold represents “the point at which evidence from a study would make one indiﬀerent to the
policy choices” Frank et al. (2013). As argued in (Frank et al., 2013), the comparison between
the threshold and the estimate then represents the strength of evidence that supports the inference
that directly links to the policy choice. Thus, all stakeholders should be able to understand this
comparison so that the policy choice can be made after comprehensive consideration and evaluation
of the strength of evidence against potential costs.

For example, if Project STAR supports an inference for a positive small class size eﬀect on
student achievement, then future educational policies would be informed to reduce class sizes.
However, there are debates about whether the inference is strong enough to inform policies. For
instance, Hanushek (1999) maintains that considerable uncertainty about the class size eﬀects
is suggested by a number of important design and implementation issues in the project and the
evidence is not strong enough to show a systematic eﬀect from overall class size reduction policies.
In contrast, Nye et al. (2000) argues that even with shortcomings in implementation, the estimated
class size eﬀects are large enough to inform policies. The debate here is essentially about how far
the estimated eﬀect exceeds the threshold, and how consistent the eﬀect is across diﬀerent sites.

7

Thresholds based on statistical signiﬁcance require thinking about a repeated sampling frame-
work that conjures a scenario that is beyond the observed data. This can create diﬃculties for
people without any statistical background. As we will demonstrate, the case replacement approach
proposed in Frank et al. (2013) provides a more intuitive alternative to quantify the compari-
son between the threshold and the estimated eﬀect to inform discussions among all stakeholders.
Additionally, this framework can be applied to any type of threshold.

This study has two goals. First, we want to demonstrate how the case replacement approach
can be applied in MSTs. Second, we aim to extend Frank et al. (2013) work by discussing how
we can quantify the uncertainty to violations of SUTVA and presence of heterogeneous treatment
eﬀects. In the following section, we will ﬁrst review the case replacement framework proposed by
Frank et al. (2013).

1.5 Case Replacement as a Counterfactual Thought Experiment

In Frank et al. (2013), the authors showed an approach to quantify how much bias there must be
in an estimate to invalidate an inference. Then the bias is interpreted in terms of sample replacement
to inform more intuitive interpretation. In other words, to show how robust an inference is, we ask
a question based on a thought experiment which is counterfactual: what percentage of the sample
should be replaced with counterfactual (unobserved) no-eﬀect cases to invalidate an inference made
from the data? Or if the concern is about external validity, we consider what percentage of the
samples should be replaced with no-eﬀect cases from an unsampled population. The larger the
percentage is, the more robust the conclusion/inference is, the less likely that the ﬁnding is only
due to chance or bias.

This case replacement idea can be applied in various ways to characterize the strength of evidence
in MSTs. But the general idea is always about replacing some observed cases with some unobserved
cases. The replacement process can become very ﬂexible depending on: (1) how we select cases
from the observed sample to be replaced (2) what cases are regarded as replacement cases (3)
whether the non-replaced cases in the sample experience any changes during the replacement. In

8

the following analysis, we will discuss diﬀerent ways to consider the hypothetical replacement and
how each approach helps inform the comparison between the threshold and the estimated eﬀect
under diﬀerent contexts.

1.6 Case Replacement for Quantifying the Strength of Evidence in MSTs

1.6.1 Sources of bias in MSTs.

As discussed above, there can be various sources of bias in MSTs. However, we can regard all
these as essentially violating the random assignment assumption. That is, sources of bias create
diﬀerences between the treatment and control group in addition to the experiment condition and
more importantly, these diﬀerences aﬀect the outcome measures. For example, attrition and added
students in Project STAR may introduce diﬀerences between the small and regular classes that
can lead to diﬀerential achievement outcomes. Teacher expectation or reactions to the treatment
condition can be another example for one potential contaminating factor. When these diﬀerences
between treatment and control groups are present, they are confounded with the intervention of
interest and we cannot tell whether and how the intervention of interest causes changes separately
from other contaminating factors.

1.6.2 Case replacement for each site when SUTVA holds.

In a counterfactual framework, the violation of random assumption indicates the control group indi-
viduals do not provide an accurate approximation of counterfactuals for treatment group individuals
if they were assigned to the control condition. Or vice versa: the treatment group individuals do
not provide accurate approximation of counterfactuals for control group individuals if they had
been exposed to the treatment condition. In other words, bias is introduced because 1) treatment
group members are compared with observed control group members rather than their counterfac-
tuals: treatment group members if they had received the control and 2) control group members are
compared with observed treatment group members rather than their counterfactuals: control group
members if they had received the treatment (Frank et al., 2013). That is, the way diﬀerent sources

9

of biases create problems can be understood as missing data for individuals if they were assigned
to a diﬀerent experiment condition (Holland, 1986). This feature that recasts all potential sources
of bias in terms of missing data has been used in the case replacement approach to quantify the
robustness of inference (Frank et al., 2013).

To illustrate, consider in each site we have two experiment conditions: treatment vs control. The
null hypothesis is 0 :  = , where  and  are population means under the treatment and
control conditions, respectively. Note these are diﬀerent outcomes for the same population under
diﬀerent conditions. Now consider we observe sample averages of treatment and control groups,
denoted as ¯ and ¯, respectively, and we have ¯ − ¯ >  ÒÒ for a signiﬁcantly positive
intervention eﬀect. In other words, ¯ is used as an approximation of counterfactuals for treatment
individuals’ outcomes if they had received the control. Now in the thought experiment, assume a
proportion of observations, say , in the control group that cannot serve as accurate approximations
for counterfactuals of the treatment group, due to violations of the random assignment assumption.
And the true intervention eﬀect for these individuals are zero, indicating their true counterfactuals
would be ¯ rather than ¯ if they had received the control. Then the average for the control
¯ · (1 − ) + ¯ · . The intervention eﬀect, as the diﬀerence between the
condition becomes:
control and treatment conditions, becomes ¯ − [ ¯ · (1 − ) + ¯ · ]. Setting this diﬀerence to
be equivalent to the threshold so that the amount of bias can invalidate the inference, we can solve
for , where  = 1−  ÒÒ/( ¯ − ¯). That is,  characterizes the amount of bias necessary
to invalidate the inference (i.e., the diﬀerence between the estimated eﬀect and the threshold), as
the proportion of treatment cases for which the true treatment eﬀect is zero and the control group
members provide biased approximations of their counterfactuals if they had been assigned to the
control condition.

To put it more simply, the case replacement approach quantiﬁes the robustness of an inference by
considering replacing a proportion of the observed control cases with unobserved counterfactuals of
the treatment cases if they had received the control, assuming these treatment cases experience null
treatment eﬀects. This leads to a new estimated eﬀect after replacement: ( ¯ − ¯) · (1− ) +0· .

10

By setting this to be equivalent to the threshold, we can solve for  that represents a value that can be
applied to characterize the strength of evidence against a certain threshold for an inference. A larger
 indicates more cases need to be replaced with null-eﬀect cases to invalidate the inference, and
correspondingly, more bias needs to be present to invalidate the inference, which represents a more
robust inference for all sources of bias. This is a very brief review for how the case replacement
approach applies the counterfactual framework to quantify the robustness of inference. See Frank
et al. (2013) for more detailed formalization of this thought experiment.

Similar arguments can be applied for scenarios where the observed treatment eﬀect is positive
but below the threshold: 0 < ¯ − ¯ <  ÒÒ. Now the goal is to quantify how the
data must change to sustain an inference. Speciﬁcally, consider the observed treatment group
average represents a combination of cases experiencing zero eﬀects and threshold level eﬀects,
with proportions  and (1− ), respectively. This gives us: ¯ = (0+ ¯)· +( Ò + ¯)·(1− ),
from which we can solve for :  = 1 − ( ¯ − ¯)/ ÒÒ. In this scenario,  represents the
proportion of null eﬀect cases that need to be replaced with threshold level eﬀect cases to sustain
the inference. A larger  indicates more evidence is needed to sustain the inference, which means
the estimated eﬀect is further away from the threshold for making an inference.

1.6.3 Case replacement for within-site spillover eﬀects.

Now we want to relax the assumption of no spillover eﬀects: the experiment condition of one unit
can aﬀect the outcome of another unit. That is, we want to conceptualize the potential bias due
to spillover eﬀects within each site so that we can characterize how sensitive the inference of each
randomized control trial is to potential bias caused by spillover eﬀects.

It is important to note that, in order to generate bias, the spillover eﬀects should not be introduced
by the treatment condition, otherwise it becomes a mediator that should be counted as part of the
total treatment eﬀect. We argue that spillover eﬀects should satisfy at least two conditions to
introduce bias:
(1) inhere in direct interactions or indirect exposures among individuals; (2)
perform in a way that is independent of the treatment assignment.

11

As before, consider we observe sample averages of treatment and control groups, denoted as ¯
and ¯, respectively, and we have ¯ − ¯ >  ÒÒ for a signiﬁcantly positive intervention
eﬀect. But we wonder if the observed diﬀerence ¯ − ¯ might be biased by either positive
spillover eﬀects in the treatment group or negative spillover eﬀects in the control group, or even
both.

Speciﬁcally, assume in the treatment group (with sample size ) we have a proportion (rep-
resented by ) of cases teaching the other treatment cases  · (1 − ). For each case in the
teaching group, they beneﬁt from teaching the other cases. Deﬁne the positive eﬀect of teaching
one case as   (teaching eﬀect), then teaching other  · (1 − ) cases gives each of them the
beneﬁt of  · (1 − ) ·  . Similarly, each case of the learning group can experience positive
learning eﬀect of  ·  ·  through studying from the other  ·  cases. Deﬁne the isolated
average outcome in the treatment group, without the spillover eﬀects, as ¯ , then we should be
able to get the following formula, representing that the observed treatment average is the sum of
the isolated treatment average and positive spillover eﬀects:

¯ = ¯ +  ·  · (1 − ) ·   + (1 − ) ·  ·  · 

= ¯ +  ·  · (1 − ) · (  + )

Similarly, assume in the control group (with sample size ) we have a proportion (represented
by ) of cases distracting the other control cases  · (1 − ). Then each ﬁrst group case
experiences a level of  · (1 − ) ·   self-initiated distraction eﬀect and each second group
case experiences a level of  ·  ·   peer-initiated distraction eﬀect. Denoting the isolated
average outcome in control group as ¯ , we should be able to get:

¯ = ¯ −  ·  · (1 − ) ·   − (1 − ) ·  ·  ·  

= ¯ −  ·  · (1 − ) · (  +  )

Now we can write out the formulas for isolated treatment and control group average, without

any spillover eﬀects, as follows:

¯ = ¯ −  ·  · (1 − ) · ( +  )
¯ = ¯ +  ·  · (1 − ) · (  +  )

12

By subtracting the second equation from the ﬁrst, we get the diﬀerence between ¯ and ¯
as the treatment eﬀect. Setting this eﬀect equivalent to the threshold so that the amount of bias can
invalidate the inference, we can get:
¯ − ¯ −  ÒÒ =  ·  · (1 − ) · ( +  ) +  ·  · (1 − ) · (  +  )

This illustrates a general situation when the bias comes from both positive spillover eﬀects
in the treatment group and negative spillover eﬀects in the control group. We can also consider
two special situations: 1) all the bias comes from positive spillover eﬀects in the treatment group
(i.e.,   +   = 0) and we can get ¯ − ¯ −  ÒÒ =  ·  · (1 − ) · , where
(positive eﬀect) =  +  ; 2) all the bias comes from negative spillover eﬀects in the control
group (i.e.,  +   = 0) and we can get ¯ − ¯ −  ÒÒ =  ·  · (1 − ) ·  , where
 (negative eﬀect) =   +  .

In all three situations, we see the diﬀerence between the estimated treatment eﬀect ¯ − ¯ and
the threshold is written as either a linear function of spillover eﬀect (i.e.,  or  ), given  and ,
or a quadratic function of  given the spillover eﬀect (i.e.,  or   or both) and . To simplify the
discussion, take the scenario where all bias comes from positive eﬀects from the treatment group
as an example. Figure 1.1 shows bias introduced from spillover eﬀects as a quadratic function of
 at diﬀerent levels of spillover eﬀect (i.e., ). The axis of symmetry is  = 0.5. Consider
the situations when 0 <  < 0.5. If  > 0.5, we can always ﬁnd another  from (0, 0.5) that
generates the same ( ¯ − ¯ −  ÒÒ) by symmetry. Then given a ﬁxed amount of , we
can see that a larger  indicates a larger diﬀerence between the observed estimated eﬀect and the
threshold. That is, we need to have stronger spillover eﬀects (more individuals teaching others
in the treatment group) to invalidate the inference, indicating a more robust inference. Similar
conclusions hold for sites whose estimated eﬀect is so large that the distance to the threshold
( ¯ − ¯ −  ÒÒ) is larger than the maximum value of this function, which is 0.25 .
Another way to interpret this is to assume 50% of cases teach the other 50% of cases in the
treatment group (i.e.,  = 0.5). Then we can solve for  =   + . That is,  tells us
how large the positive spillover eﬀects must be to invalidate the inference of a positive intervention

13

Figure 1.1: Bias introduced by spillover eﬀects.

eﬀect? The larger the resulting  is, the more robust the inference is to the threat of spillover
eﬀect.

Additionally, by assuming  =  =  and  =   = , we can further simplify the
formula for the general scenario where bias is introduced by both positive spillover eﬀects in the
¯ − ¯ −  ÒÒ =
treatment group and negative spillover eﬀects in the control group:
( + ) ·  · (1 − ) · . Then we can apply the same approach above, using either  or 
to describe the robustness of inference to potential spillover eﬀects as bias. This way we can
reduce the number of sensitivity parameters while considering spillover eﬀects in both treatment
conditions. Underlying the discussions for spillover eﬀects are also counterfactual interpretations.
By removing the spillover eﬀects among participants, we are in fact trying to come up with the
estimated eﬀect when the SUTVA assumption is satisﬁed and individuals are independent from each
other. Alternatively, we can consider replacing individuals with other individuals who experience
the same treatment eﬀect but no spillover eﬀect.

The discussion above has made several assumptions for spillover eﬀects. To illustrate, we use
Table 1.1 to present the relations between senders and receivers of spillover eﬀects indicated by
the discussion above, when we wonder whether the estimated eﬀect is biased by positive spillover

14

Table 1.1: Quantifying robustness of inference to potential spillover eﬀects under restricted
assumptions.

Individuals within
treatment group
A
Teaching
B
C
D

Learning

Teaching

B

A
NA

B → A = 0
A → B = 0
A → C =  B → C = LE
A → D =  B → D = 

NA

Learning

C

D

C → A =   D → A =  
C → B =   D → B =  
D → C = 0
C → D = 0

NA

NA

eﬀects within the treatment group. Assume four individuals are in the treatment group: A, B,
C, and D. 50% of them teach the other 50% ( = 0.5). Speciﬁcally, consider A and B teach C
and D. Then Table 1.1 displays the speciﬁc spillover eﬀect experienced by each pair of individuals
based on the discussion above. For example, in the ﬁrst row, A experiences   (teaching eﬀect) by
teaching C and D but A does not experience any spillover eﬀects from B. Similarly, C experiences
 by learning from A and B but does not experience any spillover eﬀects from D. As such, several
assumptions are implied here. First, we assume spillover eﬀects are present for pairs of individuals
within one treatment condition group (i.e., either treatment or control) but across teaching and
learning groups (or self-initiated distracting and peer-initiated distracting groups). Second, each
type of spillover eﬀect (either ,  ,   or  ) is constant for diﬀerent individuals. Inspired
by the Linear-in-Means model in the peer eﬀect literature, these assumptions can help us simplify
the sensitivity approach technique and we can easily apply this for any unknown types of spillover
eﬀects.

But what if we have speciﬁc spillover eﬀects that violate these assumptions? Then a weighted
matrix of relations between senders and receivers of spillover eﬀects can provide us with a more
powerful and ﬂexible tool to realize any possibilities. For simplicity, still assume we have four
individuals: A, B, C and D. But now A and B are from the treatment group and C and D are from
the control group. Then we can use a four by four table, as presented in Table 1.2, to specify
spillover eﬀects between each pair of individuals, whether they are from the same experimental
condition or not. For example, the ﬁrst row lists the spillover eﬀects experienced by individual
A from individuals B, C and D. Importantly, all the oﬀ-diagonal cells can have diﬀerent values

15

(the diagonal elements are meaningless because they represent one experience spillover eﬀect from
himself/herself). Assume we have weighted friendship data for all individuals in the sample, then
we can use this table to ask how strong the unit spillover eﬀect through the friendship network
needs to be to invalidate the inference.

Table 1.2: A general approach for quantifying robustness of inference to potential spillover eﬀects.

Treatment

Control

Inidividuals

A
B
C
D

C

D

Control

Treatment
A
B
B → A C → A D → A
NA
A → B
C → B D → B
NA
A → C B → C
D → C
NA
A → D B → D C → D
NA

Now consider a simple application of this weighted matrix for a scenario where we have a
positive but insigniﬁcant estimated treatment eﬀect (i.e., ÒÒ > ¯ − ¯ > 0 ). We wonder
if the observed diﬀerence ¯ − ¯ might be downward biased by positive spillover eﬀects from
the treatment to the control group. That is, individuals in the control group experience positive
eﬀects by interacting with individuals in the treatment group. Table 1.3 demonstrates how Table
1.2 can be applied to study this scenario, where only the bottom left panel shows spillover eﬀects
because that panel represents spillover eﬀects from all individuals from the treatment group to all
individuals in the control group.

Table 1.3: Quantifying robustness of inference to potential spillover eﬀects from treatment to
control group.

Inidividuals

A
B
C
D

Control
Treatment
D
C
B
A
0
0
NA
0
0
0
NA
0
A → C B → C NA
0
A → D B → D
0
NA

Treatment

Control

Now deﬁne the positive learning eﬀect experienced by each individual in the control group
through interacting with one individual in the treatment group as , each control individual gets
an amount of  ·  positive spillover eﬀects. Denote the isolated control mean as ¯ , then we

16

can write: ¯ = ¯ −  · . Now by setting the diﬀerence between treatment ( ¯) and isolated
control mean ( ¯ ) equivalent to the threshold, we obtain:  ÒÒ − ( ¯ − ¯) =  · .
From this we can calculate how large  must be to sustain a positive treatment eﬀect inference. A
larger diﬀerence between the threshold and the estimated treatment eﬀect indicates larger positive
spillover eﬀects (i.e., ) must be present to sustain an inference.

1.6.4 Case replacement for within-site heterogeneous treatment eﬀects.

As illustrated by Holland (1986), the average causal eﬀect is an average and thus it “enjoys all of the
advantages and disadvantages of averages”. The constant treatment eﬀect assumption says that all
the units in the population of interest experience the same treatment eﬀect caused by the treatment.
This assumption will then allow the average treatment eﬀect to be used to draw causal inference at
the unit level.

One may argue that we only need to draw an inference at an average level rather than at a unit
level. But in educational MSTs, each site may only have very small sample size. The number of
participants in either treatment or control group can be even smaller. For example, small classes
in Project STAR may include fewer than 20 students and most schools only had one to two small
classes. With such small sample sizes, an outlier can have a considerable eﬀect on the group
average.

To quantify how sensitive the inference about the intervention eﬀect in each site is to the het-
erogeneous or the outlier eﬀect, we propose a successive extreme replacement thought experiment.
Because the outlier eﬀect can aﬀect either the treatment or the control group or both, this selective
replacement approach can be applied to both groups. To illustrate, consider a scenario where we
have ¯ − ¯ >  ÒÒ for a signiﬁcantly positive intervention eﬀect. The goal of this discus-
sion is to characterize how robust this inference for a speciﬁc site is to outlier eﬀects. Under this
scenario, we may consider outliers from the higher end in the treatment group or lower end in the
control group. Speciﬁcally, we start our replacement from the individual who has the most extreme
outcome in the group, which means the highest in the treatment and the lowest in the control group

17

under this scenario. For replacement cases, we may consider the overall grand mean outcome
across all participants within this site, favoring the null hypothesis that there is no treatment eﬀect.
Alternatively, we can replace extreme cases in each group with their own group mean, which means
replacing the highest in the treatment with the treatment mean and replacing the lowest in the control
with the control mean. Figure 1.2 shows both approaches, where the blue dotted line represents the

Figure 1.2: Successive extreme replacement for within-site heterogeneous treatment eﬀects.

control distribution and red solid line represents the treatment distribution. If replacing the most
extreme case is not enough to cross the threshold, we continue by replacing the second extreme
case. We continue this process until the diﬀerence between the treatment and control group reduces
to the threshold and we record how many cases need to be replaced to invalidate the inference. The
more we need to replace, the more robust the inference is.

18

1.6.5 Heterogeneous treatment eﬀects across sites.

All previous discussions focus on randomized control trials within each site. But one important
advantage of the MST is that it allows researchers to study how consistent the intervention eﬀect
is across diﬀerent sites. To characterize this cross-site variation, we apply the case replacement
approach for each single site and consider four groups of sites. Figure 1.3 presents how this may

Figure 1.3: Cross-site variation of the intervention eﬀect.

work. The ﬁrst group (Group 1) includes all sites that show a signiﬁcantly positive intervention
eﬀect. Applying our case replacement approach, we can quantify the strength of evidence for
each site by considering what percent of observed cases need to be replaced with no eﬀect cases
to invalidate the inference of a positive intervention eﬀect. As such, we get a proportion   for
each site , which allows us to generate a distribution of robustness represented by  . The more
sites with larger   (towards 1) the stronger the evidence for a positive intervention eﬀect across
sites. The second group (Group 2) includes all sites that show a positive but not signiﬁcant eﬀect
within site, under which the case replacement approach allows us to have   indicating how much
more evidence we need to sustain a positive eﬀect inference. Then the more sites with a smaller
  (towards 0), the more evidence is given towards a positive intervention eﬀect. The third group

19

(Group 3) includes all sites with signiﬁcantly negative within-site treatment eﬀect. More sites
showing a large   (towards 1) indicate stronger evidence towards a negative eﬀect. The ﬁnal
group (Group 4) includes all sites with negative but not signiﬁcant eﬀects. The more sites showing
small   (towards 0) indicate a trend towards a negative intervention eﬀect.

More importantly, from how all sites are distributed across all four groups, we can see 1)
where most sites are; 2) whether sites are distributed towards inferences of diﬀerent directions of
intervention eﬀect; and 3) variation in the robustness of inference. For example, if there are many
sites that are located in the center of both group 1, 2 and group 3, 4, as shown by both the green and
red brackets, then we know that the variation of cross-site intervention eﬀect is substantial as we
have pretty robust evidence towards inferences of both positive and negative intervention eﬀects.

1.7

Illustrative Example of the Study of Class Size Eﬀect in Project STAR

In this section, we will use Project STAR as an example to demonstrate how the discussion
above can be applied to quantify the strength of evidence in multisite randomized trials to potential
bias. For simplicity, we will focus on the math achievement of students in Grade K who were
assigned to either small classes or regular classes. We excluded those students in regular classes
with a full-time aide so that we have two experimental conditions to be consistent with the previous
discussion. In general, we consider the small class to be the treatment group and the regular class
to be the control group. In total, there are 3,794 students from 79 schools included in the following
analysis. We standardized the math achievement scores for each school to make the interpretation
easier.

1.7.1 Case replacement when SUTVA holds.

We start with the general case replacement approach when the SUTVA assumption holds. Specif-
ically, we look at two schools as examples, both of which have a signiﬁcantly positive treatment
eﬀect but having very diﬀerent levels of robustness of inference. Table 1.4 shows the summary
statistics of math achievement by class type for each school. Because students and teachers are

20

Table 1.4: Summary Statistics of Math Achievements by Class Type in School A and B.
Math achievement


School A Small class
Regular class
Small class
Regular class


13
34
14
23

1.321
-0.426
0.498
-0.154


0.794
0.735
0.941
0.916


0.183
-1.761
-1.286
-2.153


3.171
1.037
2.002
1.314

School B
Notes. ,  and  represent sample size, mean, and standard deviation, respectively. 
and  represent minimum and maximum values, respectively.

randomly assigned to each class type, we applied an independent samples t-test to estimate the small
class eﬀect1: the estimated eﬀect for school A is 1.738 with a p value < 0.001 and the estimated
eﬀect for school B is 0.652 with a p value of 0.045. If we apply the commonly used threshold of
0.05, we may conclude that both schools show a signiﬁcantly positive class size eﬀect.

But do they have the equal strength of evidence or the same level of robustness of inference?
To consider this, we apply our case replacement framework, which tells us: around 71.63% of the
cases in school A must be replaced with null eﬀect cases to invalidate the inference while only
2.26% of the cases in school B must be replaced with null eﬀect cases to invalidate the inference.
This shows the inference of an eﬀect in school A is much more robust than that in school B.

1.7.2 Case replacement for within-site heterogeneous treatment eﬀect.

As shown above, the inference of a positive small class eﬀect in school B is not robust. Now we
use the selective replacement approach to see whether this inference is very sensitive to outlier
eﬀects. Following Figure 1.4 shows a distribution of math achievement by class type in school B.
We observe that there are a few students showing very low math achievement in the control group
(regular class). Speciﬁcally, the lowest score is −2.152. Once we replace this student (the blue part)
with a hypothetical average student with the grand mean of 0.093 (the pink part), the average score
in regular class increases to −0.056 and the corresponding diﬀerence from the average in small
class reduces from 0.652 to 0.554, which is lower than the threshold of 0.637. This means only by
1This is essentially the same approach as Konstantopoulos (2011) used. The diﬀerence is that

we excluded the full class with aide and accordingly, we did not use Bonferroni correction.

21

Figure 1.4: Math achievement by class type in school B.

replacing one student who has the lowest score in the regular class, the inference of a positive small
class eﬀect would be invalidated, indicating a very weak inference to potential outlier eﬀects.

1.7.3 Case replacement for spillover eﬀects.

Now we consider how sensitive the inferences in school A and B are to potential spillover eﬀects.
Assume 50% of students assigned to small classes teach other 50% of students in small classes.
If all the bias comes from these positive learning and teaching eﬀects, each unit of the spillover
eﬀect needs to be larger than 0.38 (more than 22% of the estimated treatment eﬀect) in school A
to invalidate the inference while in school B the positive spillover eﬀect among students assigned
in small classes only needs to be larger than 0.004 (less than 1% of the estimated treatment eﬀect)
to invalidate the inference. Alternatively, assume 50% of the students assigned to regular classes
distract other 50% of students. If all the bias comes from these negative distraction eﬀects among

22

students who were assigned to regular classes for school A, the distraction eﬀect needs to exceed
0.146 (more than 8% of the estimated treatment eﬀect) to invalidate the inference while in school
B, the distraction spillover eﬀect only needs to exceed 0.0026 (less than 0.4% of the estimated
treatment eﬀect) to invalidate the inference. Again, this comparison between school A and B shows
the large diﬀerence in terms of robustness of inference to potential bias.

We can also consider positive spillover eﬀects from small class students to regular class students,
leading to underestimation of the small class size eﬀect. To illustrate this, we look at one school
where 17 students were in the small class with an average math achievement of 0.264, and 20
students were in the regular class with an average math achievement of -0.423. The diﬀerence is
0.687, which is just below the threshold for statistical signiﬁcance ( = 0.5) of 0.707. In order
to sustain an inference of a positive small class size eﬀect, we can apply Table 1.3 as a tool to
quantify how large the positive spillover eﬀect from small class size to regular class must be. After
calculation2 , each student in the regular class must experience an amount of 0.0012 spillover eﬀect
from each student in small class, which is only about 0.17% of the estimated treatment eﬀect, to
alter the inference regarding a positive small class size eﬀect. This quantiﬁes the robustness of no
eﬀect in the school.

1.7.4 Case replacement for cross-site heterogeneity.

Figure 1.5 applies the case replacement approach to present cross-site variation in small class size
eﬀects on math achievement. First, many schools do not show any evidence of either positive or
negative eﬀects, indicated by the concentration of the distribution on the very right of Group 2 and
Group 4. Second, many schools are in Group 1 or 2, indicating the estimated small class eﬀects
are positive. Additionally, 7 schools in Group 1 have a robust inference for positive small class
size eﬀects: more than 40% of observed cases need to be replaced by unobserved zero eﬀect cases
2To sustain the inference, the average in regular class must be lower than 0.264−0.707 = −0.443.
Then the diﬀerence between this and the observed average in regular class is −0.423 − (−0.443) =
0.02. Each student in regular must learn at least 0.02 from all students in small class, then each
student must learn 0.02/17 ≈ 0.0012, which is only about 0.0012/0.687 ≈ 0.17% of the estimated
treatment eﬀect.

23

Figure 1.5: Cross-site variation in the robustness of inference of a small class size eﬀect on math
achievement (Grade K).

to invalidate the inference. Meanwhile, a few schools in Group 2 are close to the threshold: there
are 3 schools in which fewer than 10% of zero eﬀect cases need to be replaced with threshold level
cases to sustain a positive intervention eﬀect inference. However, as shown by the two frequency
distributions for Group 3 and 4 in the second row, there are a few schools that show negative
estimated eﬀect, among which 3 are signiﬁcantly negative. One school in Group 3 has such strong
evidence to support a negative eﬀect of small class size that one would need to replace more than
70% of the cases with zero-eﬀect case to invalidate the inference. Additionally, in Group 4, a few
schools are not too far away from the threshold of negative class size eﬀects. Therefore, we may
conclude that schools do diﬀer from each other in terms of the inference of an eﬀect of small class
size on students’ math achievement in kindergarten.

The red dotted lines in each group in Figure 1.5 represents the overall robustness for each group
when school sizes are considered to weigh each school. That is, within each group, we summed up

24

the number of students needed to be replaced from all schools and divided this by the total number
of students. Figure 1.6 further aggregates all information into one ﬁgure: the distribution illustrates

Figure 1.6: Summarizing small class size eﬀect across schools in one Figure.

the estimated treatment eﬀect across all schools, and the dotted line indicates the threshold for a
positive treatment eﬀect, for which we used the standard error as Konstantopoulos (2011) calculated
for the average small class eﬀect across schools. Comparing Figure 1.6 with Figure 1.5, we argue
that Figure 1.5 provides much more detailed information about how inconsistent the small class
eﬀect is across schools. Although one estimate that summarizes all schools is appealing, missing
the heterogeneity across schools can provide misleading information for policymakers. Moreover,
summarizing all schools with one estimated eﬀect ignores the fact that each school will make its
own inference about the eﬀectiveness of small classes. This applies to scale-up because schools
have local control over policy. Thus, schools must ask if the intervention worked in their context.
To further show how our approach can provide richer information regarding cross-site hetero-
geneity in MSTs, we carried out a similar analysis for Grade 1 students regarding whether small
class size inﬂuence their math achievement. Figure 1.7 presents the result. Comparing Figure 1.5

25

Figure 1.7: Cross-site variation in small class size eﬀect on math achievement (Grade 1).

with Figure 1.7, we can tell the inferences regarding small class size eﬀects on math achievement
in Grade 1 and Grade K share both similarities and diﬀerences3. First, compared to Grade K, there
are more schools in Grade 1 showing robust ﬁndings for positive small class size eﬀects; in Group
1 of Grade 1, there are 11 schools for which more than 50% observed cases need to be replaced
by unobserved zero eﬀect cases to invalidate the inference. Meanwhile, both Grade 1 and Grade
K have schools with strong evidence for negative eﬀect as well. In Grade 1, two schools in Group
3 show strong evidence for negative small class size eﬀects.
In the strongest school more than
70% of the cases must be replaced with zero-eﬀect case to invalidate the inference. Therefore, we
may conclude that Grade 1 has even stronger evidence for positive small class size eﬀects but both
Grade K and Grade 1 show cross-site heterogeneity with evidence for both positive and negative
intervention eﬀects.

3The sample for Grade 1 includes 76 schools, compared to 79 schools for Grade K.

26

1.8 Discussion

This chapter extends the case replacement approach (Frank et al., 2013) to quantify strength
of evidence in multisite randomized control trials, accounting for spillover and heterogeneity.
Throughout, Project STAR has been applied as an example to demonstrate how this non-parametric
approach can better inform debate regarding relevant policy choice. Drawing on Rubin’s causal
model (RCM) (Rubin, 1974), concerns about bias are expressed in terms of unobserved, counter-
factual data. Speciﬁcally, the robustness of a causal inference is interpreted as the percentage of
a sample that must be replaced with counterfactual no-eﬀect cases to alter the inference. Most
importantly, by carefully considering how to select cases from the observed sample to be replaced,
what cases are regarded as replacement cases, and whether the non-replaced cases in the sample
experience any changes during the replacement, this case replacement framework can be extended
to attend to violations of the Stable Unit Treatment Value Assumption (SUTVA) and presence of
heterogeneous treatment eﬀects.

One limitation of this chapter is that the case replacement approach might be better motivated
when an overall inference across sites is based on the accumulation of inferences within each site.
In that case the inference within each site must be compared with the site-speciﬁc threshold as I have
done here. Additionally, the discussion of spillover eﬀects can be more thorough and better tailored
for the context of multisite randomized control trials, considering that the randomization can help
eliminate some type of spillover eﬀects but not others. For example, assuming randomization is
implemented with ﬁdelity, we should not expect any presence of within treatment group (or control
group) spillover eﬀects due to non-random assignment of distractors or contributors. However,
spillover eﬀects between diﬀerent treatment groups cannot be excluded by randomization, especially
in the context of educational interventions, where the subjects are always students who interact
with each other in a social context. While these issues are present in any study, they are especially
prominent when we compare inferences across sites, as in the multisite trial.

27

CHAPTER 2

QUANTIFYING STRENGTH OF EVIDENCE FOR INFERENCES IN VALUE ADDED

MEASURES: CASE REPLACEMENT, SPILLOVER, AND HETEROGENEITY

2.1

Introduction

The introduction establishes the goal of understanding the robustness of inferences in terms of
how the sampling mechanism could be altered to replace cases in the data. In the previous chapter,
we derived the case replacement approach to quantify strength of evidence for inferences in multisite
randomized control trials, accounting for spillover eﬀects and heterogeneous treatment eﬀects. In
this chapter, we will continue to demonstrate how this case replacement framework can be applied
to quantify uncertainty in value-added measures (VAMs) that have been used to evaluate teacher
eﬀectiveness. The VAM context is highly relevant for policy and personnel decisions because
the evaluation of teacher eﬀectiveness is related to high-stake decisions. By comparing students’
expected test scores to their actual ones, the “deﬂections” are inferred to be the “added value” from
the teacher (Raudenbush & Bryk, 2002). Proponents of value-added models cite research that shows
teachers’ considerable and long-lasting inﬂuences on student achievement (e.g., Chetty, Friedman,
& Rockoﬀ, 2011; Hill, Kapitula, & Umland, 2011; Rivkin, Hanushek, & Kain, 2005). They argue
that there is important variation in teachers’ eﬀectiveness that can be better identiﬁed by VAM
(e.g., Aaronson, Barrow, & Sander, 2007; Hanushek & Rivkin, 2010). By selecting or deselecting
teachers based on value-added we can improve teacher quality and increase student achievement
and long-term outcomes (e.g., Chetty et al., 2011; Gordon, Kane, & Staiger, 2006; Winters &
Cowen, 2013a, 2013b). For example, under the evaluation system of IMPACT in the District of
Columbia Public Schools (DCPS), teachers were dismissed with rare exceptions once evaluated as
“ineﬀective” and the dismissal threats are claimed to have helped improve teacher performance and
student achievement (Adnot, Dee, Katz, & Wyckoﬀ, 2017; Dee & Wyckoﬀ, 2015).

Various concerns have been raised about the validity and reliability of value added measures

28

as a basis to inform teacher evaluation (e.g., Guarino, Reckase, & Wooldridge, 2015; Harris,
2009; Raudenbush, 2015). These issues may include test unreliability, missing data and model
misspeciﬁcations. In order to get a more reliable value added measure, researchers recommend
model speciﬁcation with two years of prior tests (Goldhaber & Hansen, 2010; Kane & Staiger,
2012; Rothstein, 2009). This can reduce the population of teachers who can be measured because
it is not uncommon for teachers to change grades or students to change schools within three-year
duration. Furthermore, unreliability in test scores can appear as measurement errors or diﬀerences
among diﬀerent achievement measures. Previous research has shown bias in value-added caused
by measurement errors alone (Lockwood, Louis, & McCaﬀrey, 2002) as well as large variation
in the estimated eﬀects of applying diﬀerent achievement measures (Lockwood et al., 2007).
Therefore, those high-stake decisions for a speciﬁc teacher (e.g., hiring, retention, and professional
development) require all stakeholders to be able to understand and conceptualize the uncertainty
and quality of the measures to avoid unfairness and loss of investments and resources.

2.2 Spillover and Heterogeneity in Value-added Measures (VAMs)

Like MSTs, VAMs also demonstrate typical education scenarios where SUTVA is rarely satisﬁed
in practice. In the VAM context for teacher evaluation, SUTVA suggests there are no peer eﬀects
that can aﬀect a student’s achievement, which is rarely satisﬁed in real life. Previous researchers
have controlled peer eﬀects through model speciﬁcation (e.g., Carrell, Fullerton, & West, 2009;
Hoxby & Weingarth, 2005; Levin, 2002; Van Ewijk & Sleegers, 2010). For example, the most
well-known peer eﬀects are based on classmates’ achievement levels. Other peer eﬀects can be
generated by racial, ethnic or economic forces. Unfortunately, identifying all possible sources of
peer eﬀects is challenging as it would have to include non-cognitive attributes as well as interaction
styles. As a result, we will never know whether we have controlled for all potential signiﬁcant peer
eﬀect mechanisms in a model. For example, how can we control the peer eﬀects from friendships
if we do not have adequate information on friendship to deﬁne closer knit peer eﬀects more than
simply class membership?

29

In the VAM context, heterogeneous treatment eﬀects are present if one teacher is good at teaching
certain students but not others, which can cause debate in teacher evaluation. Additionally, the
VAM context exempliﬁes the scenario where violations of one single treatment (as a fundamental
component of SUTVA) may lead to heterogeneity of treatment eﬀects, as discussed in Chapter 1.
For example, teachers may modify their way of teaching to better suit the special requirements of
each student, which may lead to heterogeneous teacher eﬀects experienced by diﬀerent students
in one classroom. Even with the same teaching approach, diﬀerent students can still beneﬁt
diﬀerently due to variations in their aptitudes, motivations, prior knowledge or any other individual
or contextual factors.

2.3 Strength of Evidence in VAMs

In VAM, the threshold for being an eﬀective teacher is generally arbitrary since the estimation
of standard errors in VAM is controversial; the standard errors can be quite sensitive to how one
conceptualizes the level of analysis and what formula one chooses. Regardless of the speciﬁc
deﬁnition/calculation, a threshold in VAM represents the point at which the evidence from VAMs
would make one indiﬀerent to the ﬁnal teacher evaluation result, such as eﬀective or ineﬀective. As
argued in Frank et al. (2013), the comparison between the threshold and the VAM then represents
the strength of evidence that supports the evaluation result that directly links to high-stake personnel
decision-making. Thus, all stakeholders, including teachers, administrators and parents, should
be able to understand this comparison so that the personnel decision-making can be made after
comprehensive consideration and evaluation of the strength of evidence against potential costs.
For example, consider a debate between an administrator and a teacher whose VAM is below the
threshold of being eﬀective. The administrator may use this below-threshold VAM to evaluate this
teacher as ineﬀective, but the teacher may argue that her low VAM is primarily because she has
been assigned to lower-end students. Essentially, this debate is about how far the teacher’s VAM is
below the threshold, and whether this diﬀerence generates strong evidence for this teacher’s lack of
eﬀectiveness and potential dismissal, considering all potential sources of bias in estimating VAM.

30

As we will demonstrate in this chapter, the case replacement approach proposed in Frank et al.
(2013) provides an intuitive alternative to quantify the comparison between the threshold and the
estimated VAM to inform discussions among all stakeholders. Additionally, it does not rely on the
controversial calculation of standard errors.

Like Chapter 1, this chapter has two goals. First, we want to demonstrate how the case
replacement approach can be applied in VAMs. Second, we aim to extend Frank et al. (2013)
work by discussing how we can quantify the uncertainty to violations of SUTVA and presence
of heterogeneity in VAMs.
In the following section, we will ﬁrst review the case replacement
framework proposed by Frank et al. (2013).

2.4 Case Replacement as a Counterfactual Thought Experiment

In Frank et al. (2013), the authors showed an approach to quantify how much bias there must be
in an estimate to invalidate an inference. Then the bias is interpreted in terms of sample replacement
to inform more intuitive interpretation. In other words, to show how robust an inference is, we ask
a question based on a thought experiment which is counterfactual: what percentage of the sample
should be replaced with counterfactual (unobserved) no-eﬀect cases to invalidate an inference made
from the data? Or if the concern is about external validity, we consider what percentage of the
sample should be replaced with no-eﬀect cases from an unsampled population. The larger the
percentage is, the more robust the conclusion/inference is, the less likely that the ﬁnding is only
due to chance or bias.

This case replacement idea can be applied in various ways to characterize the strength of
evidence in VAMs. But the general idea is always about replacing some students taught by one
teacher with counterfactuals of other students, if they were taught by that teacher, as a thought
experiment. The replacement process can become very ﬂexible depending on: (1) how we select
students from the teacher’s class to be replaced (2) what students are regarded as replacement
students (3) whether the non-replaced, remaining students in the class experience any changes in
achievement during the replacement. In the following analysis, we will discuss diﬀerent ways to

31

consider the hypothetical replacement and how each approach helps inform the comparison between
the threshold and estimated VAM under diﬀerent contexts.

2.5 Case Replacement for Quantifying Uncertainty in VAMs

2.5.1 Sources of bias in VAMs.

Various concerns have been raised about the validity and reliability of value added measures (VAMs)
as a basis to inform high stake decisions (e.g., hiring, retention, and professional development) for
a speciﬁc teacher (e.g., Guarino et al., 2015; Harris, 2009; Raudenbush, 2015). We will begin
our review of these concerns with the conditional random assignment assumption. Almost all the
potential inconsistencies of value added, such as those caused by test unreliability, missing data or
model speciﬁcation, can be represented in terms of the violations of conditional random assignment
assumption. Our approach to quantify the uncertainty of value-added also draws on this framework
of the student-teacher assignment mechanism.

The conditional random assignment assumption is that students are randomly assigned to every
teacher conditional on the other variables (Rothstein, 2009, 2010). However, research has shown
that there is a nontrivial amount of sorting based on students’ prior test scores as well as a nontrivial
amount of non-random assignment of teachers to classrooms. For example, recent research has
shown that teachers who are nominated as help-providers to other teachers and with leadership
positions are assigned better students (Kim et al., 2018). These nonrandom assignments may cause
substantial bias in value added estimates if not captured by the controls in the model speciﬁcation
(Pauﬂer & Amrein-Beardsley, 2014; Rothstein, 2010). In some cases, the estimates based on value
added may even have the opposite sign of the true teacher eﬀect (Dieterle, Guarino, Reckase, &
Wooldridge, 2015). Hiring or dismissing a teacher based on this ﬂipped ranking can be unfair for
teachers and cause unwanted competition that can lead to test-driven teaching.

32

2.5.2 Case replacement when SUTVA holds.

To better motivate the case replacement approach, we start our discussion with a hypothetical
example. Consider three teachers’ VAMs as presented in following Figure 2.1. Both Ashley and
Jessica are below the threshold of 0.15. If the threshold represents a serious lack of eﬀectiveness,
the administrator may decide to dismiss both of them. However, we can see that Ashley is much
closer to the threshold than Jessica. This indicates that an evaluation of Ashley as ineﬀective is
much less robust than that for Jessica. It might be some bias in VAM estimation causes teacher
Ashley to be below the threshold. As a result, the personnel decision should be considered more
seriously, or other measures should be referred to. Or the administrator may direct Ashley to
professional development if the school resources can only support one teacher for this opportunity.
Similarly, an administrator may want to provide professional development to teacher Emily who is
above the threshold, but just barely so.

Figure 2.1: Teacher eﬀects estimated by VAM (hypothetical example).

In the hypothetical example, how can we better quantify the diﬀerence between Ashley’s VAM
and the threshold as strength of evidence for her evaluation? The case replacement approach leads
us to ask how many students in her class need to be replaced with average students to get her to the
threshold. Ashley has a VAM of 0.14, which could be the average of her students’ gain scores. The
threshold is 0.15. Assume she has 20 students, whose gain scores have a distribution represented

33

as black and grey parts shown in 2.2. Hypothetically, to improve Ashley’s VAM from 0.14 to
0.15, we can replace two students (the grey parts) with two average students whose gain scores are
0.16 (the white parts with black outline). This counterfactual thought experiment tells us that via
replacing the two worst students with grade-average students, Ashely could achieve the threshold
of being eﬀective. We can also say that 2 out of 20, that is about 10% of Ashley’s students need to
be replaced with grade average students to alter the evaluation.

Figure 2.2: Example replacement of students to invalidate Ashley’s evaluation based on VAM.

We now formalize the intuition in 2.2. For the following discussion, assume that all grade nine
math teachers in one middle school were evaluated based on their students’ achievement scores.
Suppose we have a general value-added model as follows:

 =  + ,−1 +   +   + , 1

1There are various value added models. This particular (simpliﬁed) model is only used as an

34

where  is student ’s test score at time  (post-test score);  is the intercept;  is the coeﬃcient
(scaler) for the pre-test score ,−1; ,−1 is student ’s test score at time  − 1 (pre-test score); 
is a row vector of teacher indicators;2  is a column vector of teacher ﬁxed-eﬀects;3  is a row
vector that include covariates to control student heterogeneity such as student family backgrounds;
 is a column vector that include the coeﬃcients for the covariates ;  is an unobserved error
term.

After estimating those parameters, we obtain a “puriﬁed” gain score  for each student  in
teacher ’s class at time  after removing the eﬀects from those observed characteristics (,−1 and
) included in the value-added model. This is shown in the following equation:

 =  − ( ˆ + ˆ,−1 +  ˆ)

This gain score  can also be understood as a “deﬂection” score which is the diﬀerence between
a student’s expected score (based on those covariates ,−1 and  that are outside teacher ’s
control) and the actual score. We assume that this deﬂection is caused by teacher  who teaches
student .4 To clarify, all the gain scores in following discussions refer to this . We can decompose
 by using ANOVA parameterization: =  +  +  =   + , where  is the grand mean
gain score of all students in this grade. For simplicity, we assume that each teacher teaches one
example to illustrate that: all the following discussions are based on the “puriﬁed” or adjusted
“gain scores”. In other words, the gain score here is after adjusting for student characteristics that
are included in the value-added model. Theoretically, the change in students’ test scores can be
decomposed to teacher eﬀectiveness and student heterogeneity. By removing the latter part, we can
estimate the teacher eﬀectiveness.
2If there are 10 teachers in this grade, then  is a 1×10 row vector for each student (observation).
Correspondingly,  is a 10 × 1 column vector, with each element representing one teacher’s ﬁxed
eﬀect.

3Here “ﬁxed-eﬀect” means that we are NOT viewing all teachers as a population and then
getting an estimate based on a random sample drawn from this population of teachers. Instead,
we are interested in learning individual teacher’s eﬀect on student achievement. Therefore, we just
use dummy variables to indicate each teacher. There are also other estimation methods in value
added literature. For example, when we apply Project STAR later to illustrate our approach, we
will demonstrate two approaches: the EB residual approach and the Dynamic OLS approach.

4Here we contextualize our discussion by using student-level data to evaluate teachers within a

school as an example. Therefore, the school-level eﬀect is not included.

35

class and each class has the same number of students , then  =  , which is the average VAM
of all teachers in this grade;  is how far teacher ’s value added score ( ) departures from
the  ;   5 is the VAM for teacher ;  is how far the score of student  in teacher ’s
classroom departures from the classroom mean.

To simplify the discussion for now, assume that we are evaluating teachers for one grade within
one school. One way to set the threshold is to use a certain percentile such as the 5Ò percentile in
all teachers’ VAM distribution.

For teacher , we can only observe her eﬀect on the students in her class. For the other students
taught by other teachers, we cannot know their scores if they were taught by teacher  because this
is counterfactual. Because of this, teacher  may argue that her value-added   is below the
threshold ( Ò) because of the students she is assigned. She may argue she has the average teacher
eﬀect and she will be able to achieve the threshold if she is assigned more grade average students
(this could well be the argument of a beginning teacher – see Kim et al. (2018)). However, the
evaluator, such as the principal, may argue that the teacher’s low   reﬂects teacher ’s lack
of eﬀectiveness. While the dispute is about the point estimate of the VAM, the debate about the
teacher’s evaluation is informed by understanding the uncertainty of the VAM.

To formalize the discourse above for the uncertainty of value-added, the case replacement
approach leads us to consider how many grade average students need to replace to alter teacher
’s evaluation. Another argument for replacing with grade-average students is that if we randomly
choose one student from the grade then a grade-average student will be the expectation for a student
being selected.

In order to carry out the replacement analysis, we need to know the grade average student’s
gain score. As before, this gain score is achieved after adjusting for all those covariates included
in the value-added model. Two possible ways are presented as follows to get an estimate for this
grade-average student’s gain score .

In the ﬁrst approach, we can just use  as an estimate for . This approach is convenient and
5If the pre-test score (test score for time 1) is set before the teacher encounters the student, then

we can think about this VAM as a function of the post-test score (test score for time 2).

36

the resulting  will be the same for all teachers in the replacement thought experiment. From the
teacher’s argument illustrated before, she has the average teacher eﬀect in this grade and this  might
be a good estimate for an average teacher’s eﬀect on a grade average student. The disadvantage is
that this average gain score is under the observed teacher-student assignment condition and we are
assuming that this grade average student will keep the grade average score if taught by this teacher.
Another possible way to estimate this grade average student’s gain score  is still conditioning
on covariates in the current model and the observed teacher-student assignment but is more conser-
vative. Speciﬁcally, rather than look for an estimate for an average teacher’s eﬀect on a grade average
student, we try to estimate this particular teacher ’s eﬀect on a grade average student. The “average”
here refers to having the grade average pre-test scores and other controlled characteristics. This
means looking for a student  in teacher ’s class so that the value of
is minimized (where ¯−1 and ¯ are the grade average covariates). Then we use this student ’s
gain score   as . The second approach here seems to provide a “closer guess” for a grade aver-
age student’s gain score if taught by teacher . However, since we only use one speciﬁc student’s
observed gain score for the replacement, the reliability will be a more serious issue than the ﬁrst
approach.

(cid:17) − ( ¯−1 + ¯)(cid:12)(cid:12)(cid:12)

 ,−1 +  

(cid:12)(cid:12)(cid:12)(cid:16)

Once we get , we consider randomly selecting students from teacher ’s class to be replaced

with grade average students, then the formula is shown as follows.

 Ò = (1 − ) ·   +  ·  = ( −  ) ·  +  

 Ò− 
− 

= 1 − − Ò
− 

From this we can get:  =

, where  is the percentage of students
need to be replaced,  Ò is the threshold of value-added above which the teacher will be evaluated
as eﬀective. In this chapter, we assume that the threshold ( Ò) is below the average value added
score ( ) and all the   we are interested in is below the threshold ( Ò). That is, we have
  <  Ò <  .

Suppose  is bigger than  Ò and  . Also the  is the same for all teachers (the ﬁrst
approach discussed previously). Then with higher  , the  gets smaller. This makes sense

37

intuitively because teachers who are closer to the threshold need to replace fewer students. However,
we can see that the relationship between   and  is not linear.

If we treat  as a function of   (and assume  as a known constant for now), we can apply

delta method to get a standard error for  as follows.

 
(  


= ( Ò − ) · ( −  )−2
)2 = ( Ò − )2 · ( −  )−4

Then we get:
 [√

( ˆ − )] = ( Ò − )2 · ( −  )−4 ·  [√

(  −  )]

From this formula, we note that as the VAM moves further away from the grade average gain
score (that is, the ( −  ) gets larger), the asymptotic variance of ˆ approaches 0 because of
the term ( −  )−4. This indicates that for a teacher with a relatively low VAM, we can get a
quite precise estimate for . In other words, the more certain we are as the VAM gets further away
from the threshold.6 In that case, a relatively large ˆ can represent a quite robust evaluation for a
teacher’s ineﬀectiveness relative to a threshold.

In order to account for randomness in estimating ˆ, we consider a bootstrap approach. To
generate one bootstrap sample, we resample students with replacement within each teacher. Then

for each such sample, we can calculate  , ˆ and ˆ. Repeating this step allows us to get a
and potential bias. One advantage of this approach is it accounts for the uncertainty of   via

distribution of ˆ where a conﬁdence interval can be obtained that accommodates both randomness

bootstrap as well, without going into debates about what formula we should use to calculate the
standard error. If this conﬁdence interval is close to 0, then the evaluation for this teacher as being
ineﬀective is sensitive to either potential bias or sampling variability or both.

Similar arguments can be applied for scenarios where the estimated VAM is above the threshold:

 Ò <  . Now the goal is to quantify how the data must change to get this teacher below
(  −  )])
6Note the discussion here assumes the uncertainty of VAM (i.e.,  [√

keeps unchanged.

38

threshold because we wonder whether the teacher’s above-threshold VAM might be due to upward
bias and if so, the teacher may need some professional development. The case replacement
approach here is similar to that in Chapter 1 when we wanted to quantify how the data must change
to sustain an inference. Speciﬁcally, consider the estimated VAM represents a combination of
students with grade average gain scores and threshold level gain scores, with proportions  and
(1 − ), respectively. This gives us:   =  ·  +  Ò · (1 − ), from which we can solve for
. In this scenario,  represents the proportion of students with grade average
:  =
gain scores that must be replaced with students who have threshold level gain scores to bring this
teacher below the threshold. A larger  indicates more data must be changed to change the teacher
evaluation to ineﬀective, which means the estimated VAM is higher and further away from the
threshold.

 − Ò
− Ò

2.5.3 Selective case replacement for heterogeneous eﬀects.

As we discussed for MSTs, we can apply the case replacement approach to quantify how robust
a teacher evaluation inference is to outlier students in his/her class. Note in the VAM context,
this constant eﬀect assumption does not necessarily relate to student grouping based on prior test
results. Prior test scores may reﬂect students’ ability but the constant eﬀect assumption is about
teachers’ eﬀects on students. Students’ ability may or may not relate to their improvement aﬀected
by the teachers. The treatment eﬀect in this context is more a problem of whether this teacher’s
teaching works for one student (matching problem). Therefore, even in the most homogeneous case
where students are grouped based on their pre-test scores, we still need to think about violation of
the constant eﬀect assumption.

Another important note here is the heterogeneous teacher eﬀects can be caused by violations
of SUTVA, or more speciﬁcally, the single treatment level assumption in SUTVA. This happens
when a teacher modiﬁes their teaching for diﬀerent students. But as discussed earlier, the presence
of heterogeneity of teacher eﬀects on student achievement does not rely on violations of this single
treatment level assumption. Whether and how one teacher’s teaching beneﬁts student learning can

39

be aﬀected by various factors.

This constant eﬀect assumption is also highly relevant in the VAM context:

if we want to
rank all teachers, we should consider the heterogeneity of the students in their classes and ideally
the teacher whose teaching works out for more students should be more favored. Consider two
teachers who have the equivalent value-added. In teacher ’s class, there is only one student who
gets an extremely low gain score and it is this score that makes the teacher’s value-added below
the threshold. However, teacher  has several students who get quite low gain scores. In addition
to the estimated value-added, we can also use this information of mismatching as another measure
for teacher’s eﬀectiveness. Even if we are only interested in the average level, we may still be
concerned about the eﬀects of outliers.

To quantify how sensitive the VAM is to this heterogeneous or the outlier eﬀect, we propose
three selective replacement approaches as follows, assuming the teacher has a below-threshold
VAM.

(1) Successive extreme replacement:

this process is data-dependent and there is no closed
formula. This is very similar to what we discussed in the context of MSTs: we start our replacement
from the student who has the lowest gain score in teacher ’s class. If the teacher’s value-added is
still lower than the threshold, then we replace the student with the second lowest gain score. We
continue this process until teacher ’s value-added achieves the threshold and we record how many
students need to be replaced with .

(2) Purposeful sampling process: the lower the student ’s gain score is, the higher the probability

for this student gets selected to be replaced. This is shown in the following formula:
  − 

   <  , Pr (      ) =

 (  − )

Then the formula for replacement is shown as follows:


(cid:20)

 Ò =

 < 

 (  − )
( − ) ·   − 

(cid:21)

·  +  

40

From this we obtain:

 =

 < 

 Ò −  
( − ) ·

( −)

 −

(cid:21)

(cid:20)

(cid:20)

For now we are only replacing students who are below the class average (for all  <  ).
But we can also consider including those students who are above the class average but below the
threshold ( Ò >  >  ). In this case, the formula will be the following one.

 =

 < Ò

(cid:21)

 Ò −  
( − ) ·

( Ò−)

 Ò−

(3) Replace the teacher’s median student(s) with grade average students. This approach is
proposed considering that median is less sensitive to extreme values than mean. Instead of replacing
the mean student gain score in the teacher’s class, we select the median student to think about the
replacement (). The formula is represented as follows.

 =

 Ò −  
 − 

For any of these approaches, the magnitude of  gives us an intuitive understanding about how
far teacher  is from the threshold if we assume the value-added is a valid and reliable evaluation.
It also quantiﬁes how much bias there needs to be to invalidate this evaluation. A small  indicates
a lack of robustness or a small departure from the threshold. Similarly, we may get a conﬁdence
interval for  by applying the delta method or bootstrap.

Additionally, the three selective replacement schemes can help provide a supplemental measure
for teacher evaluation. For instance, when two teachers have the same value-added, we can use 
from selective replacements as another measure for evaluation purpose. The teacher with a smaller
 may be favored because her VAM is more likely to have been negatively aﬀected by just a few
outlier students.

2.5.4 Case replacement for peer eﬀect (violation of SUTVA).

In this section, we will conceptualize the potential bias due to peer eﬀects in value added models
in terms of random or purposeful resampling of students in a counterfactual scenario. To start, we

41

will discuss how peer eﬀects generate bias in VAMs.

In the VAM context, we prefer to use “peer eﬀects” rather than “spillover eﬀects” because
spillover can occur through teachers (such as one naughty student distracts the teacher’s attention
from the rest of the class). But we deﬁne peer eﬀects as those direct eﬀects among students. These
peer eﬀects are then just like other factors such as students’ own background that we should control
in value-added models.

To further deﬁne peer eﬀects as a potential bias in value-added, we need ﬁrst consider the
baseline peer eﬀects. The value-added model is essentially a normative comparison among teachers.
Therefore, the baseline peer eﬀects that are present in all teachers’ classes should not be credited to
one teacher. That is, to generate “bias” in value-added measures, peer eﬀects should satisfy several
conditions: (1) inhere in direct interactions or indirect exposures among students; (2) perform in
a way that is independent of particular teachers; (3) is unique for a particular classroom that is
not covered by the normative baseline peer eﬀects. Speciﬁcally, the second requirement indicates
that the peer eﬀects do not depend on which teacher teaches the class, while the third requirement
illustrates the existence of a biasing peer eﬀect needs to be diﬀerent from baseline peer eﬀects
that are experienced by all teachers. Assume we are comparing all the teachers within a school.
Then the baseline peer eﬀect depicts what happens in all teachers’ classrooms in that school. But
some peer eﬀects are unique for the classroom. Consider teacher Ashley’s classroom in which peer
eﬀects are not due to Ashley’s instruction. These eﬀects can bias our evaluation for Ashley.

It is also important to note that the level of baseline peer eﬀects can be quite diﬀerent in various
school environments. Students involved in project-based learning can experience strong peer eﬀects
through frequent group discussions while students in schools with conventional teaching styles may
only experience a minimal level of peer eﬀect by observation. Therefore, we should consider these
variations when discussing peer eﬀects as potential bias in diﬀerent contexts.

The non-random assignment of students to teachers can introduce bias to VAM via non-random
assignment of contributors and distractors to teachers. For example, a teacher can be assigned with
more students who are really distractors that may have negative eﬀects on other students in the

42

class. This negative spillover eﬀect can introduce bias to the VAM. To quantify how sensitive one
teacher’s value-added is to potential bias due to such peer eﬀect, we propose an approach that is
very similar to what we discussed in Chapter 1 for spillover eﬀects in the MST context. Consider
a group of students  distracting the other students (1 − ) in the class and they experience
self-initiated ( ) and peer-initiated distraction eﬀects ( ), respectively. The derivation is the
same as the MST context. Following we present the result for the VAM context when we worry
about negative peer eﬀects as bias that may threat teacher’s evaluation result. The notation is the
same as before and   indicates the isolated/true VAM score for teacher .
  =   −  · (1 − ) ·   − (1 − ) ·  ·  

By setting   equivalent to the threshold of being an eﬀective teacher, we can solve for .
As before, we only need to specify the negative eﬀect   =   +   rather than specifying the
self-initiated and peer-initiated distracting eﬀects separately. Figure 1.1 in Chapter 1 still applies
to show the relationship among the  ,  and the diﬀerence between teacher’s VAM and the
threshold (i.e.,  Ò −  ).

 Ò −   = − ·  · ( − 0.5)2 + 0.25 ·  ·  

As mentioned before, underlying these discussions for potential peer eﬀects are counterfactual
interpretations. The hypothetical example presented in following Figure 2.3 illustrates this coun-
terfactual idea in the VAM context, based on our previous example about teacher Ashley.

Recall Ashley has 20 students, where the ﬁrst ﬁgure represents her 20 students’ distribution of
gain scores before replacement. Then in the thought-experiment, we replace the students indicated
by the green part with the other students represented by the red part. Importantly, these two groups
of students have comparable gain scores, but they have diﬀerent peer eﬀects on the remaining
students: the green students will distract others but the red students only have baseline peer eﬀects
on others. Therefore, after the replacement, the remaining students in the class (represented by the
black part) will not experience the peer-initiated distraction eﬀects anymore. As such their scores
get higher, causing the change in the teacher’s VAM.

43

Figure 2.3: Case replacement for peer eﬀects.

Importantly, the crucial trick in this thought experiment for peer eﬀects is the remaining
students change their gain scores during the replacement. When there is no peer eﬀect, the change
of teacher ’s value-added is only from the diﬀerence between the new students’ and original
students’ gain scores after the replacement. For example, in the hypothetical example in Figure
2.2, teacher Ashley needs (0.15 − 0.14) × 20 (Ò Ò Ò 20 ) = 0.2 total increase
to achieve the threshold. This 0.2 comes from the diﬀerence between the two original students in
the class (denoted in grey parts) and two replacement students (represented as the white parts), i.e.,
(0.16− 0.06) + (0.16− 0.06) = 0.2. Those original students who are not replaced do not change
their scores in the replacement process. In contrast, in Figure 2.3, the gain in the VAM occurs
because of the change experienced by the students remaining in the classroom.

Finally, 1.2 in Chapter 1 can again be applied as a more ﬂexible framework for quantifying more

44

speciﬁc peer eﬀect mechanism if we know speciﬁc social interaction patterns within the teacher’s
classroom.

2.6

Illustrative Example of Evaluating Grade 1 Math Teachers in Project STAR

We will use Project STAR as an example to demonstrate how the discussion above can be
applied to quantify uncertainty in VAMs. Diﬀerent from the MST context, Project STAR is not
designed for teacher evaluation purpose. The discussion here is mainly for demonstrating our case
replacement approach. But as previous research illustrated (Nye, Konstantopoulos, & Hedges,
2004), the randomization of teacher assignments within schools and the broad range of schools
from throughout a diverse state make Project STAR a great resource to study teacher eﬀect variance
on student achievement.7 Speciﬁcally, because both students and teachers were randomly assigned
to diﬀerent classes, any systematic between-classroom variance in achievements should be due to
either the treatment eﬀects (class types) or teacher eﬀectiveness. As such, in this paper, we ﬁrst
follow the approach applied in Nye et al. (2004) to study teacher eﬀectiveness. Speciﬁcally, schools
are included in our demonstration sample if there were more than three classrooms in the same
grade so that within each school at least two classrooms were assigned to the same class type.8 A
three-level hierarchical linear models is applied, speciﬁed as follows.
Level 1 (student )

   = 0  + 1      + 2      + 3       + 4      +   
7It is important to distinguish between: evaluate speciﬁc teachers versus evaluate teacher eﬀect

8Nye et al. (2004) has shown that the constrained sample is very similar to the complete sample

variance.

on important characteristics.

45

Level 2 (teacher/classroom )9

Level 3 (school )

0   = 00 + 01    + 01    + 0  
1   = 10 , 2   = 20 , 3   = 30 , 4   = 40

00 = 000 + 00
10 = 100 , 20 = 200, 30 = 300, 40 = 400

At the student level, we control pretest, gender, whether the student is eligible for free and
reduced lunch and whether the student is minority. At the classroom/teacher level, the treatment
condition is controlled, and teacher random eﬀect is applied. That is, 0   is the teacher eﬀect for
teacher  at school . At the school level, the random component 00 captures school ’s eﬀect on
student’s achievement score.10 The Empirical Bayes estimates for the teacher random component
0   are then regarded as the estimated VAM for teacher  in school .11

For simplicity, we focus on math achievements in Grade 1. As such, we get a sample with
3, 209 students taught by 268 teachers in 54 schools.12 After ﬁtting the three-level model illustrated
above, we used the Empirical Bayes estimates for the 268 teachers as their VAM scores.

9For purpose of evaluating individual teachers, it might be better to exclude students in the
regular class with an aide since in those classrooms, we cannot distinguish between teacher’s eﬀect
from the aide’s eﬀect. Here we include these students in our sample to better align with earlier
research and also this is only for demonstration of the case replacement approach.

10We decided to follow Nye et al. (2004) to treat teacher and school eﬀect as random eﬀects here
for two considerations: (1) teacher ﬁxed eﬀect (i.e., teacher dummies) can be collinear with the
treatment eﬀect of diﬀerent class types; (2) we acknowledge that school ﬁxed eﬀect can better control
school eﬀects on students but within each school there were only a few classrooms. Moreover, for
most schools within each treatment assignment there were at most two classrooms, which can make
the collinearity even worse. We also compared the coeﬃcients of students’ pretest, gender, SES
and minority between school random eﬀect and school ﬁxed eﬀect estimation. The results (both
point estimate and standard error) are very similar to each other.

11Although we call teacher random eﬀect and school random eﬀect, this is diﬀerent from random
eﬀect estimation in econometrics literature. In econometrics literature, random eﬀect estimation in
the VAM context mainly refers to student random eﬀect when panel data is available. See Guarino
et al. (2015) for more detailed information for diﬀerent estimators for VAMs. What we use here is

46

Figure 2.4: VAMs for 268 teachers.

Figure 2.4 shows a distribution of these VAM scores. Now consider the 5Ò percentile of
−0.43 as a threshold for teacher to be eﬀective, indicated by the red dotted line in Figure 2.4.
We choose this threshold because researchers have illustrated that student achievement in US can
get to the level of Canadian by eliminating bottom 5 − 8 percent of teachers (Hanushek, 2014).
Additionally, in the controversial teacher evaluation system introduced in the District of Columbia
Public Schools (IMPACT), teachers were dismissed with rare exceptions if they were evaluated
as ineﬀective (bottom 3%) and researchers argued this dismissal threat increased performance of
remaining teachers (Adnot et al., 2017; Dee & Wyckoﬀ, 2015).

Based on this threshold of the 5Ò percentile of −0.43, there are in total 14 teachers falling into
this category of “being ineﬀective”. Figure 2.5 presents the estimated VAMs for these teachers,
where each blue bar represents VAM for teacher  through teacher . Note the VAMs are negative,
and a longer/lower bar indicates a worse estimate for teacher eﬀectiveness. The red horizontal line

what they call “Empirical Bayes and Related Estimators”.

12This sample is slightly diﬀerent from that in Nye et al. (2004), but the results are very similar.
This particular sample is selected based on several constraints: (1) school has at least four classrooms
in grade 1 (2) students have both Grade K and Grade 1 math achievement score available, also not
missing information about gender, race and whether the student was eligible for free and reduced
lunch (3) we remove two teachers, for whom only one and three students’ data are available.

47

represents the threshold of being evaluated as eﬀective. Then the more the blue bar below the red
threshold, the more robust teacher evaluation is.

Figure 2.5: VAMs for 14 ineﬀective teachers and their robustness in terms of percent of students
need to be replaced ().

Now we apply our student replacement approach to quantify how robust the teacher evaluation
is for these teachers. We ﬁrst assume the SUTVA holds. To carry out the case replacement thought
experiment, we ﬁrst calculated the value of the replacement case (i.e.,  in discussion above). In
this context, the teachers are being evaluated against all the other teachers in this sample, and the
argument made by an ineﬀective teacher is that they could achieve the threshold if they had been
assigned with more average students in this sample. To approximate this counterfactual situation
of being assigned with “an average student”, we calculated an average VAM weighted by each
teacher’s number of students, to serve as an estimate for the teacher eﬀect an average student can
experience with an average teacher. As such we get an average of −0.0086.13 Now we can apply the
13This is not too far away from the unweighted average of 0. We argue for this weighted version
because in the counterfactual situation the teacher had been assigned with an average student and
the weighted version captured this student assignment by focusing on the student level. But the
results should be very similar no matter which one we use, as long as assignment of students to

48

student replacement approach to quantify how robust each teacher’s evaluation is to any potential
sources of bias (when SUTVA holds). The results are presented by the orange dots for each teacher.
For example, for teacher , more than 57% students need to be replaced with average students to
get this teacher above the threshold. In comparison, teacher  only needs to replace a little more
than 1% students in his/her class to get to the threshold. That is, the evaluation for teacher  is
much more robust for the evaluation for teacher . We ordered the 14 teachers in Figure 2.5 to have
their VAMs getting closer and closer to the threshold from teacher  to . Correspondingly, the
percentage of students need to be replaced in the thought experiment also gets smaller. For teacher
 through , we only need to replace less than 10% of their students to get them to the threshold,
indicating their evaluation is very sensitive to potential bias and accordingly, the related personnel
decisions may need more considerations.

We acknowledge that the calculation of  (percentage of students need to be replaced) involves
sampling uncertainty. Now we want to take into account the sampling variability for our index of
inference robustness. To achieve this goal, we apply bootstrap approach to generate a distribution
of  for each teacher. The bootstrap is based on within-teacher resampling with replacement.
Following Figure 2.6 presents the result for teacher  and  (based on 1, 000 iterations). First all
1, 000 iterations generated a below-threshold VAM for teacher . Additionally, as presented in
Figure 2.6, the 95% conﬁdence interval for this teacher’s  is (42.09%, 60.93%). Thus, we may
conclude that the evaluation for teacher  is pretty robust, even after we consider the sampling
variability. In comparison, although teacher  also has a pretty large  of 34.82%, the sampling
variability is much larger compared to that for teacher . First there are 53 times out of 1, 000
iterations where ’s VAM is actually higher than threshold. For the remaining 947 times, the 95%
conﬁdence interval is (3.88%, 46.12%), which is much closer to 0 compared to that for teacher
. This comparison shows teacher ’s students are more diverse compared to teacher , and thus
teacher  is much more aﬀected by sampling variability.

Now we want to relax the SUTVA and see how sensitive the teacher evaluation is to potential

teachers is not too unbalanced.

49

Figure 2.6: Sampling variability for percentage of students need to be replaced () for teacher 
and .

50

peer eﬀects that can generate bias. Figure 2.7 presents the result by applying our approach for the
14 teachers below the threshold. As Figure 2.5, the blue bars indicate teachers’ VAMs and the
red horizontal line indicates the threshold. But now the green ones represent the smallest negative
distraction eﬀect to invalidate teacher evaluation if we assume 50% of the students are distracting
the others in the teacher’s class. The orange ones represent the distraction eﬀect if we assume 10%
of the students are distracting the others.

Figure 2.7: VAMs for 14 ineﬀective teachers and their robustness to potential peer eﬀect as bias.

First, we observe assuming more students distracting others, smaller unit of distraction eﬀects
( ) is needed to invalidate the inference (the orange dot is always above the green dot). As we
discussed before, assuming 50% students distracting others gives us the smallest peer eﬀect possible.
Second, there is a diﬀerence between Figure 2.7 and Figure 2.5: in Figure 2.5, as teacher’s VAM
gets closer to the threshold, we always have smaller percentage of students need to be replaced to
invalidate the teacher evaluation. But this is not always the case in Figure 2.7. For example, teacher
 has a higher VAM than teacher , but teacher  needs a stronger peer eﬀect as bias to invalidate
her evaluation than teacher . This is because the calculation of peer eﬀect takes into account the
number of students in the class. This makes sense because peer eﬀects essentially describe the
interaction among students which should be related to the number of students. However, as we

51

mentioned before, if one prefers more speciﬁc interaction styles, they can apply a more ﬂexible
approach presented in Table 1.2 in Chapter 1.

As mentioned before, we can always interpret Figure 2.7 in two ways: either interpreting the
  as the smallest necessary negative peer eﬀect to invalidate the evaluation, or interpreting  as
how many (more precisely, what percentage of) students need to be replaced with students having
comparable scores but will only inﬂuence the other students with baseline peer eﬀects. Take the
teacher  for example. Assuming 50% of the students in the class were distracting others, then
there must be at least 0.226 negative peer eﬀects as bias to get the teacher to the threshold. What
does 0.226 mean? It indicates that each of the 50% students (1 group) are distracting each of the
other 50% of students (2 group). And 0.226 is the sum of two eﬀects: (1) one unit self-initiated
distraction eﬀect suﬀered by one student in the 1 group by distracting one student in the 2
group; (2) one unit peer-initiated distraction eﬀect experienced by each student in the 2 group
due to being distracted by one student in the 1 group. Because we standardized the outcome
variables (the math achievements in Grade 1), the 0.226 means more than 2 standard deviations,
which is a large eﬀect. Alternatively, assuming the negative peer eﬀect is 0.226, we can interpret
the 50% as: 50% of the students need to be replaced with students who have comparable math
achievements but only have baseline peer eﬀects on their classmates. Therefore, each of the 50%
remaining students can experience an increase of 0.226 (more than 2 standard deviations) in their
math achievements once those distracting students get replaced. By applying the same bootstrap
approach before, we can also get a 95% conﬁdence interval for the negative eﬀect, which is (0.147,
0.285) for teacher . Additionally, consistent with earlier discussion, the sampling variability for
teacher ’s negative peer eﬀect is also large. Among the 947 out of 1, 000 iterations where this
teacher is below the threshold, the necessary negative peer eﬀect has a 95% conﬁdence interval of
(0.010, 0.193).

The demonstration above uses the Empirical Bayes (EB) estimates for the VAMs. As is well
known (e.g., Guarino et al., 2015), there are several diﬀerent estimation methods for VAMs. In
the following discussion, we apply a diﬀerent estimation method, the Dynamic Ordinary Least

52

Squares (DOLS) approach, to estimate VAMs. We do this for two considerations. First, this DOLS
approach better aligns with our derivation above and allows us to demonstrate the selective sampling
for heterogeneity in an intuitive way. As pointed out by Guarino et al. (2015), the EB estimates
are similar to shrinking teacher average residuals towards the overall mean, where the residuals are
obtained after regressing posttest on pretest and other covariates, except teacher assignments. Since
this shrinking process occurs at the teacher level and the magnitude of shrinkage can be diﬀerent
for each teacher, the VAM cannot be considered as a simple average of the students’ adjusted gain
scores any more, unless we are willing to conceptualize the shrinking procedure at the student
level. But going deep into the shrinking procedure can complicate the thought experiment and
make the sensitivity analysis technique less accessible to potential audience. Second, presenting
two estimation approaches allows us to show how our student replacement approach works for
diﬀerent teachers and diﬀerent estimation methods. We will show that the student replacement
approach provides us with a general framework that can be easily applied to compare evaluation
results generated from diﬀerent estimation methods.

Speciﬁcally, we use the DOLS approach for a model speciﬁed as:  = 0 + 1 +
2 + 3  + 4 +  + , where  is a row vector of teacher indicators and
 is a column vector of teacher ﬁxed-eﬀects (i.e., VAMs). The other notation is the same as before,
where  is math achievement at Grade 1 and  is math achievement at Grade K. We restrict
our sample to teachers who taught small classes so that we do not need to worry about collinearity
between teacher eﬀect and class type eﬀect. As such, we get a smaller sample with 1, 128 students
taught by 101 teachers. Figure 2.8 shows the distribution of these 101 teachers’ VAMs (i.e., ˆ in the
equation), where the red dotted line represents the 5Ò percentile (−1.60) as a threshold for being
an eﬀective teacher. 5 teachers’ VAMs are below this threshold.

Now we apply the student replacement approach to characterize the robustness of teacher
evaluation based on VAMs for the 5 teachers who are below the threshold. Table 2.1 presents
the results together with the robustness for the 14 teachers who are below the threshold in the
EB approach. Speciﬁcally, the second column (# of students in total) reports the total number of

53

Figure 2.8: VAMs (estimated by DOLS) for 101 teachers who taught small classes.

students for the teacher. The third column (EB_) reports the percentage of students that must be
replaced to change the teacher evaluation in the EB approach. There are 5 ineﬀective teachers in the
EB approach who did not teach small classes, thus the robustness of their evaluation in the DOLS
approach is not applied ( ). The fourth column (DOLS_) reports the percentage of students
that must be replaced to change the teacher evaluation in the DOLS approach. In both the third and
fourth column, if the percentage is within a parenthesis, then the teacher is above the threshold and
the percentage indicates how much the data must change to get the teacher below the threshold.
The last two columns (i.e., DOLS_het_1 and DOLS_het_2) present the results of two selective
replacement approaches to examine the robustness of teacher evaluation to potential heterogeneous
eﬀects.

Table 2.1 allows us to see how our robustness index works for diﬀerent teachers and diﬀerent
estimation methods. Note that the two approaches (i.e., EB and DOLS) are very diﬀerent. The
EB approach compares teachers from all treatment conditions (i.e., small class, regular class and
regular class with aide). But the DOLS approach is only applied to teachers who taught small

54

DOLS_het_1
(# of students

must be replaced)

33.33% (4)
10.00% (1)

DOLS_het_2

26.93%
0.96%

18.18% (2)

14.29%

10.00% (1)

1.99%

A
B
C
D
E
F
G
H
I
J
K
L
M
N
P

12
10
15
14
11
14
10
16
10
14
15
13
12
15
8

(0%)

35.52%
1.56%
NA
NA

57.74%
34.82%
31.14%
27.29%
24.52%
29.27%
23.27% (24.27%)
20.75%
18.74% (27.18%)
4.91%
9.73%
8.14%
NA
7.95%
NA
5.69% (32.47%)
3.02% (10.94%)
1.07%
(10.87%)

NA

22.67%

Table 2.1: Case replacement approach for VAMs estimated by EB and DOLS.

Teacher

# of students

in total

EB_

DOLS_

Note: The DOLS approach only includes teachers for teaching small classes.
NA indicates the teacher did not teach small class.

25.00% (2)

16.70%

classes. All teachers in the DOLS approach are included in the EB approach, but not vice versa.
As such, the two approaches generate diﬀerent coeﬃcients for the same predictors (i.e., students’
gender, race and free and reduced lunch information), diﬀerent VAMs, and diﬀerent thresholds.
But our student replacement approach allows us to compare the robustness of teacher evaluation
results across these two approaches. Comparing the third and fourth columns (i.e., EB_ and
DOLS_), ﬁrst we observe that four out of ﬁve ineﬀective teachers in the DOLS approach are also
below the threshold in the EB approach:
In both approaches, we have
strong evidence for teacher  to be ineﬀective. The only teacher that is below the threshold in EB
but above the threshold in DOLS is teacher . But our thought experiment tells us that in the EB
approach, the evidence for teacher  being eﬀective is not very strong since only 10.87% of average
students must be replaced with threshold-level students to bring this teacher below the threshold.
Similarly, our approach tells us that two estimation approaches give similar results in general for
small class teachers who are ineﬀective in the EB approach but eﬀective in the DOLS approach.

teacher , ,  and .

55

For example, teacher  is below the threshold in the EB approach and is just at the threshold in the
DOLS approach (his/her VAM is exactly at the threshold of −1.60). Another example is: among
the nine small class teachers below the EB approach threshold, the three teachers with the strongest
evidence of being ineﬀective are also below the threshold in the DOLS approach (i.e., teacher ,
 and ); it is the teachers with the weakest evidence in the EB approach that turn to be eﬀective
in the DOLS approach (i.e., teacher  and ). This indicates the VAMs generated by diﬀerent
approaches can be highly correlated. However, when it comes to the evaluation for one individual
teacher, the inference can be reversed when the teacher is close to the threshold. As we see here,
in total there are ﬁve teachers whose evaluation changes in diﬀerent approaches and none of them
shows very strong evidence of being ineﬀective in either approach, which illustrates the importance
of quantifying the robustness of the inferences in evaluating individual teachers based on VAMs.
As mentioned before, the last two columns in Table 5 present the results of selective replacement
approaches for the ﬁve teachers who are below the threshold in the DOLS approach. Speciﬁcally,
the left column (DOLS_het_1) applies the successive extreme replacement approach and the right
column (DOLS_het_2) applies the purposeful replacement approach. For example, for teacher
A, the four lowest score students (33.33%) in the class must be replaced with average students to
get this teacher to the threshold. Alternatively, 26.93% of below class average students must be
replaced. The motivation for the selective replacement is to characterize how sensitive the teacher
evaluation is to potential heterogeneous/outlier eﬀects. That is, we want to use this information to
capture not only how far the VAM (which is also the class average gain score) is from the threshold,
but also to what extent the diﬀerence between VAM and threshold is due to students with very low
gain scores in the class (i.e., tail part of the distribution). To illustrate how the selective replacement
approaches achieve this goal, we present the results together with gain score distributions for the
ﬁve teachers, as shown in Figure 2.9, where the red solid line represents the threshold and the black
dashed line represents each teacher’s VAM, which is also the class average gain score.

First, we compare teacher  and . As shown in Figure 2.9, both  and  have students with
very low gain scores. But teacher ’s VAM is much closer to the threshold than . Reﬂected in

56

Figure 2.9: Gain score distributions for 5 teachers below the threshold in the DOLS approach.

57

Table 2.1, the selective replacement approach tells us that teacher  only needs to replace very
few low-end students to get to threshold while teacher  needs to replace more low-end students.
Second, we compare teacher  and .  is a little further away from the threshold compared to :
29.27% of ’s students must be replaced with average students to change his/her evaluation and
22.67% of ’s students must be replaced with average students to change his/her evaluation. But
 has a few students with extremely low gain scores. As such, the selective replacement tells us
that ’s evaluation is more sensitive to students in the lower tail than teacher .

2.7 Discussion

This Chapter discusses how the case replacement approach can be applied in the VAM context,
accounting for spillover and heterogeneity. The general approach is very much like that in the MST
context: to quantify the strength of evidence for an inference by considering how many observed
cases need to be replaced with unobserved cases to change the inference (Frank et al., 2013). As
we have shown, this is a very powerful non-parametric framework that can be easily generalized
for violations of SUTVA and presence of heterogeneity. The extensions to account for spillover
and heterogeneity are also essentially the same in both MST and VAM contexts.

Meanwhile, it is important to note two important diﬀerences in MST and VAM. First, sampling
variability is accommodated by statistical signiﬁcance attached to the threshold in the MST context.
This is also the most general situation in empirical research. However, in the context of VAM, the
threshold is arbitrary and thus we consider sampling variability with the number of cases that need
to be replaced to invalidate the inference. In other words, instead of adding sampling variability
to the threshold, we accommodate the sampling variability in the index of inference robustness.
Second, the heterogeneity in VAM only exists within class (one level) but the MST has two levels
of heterogeneity. Importantly, we argue that the second level of heterogeneity really needs more
attention from researchers since the cross-site variation in treatment eﬀects always has signiﬁcant
policy implications. The study of cross-site variation also leads to more research about how and
why an intervention works in certain contexts but not others. We argue that our approach has a

58

strong potential in helping researchers, as well as policymakers in understanding and interpreting
these important variations.

59

CHAPTER 3

UNOBSERVED MEDIATOR IN A SINGLE-MEDIATOR MODEL

3.1

Introduction

Chapters 1 and 2 focus on how researchers can present strength of evidence in educational
research with an intuitive framework so that all stakeholders can evaluate the eﬀectiveness of one
intervention against potential costs to better inform policy choices. Now Chapters 3 will turn to
the second goal stated in the introduction: exploring why an intervention works through mediation
mechanisms so that policy choices may be better informed. Under a simple mediation model, a
binary variable  is randomly assigned (e.g., treatment vs. control groups) and causes change
in the outcome variable . A mediator variable  explains one mechanism or process through
which  aﬀects . In other words, the mediator is intermediate in the causal sequence that explains
why the intervention () causes the outcome () (e.g., Baron & Kenny, 1986; MacKinnon, 2008).
This mediation process with a single mediator is presented in Figure 3.1(a), where the product of
the eﬀects from  to  and from  to  is deﬁned as the indirect eﬀect via the mediator 
(e.g., Hayes, 2013; MacKinnon, 2008).

The basic single mediator model can also be extended to more than one simultaneous mediator.
Figure 3.1(b) presents a two-mediator parallel mediation model, where there are two simultaneous
mediation processes, one through each mediator, connecting  to . Preacher and Hayes (2008)
discussed approaches to test hypotheses for individual mediators and contrast the magnitude of
indirect eﬀects for diﬀerent mediators in a multiple mediation model like this, where more than
one simultaneous mediators are involved to explain why the intervention  causes the outcome .
Researchers have emphasized the importance of testing multiple mediators in a single model rather
than in separate models to prevent potential parameter bias due to omitted variables (e.g., Hayes,
2013; Judd & Kenny, 1981; Preacher & Hayes, 2008). In many cases, however, a single mediation
model may be tested because a researcher has not measured another relevant potential mediator or

60

Figure 3.1: Simple mediation and dual mediator designs.

is unable to measure a theoretically relevant mediator. In such instances, the second potentially
relevant mediator is unobserved (denoted by ). For example, suppose a researcher studying
the impact of tracking/grouping students () on students’ learning outcomes () is interested in
potential mediators that might explain associations between tracking and learning. If the researcher
hypothesizes that stigma and teachers’ mindset might both be mediators of interest but teachers’
mindset was not measured under the research design, then in such an instance stigma is observed
() and teachers’ mindset is unobserved ().

When  is associated with the observed mediator (denoted by ) (Figure 3.1(c)), it may
threaten inference of the mediation eﬀect via .
In this case,  is also a posttreatment
confounder for the mediation path from  to  to  as  explains the  to  relation and
is caused by  (e.g., Fritz, Kenny, & MacKinnon, 2016). Many recent studies have investigated

61

inﬂuence of potential confounders on causal mediation inferences by providing approaches to
sensitivity analyses (e.g., Hong, Qin, & Yang, 2018; Imai, Keele, & Tingley, 2010; Imai, Keele, &
Yamamoto, 2010; Imai & Yamamoto, 2013; VanderWeele, 2015). Speciﬁcally, sensitivity analyses
may describe the responsiveness of an estimated eﬀect or the robustness of a causal inference to a
potential confounder.

However, based on our knowledge, existing sensitivity analysis strategies are either 1) intended
primarily for unobserved confounders, which are not potential mediators (e.g., Imai, Keele, &
Yamamoto, 2010; Imai & Yamamoto, 2013; VanderWeele, 2010), or 2) use an alternative coun-
terfactual framework targeting the “natural indirect eﬀect” (NIE), “natural direct eﬀect” (NDE)
or “controlled direct eﬀect” (CDE) (e.g., Hong et al., 2018; VanderWeele, 2015; VanderWeele &
Chiba, 2014). Under models with multiple mediators, NIE, NDE and CDE can be very diﬀerent
from the speciﬁc indirect and direct eﬀects deﬁned from a path-speciﬁc perspective. For example,
the NIE is deﬁned as the diﬀerence in outcome  if the mediator of interest changed to what it
would have been if the exposure  changed to control (assuming  is binary for simplicity), while
the exposure  stays at the treatment condition.
In our dual mediator model of Figure 3.1(c),
then the NIE transmitted through  includes not only the pathway  →  →  but also
 →  →  → . Additionally, the NDE is deﬁned as the diﬀerence in outcome  if only 
changes from control to treatment but the mediator of interest  stays at the level that would have
taken under the control condition of exposure . Then in Figure 3.1(c), the NDE includes both
 →  and  →  → . Thus, the sensitivity approach under the counterfactual framework
cannot be applied directly if we take the path-speciﬁc perspective commonly used in psychology
(e.g., Hayes, 2013; MacKinnon, 2008).

Extending previous research, the goal of this study is to examine whether and how omitting an
alternative mediator that is also a confounder can bias causal mediation eﬀect estimates from the
path-speciﬁc perspective. Furthermore, we propose a sensitivity analysis to evaluate robustness of
a causal mediation inference to an unobserved mediator.

62

3.2 Dual-Mediator Designs

Hayes (2013) presented a complex serial mediation model with two mediators (Figure 3.1(c),
with the dotted lines) where one mediator () has a serial or predictive eﬀect on the second
mediator (), en route to Y. The parallel or serial mediation models (Figure 3.1(e) or 3.1(f)) are
special cases of the complex mediation model presented in Figure 3.1(c). Speciﬁcally, when the
path coeﬃcient for  →  is zero, the model in Figure 3.1(c) becomes the parallel mediation
model in Figure 3.1(e). When the path coeﬃcients for both  →  and  →  are zero,
the model in Figure 3.1(c) becomes the serial mediation model in Figure 3.1(f). We focus on the
complex serial mediation model (Figure 3.1(c), with the dotted line), as a general model, to evaluate
whether and how an unobserved mediator may bias estimation of the direct and indirect eﬀects via
the observed mediator. Speciﬁcally, we study how excluding  aﬀects estimation of the speciﬁc
mediation eﬀect via , deﬁned as the product of the paths from  to  and  to . That is,
when  is no longer observed in our analysis, all the pathways that relate to this omitted mediator
 are excluded from the analysis, as shown by dotted lines in Figure 3.1(c). To illustrate potential
eﬀects by omitting such a mediator, we will apply a real data example also used by Hayes (2013)
about how media use aﬀects behaviors.

3.3

Illustrative Data Example about Consequences of Omitting an Alternative Related Me-
diator

Illustrative data were originally drawn from an experimental study conducted by Tal-Or, Cohen,
Tsfati, and Gunther (2010). The research was to test a hypothesis about the inﬂuence of media use:
whether media () aﬀects people’s attitudes or behaviors () through changing people’s perceptions
regarding how other people may be inﬂuenced by the media (). For example, when a person
Ashley reads a media report, she may tend to predict that others in her community will respond to
this report in certain ways. This prediction can further aﬀect Ashley’s attitudes or behaviors. To
test this hypothesis, participants were randomly assigned into two groups (). Both groups were
asked to read a newspaper article about an economic crisis that can aﬀect the price and supply

63

of sugar in Israel. However, one group were told that this article came from the front page of a
major newspaper whereas the other group were told that the article appeared in the middle of an
economic supplement of the newspaper. A condition variable (, denoted by COND) was used to
indicate whether the participant was manipulated to consider this article as a front-page article or
interior-page article. After a participant ﬁnished reading the article, he/she was asked how much
he/she believed that others in the community would be encouraged to buy sugar after they read this
article. This presumed media inﬂuence served as one mediator (, denoted by PMI) in the model.
The participants were also asked about how important they thought this article was. This perceived
issue importance was a second mediator (, denoted by IMPORT) in the model. Finally, the
participants in both groups were asked about how soon they intended to buy sugar and how much
they would buy. These responses were then aggregated to generate a variable that measured their
intention to buy sugar. This is then the outcome variable (, denoted by REACTION).

The ﬁrst ﬁtted model includes two mediators PMI and IMPORT as presented in the left panel
of Figure 3.2. For the mediation path via PMI, it is hypothesized that others are more likely to

Figure 3.2: Illustrative data example of presumed media inﬂuence.

be aﬀected by an article appearing on the front page than on the interior page. Therefore, before
others act to buy sugar forcing the price up, one would act as soon as possible to purchase sugar
when it is available at an acceptable price. For the mediation path via IMPORT, people can infer

64

the importance of the article based on where it is published and consequently, people act upon the
importance of the issue. Moreover, IMPORT is also presumed to predict media inﬂuence, under
the hypothesis that the more important people believe the article is, the more likely they believe
others would be inﬂuenced by that article.

We ﬁt the model in the left panel of Figure 3.2 to data using standardized variables1. The
estimated path coeﬃcients were therefore standardized coeﬃcients. Both the eﬀects from COND to
IMPORT and from IMPORT to REACTION are signiﬁcantly positive. Using the joint signiﬁcance
approach (Fritz & MacKinnon, 2007), the estimated indirect eﬀect through IMPORT was 0.181 ·
0.363 = 0.066. The 95% percentile bootstrap conﬁdence interval of the indirect eﬀect is 0.001
to 0.150 (based on 5,000 bootstrap samples). In contrast, the other indirect eﬀect via presumed
media inﬂuence was not signiﬁcant, with a 95% percentile bootstrap conﬁdence interval of -0.013
to 0.114 (based on 5,000 bootstrap samples). That is to say, the speciﬁc mediated eﬀect of the
article’s location did not have statistically signiﬁcant inﬂuence on participants’ reactions through
presumed media inﬂuence. Additionally, the predictive relationship from IMPORT to PMI was
signiﬁcantly positive (0.258,  = 0.003). This conﬁrms the hypothesis that as participants think
the issue is more important, they have a stronger belief that others are going to be aﬀected by that
article.

Now we ask, “What would happen if the mediator IMPORT was excluded from the ﬁtted
model?” This may be the case if an alternative theoretically important mediator (such as IMPORT)
is not measured in the research design, as can occur in almost any example of a mediation analysis.
In this case, PMI is the only observed mediating variable included in the model. The results are
presented in the right panel of Figure 3.2. After excluding IMPORT, the speciﬁc indirect eﬀect via
PMI changed from being non-signiﬁcant to signiﬁcantly positive: both paths in the indirect eﬀect
are statistically signiﬁcantly positive and the indirect eﬀect has a 95% conﬁdence interval of 0.078
to 0.042 (based on 5,000 bootstrap samples). Speciﬁcally, the direct path from article location
to presumed media inﬂuence increased from 0.134 to 0.181 and the direct path from presumed
1We standardized all variables for analysis to be consistent with our later derivation. Therefore,

the coeﬃcients are diﬀerent from those in Hayes (2013).

65

media inﬂuence to participants’ reaction increased from 0.338 to 0.432. Therefore, excluding
IMPORT, both paths for this indirect eﬀect via presumed media inﬂuence were greater. Omitting
this alternative mediator has given us a diﬀerent conclusion that the indirect eﬀect via presumed
media inﬂuence is signiﬁcantly positive. We also observe that in both models the direct path from
article location to participants’ reaction was not statistically signiﬁcant from 0 but the point estimate
increases from 0.033 to 0.082 once we exclude the IMPORT mediator from our model.

It is important to note that when we compare the two ﬁtted models above, we are mainly focusing
on two estimates: 1) the speciﬁc indirect eﬀect via PMI, deﬁned as the product of the paths from
COND to PMI and from PMI to REACTION; 2) the direct path from COND to REACTION. The
speciﬁc indirect eﬀect deﬁned as the product of two path coeﬃcients (e.g., Hayes, 2013) is not equal
to either the natural indirect eﬀect or the randomized interventional analogue of the natural direct
eﬀect (e.g., VanderWeele, 2015). The direct eﬀect we are interested here is also diﬀerent from
both natural and control direct eﬀect deﬁned in the counterfactual framework (e.g., VanderWeele
& Chiba, 2014).

3.4 Unobserved Causally Related Mediator as Posttreatment Confounder

The example above illustrates that omitting an alternative mediator can alter estimates of the
mediating path coeﬃcients for PMI and change the inference for this mediating eﬀect. To understand
why omitting IMPORT can generate such eﬀects, we ﬁrst present a general representation of a
confounder in Figure 3.3(a): a confounder  of  and  can bias the estimate and inference
of the eﬀect from  to  by serving as an alternative cause of both  and . Following the
illustrative data example, in such a case IMPORT is associated with both the mediator PMI and
the outcome REACTION. Thus, if IMPORT is omitted, it serves as a potential confounder for PMI
and REACTION.

Previous studies on sensitivity analysis in the mediation literature have focused on the case
where a potential confounder (e.g., IMPORT) between a mediator (e.g., PMI) and an outcome (e.g.,
REACTION) is assumed to be independent from the input variable (e.g., COND) (Figure 3.3(b)).

66

Figure 3.3: Confounder in mediation.

For example, Imai, Keele, and Yamamoto (2010) presented a sensitivity analysis based on residual
correlations to deal with unmeasured pre-treatment covariates “that confound the relationship
between the mediator and the outcome”. The “pre-treatment” covariates precede the input variable
 and therefore are not inﬂuenced by . Fritz et al. (2016) combined the eﬀects of measurement
errors and omitting confounders on bias of the mediation eﬀect, but they considered the confounder
 as independent of  throughout their analytical study (Figure 3.3(b)).

Meanwhile, empirical studies across the social sciences have shown considerable evidence for
the existence of multiple causal mechanisms that may involve simultaneous or causally related
mediators. For example, Bekman, Cummins, and Brown (2010) examined a parallel multiple-
mediator model where depression aﬀected adolescent alcohol use through simultaneous mediators
perceptions and expectancies.
Imai and Yamamoto (2013) discussed several empirical studies
where both content and importance mechanisms mediated the eﬀect of political issue framing on
citizens’ political opinion and behaviors. Singh, Chen, and Wegener (2014) showed evidence for
several sequential multiple-mediator models for how attitude similarity aﬀected inferred attraction
through several mediators. As such, the parallel mediation (Figure 3.1(b)) and serial mediation
(Figure 3.1(d)) models are often observed to characterize mediational eﬀects, which emphasizes the
importance of understanding unobserved/omitted mediators in cases where there is an association
between the predictor and the unobserved mediator.

Therefore, it is of critical importance to ﬁll the research gap regarding how unobserved/omitted
mediators aﬀect estimates and inferences for observed mediators as these post-treatment con-

67

founders can easily occur in practice. In this chapter, we focus on the two-mediator model discussed
by Hayes (2013) (Figure 3.1(c) with dotted lines) to study how an unobserved/omitted mediator
() may bias direct and indirect eﬀect estimates via the observed mediator (). This selected
two-mediator model also makes our study results generalizable from two perspectives. First, as
discussed before, this model can be simpliﬁed to other popular two-mediator designs: parallel two-
mediator model (Figure 3.1(e)) and serial two-mediator model (Figure 3.1(f)). Second, by setting
the path coeﬃcient from  to the omitted mediator  to zero, the studied scenario becomes the
same as that for previous studies of the sensitivity of mediation eﬀects (e.g., Fritz et al., 2016; Imai,
Keele, & Yamamoto, 2010; Imai & Yamamoto, 2013; VanderWeele, 2010). Therefore, we can also
compare our ﬁndings with these previous studies and extend their ﬁndings.

3.5 Goals of the Study

This study has three goals. Our ﬁrst goal is to establish conditions under which omitting a
mediator can yield consistent results. However, we show that the conditions to obtain consistent
estimates for the direct and indirect eﬀect via  are not the same, so that one cannot simultaneously
obtain consistent direct and indirect eﬀects. Second, recognizing that these conditions may be
diﬃcult to meet in practice, we evaluate how omitting  biases the mediation eﬀect estimate
via . For example, we ﬁnd that when the path coeﬃcients related to  are all positive or
all negative, the speciﬁc indirect eﬀect via  is always overestimated while the direct eﬀect
from  to  can be either positively or negatively biased. We further show that as the path
coeﬃcient of  →  becomes larger, the positive bias in estimating the indirect eﬀect via
 becomes larger, but the estimation of the direct eﬀect  →  changes from overestimation to
underestimation. Third, we also seek to identify the magnitude of bias under diﬀerent scenarios in
order to contextualize the potential threats to inference by omitting the post-treatment confounder
. Finally, stemming from the ﬁrst three goals, we seek to propose a sensitivity analysis to
assessing the robustness of the causal mediation inference to omission of an unobserved mediator
that is confounded with the  paths through its relationship to  and .

68

We present our analytical ﬁndings in terms of path coeﬃcients as well as correlations among
variables in the underlying model. The path coeﬃcient framework shows how diﬀerent levels of
parameters for causal pathways in the true model aﬀect the direction and magnitude of inconsistency.
Alternatively, the correlation framework shows how estimates of path coeﬃcients vary with diﬀerent
magnitudes of correlations among the variables. We have elected to present both frameworks
because the parameter framework allows us to understand how inconsistency is generated when
we omit the other mediator by assuming we know the “truth”, while the correlation framework
facilitates our development of sensitivity analysis methods that can be directly applied by substantive
researchers in real studies. We expect that these two frameworks can complement each other to fulﬁll
a more comprehensive picture so that a solid foundation can be built for substantive researchers
to deal with an unobserved mediator as a potential confounder. Throughout, we will draw on the
previously discussed illustrative example of presumed media inﬂuence to demonstrate the eﬀect of
omitting a related mediator.

3.6

Inconsistency When  is Omitted

The ﬁrst and primary goal of this paper is to derive the eﬀect estimates with an omitted con-
founding mediator and their deviations from the true eﬀects in the underlying model. Speciﬁcally,
we would like to study the diﬀerences between ˜1, ˜1 , ˜ and 1, 1, , as shown in Figure 3.4.

We applied the Law of Iterated Expectation to derive ˜1, ˜1, ˜. The Law of Iterated Expectation
states that (|) =  [(|, )|] where ,  and  are three variables and (|) represents
the conditional expectation (or conditional mean) of  given  (Wooldridge, 2009). The Appendix
has shown the derivation details and the results are presented in Equations 3.1 through 3.4 as

69

Figure 3.4: True model with two mediators and the model omitting .

follows.

(3.1)

(3.2)

(3.3)

(3.4)

1 − 2
2

1 − ( · 2 + 1)2

,

1 − ( · 2 + 1)2

1 − ( · 2 + 1)2

˜1 = 1 +  · 2,
˜1 = 1 + 2 ·  ·
=  + 2 · , −  , · ,
˜ =  + 2 · 2 − ( + 1 · 2) · ( · 2 + 1)
˜1 ˜1 = 11 +  · 1 · 1 +  · (1 +  · 2)(1 − 2
2)
1 − ( · 2 + 1)2
As such we obtain ˜1, ˜1, and ˜ as functions of true parameters. Speciﬁcally, 1 and 1
represent the true eﬀects via the observed mediator , while 2 and 2 represent the true eﬀects
via the omitted mediator . In our illustrative example, the product of 1 and 1 is the true
mediation eﬀect through the mediator PMI, while the product of 2 and 2 is the true mediation
eﬀect through the omitted mediator IMPORT.  stands for the path coeﬃcient from IMPORT to
PMI. As shown by Equations 3.1 through 3.4, only under very stringent conditions are ˜1, ˜1, and
˜ equal to 1, 1, and . In the following subsections, we will show these stringent conditions, and
how the key parameters inﬂuence the diﬀerences between ˜1, ˜1, ˜ and the true parameters 1, 1,

70

.

To note, the derivation here is at the population level and therefore is free of sampling error.
Technically speaking, the diﬀerence between ˜1, ˜1, ˜ and 1, 1,  are inconsistencies, not bias.
That is, ˜1, ˜1, ˜ are the estimates we get even when we have all population data available (i.e.,
sample size  → ∞). In all following discussion, “bias” in the population parameters is a rough
way to describe “inconsistency” for an audience more comfortable with conceptualizations of bias.2

3.6.1 Conditions for consistent estimates when omitting 

If an
Next, we examine conditions under which omitting  can result in consistent results.
empirical researcher can justify their studies as meeting these conditions, there is less concern
about omitting .

From Equation 3.1, ˜1 = 1 if and only if  = 0 or 2 = 0. When  = 0, there is no causal
relation between the two mediators, which reduces to the parallel mediation model represented in
Figure 3.1(e). In our illustrative data example, this condition is satisﬁed when IMPORT has no
eﬀect on PMI. Under this condition, excluding IMPORT does not aﬀect the eﬀect estimate from
COND to PMI. Alternatively, ˜1 = 1 when 2 = 0, or there is no eﬀect of the treatment variable
 on the omitted variable . This condition is equivalent to the situation that the omitted  is
only a confounder of the  −  relation and is not caused by .

Based on Equation 3.2, to satisfy ˜1 = 1, we can have (1) 2 = 0, indicating no eﬀect of 
on , (2)  = 0, indicating no eﬀect of  on  (a parallel two-mediator model), or (3) 2
2 = 1,
indicating the eﬀect of  on  is -1 or 1. If condition (3) is true, then the omitted mediator 
2Take the ˜1 as an example to see why we are deriving inconsistencies rather than bias. The
derivation in the Appendix shows that 2 comes from 1 where (|) = 1 · . For one
sample, we can express ˆ1 as ((cid:48))−1((cid:48)) where  and  are sample data for  and .
This is a nonlinear function of .
If we really want to derive the magnitude of bias, we need
to know (((cid:48))−1((cid:48))). Since only probability limit goes through nonlinear function but
expectation cannot, we can only know (((cid:48))−1((cid:48))) but not (((cid:48))−1((cid:48))). That
is why in this chapter (also chapter 4) I am in fact deriving inconsistency which is at the population
level.

71

is fully determined by the treatment  3. In our illustrative example, condition (3) indicates that the
value of IMPORT is solely dependent on COND.

In terms of the direct eﬀect ˜, based on Equation 3.3, ˜ =  if and only if 2 = ( + 1 · 2) ·
( · 2 + 1) or 2 = 0. The ﬁrst condition 2 = ( + 1 · 2) · ( · 2 + 1) is equivalent to
, =  , · ,, where ,,  ,, , are correlations between  and ,
between  and , and between  and , respectively. In the illustrative example, this means
the correlation between COND and the omitted mediator IMPORT is equivalent to the product
correlation between PMI and IMPORT × the correlation between COND and PMI. The second
condition 2 = 0 indicates no eﬀect of  on .

In sum, as long as the mediation eﬀect via the omitted mediator is nonzero (2 ࣔ 0 and
2 ࣔ 0) and  has a nonzero eﬀect on the observed mediator  ( ࣔ 0), the estimates for
1, 1,  will be inconsistent (i.e., 1 ࣔ ˜1, 1 ࣔ ˜1,  ࣔ ˜). Furthermore, the conditions to
obtain consistent direct and indirect eﬀects are not the same, so that one cannot simultaneously
obtain accurate direct and indirect eﬀects. An unbiased indirect eﬀect can be obtained when 
is omitted when  has a zero eﬀect on the observed mediator , or the underlying model is a
parallel two-mediator model. In fact, under a parallel two-mediator model, not only can we obtain
an consistent indirect eﬀect, but also consistent estimates of 1 and 1. A consistent direct eﬀect
from  to  can be obtained when  is omitted when , =  , · , or 2 = 0.

3.6.2 Direction of inconsistency when omitting 

From the previous discussion, we note that conditions for obtaining unbiased estimates of direct
and/or indirect eﬀects are diﬃcult to justify in practice. This leads us to the following discussion
to achieve our second goal: to examine the direction and magnitude of bias of direct and indirect
eﬀect estimates when omitting .

From Equation 3.1, ˜1 > 1 as long as  and 2 are in the same direction. That is, when
3We standardized all variables in our derivation. Therefore, this condition of 2
2 = 1 is equivalent

to zero error term, indicating the omitted mediator only depends on the treatment variable.

72

 and 2 are both positive or negative, 1 is overestimated. Similarly, ˜1 > 1 when  and 2
are in the same direction (either both positive or both negative, but not zero, see Appendix for
detailed proof). Therefore, the indirect eﬀect estimate via  is overestimated when , 2 and
2, the three -related path coeﬃcients, are all positive or all negative. The amount of bias is
 · 2 · 1 + ·2·(1+·2)·(1−2
2)
. In our earlier example, the three path coeﬃcients related to
IMPORT were all positive. Correspondingly, when IMPORT was excluded from the model, the
indirect eﬀect through PMI became larger, from 0.045 to 0.078.

1−(·2+1)2

In terms of the direct eﬀect , the direction of bias depends on whether 2 and [2−( + 1 · 2)·
( · 2 + 1)] have the same sign. In the Appendix, we show that [2−( + 1 · 2)·( · 2 + 1)]
is equivalent to (, − , ·  ,), which is also the numerator of the partial correlation
> , ·  ,,  is
between  and  conditional on . When 2 > 0 and ,
< , ·  ,,  is underestimated.
overestimated. In contrast, when 2 > 0 and ,
One way to think about the sign of (, − , ·  ,) is to consider the observed mediator
 as a common outcome of the omitted variable  and the treatment variable . When we
< , ·  ,, then , changes its sign once paritialling out the . In
have ,
this case,  functions similarly to a suppressor since it distorts the relation between  and 
(e.g., Cohen & Cohen, 1983; Rosenberg, 1968).

In our illustrative example of presumed media inﬂuence, the path coeﬃcient from PMI to
the outcome REACTION (2) was positive. Additionally, the correlation between COND and
IMPORT was larger than the product of correlations between COND and PMI and between PMI
> , ·  ,. Accordingly, the direct eﬀect became larger, from
and IMPORT: ,
0.033 to 0.082, when IMPORT was excluded from the model.

3.6.3 How inconsistency changes with -related parameters.

Next we will demonstrate how the omission of  can generate bias with regards to  under
diﬀerent scenarios. To achieve this goal, we will manipulate each of the parameters related to .
The discussion here will exhaust all possible scenarios about how each -related path coeﬃcient

73

aﬀects bias regarding , in terms of increasing or decreasing the bias. For example, the path
coeﬃcient from  to  () is always positively associated with the bias in estimating 1, no
matter what values other parameters take on. In contrast, the path coeﬃcient from  to  (2) can
be either positively or negatively related to the bias in ˜1 depending on values of other parameters.
To simplify, we constrain our discussion to all parameters in the true model being positve4.

We start our discussion with the eﬀect from  to  (represented as ). Figure 3.5 plots
the bias in estimating 1, 1,  against diﬀerent values of , while all the other parameters are
ﬁxed at certain values: 1 = 2 = 0.2, 2 = 0.1. First the magnitude of bias in estimating both
 → (1) and  → (1) increases as path  becomes larger. Focusing on the eﬀects
 →  →  in the true model (left panel in Figure 3.4), the larger the  the more some of
the eﬀect attributed to  should be attributed to  as a mediator. Therefore, we overestimate
1 and the diﬀerence between ˜1 and 1 grows as  gets bigger. Similary,  plays a role in the
chain of eﬀects  →  →  → . Here, the larger the magnitude of  the larger the bias in
the estimated eﬀect  → . Thus, as  increases the bias in both ˜1 and ˜1 increase, leading to
more serious overestimation of the indirect eﬀect  →  → .

Figure 3.5 also shows that after the positive bias in the estimated direct eﬀect  →  falls to 0
it continues decreasing to negative values as  increases to 1. That is, when  is relatively small
we overestimate  while when  becomes larger we underestimate . As  gets closer to 1, the
magnitude of negative bias in ˜ keeps growing. This is consistent with our previous discussion
about the direction of bias in ˜. As  becomes larger, the “suppression” function from the common
outcome  becomes more serious as the diﬀerence between , and , ·  ,
becomes larger.

To summarize, when we manipulate  to vary from 0 to 1, we always overestimate the indirect
eﬀect via  and the bias becomes larger. However, the increase in  ﬁrst countermands the
overestimation in  and as  gets closer to 1 it ﬁnally leads to underestimation of the direct eﬀect
 → . The Appendix also proves that this pattern of associations between  and the bias of ˜1,
4See the Appendix for more detailed discussion about how we calculated these diﬀerent scenarios
by studying the ﬁrst derivatives of the bias of ˜1, ˜1 and ˜ as a function of -related parameters.

74

Figure 3.5: How bias changes with diﬀerent levels of .

˜1 and ˜ is consistent no matter what values other parameters take on.

Now we manipulate 2 and see how this direct eﬀect  →  relates to changes in bias.
The results are presented in Figure 3.6(a) and Figure 3.6(b). In both ﬁgures, a larger direct eﬀect
 →  (2) is associated with larger bias in the estimated direct eﬀect  → . The intuition
the indirect eﬀect  →  →  is allocated to
is similar to how  aﬀects the bias of ˜1:
the biased eﬀect  →  (reﬂected as ˜1), where the omitted  plays a role of mediator.
Consistent with our previous discussion, Figure 3.6(a) and 3.6(b) shows the direct eﬀect  → 
is always overestimated (the bias is positive) (when all parameters are positive) while ˜ can be
either positively or negatively biased.

Note that Figure 3.6(a) and 3.6(b) present diﬀerent trends of changes in the bias of ˜1 and ˜ as
2 increases5. Figure 3.6(a) shows the bias of ˜1 decreases until 0 as 2 increases while Figure
3.6(b) shows the bias of ˜1 increases as 2 gets bigger. On the contrary, the bias of ˜ becomes
5The Appendix shows how we calculated these two diﬀerent patterns by studying the partial

derivatives of the bias in estimating 1, 1 and  with respect to 2.

75

Figure 3.6: How bias changes with diﬀerent level of 2.

larger in Figure 3.6(a) but Figure 3.6(b) gives a decreasing curve for the bias in estimating . The
decreasing curve indicates the underestimation of  becomes more extreme as 2 increases. The
distinction between Figure 3.6(a) and 3.6(b) illustrates that the eﬀect of 2 on the bias in estimating
2 and  relies on the values of other parameters. Speciﬁcally, Figure 3.6(a) presents the scenario
where the magnitude of 1 and  are both relatively small (with values of 0.15 and 0.22) while in
Figure 3.6(b) 1 and  are both relatively large (with values of 0.6 and 0.5).

The last parameter that relates to  is the eﬀect  → , represented as 2. Figure 3.7(a)6
and Figure 3.7(b) 7 present how the bias in estimating 1, 1 and  vary with diﬀering levels of 2
under two diﬀerent scenarios 8. In both Figure 3.7(a) and 3.7(b), ˜1 and ˜1 are always positively
biased. But the positive bias of ˜1 becomes larger as 2 gets closer to 1 while the level of 2 has
no eﬀect on the bias in estimating 1.

To see why 2 has no eﬀect on the magnitude of the bias of ˜1, we focus on  →  → 
in the true model because this is where the bias of ˜1 comes from (based on Equation 3.1). It is

61 = 0.2, 2 = 0.2,  = 0.2.
71 = 0.55, 2 = 0.2,  = 0.6.
8The Appendix shows how we calculated these two diﬀerent patterns by studying the partial

derivatives of the bias in estimating 1, 1 and  with respect to 2.

76

Figure 3.7: How bias changes with diﬀerent levels of 2.

obvious then that 2 plays no role in  →  → . That is, the direct eﬀect  →  (2) is
not directly connected with the direct eﬀect  →  (1). Alternatively, we can consider this “no
eﬀect” by noticing that the path 2 is invoked after the direct eﬀect  →  (1) in the sequence
of causality. Equation 3.2 also shows that the bias in estimating 1 is a linear function of 2 as in
both Figure 3.7(a) and 3.7(b).

The distinction between Figure 3.7(a) and 3.7(b) is how diﬀerent levels of 2 aﬀect the bias
of ˜. In Figure 3.7(a),  becomes overestimated and the positive bias grows but in Figure 3.7(b)
we always underestimate  and more importantly, the negative bias becomes more serious as
2 increases. This distinction is due to the direction of ˜’s bias while the eﬀect of 2 on the
magnitude of ˜’s bias follows the same pattern in Figure 3.7(a) and 3.7(b). In Figure 3.7(a) we have
1 =  = 2 = 0.2 but in Figure 3.7(b), 1(= 0.55) and (= 0.6) are much larger than 2(= 0.2).
< , ·  ,
Correspondingly, ,
for Figure 3.7(b). Based on the previous discussion, these two scenarios exemplify the cases in
which  gets overestimated and underestimated, respectively. Importantly, in both scenarios, 2
has no eﬀect on the direction of ˜’s bias but only exhibits a positive inﬂuence on the magnitude of

> , ·  , for Figure 3.7(a) and ,

77

˜’s bias. When ˜ is positively biased, this inﬂuence then manifests as more serious overestimation.
Alternatively, when the bias is negative, it results in more serious underestimation.

3.7 How Serious Inconsistency Could be at Diﬀerent Levels of -related Parameters

In the previous section we demonstrate how diﬀering levels of -related parameters may
aﬀect bias. In this section, we will solve for bias under diﬀerent conditions of 2, 2 and  as small
(0.1), medium (0.25) and large (0.4) to demonstrate how serious the bias could be and how large
the bias is relative to the true eﬀect. Starting from general situations, we will also discuss special
situations where parallel ( = 0) or serial (1 = 2 = 0) two-mediator model is the true underlying
mediation process. To link our analysis with previous literature, we will also present situations
where the omitted  is only a confounder for the mediator-to-outcome relation but not a mediator
itself (2 = 0).

Table 3.1 depicts situations where the dual mediator design in Figure 1.7(c) (with dotted lines)
is the true model, with all pathway coeﬃcients being positive. Speciﬁcally, in all models, 1 = 0.2,
1 = 0.15 and  = 0.1 (except in Table 3.3 of serial mediation models where 1 = 0). That is,
the true indirect eﬀect is 0.03 and the true direct eﬀect is 0.1. Each row displays the bias in one
scenario with diﬀerent values of 2, 2 and  either small (0.1), medium (0.25) or large (0.4).
Consistent with previous discussion, all cases in Table 3.1 generate positively biased values for
the indirect eﬀect via  from  to , since all -related path coeﬃcients (2, 2 and ) take
on positive values. But the magnitude of bias for the estimated indirect eﬀect via  can vary
substantially in diﬀerent cases: in the worst scenario, large 2, medium 2 and large  generates a
biased indirect eﬀect around 0.09, almost three times as large as the true value of 0.03; while in the
best scenario, small 2, 2 and  only generates an indirect eﬀect of 0.034, just slightly larger than
the true value of 0.03. The magnitude of bias for the estimated direct eﬀect ˜ also varies greatly
from case to case, with the worst scenario displaying a bias as large as 0.15. In our illustrative
example, when IMPORT was included in the model, 2 was a small to medium value of 0.181,
 was a medium value of 0.258, and 2 was a medium to large value of 0.363. As such, when

78

IMPORT was excluded, the indirect eﬀect via PMI increased around 73.33%, from 0.045 to 0.078.
The direct eﬀect from COND to REACTION increased almost 150%, from 0.033 to 0.082.

Table 3.1: Bias in the Estimated Direct Eﬀect of  on  and Estimated Indirect Eﬀect via .

Parameters

2

2


( ˜1−1)

(˜1−1)

(˜1−1)
/1

Bias/True value
( ˜
−)
/

( ˜1 ˜1
( ˜1−1)
−11)
/1
/11
20.0% 23.8% 151.4% 48.5%
50.0% 61.5% 132.3% 142.3%
80.0% 102.9% 104.4% 265.3%
20.0% 14.9% 94.7% 37.8%
50.0% 38.5% 82.7% 107.7%
80.0% 64.3% 65.3% 195.8%
20.0% 5.9%
37.9% 27.1%
50.0% 15.4% 33.1% 73.1%
80.0% 25.7% 26.1% 126.3%
12.5% 26.3% 91.1% 42.1%
31.3% 67.1% 73.6% 119.4%
50.0% 109.9% 50.5% 214.8%
12.5% 16.5% 56.9% 31.0%
31.3% 42.0% 46.0% 86.3%
50.0% 68.7% 31.6% 153.0%
12.5% 6.6%
22.8% 19.9%
31.3% 16.8% 18.4% 53.3%
50.0% 27.5% 12.6% 91.2%
5.0% 27.6% 31.3% 34.0%
12.5% 69.5% 16.5% 90.7%
20.0% 112.1% -0.3% 154.5%
5.0% 17.3% 19.6% 23.1%
12.5% 43.4% 10.3% 61.4%
20.0% 70.0% -0.2% 104.0%
5.0%
6.9%
12.2%
12.5% 17.4%
32.1%
20.0% 28.0% -0.1% 53.6%

7.8%
4.1%

0.4
0.4
0.4
0.25
0.25
0.25
0.1
0.1
0.1
0.4
0.4
0.4
0.25
0.25
0.25
0.1
0.1
0.1
0.4
0.4
0.4
0.25
0.25
0.25
0.1
0.1
0.1

0.1
0.25
0.4
0.1
0.25
0.4
0.1
0.25
0.4
0.1
0.25
0.4
0.1
0.25
0.40
0.1
0.25
0.4
0.1
0.25
0.4
0.1
0.25
0.4
0.1
0.25
0.4

0.04
0.10
0.16
0.04
0.10
0.16
0.04
0.10
0.16
0.03
0.06
0.10
0.03
0.06
0.10
0.03
0.06
0.10
0.01
0.03
0.04
0.01
0.03
0.04
0.01
0.03
0.04

0.04
0.09
0.15
0.02
0.06
0.10
0.01
0.02
0.04
0.04
0.10
0.16
0.02
0.06
0.10
0.01
0.03
0.04
0.04
0.10
0.17
0.03
0.07
0.11
0.01
0.03
0.04

0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.25
0.25
0.25
0.25
0.25
0.25
0.25
0.25
0.25
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
Note: 2 = direct eﬀect of  → ; 2 = direct eﬀect of  → ;
 = direct eﬀect of  → ; 1 = direct eﬀect of  → ;
1 = direct eﬀect of  → ;  = direct eﬀect of  → ;
˜1 = direct eﬀect of  →  with the omission of ;
˜1 = direct eﬀect of  →  with the omission of ;
˜ = direct eﬀect of  →  with the omission of .
 Hypothetical path coeﬃcients for the model depicted in Figure 3.4.

Bias
( ˜
−)
0.15
0.13
0.10
0.09
0.08
0.07
0.04
0.03
0.03
0.09
0.07
0.05
0.06
0.05
0.03
0.02
0.02
0.01
0.03
0.02
0.00
0.02
0.01
-0.0002
0.01
0.004
-0.0001

( ˜1 ˜1
−11)
0.01
0.04
0.08
0.01
0.03
0.06
0.01
0.02
0.04
0.01
0.04
0.06
0.01
0.03
0.05
0.01
0.02
0.03
0.01
0.03
0.05
0.01
0.02
0.03
0.004
0.01
0.02

79

Additionally, most cases in Table 3.1 generate positively biased values for the direct eﬀect from
 to , with only two exceptions of underestimation but the magnitude of bias is very small in both
cases (-0.0002 and -0.0001). Additionally, it is very important to note Equation 3.3 also implies
the bias in estimating  does not depend on the true value of . From a practical perspective, this
suggests even with a very large sample, omitting  can yield evidence of a medium positive (or
a small negative) direct eﬀect from  to  when the actual direct eﬀect is zero.

Table 3.2: Bias in the Estimated Direct Eﬀect of  on  and Estimated Indirect Eﬀect via
( = 0).

Bias
( ˜
−)
0.0100
0.0250
0.0400
0.0250
0.0625
0.1000
0.0400
0.1000
0.1600

Bias/True value

( ˜
−)
/

(˜1−1)
/1

( ˜1 ˜1
( ˜1−1)
−11)
/1
/11
0.0% 0.0% 10.0%
0.0%
0.0% 0.0% 25.0%
0.0%
0.0% 0.0% 40.0%
0.0%
0.0% 0.0% 25.0%
0.0%
0.0% 0.0% 62.5%
0.0%
0.0% 0.0% 100.0% 0.0%
0.0% 0.0% 40.0%
0.0%
0.0% 0.0% 100.0% 0.0%
0.0% 0.0% 160.0% 0.0%

( ˜1 ˜1
−11)
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00

Parameters

2

2


( ˜1−1)

(˜1−1)

0
0
0
0
0
0
0
0
0

0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00

0.1
0.25
0.4
0.1
0.25
0.4
0.1
0.25
0.4

0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00

0.1
0.1
0.1
0.25
0.25
0.25
0.4
0.4
0.4
Note: 2 = direct eﬀect of  → ; 2 = direct eﬀect of  → ;
 = direct eﬀect of  → ; 1 = direct eﬀect of  → ;
1 = direct eﬀect of  → ;  = direct eﬀect of  → ;
˜1 = direct eﬀect of  →  with the omission of ;
˜1 = direct eﬀect of  →  with the omission of ;
˜ = direct eﬀect of  →  with the omission of .
 Hypothetical path coeﬃcients for the model depicted in Figure 3.4.

Tables 3.2 and 3.3 depict situations where the true underlying mediation process are parallel
(Figure 3.1(e),  = 0) and serial (Figure 3.1(f), 1 = 2 = 0) two-mediator models, respectively.
Importantly, all cases in Table 3.2 (parallel mediation model) yield positively biased direct eﬀect
from  to  but unbiased indirect eﬀect via ; while all cases in Table 3.3 (serial mediation
model) yield positively biased indirect eﬀect via  but unbiased direct eﬀect from  to . These
patterns are also implied by Equation 3.4 and 3.3, where  = 0 gives unbiased eﬀect for 11 in

80

the parallel model and 2 = 0 gives unbiased eﬀect for  in the serial model. Additionally, the
positive bias in both Table 3.2 and Table 3.3 increases as -related path coeﬃcients get larger.
Most importantly, the actual indirect eﬀect in the serial model is 0 (since 1 = 0) but the omission
of  can introduce bias to estimate a considerable indirect eﬀect even when the researcher has a
very large sample, especially when 2 and  have large values in the true serial mediation model.

Table 3.3: Bias in the Estimated Direct Eﬀect of  on  and Estimated Indirect Eﬀect via
(1 = 2 = 0).

Bias/True value

0
0
0
0
0
0
0
0
0

0
0
0
0
0
0
0
0
0

Parameters

2

2


( ˜1−1)

(˜1−1)

Bias
( ˜
−)
0
0
0
0
0
0
0
0
0

0.1
0.25
0.4
0.1
0.25
0.4
0.1
0.25
0.4

0.01
0.03
0.04
0.03
0.06
0.10
0.04
0.10
0.16

( ˜1−1)
( ˜1 ˜1
−11)
/1
0.0015 NA
0.0038 NA
0.0060 NA
0.0038 NA
0.0094 NA
0.0150 NA
0.0060 NA
0.0150 NA
0.0240 NA

(˜1−1)
( ˜
−)
/
/1
0.1
0% 0%
0.1
0% 0%
0.1
0% 0%
0.25
0% 0%
0.25
0% 0%
0.25
0% 0%
0.4
0% 0%
0.4
0% 0%
0.4
0% 0%
Note: 2 = direct eﬀect of  → ; 2 = direct eﬀect of  → ;
 = direct eﬀect of  → ; 1 = direct eﬀect of  → ;
1 = direct eﬀect of  → ;  = direct eﬀect of  → ;
˜1 = direct eﬀect of  →  with the omission of ;
˜1 = direct eﬀect of  →  with the omission of ;
˜ = direct eﬀect of  →  with the omission of .
 Hypothetical path coeﬃcients for the model depicted in Figure 3.4.

( ˜1 ˜1
−11)
/11
NA
NA
NA
NA
NA
NA
NA
NA
NA

Table 3.4 depicts situations where the omitted  is only a confounder for the mediator-to-
outcome relation but not a mediator (i.e., 2 = 0).
In our illustrative example, this means the
path coeﬃcient from COND to IMPORT is zero. All cases in this table negatively bias the direct
eﬀect from  to  and positively bias the indirect eﬀect via , which is consistent with previous
literature (Fritz et al., 2016). By setting 2 = 0 in Equation 3.1 through 3.3, we get the formulas for
bias in estimating 1 and , which are the same as that in previous literature9. To further understand
9By setting 2 = 0 in Equations 3.1 through 3.3, we can get that the bias in ˜1 is 0, the bias in

81

Table 3.4: Bias in the Estimated Direct Eﬀect of  on  and Estimated Indirect Eﬀect via
(2 = 0).

Bias
( ˜
−)
-0.0021
-0.0052
-0.0083
-0.0052
-0.0130
-0.0208
-0.0083
-0.0208
-0.0333

-2.1%

(˜1−1)
/1
6.9%

Bias/True value
( ˜
−)
/

( ˜1 ˜1
( ˜1−1)
−11)
/1
/11
6.9%
0%
0% 17.4% -5.2% 17.4%
0% 27.8% -8.3% 27.8%
0% 17.4% -5.2% 17.4%
0% 43.4% -13.0% 43.4%
0% 69.4% -20.8% 69.4%
0% 27.8% -8.3% 27.8%
0% 69.4% -20.8% 69.4%
0% 111.1% -33.3% 111.1%

( ˜1 ˜1
−11)
0.002
0.01
0.01
0.01
0.01
0.02
0.01
0.02
0.03

Parameters

2

2


( ˜1−1)

(˜1−1)

0
0
0
0
0
0
0
0
0

0.1
0.25
0.4
0.1
0.25
0.4
0.1
0.25
0.4

0.1
0.1
0.1
0.25
0.25
0.25
0.4
0.4
0.4

0.01
0.03
0.04
0.03
0.07
0.10
0.04
0.10
0.17

0
0
0
0
0
0
0
0
0
Note: 2 = direct eﬀect of  → ; 2 = direct eﬀect of  → ;
 = direct eﬀect of  → ; 1 = direct eﬀect of  → ;
1 = direct eﬀect of  → ;  = direct eﬀect of  → ;
˜1 = direct eﬀect of  →  with the omission of ;
˜1 = direct eﬀect of  →  with the omission of ;
˜ = direct eﬀect of  →  with the omission of .
 Hypothetical path coeﬃcients for the model depicted in Figure 3.4.

this Table 3.4, we consider cases in this table as special situations of Table 3.1, where 2 takes on
the value of 0 in all cases. Implied by Equations 3.1, 2 = 0 suggests an unbiased direct eﬀect
estimate from  to  and thus only the bias of the estimated direct eﬀect from  to  remains
and contributes to the positively biased indirect eﬀect via . Implied by Equation 3.3, 2 = 0
indicates a negative bias for the estimated direct eﬀect from  to , under the conditions that 1, 
˜1 is 2 · /(1− 2
2). These are exactly the
same as the bias formula presented in Fritz et al. (2016) (p. 684), where they wrote 2 as 1 and
they use 1 to denote the confounder. Their 1  is our  , and their    is our  . The
Appendix presents the formulas of correlations as a function of path coeﬃcients. When 2 = 0, we
get   =  and   = 1. Thus, their bias in estimating 1, which is 1 ·
,
(1−22) ; their bias in estimating , which is 1 ·
is equivalent to our 2 ·
equivalent to our 2 · −·1
1−22 .

(cid:34) 1 
(cid:16) 1
(cid:34)−  1
(cid:17)(cid:35)
(cid:16) 1

2), and the bias for the bias in ˜ is 2 · (− · 1)/(1− 2

(cid:17)(cid:35)

1−2

, is

1−2

 
82

and 2 are all positive.

3.8 Correlation Framework and Sensitivity Analysis for Omitted Alternative Mediators

We have demonstrated that omitting alternative mediators/post-treatment confounders may bias
the estimate of the direct eﬀect from  to  as well as the indirect eﬀect via . The direction
and magnitude of bias can vary substantially under diﬀerent underlying mediating processes. From
a practical perspective, we never know the actual parameters in the true model, but we can use
our analysis to quantify the robustness of any inference regarding  to a potential post-treatment
confounder.

Sensitivity analyses can serve as a useful tool to inform the strength of evidence for speciﬁc
inferences by quantifying the conditions that would alter the inference (e.g., Frank, 2000; Imbens,
2003; Rosenbaum, 2002; VanderWeele & Arah, 2011). For example, if a study focuses on estimating
a treatment eﬀect, sensitivity analyses can generate statements about how strong an omitted variable
would have to be correlated with the treatment and with the outcome to invalidate an inference of
an eﬀect of the treatment on the outcome. As such, recent approaches to sensitivity analysis help
interpreters of research quantify the conditions necessary to invalidate an inference drawing on
familiar quantities such as correlations (Frank, 2000), percentage of variance explained (Cinelli &
Hazlett, 2018) or graphical representations such as contour plots (Imbens, 2003).

In this section, we quantify the robustness of inferences by evaluating how sensitive the estimated
direct and indirect eﬀects of  are to a potential post-treatment confounder . For example,
we can quantify how large must be the eﬀect of the  on the  and  to change a statistically
signiﬁcant direct or indirect eﬀect to zero. Speciﬁcally, we quantify the sensitivity of estimates and
robustness of inferences as functions of correlations between  and other observed variables ,
 and , so that empirical researchers can use these quantities to discuss the robustness of their
inferences in terms of .

To introduce this sensitivity analysis approach, we ﬁrst present the correlation framework:
based on Equations 3.1 through 3.4 we write the bias in estimating 1, 1, and  as function of

83

correlations among the four variables , ,  and : ,, ,, ,,  ,,  ,
and  ,. The Appendix provides details and the ﬁnal derivation results. Compared to the
parameter framework Equation 3.1 through Equation 3.4 where all the path parameters are unknown
(i.e., 1, 1, , 2, 2 and ) to substantive researchers, now we can always estimate ,, ,,
and  , from the sample data. As such, the bias only depends on the three unknown -
related correlations: ,,  , and  ,. These three unknown correlations are the key
parameters in our sensitivity analysis approach.

To demonstrate this sensitivity analysis approach, we go back to our illustrative example re-
garding presumed inﬂuence of media use. Assume we ﬁt the model only with one mediator PMI,
where the results were presented in the right panel of Figure 3.2. The speciﬁc indirect eﬀect via
PMI was estimated as 0.078 and signiﬁcantly positive. Next, we ask how sensitive this estimate
is to an unobserved post-treatment confounder/mediator. For example, we consider IMPORT as
a potential confounder, but we fail to measure this variable. Focusing on the indirect eﬀect via
PMI, Figure 3.8 presents the result for the sensitivity analyses based on the three correlations as
sensitivity parameters:
the correlations between the unobserved post-treatment confounder 
and the other observed variables ,  and .

Considering that we have three sensitivity parameters, we include 15 scenarios in Figure 3.8
to comprehensively present the sensitivity analyses. In all scenarios, the horizontal dotted line is
drawn at the point estimate of 0.078, which is the estimated indirect eﬀect via PMI when we ﬁt
the model excluding the confounder . The solid black line plots the estimated indirect eﬀect
via PMI against diﬀering values of one correlation related to the confounder . The grey region
represents the 95% conﬁdence interval based on the Delta method (Sobel, 1982) 10. Each column
includes ﬁve scenarios plotting how the estimated indirect eﬀect varies with one particular -
related correlation: the left column has x-axis representing the correlation between  and ; the
middle column has x-axis representing the correlation between  and ; the right column has
x-axis representing the correlation between  and . Each row includes three scenarios where

10The standard error of the indirect eﬀect is approximated by

84

(cid:113)

12 (1) + 1

2 (1).

Figure 3.8: Sensitivity analysis for unobserved post-treatment confounder.

85

the other two unmanipulated -related correlations are ﬁxed at a certain level: the 1 to 3 rows
show scenarios where the two correlations taking on low, medium and high values11 , respectively;
the 4Ò and 5Ò rows show scenarios where the two correlations taking on one low, one high and
one high, one low values, respectively.

We interpret the sensitivity analyses by looking at each column separately. The ﬁve plots in the
left column show how the estimated indirect eﬀect changes with diﬀerent values of the correlation
between COND and the potential posttreatment confounder, when the correlation between PMI
and the confounder  (rmomu) and the correlation between REACTION and the confounder
 (rymu) are ﬁxed at diﬀerent values. For example, the fourth plot in the left column indicates
when the correlation between PMI and  (rmomu) is low (0.1) and the correlation between
REACTION and  (rymu) is high (0.5), the original inference about the direction of the indirect
eﬀect via PMI would always be maintained, no matter how large the correlation between COND
and . However, if we consider sampling variability, the conﬁdence interval covers the value
of zero when the correlation between COND and  is between -0.043 and 0.834. This implies
that the conclusion of a positive indirect eﬀect via PMI is sensitive to a posttreatment confounder.
Similar conclusions can be drawn based on other plots in the ﬁrst column: although only extreme
values of the correlation between COND and  can alter the direction of the indirect eﬀect from
positive to negative, once we take into account the sampling variability, the conﬁdence interval
can cover zero even when the correlation between COND and  is around 0. The ﬁve plots in
the middle column imply the same conclusion:
the direction of the indirect eﬀect via PMI can
be maintained in most cases unless the correlation between PMI and  takes on some extreme
values; however, once sampling variability is taken into account, the conﬁdence interval can cover
zero even when the correlation between PMI and  is close to zero. The scenarios in the right
column, which plot the estimated indirect eﬀect against diﬀering values of the correlation between
11We used 0.1, 0.3 and 0.5 for low, medium and high correlations between continuous variables
(i.e., ,  and ). For correlations involving the binary variable , we used 0.2, 0.5 and 0.8 as
low, medium and high Cohen’s D, which corresponds to correlations of 0.0995, 0.2425 and 0.3713
as low, medium and high, respectively.

86

REACTION and , support an even stronger conclusion that the positive indirect eﬀect via PMI
is very sensitive to a post-treatment confounder : in all ﬁve plots, the conﬁdence interval can
cover zero for most values of the correlation between REACTION and , especially when the
other two -related correlations are high (the 3 plot).

3.9 Discussion

The major conclusion of this article is that omitting a mediator  will typically generate
biased (more precisely, inconsistent) estimates for the speciﬁc indirect eﬀect via , as well as the
direct eﬀect from  to . Additionally, the magnitude of bias (inconsistency) can be substantial.
In our illustrative example that studies whether media () aﬀects people’s attitudes or behaviors
() through changing people’s perceptions regarding how other people may be inﬂuenced by the
media (), excluding the alternative mediator of the perceived importance of the article ()
has given us a diﬀerent conclusion that the indirect eﬀect via presumed media inﬂuence ()
is signiﬁcantly positive. Once the perceived media importance () was included, the indirect
eﬀect via presumed media inﬂuence () is not signiﬁcant from 0, decreasing from 0.078 to
0.045. Though the inference about the direct path from the direct path from article location () to
participants’ reaction () did not change no matter whether the perceived media importance ()
was included or not, the point estimate decreased from 0.082 to 0.033 once we included the  in
our model.

The exact pattern of bias (inconsistency) depends on the speciﬁc underlying mediation process.
In the parallel two-mediator model (Figure 3.1(e)) where the omitted mediator is independent of the
observed mediator, the estimate of the speciﬁc indirect eﬀect via  is not biased but the estimate
of the direct eﬀect from  to  can be either positive or negatively biased, depending on the speciﬁc
indirect eﬀect via the omitted mediator . If the speciﬁc indirect eﬀect via omitted mediator 
is positive, then the direct path from  to  is overestimated. Some credits via the direct eﬀect
from  to  should be attributed to  as a mediator. If the speciﬁc indirect eﬀect via omitted
mediator  is negative, then the direct path from  to  is underestimated. The magnitude of the

87

bias (inconsistency) in estimating  is exactly the true speciﬁc indirect eﬀect via .

In the serial two-mediator model (Figure 3.1(f)), the estimate of the direct eﬀect from  to
 is unbiased but the estimate of the speciﬁc indirect eﬀect via  can be either positively or
negatively biased, depending on the path coeﬃcients directly related to the unobserved mediator
. The estimated eﬀect attributed to  →  should be attributed to  →  → . Note
the true path from  to  is zero, indicating that the true speciﬁc indirect eﬀect via  is zero.
By excluding , we may conclude a nonzero indirect eﬀect via .

If the underlying mediating process is a more general two-mediator model where all path
coeﬃcients are positive, then the indirect eﬀect via  is overestimated and the magnitude of
bias can vary substantially. The estimate of the direct eﬀect from  to  can be either positively
or negatively biased. When the path from  to  () is relatively small, we overestimate the
direct eﬀect () while when  gets close to 1, we underestimated the direct eﬀect (). Further, the
larger the path coeﬃcient from  to  (), the larger the positive bias in estimating the indirect
eﬀect via . Importantly, the conditions to obtain unbiased direct and indirect eﬀect estimates
are not the same, so that one cannot simultaneously obtain accurate direct and indirect eﬀects.
Situations will become even more complicated once we consider sampling variability. Omitting an
alternative mediator can have either have no implications or disastrous consequences for hypothesis
testing regarding the observed mediator . Therefore, we propose a sensitivity analysis approach
where the three -related correlations serve as sensitivity parameters. We are also developing
an easy-to-use R package for empirical researchers to examine the robustness of their inference
regarding  to a potential omitted mediator . The demonstration for the sensitivity analysis
in this paper only shows one way this package can be applied. If the researcher has any knowledge
about a certain , they can choose to specify two of the unknown -related correlations so
that they can focus on one ﬁgure to see how the other -related correlation aﬀects the estimated
eﬀect and the inference.

One may ask that what if more than one mediator is omitted? From a practical perspective,
we can never know the true underlying mediating process and there can always be another omitted

88

mediator. We argue that for the purpose of quantifying the strength of evidence in making inference,
it is enough to consider one mediator that captures all sources of bias. To better conceptualize the
omitted mediator in our thought experiment of the sensitivity analysis, we can think about  as
one latent variable that captures all potential omitted mediators that may bias our inference for .
Finally, it is important to note that the deﬁnition of the indirect eﬀect and direct eﬀect is
diﬀerent from the natural indirect eﬀect (NIE), natural direct and the control direct eﬀect in the
counterfactual framework. As reviewed before, this distinction can be considerable under models
with multiple mediators, which emphasized the importance to clarify the deﬁnition of indirect
eﬀect and direct eﬀect in applied research. Additionally, although the counterfactual framework
allows us to relax those parametric assumptions, we argue that, by specifying a parametric model,
we can see how omitting another mediator may bias our estimation for each speciﬁc path coeﬃcient
we are interested in (i.e., 1, 1 and ).

3.10 Limitations and Future Directions

Several limitations of the current work suggest avenues for future research. First, we have strong
assumptions for model speciﬁcations, including no mediator-outcome interaction and the original
conditions () are randomly assigned. A more complex situation arises when these assumptions are
relaxed. Second, we applied the delta method to approximate the sampling variability, which may be
not accurate under some scenarios. Third, we acknowledge that three sensitivity parameters are a lot
to consider. Though we provided several plots in the sensitivity approach to accommodate diﬀerent
scenarios, it would be valuable if future studies can reduce the number of sensitivity parameters.
Furthermore, we only examined the cross-sectional dual-mediator model. Recent research has
suggested longitudinal designs to test mediation because cross-sectional examination of mediation
can generate biased estimates (e.g., Maxwell & Cole, 2007; Maxwell, Cole, & Mitchell, 2011;
Mitchell & Maxwell, 2013). Accordingly, future studies should consider whether omitting another
mediator may bias our estimation in longitudinal designs.

89

CHAPTER 4

APPLYING A PARAMETER FRAMEWORK TO QUANTIFY INCONSISTENCY IN A

TIME VARYING MEDIATION MODEL

4.1

Introduction

This chapter will continue the discussion in Chapter 3 about an unobserved mediator as a
posttreatment confounder in a single-mediator model. Speciﬁcally, I will leverage the parameter
framework to further discuss how inconsistency is generated for each path coeﬃcient we are
interested in when a posttreatment confounder is omitted. Therefore in the ﬁrst section of this
chapter I will develop a parameter framework for characterizing inconsistency in mediation models.
Then in the second part I will apply this parameter framework to a longitudinal design.

Two important tools will serve as the basis for all the discussion in this chapter: namely the Law
of Iterated Expectation (LIE) and the Linear Regression framework. These powerful tools allow
us to grasp a deep and intuitive understanding about the mechanisms underlying inconsistency
generation. More importantly, this understanding can be applied in several ways: (1) it helps us
better understand the patterns of how each -related parameter aﬀects inconsistency; (2) it has
implications for how we may consider the eﬀects of the unobserved confounder  on the indirect
eﬀect via  and the direct eﬀect from  to ; (3) this understanding also goes beyond the cross-
sectional model and depicts the underlying story of a post-treatment confounder in a longitudinal
single-mediator model as well.

4.2 Deriving Inconsistency Using the Law of Iterated Expectation for a Parameter Frame-

work

The Law of Iterated Expectation (LIE) states that (|) = [ (|, ) |] where , 
and are  three variables and  (|) represents the conditional expectation (or conditional mean)
of  given  (Wooldridge, 2009). The LIE allows us to apply a two-step approach to solve the
conditional mean: (|). First, we ﬁnd  (|, ), which is the conditional mean of  given

90

 and . As such, we get a function of  and  for this conditional mean. Second, we use the
information from (|), which is a function of , to solve the expected value of  (|, )
conditional on .

The LIE can be a very useful tool to derive inconsistency at the population level when we omit a
related independent variable in a regression model (model misspeciﬁcation). For example, consider
that the true model is  (|, ) =  +  and the correlation between  and  is not zero. But
when we ﬁt a regression model with all population data, we omit  and only regress  on . This
can lead to an inconsistent estimator for the parameter . In this scenario, we can apply the LIE to
derive the inconsistency. We do this by writing  (|) =  [ (|, ) |] =  ·  +  ·  (|)
and then plugging in  (|) (which is a function of ) to solve the question. The underlying
intuition is that part of the explanatory credit that belongs to the omitted  has been allocated to
. As a result, the mistakenly attributed credit generates inconsistency when we estimate  while
omitting .

To note, the derivation here is at the population level and therefore is free of sampling error.
Technically speaking, the diﬀerence between ˜1, ˜1 , ˜ and 1, 1,  are inconsistencies, not bias.
That is, ˜1, ˜1, ˜ are the estimates we get even when we have all population data available (i.e.,
sample size  → ∞). In all following discussion, “bias” in the population parameters is a rough
way to describe “inconsistency” for an audience more comfortable with conceptualizations of bias.
Recall that the true model and the model that omits the unobserved mediator can be written as
follows, in Figure 3.4 (from Chapter 3).  is the exposure variable,  is the observed mediator
of interest,  is the outcome, and  is the unobserved mediator. We are interested in estimating
the speciﬁc indirect eﬀect via  and the direct path from  to . As such, three true eﬀects are
of key interest: 1 that represents the path  → , 1 that represents the path  → , and
 that represents the path  → . When  is omitted, ˜1, ˜1 and ˜ are the estimated eﬀect for
 → ,  →  and  → , respectively.

Now we apply LIE to express the inconsistency due to omitting . That is, we want to derive
the diﬀerences between the estimators and their true eﬀects when the confounder  is present yet

91

not included, and we have the population data (i.e., sample size  → ∞). In this case, the omitted
 plays the role of  in our previous example. Speciﬁcally, we ﬁrst write our true models as
Equations 4.1 through 4.3:

(|, ) =  ·  + 1 · 

(|) = 2 · 

(|, , ) = 1 ·  + 2 ·  +  · 

(4.1)

(4.2)

(4.3)

By LIE, we can further write equations 4.4 to 4.5 to see what happens when  is excluded:

(|) =  [ (|, ) |] =  · (|) + 1 · 

(|, ) =  [ (|, , ) |, ] = 1· + 2 · (|, ) +  · 

Now write:

 (|) = 1 · 

 (|, ) = 2 ·  + 3 · 

(4.4)

(4.5)

(4.6)

(4.7)

Plugging Equation 4.6 and 4.7 back to Equation 4.4 and 4.5, we can get formulas for inconsistencies
in ˜1, ˜1 and ˜, as follows:

˜1 − 1 =  · 1
˜1 − 1 = 2 · 2
˜ −  = 2 · 3

(4.8)

(4.9)

(4.10)

Importantly, as implied by Equation 4.6 and 4.7, 1, 2 and 3 are three regression coeﬃcients
(again, at the population level). Speciﬁcally, 1 is the regression coeﬃcient of  when we regress
 on , which is also equivalent to 2. 2 and 3 are regression coeﬃcients of  and ,
respectively, when regressing  on  and .

Before going to deep discussion about the inconsistency, it is necessary to clarify that I use
the linear regression framework here in a “loose” way to serve as a tool for easy interpretation

92

only. No causality is implied in the regression models. For example, Equation 4.7 implies a linear
regression model where  and  are predictors for , which only has statistical meaning but no
causal implications. In other words, the coeﬃcients 2 and 3 only represent statistical association
between ,  and . They do not imply that  and  cause .

4.3 Understanding What Happens When Omitting 

Now I will present how Equations 4.8 to 4.10 can help us understand the underlying mechanisms
of sources of bias. To simplify the discussion, we assume all parameters are positive. Chapter
3 shows that the bias in estimating 1 and 1 are always positive. But the direction of ˜’s bias
depends on the relative magnitude of , and , ·  ,. As we will see later, this is
much more than just a mathematical result. In fact, the discussion for ˜’s story has inspired me
to consider more deeply about what’s actually happening when omitting . Importantly, this
parameter framework allows us to have a powerful tool to understand how the omittance of 
generates the bias. Following Figure 4.1 presents a summary of this interpretative framework, in
which the formulas in the right columns are based on Equations 4.8 through 4.10.

We will start our consideration of this framework from the ﬁrst row for ˜1, which happens to be
the most direct and easy result. The formula tells us that the bias is the product of  and 2. From
the causal pathways in the left column, we can see that this bias comes from the causal pathway
 →  → . That is, ˜1 measures the eﬀect  →  plus the eﬀect  →  → .
Therefore, omitting  is basically omitting a mediator when estimating 1.

The stories for ˜1 and ˜ are more complicated. First we observe that the bias in estimating 1 and
 are both a weighted version of 2, which represents the causal pathway  → . Interestingly
enough, the two weights for the bias in ˜1 and ˜ are the two partial regression coeﬃcients from one
regression model. The last row of Figure 4.1 presents this crucial regression model, in which we
use  and  to predict . Speciﬁcally, the weight for ˜1’s bias is the coeﬃcient for  and the
weight for ˜’s bias is the coeﬃcient for . We know that a partial regression coeﬃcient in a multiple
regression model measures the unique contribution of the predictor to predict/explain the variance

93

Figure 4.1: Parameter framework to understand what happens when omitting .

94

of the outcome (again, no causality is implied here). In this context, the unique contribution made
by  to explain  composes the weight applied to 2 to generate the bias in estimating 1. On
the other hand, the unique contribution attributed by  to explain  becomes the weight of 2 to
form the bias .

This observation that centers on regressing  on  and  actually points us to understand
two competing mechanisms about how omitting  aﬀects the estimation of 1 and . If we trace
all the causal pathways that go through , the beginning point is surely  → . But then this
causal pathway splits into two parts. The ﬁrst part goes through  to the ending point of . This
ﬁrst part passes through the 1 pathway and thus it contributes to the bias in estimating 1. The
second part, on the other hand, passes directly to  after . The second part here  →  → 
picks up a mediation pathway from  to  and as a result, it is responsible for bias in estimating .
Importantly, the role of the omitted  is diﬀerent in these two parts though confounding and
mediation are known to have statistical similarities (MacKinnon, Krull, & Lockwood, 2000). In the
ﬁrst part for the bias in ˜1,  is a real confounder because it has an eﬀect on both  and . In
other words, the explanatory credit that belongs to  via direct eﬀects  →  and  → 
are allocated to the direct eﬀect  → . This explains why ˜1 is the sum of 1 and 2 · 2.
Intuitively, we can consider 2 as a measure for eﬀects that ﬂow through  →  → . On the
contrary,  is only a mediator instead of confounder for estimating . As the third row in Figure
4.1 shows, the estimated eﬀect of  is biased because we count the indirect eﬀects  →  → 
as part of the direct eﬀect  → . Because part of the eﬀect from  →  ﬂows to , we only
get the remaining part multiplied by 2 to form the bias in estimating . The remaining part of
 →  is exactly what is represented by 3.

Because these two ﬂows (i.e.,  →  →  →  and  →  → ) both orginate from
 → , they are in fact competing with the path  → . This competition manifests iteself
by the linear regression model in our previous discussion, in which  is regressed on  and .
In other words,  and  are competing with each other to explain the variation in . The part
that  wins then continues to the ﬂow toward  (i.e.,  →  → ), which results in the bias

95

in estimating . On the other hand, the unique contribution made by  passes through  to 
(i.e.,  →  →  → ), which leads to the bias in estimating 1.

(cid:16)

(cid:17)

Equipped with this framwork, we are now able to have a more intuitive understanding for the
direction of bias in ˜ discussed in Chapter 3. Previously, we said that the sign of the bias in estimating
 depends on the relative magnitude of , versus , ·  ,. If ,
> , ·
< , ·  ,.
 ,, we overestimat . Otherwise, we underestimate  when ,
The comparison between , and , ·  , reﬂects exaclty the competition between
 and  to explain the variation in . This is because
is just
the numerator of the regression coeﬃcient for . Alternatively, this is also equivalent to the partial
< , ·  , (and
correlation between  and  conditional on . When ,
assuming all correlations are positive), we have the unconditional (zero-order) correlation between
 and  reversing the sign once conditional on . This eﬀect is also known as suppression in
the literature (Cohen & Cohen, 1983) and somestimes  is called a distorter variable (Rosenberg,
1968). That is, the relationship between  and  gets suppressed or distorted once we include
 to explain the outcome . Intuitively, we may consider this as  losing the competion with
 to explain . When this is the case, we underestimate c. (Again, we only refer to statistical
relations rather than any causal relationships when talking about a suppressor.)

, − , ·  ,

Because parameters represent causal pathways, we see that the parameter framework allows us
to tell stories about mechanisms generating the bias. In the following section, more analyses will
be presented that applies this framework to better understand how -related parameters aﬀect the
magnitude of bias. As we will see, this framework (Figure 4.1) serves as a very useful instrument in
interpreting how bias vary under diﬀerent conditions, providing us a deeper discussion compared
to that in Chapter 3.

96

4.4 Using the Mechanism to Understand How Inconsistency Changes with -related Pa-

rameters

4.4.1 How inconsistency changes with diﬀerent levels of .

We start with the causal eﬀect from  to , which is represented as . In Chapter 3, we have
mentioned that the magnitude of bias in estimating both 1 and 1 increase as the causal eﬀect
 becomes stronger. But now we have a framework to better understand why this is the case.
Focusing on the eﬀects  →  →  as shown in the ﬁrst row of Figure 4.1, larger  leads us
to allocate more explanatory power to  that in fact belongs to  as a mediator. Therefore, we
overestimate 1 and the diﬀerence between ˜1 and 1 grows as  gets bigger. Similary,  plays a
role in the chain of causal eﬀects  →  →  →  (the second row of Figure 4.1). Then
larger  leads more explanatory power via  to be attributed to , which manifests as larger
bias in estimating 1. Thus, bias in both ˜1 and ˜1 increase, leading to more serious overestimation
of the indirect eﬀect  →  → .

Alternatively, we can interpret why the bias in ˜1 increases with  by considering the role of 
in the key regression model discussed before. In Figure 4.1 (last row), this key regression model is
presented with the zero-order bivariate correlations as functions of parameters. We can tell  plays
an important role in the correlation between  and  but the correlation between  and 
only depends on 2. In the example of Figure 3.5 (from Chapter 3), we ﬁx the value of 2 as 0.2
while increasing the value of . That is, the correlation between  and  becomes larger and
larger compared to that between  and . This leads to greater importance attached to  in
the competition between  and  to explain , which also means that  becomes less and less
important. Correspondingly, the partial regression coeﬃcient for (2) becomes larger while
the partial regression coeﬃcient for (3) turns to be smaller as  increases. Based on previous
discussion, we know that 2 is responsible for the bias in ˜1 and 3 is responsible for the bias in ˜.
Therefore, the increase in 2 explains the growing bias in ˜1 and the decrease in 3 explains the
declining bias in ˜.

97

Moreover, in Chapter 3 we also emphasized that after the positive bias in ˜ falls to 0 it continues
decreasing to negative values as  increases to 1. That is, when  is relatively small we overestimate
 while when  becomes larger we underestimate . And as  gets closer to 1, the magnitude of
negative bias in estimating  keeps growing. This is in fact consistent with our previous discussion
regarding the direction of ˜’s bias. As  becomes larger, suppression starts to take place and
becomes increasingly serious. That is, the diﬀerence between , and , ·  ,
becomes larger. Again, we can apply our competition story to interpret this observation: as 
becomes larger,  becomes weaker and weaker in its competition with . This leads more eﬀect
from  →  to be attracted towards  →  rather than directly ﬂowing to , which means
that the overestimation of the direct eﬀect  → (1) keeps growing. Meanwhile, as less eﬀect
remains to pass through  →  → , the bias in ˜ ﬁrst decreases to 0 and then further becomes
more and more negative.

4.4.2 How inconsistency changes with diﬀerent levels of 2
In Chapter 3, we discussed the larger the direct eﬀect  → (2), the larger the bias is
estimating 1. Relying on the ﬁrst row in Figure 4.1, we are able to explain this as the indirect
eﬀect  →  →  is allocated to  →  , where the omitted  plays a role of mediator.
In Chapter 3, we also discussed that the eﬀects of 2 on the bias in ˜2 and ˜ rely on the values of
other parameters. Speciﬁcally, Figure 3.6(a) in Chapter 3 gives the scenario where the magnitude
of 1 and  are both relatively small (with values of 0.15 and 0.22) while in Figure 3.6(b) 1 and
 are both relatively large (with values of 0.6 and 0.5).

Figure 4.2 presents how the competing story in the parameter framework we discussed before
for the bias in estimating 1 and  can be served as a useful tool to interpret the distinctions
between Figure 3.6(a) and 3.6(b). We know that the direct eﬀect  →  (2) splits into two
causal pathways after the point of , one toward  and the other towards , generating the bias
in ˜1 and ˜, respectively. And the way the eﬀect  →  (2) gets split can be understood by a
competition between  and  to predict . Interestingly, how the competition is aﬀected by

98

2 depends on the magnitudes of 1 and , which is exactly the distinction between Figure 3.6(a)
and 3.6(b).

Figure 4.2: Understand how bias changes with diﬀerent levels of 2.

In Figure 3.6(a), 1 and  are relatively small, the increase in 2 has larger eﬀect on ,
compared to , and  ,. This is because the latter two correlations can be read as
2 weighted by 1 and . That is, the increase in 2 needs to be discounted when reﬂected in
, and  ,. In comparison, , can get the 100% growth from 2 because ,
is just equivalent to 2. Therefore, when 1 and  are small in Figure 3.6(a), the increase in 2
leads  to gain more and more advantage in the competition with . In this context, larger 2
causes more explanatory credit to be allocated to  →  →  with less credit attributed to
 →  →  → . As such, though  can be underestimated in the very beginning (2 is very
close to 0), the negative bias becomes positive quickly and keeps growing larger. At the same time,
the overestimation in ˜1 slowly reduces to 0 as 2 gets closer to 1.

As a comparison, Figure 3.6(b) shows the opposite scenario where the increase in 2 has smaller

99

eﬀect on , compared to , and  ,. That is, the increase in 2 gets enlarged in
, and  , due to the large magnitudes of 1 and  in , and  ,. As a result,
 gains more and more comparative advantage in its competition with . Under this scenario
the increase in 2 leads more explanatory power to the path  →  →  →  relative to the
credit assigned to the pathway  →  → .

Figure 4.2 summarizes these patterns and interpretations about how 2 aﬀects the bias in
estimating 1 and  under diﬀerent scenarios. Again, we see that the key regression model plays a
signiﬁcant role in helping us understand the mechanisms that generate the bias in ˜1 and ˜ when
we omit .

4.4.3 How inconsistency changes with diﬀerent levels of 2.

For how 2 aﬀects the bias, we also presented two diﬀerent scenarios in Chapter 3, exempliﬁed by
Figure 3.6(a) and 3.6(b). In both scenarios, the positive bias of ˜1 becomes larger as 2 gets closer
to 1 while the level of 2 has no eﬀect on the magnitude of bias in 1. The ﬁrst two rows of Figure
4.1 can help us interpret these patterns by applying our framework.

First for ˜1, we know that its bias comes from the explanatory power that belongs to the
 →  →  and there is no role of 2 here. That is, the direct eﬀect  → (2) is not
directly connected with the direct eﬀect  → (1). Alternatively, we can think about this “no
eﬀect” by noticing that the parameter we manipulate 2 occurs after the direct eﬀect  → (1)
in the sequence of causality.

Figure 4.1 illustrates the bias in estimating 1 is the product of 2 and 2. The magnitude
of 2 depends on the competition between  and  that happens in the key regression model
represented in Figure 4.1. We can tell from the triangle of the key regression model that 2 has no
eﬀect on this model. Therefore, the only way for 2 to inﬂuence the bias in ˜1 is to through 2 but
not 2. This helps explain why the bias in ˜1 is a linear relationship of 2 in both Figure 3.6(a) and
3.6(b).

The distinction between Figure 3.6(a) and 3.6(b) is the pattern for the bias of .

In Figure

100

3.6(a), c gets overestimated and the positive bias keeps growing but in Figure 3.6(b) we always
underestimate  and importantly, the negative bias gets more serious as 2 increases.
In fact,
the fundamental reason underlying this distinction between two ﬁgures (3.6(a) and 3.6(b)) is the
direction of bias while the eﬀect of 2 on the magnitude of bias follows the same pattern in both
two ﬁgures. This ties back to the earlier discussion of the direction of ˜’s bias, which can be
interpreted with our parameter framework again. In Figure 3.6(a) we have 1 =  = 2 = 0.2 but in
Figure 3.6(b), 1(= 0.55) and (= 0.6) are much larger than 2(= 0.2). Correspondingly, we have
< , ·  , for Figure 3.6(b). In
,
other words, in Figure 3.6(b), the relationship between  and  gets suppressed or distorted once
we include  to explain the outcome  but this suppression does not occur in Figure 3.6(a).

> , ·  , for Figure 3.6(a) and ,

4.5 Using the Mechanism to Understand the Inconsistency of Indirect and Direct Eﬀects

When Omitting 

In the previous section, we applied our parameter framework to interpret how bias is generated
in estimating each parameter we are interested in (i.e., 1, 1 and ). But we are also interested
in the indirect eﬀect via , which is the product of 1 and 1. Importantly, as we will see later,
this discussion can provide empirical researchers some ideas to consider how diﬀerent potential
candidates of  may impact our estimation of the indirect eﬀect 11 and direct eﬀect  diﬀerently.
We start our discussion by getting the following equation for the bias in estimating 11 (the

speciﬁc indirect eﬀect via ) based on our previous derivations.

˜1 ˜1 − 1 · 1 = 2 ·  · 1 + 2 · 2 · (1 +  · 2)

Combing Equation 4.11 and 4.10, we can get:

(cid:2) ˜1 ˜1 − 1 · 1

(cid:3) + [ ˜ − ] = 2 ·  · 1 + 2 · 2 · (1 +  · 2) + 2 · 3

(4.11)

(4.12)

In Equation 4.12, the two last components 2 · 2 · (1 +  · 2) + 2 · 3 can be written as 2 · 2
(see Appendix for more detailed proof). This allows us to get the following equation 4.13.

(cid:2) ˜1 ˜1 − 1 · 1

(cid:3) + [ ˜ − ] = 2 ·  · 1 + 2 · 2

(4.13)

101

Equation 4.13 tells us an important fact: the bias in estimating the indirect eﬀect 11 and the
bias in estimating the direct eﬀect  adds up to the two pathways:  →  →  →  and
 →  → . Figure 4.3 summarizes this ﬁnding by comparing diﬀerent pathways from  to 
by models: the true model with two mediators versus the model that omits . The left column

Figure 4.3: Diﬀerent causal pathways from  to  by models: the true model with two mediators
versus the model that omits .

shows four causal pathways in the true model from  to . As a comparison, there are only two
causal pathways in the right column to account for the total eﬀect from  to . Because the total

102

eﬀect from  to  is ﬁxed, the eﬀect via the two missing pathways, namely  →  →  → 
and  →  → , must be assigned to the two remaining pathways when  is excluded. In
other words, the explanatory power via  →  →  →  and  →  →  either goes to
the indirect eﬀect ˜1 ˜1 or the direct eﬀect ˜. When more eﬀect is allocated to the indirect eﬀect,
then omitting  results in more biased estimate of the indirect eﬀect and less biased estimate of
the direct eﬀect. The same vice versa: if more eﬀect is allocated to the direct eﬀect, then omitting
 generates more biased direct eﬀect and less biased indirect eﬀect.

What factors aﬀect how the eﬀect via  →  →  →  and  →  →  is allocated to
the bias in the estimated indirect eﬀect or the bias in the estimated direct eﬀect? The answer to these
questions goes back to our parameter mechanisms, more speciﬁcally, the competition between 
and  in the key regression model. We know the bias either goes to the estimated indirect eﬀect
or the estimated direct eﬀect, and it is easier to focus on the direct eﬀect. As we discussed before,
the bias in estimating the direct eﬀect  is 2 · 3. 2 does not even show up in the key regression
model or the competition between  and . Then as the exposure variable  gets stronger at
predicting the unobserved mediator , compared to the observed mediator , more bias will
be allocated to the estimated direct eﬀect and less bias will be allocated to the estimated indirect
eﬀect. In other words, when 2 is ﬁxed at a certain level (not zero), if the unobserved mediator 
becomes more correlated to  and less correlated to , then omitting  generates more biased
direct eﬀect and less biased indirect eﬀect.

It is important to note that this “more” or “less” is not a comparison between the amount of
bias between the estimated indirect eﬀect and the estimated direct eﬀect. Instead, it is a trend in the
change of the bias of either the indirect eﬀect itself or the direct eﬀect itself. Assume an example
where 10% of the total bias (i.e., the eﬀects via  →  →  →  and  →  → ) is
allocated to the estimated direct eﬀect and other 90% of the total bias is allocated to the estimated
indirect eﬀect. Now as the unobserved mediator  gets more correlated with  and less correlated
with , maybe 20% of the bias is allocated to the estimated direct eﬀect and the other 80% is
allocated to the direct eﬀect. But still more bias gets assigned to the indirect eﬀect in this case

103

(80% versus 20%).

With this understanding, now we can better understand two special situations, as summarized in
Figure 4.4. In the ﬁrst context, the true underlying process is a parallel mediation model.  does
not have a direct eﬀect on  and consequently,  and  are independent once conditional on
. That is,  wins all possible explanatory power in the competition with . Once we have ,
 has no predictive power for . As such, the direct eﬀect is mostly overestimated, accounting
for all the explanatory credit that belongs to the omitted  →  → . In contrast, the indirect
eﬀect  →  →  is not aﬀected at all by the omitted mediator (i.e., the estimated indirect eﬀect
is consistent).

Figure 4.4: Two special situations where all inconsistency goes to the direct eﬀect  →  or all
inconsistency goes to the indirect eﬀect  →  → .

In the second context, we have , − , ·  , = 0. This means  has no predictive
power for  once conditional on . Then  wins all the explanatory credit possible in its
competition with  to predict . As such, all the bias is allocated to the estimated indirect eﬀect
 →  →  and the direct eﬀect  →  is consistent.

Note the key in the second situation is: once conditional on ,  has no predictive power

104

for . This does not happen when there is no eﬀect from  to (2 = 0). That is, when 
is only a confounder for  →  but not a mediator, even conditional on ,  can still have a
predictive eﬀect on . Another important note here is that in both two situations in Figure 4.4,
there is a path of  →  (i.e., 2 ࣔ 0). When 2 = 0, only ˜1 is biased. As such, only the
estimated indirect eﬀect is biased, and direct eﬀect is unbiased. The serial mediation model is a
special situation of this where 1 = 2 = 0.

4.6 Applying the Mechanism to Understand the Inconsistency in Longitudinal Designs

When Omitting 

Recent research has suggested longitudinal designs to test mediation because cross-sectional
examination of mediation can generate biased estimates and longitudinal designs provide more
rigorous inference for mediation eﬀects (e.g., Maxwell & Cole, 2007; Maxwell et al., 2011; Mitchell
& Maxwell, 2013). Yet there may be post-treatment confounders  invalidating mediation eﬀects
even in longitudinal designs. This section will present how our parameter framework can help us
understand the bias generation in a longitudinal design. More speciﬁcally, this section will focus
on one autoregressive model presented by Maxwell et al. (2011). Following Figure 4.5 presents
the path diagram for this model and the formal equations are followed.

This model can be written as follows:

,+1 =  , + ,+1
,+1 = , + , + ,+1
,+2 = ,+1 + ,+1 + , + ,+2

(4.14)

(4.15)

(4.16)

where ,+1 is the treatment status for individual  at time  + 1, , is the treatment status for
individual  at time , ,+1 is the value for individual  on mediator  at time  + 1, , is the
value for individual  on mediator  at time , ,+2 is the value for individual  on outcome  at
time  + 2, ,+1 is the value for individual i on outcome  at time  + 1, and similarly, , is the
value for individual  on outcome  at time . We can also tell that the direct eﬀect in this model
is the product of  and  while the indirect eﬀect is . We assumed that this longitudinal model

105

Figure 4.5: Longitudinal mediation model with two unit lag for direct eﬀect of  on .

satisﬁes the stationarity and equilibrium. This means that the causal relationships among variables
and the within-wave correlations are unchanged over time (the variance-covariance matrix among
,, , and , is time invariant).

We then introduced an unobserved mediator  into this model and allowed this mediator to
have an eﬀect on the observed mediator. To make a clear distinction between these two mediators
and make the notation consistent with our cross-sectional design, we note the observed mediator
of interest as . The path diagram is presented in the following Figure 4.6.

This model can be written formally as follows:

,+1 =  , + ,+1
,+1 = 1, +  , + 1, + ,+1
,+1 = 2, + 2, + ,+1
,+2 = ,+1 + 1,+1 + 2,+1 + , + ,+2

(4.17)

(4.18)

(4.19)

(4.20)

where the notation is almost the same as before. The only distinction is that another mediator
 is introduced so we use  to represent the observed mediator. Similarly, the indirect eﬀect

106

Figure 4.6: Longitudinal mediation model with unobserved latent mediator .

associated with the observed mediator is the product of 1 and 1. Additionally, the autoregressive
parameters for the two mediators are noted as 1 and 2 and  represents the causal eﬀects of
, on ,+1. As before, we assumed stationarity and equilibrium, indicating that the causal
relationship and the correlation matrix among , , , and  are time-invariant. Note in
this longitudinal model, the unobserved mediator  serves as a lagged confounding, but not
simultaneous confounding. That is, the longitudinal design only has an eﬀect , → ,+1 but
not ,+1 → ,+1.

Now we apply the same approach LIE to derive the inconsistency in estimating 1, 1, and 

when omitting . The procedure is essentially the same. We ﬁrst write our true models as:

 (+1|) =  · 

(cid:0),+1|,, ,, (cid:1) = 1 · , +  · , + 1 · 
(cid:0)+2|,+1, ,+1, (cid:1) =  · +1 + 1 · ,+1 + 2 · ,+1 +  · 

(cid:0),+1|, ,(cid:1) = 2 · , + 2 · 

(4.21)

(4.22)

(4.23)

(4.24)

107

By LIE, we can further write above Equation 4.22 and 4.24 to see what happens when  gets
excluded:

(cid:0),+1|,, (cid:1) = (cid:2)(cid:0),+1|,, ,, (cid:1) |,, (cid:3)
= 1 · , +  · (cid:0),|, ,(cid:1) + 1 · 
(cid:0)+2|,+1, (cid:1) = (cid:2)(cid:0)+2|,+1, ,+1, (cid:1) |,+1, (cid:3)
=  · +1 + 1 · ,+1 + 2 · (cid:0),+1|, ,+1, +1(cid:1) +  · 
(cid:0),|, ,(cid:1) = 1 · , + 2 · 

(cid:0),+1|, ,+1, +1(cid:1) = 1 · +1 + 2 · ,+1 + 3 · 

(4.25)

(4.26)

(4.27)

(4.28)

Now write:

Plugging these two equations back to Equations 4.25 and 4.26, we can get formulas for inconsistency
in estimating 1, 1 and , as follows, where ˜1, ˜1, and ˜ are estimated eﬀect of 1, 1 and 
when  is excluded (again, assuming we have population level data,  → ∞).

˜1 − 1 =  · 2
˜1 − 1 = 2 · 2
˜ −  = 2 · 3

(4.29)

(4.30)

(4.31)

Importantly, as implied by Equation 4.27 and 4.28, 2, 2 and 3 are three regression coeﬃcients.
Speciﬁcally, 2 is the regression coeﬃcient of  when we regress , on  and ,. 2 and 3
are regression coeﬃcients of ,+1 and , respectively, when regressing ,+1 on +1, ,+1
and .

Figure 31 presents how our parameter mechanism can be extended to this autoregressive longi-
tudinal mediation model to understand how bias (more precisely, inconsistency) is generated when
omitting . As we will see, the major distinction between the cross-sectional and longitudinal
design is the control of prior , ,  and .

We start our consideration with ˜1. In the cross-sectional design, it is evident to see from the
˜1

path diagram that the bias in ˜1 comes from the omitted causal pathway  →  → .

108

Figure 4.7: Parameter framework extended to longitudinal designs.

109

measures the eﬀect  →  plus the eﬀect  →  → . Now in the longitudinal scenario,
the mechanism is essentially the same: ˜1 is biased because it accounts for the eﬀect  → , →
,+1 that should be attributed to , as a mediator. The eﬀect of , → ,+1 is captured
by  in the formula but we do not have a parameter for  → ,. We only have 2 to represent
 → ,+1 in the longitudinal scenario. This is why in the formula we have 2 rather than
2. Loosely speaking, 2 captures what the eﬀect of  → ,+1(2) projects on  → ,.
However, 2 is not equivalent to the correlation between  and , because both  and ,
are also determined by their prior values, namely −1 and ,−1. That explains why 2 is the
regression coeﬃcient of  when we regress , on  and ,. The presence of , in this
regression is exactly controlling for −1 and ,−1, because our model implies that , is
determined by −1, ,−1 and its own prior value ,−1.

The stories for ˜1 and ˜ are also essentially the same with the cross-sectional design. In the
longitudinal case we still observe that the bias in estimating 1 and  are both a weighted version of
2(, → +1). The two weights for the bias in estimating 1 and  are still two partial regression
coeﬃcients from one regression model. The last row of Figure 4.7 presents a comparison between
the key regression model in the cross-sectional desgin versus that in the longitudinal design. In
both situations, we use  and  to predict , and the weight for the bias in estimating 1 is the
coeﬃcient of  and the weight for the bias in estimating  is the coeﬃcient of . This similarity
illustrates the competition between  and  is still present in the longitudinal situation:
the
unique contribution made by  to explain  composes the weight before 2 to generate the
bias in ˜1, and the unique contribution attributed by  to explain  becomes the weight of 2
to form the bias in ˜. The argument about  serving as a confounder in the bias of ˜1 and as a
mediator in the bias of ˜ is also the same in the longitudinal design.

What is diﬀerent in the longitudinal design is that now we need to consider time points and
controlling for prior values. In the cross-sectional design, we argue that the competition between 
and  comes from the fact: pathway  →  splits into two pathways  →  →  → 
and  →  → . Now the longitudinal story is a little diﬀerent: it is the pathway of  → ,+1

110

that splits into  → ,+1 → ,+1 → +2 and  → ,+1 → +2. This ﬁrst part (i.e.,
 → ,+1 → ,+1 → +2) passes through the 1 pathway and contributes to the bias in ˜1.
The second part (i.e.,  → ,+1 → +2) picks up a mediation eﬀect transmitted through ,+1
and generates the bias in ˜. This helps us interpret why the key regression model in the longitudinal
design is regressing ,+1 on ,+1 and  and why the coeﬃcient of ,+1 is responsible
for the bias in ˜1 while the coeﬃcient of  is responsible for the bias in ˜. Note the longitudinal
design does not really have the eﬀect of ,+1 → ,+1 since ,+1 and ,+1 are at the
same time point. But one can loosely interpret this as a projection of the eﬀect ,+1 → ,+2,
just as previously how we interpret 2 as a projection of the eﬀect  → ,+1.

Now we only need to explain why +1 shows up as a control variable in the key regression
model in longitudinal design. This is similar to why , is controlled to generate 2. Here +1
is present as a control for prior values that can have an eﬀect on ,+1, ,+1 and , namely
,, , and −1. This ties back to the argument about the fundamental diﬀerence between
the cross-sectional and longitudinal design: as Maxwell and Cole (2007) argued, what is missing
in cross-sectional examination of mediation is the failure to capture the autoregressive eﬀects.

To summarize, although the longitudinal design to examine mediation is much more compli-
cated, the interpretation of how bias (more precisely, inconsistency) is generated with omitting 
is essentially the same as the cross-sectional design. The parameter framework depicted in this
chapter allows us to grasp an intuitive path-based understanding of how omitting  generates
bias in both cross-sectional and longitudinal designs.

4.7 Discussion

This chapter leverages the parameter framework to explain how inconsistency is generated when
an unobserved mediator is omitted. Brieﬂy speaking, the inconsistency of ˜1 is due to omitting
 →  →  and thus part of the direct eﬀect  →  should be allocated to this omitted
indirect eﬀect  →  →  via . The inconsistency of ˜1 comes from the omission of
 →  →  →  and the inconsistency of ˜ comes from the omission of  →  → .

111

Importantly, we can consider the fact that the inconsistency of ˜1 and ˜ both originates from
 →  as competition between  and  to predict  in a linear regression framework. As
the exposure variable  gets stronger at predicting the unobserved mediator , compared to the
observed mediator , more bias will be assigned to ˜ and less bias will be assigned to ˜1.

Applying this framework, we can better understand how each -related parameter aﬀects the
inconsistency, how inconsistency is allocated to either the direct eﬀect from  to  or the indirect
eﬀect via , as well as inconsistency in some special situations including the parallel and serial
mediation models. Additionally, I showed that the inconsistency underlying a longitudinal design
is essentially the same, except that prior values are controlled in the longitudinal design.

4.8 Limitations and Future Directions

There are at least two limitations of the current chapter that suggest avenues for future research.
First, this chapter focus on understanding how inconsistency is generated when the alternative
mediator is omitted. Future studies can look at how this understanding can be applied in a more
practical perspective. For example, can we bound the inconsistency for the indirect ( ˜1 ˜1) and
direct ( ˜) eﬀects if we have some ideas about the correlation between the unobserved mediator 
and the intervention , and the correlation between  and the mediator of interest ? Second,
it would be valuable if future studies can develop a sensitivity approach for an unobserved post-
treatment confounder in a longitudinal design, as what we do in Chapter 3 for the cross-sectional
model.

112

DISCUSSION

As mentioned in the introduction, this dissertation is centered on causal inference regarding eval-
uation of an intervention, from whether an intervention works to why it works, accounting for the
social and dynamic contexts in which interventions are implemented. The ﬁrst two chapters focus
on whether an intervention works by proposing a non-parametric case replacement framework to
quantify strength of evidence for inferences in multisite randomized control trials and value-added
measures for teacher eﬀectiveness. The last two chapters focus on why an intervention works by
studying post-treatment confounders in mediating processes.

From a practical perspective, the four chapters represent the eﬀort to reﬁne policy analysis so that
policies regarding the allocation of educational resources can be better informed. As the ﬁrst step,
we are interested in the total average intervention eﬀect. But summarizing an intervention with only
one estimated eﬀect can be misleading, especially in the context of multisite randomized control
trials. For example, presence of heterogeneity in multisite randomized control trials emphasizes the
importance of considering local contextual eﬀects and causal mechanisms that can help explain why
an intervention works in some sites but not others. Identifying important mediating pathways may
point out an alternative intervention option, especially when the mediating pathway can explain a
large proportion of the causal eﬀect and it is easier or more cost-eﬃcient to manipulate the mediator
directly. Further, considering alternative mediators as potential confounders can help researchers
and policy makers evaluate the robustness of the inference regarding the identiﬁed mediator of
interest so that the reallocation of educational resources regarding the mediator of interest can be
made after comprehensive evaluation against potential costs and other alternatives.

There are other ways that mediation studies can inform policy manipulations. For example,
sometimes mediation analysis may present us with two mediating pathways with opposite directions
that cancel out each other and lead to a zero total eﬀect (i.e., average treatment eﬀect), which provides
another example for considering manipulating one mediator instead to achieve the expected changes
in the outcome. Under other scenarios, a mediator can also be a side eﬀect of interest that inheres

113

in the intervention. Then knowing the presence of this mediator can allow us to ﬁgure out ways to
minimize the negative side eﬀects.

Problems of causal inference can be conceptualized in terms of omitted variables, or in terms
of sampling bias. Both mediators and confounders are third variables that inﬂuence the associ-
ation between the predictor of interest and the outcome, and they are statistically identical (e.g.,
MacKinnon et al., 2000). In Chapters 1 and 2, the presence of spillover eﬀects violates SUTVA
and generates bias via playing a similar role as a confounder: by associating with both exper-
iment conditions and outcome measures.
Importantly, the association between spillover eﬀects
and experiment condition implies the treatment and control group individuals experience diﬀerent
levels of spillover eﬀects. For example, positive spillover eﬀects within the treatment group, or
negative spillover eﬀects within the control group, due to non-random assignment of contributors
and spoilers to treatment and control conditions, can bias the treatment eﬀect estimate negatively
(assuming a positive treatment eﬀect). Sometimes capacity constraints can also lead to negative
spillover eﬀects within treatment group (e.g., Maroulis, 2016), leading to downward bias in the
treatment eﬀect estimation. Alternatively, the treatment condition triggers positive spillover ef-
fects from treatment individuals to control individuals, or the control condition triggers negative
spillover eﬀects from control group to treatment group, both of which may cause downward bias
in estimating the treatment eﬀect (again, assuming a positive treatment eﬀect). In other situations,
aﬀecting individuals’ interactions to introduce spillover eﬀects can be one mediating process that
explains why the intervention works. For example, if reducing class size improves students’ learn-
ing through maximizing the learning and teaching among students, then the peer eﬀect becomes a
mediator that helps explain how smaller classes boost students’ performance.

Furthermore, constant eﬀects through simple mechanisms to independent individuals rarely
occur in education research (Frank, Saw, & Xu, 2016). Hong (2015) conceptualizes causal inference
regarding moderation, mediation and spillover as weighting issues in a sampling framework.
Relatedly, the case replacement approach proposed in the ﬁrst two chapters for spillover and
heterogeneity applies the feature in the counterfactual framework that recast potential sources of

114

bias in terms of missing data (Frank et al., 2013). Both replacing observed cases with unobserved
cases and adjusting weights in observed cases are rooted in a sampling framework. Chapters 3 and
4 study mediation within a parametric framework, but they also contribute to existing literature that
applies a non-parametric approach (e.g., Hong et al., 2018) by taking on a path coeﬃcient perspective
and digging into the eﬀects of a confounding mediator on each path-coeﬃcient estimate. We argue
that parametric and non-parametric approaches complement each other so that researchers can
better understand spillover, heterogeneity, and mediation in education research.

Finally, it is important to note that sensitivity analysis cannot exclude bias, regardless of what
type of approach one uses, either traditional approaches that draw on familiar quantities such as
correlations or percentage of variance explained, or the non-parametric case replacement approach
proposed in Chapters 1 and 2. Before applying sensitivity analysis, one should make sure that best
considerations have been given to research design, model speciﬁcation, and choice of estimation
approach. Sensitivity analysis cannot substitute any of these crucial steps, but rather provides a
discourse for researchers to communicate with all potential stakeholders regarding the strength of
evidence, after all those eﬀorts are made to remove as much bias as possible.

115

APPENDIX

116

DERIVATION NOTE FOR UNOBSERVED MEDIATOR IN A CROSS-SECTIONAL

DESIGN

Introduction

This Appendix derives the inconsistency an unobserved mediator  may bring to the estima-
tion of the indirect and direct eﬀects in a cross-sectional design. The key approach applied in this
derivation is the Law of Iterated Expectations (LIE). I will introduce the true model in Part 1 and
then write the true model in terms of conditional mean (Part 2). After that, I derive several correla-
tions that are useful in later derivation (Part 3). In Part 4, I apply LIE to derive the inconsistency as
a function of parameters that have been derived in Part 3. I work out the formulas for inconsistency
as a function of only correlations in Part 5 and the formulas for percent of inconsistency in Part
6. Part 7 includes the derivation of the error variances for the purpose of simulation (to generate
standardized variables) and also constraints of parameters. Part 8 and 9 discuss the directions of
inconsistency (Part 8) and how the inconsistency changes with parameters (Part 9). The last part
includes some derivation to decompose the total inconsistency into two omitted pathways.
One crucial assumptions made in this derivation is that all variables are standardized.

True model

Note: the causal relationship between two mediators is:  causes .

 =  ·  + 1 ·  + 
 = 2 ·  + 
 = 1 ·  + 2 ·  +  ·  + 

(32a)

(32b)

(32c)

117

True model in terms of conditional mean

Note: the assumptions underlying the model are actually stronger than this. But the assumptions

listed here are enough for the purpose of derivation in this Appendix.

(|, ) =  ·  + 1 · 

(|) = 2 · 

(|, , ) = 1 ·  + 2 ·  +  · 

(33a)

(33b)

(33c)

Correlations

Correlation between  and 

From Eq.32b:

(, ) = 2 · ()

Assuming all variables are standardized:

, = 2

Correlation between  and 

From Eq.32a:

(, ) =  · (, ) + 1 · ()

Assuming all variables are standardized:

, = 2 + 1

Correlation between  and 

From Eq.32a:

(, ) =  · () + 1 · (, )

118

Assuming all variables are standardized:

 , =  + 12

Correlation between  and 

From Eq.32c:

(, ) = 1 · (, ) + 2 · () +  · (, )

Assuming all variables are standardized:

To summarize,

 , = 1 + 112 + 2 + 2

, = 2
, = 2 + 1
 , =  + 12
 , = 1 + 112 + 2 + 2

(34a)

(34b)

(34c)

(34d)

Inconsistency if omitting 

By Law of Iterated Expectation (LIE), the right model that excludes  can be written as

follows:

Write:

(|) = [(|, )|]
=  · (|) + 1 · 

(|, ) = [(|, , )|, ]

= 1 ·  + 2 · (|, ) +  · 

(|) = 1 · 

(|, ) = 2 ·  + 3 · 

119

(35a)

(35b)

(36a)

(36b)

Then we get:

(|) =  1 ·  + 1 · 

(|, ) = 1 ·  + 2 · 2 ·  + 2 · 3 ·  +  · 

Therefore, we can see that if we omitting , we can get inconsistent estimates.

˜1 = 1 +  1
˜1 = 1 + 22
˜ =  + 23

(37a)

(37b)

(38a)

(38b)

(38c)

The inconsistency for ˜1 would be:  · 1;
The inconsistency for ˜1 would be: 2 · 2;
The inconsistency for indirect eﬀect ˜1 ˜1 would be: (1 +  1) · (1 + 22) − 11;
The inconsistency for ˜ would be: 2 · 3.
Following are derivations for 1, 2, 3
From Eq.36a: 1 is the regression coeﬃcient of  when we regress  on  (at the population

level). Therefore,

1 = , = 2

Similarly, from Eq.36b: 2 is the regression coeﬃcient of  when we regress  on  and ;
3 is the regression coeﬃcient of  when we regress  on  and  (at the population level).

2 =

3 =

1 − 2

,

 , − , · ,

1 − 2

, −  , · ,

,

(39a)

(39b)

(39c)

where all elements have been derived in Eq.34.

Note: 2 and 3 are derived based on formula of regression coeﬃcients as follows (I derived

both of these through the well-known ((cid:48))−1(cid:48) and some linear algebra).

120

When we regress  on 1 and 2, the coeﬃcient of 1 is (in terms of correlations):

 ,1 −  ,2 · 1,2

1 − 2

1,2

To summarize, we can get following estimates if we omit :

˜1 = 1 +  · 2
˜1 = 1 + 2 ·  ·
˜ =  + 2 · 2 − ( + 1 · 2) · ( · 2 + 1)

1 − ( · 2 + 1)2

1 − 2
2

1 − ( · 2 + 1)2

We can also get:

˜1 · ˜1 = 1 · 1 +  · 2 · 1 + (1 +  · 2) · 2 ·  · (1 − 2
2)
1 − ( · 2 + 1)2

Inconsistency as a function of correlations only

(40)

(41a)

(41b)

(41c)

(42)

This section aims to write the inconsistent estimators as a function only of correlations. To
achieve this, we start from Eq.38a, Eq.38b, Eq.38c and formulas for 1, 2 and 3. From these
equations, we can write the inconsistency as a function of correlations and parameters ( and 2).
Therefore, we only need to derive  and 2 as a function of correlations and then plug into previous
equations.

From Eq.32a,  is the regression coeﬃcient of  when we regress  on  and .
Similarly, from Eq.32c, 2 is the regression coeﬃcient of  when we regress  on ,  and

121

. Therefore, we can write the results in terms of correlations:

˜1 = 1 +  · 1
= 1 + , ·  , − , · ,
˜1 = 1 + 2 · 2

1 − 2

,

= 1 +

1 + 2 ·  , · , · , − 2

− 2

,

− 2

,

 ,

1

· ( , +  , · , · , + , ·  , · , −  , · 2
−  , ·  , − , · ,) ·  , − , · ,
˜ =  + 2 · 3
=  +

1 − 2

,

1

1 + 2 ·  , · , · , − 2

 ,

,

,

− 2

− 2

,

(43)

(44)

(45)

· ( , +  , · , · , + , ·  , · , −  , · 2
−  , ·  , − , · ,) · , −  , · ,

,

1 − 2

,

Note: 2 is derived based on formula of regression coeﬃcients as follows (I derived both of

these through the well-known ((cid:48))−1(cid:48) and some linear algebra).

When we regress  on 1, 2 and 3, the coeﬃcient of 1 is (in terms of correlations):

1

1 + 2 · 1,2 · 2,3 · 1,3 − 2
· ( ,1 +  ,2 · 1,3 · 2,3 +  ,3 · 1,2 · 2,3 −  ,1 · 2
−  ,2 · 1,2 −  ,3 · 1,3)

1,2 − 2

2,3 − 2

1,3

(46)

2,3

Percent of inconsistency

In this section, we will derive the percent of inconsistency as a function of correlations.
From Eq.32 we know 1, 1 and  are also functions of correlations, which can be shown as

122

follows.

1 =

, −  , · ,

1 − 2

,

1

1 =

,

,

 ,

− 2

− 2

1 + 2 ·  , · , · , − 2
· ( , +  , · , · , + , ·  , · , −  , · 2
−  , ·  , − , · ,)
1
 =
1 + 2 ·  , · , · , − 2
· (, +  , · , ·  , +  , · , ·  , − , · 2
−  , · , −  , · ,)

− 2

− 2

 ,

,

,

,

 ,

Together with the previous section, we can get percent of inconsistency as a function of correlations.

˜1 − 1

˜1

˜1 − 1

˜1

=

=

, · ( , − , · ,)

, · (1 − 2

,
 , − , · ,
·
 , − , · ,

)

1

1

 ,

− 2

− 2

1 + 2 ·  , · , · , − 2
· ( , +  , · , · , + , ·  , · , −  , · 2
−  , ·  , − , · ,)
=

, −  , · ,
, − , ·  ,

,

,

·

,

˜ − 
˜

1 + 2 ·  , · , · , − 2
· ( , +  , · , · , + , ·  , · , −  , · 2
−  , ·  , − , · ,)

− 2

− 2

 ,

,

,

,

123

Derive error variation for standardization

In order to have all variables standardized, we need to constrain the error variations. From

Eq.32:
() = ( · ) + (1 · ) + 2 · ( · , 1 · ) + ()
() = (2 · ) + ()
() = (1 · ) + (2 · ) + ( · )+

2 · (1 · , 2 · ) + 2 · (1 · ,  · ) + 2 · (2 · ,  · ) + ()

() = 1 − 2 − 2
() = 1 − 2
2
() = 1 − 2
1 − 2
= 1 − 2
1 − 2

1 − 21 · , = 1 − 2 − 2

1 − 212

2 − 2 − 212 ·  , − 21 · , − 22 · ,
2 − 2 − 212 · ( + 12) − 21 · (2 + 1) − 222

Note: From Eq.49, we can tell that we have following constraints on parameters:
1 − 212 > 0
2 > 0
2 − 2 − 212 · ( + 12) − 21 · (2 + 1) − 222 > 0

1 − 2 − 2

1 − 2

1 − 2

1 − 2

(49a)

(49b)

(49c)

(50a)

(50b)

(50c)

Discussion about the directions of inconsistency

In this section, we discuss the directions of inconsistency based on previous derivations.

with respect to 1
From Eq.41a, we can tell ˜1 − 1 > 0 as long as  and 2 have the same direction. That is, we
overestimate 1 when  and 2 are both positive or both negative. If  > 0, 2 < 0 or  < 0, 2 > 0
we underestimate 1.

124

with respect to 1
From Eq.41b, we can show that the sign of ˜1 − 1 depends on whether  and 2 have the same
direction.

First, 1 − 2 > 0 based on Eq.50b.
Second, 1 − ( · 2 + 1)2 > 0 because:
2 + 2

( · 2 + 1)2 = 2 · 2

1 + 212 < 2 + 2

1 + 212 < 1

where the last inequality is based on Eq.50a.

Therefore, we overestimate 1 when  and 2 are both positive or both negative. If  > 0, 2 < 0

or  < 0, 2 > 0 we underestimate 1.

with respect to 
From Eq.41c, we can tell that sign of ˜−  depends on whether 2 and 2−( + 1· 2)·( · 2+ 1)
have the same direction. This is because we have already shown that 1 − ( · 2 + 1)2 > 0 when
we discuss inconsistency for 1.

We overestimate  when 2 and 2 − ( + 1 · 2) · ( · 2 + 1) are both positive or both

negative. If one positive and one negative then we underestimate .

Interestingly, from Eq.34 we can tell that:

2 − ( + 1 · 2) = , −  , · ,

Parameter framework: how inconsistency changes with parameters

In this section, we work out partial derivatives of ˜1, ˜1, ˜ with respect to parameters related

to : 2, 2 and .

125

with respect to 

From Eq.41a, Eq.41b, Eq.41c

= 2

(1 − 2

 ˜1

 ˜1

 ˜


=

=

22 + (1 − 2
1)]

2)2 · [2
[1 − (1 + 2)2]2
2){−22 + 1[(1 + 2)2 − 1]}
(−1 + 2

1 + 212 + 2

22)2

2(1 − 2

(51a)

(51b)

(51c)

When all parameters are positive and follow the constraints given by Eq.50, it can be easily

shown that  ˜1

 and  ˜1

 are always positive.

For  ˜

 , we can show that it is always negative by following:

(1 + 2)2 = 2

1 + 2

22 + 212 < 2

1 + 2 + 212 < 1

where the last inequality is based on Eq.50a.

with respect to 2

From Eq.41a, Eq.41b, Eq.41c

(52)

(53a)

(53b)

(53c)

=

2{4

2 + 2(−1 + 2

= 
22[1 + 12

 ˜1
2
 ˜1
2
 ˜
2
When all parameters are positive,  ˜1
2
When all parameters are positive,  ˜1
2

1 + 23

=

1 + 2)]

[1 − (1 + 2)2]2
is positive.
and  ˜
2

[1 − (1 + 2)2]2
12 − 212(1 + 2) + (1 − 2)(1 + 2

22) + 2

1[(2

2 − 1)2 − 2]}

can be either positive or negative, depending

on the magnitudes of 1 and . Discussions are as follows.

Fig..8 shows how the sign of  ˜1
2

changes when 2, 1 and  take on diﬀerent values. The
dashed line represents the constraint given by Eq.50a. The constraint says that we can only take on
values below the dashed line. The solid line, on the other hand, shows where  ˜1
is equivalent to
2

126

0. Importantly, this partial derivative is positive when 1 and  take on values above the solid line
and negative below the solid line. And these two lines become closer to each other as 2 becomes
larger, which are shown from the left ﬁgure to the right one.

Similarly, Fig..9 shows how the sign of  ˜
2

changes when 2, 1 and  take on diﬀerent values.
Again, the dashed line represents the constraint given by Eq.50a. And we can only take on values
below the dashed line. The solid line shows where  ˜
is equivalent to 0. Above this solid line this
2
partial derivative is negative and below the solid line it is positive. And the three ﬁgures from left
to right show what happens as 2 becomes larger.

Therefore, we can summarize as follows:
When 1 and  are relatively small (in the left-lower corner),  ˜1
2

quickly become negative as 2 becomes larger. Instead,  ˜1
2

When 1 and  are relatively large (in the center part),  ˜1
2

can be positive ﬁrst then

is always positive.

can be always positive. For  ˜1
2

,

the sign can be negative or ﬁrst positive then negative.

Figure .8: Sign of  ˜1
2

127

with respect to 2

From Eq.41a, Eq.41b, Eq.41c

 ˜1
2
 ˜1
2
 ˜
2

Figure .9: Sign of  ˜
2

= 0

1 − 2
2

=  ·
2 − ( + 1 · 2) · ( · 2 + 1)

1 − ( · 2 + 1)2

=

1 − ( · 2 + 1)2

(54a)

(54b)

(54c)

When all parameters are positive (and other restrictions required by standardization), it can be

shown that  ˜1
2

is always positive by using Eq.52.

 ˜
2
Fig..10 shows how the sign of  ˜1
2

can be either positive or negative, depending on the magnitudes of 1, 2 and .

changes when 2, 1 and  take on diﬀerent values. The
dashed line represents the constraint given by Eq.50a. The constraint says that we can only take on
values below the dashed line. The solid line, on the other hand, shows where  ˜
is equivalent to
2
0. This partial derivative is positive when 1 and  take on values below the solid line and negative
above the solid line. The three ﬁgures from left to right show what happens when 2 turns larger.
Therefore, when 1 and  are relatively small (in the left-lower corner) and 2 is also very small,
is positive. When 1 and  are relatively large (compared to 2)(in the center part, between
is negative. Importantly, because we are talking about the partial derivatives

 ˜
2
the two lines),  ˜
2

128

with regards to  and  does not appear in this derivative, the slope always keeps the same no matter
what value  takes.

Figure .10: Sign of  ˜
2

Decompose total inconsistency into two missing pathways when omitting 

First we write the following equation based on the results in Eq. 41.

[ ˜1 · ˜1 − 1 · 1] + [ ˜ − ] = 2 ·  · 1 + 2 · 2 · (1 +  · 2) + 2 · 3

(55)
Then following derivation shows that the last two components 2 · 2 · (1 +  · 2) + 2 · 3 is
equivalent to 2 · 2.

2 · 2 · (1 +  · 2) + 2 · 3
= 2 ·  ·
2 ·  · (1 − 2

1 − 2
2

1 − ( · 2 + 1)2 · (1 +  · 2) + 2 · 2 − ( + 1 · 2) · ( · 2 + 1)

1 − ( · 2 + 1)2
2) · (1 +  · 2) + 2 · [2 − ( + 1 · 2) · ( · 2 + 1)]

( · 2 + 1) · 2 · [ · (1 − 2

1 − ( · 2 + 1)2
2) − ( + 1 · 2)] + 2 · 2
2 −  − 1 · 2] + 2 · 2

1 − ( · 2 + 1)2

( · 2 + 1) · 2 · [ −  · 2
−( · 2 + 1) · 2 · 2 · ( · 2 + 1) + 2 · 2

1 − ( · 2 + 1)2

1 − ( · 2 + 1)2

=

=

=

=
= 2 · 2

129

(56a)

(56b)

REFERENCES

130

REFERENCES

Aaronson, D., Barrow, L., & Sander, W. (2007). Teachers and student achievement in the Chicago

public high schools. Journal of labor Economics, 25(1), 95–135.

Adnot, M., Dee, T., Katz, V., & Wyckoﬀ, J. (2017). Teacher Turnover, Teacher Quality, and Student

Achievement in DCPS. Educational Evaluation and Policy Analysis, 39(1), 54-76.

An, W. (2018). Causal Inference with Networked Treatment Diﬀusion. Sociological Methodology,

48(1), 152–181.

Baron, R. M., & Kenny, D. A.

(1986). The moderator–mediator variable distinction in social
psychological research: Conceptual, strategic, and statistical considerations. Journal of
personality and social psychology, 51(6), 1173.

Bekman, N. M., Cummins, K., & Brown, S. A. (2010). Aﬀective and personality risk and cognitive
mediators of initial adolescent alcohol use. Journal of studies on alcohol and drugs, 71(4),
570–580.

Blatchford, P., & Martin, C. (1998). The Eﬀects of Class Size on Classroom Processes: ‘It’s a Bit
Like a Treadmill – Working Hard and Getting Nowhere Fast!’. British Journal of Educational
Studies, 46(2), 118–137.

Bloom, H. S., Raudenbush, S. W., Weiss, M. J., & Porter, K. (2017). Using Multisite Experiments to
Study Cross-Site Variation in Treatment Eﬀects: A Hybrid Approach With Fixed Intercepts
and a Random Treatment Coeﬃcient. Journal of Research on Educational Eﬀectiveness,
10(4), 817–842.

Bloom, H. S., & Spybrook, J. (2017). Assessing the Precision of Multisite Trials for Estimating the
Parameters of a Cross-Site Population Distribution of Program Eﬀects. Journal of Research
on Educational Eﬀectiveness, 10(4), 877–902.

Bulterman-Bos, J. A. (2008). Will a Clinical Approach Make Education Research More Relevant

for Practice? Educational Researcher; Washington, 37(7), 412–420.

Carrell, S. E., Fullerton, R. L., & West, J. E. (2009). Does Your Cohort Matter? Measuring Peer

Eﬀects in College Achievement. Journal of Labor Economics, 27(3), 439–464.

Chetty, R., Friedman, J. N., & Rockoﬀ, J. E. (2011). The Long-Term Impacts of Teachers: Teacher

Value-Added and Student Outcomes in Adulthood. Working Paper(17699).

Cinelli, C., & Hazlett, C. (2018). Making Sense of Sensitivity: Extending Omitted Variable Bias.

Working Paper, 40.

131

Cohen, J., & Cohen, P. (1983). Applied Multiple Regression/Correlation Analysis for the Behavioral

Sciences (2nd ed.). Hillsdale, New Jersey: Erlbaum Associates.

Cook, T. D. (2002, September). Randomized Experiments in Educational Policy Research: A
Critical Examination of the Reasons the Educational Evaluation Community has Oﬀered for
not Doing Them. Educational Evaluation and Policy Analysis, 24(3), 175–199.

Dee, T. S., & Wyckoﬀ, J. (2015). Incentives, Selection, and Teacher Performance: Evidence from
IMPACT: Incentives, Selection, and Teacher Performance. Journal of Policy Analysis and
Management, 34(2), 267–297.

Dieterle, S., Guarino, C. M., Reckase, M. D., & Wooldridge, J. M. (2015). How do Principals
Assign Students to Teachers? Finding Evidence in Administrative Data and the Implications
for Value Added: How do Principals Assign Students to Teachers? Journal of Policy Analysis
and Management, 34(1), 32–58.

Frank, K. A. (1998). Chapter 5: Quantitative Methods for Studying Social Context in Multilevels

and Through Interpersonal Relations. Review of Research in Education, 23(1), 171–216.

Frank, K. A. (2000). Impact of a Confounding Variable on a Regression Coeﬃcient. Sociological

Methods & Research, 29(2), 147–194.

Frank, K. A., Maroulis, S. J., Duong, M. Q., & Kelcey, B. M. (2013). What Would It Take to
Change an Inference? Using Rubin’s Causal Model to Interpret the Robustness of Causal
Inferences. Educational Evaluation and Policy Analysis, 35(4), 437–460.

Frank, K. A., Saw, G. K., & Xu, R. (2016). Book review of “Causality in a Social World” by

Guanglei Hong. Observational Studies, 4.

Fritz, M. S., Kenny, D. A., & MacKinnon, D. P. (2016). The Combined Eﬀects of Measurement
Error and Omitting Confounders in the Single-Mediator Model. Multivariate Behavioral
Research, 51(5), 681–697.

Fritz, M. S., & MacKinnon, D. P. (2007). Required Sample Size to Detect the Mediated Eﬀect.

Psychological Science, 18(3), 233–239.

Goldhaber, D., & Hansen, M. (2010). Assessing the Potential of Using Value-Added Estimates of
Teacher Job Performance for Making Tenure Decisions. Working Paper 31. National Center
for Analysis of Longitudinal Data in Education Research.

Gordon, R. J., Kane, T. J., & Staiger, D. (2006). Identifying eﬀective teachers using performance

on the job. Brookings Institution Washington, DC.

Guarino, C. M., Reckase, M. D., & Wooldridge, J. M. (2015). Can value-added measures of teacher

performance be trusted? Education Finance and Policy.

132

Hanushek, E. A. (1999). Some Findings From an Independent Investigation of the Tennessee STAR
Experiment and From Other Investigations of Class Size Eﬀects. Educational Evaluation
and Policy Analysis, 21(2), 143–163.

Hanushek, E. A. (2014). Boosting teacher eﬀectiveness. What lies ahead for America’s children

and their schools, 23–35.

Hanushek, E. A., & Rivkin, S. G. (2010). Generalizations about using value-added measures of

teacher quality. The American Economic Review, 100(2), 267–271.

Harﬁtt, G. J. (2013). Why ‘small’ can be better: an exploration of the relationships between class

size and pedagogical practices. Research Papers in Education, 28(3), 330–345.

Harris, D. N. (2009). Teacher value-added: Don’t end the search before it starts. Journal of Policy

Analysis and Management, 28(4), 693–699.

Hayes, A. F. (2013). Introduction to mediation, moderation, and conditional process analysis: a

regression-based approach. New York: The Guilford Press.

Hill, H. C., Kapitula, L., & Umland, K. (2011). A validity argument approach to evaluating teacher

value-added scores. American Educational Research Journal, 48(3), 794–831.

Holland, P. W.

(1986). Statistics and Causal Inference. Journal of the American Statistical

Association, 81(396), 945–960.

Hong, G. (2015). Causality in a Social World: Moderation, Meditation and Spill-over. Chichester,

UK: John Wiley & Sons, Ltd.

Hong, G., Qin, X., & Yang, F. (2018). Weighting-Based Sensitivity Analysis in Causal Mediation

Studies. Journal of Educational and Behavioral Statistics, 43(1), 32–56.

Hoxby, C. M., & Weingarth, G. (2005). TAKING RACE OUT OF THE EQUATION: SCHOOL

REASSIGNMENT AND THE STRUCTURE OF PEER EFFECTS. Working paper, 47.

Imai, K., Keele, L., & Tingley, D.

(2010). A general approach to causal mediation analysis.

Psychological Methods, 15(4), 309–334.

Imai, K., Keele, L., & Yamamoto, T. (2010). Identiﬁcation, Inference and Sensitivity Analysis for

Causal Mediation Eﬀects. Statistical Science, 25(1), 51–71.

Imai, K., & Yamamoto, T. (2013). Identiﬁcation and Sensitivity Analysis for Multiple Causal
Mechanisms: Revisiting Evidence from Framing Experiments. Political Analysis, 21(2),
141–171.

Imbens, G. W. (2003). Sensitivity to Exogeneity Assumptions in Program Evaluation. American

133

Economic Review, 93(2), 126–132.

Judd, C. M., & Kenny, D. A.

(1981). Process Analysis: Estimating Mediation in Treatment

Evaluations. Evaluation Review, 5(5), 602–619.

Kane, T. J., & Staiger, D. O. (2012). Gathering Feedback for Teaching: Combining High-Quality
Observations with Student Surveys and Achievement Gains. Research Paper. MET Project.
Bill & Melinda Gates Foundation.

Kim, C. M., Frank, K. A., & Spillane, J. P. (2018). Relationships among teachers’ formal and
informal positions and their incoming student composition. Teachers College Record, 120(3),
n3.

Konstantopoulos, S. (2011). How Consistent Are Class Size Eﬀects? Evaluation Review, 35(1),

71–92.

Levin, J. (2002). For whom the reductions count: A quantile regression analysis of class size
and peer eﬀects on scholastic achievement. In Economic applications of quantile regression
(p. 221-246). Springer.

Lockwood, J. R., Louis, T. A., & McCaﬀrey, D. F.

(2002). Uncertainty in rank estimation:
Implications for value-added modeling accountability systems. Journal of educational and
behavioral statistics, 27(3), 255–270.

Lockwood, J. R., McCaﬀrey, D. F., Hamilton, L. S., Stecher, B., Le, V.-N., & Martinez, J. F. (2007).
The sensitivity of value-added teacher eﬀect estimates to diﬀerent mathematics achievement
measures. Journal of Educational Measurement, 44(1), 47–67.

MacKinnon, D. P. (2008). Introduction to statistical mediation analysis. New York: Lawrence

Erlbaum Associates. (OCLC: ocn122701406)

MacKinnon, D. P., Krull, J. L., & Lockwood, C. M.

(2000). Equivalence of the mediation,

confounding and suppression eﬀect. Prevention science, 1(4), 173–181.

Maroulis, S. (2016). Interpreting school choice treatment eﬀects: Results and implications from
computational experiments. Journal of Artiﬁcial Societies and Social Simulation, 19(1), 7.

Maroulis, S., Guimera, R., Petry, H., Stringer, M. J., Gomez, L. M., Amaral, L. A. N., & Wilensky,
U. (2010). Complex Systems View of Educational Policy Research. Science, 330(6000),
38–39.

Maxwell, S. E., & Cole, D. A. (2007). Bias in cross-sectional analyses of longitudinal mediation.

Psychological Methods, 12(1), 23–44.

Maxwell, S. E., Cole, D. A., & Mitchell, M. A.

(2011). Bias in Cross-Sectional Analyses of

134

Longitudinal Mediation: Partial and Complete Mediation Under an Autoregressive Model.
Multivariate Behavioral Research, 46(5), 816–841.

Mitchell, M. A., & Maxwell, S. E. (2013). A Comparison of the Cross-Sectional and Sequential
Designs when Assessing Longitudinal Mediation. Multivariate Behavioral Research, 48(3),
301–339.

Nye, B., Hedges, L. V., & Konstantopoulos, S. (2000). The Eﬀects of Small Classes on Academic
Achievement: The Results of the Tennessee Class Size Experiment. American Educational
Research Journal, 37(1), 123–151.

Nye, B., Konstantopoulos, S., & Hedges, L. V. (2004). How large are teacher eﬀects? Educational

Evaluation and Policy Analysis, 26(3), 237-257.

Pauﬂer, N. A., & Amrein-Beardsley, A.

(2014). The random assignment of students into ele-
mentary classrooms: Implications for value-added analyses and interpretations. American
Educational Research Journal, 51(2), 328–362.

Pedder, D.

(2006). Are small classes better? Understanding relationships between class size,

classroom processes and pupils’ learning. Oxford Review of Education, 32(2), 213–234.

Preacher, K. J., & Hayes, A. F. (2008). Asymptotic and resampling strategies for assessing and
comparing indirect eﬀects in multiple mediator models. Behavior Research Methods, 40(3),
879–891.

Raudenbush, S. W. (2015). Value added: A case study in the mismatch between education research

and policy. Educational Researcher, 44(2), 138–141.

Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data

analysis methods (Vol. 1). Sage.

Rivkin, S. G., Hanushek, E. A., & Kain, J. F. (2005). Teachers, schools, and academic achievement.

Econometrica, 73(2), 417–458.

Rosenbaum, P. R. (2002). Observational Studies. Springer.

Rosenberg, M. (1968). The logic of survey analysis. New York: Basic.

Rothstein, J. (2009, October). Student Sorting and Bias in Value-Added Estimation: Selection on

Observables and Unobservables. Education Finance and Policy, 4(4), 537–571.

Rothstein, J.

(2010). Teacher quality in educational production: Tracking, decay, and student

achievement. The Quarterly Journal of Economics, 125(1), 175–214.

Rubin, D. B. (1974). Estimating causal eﬀects of treatments in randomized and nonrandomized

135

studies. Journal of Educational Psychology, 66(5), 688–701.

Rubin, D. B. (1986). Statistics and Causal Inference: Comment: Which Ifs Have Causal Answers.

Journal of the American Statistical Association, 81(396), 961.

Rubin, D. B. (1990). Formal mode of statistical inference for causal eﬀects. Journal of Statistical

Planning and Inference, 25(3), 279–292.

Schanzenbach, D. W. (2007). What Have Researchers Learned from Project STAR? Brookings

Papers on Education Policy(9), 205–228.

Singh, R., Chen, F., & Wegener, D. T. (2014). The Similarity-Attraction Link: Sequential Ver-
sus Parallel Multiple-Mediator Models Involving Inferred Attraction, Respect, and Positive
Aﬀect. Basic and Applied Social Psychology, 36(4), 281–298.

Sobel, M. E. (1982). Asymptotic conﬁdence intervals for indirect eﬀects in structural equation

models. Sociological methodology, 13, 290–312.

Spybrook, J., & Raudenbush, S. W.

(2009). An Examination of the Precision and Technical
Accuracy of the First Wave of Group-Randomized Trials Funded by the Institute of Education
Sciences. Educational Evaluation and Policy Analysis, 31(3), 298–318.

Spybrook, J., Shi, R., & Kelcey, B. (2016). Progress in the past decade: an examination of the
precision of cluster randomized trials funded by the U.S. Institute of Education Sciences.
International Journal of Research & Method in Education, 39(3), 255–267.

Sullivan, G. M. (2011). Getting Oﬀ the “Gold Standard”: Randomized Controlled Trials and

Education Research. Journal of Graduate Medical Education, 3(3), 285–289.

Tal-Or, N., Cohen, J., Tsfati, Y., & Gunther, A. C. (2010). Testing Causal Direction in the Inﬂuence

of Presumed Media Inﬂuence. Communication Research, 37(6), 801–824.

VanderWeele, T. J. (2010). Bias Formulas for Sensitivity Analysis for Direct and Indirect Eﬀects:.

Epidemiology, 21(4), 540–551.

VanderWeele, T. J. (2015). Explanation in causal inference: methods for mediation and interaction.

New York: Oxford University Press.

VanderWeele, T. J., & Arah, O. A. (2011). Bias Formulas for Sensitivity Analysis of Unmeasured
Confounding for General Outcomes, Treatments, and Confounders:. Epidemiology, 22(1),
42–52.

VanderWeele, T. J., & Chiba, Y. (2014). Sensitivity analysis for direct and indirect eﬀects in the
presence of exposure-induced mediator-outcome confounders. Epidemiology, Biostatistics
and Public Health(ONLINE FIRST).

136

Van Ewijk, R., & Sleegers, P. (2010). The eﬀect of peer socioeconomic status on student achieve-

ment: A meta-analysis. Educational Research Review, 5(2), 134-150.

Winters, M. A., & Cowen, J. M. (2013a). Who Would Stay, Who Would Be Dismissed? An Em-
pirical Consideration of Value-Added Teacher Retention Policies. Educational Researcher,
42(6), 330–337.

Winters, M. A., & Cowen, J. M. (2013b). Would a Value-Added System of Retention Improve the
Distribution of Teacher Quality? A Simulation of Alternative Policies: Value Added Teacher
Retention Policies. Journal of Policy Analysis and Management, 32(3), 634–654.

Wooldridge, J. M. (2009). Introductory econometrics: a modern approach (4th ed ed.). Mason,

OH: South Western, Cengage Learning.

137