QUANTIFYING THE 
BIAS OF 
STANDARD ERROR 
ESTIMAT
ES 
DUE TO 
OMITTED 
CLUSTER LEVELS
 
IN COMPLEX MULTILEVEL DATA
: A SENSITIVITY ANALYSIS 
FOR EMPIRICAL RESEARCHERS
 
By
 
Zixi Chen
 
 
A DISSERTATION
 
Submitted to
 
Michigan State University
 
in partial fulfillment
 
of the requirements
 
for the degree of
 
Measurement and Quantitative Methods

Doctor of Philosophy
 
2020
ABSTRACT
 
QUANTIFYING THE BIAS OF STANDARD ERROR ESTIMATES DUE TO OMITTED 
CLUSTER LEVELS IN COMPLEX MULTILEVEL DATA: A SENSITIVITY ANALYSIS 
FOR EMPIRICAL RESEARCHERS
 
By
 
Zixi Chen
 
E
ducational phenomena occur in multilevel contexts, such as students nested within 
classrooms and classrooms nested within schools. This 
multilevel structure is also reflected in 
the multi
-
stage sampling design and randomized experi
mental
 
design
 
by clusters in educational 
data collection and research design. The consequential challenge of dependent observations 
within clusters of each nesti
ng layer is prevalently dealt with by Hierarchical Linear Modeling 
(HLM) in education studies. 

can be unidentified or misspecified that the 
complex multilevel 
data structure is partially 
pres
ented. Thus, even with the advanced statistical tools, the estimated models with omitted 
cluste
ring levels will still produce erroneous standard error estimates and result in either Type I 
or Type II errors that compromise and even undistort interpretation
s of educational mechanisms. 
Practical guidance is urgently needed for empirical research confr
onting this issue to judge and 
detect whether the estimated models are adequate in taking account of the clustering dependency. 
 
 
This paper contri
butes to investigate when a cluster level should be explicitly modeled 
but omitted and how much the standard 
error estimates would be biased. This paper examines 
these questions in settings of a true three
-
level clustered data structure, while a cluster le
vel, 
either at the highest, middle, or the lowest level, is omitted in the estimated two
-
level models. 
The th
eoretical discussion of essential clustering levels in modeling due to multi
-
stage sampling 
design and randomized experiments by clusters is drawn 
on insights from Abadie 
et al
.
 
(2017)
and Hedges and Rhoads (2011). The current study then derives correspond
ing mathematical 

effect. These derived formulas are 
functions of the intraclass correlation coefficients and cluster 
sizes of the estimated and omitted cluster l
evels. Further, build on Frank, Maroulis, Duong, and 
Kelcey (2013), the current paper 
develops
 
a sensitivity analysis framework with a scientific 
l
anguage to quantify the degree of statistical inferences robustness based on the clustering 
characteristics o
f the omitted levels of clusters. In each omitted clustering scenario at the lowest, 
middle, and highest level, empirical studies are provided as i
mplication examples of the 
sensitivity analysis to demonstrate the potential inference robustness risks
 
due t
o omitted 
clustering
.
 
 
Copyright by 
 
ZIXI CHEN
 
2020
 
 
v
 
 
ACKNOWLEDGEMENTS
 
 
I would like to 
express the
 
heartfelt 
gratitude to my advisor and committee Chair Dr. 
Kenneth Frank
 
for giving me 
invaluable guidance throughout my doctoral study
 
and this 
research. 
His vision, experience, and sincerity
 
in research
 
have deeply inspired me
. He taught me 
th
e spirit to grow to be a good scientist and the philosophy to conduct scientific research.
 
His 
kindness
 
and 
empathy
 
have continuously supported an
d 
encouraged 
me to overcome 
all the 
challenges
 
a
n
 
international student 
faces
.
 
Without his 
support, my doctoral study would not 
have 
been possible
. 
 
I would
 
also
 
like to
 
extend my sincere 
appreciations to my
 
committee
,
 
Dr. Jeffery 
Wooldridge, Dr. Kimberly 
Kelly, and Dr. Spyros Konstantopoulos
 
for their 
expertise and 
support. 
Their insights 
on this study 
has 
large
ly inspired me to 
learn the different language
s
 
of
 
economics, education, and statistics
 
while seeking for
 
the 
common
 
ground
 
and the 
essence of the 
research goals. 
I would like to acknowledge the 
Dissertation Completion Fellowship
 
from the 
Graduate School of Michigan State University
 
financially supported this study. 
 
I ow
e
 
my 
deepest 
gratitude to my parents, Zaimin Chen and Yueping Yu
, and 
my 
grandpa
rents
, who
 
love 
me 
and
 
always
 
support me unconditionally. 
For the past
 
nine years
,
 
I 
have 
been 
living in another continent
;
 

endless
 
love 
is
 
what carr
ies
 
me t
o
 
accomplish 
this journey.
 
I would also
 
like to thank
 
the
 
long
-
standing 
support
 
and 
lov
e
 
from my 
fiancé
, 
Haoran Tan, and his 
family
. 
 
 
Last but not the least, I 
sincere
ly appreciate my 
dear friends and 
colleagues
 
for their 
intellectual sharing, emotional support, 
and companions: 
Dr. Kaitlin 
Knake
, 
Dr. Qinyun Lin, 
vi
 
 
Yuqin
g
 
Liu,
 
Jordan Tait,
 
Dr.
 
Kim
berly 
Jensen, 
Dr. Tenglong Li, 
Dr. 
Jihyun Kim
, 
Dr.
 
Sihua Hu, 
Dr. Faran Zhou, Dr. Zheng Gu, 
Wanqing 
Apa
, 
Bixi Zhang, and Xuran Wang. 
Thank you!
 
 
vii
 
 
TABLE OF CONTENTS
 
 
LIST 
OF TABLES
 
................................
................................
................................
.........................
 
ix
 
LIST OF FIGURES
 
................................
................................
................................
.......................
 
xi
 
CHAPTER 1
 
INTRODUCTION
 
................................
................................
................................
....
 
1
 
1.1 Background
 
................................
................................
................................
...........................
 
1
 
1.2 Problem Statement
 
................................
................................
................................
................
 
3
 
1.3 Research Questions and Goals
 
................................
................................
..............................
 
6
 
1.4 Combini
ng the B
enefits of the Model
-
Based and the Design
-
Based Approaches
 
................
 
8
 
1.5 Summary of Findings
 
................................
................................
................................
..........
 
11
 
1.6 Structure of this Study
 
................................
................................
................................
.........
 
12
 
CHAPTE
R 2
 
OMITTED THE MIDDLE CLUS
TER LEVEL
................................
.....................
 
14
 
2.1 Introduction
 
................................
................................
................................
.........................
 
14
 
2.2 Omitted Middle Level Due to Sampling Design
 
................................
................................
.
 
16
 
2.2.1 Omitting SSUs in a Three
-
stage Sampling Structure Data
................................
...........
 
16
 
2
.2.2 In
cidental Middle Level between PSUs and SSUs (or USUs)
 
................................
.....
 
19
 
2.3 Omitted Middle Level in Cluster Randomized Trial
 
................................
..........................
 
22
 
2.4 Quantification of Standard Error Bias
 
................................
................................
.................
 
24
 
2.4.
1 Model Setting
 
................................
................................
................................
...............
 
24
 
2.4.2 Quantifying the Standard Error Estimate Bias
 
................................
.............................
 
33
 
2.4.3 Simulation Results
 
................................
................................
................................
........
 
42
 
2.5 Discussion and Conclusion
 
................................
................................
................................
.
 
47
 
CHAPTE
R 3
 
SENSITIVITY ANALYSIS FRAMEWORK OF OMITTED CLUSTERING
 
.....
 
50
 
3.1 Introduction
 
................................
................................
................................
.........................
 
50
 
3.2 Constructing the Sensitivity Measures for Inference Robustness of Clustering
 
.................
 
53
 
3.2.1
 
Scenario of No Type I error
 
................................
................................
..........................
 
55
 
3.2.2 Scenario of Having a Type I error
 
................................
................................
................
 
59
 
3.2.3 Heuristics Diagram and Interpretations of the Sensitivity Analysis
 
.............................
 
61
 
3.3 Implication of
 
the
 
Sensitivity Analysis: Using an Empirical Example
 
...............................
 
63
 
CHAPTER 4
 
OMITTED HIGHEST CLUSTER LEVEL
 
................................
............................
 
67
 
4.1 Introduction
 
................................
................................
................................
.........................
 
67
 
4.2 Omitted Highest Cluster Level in 
Sampling and
 
Experimental Design
 
..............................
 
68
 
4.2.1 Omitting PSUs in a Three
-
Stage Sampling Structure Data
 
................................
..........
 
68
 
4.2.2 Incidental Highest Level above PSUs
 
................................
................................
..........
 
70
 
4.2.3 Omitted PSUs above the Level o
f Treatment Assignment
 
................................
...........
 
72
 
4.3 Quantification of Standard Error Bias
 
................................
................................
.................
 
74
 
4.3.1 Model Setting
 
................................
................................
................................
...................
 
74
 
4.3.2 Quantifying the Standard Error Estimate Bias
 
................................
.............................
 
78
 
4.3.3 Simulation Results
 
................................
................................
................................
........
 
82
 
4.4
 
Empirical Example and Sensitivity Analysis
 
................................
................................
......
 
85
 
4.5 Discussion and Conclusion
 
................................
................................
................................
.
 
88
 
CHAPTER 5
 
OMITTED SERIAL CORRELATIONS IN 
LOWEST CL
USTER LEVEL
 
.........
 
90
 
5.1 Introduction
 
................................
................................
................................
........................
 
90
 
viii
 
 
5.2 Alternative 
R
 
Structures with Serial Correlations
 
................................
...............................
 
93
 
5.2.1 Study Motivation
 
................................
................................
................................
..........
 
96
 
5.3 Quantification of 
Standard Error Bias
 
................................
................................
.................
 
97
 
5.3.1 Model Setting
 
................................
................................
................................
...............
 
97
 
5.3.2 Quantifying the Standard Error Estimate Bias
 
................................
.............................
 
99
 
5.3.3 Simulation results
 
................................
................................
................................
.......
 
107
 
5.4 Empiric
al Example and Sensitivity Analysis
 
................................
................................
....
 
110
 
5.5 Conclusi
on and Future Research
 
................................
................................
.......................
 
114
 
APPENDICES
 
................................
................................
................................
............................
 
116
 
APPENDIX 2A
 
Intraclass Correlation Coefficients in 
A Three
-
Level Model
 
.......................
 
117
 
APPENDIX 2B
 
A Summary of Model Specification, Assumption, and Estimation
 
..............
 
119
 
APPENDIX 2C
 
Simulation Parameter Settings and Resul
t of VOCs of Omitting the Middle 
Cluster Level
 
................................
................................
................................
...........................
 
122
 
APPENDIX 3A
 
Quantifying the Robustness of Inference with Type 2 Error
 
........................
 
131
 
APPENDIX 4A
 
Simulation Para
meter Settings and Results of VOCs of Omitting the Highest 
Cluster Level
 
................................
................................
................................
...........................
 
134
 
APPENDIX 5
A
 
Simulation Parameter Settings and Results of VOCs of Omitting the Lowest 
Cluster Level
 
................................
................................
................................
...........................
 
141
 
BIBLIOGRAPHY
 
................................
................................
................................
.......................
 
146
 
 
ix
 
 
LIST OF TABLES
 
 
Table 2.1


-
stage sampling design with
 

or 


estimated models.
 
................................
......
 
16
 
Table 2.2
 
A summary of VOCs when the middle cluster level is omitted in a three
-
level 
structured clustering data.
 
................................
................................
................................
.............
 
42
 
Table 3.1
 
Sensitivity analysis of the 
s
tudent
-
s
evel 
p
redictor: Internet
-
u
se for 
e
conomic 
d
ata
.....
 
66
 
Table 4.1
 
A summary of VOC
s
 
when the highest cluster level is omitted i
n a three
-
level 
structured clustering data.
 
................................
................................
................................
.............
 
82
 
Table 4.2
 
Sensitivity A
nalysis of the 
s
chool
-
s
evel 
p
redictor: 
e
conomics 
r
equired for 
g
raduation
................................
................................
................................
................................
.......................
 
87
 
Table 5.
1 A summary of VOCs when the serial correlation is omi
tted
 
................................
......
 
106
 
Table 5.
2 Sensitivity 
a
nalysis of the 
t
ime
-
varying 
p
redictor: 
c
ompetence
 
................................
.
 
111
 
Table 5.
3 Sensitivity
 
a
nalysis of the 
t
ime
-
varying 
p
redictor: 
i
ntrinsic 
r
egulation
 
......................
 
113
 
Table 5.
4 Sensitivity 
a
nalysis of the 
t
ime
-
varying 
p
redictor: 
e
xternal 
r
egulation
 
......................
 
113
 
Table 2
B.1 Summary of
 
model specification, assumption, and estimation contrasting the two
-
level estimated model omitting the middle cluster level and the three
-
level satisfactory model.
................................
................................
................................
................................
.....................
 
119
 
Table 2C.1
 
Simulation 
p
arameter settings.
 
................................
................................
.................
 
122
 
Table 2C.2
 
Relative
 
bias of
 
estimates of variances
 
when
 

, 


.
 
.......
 
123
 
Table 2C.3
 
Relative
 
bias of
 
estimates of variances
 
whe
n
 

, 


.
 
.......
 
125
 
Table 2C.4
 
Relative
 
bias of
 
estimates of variances
 
when
 

, 


.
 
.......
 
127
 
Table 2C.5
 
Relative
 
bias of
 
estimates of variances
 
when
 

, 


.
 
.......
 
129
 
Table 4A.1
 
Simulation 
p
arameter settings.
 
................................
................................
.................
 
134
 
Table 4A.2
 
Relative
 
bias of
 
estimates of 
variances
 
when 


, 


, 


.
 
.........
 
135
 
Table 4A.3
 
Relative
 
bias of
 
estimates of variances
 
when 


, 


,


.
 
.........
 
137
 
Table 4A.4
 
Relative
 
bias of
 
estimates of variances
 
when 


, 


, 


.
 
.........
 
139
 
x
 
 
Table 5A.1
 
Simulation 
p
arameter settings.
 
................................
................................
.................
 
141
 
Table 5A.2
 
Re
lative
 
bias of
 
estimates of variances
 
when lag
-
1 autocorrelation 


..........
 
142
 
Table 5A.3
 
Relative
 
bias of
 
estimates of variances
 
when
 
lag
-
1 autocorrelation 


..........
 
143
 
Table 5A.4
 
Relative
 
bias of
 
estim
ates of variances
 
when lag
-
1 autocorrelation 


.
 
.........
 
144
 
Table 5A.5
  
Relative
 
bias of
 
estimates of variances
 
when lag
-
1 autocorrelation 


.
 
........
 
145
 
 
xi
 
 
LIST OF FIGURES
 
 
Figu
re 2.1 Data correlation 
structures of three
-
stage sampling designs when the secondary 
sampling stage is included and omitted
 
................................
................................
........................
 
18
 
Figure 2.2 Correlation structures of 


of the three
-
level m
odel and 


of the two
-
level 
model 
omitting the middle cluster level.
................................
................................
................................
..
 
27
 
Figure 2.3 Relationship among 

,
 

, and 


................................
................................
...............
 
32
 
Figure 3. 1 Graphic demonstrations of the conceptualizing the sensitivi
ty analysis framework
 
.
 
54
 
Figure 3.2 Two Scenarios of Comparing 
t
 
Statistics of the Estimated Model and the 
Hypothesized Models 
(


) 
 
................................
................................
................................
 
56
 
Figure 3. 3 Heuristics diagram of sensitivity 
analysis when the predictor of interest in the original 
single
-
level model is statistically significant.
 
................................
................................
...............
 
62
 
Figure 4.1 Data correlation structures of three
-
stage sampling designs when the first samplin
g 
stage in is included and omitted
 
................................
................................
................................
....
 
70
 
Figure 4.2 Omitted highest cluster level in a two
-
level CRT design
 
................................
............
 
73
 
Figure 4.3 Correlation structu
res of 


of the three
-
level model, and
 

o
f the two
-
level model 
omitting the highest cluster level.
 
................................
................................
................................
.
 
78
 
Figure 3.A.1 Two scenarios of comparing 
t
 
statistics of
 
the estimated model and the 
hypothesized satisfactory models 
(


) 
 
................................
................................
............
 
133
 
 
1
 
 
CHAPTER 1
 
INTRODUCTION
 
 
1.1 Background
 
 
Educational phenomena occur in a nested context, such as students nested within classes
 
wit
hin schools (Barr & Dreeben, 1983). In this multilevel schooling system, higher
-
level school 
actors, such as administrators and principals
, as well as school social context
, shape and respon
d 
to educational activities of the lower
-
level actors, such as tea
chers and students, through flows of 
resource allocations
 
and routine organizational designs 
(Gamoran & Dreeben,1986; 
Goddard
 
et 
al., 
2007; 
Hallinger, & Murphy, 
1986; 
Heck
 
et al.
, 1990; Seashore Louis & Lee, 2016; Spillane
 
et al.,
 
2011). These 
units
,
 
such 
as schools
,
 
classrooms,
 
and students,
 
are 
inherent (
or innate
)
 
levels in the formed organizational system of schooling (Krull & MacKinnon, 2001). 
 
 
The multi
-
stage sampling design of education data collection corresponds to the 
multilevel structure of sch
ooling system (Konstantopoulos, 2008a, 2008b; Snijders & Boske
r, 
2011
;
 
Hedges & Rhoads, 2011). Larger units, such as schools from the population of interest, are 
randomly selected in the first stage and are referred to as Primary Sampling Units (PSUs) 
(Lee
uw & Meijer, 2008). In the second stage, researchers sample sm
aller units, such as 
classrooms, from PSUs. The sampled students are thus Secondary Sampling Units (SSUs)
,
 
which 
are nested within school clusters. Stages of sampling can continue until the Ulti
mate Sampling 
Units (USUs), which are normally the targeted re
search units, such as students, are reached 
(Battaglia,2008). These sampling stages 
define 
the deliberate cluster levels by design 
in analysis
. 
 
 
2
 
 
The multilevel nesting structure results in dep
endencies between individual actors within 
clusters, challengi
ng the independent observation assumption of the conventional regression 
analysis 
using
 
Ordinary Least Squares (OLS)
 
estimation
. For instance, students who are similar 
in motivation, achievement
, and family background are more likely to be grouped in the s
ame 
classrooms and schools
 
(Goldstein, 20
11
; Snijders & Bosker, 2011)
. It is also possible that 
students become more similar after they are assigned to the same classes and schools, as they 
shar
e similar learning experiences and social contexts
 
(
see
 
empirical examples in
 
Frank
, 
Muller,
 
et al
,
 
2008
, and
 
Rho
ads, 2011
)
. 
T
eachers could become similar in instructions
 
through 
pro
fessional training, 
collaboration and social interactions which 
will ultim
ately expose to their 
students
 
learning activities
, 
form within
-
school 
shared 
culture, and collectively 
react to policy 
enactment
 
(
see empirical examples in 
Coburn
 
et al., 2012; 
Goddard et al., 2007; Penuel et al., 
2009
, and a 
survey in Voogt et al., 2016
)
. 
With the clustering dependen
cy, the ind
ependent error 
assumption treating the 
data 
as 
a 
simple random sampl
e 
is thus violated
. 
It is well
-
documented 
that the standard error estimates of coefficients
 
from OLS estimation
 
are 
underestimated though 
the coeff
icient estimates are 
unbiased
, which 
leads to
 
Type I errors
 
(McNeish, 2014;
 
Mundfrom 
& Schults, 2002;
 
Musca et al., 2011
)
. 
 
 
The research interest of multilevel
-
structured educational and social phenomena and the 
methodological needs of dealing with the c
lustering dependency 
lead
 
to the prevalent use of 
Hierarchical Linear Model (HLM) (
Raudenbush & Bryk, 2002; 
Frank, 1998; 
Musca et al., 2011; 
Niehaus
 
et al.
, 2014
; Snijders & Bosker, 2011
). HLM explicitly models the multilevel clustering 
dependency by inclu
ding the 

-
cluster variation and identify the cluster
-
specific effects beyond the population
-
averaged 
estimates (McNeish
 
et al.
, 2017; Snijders & Bosker, 2011). In a three
-
stage sampled data 
 
 
3
 
 
stru
cture, a 
three
-
level HLM model 
can account for 
dependenc
ies 
of USUs nested within SUSs
 
within PSUs. This structure aligns well with the conventional schooling system mentioned 
above, that students are nested within classes, and classes are nested within sc
hools. 
 
 
The advantages of using HLM to make robust statistical inference
s
 
with clustered data 
could easily vanish if the imperative modeling assumptions relevant to random effects 
do not 
hold true (
Dedrick, et al.
, 2009; 
Huang, 201
8
; McNeish et al., 2017
; 
Snijders & Berkhof, 2008
). 
Since the random effects variance is taken into account in estimating the standard errors of the 
regression coefficients (Raudenbush & Bryk, 2002; Snijders & Bosker, 2011), an essential 
assumption is that the modeled cluster le
ve
ls as random effects are sufficient and the 
corresponding random effects are correctly specified. For example, in practice, researchers may 
purposively exclude a cluster level, such as the classrooms, in modeling for parsimon
y
, 
regardless of testing whet
he
r this omission would result in ignoring clustering dependency and 
false inferences.  
 
1.2 Problem Statement
 
 
The omission of cluster levels or misspecified random effects masks 
some 
true sources of 
the clustering dependency, which misguides the confirm
at
ion of the tested hypothesis and the 
deduction of theories. A substantial body of methodological studies in the early 2000s has 
highlighted this issue (e.g., Luo & Kowk, 2009; Moerbeek, 2004; Van den Noortgate
 
et al., 
200
5
; Opdenakker & Van Damme, 2000; 
Tr
anmer & Steel, 2001). In general, the variances from 
an omitted level are redistributed into the 
adjacent 
levels if the intermediate level is omitted. 
For 
example, i
f the lowest or the highest level is omitted, the variances are distributed to the adjace
nt
 
higher and lower level
s
 
respectively. As the omitted dependencies redistribute to the wrong 
 
 
4
 
 
cluster levels, the variance estimates of random effects (i.e., variance components) and the 
standard error estimates of fixed effects (i.e., regression coeffici
en
ts) are still biased even if the 
estimated model capture
s
 
some clustering effects. Also, critical dependencies within the 
modeled 
clusters must be 
represented and accounted for 
in the model
. In other words, all crucial 
dependencies of clustering 
should 
n
ot
 
be
 
falsely left
-
out or over
-
specified. The debate of this 
condition has been mostly around the misspecifications of the error variance
-
covariance structure 
of repeated measures in longitud
inal data analysis (e.g
.
, in Baek & Ferron,
 
2013; Ferron
 
et al., 
2010; LeBeau, 2018;
 
Murphy & Pituch,
 
2009). 
 
 
Ideally, empirical researchers are encouraged to provide the strongest models that are 
best fit for their data, theories, and research design. O
n one hand, if any cluster levels are 
necessary, the correspondin
g clustering dependencies should be modeled for robust inferences. 
On the other hand, we do not want to model the unnecessary clustering and fall into the opposite 
extreme of overcorrection
 
 
(
Abadie
 
et al.
, 2017
; MacKinnon 
& Webb
, 
20
19
).
 
 
The first pract
ice
 
is often considered by several conventional criteria for clustering 
specification (Opdenakker & 
V
an Damme, 2000). A basic one 
inheres in the conceptualization
. 

rese
arch
ers usually 
split
 
models into 
the corresponding 
multiple levels
 
(Cheong
 
et al., 
2001). 
However, this criterion alone often fail
s 
if a cluster level of 
the 
mechanism is historically 
overlooked (Martínez, 2012). Other criteri
a
 
of defining cluster levels 
based on the stages of 
a 
sampling design and treatment assigned 
levels 
in experiment design 
have
 
been considered 
(
Abadie et al., 2017
, 2020; Hedges & Rhoads, 2011; MacKinnon & Webb
,
 
2020; Opdenakker & 
V
an Damme, 2000
, 
Raudenbush
, & Schwartz, 2020
)
. 
However
, 
a researcher may inadvertently 
omit a cluster level if she ignores the 
complex sampling structure 
(Niehaus
 
et al., 
2014; 
Wang
 
et 
 
 
5
 
 
al., 
2019;
 
Zhu
 
et al., 
2012
; Skinner 
& 
Wakefield, 2017). In some cases, the omission of a cluster 
level is obliged due to dat
a restrictions. For example, many public
-
available datasets do not 
provide linkable identification numbers across cluster levels 
(e.g., classrooms) 
due to data ethic 
conc
erns (Conaway, 
et al., 
2015)
. 
Also, it is not surprising that 
many 
published studies 
d
o 
not 
fully illustrate the sampling designs or 
provide 
original data. Readers could 
have 
reasonable 
questions of whether the clustering dependencies are modeled correctly
.
 
 
Conventionally, researchers may also model a cluster level if the size of the clus
tering 
dependency measured by 
the 
intraclass correlation coefficient  (ICC) is considerable. Ea
rlier 
research suggests a rule of thumb of larger than 0.05 to include a cl
uster level in modeling. 
Noticeably, since there are no statistical tests or definite 
thresholds of ICCs to make a modeling 
decision, researchers may judge the ICCs 
based on 
evidence from previous research. However, 
evidence from prior research could have different contexts than the current one, thus leading to 
an inaccurate judgment of the
 
empirical 
ICCs. Nonetheless, Musca et al. (2011) found that the 
Type I error rate is al
ways higher than the conventional 5% 
when clustering is ignored 
even with 
an ICC value is as low as 0.01 
across 
many conditions of group size. Therefore, 
the 
ICC 
may no
t 
be sufficient for 
deciding whether a level should be included in analysis
. In modeling
 
with more 
than one level of clustering, judgments based on multiple ICCs may become even more 
complicated. Alternatively, power analysis of the experiment designs cann
ot solve the question 
of how many levels there are in modeling, either. Designed to dete
rmine the sample size needed 
to achieve the power of the statistical hypothesis tests, a power analysis is conducted with a 
presumption of levels of clustering in desig
n (Berger & Wong, 2009; Cohen, 1992; van 
Breukelen & Moerbeek, 2016). If a level of clus
tering matters but is omitted in the design, a 
 
 
6
 
 
detected educational mechanism or effective intervention may have adequate power, but 
for an 
incorrect inference 
(see 
Kon
stantopoulos
, 200
8a
) 
 
 
The 
assumptions associated with the 
second practice
 
of not model
ing unnecessary 
clustering 
are 
also often unsatisfied since the current guidelines on when to 
account for 
clustering remain vague. For example, analytical guidance woul
d state that a clustering of 
interest should be accounted for as the estimates change co
mpared with the models without 
including the cluster level (e.g., 
Van den 
Noortgate et al. 200
5
; Cameron & Miller, 2015). This 
kind of statement does not explain the ra
tionale of why the cluster gives rise to clustering or 
when to cluster. This gap causes two major misconceptions. One is that whenever there is a level 
of cluster that can be defined,
 
regardless of 
-
inherent
ly
 
or by design, a clustering dependency 
is 
possi
ble and 
thus needs to be modeled.  Another 
misconception
 
is that a cluster level 
needed to 
be modeled since adding it would change the standard error estimates. 
 
It is often the case 
that 
empirical researchers choose the larger standard error estimates acc
ounting for clustering 
dependency to avoid committing a Type I error,  without justifying whether the clustering is true 
and must be adjusted (Robinson, 2020).
 
To dispel these misconc
eptions, a theoretical framework 
of when a cluster level is necessary and
 
thus should be controlled is pivotal. This argument has 
been highlighted in 
Abadie et al. 
(2017)
. Along with Hedges and Rhoads (2011),  these studies 
clarify that the standard error 
estimates should be corrected if the clustering is due to 
multi
-
stage 
sam
pling design and randomized experimental by clusters. 
 
1.3 
Research Questions and Goals
 
 
Despite strong analytical evidence of the risks of omitting levels of clustering and the 
urgent need 
for 
practical guidance to judge whether the estimated model 
adequ
ately accounts for 
 
 
7
 
 
clustering
, it is still unclear what misspecification of the random eff
ects of the cluster levels may 
or may not lead to incorrect results. The above discussion motivates the current study to ask the 
following questions: 
 
1)
 
When 
should 
a c
luster level and the corresponding clustering dependency to be explicitly 
modeled, 
and whe
n 
could 
they 
be omitted?
 
2)
 
And, if an essential level is omitted in modeling, whether and how much of the omitted 
clustering dependency would affect the robustness of i
nference?
 
 
This study investigates these questions in settings of a true three
-
level clus
tered data 
structure, while a cluster level, either at the highest, middle, or the lowest level, is omitted in the 
estimated two
-
level models. Applying insights from 
Abadie et al. (2017)
 
and Hedges and Rhoads 
(2011)
, the first research question is answered
 
by building a theoretical framework of when a 
middle or highest cluster level is produced by sampling and experimental designs, but is omitted 
in modeling. In the om
itted lowest cluster level case, the theoretical argument switches to the 
serial correlati
on dependency due to the chronologically ordered nature of repeated measures. 
Answering the first question aids 
in clarifying
 
empirical decisions of whether a cluster
 
level 
should or 
should
 
not be modeled to avoid either Type I or Type II errors and improv
e the 
analytical identification of the consequences of an omitted cluster level.  
 
 
The second question is answered through analytic
al
ly
 
quantif
ying
 
the
 
magnitude of
 
the 
standard error estimates 
bias 
of the slope estimates of 
predictors at 
each level. Pre
vious studies of 
examining the issues of omitting a cluster level commonly use simulations to show empirical 
evidence of bias in estimates and threats to robust infer
ences. The simulation approach, assuming 
a known correct model to compare with the other f
alse ones, has advantages in setting extensive 
 
 
8
 
 
ranges 
for 
parameters and models. Nonetheless, though those simulations reveal valuable general 
patterns, how the bias 
is produced mathematically is still in a black box. This study complements 
those simulation
-
based evidence through closed forms of standard error co
rrection formulas. 
These formulas, showing the relationship between the bias and the clustering parameters (
i.e., 
ICCs and cluster sample size) of the omitted cluster level, can identify where the omitted 
clustering is hidden or distributed to other levels
 
and 
how 
statistical inferences are 
affected
. In 
other words, aligning with the theoretical framework alread
y established, the sources of 
clustering dependency are
 
also
 
clarified. The explanation of the approach is soon introduced in 
the following 
S
ection
 
1.4
.
 
 
Finally, with the development of such formulas
 
for standard errors and bias as a function 
of clusteri
ng
, this study is further able to develop a sensitivity analysis framework for researchers 
to 
quantify the 
robustness of inferences (or effect size)
 
and the risk of making a false hypothesis 
decision based on the clustering degree of the suspected omitted 
cluster level. This sensitivity 
analysis framework contributes to filling the gaps in current methodological research and to 
bridging with empirical
 
studies 
that
 
require guidance in making decisions on modeling specific 
cluster levels. Particularly, this s
ensitivity analysis framework is desired in practice when the 
omitted cluster level is not able to be included in the modeling
.
 
 
1.4 
Combining the B
enefits of the Model
-
Based and the Design
-
Based Approaches
 
 
While the model
-
based approach HLM explicitly m
odels the multilevel clustering 
dependency with random effects, the design
-
based approach provides statistical corrections to 
the standard error est
imates 
(
Cameron & Miller, 2015; Cheong et al., 2001; McNeish & Wentzel, 
201
7
). The design
-
based approach is 
prevalent in the fields of survey studies and economics, 
 
 
9
 
 
where the corrections are called 
Design Effect (DEFF) and Cluster Robust Standard 
Errors 
(CRSE). In a two
-
level sampling data structure, DEFF
 
is derived from the ratio of the variance of 
an estimate 
that takes into account the clustering and the variance that ignores the clustering 
(Kish,
 
1995; Snijders, 2005), which is
 

is the ICC and 


is 
the average cluster size of clusters 
k
. 
 
 
In the field of economics, 
CRSE is widely applied to many 
structures
 
of clustering (see a 
detailed survey in Cameron 
&
 
Miller, 2015). 
Generally, CRSE provides 
a 
mathema
tical 
expression of the variance
-
covariance structure with an index measuring the error variance 
(Snijders & Bokser, 201
1
). In a simplified approximation CRSE case when the homoscedasticity 
assumption holds
1
 
(as set in the
 
current study), this index is derived as
 
Moulton
 
Factor (
M
F) 
(
Angrist 
& Pischke
,
 
2009;
 
Moulton, 1986, 1990
). 
Moulton Factor 
is 
essentially close 
to
 
DEFF
 
since it is also derived from the ratio of the variance of an estimate with the cluster
ing effect and 
the variance without the clustering effect
2
. The standard error estimates are corrected by the 
square root of 

 
or 
MF
,
 
which are equivalent to the 
model
-
based
 
two
-
level HLM (Cheong et 
 
1
 
The correlated
-
within
-
cluster error terms require a covariance matrix estimator that is robust to arbitrary
 
patterns of 
both heteroskedasticity and 
intra
-
cluster correlation (MacKinnon & Webb, 2020). The current work dealing with 
omitted clustering focuses on the omission of the latter one and assumes homoskedasticity. The setting of 
homoscedasticity implies th
at any heteroskedastic patterns in the sp
ecified and modeled clustering have been 
already corrected, and the assumption still hold after included the omitted cluster level. In later chapters of model 
settings, the cluster sizes of the omitted cluster level
 
are set to be relatively equal, and ther
e are no heterogeneous 
across clusters. A discussion of modeling heterogeneous random effect variance within an empirical education 
setting can be seen in Leckie, French, Charlton, and Browne (2014).
 
 
2
 
Moulton 
fact
or
 
is 


. Moulton
 
factor
 
uses 


(i.e., average variance 
of the cluster size deviation) to account into the variation of unequal cluster sizes. This is equiva
lent to the Skinner 

DEFF
.
 
Additionally,
 
compared 
with the DEFF, Moulton
 
factor
 
also has 


, which 
is the within
-
cluster correlation of the predictor 
Z 
. When the predictor 
Z
 
is at the aggregated level, this 


is 
perfectly correlated and equals to 1, and , thus, 

 
approaches to 
DEFF
. 
Abadie et al. (2017) argues that 


t
he reg


is. In the current study, the uncertainty due to 


is less of a research interest as 


for cluster
-
level predictors. 
 
 
10
 
 
al., 2001; Huang, 201
8
; Niehaus et
 
al.
, 2014). An empirical example s
howing this equivalency 
can be seen in Claessens (2012). 
 
 
The current study considers a three
-
level clustered data where two layers of clustering 
are 
observed 
while one clustering is omitted in a two
-
level model, and innovatively applies the 
method of DE
FF to correct for the standard error bias due to the omitted
 
layer of clustering 
dependency in the estimated two
-
level HLM model. 
In t
his way, closed forms of formulas of 
quantifying the bias of the standard error estimates can be derived the same as the 
D
EFF
. The 
only difference 
from the DEFF 
is that the denominat
or of the formulas here are the variance 
estimates from the two
-
level models, where partial clustering dependency were captured albeit 
not fully. 
C
heong
 
et al.
 
(2001) 
have shown the 
potential 
of t
his idea. Employing a national 
representative survey data, t
hey compared the standard error estimates from a three
-
level model 
with the ones from a two
-
level model omitting the middle cluster level while having been 
corrected by CRSE  for the two
-
level HLM
 
estimated model. Those standard error estimates of 
the late
r approach that combined model
-
 
and design
-
based approaches are found to be 
comparable to the empirical standard errors and the ones from the true three
-
level model.
 
 
Current literature 
has
 
provi
ded other developed approaches to deal with the same issue o
f 
omitting a cluster level. For example, Raykov
 
et al. 
(2016) addresses the question of omitting a 
middle cluster level through considering the potential size of the middle cluster level variance 
in 
the estimation of the confidence intervals of testing the cluster level variances. In Hedges and 
Rhodes (2011), 
corrections were made to 
F
-
test statistics in two
-
level data while the clustering is 
omitted. Comparing with these studies, the approach deve
loped in the current study is beneficial 
in ways of expressing the different sources of clustering dependency as fu
nctions of the 
clustering parameters of random effects variance and sample size of clusters. Further, plugging 
 
 
11
 
 
in plausible values of the clu
stering parameters, empirical research can use the sensitivity 
analysis to evaluate the estimated model and transpa
rently show their analytic decisions (Abe & 
Gee, 2014) 
 
1.5 
Summary of 
F
indings 
 
 
This study presents closed forms of formulas that quantify
 
the standard error estimation 
bias due to omitted clustering dependency. A general pattern found is that, if a clu
ster
-
level 
predictor of interest is falsely disaggregated to the lower levels since it is not explicitly modeled, 
its standard error estimate
 
of the coefficient estimate is underestimated. Specifically, if the 
middle cluster level is omitted, the middle
-
le
vel cluster predictor that is disaggregated at the 
lowest individual level has an underestimated standard error estimate of its coefficient. 
If the 
highest cluster level is omitted, the standard error estimate of the coefficient of the highest
-
cluster leve
l predictor (which is falsely disaggregated at the middle level) is underestimated. 
Similar patterns apply when the single level OLS are the 
estimated models. 
 
 
If the upper adjacent cluster level is omitted, the standard error estimates are 
overestimated
. This pattern is found in the cases where the highest cluster level is omitted, and 
the standard error estimate of the coefficient 
defined a
t 
the middle cluster level is upward biased, 
leading to Type II error. In the same vein, though the lowest cluster 
level predictor is not of the 

overestimated if th
e adjacent higher cluster level is omitted.
 
 
An exceptional pattern is that, if the middle cluster level is omitte
d in the estimated two
-
level model, the standard error estimate of the highest
-

This is because t
he overall dependency is captured in the estimated two
-
level model though the 
 
 
12
 
 
sources of clustering 
are entangled
. 
At last, if the omitted level is not adjacent to the level of the 
predictor of interest, such as when the lowest level predictor is the predi
ctor of interest and the 
highest cluster level is omitted, the corresponding standard error estimate remains unbias
ed. 
 
 
The magnitude of the standard error estimates bias can be calculated by the derived 
formulas. Furthermore, c
ombined with empirical stu
dies as examples, this study encourages 
empirical researchers to utilize the developed sensitivity analysis framewo
rk to diagnose whether 
the hypothesized omitted clustering would result in considerable estimation bias that would 
invalidate 
any 
inferences.
 
The sensitivity analysis is of the 
best
 
usage 
when the researchers or 
readers suspect a potential issue of omitting cluster level due to design while there are data 
restrictions or other reasons that using the model
-
based approach of 
modeling 
that level i
s not 
plausible.
 
1.
6
 
Structure of 
t
his Study
 
 
This study follows with four chapters, where three chapters (i.e., Chapters 2, 4, and 5) 
discuss the scenarios of cluster omission respectively at the middle, highest, and lowest level, 
and one chapter (i.e., 
Chapter 3) develops the sensitivity analysis fr
amework. In Chapter2, the 
discussion of omitting the middle cluster levels in the two
-
level HLM models are based on a 
theoretical framework of omitted cluster levels due to sampling and experimental designs. F
or 
better implication significance, the current
 
study takes the prevalently used national presentative 
survey datasets initiated by the National Center for Education Statistics (NCES) as examples.  In 
Chapter 3, the sensitivity analysis framework provides 
three measures of quantifying the 
inference rob
ustness. An empirical example is provided to demonstrate the sensitivity analysis in 
testing the inference robustness when a middle cluster level is omitted. 
The structure of 
Chapter 
 
 
13
 
 
4 of discussing the omissi
on of the highest cluster level in the two
-
leve
l models 
is 
identical 
to 
that 
of Chapter 2, including the theoretical framework and the variance inflation factor derivation 
process, though the specific scenarios and examples of omitting the highest cluster 
level in 
sampling and experimental designs are 
given. Also, a
n
 
empirical study using the sensitivity 
analysis framework is presented. 
Finally
, Chapter 5 discusses the case of omitting the lowest 
cluster level, where the error variance
-
covariance is misspec
ified (i.e., omitting the serial 
correlation in
 
the repeated measures) in the two
-
level growth modeling.
 
 
14
 
 
CHAPTER 2
 
OMITTED THE MIDDLE CLUSTER LEVEL
 
 
2.1 Introduction
 
 
The intermediate level reflects important social activities. For example, in education
al 
research, classrooms and teachers, 
lying between students and schools, contain rich educational 

 
(
Martínez, 
2012; 
Raudenbush, 2008; 
Raudenbush & Sadoff,
 
2
008)
. Empiri
cal studies often employ three
-
level H
LM models to fully reveal the relationships among predictors at the students, classrooms, 
and school levels (such as Bryk & Raudenbush, 1989; 
H.
 
C. 
Hill
 
et al., 
2005; Nye
 
et al., 
2004). 
Empirical studies may also choos
e not to model the middle classroom le
vel, theoretically and 
methodologically, and conduct two
-
level models. For example, Martínez (2012) argued that the 

-
level effects might overlook the within
-
school dynamics 
of 
classrooms. In the estimated two
-
level
 
model, the omitted between
-
classroom variation is 
repartitioned into the school
-
 
and student
-
level. The repartition of random effects largely impacts 
the conclusions drawn for schools since the classrooms often explai
n a significant proportion of 
variance
s that are also far more than what the schools can explain (Martínez, 2012; Beaton & 

structures as well. For example, Vaezg
hasemi
 
et al.
 
(2016) found that househ
olds, which are 
between individual and residential communities, are rarely considered when examining the 

 
15
 
 
Still, current literature lacks a practical guide to infor
m under what circumstances the 
middle 
cluster level is necessary to be modeled to represent a complete and accurate educational 
and social mechanism. This chapter intends to fill this gap by investigating scenarios of omitting 
a middle cluster level in two
-
level HLM analyses due to research de
sign. In Section 2.1 and 2.2, 
those scenarios are classified into mechanisms of two
-
 
or three
-
stage sampling designs and CRT. 
This classification helps to clarify 
when
 
the middle clustering dependency matters in modeli
ng. 
Furthermore, this chapter aims to answer how the estimates of predictors at each level would be 
impacted if the middle cluster level is essential due to design but omitted in modeling. Previous 
researc
h mainly analyzed the impacts on random effects in 
unconditional models; this chapter 
extends the settings to conventional empirical models with predictors and covariates. Section 2.3 
details the settings of two
-
 
and three
-
level HLM models with predictors 
of interest at each level 
based on the two mechanis
ms discussed. 
 
 
To answer the question of 
how
 
much
 
the omitted cluster level matters, this chapter 
derives mathematical formulas to quantify the estimation bias of random effects and standard 
errors. The 
developed formulas adjust the standard error estima
tes of coefficients defined as 
variance 
c
orrection due to omitted clustering
 
(
VOC
). In a similar format of design effect, 
VOC 
is 
a function of the intraclass correlation coefficient and sample size of the 
omitted cluster level. It 
further informs the later
 
sensitivity analysis framework with implications for empirical examples 
in Chapter 3. A simulation study in Section 2.4 is designed to examine the performance of the 
bias quantification formulas. Empirica
lly meaningful 
VOC
 
parameters are selected for the 
simulation study. Finally, Section 2.5 gives a conclusion. 
 
 
16
 
 
2.2 
Omitted 
Middle Level Due to Sampling Design
 
 
Table 2.1 summarizes when the middle cluster level is omitted in two
-
level HLM due to 
sampling
 
design. One scenario is when the SSUs as the middl
e cluster level is excluded from 
modeling in a three
-
stage sampling design; the other scenario is when the omitted middle cluster 
level appears as incidental instead of being deliberated in a two
-
stage sam
pling design. The 
following considers these two sce
narios in a typical educational setting where students are nested 
within classrooms and classrooms are nested within schools, and a treatment is randomly 
assigned to schools. The omitted middle level is hy
pothesized as classrooms and teachers. The 
current 
study examines empirical findings that aim to generalize to a broader population of the 
sampled schools and classrooms, rather than those fixed for the sampled schools and classrooms. 
In this case, the clu
stering effects of schools and classrooms matter an
d need to be considered in 
modeling (Schochet, 2008; Abadie et al., 2017).  
 
Table 2.
1


-
stage sampling design with
 

or 


estimated models.
 
         
S
ampling 
Design
 
 
Estimated Model
 
Three
-
stage Sampling
 
(e.g., Students
 

Classroom 

 
Schools)
 
Two
-
stage Sampling
 
(e.g., Students 

 
Schools)
 
Two
-
level Model
 
Omits the middle classroom 
cluster level.
 
Corresponds with the 
sampling design, while 
omitting the incidental
 
cluster 
level. 
 
Three
-
level Model
 
Corresponds with the sam
pling 
design.
 
Counts into the incidental 
 
cluster levels.
 
2.2.1 
Omitting 
SSUs in a Three
-
stage Sampling Structure Data 
 
 
Consider a dataset that has a three
-
stage sampling design where schools are PSUs, 
classrooms are SSUs, and students are USUs. The thre
e
-
stage design effect accounts for these 
 
 
17
 
 
two sources of clustering to adjust the standard error estimates (Chen & Rust, 20
17; Skinner
 
et 
al., 
1989; Valliant
 
et al.,
 
2013)
3
 

,
 
where 


, 


, 


, and 


are correspondingly the first
-
stage and second
-
stage intraclass 
correlation coefficients and average cluster size. Equivalently, a model
-
based approach, i.e., a 
three
-
level HLM model, explicitly analyzes the clust
ering dependency of students within 
classr
ooms and clustering dependency of classrooms within schools. 
 
 
In practice,  the second stage of sampling may be purposively omitted for model 
simplicity, especially when the substantive research question is not di
rectly related to the middle 
level (Staple
ton & Kang, 2018; Konstantopoulos, 2008a)
.
 
In 
the
 
case
 
of omitt
ing the
 
middle 
cluster level
, adapting from the original two
-
stage sampling design effect in Kish (1995), the 
design effect of a simplified structure w
ith PSUs of schools and USUs of students d
isregarding 
the SSUs of classrooms is
 

.
 
 
The superscript 

 
notes for the setting of SSUs omission. 


is the average number of 
students within a school and equa
ls to the product of 


and 


. 


measures the 
similarity of students within schools. Figure 2.1 visualizes the intraclass correlation structure of 
the three
-
stage sampled data in the
 
upper panel and the one with the omitted SSUs in
 
the lower 
panel. In the complete structure of a three
-
stage sampling, 


and 


capture the within
-
 
 
3
 
The current study assumes no stratification in the sampling design
 
for simplification purposes
. However, 
stratification is commonly used in educational sampling design. For example, schools, as PSUs, are firstly stratified 
by census units. If the stratification is ignored, Type II error occurs, but is less of a concern w
hen studies prefer 
conservative results. In 
Chen and Rust (2017), design effects formulas incorporate 
stratification with multiple stages.
 
 
18
 
 
Figure 2.
1
 
Data correlation structures of three
-
stage sampling desig
ns when the secondary 
sampling stage is inc
luded and omitted
 
classroom
-
within
-
school and between
-
classroom
-
within
-
school clustering dependency within a 
school PSU as presented in a dashed box. As schools are the randomly sampled PSUs, 
correlations across s
chools are zero. When the SSUs are omitted 
in modeling, as shadowed in the 
lower panel figure, the within
-
school dependency is captured by 


, regardless of classrooms. 
Between
-
school independency assumption still holds. Intuitively, since the overall clustering 
dependency remains the same as 


, 


is a function of 


, 


, 


and 


. 
 
 
The existing de
sign effect literature has not been extended to define the mathematical 
relationship between 


and 


,
 
and the consequences on parameter estimates 
precision are less known. This unknown relationsh
ip can be solved through the model
-
based
 
 
19
 
 
HLM approach through quantifying the relationship of the variance
-
covariance or intraclass 
correlation structure of the three
-
level HLM model and the one of the two
-
level model. Section 
2.3 will detail the solution.
 
2.2.2 Incidental Middle Level 
b
etween P
SUs and SSUs (or USUs)
 
 
Educational datasets commonly provide additional survey data beyond the designed 
sampling units. For example, NCES datasets, including Early Childhood Longitudinal Study 
(ECLS), National Asses
sment of Educational Progress (NAEP), an
d Education Longitudinal 
Study (ELS), collected classroom
-
 
and teacher surveys to facilitate research to understand the 
within
-
school dynamics, even though the sampling designs did not present a known probability 
sam
ple from classrooms. Wang et al. (2019) 
defined this scenario as emerging 
incidental
 
middle 
cluster level 
in sampling. The cluster levels corresponding to sampling designs, such as the PSUs 
of schools and SSUs of students, are called 
deliberate
 
levels (McN
eish & Wentzel, 2017). 
 
When the samplin
g design is two
-
staged structure, such as in NAEP, a two
-
level model is 
analytically sufficient to take into account the clustering dependency of students nesting within 
schools and provides unbiased standard error e
stimates of the school
-
level predictors 
(Cheong et 
al., 2001; Moerbeek, 2004; Wang et al., 2019). However, the two
-
level model does not explicitly 
model the between
-
classroom variance and does not satisfy research interests that focus on 
between
-
classroom 
variance. Further, the two
-
level model c
ould completely disregard any 
potential classroom
-
level effects or falsely disaggregate the classroom
-
level predictors at the 
lower student level. This case may be comparable to the well
-
documented issue of omitting 
a 
single level clustering dependency in
 
a
 
single
-
level model 
of 
OLS 
estimation
. In the setting of 
disaggregated classroom
-
level predictors, an artificial homogeneity is introduced at the student 
level, which produces overestimated standard error estimates 
of the student
-
level predictors and 
 
 
20
 
 
underestimated the standard errors for the
 
classroom
-
level disaggregated predictors (Korendijk, 
Hox
,
 
et al., 
2011). Wang et al. (2019) showed simulation evidence that the standard error 
estimates of the student
-
level pre
dictors are unbiased. Section 2.4.3 of the current study further 
shows that th
e inconsistency evidence in Wang et al. (2019) is not valid, but due to their 
parameter setting restrictions. 
 
When the research interests include between
-
cluster variations at d
ifferent levels, 
conducting a three
-
level model is beneficial since it simulta
neously incorporates the sampling 
stages and the incidental middle cluster level mechanisms. Even in an extreme situation where 
the between
-
classroom variations are nearly zero, 
the estimated variance of the student
-
 
and 
school
-
level random effects from th
e three
-
level model would not be biased (Raykov et al., 
2016), though the sampling variance estimates would change slightly due to the changes of 
degrees of freedom by the added 
cluster sample size and predictors of classrooms. Nevertheless, 
Wang et al. (2
019) and McNeish et al. (2017) summarized the pitfalls of conducting a three
-
level 
model, which are mainly (1) increased complexity of modeling assumptions and the increased 
risk
s of violating the assumptions, and (2) the sparseness of the cluster number o
f the incidental 
middle level may lead to biased estimates of the variance components. With these concerns and 
when the research interests only focus on the school
-
level predicto
rs, a two
-
level model is 
preferred
4
. 
 
 
4
 
Many studies have provided several solutions to address the second concern of small cluster numbers. 
McNeish 
and St
apleton (2014) provided a review of such methods, including restricted maximum likelihood with 
Kenward

Roger adjustment
 
(see 
Kenward & Roger, 1997, 2009
) 
and, alternative to maximum likelihood based approaches, 
Bayesian Markov Chain Monte Car
lo (MCMC) (see
 
Baldwin, & Fellingham ,2013; 
Hox, van de Schoot, & 
Matthijsse, 2012
). 
However, these discussions mainly focused on addressing the issue within a two
-
level cluster data 
structure setting. More studies are needed to extend the discussion in a 
three
-
level cl
ustering data structure and 
examine the methods when the middle cluster level sample size is small. In the current discussion of whether to 
include an incidental middle cluster level in modeling, the above
-
mentioned limitations could still af
fect empirical
 

-
making.
 
 
21
 
 
Wang et al.(2019) provided modeling gui
delines depending on the parameter of interest 
and listed a few empirical examples which employed the same ECLS data while made different 
modeling decisions of the incidental clus
ter level (p. 575).  For instance, Jennings and DiPrete 
(2010) explicitly mod
eled the incidental teacher
-
level cluster since their research goal is to 
examine teacher effects on students' social and behavioral skills. While in Adelson, McCoach, 
and Gavin (
2012), which studied school
-
level gifted programs' population average effect 
on 
student's achievement, the incidental classroom level is not modeled. Their modeling approach is 
legit since it corresponds to the sampling design that the classroom level is n
ot a sampling stage, 
and the classroom
-
level effect is not the focus of the s
tudy. Their study also avoided the pitfall of 
overcorrection if model any unnecessary clustering. 
 
Yet, practical guidelines of modeling choice with incidental middle cluster leve
l have not 
been widely explored except in Cheong et al. (2001) and Wang et al
.(2019). This led to 
conflicting modeling decisions in empirical research using the same data for similar research 
questions. For example, Fitchett and Heafner (2017) examined the
 
teacher's professional 
characteristics and classroom instructions on student
s' history achievement using the NAEP data. 
Therefore, even though teachers are not a deliberate sampling stage in NAEP, the authors 
explicitly modeled the teacher cluster level. 
However, Heafner, VanFossen, and Fitchett (2019), 
which employed NAEP as well
, conducted a two
-
level model to examine student characteristics, 
courses and instructional variables, as well as demographic variables' effects on students' 
economics content kno
wledge. The incidental classroom level is suspected to be omitted, and a 
key 
predictor of classroom instruction could be a classroom
-
level variable but falsely aggregated 
at the student level. Though the school
-
level predictors' standard error estimates ar
e not biased, 
the standard error estimates of the key predictors of classroom
-
level instructions could be 
 
 
22
 
 
underestimated and the ones of the student
-
level characteristics could be overestimated. That 
study will be soon introduced in Chapter 3 to demonstrat
e the sensitivity analysis.
 
2.3 
Omitted 
Middle Level in 
Cluster 
Randomized Trial  
 
Many CRT design a three
-
level structure with cluster levels of students, classrooms and 
teachers, as well as schools, where the randomization happens at the schools and the 
outcomes 
are at the student level (Spybrook, Kelcey, 
et al.
, 201
6; Westine
 
et al., 
2013). Two sources of 
clustering exist in CRT (Schochet, 2008; Abadie et al., 2017): one is the random assignment of 
units to the control and treatment groups, and the other
 
is the sampling of two
-
level of clusters 
from a broader populat
ion as discussed in 2.1. In many cases, two
-
level models with students and 
schools are conducted
,
 
where the clustering due to assignment is captured while clustering due to 
sampling 
could be
 
o
nly partially captured. The omission of modeling the classroom l
evel 
clustering effect could be the result of the two scenarios from sampling design that are discussed 
above. 
 
Consistent with the previous review, the point estimates of the school
-
level int
ervention 
effect and standard error, as well as the minimum dete
ctable effect size, which are of the most 
research interest in CRT, are nearly identical in three
-
 
and two
-
level models, regardless of the 
size of the teacher
-
level variance, size of clusters,
 
and number of student
-
 
and school
-
level 
covariates, as evidence
d in Murray
 
et al. 
(1996) and Zhu
 
et al. 
(2012). Equivalently, the 
corresponding design effect for the treatment group of schools is the same as the above 


, 
where the overall clu
stering dependency within schools is captured (Hedges & Rhoads, 2011). 
However, the potential classroom
-
level
 
effects and cross
-
level moderation effects of the 
intervention are ignored, which are pivot in CRT studies that aims to detect heterogeneous 
 
 
23
 
 
treat
ment effects and answer questions of 
how
 
and 
under what conditions 
the intervention works 
beyond 
what
 
works (
Spybrook et al., 2016; Spybrook, Zhang,
 
et al.
, 2020).
 
Recently, scholars call for advancing the understanding of the implementation process of 
int
erventions in school settings, such as how teachers deliver the treatment to students (Lendrum 
& Humphrey, 20
12). For example, teachers may be influenced by the local contexts and adapt 
the intervention process, and students could be assigned to teachers b
ased on certain attributes of 
teachers, such as the experience of teaching or class schedule(Weiss, 2010; Wei
ss
 
et al.
, 2016). 
Also, it is not uncommon that teachers are often trained as groups for the intervention (such as in 
Jayanthi et al., 2018)
 
that
 
g
roups of teachers may conduct the intervention similarly. These 
situations result in students who have the sa
me teacher or are exposed to a teacher group could 
receive the treatment in a similar manner. Therefore, if the CRT design considers the role of 
te
achers, the correlation of treatment and clustering in CRT would be a composition of treatment 
correlating wi
thin teachers (or teacher groups) and schools. 
 
In the work of Abadie et al. (2017), potential treatment provider variation is mentioned 
while cons
idering the classroom and teacher level effects as fixed rather than intending to 
generalize the effects to t
he superpopulation of classrooms and teachers. The current study, on 
the contrary, considers the classrooms as SSUs in a three
-
stage sampling desig
n or as an 
incidental cluster level that is not in a sampling stage. The current work also explores the 
influ
ence on coefficients associated with students and teacher level predictors when the between
-
teacher variance is omitted in a two
-
level model, which
 
is not studied in Zhu et al. (2012). 
 
 
24
 
 
2.4 Quantification of Standard Error Bias 
 
This section formulates t
he potential bias of the standard error estimates of predictors 
when a middle cluster level is omitted. The process of quantifying the bias is, in 
essence, a 
design
-
based approach, which compares the variance estimates from a 
satisfactory
 
three
-
level 
rando
m intercept model and an estimated two
-
level random intercept model. The models are set 
to cover the previously discussed scenarios of omitting the
 
middle clusters due to sampling and 
experimental designs. Meanwhile, the notation used throughout the whole 
study is explained. 
 
2.4.1 Model Setting
 
Two
-
level random intercept model
 
Consider first a two
-
level random intercept model with a continuous depen
dent variable 


, which indicates the outcome of a student 
i
 
in a school 
k
. The model is:
 
Student
-
level: 


School
-
level: 


,
 

,
 

,
 
Mixed model: 


and 


are treated as student
-
level predictors. 


, for instance, can be the prior 
scores of students, which is a commonly 
used student
-
level covariate (e.g., Bloom
 
et al.,
 
2007). 
However, 


is actually a classroom
-
level measure, such as an attribute of
 
teacher, 
so
 
that all 
students in the same class have the same value of 


. This setting is to satisfy the 
fals
ely 
 
 
25
 
 
disaggregated incidental cluster level predictor case. The random intercept 


is predicted by a 
school
-
level predictor 


to captur
e the variability between schools. The setting of 


being 
either a continuous variable or a binary trea
tment predictor as in CRT does not affect the later 
quantification process of the potential bias of variance estimates. The latter section soon 
confir
ms this note. Additionally, the predictors are assumed to be orthogonal to the random 
effects at any level
 
for the exogeneity assumption because 


and 


are group
-
mean centered 
(Antonakis
 
et al.
,
 
2019). To keep the simplicity of the conceptual ex
ample, I only present each 
level with the minimum number of predictors, albeit many other covariates can be added. As 
long as the assumption
s hold, the following algebraic expressions of the variance estimation and 
the quantification procedure of bias rema
in the same. 
 
Conventionally, the random effects are assumed to be normally distributed with means of 
zero and constant variances conditioni
ng on the predictors and have zero covariance, which 
are


, 


, and 


. Tildes over the parameters are used 
to distinguish the current two
-
level model from the later three
-
level model. The total sample size 
of students is 


, where 


is the number of schools, and 


is the average number of 
students within s
chools
5
. 
 
For each school 
k
, the error variance
-
covariance matrix of 


, denoted as 


, is 
composed of a residual variance matrix 


and a r
andom intercept variance matrix 


: 
 

,
 
 
5
 
The current study considers balanced design as a starting point, where the cluster size is assumed to be the same 
(or almost identical) across cluster units. This setting provides closed forms of maxi
mum likelihood
 
estimates . 
Thus, I can make comparisons of the estimates across two
-
 
and three
-
level models and the OLS models in later 
chapters, when fixing the coefficient estimates of the HLM and OLS to be equal (Nezlek & Zyzniewski, 1998). Van 
den Noor
tgate et al. (
2005) has provided simulation evidence of omitting a cluster level in unbalanced settings. They 
found similar patterns of the variance
-
covariance repartition as in balanced settings.
 
 
26
 
 
where the dimension of 


is 


by 


, and 


is a column vector of 


ones. 
 
Further,
 

and
 

.
 
Also, 


can be wr
ite as:
 

27
 
 
where 


is the total error variance, and 


is the 
proportion of variance at t
he school level
6
 
or the intraclass correlation coefficient indicating the 
expected correlation of any two randomly drawn students in a school. The structure of 


is 
consistent with the purple dashed boxes in the lower panel of Figure 2.1. With new 
notations of 
ICCs, Fi
gure 2.2 
below 
modifies Figure 2.1 to show the correlation structure of 


(as shown in 
the lower panel) and 


of the three
-
level model(as shown in the upper panel) in the following 
discussion. 
 
 
Figure 2.
2
 
Correlation str
uctures of 


of the three
-
level model and 


of the two
-
level model 
omitting the middle cluster level. 
 
 
6
 
The intraclass correlation coefficient can be conditione
d on the predi
ctors. For simpler notation, I do not put 
additional subscript (such as 


or 


) to indicate this essence.
 
 
28
 
 
Three
-
level Random Intercept Model
 
If there emerges a necessary classroom
-
level middle cluster, a three
-
leve
l model for 
students 
i
 
within classroom 
j
 
within school 
k
 
should be conducted as:
 
Student
-
level: 


Classroom
-
level: 


,
 

School
-
lev
el: 


,
 

,
 
Mixed model: 


Compared with the above two
-
level model, t
he 
three
-
level model has an additional 
classroom
-
level random effect
 

which indicates variability across teachers within schools. 
The predictor 


is now correctly specified at the middle cluster level to explain the outcome 
mean differenc
es
 
across
 
teachers within schools. The
 
random effects are assumed to be normally 
distributed with means of zero and constant variances, which are 


, 


, and 


. Also, the random eff
ects have zero covariance with each 
oth
er.
 
The sample size of schools 


and the total sample size
 
of students (i.e.,
 

) 
remain the same, regardless of adding or omitting the middle classroom cluster level. In the 
 
 
29
 
 
three
-
level model,  


is the cluster size of the lower nesting level (i.e., the average class size or 
the number of students taught by each teacher), and 


is the cluster size of the higher nesting 
level (i.e.,  the average number of teachers within each school). Als
o, 


is the average school 
size or the average number of students within a school.
 
The following 


is the error variance
-
covariance matrix of a school 
k
, which has a 
consistent structure as shown in the purple dashed boxes of the upper panel in 
Figures 2.1 a
nd 
2.2. As shown, 


and 


have the same dimensionality of 


by 


, while, since the single 
nesting structure in the two
-
level model is now extended to two levels of nesting, the dimension 
of 


in the three
-
level mo
del becomes 


as 


. And,
 

Its diagonal element 


is 


in dimension and i
s
 
the highlighted area within each purple 
dashed box in the upper panel of Figure 2.2.
 
 
= 


The off
-
diagonal element 


in 


is the intraclass correlation coefficient of any two 
students from the same classroom 
j
 
in a school 
k
, and 


. 
Intui
tively, 


combines the similarity of students exposed by being in the same school 
k
 
and the 
similarity of students exposed by being in the same classroom 
j
. Specifically, within the school 
k
 

by 


.
 
And the average correlation of 
 
 
30
 
 
any two students from the same classroom is 


. 
Other ways of defining the intraclass correlation 
coefficient exists;  and Appendix 2.A compares these approaches and presents the derivation of 


. 
 

in 


shows the p
roportion of between
-
classroom
-
within
-
school variation, which is 
the unhighlighted parts within any purple dashed boxes in the upper panel of Figure 2.2. 


has a 
dimension of
 

, and
 

As shown, the expected correlation among students coming from the same classroom (i.e., 


) is 
larger than the expected correlation among
 
students coming from the same school but different 
classrooms (i.e., 


); and this difference is 
measured by 


. In the estimated two
-
level models, 
this similarity difference among different cluster levels is ignored. Finally, since the schools as
 
PSUs are the highest cluster level and are independent to each other, the correlations among 
students fr
om different schools are set to be 0, as shown in the cells outside of all dashed purple 
boxes in Figure 2.2.
 
With some algebraic operations, 


c
an be written as:
 

where 


is an 


b
y 


diagonal matrix, and 


is an 


by 


diagonal matrix. Additi
onally,


and 


are vector columns of 


and 


ones, respectively. 
 
 
31
 
 
Evidenced in Moerbeek (2004), Tranmer and Steel (2001), and Konstantopoulos 
(2007), the random 

-
level model can be approximately 
re
partitioned by the ones of the two
-
level models. Specifically, the omitted teacher
-
level 
variance in the two
-
level model is partially distributed to the flanking student and school 
levels a
s:
 

and
 

Thus, the ratio of classroom size to the school size (i.e., 

) decides the extent of 
repartition of the omitted classroom
-
level variance into the stude
nt
-
 
and school
-
level variance. 


is restricted to 


since 


and 


is an integer that 
is 
larger than or equal to 1. 
When 


,
 

and 


, each classroom SSU has only one sampled student, 
t
hen the between
-
classroom 
variance 


is dominated by the estimated student
-
level variance 


in the two
-
level model. When 


, 


and 


, all sampled students come from 
the only classroom SSU in a school PSU, then the b
etween
-
classroom
-
within
-
school 


is 
actually
 
0
 
that
 
the
 
estimated two
-
level model is appropriate. 
 
Figure 2.3 
below 
shows the range of 

 
in an example setting of class size 


and the school size 


. This restriction is
 
due to 


, where 


, 


, and 


. In practice, defining the value of 

 
needs to consider this restriction in 
setting empirical meaningful random effects variance. For example, the value of 

 
decides the 
 
 
32
 
 
maximum value of 


that a researcher can set to satisfy the conditions of 


and 


, when fixing 


,


, and 


(or 

). This discussion will be further shown in the 
empirical study example of impl
ementing the sensitivity analysis in Chapter 3.
 
Figure 2
.
3
 
Relationship among 

, 


, and 


The original 

 
in the two
-
level model now turns into a function of the two intraclass 
correlation coefficients of the three
-
level model, which is 


Further, 


, and


7
. Thus, if
 
the classroom
-
level cluster 
is omitted, the estimated within
-
classroom student co
rrelation is upwardly biased by 


, and the between
-
classroom
-
within
-
school student correlation is downwardly biased 
by 


. 
 
 
7
 
The current paper assumes that 
homogeneity assumption still holds when the teacher
 
cluster level is included in t
he 
t
hree
-
level model. Particularly, when the cluster sizes are equal (or at least has relatively small variances across 
clusters), the repartitioned variances, though their values depend on the size of 

, remain constant across groups.
 
 
33
 
 
Throughout the whole study
, I 
use
 
the term 
satisfactory model
 
to refer to the three
-
level 
mode
l stated above as it satisfies the specifications of three clustering levels and corresponding 
random effects. I then name the two
-
level model omitting a necessary cluster level as the 
est
imated model. 
Tab
le 2B.1 in 
Appendix 2
B
 
summarizes and compares the 
model specification, 
assumption, and estimation considerations and settings of the two
-
level estimated model omitting 
the middle cluster level and the three
-
level satisfactory model. As shown, the only
 
distinctions 
between the two
-
level and three
-
level mod
els occur at the random effect specifications of cluster 
levels and the allocation of the omitted cluster level's predictor. These distinctions due to 
omitting cluster levels are the research focus of 
the current study. Other model assumptions and 
specific
ations are conventional settings. Discussions of how to deal with violations of those 
conventional assumptions in practice are out of the current study's scope. Some techniques to 
correct for violation
s of those conventional assumptions (such as small clus
ter size) are noted in 
footnotes. In the later section of Discussion, some assumptions (such as balanced design and no 
random slope) that are closely related to the random effect specifications are con
sidered as 
limitations and directs for future studies.
 
2.4.2 
Quantifying the Standard Error Estimate Bias
 
Bias of the 
Standard Error Estimates of the Coefficients of 


and 


In the two
-
level model, the estimated variance of the coefficient of 


is: 
 

34
 
 
where 


=


. 
In the CRT setting, 


is the variance of the intervention 
effect or, in other words, the variance of the group mean difference in outcome such that 


.
 
The standard error 
estimate 
is the square root of the diagonal of 
the variance matrix
 

. The subscript 


indicates the two
-
level model. 
In a single
-
level
 
analysis with
 
OLS 
estimation
, the variance estim
ate is 


. 
 
Compared with 


, 


is smaller and thus leads to Type I e
rror. The 
ratio of 


and 


is
 

, w
hich is known as the design effect of a two
-
stage 
sampling design in survey studies
. It 
quantifies the variance inflation or the over
-
estimated 
precision of the effect of 


as if the sampling scheme is a simple random sample. Or in 
economics, 


is the 
MF that is robust to clustering but assumes homoskedasticity
. 
The 
detailed deriv
ation procedure of 


can be found in Angrist and Pischke (2008) and Cameron 
and Miller (2015).
 
Similarly, when 


is modeled in the
 
three
-
level model, the variance estimate yields to:
 

where 


. Ag
ain, in CRT, 


. 
The
 
index
 

is derived from 
the error variance
-
covariance matrix 


of the three
-
level 
model (i.e., 
Eq. 2.2
)
, which 
is identical to the three
-
stage sampling design
 
effect formulas shown 
in Chen 
and
 
Rust
 
(
2017
)
, and an earlier three
-
level clinical CRT design work in Heo and Leon 
(2008). A
lgebraically, the derivation of the weighting indices of 


and 


is stra
ightforward 
 
 
35
 
 
that 


a
nd 


are equal to the summation of all the intraclass correlation coefficients 
in the 
brackets of 


and 


, respectively. This procedure implies that 


since all between
-
cluster variance is captur
ed. Howev
er, one must still determine in which cluster levels of the 
between
-
cluster variance exists. In essence, 


takes into account the inflation due to the 
dependency of two levels of nesting (
i.e., students nested within teachers, and teach
ers nested 
with
in schools), compared with 


which quantifies the variance inflation due to the 
dependency of a single level nesting (i.e., students nested within schools). 
The following 
provides additional algebraic proofs.
 
The
 
index quant
ifies the 
bias of the 
var
iance 
estimate 
of 


due to the 
omitted middle cluster level 
is
 
the ratio of 


and 


: 
 

VOC stands for the 
V
ariance bias 
due to the 
O
mitted 
C
luster leve
l. The
 
superscript 
(
3
-
2
, 
2L
) 
indicates that
 
the predictor of interest is at level
-
3 but modeled 
as level
-
2 
in a two
-
level 
cluster structure. The subscript 
M
 
stands for th
e omitting of the middle cluster level case. The 
construction of 


follows the same logic of DEFF and 
MF
, which is comparing the 
variance estimates with and without the omitted cluster level. 
 
In practice, researchers can compute the v
ariance inflation magnitude by filling the 
possible values of class size 


,
 
and the average correlation of students from the same class 


. 
Therefore, I re
-
express all the variance inflation factors by the known 

 
from the estimated two
-
level m
odel and the assumed omitted level clustering parameters of 


and 


.  Further, s
ince
 
 
36
 
 
then,
 

suggests
 
that the estimated variance of the fixed effect of school
-
level 
predictor 


does not need an
y
 
bias correction
 
when the teacher
-
level cluster is omit
ted. Since 
the omitted teacher
-
level variance is redistributed to the school
-
 
and student
-
 
level, 


f
rom the 
two
-
level model still takes into account the between
-
teacher variance. 
 
Equivalently showing in the CRT settings, assuming half schools are
 
randomly assigned 
to the treatment and control groups (i.e., the sample size of the treatment and control gr
oups is 


) , the standard error estimates of  


and 


equations in the three
-
 
and two
-
level CRT 
balanced design (Konstantopoul
os, 2008a; Spybrook et al., 2016 ) are respectively defined as:
 

,
 
and 
 

.
 
Plugin Eqs. 2.3 

 
2.5 of the algebraic relationships between 


and 


, and 


between 


, the standard error estimates of  


and 


are equal as shown below: 
 
 
37
 
 
.
 
Therefore, if the predictor of interest i
s at the highest school level, either a binary 
treatment or a continuous variable, the corresponding f

unbiased even if the teacher
-
level variance is omitted, assuming there are no even higher cluster 
levels than schoo
ls. This finding is consistent with Wang, et al. (2019), Zhu et al. (2012), and 
Cheong et al. (2001). 
 
Extending to an extreme case where the clustering structure is completely ignored as in a 
single
-
level analysis 
with 
OLS 
estimation
, the variance estimat
es of 


adjustment of: 
 

Constructed by dividing 


by 


, 


reflects the two
-
layer nesting structure of 


. The magnitude of adjustment depends on the 
clustering parameters o
f intraclass correlation coefficients and cluster sizes
. Further, 


is 
also equivalent to 


, which captures the total between
-
cluster variance, while blurring the 
levels of clustering structure. 
 
 
38
 
 
Bias
 
of the Standard Error Estimates of the Coefficients of 


and 


T
he following discussion switches to the teacher
-
level predictor 


which is falsely 
aggregated at the lowest level of students as 


, omitting the teacher
-
level va
riance 


in an 
estimated two
-
level model. The inflation of the vari
ance estimate of 


is 
quantified similarly as above, though the focus shifts from the two
-
layer clustering to the single
-
layer omitted clustering of students nes
ted within teachers. In this simplification, the true error 
vari
ance
-
covariance structure only needs to consider 


from 


of the three
-
level model 
instead of the whole structure of 


. This true variance estimate of 


, 
deno
ted as 


, produces a varia
nce weighting index 


. 
 
Further, dividing 


by the variance estimate 


of the single
-
level 
analysis with OLS estimation 
which falsely assuming students are independ
ent 
within 
classrooms and schools, the variance inflation measure yields to
 

which contains the between
-
schoo
l variance (


) and 
between
-
classroom variance (


) in a 
correctly specified three
-
level model. Further, 


can be rewritten as a function of the 
known 

 
from the estimated two
-
level model and the unknown 


and 


that researchers can 
specify
 
as
 

39
 
 
Intuitively, the bracket quantifies the omitted clustering dependency, which consists of 
(1) the overestimated school
-
level variance, as presented by 


,  from the estimated two
-
level model, and (2) the uncaptured
 
classroom
-
level variance 


. Noticeably, these two 
components are weighted by 


, and relevantly, 

. When 


that each sampled classroom 
wit

standard e
rror estimates since the classroom
-
level predictor actually measures the singleton 
sampled student, which thus can be disaggregated at the student level. Whe
n 


, the school
-
level variance estimate 


is not overestimated and equals to 


. In this c
ase, the estimated 
two
-
level model is 
satisfactory
 
since the classroom cluster level does not need to be specifically 
modeled to produce the unbi
ased random effects estimates of students and schools. However, if 
the classroom
-
level predictor 


is still of research interest and is modeled as a disaggregated at 
the student
-
level 


ient still need to be 
adjusted by the square root of 


, as the clustering of higher school level sti
ll exists. 
Further, if 


, the cluster
-
level predictors 


and 


(as shown in Eq. 2.7) do not need 
any clustering adj
ustments anymore. On this occasion, a single
-
level analysis 
using OLS 
estimation 
is 
sufficient
 
as the data has a simple 
random sampling design.
 
When the estimated model is single
-
level OLS
 
estimation
, the variance inflation issue of 
the disaggregated class
room
-
level predictor 


is equivalent to the well
-
documented simple 
two
-
level clustering modeling situation whe
re the teacher
-
level predictor 


is modeled as 


. 
Consequently, the variance adjustment is constructed by the variance of the 
satisfactory
 
two
-
level analysis which accounts for the clustering of students nested within classrooms, dividing 
 
 
40
 
 
the var
iance of the single
-
level analysis. The variance estimates of the 
satisfactory
 
two
-
level 
model is
 

and 


is the error variance
-
covariance structure
 

Similarly, the variance weighting index 


Finally, the variance 
inflation adjustment index 


for 


in the
 
single
-
level analysis using
 
OLS 
estimation
 
is the same as the two
-
stage 
DEFF
 
(or
 
M
F
)
. That is
 

Obviously, fixing the teacher
-
level variance, 


and 


increase as 
the average class size 


increases. Therefore, the 
variance adjustment is more in need of mode
ls 
that are conducted for large class size contexts than the ones with small class size. Meanwhile, 
the number of classrooms in a sampled school (i.e., 


) constraints in the practice setting of the 
potential
 
value of  


and 


. This point is relevan
t in Chapter 3, in which an empirical example 
of omitting the middle cluster level is used to demonstrate the sensitivity analysis framework.
 
 
41
 
 
Furthermore, 


is smaller than  


by 


, which is intuitive as 
two sources of clustering dependency affect the estimation of the standard error estimate of the 


coefficient estimate, while a single middle
-
level clustering affect the one of 


. In 
other words, in the single
-
lev
el analysis
 
using OLS estimation
, a Type I error issue could be 
more pronounced for the highest
-
level predictor than the middle level one. 
 
Bias
 
of the Standard Error Estimates of the Coefficients of 


and 


Finally, although the student
-
level predictor 


is not the focus of the current study; its 
standard error estimate is upwardly biased when the clustering structure is omitted
8
. As 
evidenced in Moerbeek (2004)and Snijders (2005)
, the regression coefficient of an individual
-
level predictor 


in a two
-
level random intercept
-
only model tends to be upwardly biased when 
the adjacent upper cluster level is omitted in either the t
wo
-
 
or 
single
-
level 
models. Type II error 
is also
 
undesired since important individual
-
level predictor effects could be masked as 
insignificant. In a 
satisfactory
 
random intercept two
-
level HLM model, the design effect formula 
of the standard error estimat
e of 


, which is less than 1 
when 


, indicating that the multi
-
stage sampling design is more efficient than the simple 
random sampling in this setting (Snijders, 2005). It is easy to extend to a three
-
level c
ase for the 
variance estimate adj
ustment of the coefficient of 


from the OLS 
estimation
 
case, which is:
 

8
 
When 


is t
he predictor of interest while the cluster
-
level predictors and the random effects are not the foci, 
researchers could employ the fixed effect framework to account for the overall clustering dependency. However, 
when the cluster
-
level predictors
 
are of the
 
research interest, the fixed effect approach is less optimal. In the current 
setting of when the omitted cluster level data is not available, the shown design
-
based approach with the sensitivity 
analysis framework (in Chapter 3) could be prefer
red. 
 
 
42
 
 
For the estimated two
-
level model case,
 

which is the ratio of the design effects of the 
satisfactory
 
three
-
level model and the false two
-
level model. In Chapter 4, the main predictor of interest 


encounters the same issue, in 
which a detailed derivation procedure is provided. La
stly, Tab
le 2.
3
 
below summarizes the 
VOC
s
 
of cluster
-
level predictors when omitting the teacher
-
level cluster only and omitting the 
clustering structure completely.
 
Table 2.
2
 
A summary of 
VOC
s when the middle cluster level is 
omi
tted in a three
-
level 
structured clustering data.
 
Three
-
level HLM
 
Two
-
level HLM
 
Single
-
level OLS
 
Estimation
 
Level
 
Predictor
 
Level
 
Predictor
 
Variance 
adjustment
 
Level
 
Predictor
 
Variance 
adjustment
 
Student
 

Student
 

Student
 

Teacher
 

School
 

School
 

Note. 

levels that are omitted. 
 
2.4.3 Simulation Results 
 
A simulation study is designed to test the estimatio
n bias when the middle cluster is 
omitted and the performanc
e of the derived 
VOC
 
formulas. In total, 12 conditions of random 
effect standardized variances and cluster sizes are set, and 500 replications are generated for each 
condition. The total sample si
ze of students (


) and schools (


) are 5000 and 100, 
r
espectively, which fixes an average school size (i.e., the average number of sampled students 
 
 
43
 
 
within a school) 


of 50. The setting of  the average class size (i.e., the average number of 
sampled studen
ts within a class) 


is set to be 5, 10, and 25. 
The corresponding average 
numbers of sampled classrooms or teachers in a school 


10, 5, and 2, 
and the ratio measure 
of 
the class and school sizes


are 0.08, 0.18, and 0.49. 
 
Hedges and Hedber
g (2007) provided a comprehensive list of ICCs for planning CRT 
based on the commonly used multi
-
stage sampled educational data sets, such as ECLSK and 
National Educational Longitudinal Study (NELS). They found that the ICCs are around 0.2
 
across all grade
s of all sample schools. Therefore, with setting 


, the values of random 
effects variance in the current study cover the conventional situations when the school
-
level 
random effects are relatively small (i.e., 


) a
nd large (i.e., 


). Then, 
the teacher
-
le
vel random effects of 


(= 


) are 0.2, 0.5, and 0.7 to meet the conditions of 
equaling to, larger than, and smaller than the school
-
level random effects. Finally, the simulation 
study employed R packa
ge lme4 (Bates
 
et al.
, 2015), where Restricte
d Maximum Likelihood 
(REML) is specified for estimating the variance component to accommodate the cases with small 
cluster samples.
 
The index of relative bias is computed to measure the magnitude of the estimati
on bias
 

where 

 
represents the true parameters from the three
-
level model, including the random effects 
variances, standard errors of the teacher
-
level predictor 


, and the school
-
level predictor 


. 
Correspondingly, 


represents the estimates
 
from the estimated two
-
level model or the 
disaggregated OLS 
estimation
. Falsely estimated models lead 


to deviate from zero. 
 
 
44
 
 
Furthermore, a negative 


represents underestimation and a positive 


represe
nts overestimat
ion. Similarly, a relative bias index of 


is provided to show the 
need and performance of adjustments of estimates, in which 


becomes the ones that are adjusted 
by 

s or the repartitioned variance
-
covariance formulas. The bette
r performance of the 
adjustment of estimates, the closer to zero 


is. The simulation outputs are summarized 
in the following, and Appendix 2.
C
 
lists the parameter settings and p
rovides detailed simulation 
results.
 
Bias of the Random Eff
ects and the Adjustment Performance
 
The estimated two
-
level models overestimated the individual
-
level residual variance and 
school
-
level random effects variance, where the mean 


are all positive and increasing with 
the increased  


or 


. With increasing 


and 

, the magnitu
de of the overestimation 
 
of 


increases while decreasing in 


. When the omitted between
-
classroom
-
within
-
school 
and the between
-
school variation only take 20% of the total variance respec
tively (i.e., 


) and the individual residual takes
 
the most of the total variance, the overestimation of 


is small, particularly when the average classroom size is relatively small (i.e., 


) 
and the mean 


of 


is less than 0.01. Under the same conditions, however, the 
overest
imation of the residual variance 


is large, with 


capables of being around three 
times as large as the true parameter. In an extreme converse case
 
where the omitted between
-
classroom
-
within
-
school variation is considerably large (i.e., 


) and the individual 
variance and the between
-
school variation are small (i.e., 


and 


), 


can be as 
twice as large a
s the true parameter 


. The overestimation of 


is extreme that 


can be 
over seven times larger than the true p
arameter 


. 
 
 
45
 
 
These patterns are consistent with the Eqs. 2.3

2.5 that 


has a higher degree of 
overest
imation compared with 


under the same conditions of 


and 


. Moreover, the 
adjusted variances performed considerably well, as the mean 


are close to 0 across all 
conditions. 
 
Bias of the Standard Error Estimates o
f the Coefficients of 


and 


and the 
Adjustment Performance
 
The absolute mean 


of the
 
standard error estimates of 


in the two
-
level 
models are all highly close to 0 (less than 0.01), which supports the previou
s derivation of 


=1.In the 
single
-
level model using 
OLS 
estimation
, 
the
 
standard error estimates of 


are c
onsistently underestimated since the mean of 


are all negative, and the standard 
deviations of 


are nearly zero. The standard error estimates are only around 20 to 30 
percent of the true parameter, which is relatively stable across all 
the conditions. This is because 
OLS 
estimation 
ignored the overall error clustering dependency so that distinguishi
ng the 
sources of clustering matters less. 
 
Bias of the Standard Error Estimates of the Coefficients of 


and 


and the 
Adjus
tment Performance
 
As shown by the negative value of the mean 


and the nearly zero standard deviation of 


, the standard error estimates of the 


coefficient 
in the estimated t
wo
-
level models 
and the 


coefficient 
in the 
OLS 
estimated 
single
-
level
 
models is downwardly biased in all 
conditions.
 
 
46
 
 
In the two
-
level models, the standard error estimates are mostly underestimated when the 
omitted 


and 


are larg
e, and when 


is small. When 


and
 

, the standard 
error estimates can be a half and even only 20 percent of the parameter. When 


and 


, the standard error estimates can still be only 40 to 70 percent of the p
arameter, which 
is non
-
trivial. Further, in the extreme cas
e of when the individual residual variance is 
considerably small (e.g., 


), the underestimation of standard error estimates is 
comparable in cases of either the majority clustering depend
ency coming from the school
-
level 
(i.e., when 


and 


) or from the classroom
-
level (i.e., when 


and 


). This is intuitive from 


.  These patterns are also found in the 
single
-
l
evel
 
models where the underestimation is positively related to the size of  


and 


. 
 
The performance of 


is generally good in almost all cases since 


has 
absolute mean and standard deviation values less tha
n or around 0.1
. 
However, one exception in 
the two
-
level models is when 


, 


and 


and the underestimation 
adjustment is not enough. The adjusted standard error estimate is around 75 percent of the true 
parameter, though having
 
improved largely as compared with the unadjusted one of being 20 
percent of the true parameter. In 
single
-
level 
models when 


, 


and 


, the 
standard error estimates are over
-
corrected that the adjusted estimates are, on average
, 20% 
larger than the parameter. In this case, the underestimation bias from the 
single
-
level
 
model is 
close to zero
 
(i.e., 
-
0.05) that no adjustment is required in the first place.  
 
 
47
 
 
Bias of the Standard Error Estimates of the Coefficients of 


and 


and the 
Adjustment Performance
 
Finally, the simulation found evidence of the overestimation bias of 
the standard error 
estimates of the coefficients of 


and 


.
 
This finding is consistent with Moerbeek 
(2004). 
Particularly in cases where 


and 


are large, the bias is substantial. When 


and 


are small as in 


and 


, the 


of the two
-
level HLM are less than 0.1. This 
resonates 

ation setting with small 


, 


, and corresponding 
evidence that shows that the standard error estimates 
of 


is unbiased. 
 
2.5 D
iscussion and Conclusion 
 
Extending an emerging body of research debating whether a middle cluster level 
matters 
in making the decision of using a two
-
 
or three
-
level model, this chapter summarizes and 
clarifies when a two
-
level model omitting the middle c
luster level would impact the standard 

nt in the settings of multi
-
stage 
sampling and CRT design. In previous studies, the relevant evidence is often shown through 
simulation and empirical a
nalyses as examples. The current study complements those evidence 
by producing critical formulations of q
uantifying the standard error estimation bias (i.e., the 
correction index of 
VOC
s), which are functions of the clustering parameters of the omitted 
mid
dle cluster level. Simulation evidence is provided with settings of the practical K
-
12 
education to aid f
or empirical implications. 
 
Also, the findings shown by the 
VOC
s formulas provide a general conclusion of the 
statistical mechanisms causing the bias a
nd to what degree. The 
VOC
s are specifically listed in 
 
 
48
 
 
the above Table 2.2. For recommendations of modeli
ng three
-
 
and two
-
level models, if the 
middle cluster level is a deliberate stage in sampling, even if this level is not directly related to 
the resear
ch questions, this cluster level should be explicitly modeled to correctly reflect the 
complete picture o
f the study designs of sampling stages and the levels of experimental 
mechanisms. An estimated two
-
level model omitting the middle cluster level should
 
be corrected 
with the variance estimates of the random effects, whereas it would not produce biased stan
dard 
error estimates of the coefficients of the third level predictors. 
 
If the middle cluster level is incidental instead of being a deliberate sampli
ng stage or 
receive treatment assignment, whether to model this level as random effects largely depend on
 
whether the research interests relate to the predictors at this middle level. Many times, the middle 
cluster level conveys important mechanisms that r
esearchers would prefer to include this middle 
level and corresponding predictors in the three
-
level mode
ls. Particularly, a two
-
level model in 
this situation would easily falsely disaggregate the middle
-
level predictors at the lowest level. In 
this case, 
the standard error estimates of the disaggregated middle
-

need to be correc
ted to avoid Type I error. 
 
This study also extends omitting the single middle cluster level to completely omitting 
the clustering of both the middle a
nd highest levels. This extension contributes to the 
conventional design
-
based robust standard error stud
ies, which do not distinguish the sources of 
dependencies in multilevel data structures while capturing the overall dependency. This point is 
best supp
orted by the 
VOC
 
derivation of the highest cluster level predictor. Additional to the 
omitted one cluster
 
level scenario, this chapter also extends to the omitting the overall clustering 
dependency case
 
 
49
 
 
as the estimated model is a single
-
level 
model
. Then,
 
the cluster
-
level predictors estimates would 
have Type I error issues and the individual level predictor
 
would have a Type II error. Moreover, 
the Type I error issue is more pronounced in the highest
-
level predictor than the middle
-
level 
one.
 
The above fi
nding is empirical guidelines for researchers to decide whether the middle 
cluster level should be modele
d. Further, combining with the sensitivity analysis framework and 
empirical examples presented in the following Chapter 3, researchers would further be
nefit from 
testing the magnitude of the robustness inference if a potential middle cluster level is not 
m
odeled. The current model
-
based design sets the basic random intercept model as the 
satisfactory
 
model. If the random
-
slope model is the 
satisfactory
 
model, the error variance
-
covariance matrix of 


and the standard error estimate expressions should 
be accommodated 
(see Snijders and Bosker, 1993). However, the random intercept model is a widely used model in 
education empirical research 
and an ideal starting point for more complex models in future 
research. Another limitation of this study is that the
 
modeling setting assumes balanced designs, 
which is not always plausible in practice. Future work needs to develop 
VOC
s, particularly in th
e 
CRT designs (Konstantopoulos, 2010), to accommodate unbalanced situations, such as when 
including the ratio of clu
ster sizes.
 
 
50
 
 
CHAPTER 3
 
SENSITIVITY ANALYSIS FRAMEWORK OF OMITTED CLUSTERING
 
 
3.1 Introduction
 
Good scientific research is expected to present the 
best 
design and 
models that can 
answer the research questions and satisfy the
 
model
 
assumptions
. 
However, as 
argued earlier, the 
issue of omitting a cluster level
 
in two
-
level HLM
 
cannot be solved by a model
-
based approach
 
(i.e., thre
e
-
level HLM)
 
in many practical situations, such as data restrictions and unidentifiable 
error variance
-
covariance structures. 
Given
 
these 
concerns 
about omitted clustering
, 
Chapter 2 
(and later Chapter
s 4 and 5) provided formulas to 
quantify
 
the standard error estimat
ion bias 
of 
the coefficients, which are functions of the clustering parameters (i.e., ICCs and cluster sample 
size) of 
the omitted cluster levels. 
Further
, the
 
current chapter 
build
s
 
a sens
itivity analysis
 
using 
the 
VOC
s
 
to test the magnitude of the inference robustness when the model
-
based approach is 
not feasible. In practice, if empirical researchers aim to know how rob
ust the inference they 
made from the estimated model with a potentially omitted cluster level, they m
ay hypothesize 
the clustering parameters of the omitted cluster and utilize the
 
this
 
sensitivity analysis 
framework
. 
.
 
 
In essence, the proposed sensitivit
y analysis evaluates the deviations of the estimated 
models from the ideal case of when all crucial r
andom effects are correctly modeled and 
specified. Simply stated, the larger deviations from the assumption there are, the higher bias of 
the 
standard erro
r 
estimates
 
due to the omitted clustering
, and the less robustness of the
 
statistical
 
inference. Pane
l (a) of Figure 3.1 demonstrates this idea. 
As defined earlier in Chapter 2, t
he 
satisfactory
 
model is the ideal model that meets all the clustering assump
tions
, which is unknown 
 
 
51
 
 
in practice 
and thus outlined with dashed lines in the figure on the right en
d
. 
The estimated 
model is the actual conducted model and hypothesized with an omitted necessary cluster level, 
which could produce biased estimates. The mo
re the estimated model deviates from the 
satisfactory
 
model
 
due to the omitted clustering
, the less r
obust it is. The size of the deviation is 
then quantified by the hypothesized clustering effect via setting parameters of ICCs and cluster 
sizes of the omi
tted cluster level. Consequently, 
even if
 
the 
satisfactory
 
model is unknown, it can 
be hypothesized to test how far the estimated model deviates. In Figure 3.1, Model A and Model 
B are such two hypothesized 
satisfactory
 
models. Specifically, the estimated 
model deviates 
from Model B farther than Model A, since Mo
del B sets with larger clustering parameters. 
 
The size of the deviation from the estimated model to a hypothesized 
satisfactory
 
model 
can be represented in terms of the size of the bias of standard
 
error estimates. Thus, if the 
deviation is considerable, 
the bias of the standard error estimates can be large enough to generate 
a false inference with either a Type I or Type II error. Therefore, a threshold 
satisfactory
 
model 
defining the minimal devia
tion size to invalidate an inference is added in panel (b)
 
of Figure 3.1. 

et al.
 
(2013)
, which 
defines a 
lower 
threshold
 
of a 
non
-
zero effect 
study
 
switch
es
 
to 
a no effect one
. 
 
The clu
stering setting of the threshold 
satisfactory
 
model is the
n the threshold clustering 
of the omitted cluster, which can help researchers quantify the robustness of their inferences to 
omitted clusters. For example, if researchers think Model A fundamentally
 
represents the omitted 
clustering, then the estimated mod
el does not produce a false null hypothesis decision since the 
threshold model is on the right of Model A. In this case, the estimated model would be 
acceptable, although its interpretations and imp
lications should not be overstated. On the 
contrary, if Mo

 
magnitude of the standard error 
 
 
52
 
 
estimate bias
 
of the estimated model is large enough that the estimated model 
generates a false 
decision of null hy
potheses. 
 
In 
Frank, Maroulis, et al. (2013),  
the 
robustness
 

of bias in an estimate to invalidate the inference. The threshol
d estimate is often specified as the 
one for statis
tical significance associated with an exact p
-
value of 0.05. Switch to the current 
study, the magnitude of the robustness of inference is evaluated by the deviation of the estimated 
model from the hypothes
ized satisfactory models. The larger the deviation,
 
the less robust the 
inference regarding modeling specification on clustering dependency. Unlike a fixed threshold 
estimate in 
Frank, Maroulis, et al. (2013), 
the position of the hypothesized satisfactory 
model 
defined in the current study is flexible as s
hown in Figure 3.1, which changes along with the 
sizes of the omitted clustering degree (i.e., VOCs). The current study constructs a sensitivity 
measure accordingly: the percentage of reduced robustness of
 
inference. This measure quantifies 
the magnitude o
f threats to the robustness of inference due to an omitted cluster level. The initial 
robustness of the estimated model should be considered 100% when the estimated model and the 
hypothesized satisfactory 
model has no distance (i.e., no omitted clustering 
issue). If the 
estimated model has an omitted clustering issue that a deviation between the estimated and the 
hypothesized satisfactory model exists, its initial robustness magnitude should be smaller than
 
100%. Thus, as the deviation increases, the robust
ness decreases. 
 
Extend the 
sensitivity analysis 
application t
o treatment evaluation studies, 
the percentage 
of reduced effect size as a second sensitivity measure is developed.
 
Further, when the 
hypothesi
zed 
satisfactory
 
model is on the right of the thres
hold model
 
(such as the standard error 
associated with an exact p
-
value 0.05)
, a measure evaluates the risk of making a false null 
 
 
53
 
 
hypothesis decision is provided.
 
For example, i
n the panel (b) of Figure 3
.1, a red line presents 
the distance between the th
reshold 
satisfactory
 
model and Model B. As the distance increases, 
the risk of making a Type I error 
(or a Type II error) 
increases. Further, the risk of having an 
invalid inference can be compared across 
hypothesized 
satisfactory
 
models. 
The following 
discussion focuses on the scenarios of making Type I error, while Appendix 3.1 further provides 
the Type II error discussions
.
 
Section 3.2 starts with the simple scenario of conducting a false single
-
level an
alysis 
which omits a higher cluster level and leads to underestimated standard error estimates. The 
developed measures and formulas are easily applied to the false two
-
level
 
HLM with omitting a 
cluster level cases, and can also accommodate to the Type II e
rror cases when the standard error 
estimates are upwardly biased (such as in the omitting highest cluster level case in Chapter 4). 
Section 3.3 provides an empirical example
 
of employing the developed sensitivity analysis. The 
empirical example serves the 
discussion in Chapter 2, where a two
-
level HLM model is 
estimated while an incidental middle cluster level is potentially omitted.  
 
3.2 Constructing the Sensitivity Measure
s for Inference Robustness of Clustering
 
In 
Frank, Maroulis, et al. (2013), 
the mag
nitude of the inference robustness was 
quantified by constructing a ratio of a coefficient estimate with a threshold coefficient. Since the 
standard error estimate is of the
 
focus of the current study in evaluating the impacts of the 
omitted clustering dep
endency, the current study construct the ratio of the t statistics from the 
estimated and hypothesized satisfactory models and the t critical value with an alpha level of 
0.
05, fixing the coefficient estimates if a cluster level is omitted (McNeish,2014). 
For example, 
consider an estimated single
-
level analysis with a continuous dependent variable 


, which 
 
 
54
 
 
indicates the outcome of a student i in a school k though the s
chool
-
level is omitted as shown in 
parenthesis:  


.  The coefficient estimate 


of the predictor of 
interest 


has a corresponding standard
 
error estimate 


. 
 
 
(a)
 
Deviation of the estimated model from the unknown 
satisfactory
 
model
 
 
(b)
 
Deviations of the estimated model from the hypothesized 
satisfactory
 
models of A and B
 
 
(c)
 
Deviations of the e
stimated model from the hypothesized 
satisfactory
 
models of B and C
 
 
Figure 3. 
1
 
Graphic demonstrations of the conceptualizing
 
the sensitivity analysis framework
 
With the omission of the higher cluster level o
f schools, the stan
dard error estimate is 
downwardly biased and needs adjustment, which turns to be 


, while the 
 
 
55
 
 
point estimate 


remains the same. The 

 
here is the design effect 


, 
where the expected intraclass correlation 


and the average cluster size 


are the clustering 
parameters. Further, setting the common 0.05 alpha level, the threshold model has the 
t
 
critical 
value of 


and the standard error of 


. The following uses a 
general coefficient estim
ates notation 


replacing 


.
 
3.
2
.1 Scenario of 
No 
Type I error 
 
This scenario presents the case of the estimated model (i.e., the 
single
-
level
 
model
 
using 
OLS estimation
) deviating
 
from the hypothesized 
satisfactory
 
Model A with reduced infer
ence 
robustness and effect size. However, the deviation is not large enough to result in a Type I error, 
as Model A is on the left of the threshold model. After transforming into the 
t
 
statistic
 
robustness 
framework as shown in 
P
anel (a) Figure 3.
2
, this s
cenario yields


, in which  
the 
t
 
statistic from the estimated model is larger than the threshold 


by 


(i.e., 


), and  
the 
t
 
statistic from Model A is larger than the threshold 


by 


(i.e., 


). The 
deviation of the estimated model and Model A is thus equivalent to the distance 

 
between those 
two differences of
 
t
 
statistics against 


(i.e.,


). 
 
The 
larger the distance 

, the larger inflation the 
t
 
statistic o
f the estimated model is, and 
the stronger evidence of the reduced magnitude of robustness. Scaling  

 
by 


as quantifying 
the size of inflation relatively to the t
-
statistic,  the percent
age of the reduced robustness is 
formulated as 
 

56
 
 
Consider Figure 3.
2
 
Panel (a)
 
below
, 
when
 

that
  

, 
t
he
re is no bias in the 
standard error estimates due
 
to 
potential 
omitted clustering
. 
This is the case of the estimated 
model is the best practice model 
which
 
initial robustness regarding with modeling clustering can 
be considered as 100%.
 
With the increase o
f 

, the initial robustness 
decreases by 


.
 
 
(a)
 
Scenario of No Type I error 
 
 
(b)
 
Scenario of having Type I error 
 
Figure 3.
2
 
Two Scenarios of Comparing 
t
 
Statistics of the Estimated Model and the 
Hypothesized Models (


)
 
Further, I propose a measure of the c
hanges in effect size. In educational research, 
particular in the experimental design research, the generic idea of effect size is the standa
rdized 
 
 
57
 
 
mean differences, which is the ratio of the treatment effect to a standard deviation (Hedges, 
2007b)
. 
Then, 
the effect size of the predictor of interest 


from the estimated 
single
-
level
 
model is
 

, where the numerator is the fixed coefficient 


and the denominator 
is the standard deviation 


. And 

 
is the total s
ample size. This definition of 

d
 
(Cohen, 1962, 2009). Corresponding
ly, the effect size 
from the hypothesized 
satisfactory
 
model
9
 
is 


. The
n, the percentage of 
the reduced effect size due to an omitted cluster level can be calculated using
 

which is identical to 


. As spec
ified by the scenario setting, 


here is smaller than the 
threshold 


that the estimated model is acceptable as it does not lead to a false decision on 
a non
-
effective intervention or mechanism. However, the decisions made on the estima
ted effect 
size need to be cautious as the 
satisfactory
 
effect size can be smaller. 
 
In t
he context of education interventions and policy evaluations, there are several 
commonly used measures of interpreting effect size, such as the magnitude, cost of a pr
ogram, 
and scalability of programs (Kraft, 2020). As a complement, 


can be consider
ed as a 
sensitivity measure serving to quantify the uncertainty of effect size due to the omitted clustering 
effect. Noticeable, 


is different from the conventional sampling uncertainty measures of 
effect size, such as the 
standard error and confiden
ce interval (see Cooper
 
et al.
, 2019). 
 
 
9
 
The
 

d
, while other definitions of effect size that satisfy 
specific research interest exist. A summary and comparison of commonly used effect size measures can be found in 
Fritz
,
 
Morris, &
 
Richler (2
012), and the ones
 
developed
 
for multilevel analysis can be seen in Hedges (2007).
  
 
58
 
 
Obviously, the size of 


depends on the values of the hypothesized clustering degree 
VOC
 
and the original effect size estimate of the tested study. By hypothesizing meaningful 
setti
ngs of clustering degree (i.e.
, VOC and its parameters of ICC and cluster size) within the 
context of a certain study
10
, 


constructs an interval as well. Then, multiplying the original 
effect size estimate with the range of 


, researchers g
ain an interval of effect size due to 
plausible omitted clustering settings. The larger the VOC, the larger reduction of effect size
 
when fixing the original effect size. The wider range of the VOC, the more uncertainty of a study 
due to the omitted cluste
ring.   
 
When fixing the VOC, the same value 


could lead to different meanings with respect 
to different original effect size.
  
For example, when 


, a large effect size estimate of 0.3 
only reduces to 0.2, which is still considerably large to indicate an effective and significant 
program. However, a medium effect size estimate of 0.1 reduces to 0.07, which would lead to 
a 
consideration of less strength of the detected effect. As shown, though a 3% reduction of a small 
effect size (i.e., 0.03 in the example) is much smaller than a 3
% reduction of a larger effect (i.e., 
0.1 in the example), the judgments on the reduced effe
ct size realize on the magnitude of the 
original effect size. It is an advantage of  


measuring the percentage of reduction against the 
original effect size instead of being an arbitrary value of reduction. The interpretation of effect 
size depends l
argely on the research context (Hedges, 2008; 
C. J. 
Hill
 
et al.
, 2008; Kraft, 2020)
. 
Though it is beyond the scope of this study to discuss the benchmarks of interpreting the 
magnitude of effect size, the current study suggests employing a summarized schem
a for 
 
10
 
See Korendijk, 
Moerbeek
 

59
 
 
interpreting effect size along with the cost and scalability of programs from
 
Kraft (2020, p. 20) 
when interpreting the magnitude of the reduced effect size of 


.
 
R
esearchers need to make decisions on setting plausible values of the clustering 
p
arameters of the omitted cluster
 
when applying the above sensitivity analysis
. 
In the setting of 
no Type I error
, 


is
 
always smaller than the threshold 


. While, if 


is 
possible 
to 
 
be larger than the threshold 


, the estimated model needs to further consider a Type I 
error issue discussed as following.  
 
3.
2
.2 Scenario of 
H
aving a Type I error 
 
A Type I error issue occurs wh
en the estimated model is on the left of the threshold 
model while the hypothesized 
satisf
actory
 
model (i.e., Model B) is on the right, as shown in 
panel (b) of Figure 3.
2
, 


when
 

is smaller than the threshold 1.96 
by 


(i.e., 


)while 


is larger than the threshold by 


(i.e., 


)
. The estimated 
model deviates from Model B by 


. The quantification process of the reduced 
robustness of inference
 
and effect size is identical to the above scenario of no Type I error
 

Further, as introduced earlier, a large distance between the threshold model and Model B 
suggests that the estimated model has a high possibility of making a Type I error. T
his Type I 
error risk can thus be quantified by the relative size of 


in 

 
while fixing 


60
 
 
and 
 

where 

 
is positive and 


since Type I error only happens when 


. In Panel (b) of Figure 3.
2
, fixing 


of the 
satisfactory
 
model, a larger 


leads to larger 
risk of making the Type I error. 
This relationship is quantified through 


that the higher the 
omitted clustering effect or correspondingly the 


is
, the higher the risk it is of the estimated 
model for making a Type I error. Fur
ther, the value of 


makes compariso
ns with the 
threshold case of when 


. This is because it is intuitive that when the 
satisfactory
 
model has a 
t
 
statistic that equals to the 
t
 
threshold (that is 


), the Type I error 
i
ssue arises. 
 
Back to Pane
l (c) of 
Figure 3.1
, it further 
demonstrates how the risk index can be utilized 
for comparing hypothesized 
satisfactory
 
models. A hypothesized 
satisfactory
 
Model C has a 
higher clustering setting than Model B, and thus being locat
ed on the farther right of
 
the 
threshold model than Model B, thus  


. Also, fixing 


, 


. That is, if Model C 
is the 
satisfactory
 
model, the estimated model has a higher risk of having a Type I error issue 
 
 
61
 
 
than if Model B is 
satis
factory
. Noticeably, since the relative size of 


in 

 
is considered in the 
formulation, the ratio of 


and 


is not as simple as 


. Researchers who intend to 
know the relative risks of having Type I errors across different cluster
ing settings of 
VOC
 
can 
further ut
ilize a relative risk index of
 
 
In this manner, the risk of making Type I error increases by a percentage of 


, if the 
omitted clustering
 
setting of
 
Model C
 
is preferred than the one 
of
 
Model B
 
based on the research 
context
.  
Finally, the above discussions focused on the Type I error 
issue. In Appendix 3.A, 
measures of robustness inferences are extended to the Type II error issue.  
 
3.
2
.
3
 
Heuristics 
D
iagram a
nd 
I
nterpretations of the 
S
ensitivity 
A
nalysis 
 
The heuristics diagram in Figure 3.
3
 
depicts a possible flow of conducting the se
nsitivity 
analysis. Starting from the top of the diagram, researchers may first find the threshold 


. 
Solving 


0
 
(i.e., 


), 


yields
 
 
The use of this threshold
 

is straightforward, and it is of great use 
when empirical 
researchers need to anchor the threshold clustering parameters of the omitted cluster level. 
Further, researchers may set an empirical 


with meaningful clustering parameter values of
 
what best satisfies their prior knowledge abo
ut the suspected omitted cluster level. If the 
scientific 


is unlikely to be exceeded at the threshold 


, then researchers may worry 
 
 
62
 
 
less about the Type I error but focus on the magnitude of reduc
ed robustness of inferences and 
ef
fect size. If 


exceeds the switch point value, then researchers need to further take into 
account the risk of having a Type I error. Setting a reasonable 


value, researchers can 
manipulate the implications of
 
an omitted cluster by
 
exploring many possible values of the 
clustering parameters. 
 
 
Figure 3. 
3
 
Heuristics diagram of sensitivity analysis when the predictor of interest in the original 
single
-
level model is statistically sig
nificant.
 
Researchers 
can also conduct sensitivity analysis in the opposite direction. They may 
start with setting the clustering parameters to gain a 


, then judge with the 


. 
Enlightened by the work of 
Frank, Maroulis, et al. (2013), 
a sensi
tivity analysis can be of the 
most practical use by empirical research when it is equipped with a scientific language for 
interpretations. Her
e are the suggested interpretations of the above sensitivity analysis: 
 
1) The robustness of inference (or effect size) reduces by 
x
 
% (i.e., the values of 


or 


if 
the omitted cluster level has a clustering degree of 
y
 
(i.e., the 


value). The clustering 
degree is characterized by 


and 


. 
 
 
63
 
 
2) The risk of making Type I error increases by x % (i.e., the value of 


if the omitted cluster 
level has a clustering degree of 
y
. 
 
3.
3
 
Implication of the S
ensitivity Analysis: 
U
sing 
an
 
Empirical Example 
 
This section provides an empirical study ex
ample to show how to use the sensitivity 
analysis in defining the robustness of inference when an incidental middle cluster level is 
omitted. The selected study is 
from Heafner
 
et al. 
(2019), which examined demographic and 
course instruction related variab

employed data is the National Assessment of Educational Progress Economics Assessment 
(NAEP
-
E), which has 
a two
-
stage sampling design (with PSUs being schools and USUs being 
students). In that work,
 
a two
-
level random intercept model is constructed, where the first and 
second levels are students and schools, because the authors mentioned that NAEP
-
E has data 
c
onstraints to link students to teachers causing a three
-
level model to be prohibited (as see
n in p. 
331). In the final estimated model (see their Table 2 in p. 336), each level has corresponding 
demographic measures. Moreover, course type (such as AP cours
e), curricular and instructional 
exposure (such as internet use in a class) measures are ass
igned at the student level. 
 
It is reasonable to argue that some student
-
level predictors that are relevant to courses and 
instructions may be classroom
-
level predi
ctors. For example, instructional exposure of reading in 
class and internet use for economic
 
data may be uniform for students within the same classroom 
and teacher. Also, variations in the between
-
classrooms
-
within
-
schools cluster may be random. 
Therefore,
 
the classroom level, as an incidental middle cluster level, is assumed to matter to be 
expl
icitly modeled. 
 
 
64
 
 
The below sensitivity analysis, shown in Table 3.1,  is performed to calculate the 
robustness of the inference of the student
-
level predictor of in
ternet use for economic data. The 
statistics from the estimated models are presented in the 
section of estimated two
-
level HLM in 
the table, including the regression coefficient 


(I used the absolute value in the 
sensitivity analysis for simplicity reasons which does not affect the results), standard error 
estimates 


, the ran
dom effects variance of 


and 


conditioned 
on the 
predictors, the total number of the sample schools 


, and the average number of 
students within a school 


. Meanwhile, the hypothesized average number of students 
within a classroom 


and the between
-
classroom variance 


ne
ed special attention since they 
together affect whether the 
VOC
s and the corresponding calculated statistics of the three
-
level 
model
 
(such as the random effects variance 


and 


) are plausible. In Table 3.1, three 
values of 


are hyp
othesized to provide cases of extreme small cluster size of classrooms and 
the regular ones. 
 
Following the steps shown in the heuristics diagram 
of Figure 3.
3
, I first find the 
threshold 


. This threshold is then used to calculate the c
orresponding 


and 


. In the cases of when 


are 2 and 10, the threshold
-
based 


is not plausible since they 
exceed the boundary of (0,1). In 
these two cases, it is more meaningful to find the possible 
maximum and minimum 


. Fo
r example, when 


, the maximum value of a 


is 0.665 to 
make the regression estimates in the hypothesized three
-
level HLM feasible. Further, even when 


is larg
e, the corresponding 


would not lead to a 


th
at is larger than 


. 
Thus, there is no need to concern about potential Type I error issue when the average classroom 
size is extremely small. However, the 
robustness of inference (o
r effect size) reduces by around 
 
 
65
 
 
50 % (i.e., the values of 


or 


, which is not trivial. These settings reflect the earlier 
discussion of no Type I error scenario in Section 3.1.1. The following shows the having Type I 
error scenario. 
 
In the setting of 


, a minimum 


is needed to specify
 
eligible regression 
est
imates in the hypothesized three
-
level HLM. This 


is extremely small, being 0.01, 
which still can lead to a Type I error since the corresponding 


is larger than 


. The risk 
of making a Type I err
or (i.e., 


) increases by 0.
02, compared with the threshold setting with 
the 
t
 
statistic at the switch point of 1.96. Also, when 


, the feasible 


is 0.58 with a 


the maximum value of 0.24. Finally, when 


, the
 
threshold
-
based 


is plausible for 
being 0.17
6, which means that any 


that is larger than 0.176 could result in Type I error or not 
if 


is smaller than 0.176. Two values of 


being 0.5 and 0.1 are used to demonstrate this 
point. 
 
Thi
s section went through the implication of the sensitivity an
alysis framework. As 
shown by the above example that inferring the magnitude of the robustness inference largely 
depends on the selection of the clustering parameters of the omitted cluster level.
 
In practice, 
researchers may require meaningful clustering 
parameters from the previous research evidence to 
make the best argument for the inference robustness. As shown in the above specific example, 
the calculated between
-
classroom variation as measure
d by 


and  


are regulated by 
VOC
 
formulas and empiric
al evidence. This evidence encourages researchers to be cautious about 
excluding the classroom
-
level in modeling and assign the classroom
-
level predictors to other 
levels. 
 
 
66
 
 
Table 3.
1
 
Sensitivity analysis of the 
s
tudent
-
l
evel 
p
redictor: Internet
-
u
se for 
e
conomic 
d
ata
 
Estimated Two
-
level HLM
 
Hypothesized Three
-
level HLM
 
 
560
 

2
 
7
 
10
 
 
20
 

0.053
 
0.316
 
0.474
 

|
-
1.44|
 
 
22.62
 

20.748
 
19.812
 
5.499
 
15.600
 
19.500
 
0.312
 
4.524
 

0.735
 

1.96
 
/
 
/
 

2.964
 
2.964
 
17.121
 
11.946
 
3.120
 
22.308
 
18.096
 

0.400
 

3.60
 

8.58
 

7.488
 
8.424
 
8.580
 
3.654
 
8.580
 
8.580
 
0.008
 
 
0.275
 

0.240
 
0.270
 
0.219
 
0.117
 
0.275
 
0.270
 
<0.001
 
 
0.665
 
0.095
 
0.176
 
0.500
 
0.1
 
0.010
 
0.580
 
 
Threshold
 

1.837
 
0.735
 
0.456
 
Switch Point
 
0
.053
 
2
 

based 


2.215
 
 
NA
 
NA
 
NA
 
NA
 

0.665
 

1.380
 
0.552
 
0.275
 
NA
 

0.095
 

1.168
 
0.467
 
0.144
 
NA
 
0.316
 
7
 

based 


0.176
 

1.837
 
0.735
 
0.456
 
Switch Point
 

0.5
 

2.169
 
0.867
 
0.539
 
0.15
 

0.1
 

1.749
 
0.700
 
0.428
 
NA
 
0.474
 
10
 

based 


-
0.021
 
 
NA
 
NA
 
NA
 
NA
 

0.01
 

1.877
 
0.751
 
0.467
 
0.02
 

0.58
 

2.494
 
0.998
 
0.599
 
0.24
 
 
67
 
 
C
HAPTER 4
 
OMITTED HIGHEST CLUSTER LEVEL
 
 
4.1 Introduction
 
The context of schools and districts play important roles in many aspects of education, 
which has been a major topic in educational effectiveness studies 
since t


Gamoran
 
et al., 2000; 
Rumberger & Palardy, 2004). In many aspects, 
schools and districts provide particular social contexts, physical resources, and leadership 
distributions and provoke varying students lea
rning o
utcomes (Akerlof & Kranton, 2002; Fahle 
& Reardon, 201
8
; Muijs, 2020; Muller, 2015; Xia
 
et al.
, 2020). 
Current educational database, 
such as the NCES
-
initiated survey programs, provide many significant instruments measuring 
the contexts of schools and dist
ricts, as well as within
-
school and 
-
district variations (Muller, 
2015). Methodologically, if this rich
 
contextual information is omitted in modeling, studies may 
give spurious conclusions since the 
satisfactory
 
but omitted between
-
school (or district) var
iation 
would be trapped into the lower levels of classrooms and teachers, whose impacts would thus be 
f

Chapter 1). 
 
This chapter intends to address the analytical iss
ues of omitting a highest cluster level 
(such as schools and districts) in a two
-
level HLM model. Speci
fically, this chapter sets a 
conceptual two
-

level predictors and assuming t
hat an even higher cluster level of districts is omitted. Following 

the middle cluster level, this chapter also applies the 
mechanisms of sampling and experimental designs to the discussion of omitting a necessary 
 
 
68
 
 
highest
 
cluster level, which facilitate to answer 
when
 
the highest clustering dependency matters 
in modeling. 
Popular educational survey data sets are used as examples for empirical concerns.  
Then, the question of how much the omitted cluster level matters in ma
king a robust inference is 
answered by the derived 
VOC
 
formulas and evidenced by a simulation study. Fu
rther, an 
empirical study example using two
-
level model is provided to implement the 
VOC
s within the 
sensitivity analysis framework developed in Chapter 
3. 
 
4.2 Omitted Highest Cluster Level in Sampling and Experimental Design
 
4.2.1 Omitting PSUs in a Thre
e
-
Stage Sampling Structure Data 
 
PSUs could be omitted in empirical analysis with data that has a three
-
stage sampling design. For 
instance, the public a
vailable version date sets (e.g., ECLSK) are often do not provide linkable 
ID of SSUs of schools to PSU
s of districts or counties
11
. In this case, two
-
level HLM models 
leave out the clustering of schools within districts or counties, although the clustering
 
dependency due to students
-
nesting
-
within
-
schools is modeled explicitly. The design effect of 
the tru
e three
-
stage sampling is
 

,
 
where 


is the expected correlation among SSUs 
within a PSU, and  


is the sample 
number of SSUs within a PSU. Also, 


is the expected correlation among FSUs with
in an 
 
11
 
Sometimes, ignoring a sampling stage co
uld happen 
to when the sampling scheme is not universal in a large 
survey study. For example, in some international survey programs, countries may vary in sampling scheme to 
accommodate local context. 
Researchers may easily use the general sampling scheme 
as the univ
ersal design while 
ignore certain exceptions. 
Chen and Rust (2017) introduced such a scenario in the 
P
rogramme for International 
Student Assessment (PISA) 2015, which used a general two
-
stage sampling design 
where the two stages are schools 
and 
students (O
ECD, 2015). While PISA of Russia used a three
-
stage sampling design, where geographical areas are 
PSUs, schools are SSUs, and students are USUs (OECD, 2015). The PSUs of geographical areas maybe easily 
ignored if a two
-
stage sampling scheme is t
aken as uni
versal when the research data employed is PISA of Russia.  
 
 
69
 
 
SSU, and 


is the sample number of FSUs within an SSU. With only one layer of clustering 
accounted from the 
sec
ond
 
stage sampling, the corresponding design effect is measured by 


and 


as 


.
 
Figure 4.1
 
below
 
visualizes the structures of these two design effects. Obviously, when 
the first
-
stage samplin
g is omitted, 


turns to be 0, since SSUs are now falsely assumed to be 
independent to each other even if they are in the same PSU. Therefor
e, 


is not sufficient 
in two ways. One is that the two distinct sources of clustering measur
ed by 


and 


are 
now absorbed by a single clustering dependency (i.e., 


). The other one is that the sampling 
structure is reduced from
 

to 


. Immediately, the 


overestimate the 
standard
 
error of the estimate. This is because the effective sample size calculated based on 


is smaller than the true effective sample size given by 


. Equiva
lent to the 
design
-
based approach, conducting a two
-
level HLM model with a
 
three
-
stage sampling design, 
the omitted highest cluster level results in the repartitioned random effects and a shrinking error 
variance structure. The comparison of the design eff
ects resonates with Moerbeek (2004) and 
Opdenakker and Van Damme (2000) wh
ich provided simulation evidence that omitting the 
highest cluster level results in inflated standard errors of the adjacent 
lower
-
level
 

coefficient standard error estima
tes, and thus Type II errors. Later sections provide detailed 
mathematical
 
procedures of formulating the biased standard error estimates. 
 
 
70
 
 
Figure 4.
1
 
Data correlation structures of three
-
stage sampling designs when the 
first sampling 
stage in is included and omitted.
 
4.2.2 Incidental Highest Level ab
ove PSUs 
 
Many times, a higher cluster level emerges even if it is not designed in sampling but 
matters to answer the research questions. McNeish and Wentzel (2016) defined s
uch highest 
cluster level as incidental level to distinguish from the deliberate l
evels of the sampling stages; 
they also provided several example scenarios of when such incidental highest cluster level would 
occur. One is that individual two
-
level data ar
e integrated into a single data set to invest the 

wer. This scenario applies to meta
-
analysis where individual 

nested within
 
studies. Further, the studies are nested within investigators. Thus, the investig
ators 
 
 
71
 
 
form an incidental highest cluster level, and the between
-
investigator variation could be relevant 
to the research question (Konstantopoulos, 2011). 
 
Another scenario i
s that when certain large sample size of PSUs of schools is required, a 
relatively
 
large sample size of districts will be incidentally presented as a higher cluster level, 
though it may not directly relate to the research questions. For example, the Educat
ion 
Longitudinal Study of 2002 (ELS
:2002
) has a two
-
stage sampling design where sc
hools are 
PSUs and students are USUs (Stapleton & Kang, 2018). With 16,197 sampled schools 
nationwide in ELS, the districts level, with a considerably large sample size, is i
ntroduced 
naturally while the linked ID of schools and districts is not accessible
 
in the public
-
use file. 
Hence, the district cluster level is omitted due to data restrictions. 
 
The above examples require three
-
level models to account for the random varia
tion at the 
incidental highest cluster level, particularly when the highest
-
level 
units are samples and the 
inferences are made to the population. 
Conversely, the incidental cluster level does not need to 
be included with random effect when they are popula
tion units. Take the study of Wong and Li 
(2008) as an example, which utilized a t
wo
-
level model to examine school
-
level contextual 

effectiveness. As they stated that the
 
sampled schools are from all 18 districts in the studied area, 
the districts are 
not required to be modeled as random. Similarly, the two
-
stage design approach 
with sampling design effect 


is adequate for the clustering 
depende
ncy due to sampling. In this case, based on the estimated two
-
level mod
el, a fixed effect 
framework can be further utilized for the higher
-
level districts (i.e., add dummy variables 
indicating memberships of districts) (McNeish & Kelley, 2019).  
 
 
72
 
 
4.2.3 Omitted PSUs above the Level of Treatment Assignment
 
Consider a two
-
level 
model being conducted in a study where the outcome is at the 
individual student level and the assignment of treatment is at the higher school level. If the 
utilized data is a t
wo
-
stage sampling design where PSUs and USUs are schools and students 
respective
ly, and the statistical inference aims to the population of schools, the estimated two
-
level model is appropriate to capture the clustering with the school
-
level random effect.
 
This 
model is a typical CRT that has shown in Chapter 2. 
 
Now consider a three
-
stage sampling structure data where PSUs are districts, SSUs are 
schools, and USUs are students. The above two
-
level  model is no longer sufficient because the 
random effects o
f the highest cluster level is omitted. Furthermore, the CRT model turns to be a
 
Block Randomized Trial (BRT) since the schools within districts are randomly assigned with 
treatments. The conceptual differences of these two designs are depicted in Figure 4
.2. If the true 
PSUs of districts are omitted or hidden (as shown by the dashed 
ovals below the dashed line), the 
experimental design can be falsely interpretated as the treatments being assigned to the schools 
randomly and all students in each school rece
ived with the same treatment. With the presence of 
districts, schools within the
 
blocks of districts are randomly assigned with treatments. Schools 
remain as clusters since students in each school received with the same treatment. See Hedges 
and Rhoads (20
10) for a summary of the relationships between BRT and CRT. 
 
Since the inference
 
targets the population of districts and schools, the three
-
level BRT 
model explicitly models the between
-
district variation with the random effect of districts. 
Conceptually, 
the clustering dependency due to sampling is now sufficiently captured in additi
on 
to the clustering of assignment, whereas the (false) two
-
level CRT only models the latter source 
 
 
73
 
 
of clustering dependency
12
. This argument is consistent with Abadie et al. (
2017) that clustering 
is due to the distinct rationales of sampling and assignmen
t.  
 
Figure 4.
2
 
Omitted highest cluster level in a two
-
level CRT design
 
In the experimental design planning work of Hedges and Hedberg (2014), defining 
design parameters, such as ICCs, need to consider the omission of the distr
icts as blocks while 
only keeping the schools as clusters. In such cases, the between
-
district variat
ion is pooled into 
the between
-
school variation and the school
-
level ICCs are larger than they should be (Hedges & 
Hedberg, 2014, p. 455). Still, the effec
ts on standard error estimates when omitting the highest 
cluster level in experimental design has not
 
yet been extensively studied. Particularly, practical 
guidelines lack for empirical studies.  
 
 
12
 
Often, a three
-
level BRT model includes the random effect of the interaction term of treatment by district since 

opoulos, 20
08a
, 
2008b). The current paper 
does not include the random slope of the treatment and the corresponding interaction term in the later modeling 
settings in Section 4.2 to keep consistent with the 
setting of 
random intercept model of the whole stu
dy.  
 
 
74
 
 
4.3 Quantification of Standard Error Bias
 
4.3.1 Model Set
ting
 
Follow the examples made above that the district cluster level is omitted, I first consider 
an e
stimated two
-
level random intercept model which only captures the clustering dependency of 
students (denoted as 
i
) nested within a school 
j
:
 
Student
-
level:
  

School
-
level: 


,
 

,
 
Mix
ed model: 


where 


and 


are student
-
 
and school
-
level predictors, and 


is modeled at the school level  
whereas it is truly a district
-
level measure. Also, predictors are group
-
mean centere
d so that the 
exogeneity assumption holds. In the setting of a two
-
le
vel CRT design, 


can be the binary 
treatment variable. The random effects of 


and 


are assumed to be normally distributed with 
zero means, and variances of 


and 


respectively: 


and 


.
 
Identical to Chapter 2, for each school 
j 
from the total 


sample schools, the error 
variance
-
covariance matrix of 


, denoted as 


, is 
 

75
 
 
where 


is the average school size (i.e., the average number of students within a school) and 


is a column vector of 


ones. There are 


sample schools and the total sample size of st
udents 
is thus 


. The matrix 


and 


reflect the composition of variance components at the 
student
-
 
and school level respectively:
 

and
 

.
 
Then,
 

The ICC 


measures the expected correlations among any two 
randomly selected students from the same school. Now consider the 
satisfactory
 
three
-
level 
random interc
ept model which includes the 
omitted highest level of districts (noted as 
k
):
 
Student
-
level: 


School
-
level: 


,
 

76
 
 
District
-
level: 


,
 

,
 
Mixed model: 


.
 
The previously disaggregated predictor 


is now defined at the correct level of district as 


. Further, the random effect of the district
-
level is explicitly modeled and is assumed to be 
normally distributed with mean zero and variance of 


. Also, the random effects of the 
student
-
 
and school
-
level are assumed to have  normal distributions, whi
ch have means of zero 
and variance of 


and 


, respectively as 


and 


These random effects ar
e independent to each other.
 
 
The three
-
level model has two ICCs, including the expected correlation am
ong students 
within the same school and the same district 


, and the expected correlation among 
students within the sa
me district while from different schools 


. The average district 
sample size (i.e., average number of schools in
 
a district) is 


. Also, the total sample districts 


, and the average number of students 
in a district is 


. 
 
The error variance covariance matrix 


of a district 
k 
is 
 

where 


is a diagonal matrix with a dimension of the average cluster size of level
-
3 (
K
) 


, 


is a diagonal matrix with a dimension of the average cluster size of level
-
 
2 (
J
) 


, 
 
 
77
 
 
is a v
ector col
umn of 


ones, and 


is a vector column of 


ones. Conceptually, 


and 


.  I also define 


, which is the proportion of the true 
between
-
school variance in th
e total error variance, and 


is smaller than 

 
by 


. The detailed 
definition rationales of these ICCs have already been given in Chapter 2.
 
 
Figure 4.3 demonstrates the error variance
-
covariance structure of 


from the three
-
level model a
nd 


from the two
-
level model omitting the 
highest cluster level of districts.
 
Noticeably
, 
compared with 


, the error correlation structure of 


shrank from the 


block diagonal matrices (i.e., the p
urple dashed boxes) to the 


diagonal 
matrices (i.e., th
e o
rang
e highlighted squares). The shadowed areas represent the shrank 
segments due to falsely assumed independence among schools within districts. 
 
When the highest cluster level is omi
tted, the between
-
district variation is fully 
redistributed to the between
-
sc
hool variation while the between
-
student variation remains the 
same, which are
 

and 
 

Then, 


The shrank parts in 


are 


, which are 
falsely captured by 

 
in the estimated two
-
level 
model. Unlike the omitted middle cluster case in 
Chapter 2, the omitted between
-
cluster variance
 
repartition here is not weighted by the cluster 
size. 
 
 
78
 
 
Figure 4.
3
 
Correlation structures of 


of the three
-
level model, and 


of the two
-
level model 
omitting the highest cluster level. 
 
4.3.2 
Quantifying
 
the Stan
dard Error Estimate Bias
 
Bias
 
of the Standard Error Estimates of the Coefficients of 


and 


Predictor 


, though a measure of the districts, is falsely disaggregated at the school level. The 
letters in the parentheses (i.e.,
 

and
 

) 
indicate the corresponding omitted cluster levels. 
The estimated variance of the coefficient parameter of 


in the two
-
level model is:
 

where 


=


. In the three
-
level model which correctly models the predictor 


as 


, the variance estimate of
 
the coefficient parameter is
 
 
79
 
 
where 


. The inflation of 


is then quantified by 
 
the division of 


and 


, which yiel
ds to
 

Noticeably, 


is identical to 


in Chapter 2 since both of them solve the 
same issue of adjusting the standard error estimates of the highest
-
level 
predictor coefficients 
when the two layers of clustering are omitted.  
 
Bias
 
of the Standard Error Estimates of the Coefficients of 


and 


The school
-
level predictor 


is the main predictor of interest
13

v
ariance estimate inflation is quantified via comparing the variance estimate from a 
satisfactory
 
two
-
level model where no higher cluster level exis
ts and a false two
-
level model where a higher 
level exists but is omitted. The 
satisfactory
 
two
-
level model i
s identical to the case that has been 
illustrated in Chapter 2 in deriving for 


. 
 
T
he corresponding 
variance estimate of 


is
 
 
13
 
Th
e following derivation process applies to both cases of 


as a continuous measure or binary treatment 
assignment indicator. The latter applies to the previous theoretical discussion of when the estimated model is a two
-
level random intercept CRT 
omitting the highest cluster level and the true model is a three
-
le
vel random intercept 
BRT. When the true three
-
level BRT model has no random slope of 


, the standard error estimate of the 
difference between means is the same as the one from the
 
three
-
level random intercept CRT model (see 
Konstantopoulos, 2008a an
d Konstantopoulos, 2008b
). The equivalence of the standard error estimates of the 

 
80
 
 
wher
e 


is the error var
iance
-
covariance matrix
 

The corresponding variance weighting index is 


where the o
nly 
intraclass correlation is 


, when the district
-
lev
el does truly not exist.
 
In the false two
-
level model, the error variance
-
covariance matrix is 


, which gives the variance estimate of 


:
 

where 


. Finally, the variance inflation factor yields to
 

As shown, when the district level cluster should be modeled as random effects but is
 
omitted, the standard error estimates of 


error. This finding is similar to 


that was developed
 
for the individual
-
level predictor 


in Chapter 2. The common 
idea is that, if the 
satisfactory
 
model is a two
-
level, then the 
artificial between
-
group variance of the untrue highest level should be taken out.
 
 
81
 
 
Further, when both sources and layer
s of the clustering dependency are completely 
omitted in a single
-
level 
analysis using OLS estimation and 


is disaggregated at the student 
level as 


, the corresponding variance inflation factor is
 

which is identical to 


in Chapter 2 and is smaller 


by 


. 
 
Bias of the Standard Error Estimates of the Coefficients of 


and 


Finally, since the individual
-
level variance is not affected
 
by the omitted highest cluster 
level, the standard error esti
mate of 


remains unbiased. This can be shown 
by
 

In terms of OLS estimation, the 


is identica
l to the 


in Chapter 
2 (see Eq. 2.9) that
 

Table 4.1 below summarizes the above derived variance inflation factors corresponding 
to the predictors of each level in the estimated two
-
level HLM and singl
e
-
level OLS models. 
 
 
82
 
 
Table 4.
1
 
A summary of 
VOC
s
 
when the highest cluster level is omitted in a three
-
level 
structured clustering data.
 
Three
-
level HLM
 
Two
-
level HLM
 
Single
-
level OLS
 
Estimation
 
Level
 
Predictor
 
Level
 
Predicto
r
 
Variance 
adjustment
 
Level
 
Predictor
 
Variance 
adjustment
 
Student
 

Student
 

Student
 

School
 

School
 

Districts
 

Note. 
The letters in the parentheses indicate the corresponding cluster levels that are omitted.
 
 
4.3.3 Simulation Results 
 
Similar to Chapter 2, a simulation study is designed to test the variance es
timation bias 
when the highest cluster level is omitted as well as the performance of the derived 
VOC
 
formulas. The 
total sample size of students (


) and number of schools (


) are fixed to be 2000 
and 100, which lead to a 
conventional school si
ze 


. Four conditions of number of 
districts (


) are set to be 5, 10, 25, and 50, which covers a plausible range of sample size of the 
highest cluster level. 
 
In each condition 
of  


, the residual variance (


) and school
-
leve
l random effect 
variance (


) of the estimated two
-
level models are set with two pairs: 0.5 and 0.5, and 0.8 and 
0.2. The latter pair setting satisfies the empirical evidence where between
-
school variance could 
reach to 0.2 (Hedges & Hedberg, 2014
; Konstantopoulos, 2009; Westine 
et a
l.
, 2013). In Fahle 
and Reardon (201
8
), the between
-
districts variance 


of U.S. public school for Grades 3
-
8 
students in Math and English Language Arts ranges from 0.05 to 0.24. Then, the setting of the 
 
 
83
 
 
omitted
 
district
-
level random effect variance (


) in this current study includes the evidence 
found in Fahle and Reardon (201
8
) and a hypothetical extreme large value, which are
 
0.1, 0.4, 
and 0.6. S
etting 


, the values of random effects 
variance 


, 


, and 


are 
equivalent 
to the
 
ICCs of 

 
(or 


), 


, and 


, respectively. Again, 
the magnitude of the 
estimation bias and the performance of 
VOC

index of rel
ative bias 


and 


correspondingly
.
 
See 
Appendix 4.A
 
for the 
parameter settings and simulation results. 
 
Bias of the Random Effects and the Adjustment Performance
 
Previous research had found that the omitted 


is take
n by 


, while 


remains the 
same. The simulation results support this finding. In all conditions, the mean 


of 


are 
all zero. With larger setting of 


(or 


) and
 

, the overestimated 


has larger positive 


. In the extreme condition of
 

, 


can be more than twice and even four 
times larger than 


as the number of schools increases. Even in the cases where 


is 
extremely small of
 
being 0.1, between
-
school variation can be overestimated by at least around 
15%. With adju
stment, 
the mean 


of 


are close to 0 across all conditions. 
 
Bias of the Standard Error Estimates of the Coefficients of 


a
nd 


and the 
Adjustment Performance
 
When the district
-
level cluster is omitted in the estimated two
-
level model, the positive 
values of mean 


indicates that the standard error estimates of 


is overestimated 
which lead to
 
Type II error. The magnitude of the overestimation in
creases with the increase of 


and 


. When 


is considerably large as 0.6, the standard error estimates of the two
-
level 
 
 
84
 
 
model are 1.5 to 2 times larger than the parameter. When th
e omitted 


is trivial as 0.1, the 
magnitude of the overestimati
on is minimal.
 
When both school
-
 
and district
-
level clusters are not modeled as in the single
-
level 
model, 
the standard error estimates of 


are underestimated as the mean 


are all 
negative. The value of 


are relatively stable, around 
-
0.6 across all conditions. This is 
because the setting of the overall omitted clustering dependency 


are relatively similar of 
being 0.5 and 0.8, and the sampl
e size of students is fixed.  
 
Finally, f
or the adjustment pe
rformance, both 


and 


for the two
-
 
model 
and 
single
-
level
 
model are desirable since the mean 


are consistently smaller than 0.1.
 
Bias of the St
andard Error Estimates of the Coefficients of 


and 


and the 
Adjustment Performance
 
When the district
-
level predictor 


is falsely disaggregated, either at the school
-
level in 
the two
-
level model or at the student
-
 
in the 
singl
e
-
level
 
model, the standard error estimates of 
the coefficient of 


are underestimated. In the two
-
level model, the underestimation bias 
increases
 
with the increase of 


and the decrease of 


. With the maximal 


or 


, 
th
e standard error estimates can be around 60% larger than the parameter. 
With
 
the OLS 
estimation
, the underestimation magnitude is relatively stable as the mean 


are around 
-
0.8 across all conditions. Again, this is due to the omission of the o
verall clustering dependency 
in the single
-

the underestimation magnitude in 
the 
OLS
 
estimation 
is always higher than the two
-
level model 
in each condition. This is beca
use 


in the two
-
level models have captured the omitted  


. 
 
 
85
 
 
The
 
performance of the 
VOC
s 
is
 
ideal across all settings and models, where the mean 


are 
close 
to or smaller than 0.1. 
 
Bias of the Standard Error Estimates of the Coeffic
ients of 


and 


and the 
Adjustment Performance
 
As shown, the standard error estimates of the coefficients of 


is not biased in the 
two
-
level model. However, the standard error estimates of the coefficients of 


it is 
overestimated 
using 
OLS 
estimation
. This pattern is consistent with
 
the previous findings in 
Chapter 2 that omitting the adjacent higher cluster level leads to Type II error issue. The 
overestimation is large when 


and 


are large. F
or example, when the omitted  


or 


is 
0.6 and 


is 50, the standard error 
estimates of the coefficients of 


are two times larger 
than the parameters. When the omitted  


or 


is 0.1, the estimates can still be
 
40% larger 
than the parameters. 


performed relatively well that in most of the cases, the mean 


are around or smaller than 0.1, though in two cases of when 


and 


are large 
(i.e., 0.6 and 0.4), the mean 


are around 0.2. 
 
4.4. Empirical Example and Sensitivity
 
Analysis
 
This section employs the same study of Heafner
 
et al. 
(2019) seen in Chapter 3 for 
example. As shown earlier, their employed data NAEP
-
E has a two
-
stage sampling design where 
sc
hools are PSUs and students are SSUs, and the empirical model is a tw
o
-
level random intercept 
HLM model. With a large sample size of schools 


, an incidental highest cluster level of 
districts could emerge. Further, as stated by the authors, state
 
and district level policy predictors, 
including required economics educ
ation for graduation and
 
economics testing, are modeled at the 
 
 
86
 
 
school level (see 
Heafner et al., 2019, 
p. 334). In this case, these variables are falsely aggregated 
at the school level
 
and produced with underestimated standard errors, though no significant
 
evidence were found. Since Chapter 3 has already demonstrated examples of making Type I 
error, this section provides example of Type II errors of the middle
-
level predictors when a 
hi
gher cluster level is omitted. The example predictor used here is the re
quirement of economics 
education for graduation, which I consider as a true school
-
level predictor for example
-
making 
reason. 
 
The procedure of conducting the sensitivity analysis foll
ows the steps shown in heuristics 
diagram of Figure 3.
3
, and the results
 
are shown in Table 4.
2
. The regression coefficient and 
random effects estimates from the estimated two
-
level model are presented at the left upper 
corner in the table, where 


satisfies the requirements of conducting the Type II error 
sensitivity 
analysis. Since the omitted between
-
district variance (


)  is completely captured by 
the estimated between
-
school variance of the two
-
level model (


) wi
th no weights of cluster 
sizes, the sensitivity analysis here is straightforward and starts with
 
calculating the threshold 


and the corresponding 


. 
 
For any settings of 


that is larger than 0.267 and 


that
 
is smaller than 0.432, the 
risk of having Type II error is larger than 0. For example, the maximum 


found is 0.275 with a 
corresponding 


. That is, when the hypothesized district
-
level ICC 


is 0.275 
or the estimated betw
een
-
sch
ool variance are completely between
-
district variance, the 
magnitude of inference robustness (or effect size) reduces by 60% and the risk of making Type II 
error increases by 12% when compared with the threshold setting. In the current example, the 
maximum
 

does not exceeded 

 
of the estimated two
-
level model. 
 
 
87
 
 
T
able 4.
2
 
Sensitivity 
analysis of the school
-
level predictor: economics required for graduation
 
Estimated Two
-
level HLM
 
Hypothesized Three
-
level HLM
 
 
560
 
 
20
 

1.820
 
 
22.62
 

22.620
 
22.620
 
22.620
 
22.620
 
22.620
 

0.929
 

1.96
 

8.58
 

0.265
 
2.340
 
8.268
 
0.000
 
-
0.780
 

2.150
 

0.85
 
/
 
/
 

8.315
 
6.240
 
0.312
 
8.580
 
9.
360
 

1.820
 
 
0.275
 

0.008
 
0.075
 
0.265
 
0.000
 
-
0.025
 
 
0.267
 
0.200
 
0.010
 
0.275
 
0.300
 

Threshold
 

0.432
 
0.929
 
0.568
 
Switch Point
 

based 


0.267
 

0.432
 
0.929
 
0.568
 
Switch Point
 

0.200
 

0.624
 
1.342
 
0.376
 
NA
 

0.010
 

0.985
 
2.117
 
0.015
 
NA
 

0.275
 

0.401
 
0.862
 
0.599
 
0.121
 
Not plausible
 
0.300
 
NA
 
0.2
90
 
0.624
 
0.710
 
0.462
 
 
An implausible example of 


is thus demonstrated that if 


, the between
-
school variance from the three
-
level models turns to be negative, though the corresponding 
robustness measures are producible and larger than t
he above ones. Also, 


is provided to 
show the lower b
oundary of the variance adjustment. In this case, the reduced robustness and 
effect size is small (i.e., 1.5%) , thus no concerns for making Type II error. The above 
hypothesized 


are 
in a comparable range of around 0.05 to 0.24 and are of the empirical
 
values 
summarized in previous literature across nations, grade level, and subjects (e.g., Fahle & 
 
Reardon, 2017). This evidence heightens the significance of conducting this sensitivity
 
analysis 
to test the estimation bias due to an omitted but 
empirically possible district cluster level.
 
 
88
 
 
4.
5
 
Discussion and Conclusion 
 
This chapter clarifies the risky practice in two
-
level models omitting the highest cluster 
level that is legitimate in 
sampling and experimental designs. The harmful ramifi
cations of 
omitting the clustering dependency of the highest level need particular caution from researchers 
when their research questions relate to the cluster
-
level predictors effect, and to explaining 
the 

ariance of the individual variance, since the 
estimated two
-
level model would generate biased standard error estimates of the middle and 
highest level coefficients estimates and random effect variance of
 
the middle cluster level. 
Similar to Chapter 2, the 
VOC
s derived in this chapter quantify the potential magnitude of the 

 
The decision on whether to expli
citly model the highest cluster level depends on the 
research design of sampling and experimental schemes, as well as the rationales of whether to 
generalize the reference to the studied sample groups only or to the population of interest. When 
the main pr
edictor of interest is at the middle level and the hi
ghest level of clusters are the 
population groups, a fixed effect modeling framework is genuine. In contrast, if the predictors of 
interest also include the highest
-
level ones and the clusters are sample
 
units, the highest cluster 
level needs to be modeled
 
as a random effect. As listed in Table 4.1, the estimated two
-
level 
model omitting the highest cluster level would lead to a Type I error of the middle
-
level 
predictor estimate and a Type II error of th
e disaggregated highest
-
level predictor. However, the
 
individual level inferences would not be affected. The extended single
-
level
 
model 
scenario of 
omitting the overall clustering dependency has been shown in Chapter 2 where Type I error 
emerges for the c
luster
-
level predictors estimates and Type II errors 
for the individual level 
predictor.
 
 
89
 
 
Another particularity of making decisions on modeling the highest cluster level relates to 
the sample size. Frequently, the small sample size issue happens at the hig
hest cluster levels, 
particularly when the highest cl
uster level is not in the initial sampling design. In this case, the 
fixed
-
effect approach is optimal (McNeish & Wentzel, 2016; McNeish & Kelley, 2019). The 
current study only tested when the middle leve
l cluster size is relatively large (i.e., fixed the 
s
chool size 


to be 20) and the cluster size of the highest cluster level is not extremely small 
(where the smallest district sample size 


is 5), the displayed simulation outputs did not 
evidence any exceptional performance of the VOCs relating to th
e samp
le size. In future studies, 
the small sample size occasions relevant to the assumption of random effects and performance of 
the estimation methods should be investigated in detail, whereas it is out of scope of the current 
study. Also, future studies
 
are e
ncouraged to develop extended VOCs in conditions of 
unbalanced design and random slopes
.
 
 
90
 
 
CHAPTER 5
 
OMITTED SERIAL CORRELATIONS 
IN
 
LOWEST CLUSTER LEVEL
 
 
5.1 Introduction
 
Longitudinal data can be conceptualized as clustered data, since repeated measur
es 
are clustered within groups such as the yearly measured performance of students. A two
-

ce 
change and the change variabilities, as well as examine the factors that can 
explain the 
growth patterns (Bryk & Raudenbush, 2002; Hoffman, 2015; Singer & Willett, 2003). 
 
In previous chapters, units within groups are exchangeable in conventional cluste
red 
data that 
any pair of units within a cluster has an equal intraclass correla
tion as they are assumed 
to share common unobserved factors at the group level (
Alejo
 
et al., 
2018; Cameron & Miller, 
2015; Hansen, 2007)
. Assuming homogeneity and independent 
two levels of random effects, the 
corresponding error variance
-
covariance of a t
wo
-
level model is 


, 
where 


is
 
the first
-
level error structure
. The second
-
level error structure 

 
is a 


matrix of 


, where 


is a column vector of one with a length of cluster size 


, 
 
A distinguished
 
feature of longitudinal data is that repeated measures are 
chronologically ordered (Alejo, et al., 2018; Skrondal & Rabe
-
Hesketh, 2008). The ordering 
gives an additional
 
source of
 
dependency from the correlations of repeated measure
s within 
individuals of an outcome, other than the mean differences across individuals and the variations 
of growth across individuals (Hoffman, 2015). Unlike equicorrelated intraclass correlations, 
seri
al correlations between two successive time measures 
(i.e., 


) have certain 
 
 
91
 
 
patterns. Generally, the correlations between two successive time measures are larger than the 
correlations between two non
-
successive ones
14
. As the gap between two occasion measures 
increases, the correlation 
decreases. That is, 


> 


. In this case, 
another form of intraclass correlation due to serial correlation emerges, in addition to the 
conventional o
ne  due to random effects. The basic identity structure (I
D) of 


, 
assuming 
independently and identically distributed within
-
individual
-
repeated
-
measure residuals
, is then 
overly simplified in the multilevel longitudinal data analysis. 
 
In the field 
of Economics, the intraclass correlations due to clus
tering and serial 
correlation are explicitly defined to be closely related but distinct (Angrist & Pischke, 2008). 
Then corresponding statistical tests are proposed for evaluating the two forms of intrac
lass 
correlations in random effects longitudinal mode
ls. As surveyed in 
Alejo et al. (2018),
 
earlier 
tests evaluating either random effects or serial correlation (i.e., Baltagi & Li, 1991; Breusch & 
Pagan, 1980) tend to produce inflated rejection rate if the other form of intraclass correlation 
exits and is 
ignored (Bera
 
et al.
, 2001). Empirical 
research also presents this issue. In the 
influential study of Bertrand
 
et al.
 
(2004), a survey of 92 difference
-
in
-
difference (DD) studies 
found that only five of them had implemented serial correlation corrections. 
In that study, 
significant over
-
rejecti
on is found for a null effect treatment, which is due to the omitted serial 
correlation. On the other hand, interclass correlation due to clustering alone is commonly taken 
care of by cluster
-
robust standard errors (M
oulton, 1986, 1990), alternative estima
tors such as 
GLS (Liang & Zeger, 1986; White, 1980), and block bootstrap (Cameron
 
et al.
, 2008). Later 
developed tests provide joint tests of both forms of intraclass correlation such as in 
Alejo et al. 
 
14
 
T
he current s
tudy focuses on positive serial correlations, which means the error te
rms have the same sign from 
one time
-
measure to the next.
 
 
92
 
 
(2018)
, 
Baltag
i, Jung, and Song, (2002, 2010), and Ki
ng and Roberts (2015), to name a few
. 
These studies highlight the identification of the sources of the intraclass correlation (i.e., due to 
clustering effect or serial correlation), and appropriate strategies and mode
ls would be applied 
(Alejo et al., 2018
). With both forms of intraclass correlations, corresponding strategies, such as 
feasible generalized least squares estimation (FGLS) (Hansen, 2007), are required to account for 
dependencies (Angrist & Pischke, 2008; 
Wooldridge, 2003). 
 
The above discussio
n alerts the critical need for detecting and distinguishing the two 
forms of intraclass correlation. Beyond the above
-
mentioned approaches that are popular in 
economics research, the model
-
based approach HLM account f
or the two forms of intraclass 
correlat
ion simultaneously by specifying a correct error variance
-
covariance structure. However, 
it is not uncommon in empirical research that the serial correlations among the repeated 
measures are ignored in the time
-
level 
variance 

, and all the expected corre
lations among the 
repeated measures are (false) due to the individual
-
level random effect variance 

. 
Consequently, the tested theories and inferences made for the variance components and fixed 
effects could be erroneous 
(Fer
ron 
et al.
, 2002; Hoffman, 201
5; LeBeau, 
2016, 
2018). Therefore, 
with recognition of serial correlation, a correctly specified 

 
structure is pivotal. 
 
As a start, the current study considers the ID structure of 

 
being a scenario of omitting 
serial corr
elations at the lowest level, 
and sets out to mathematically quantify the corresponding 
estimation bias for robust inference making. It begins in Section 5.1 with a review of the 
approaches to specify 

, and a discussion of the bias in estimates due to th
e misspecified 

 
in 
empirical
 

the details of 
deriving generalized formulas to quantify the estimation bias of variances of the 
random effects and fixed effects, explo
re through an example of a two
-
level random intercept 
 
 
93
 
 
linear growth model that misspecified the 

 
as ID from AR (1). A Monte Carlo simulation study 
presents the performances of the formulas. 
Section 5.4 provides an empirical study example. At 
last, Sectio
n 5.5 concludes and discusses 
the future research.   
 
5.2 Alternative 
R
 
Structures with Serial Correlations 
 
The structure of the time
-
level residual variance matrix 

 
represents the serial correlation 
patterns. Besides the ID structure of 


,
 
many alternative structures have been widely 
recognized in textbooks of multilevel analysis, including Bryk & Raudenbush (2002, Chapter 6), 
Hoffman (2015, Section 3), Singe
r & Willett (2003, Chapter 7), and Snijders & Bosker (2012, 
Chapter 15), to name a 
few. Commonly presented alternative 

 
structures include autoregression 
(AR (
k
)), autoregression and moving average (
ARMA (
p, q
)), Toeplitz (TOEP(
k
)), and 
unstructured. 
 
In
 
practice, the selection of 

 
largely depends on empirical and theoretical needs 
(
Snijders & Bosker, 2012)
. Neverthele
ss, this approach is limited by prior experience and 
generalizability, which is prone to uncertainties in specifying 

. Moreover, a miss
pecified 

, in 
return,
 
distorts the deduction of theories. Taking a simple example, which will be proved in later 
sect
ions, when a relatively large serial correlation is completely omitted, 
the between
-
individual 
variance matrix 

 
is then considerably overestimated as 

 
is underestimated. Then, in modeling, 
individual
-
level predictors are added to explain the overstated
 
between
-
individual variances, 
instead of the within
-
individual predictors (Hoffman, 2015). In this case, the true
 
predictors and 
mechanisms of individual growth, particularly for the within
-
individual levels, are overlooked. 
This example applies well for 

time
-
varying psychological and cognitive factors
 
on their learning. On the other hand, if the 
 
 
94
 
 
hnicity 
and family background, an overstated 

 
eliminates some between
-
individual variance due to 

 
attributes. 
 
In addition to deciding 

 
based on empirical experience and theory, a
 
general statistical 
approach is through comparing the goo
dness of fit values among several models with different 
specified 

 
structures, and then selecting the best fit m
odel. However, the arbitrary values of 
likelihood ratio tests and information fit criteria have been critiqued. The criteria of modeling 
perfo
rmances depend on many factors, including the number of time measures, total sample size 
of individuals, estimatio
n methods, and variance
-
covariance patterns. Therefore, no single 
criterion performs uniformly better than the others, and certain criteria pe
rform better for 
selecting some 

 
structure models (Vallejo
 
et al.
, 2011). Also, it is important to note that the
 
best 
fit model is not necessarily the model with the correct 

 
(Murphy & Pituch, 2009). Researchers 
may turn to the general unstructured 

 
with no prior specifications to 
best fit the data (
Littell
 
et 
al., 
2000
). However, the unstructured 

 
is less int
erpretable to empirical studies that appreciate 
substantive theories. Further, as evidenced in Murphy and Pituch (2009), although the 
unstructured 

 
produces the least biase
d random effects, it inflates Type I error rate of fixed 
effects and has convergen
ce problems as a large number of parameters needs to be estimated. 
 
The above presented selection methods of 

 
are not free from concerns. Empirical 
research is then still s
ubject to the serious impacts on variance estimates if 

 
is misspecified
. In 
Kwok
 
et al. 
(2007) study, three scenarios of misspecifying 

 
are summarized: overspecification, 
underspecification, and general misspecification. That study develops a network o
f multiple 

 
by 
their nesting relationship of structures, including independent (ID), 
first
-
order autoregressive 
process (
AR(1)), 
first
-
order moving average process 
ARMA (1,1), 
Toeplitz
 
2 bands
 
(TOEP(2)), 
 
 
95
 
 
and unstructured, as shown in 
Figure 1 and Table 1
 
(Kwok et al.,
 
2007, p.565, 568)
.
 
For 
example, an underspecification situation happens when the true 

 
is AR(1) while an ID is 
estimated, or the true one is ARMA(1,1) while an AR(1) is estimated. An overspecification 
happens when AR(1) is the true 

 
whil
e ARMA(1,1) is
 
modeled. TOEP(2), and unstructured are 
considered as general misspecification when the true 

 
is defined as the other ones. The general 
findings are that, if 

 
is underspecified or 
general
-
misspecified,
 

esti
mates are unbi
ased, while the variances are found to be overestimated
. The overspecifications 
lead to slightly underestimated variances. 
 
However, other studies found conflicting patterns. Murphy and Pituch (2009) detects 
smaller standard error estimates i
n the underspe
cified AR (1) model while the true 

 
is ARMA 
(1, 1). Also, 

estimated ID model when the 
true 

 
is AR (1), whereas the standard error estimates of the fixed 
effects are sligh
tly smaller than they should be, as
 
the Type I error rate inflates accordingly. 
These finding are consistent with a recent 
Monte Carlo 
study of 
LeBeau (2018), which also 
shows inflated Type I error rates of the fixed effects when the serial correlation is 
completely 
omitted in 

 
(i.e., und
erspecified as ID).
 
 
The above
 
simulation
-
based studies provide evidence of estimation bias due to the 
misspecified 

, 
whereas the findings are not always consistent. 
Moreover, they are limited in 
generalizability as the
y are tested for certain range of p
arameters. Therefore, further analytic 
examinations are needed to further determine the underlying mechanisms of the estimation bias. 
 
 
96
 
 
5.2.1 Study Motivation
 
 
The above discussion demonstrates that the decision making of
 

structure is complex. 
Besides t
he awareness of the negative impacts of misspecified 

,
 
empirical researchers can 
benefit more from a strategy that aids to evaluate whether the estimated 

 
is specified correctly, 
and to adjust the potential bias if the estimated 

 
is false
. 
 
The curre
nt study intents to provide such a strategy that, 
instead of deciding 
the
 
true 

, it 
proposes to quantify the uncertainties that the specified 

 
in the estim
ated model can cause when 
an alternative 

 
is hypothesized to be true
. Bertrand et al. (2002) pr
ovides variance estimate 
formulas to demonstrate the exact reason of omitting positive serial correlation in OLS 
estimation 
that understates the standard error
 
estimates. However, no such efforts have been 
made with the presentation of clustering dependenc
ies in multilevel longitudinal analysis. The 
current study therefore contributes to fill this gap by deriving formulas to determine the reason of 
estimation bi
as due to omitted serial correlation with multilevel longitudinal analysis. The 
detected bias, th
en, can be adjusted by those formulas. These formulas will be derived similarly 
with the 
VOC
s from the previous chapters of the omitted middle and highest clus
ter levels. This 
quantification approach distinguishes the sources of the estimation bias (i.e., 
serial correlation 


on different levels of 
predictors, including the growth parameter
s and the time
-
varying predictors at the time
-
level, and 
the time
-
invariant predictors at the individual
-
level. In this way, researchers in practice can 
benefit from model building with se
lecting predictors that best explain those corresponding 
variances, 

not. 
 
 
97
 
 
Together with the sensitivity analysis developed in Chapter 3, this approach provides 
researcher
s with flexibilities to choose the best model that is statistically 
robust and appropriate 
for their theories. This approach also facilitates researchers and readers who do not have the 
original data to replicate the presented models while the original stu
dy does not provide much 
information on the selection criteria and d
ecision rationale of 

. For instance, the assumption of 

 
is not explicitly given in the study of a five
-
year longitudinal study of student achievement and 
goal setting (i.e., Moeller
 
et
 
al., 
2012), and the model results do not show serial correlation 
es
timates. In this case, readers may suspect the estimated model is specified 


, and ask 
questions of, if any serial correlation is omitted, how much the estimation bias would b
e and how 
robust the inferences that were made. 
 
Since ID and AR(1) are the most widely used 

 
in empirical longitudinal research, the 
current study 
focuses 
on this underspecification case of estimated 

 
being
 
ID while the true one 
being AR(1). However, 
the above described approach is suitable to test many other pairs of 
misspecification cases, such as between AR(1) and ARMA(1,1), as long as the structures ar
e 
nested as shown in Figure 1 of Kowk, et al. (2007). The current study adapts this concept of 
nes
ted 

 
structures
 
for future work of building a full network of 

 
structure misspecification 
pairs.
 
5.3 Quantification of Standard Error Bias
 
5.3.1 Model Set
ting
 
This section derives formulas to quantify the bias of variance estimates of both random 
effects a
nd fixed effects, if the true 

 
structure is assumed to be AR (1) and the estimated 

 
is 
 
 
98
 
 
ID. The following 
defines a two
-
level random intercept linear g
rowth model to describe a mean 

 
Time
-
level: 


Student
-
level: 


,
 

,
 
Mixed model: 


.
 
For simplicity of notation and formula derivation, the mo
del is a balanced design that any student 
i
 
is measured at the same 


occasions. The intercept 


varies
 
at the student
-
level and is 
explained by a student
-
level measure 


. The occasion meas
ure is 


para
meter 


is the mean growth rate of students. Taking five
-
year measured age for example, 


can be coded as 1, 2, 3, 4, 5, or 
-
2, 
-
1, 0, 1, 2, where the 0 point serves a meaningful start 
point for in
terpretation (Hoffman, 2015). In here a
nd later simulation, 


is group
-
centered 
which helps avoid endogeneity issues where random effects correlate with predictors (Antonakis
 
et al.
, 2019). Though a random slope is common in longitudinal data an
alysis, the growth rate 


in t
his study is not assumed to be random, as students grow at a same rate in a shared  
context, such as the same school.  Assuming the true serial correlation pattern is AR(1), the 
random effects are
 

and


.
 
Consistent with the modeling settings in Chapter 2 and 4, homogeneity assumption holds 

In the model above, covar
iate
s other than 


and 


are not shown for simplicity. In this 
 
 
99
 
 
case, the fixed growth rate and serial correlation are also conditional. The model setting serves a 
presumption that the empirical model has no omitted confounding variable. 
 
5.
3.
2 
Quantifying
 
the Standard Error Estimate Bias
 
The error variance
-
covariance structure is
 

, 
where the dimension of
 

is 


, and 


is a col
umn vector of 


ones. The difference 
of the error variance
-
covariance structure between the estimated ID model (noted as 


) and 
the AR (1) model is at the structure of 


. In the AR(1) model, 
 

That is, with an AR(1) serial correlation pattern, the variance of time
-
level residual is
 

and the covariance of two adjacen
t time measures is 


, where 


; 


, 


, and 


(Montes
-
Rojas, 
2016). I also 
assume that no measurement error and the lag
-
1 autocorrelation is posit
ive (i.e., 


). The structure 


does not differ in the true AR(1) model or in the estimated ID 
model, whi
ch captures t
he intraclass correlation due to the individual
-
level random effect 
variance. The complete extende
d form of 


is
 
 
100
 
 
+


=


Unlike 


, the off
-
diagonal of 


is no longer a single 


but a function of 


, 


, and 


. To achieve a simpler form of 


that
 
can be written into a general 
linear form like 


, and to achieve a general form of the column sum (such as 


in previous 
omitted middle an
d highest level cases), I construct an average term of 


in the off
-
diagonal as
 

where
 

and
 

is the sum of all elements of either side off
-
diagonal of the 


symmetric correlation matrix. 
This averaging approach is also suggested by Montes
-
Rojas (2016).
 
 
101
 
 
Any elements in the 
off diago
nal
 
of 


now turns to 


. Simil
ar to the 
construction of the conventional ICC as shown in Appendix 2.1, the expected intraclass 
correlations of 


is defined by the ratio of 


and the total error variance
 

Straightforwardly, 


is a function of th
e two forms of intraclass correlations from the 
time series and random effect, which emphasizes the legitimacy of the two forms of intraclass 
correlation coefficients. If we overlook the forms of the intraclass correlations, 


is 
simplified to a
n overall intraclass correlation coefficient, and functions equivalently to 


. 
 
The average intraclass correlation of the repeated time measures per individual is 


. The 
current study defines 


as intraclass autocorrelation coeffic
ient (IA
C) and 


as intraclass 
correlation coefficient of random effects (ICR). Unlike the ID model that has only one intraclass 
correlation coefficient (i.e., ICR), the AR(1) model has two forms of intraclass correlations of 
IAC and ICR, which hi
ghl
ights the serially correlated features of longitudinal data discussed at 
the beginning. 
 
 
102
 
 
The ICR is the conventional intraclass correlation
, which is
 

.
 
Further,
 
since 


, the relationship between the individual
-
level random effects 
of two models is
 

.
 
Also, since the total error variance is fixed regardless of the mo
del specification, 


gives that
 

That is, the estimated intercept random effect in the ID model is smaller than the one in 
the AR(1) 
model, while the estimated time
-
level random effect in the ID model is larger than the 
one in the AR(1) model. This
 
formula testifies the patterns detected in Murphy and Pituch 
(2009). The size of the gaps between the random effects of the two models depen
ds on the size 
of IAC 


. Immediately, the random effects of AR(1) can be derived by the following formula: 
 

and
 

103
 
 
Also, 
 

Consequently, 


is smaller than 


. The degree of differences between these t
wo 
ICRs is weighted by the IAC 


. When 


is zero, the forms of intraclass correlation reduce to 
ICR
-
only as 


. In this case, an estimated ID model is the true model. On 
the other hand,
 
if 


that each of th
e time measures of the dependent variable of an individual 
are exactly the same, 


, which is equivalent to a time
-
individual aggregated or single time
-
point one level analysi
s.  
 
Fin
ally, a simple form of the unified single column sum of 


is 


, which is the 
variance estimate index of the coefficient estimate of 


:
 

and  
 

Then, I construct the 
VOC
 
to meas
ure the variance inflation size of the estimated 
variance of the coefficient of the time
-
level predictor when the AR(1) model is underspecified as 
ID. The construction rationale is the same as in previous chapters and the conventional design 
effect
 
that th
e 
VOC
 
is the ratio of the variance estimate of the AR(1) model and the variance 
estimate of the ID model, which yields to
 
 
104
 
 
where 


is the scaler weight of the variance estimate of the ID model, which only takes into 
account of the intrac
lass correlation due to the random effect. 


is the scaler weight of the 
variance of th
e AR(1) model, which takes into account of the two forms of the intraclass 
correlations. This equation holds the same idea as the ones in previous chapters of
 
quantifying 
the variance estimate bias when the middle and highest cluster are omitted. 
 
Further, 


is the intraclass correlation of repeated time measure predictor in the 
form of an average lag
-
1 autocorrelation, while 


i
s the average conventional correlation 
coefficient. Specifically, 
 

and 
 

where 


is the number of individuals, 


is 
a
 
time occasion 
measure at time 
t
 
of an individual 
i
, and 


is the individual
-
level mean of the occasion measures. Group
-
mean centering of the 
occasion measure in balanced studies does not produce different v
alues of 


and 


. 
 
The above
 
two intraclass correlation measures of predictors are adapted from the ones in 
Angrist and Pischke (2008) and Montes
-
Rojas (2016), which distinguish the difference between 
these two types of intraclass corre
lation coefficients of predictors. Specifically,
 
the inclusion of 
 
 
105
 
 
is unique for the time
-
varying predictors in longitudinal data analysis, which specifies 
the autocorrelation among one time
-
measure with the one
-
time
-
point
-
later
-
measure w
ithin 
individuals. In contrast, the intrac
lass correlation of predictors 


is a matter of clustering with 
equal correlation between any pairs of time
-
measure within an individual, due to the nature of 
the model specification of ID. Therefore, in g
eneral, 


is 


times smaller than 


if 
there are more than two time
-
measure. If we ignore these two measures of time
-
varying 


tends to be smaller than
 
it sho
uld 
be. In Chapter 2 and 4 for omitted middle and higher cluster levels in non
-
longitudinal data 
cases, the intraclass correlation coefficients of a predictor in the denominator and numerator are 
canceled out since they both equal to the conventiona
l corre
lation coefficient of the same 
predictor. 
 
As shown above, the standard error estimates of the time
-

are downwardly estimated by the omitted autocorrelation. In contrast, the individual level time
-
invariant predictor c
oeffici

no need of distinguishing serial correlation or random effects. In other words, the standard error 
of the individual level time
-
invariant predictors does not need adjustments in
 
the es
timated ID 
model. The following equation and further simulation results evidence this point. 


Different 
from the previous 


as shown 
in Eq. 5.4, the denominator of 


comes from the estimated model that captures all the dependencies, whereas the 
sources of dependencies are not recognized. S
ince the predictor of interest 


here is at the 
 
 
106
 
 
cluster level and time
-
in

distinguished, as long as the overall error variance
-
covariance are captured.  The intraclass 
correlation of a
 
cluster
-
level predictor is one (i.e., 


) and canceled out.
 
 
Table 5.
1
 
A summary of 
VOC
s when the serial correlation is omitted
 
Two
-
level HLM
 

Single
-
level OLS Estimation
 
Level
 
Predictor
 
Variance 
adjustment
 
L
evel
 
Predictor
 
Variance 
adjustment
 
Time
 
Time
-
varying
 

Time
-
 
Student
 
Time
-
varying
 

/
 
Individual
 
Time
-
invariant
 

Time
-
invariant
 

However, when the clustering structure is also omitted and a single
-
level analysis using 
OLS estimation is conducted, the standard error estimate of 


then needs to be 
adjusted by the square root of
 

In essence, 


shows the sources of the dependencies through


, which 
is a function of 


and 


that have shown in Eq. 5.1. Moreover, if there is no clustering 
issue (i.e., random effect variance is null) but only autocorrelation, then  


reduces 
to 


, which mimics the design
-
based approaches (e.g., 
DEFF and MF to so
lve the 
classic situation of omitting serial correlation in the OLS estimation. Table 5.1 
shown above 
summarizes when and which predictor needs VOC adjustment. 
 
 
107
 
 
5.3.3 Simulation results 
 
To show the magnitude and direction of estimation b
ias when R is mis
specified, and to 
examine the performance of the derived 
VOC
 
formulas, a simulation study is designed with 12 
condition sets for the three models of the true AR (1), estimated ID, and estimated 
single
-
level
 
models. The conditions are set b
y the two paramet
ers of the 
VOC
 
formulas: the number of the 
time repeated measures 


of 6, 10, and 30, and the autocorrelation 

 
of 0.9, 0.7, 0.5, and 0.2. 
The total number of individuals 


, and the true variances of random effects 


and 


are fixed to be 500, 144, and 64 respectively, where the true ICR 


is 0.3. 
 
The numbers of the time repeated measures are chosen based on the representative cases 
in empirical research. For example, the
 
period
icity of ECLS
-
K: 2011 survey measure
s are from 
the kindergarten to the fifth grade that 


. In another example of a daily diary study, the 
occasion measures can be many more, such as 2 times a day for a half month that 


(e.g., 
Ilies & Judg
e, 2004). The combination of the extensive time measures and relatively smaller 
autocorrelation gives extremely small average autocorrelation 


values that can be null. These 
extreme cases serve to prove that, under such circumstances,
 
variance adjustme
nts are not 
necessary. For each condition, replications of 500 are generated. 
 
Like the earlier discussed simulation studies, the index of relative bias is computed to 
measure the magnitude of the estimation bias: 
 

w
here 

 
represents the true parameters from the AR (1) model, including the random effects 
variances, and standard errors of the repeated time measure 


and the individual
-
level 
 
 
108
 
 
predictor 


. Correspondingly, 


represents the estimate
s from the esti
mated ID models. Falsely 
estimated models lead 


to deviate from zero. 
 
Similarly, a relative bias index of 


is provided for the estimates adjusted by 
VOC
s. The better performance is of the 
VOC
s and less biase
d of the adjusted est
imates, the 
closer to zero of 


is. Further, a larger difference between 


and 


proves that a biased estimate is more in need of a 
VOC
 
adjustment. 
See Appendix 5.A for the 
simulation p
arameter setting and the
 
detai
led
 
simulation results. 
 
 
Bias of the Random Effects and the Adjustment Performance
 

of the residual variance estimate 


is consistently negative across all models, 
and the ones of the individual level r
andom effect variance estimate 


are positive. In other 
words, 


is underestimated and 


is overestimated. The robustness of 


is commonly of 
interest in explain
ing the proportion of the between
-
individual variances 
of the total variance of 
the outcome. With larger IAC 


that is omitted, 


of 


deviates more from zero. For 
example, when 


and 


, 


can be 2.77 times as large as the tru
e 


. When 


and 


, 


is almost identical to the 


since 


is close to zero (


).
 
Noticeable, when 


is small, a small 

 
could still result in considerable bias of the 
random effects estimation. For example, when 


and 


, 


is 1.17 times larger than 


(i.e.


). Consequently, the estimated 


is always larger than the true 


as long as 


is not zero. The adjustments of both 


and 


are performed ideally across all 
conditions, where the relative bias are all close to zero (


) with minimum 
variances (


). The detect
ed patterns prove that the omitted serial 
 
 
109
 
 
cor
relation is falsely taken away by the individual
-
level random effect from the time
-
level 
residual, as well as confirming the previously formulated relationships between the random 
effects from the AR(1) and ID m
odels. 
 
Bias of the Standard Error Estimate o
f the Coefficient of 


and the Adjustment 
Performance
 
If the AR (1) structure is omitted, the estimated standard errors of 


is underestimated, as 


are negative across all models. The magnitu
de of the 
underestimation bias rises with the increase of 

 
and 


. Fixing 


, the estimated standard 
error of 


is only one
-
fifth of the true parameter when 


, and three
-
fifths of the true one 
wh
en 


. Moreover, the 


values decreases to zero when 


are closing into zero, such 
as when 


across all 


. However, the underestimation bias does not diminish. The 
estimated standard error of 


can still be one
-
fifth less than the 
true one. 
 
In ter
ms of the bias adjustment, when 


is larger than 0.1, 


performs well 
since
 

and 


, except for the case of when 


and 


(


). The perform
ance of
  

are also relatively better when the 
occasion measures are not extensive. When 

 
is moderate and small (i.e., 0.5 and 0.2), 


tends to be positive, though smaller than 0.1 when 


is 6
 
and smaller than 0.3 wh
en 


is 10. If 


gets extensively large to be 30, 


tends to make undesired 
overcorrections that 


is larger than 0.5, or even as large as 1. Type II error can thus 
be 
caused. In these cases, 


are around 0.0
5 and smaller. The undesired overcorrection pattern could 
also be related to the values of the intraclass correlations of predictors 


and 


. 
 
 
110
 
 
As shown in 


, 


while 


, in wh
ich the extremely small 


produces an extremely small denominator of 


. As a Result, the 
corresponding 


tends to be much larger than it should be. 
 
Bias of the Standard Error Estimate of the Coefficient of 


and the Adjustment Performance
 
As expected, when the clustering structure is omitted, there are underestimation issues of 
the stand
ard error estimat
es of the individual
-
level predictor 


. Consistently 
across all conditions, 


are negative and around from 
-
0.5 to 
-
0.7. Equivalently, the 
estimated standard error estimates from 
the single
-
level a
nalyses
 
are only half of or 
even smaller 
than the true parameter. 


performs desirable in all models that 


are 
close to zero, except for one noticeable overcorrection case of when 


and 


. 
 
5.4 Empirical Example and Sensitivity Anal
ysis
 
The selected empirical example is in Taylor
 
et al. 
(2010), which applies two
-
level linear 
growth models to examine the impacts of between
-
student and within
-
student motivational 
regulations and psychologica
l needs on three motivational outcome of effo
rt, intentions, and 
physical activity growth. The 178 participant students come from an England school who are in 
grade
-

surveys.
 
 
The original study does not specify the tim
e
-
level random effect variance structure, thus, 
assuming an AR (1) structure is underspecified as ID, the current study presents examples of 
utilizing the sensitivity analysis to test the robustness of the time
-
varying predictors. The 
employed models are t

 
111
 
 
Table 1 and 2 of 
Taylor et al., 2010
 
for the detailed model reports). In both models, the random 
slopes are not significant, and the varian
ce estimates are close to 0 (i.e., 0.1 and 0.
01 
respectively). Thus, I use the above 
VOC
 
formulas that are initially constructed from the random 
intercept models as an elementary example.
 
Following the suggested steps of conducting the 
sensitivity analysis
 
in the heuristics diagram of Figure 3.
3
 
of C
hapter 3, the threshold 
VOC
 
is 
calculated at first. Then examples that are minimum and maximum IAC values are presented to 
show the boundaries of robustness. 
 
Table 5.
2
 
Sensitivity 
analysis of the time
-
varying predictor: competence
 
 
ID
 
AR (1)
 
 
0.27
 
 
0.53
 
0.79
 
 
0.14
 

1.96
 
 
IAC=0.10
 
IAC=0.51
 
IAC=0.30
 
 
0.13
 

2.08
 

1.20
 
1.08
 
0.02
 
0.72
 
 
1.12
 
1.24
 
2.30
 
1.60
 
 
ICR: 


0.52
 
0.46
 
0.01
 
0.31
 
 
IAC: 


Index
 

0.00
 
Threshold
 

1.060
 
0.138
 
NA
 
NA
 
 
0.10
 

1.105
 
0.144
 
0.095
 
0.265
 
 
0.51
 

1.341
 
0.174
 
0.254
 
0.719
 
 
0.30
 
 
1.170
 
0.152
 
0.145
 
0.508
 
 
Table 5.2 
above 
presents the sensitivity analysis results of the time
-
varying predictor 
competence in the first model. The right upper corner shows that the robustnes
s of the 
competence predictor is not desirable since the t statistic is 2.08, which is almost at the thres
hold 


of 1.96. Therefore, any small serial correlation can lead to a Type I error. In this case, the 
threshold 
VOC
 
is less useful. In the table,
 
the grey cells are fixed values, including parameters 
 
 
112
 
 
that are provided in the original study, and the ones 
that are set to achieve minimum and 
maximum IAC values.
 
 
Setting a minimum IAC being 0.1, the corresponding square root of 
VOC
 
is 1.105, and 
the IC
R of the AR (1) model (
i.e., 


) is close to the ICR of the ID model (
i.e., 


). 
The original table 
provides the intraclass correlation of the predictor competition (


) 
being 0.79, and thus 


turning to
 
be 0.53. In this setting, the robustness of inference (i.e., 


) or the effect size (i.e., 


) reduces by 1%, and th
e risk of making Type I error increases 
by 
26.5
%. When setting a minimum 


being 0.01, the corresponding square r
oot of 
VOC
 
is 1.341, and the IAC is 0.51. This setting offers the upper bound of the possible IAC and the 
magnitude of bias. The robustn
ess of inference or the effect size reduces by 25.4 %, and the risk 
of making Type I error increases by 
71.9 
%. These tw
o settings have correspondingly lag
-
1 
autocorrelation 

 
values of 0.15 and 0.7, which form a reasonable boundary of a potentially 
omitt
ed serial correlation.
 
Tables 5.3 and 5.4 show the sensitivity analysis examining the time
-
varying predictors of 
intrins
ic regulation and external regulation in the second model. For both predictors, they have 
the same IAC values since they share the same random effects in the s
ame model, while they 
have different VOCs due to the different intraclass correlation of predicto
rs. Provided by the 
original study, 


and 


. The estimated effects of these two 
predictors are relatively robus
t. Specifically, their threshold VOCs are larger than the upper 
bound of possible VOC
s. Thus, no risk of Type I error issue emerges. 
 
 
113
 
 
Table 5.
3
 
Sensitivity 
analysis of the time
-
varyin
g predictor: intrinsic regulation
 
 
ID
 
AR (1)
 
 
0.36
 
 
0.49
 
0.73
 
 
0.18
 

1.96
 
 
IAC=0.10
 
IAC=0.66
 
IAC=0.30
 
 
0.10
 

3.60
 

1.61
 
1.52
 
0.02
 
1.26
 
 
0.81
 
0.90
 
2.40
 
1.16
 
 
ICR: 


0.67
 
0.63
 
0.01
 
0.52
 
 
IAC: 


Index
 

0.00
 
Threshold
 

1.837
 
0.184
 
NA
 
NA
 
 
0.10
 

1.106
 
0.111
 
0.096
 
NA
 
 
0.66
 

1.397
 
0.140
 
0.284
 
NA
 
 
0.30
 
 
1.143
 
0.114
 
0.125
 
NA
 
 
Table 5. 
4
 
Sensitivity 
analysis of the time
-
varying predictor: external regulation
 
 
ID
 
AR (1)
 
 
0.35
 
 
0.35
 
0.53
 
 
0.18
 

1.96
 
 
IAC=0.10
 
IAC=0.66
 
IAC=0.30
 
 
0.08
 

4.38
 

1.61
 
1.52
 
0.02
 
1.26
 
 
0.81
 
0.90
 
2.40
 
1.16
 
 
ICR: 


0.67
 
0.63
 
0.01
 
0.52
 
 
IAC: 


Index
 

0.00
 
Threshold
 

2.232
 
0.179
 
NA
 
NA
 
 
0.10
 

1.087
 
0.087
 
0.080
 
NA
 
 
0.66
 

1.301
 
0.104
 
0.232
 
NA
 
 
0.30
 
 
1.116
 
0.089
 
0.104
 
NA
 
 
However, the robustness of inference and effect size still needs attention. With a 
maximum IAC 


, the robustness of inference and effect size reduces by
 
28 % and 
23 %, respectively, of the predictors of intrinsic regulation and external regulat
ion. With a 
minimum IAC 


, the robustness of inference and effect size reduces by 
around
 
1
 
% 
for both predictors. 
 
 
114
 
 
In sum, the sensitivity analysis
 
provides that the inferences made for the regulation 
predictors are relatively strong if the AR 
(1) structure is omitted. However, the inference made 
for competence needs attention because even a minimum omitted autocorrelation can lead to 
serious Type I 
error issue. This evidence is critical since the conclusion drawing on the within
-
student level c
ompetence is the focus of the original study.
 
5
.5 Conclusion and Future Research 
 
Consistent with previous research (
Alejo, et al., 2018; Betrand et al., 2004
)
, t
he current 
study proves that when the chronological order structure within cluster units 
are omitted
 
in 
multilevel analysis for longitudinal data
, the intraclass correlation due to individual
-
level 
random effect variance takes over the serial correlatio
n.
 

study is the first that formulated this relationsh
ip of random effects and serial correlations when 
R is underspecified from AR (1) to ID. This study further determines that the magnitude of the 
overestimation of t
he individual
-
level random effect variance is weighed by the IAC 


. 
The 
conceptualization o
f IAC and ICR provides new understandings of the conventional intraclass 

needed.  
 
Further, the derivations of 
VOC
s are conducted separately for time
-
level and individ
ual
-
level predictors. These formulas produce consistent suggestions with the simulation
-
based 
findings from the earlier discussed prior research, such as Ferron e
t al. (2002) and LeBeau 
(2018). Specifically, when the true AR(1) is completely omitted, time
-
varying predictors need 
adjustments, while the time
-
invariant predictors do not. Noticeably, the current study
 
does not 
recommend adjusting the standard error est
imates of fixed effects when the occasion measures 
are extensive, and the hypothesized lag
-
1 a
utocorrelation is small such that the IAC is smaller 
 
 
115
 
 
than 0.2. Employing the sensitivity analysis framework developed in Chapter 3, empirical 
researchers and read
ers are able to easily find evidence of the extent to which the inference is 
robust. The strat
egies are provided with an empirical research example (i.e., Taylor, et al., 2010).  
 
The current study sets models with only random intercept. However, random sl
opes are 
common in longitudinal data analysis. Particularly, if the random effect of slopes is
 
ignored in 
modeling, Type I error rate inflates (LeBeau, 2018). Including the random slopes in the current 
study increases the complexity of the variance
-
covaria
nce structure, since the covariance units 
depend on the occasion measures. This complexity can
 
be addressed in future studies. For 
instance, with the experience of constructing an averaged autocorrelation parameter (i.e., IAC) 
for the descending serial cor
relation pattern, an average covariance parameter can be similarly 
constructed, as long as the
 
overall error variance
-
covariance are captured correctly. However, the 
precision and consistence of the averaged autocorrelation and covariance parameters could 
be 
affected by the missing data and unbalanced design, which need further tests. 
 
Also, the cu
rrent study only explored the relationship between the ID and AR (1) 
structures. In future studies, the interrelationship between other commonly used alternative 
R 
structures will be developed. For example, AR (1) is easily to relate to ARMA (1,1). Finally
, 
future research may study the omitted serial correlation in three
-
level models. For instance, a 
higher cluster level of school could exist. Comparing with the c
urrent study, two additional 
intraclass correlations emerge: the ICR and IAC that are school s
pecific (Alejo, et al., 2018). 
Then the quantification of estimation bias due to omitted serial correlation are complex as to 
distinguish the sources of the intra
class correlations.     
 
 
116
 
 
APPENDICES
 
 
117
 
 
APPENDIX 2A
 
Intraclass Correlation 
Coefficients in A Three
-
Level Model
 
 
In the current study, the intraclass correlation coefficients (ICCs) of classroom
-
 
and 
school
-
level are defined as 


and 


Another commonly used definition of 
ICCs is 


and 


The distinction of these two methods of ICCs occurs only at 


and 


. Hox, Moerbeek, and Van de Schoot (2
010) summarized that these two methods are both 
correct, though having slightly different focuses. The latter method has a focus on decomposing 
the variance from each level that 


identifies the unique classroom
-
level variance. 
 
 
In the first metho
d, 


is derived as the following 
 

where the denominator is the
 
total error variance 


, and the numerator is:
 

With assumptions of random effects having zero covariance with each other, 
 

118
 
 
As shown, 


measures the expected correlation between two students wh
o 
are in the same class and, also, in the same school. Conversely, 


measures the expected 
correlation between two studen
ts who are in the same school but from different classes.
 
 
119
 
 
APPENDIX 2B
 
A 
Summary of Model Specification, Assum
ption, and Estimation
 
 
Table 2B.
1
 
Summary of
 
model specification, assumption, and estimation contrasting the two
-
level estimated model omitting the middle cluster level and the three
-
level satisfactory model. 
 
Model 
Specificatio
n, 
Assumption, and 
Estimation
 
Two
-
level Estimated Model
 
Three
-
level Sa
tisfactory Model
 
1. Multi
-
stage 
Sampling Design 
and Experimental 
Design with 
Clusters.
 
Multi
-
stage Sampling 
 
1) 
In a three
-
stage sampling where 
PSUs are schools, SSUs are 
classrooms, an
d USUs are 
students, the middle classroom 
deliberate
 
cluster level is 
omitted 
in modeling. 
 
2) 
Or, in a two
-
stage sampling 
design where PSUs are schools 
and SSUs are students, the middle 
classroom 
incidental
 
cluster level 
is omitted in modeling.
 
Multi
-
stag
e Sampling
 
1) 
The model corresponds with 
the three
-
stage sampling design 
that all the sampling stages as 
deliberate cluster levels are 
specified in modeling. 
 
2) 
Or, in a two
-
stage sampling 
design where PSUs are schools 
and SSUs are students, the middle 
cl
assroom 
incide
ntal
 
cluster level 
is included in modeling. 
 
RCT: treatment is randomly 
assigned to 
schools.
 
 
RCT: treatment is randomly 
assigned to schools
.
 
2. All relevant 
predictors are 
included in the 
model. 
 
Predictors of interest to answer 
research 
questions:
 
1) 


is a student
-
level 
predictor.
 
2) 


is a (falsely 
disaggregated) student
-
level 
predictor.
 
3) 


is a school
-
level predictor.
 
Predictors of interest to answer 
research questions:
 
1) 


is a student
-
le
vel 
predictor. 
 
2) 


is a classroom
-
level 
predictor.
 
3) 


is a school
-
level predictor.
 
Relevant covariates, such as 
contextual factors, at each level 
based on subject
-
matter 
knowledge. 
 
Relevant covariates, such as 
contextual fact
ors, at ea
ch level 
based on subject
-
matter 
knowledge. 
 
 
120
 
 
Table 2
B
.
1
 
(

 
3
. Random intercept 
only.
 
1) 
Schools differ across the 
average value of outcomes.
 
2) 
The slopes at all levels do not 
differ across schools. 
 
1) 
Schools and classrooms within 
schools dif
fer across the average 
value of outcomes.
 
2) 
The slopes at all levels do not 
differ across schools.
 
4. The error variance 
covariance structure 
is properly specified. 
 

.
 

.
 
Parameters: 
 
1) 
One ICC: 

 
measures the 
similarity of students within the 
same school 
k
, regardle
ss of 
classrooms.
 
2) 
One cluster size: 


is the 
average number of stud
ents within 
a school 
k
. 
 
Parameters: 
 
1) 
Two ICCs: 


is the expected 
correlation of two randomly 
drawn students from the same 
classroom 
j
 
in a school 
k, 
and
 

is
 
the expected correlation of 
two randomly drawn students 
from the same school 
k
. 
 
2) 
Two cluster size:


is the 
average class size and  


is
 
the 
average number of teachers within 
each school
 
k
. 
 
5. The within
-
cluster 
residuals follow a 
multivariate normal 
distribution. 
 

, where 


is 
cond
itioned on predictors and 
covariates.
 

where 


is 
conditioned on predictors and 
covariates.
 
6. The random 
effects follow a 
multivariate normal 
distribution. 
 

, where 


is 
condition
ed on predictors and 
covariates. 
And,
 
the
 
group effects 


is independent and identically 
distrib
uted that no higher cluster 
level exists. 
 
 
and 


, where 


and 


are conditioned on
 
predictors 
and covariates.
 
And, 
the group 
effects 


is independent and 
identically distributed that no 
higher cluster level exists. 
 
7. Homoscedasticity. 
 
1) 
Constant error variance of all 
levels conditioned on predictors. 
 
2) 
Or corrected heteros
kedastic 
patterns
 
for the specified nesting 
structure.
 
1) 
C
onstant error variance of all 
levels conditioned on predictors. 
 
2) 
Or corrected heterogeneity for 
the specified nesting structure.
 
3) 
The assumption
s
 
still hold
 
after 
including the omitted cluster
 
level.
 
 
121
 
 
Table 
2B
.1
 

8. The 
within
-
cluster 
residuals and the 
random effects do 
not covary.
 

.
 

, 
 

and  


.
 
9. The predictor
s do 
not covary with the 
residuals and 
random effects at 
any oth
er level. 
 
1) 


and 


are group
-
mean 
centered.
 
2) Assume no omitted 
confounding variables at all 
levels.
 
1) 


and 


are group
-
mean 
centered.
 
2) Assume no omit
ted 
confounding variables at all 
levels.
 
10. Sample size.
 
A sufficient larg
e sample size 
(both cluster numbers and cluster 
size) at all levels to satisfy the 
desired power and for the 
asymptotic inference. 
 
A sufficient large sample size 
(both cluster num
bers and cluster 
size) at all levels to satisfy the 
desired power and for th
e 
asymptotic inference.
 
Balanced design or at least almost 
equal sample size of clusters.
 
Balanced design or at least almost 
equal sample size of clusters.
 
11. Estimation. 
 
1) (
Restricted) Maximum 
Likelihood. 
 
2) Design
-
based approach for the 
standard error bias correction.
 
1) (Restricted) Maximum 
Likelihood.
 
Note
. The listed model specification, assumption, and estimation in the first column are 
summarized from McNeish and Kell
ey, (2019, p. 26), 
McNeish et al. (2016, p. 116), 
Snijders 
and Berkhof (2008) and Snijders
 
& Bosker, (2012, p. 102).
 
 
122
 
 
APPENDIX 2C
 
Simulat
ion Parameter Settings and Result of V
OC
s 
o
f Omitting 
t
he Middle Cluster Level
 
 
Table 2C.
1
 
Simulation 
p
arameter settings.
 

or
 

or
 

Avg. Class 
size
 

Avg. No. of 
teachers/classrooms 
per school
 

or
 

0.2
 
0.2
 
0.6
 
5
 
10
 
0.08
 
0.22
 
0.78
 
0.5
 
0.2
 
0.3
 
0.24
 
0.76
 
0.7
 
0.2
 
0.1
 
0.26
 
0.74
 
0.2
 
0.7
 
0.1
 
0.72
 
0.28
 
0.2
 
0.2
 
0.6
 
10
 
5
 
0.18
 
0.24
 
0.76
 
0.5
 
0.2
 
0.3
 
0.29
 
0.71
 
0.7
 
0.2
 
0.1
 
0.33
 
0.67
 
0.2
 
0.7
 
0.1
 
0.74
 
0.26
 
0.2
 
0.2
 
0.6
 
25
 
2
 
0.49
 
0.30
 
0.70
 
0.5
 
0.2
 
0.3
 
0.45
 
0.55
 
0.7
 
0.2
 
0.1
 
0.54
 
0.46
 
0.2
 
0.7
 
0.1
 
0.80
 
0.20
 
 
123
 
 
Table 2C.
2
 
Relative
 
bias of
 
estimates of variances
 
when
 

, 


.
 
Parameter
 

of HLM
 

of HLM
 
Mean
 
(Variance)
 
Range
 
Mean
 
(Variance
)
 
Range
 
Residual 
variance 


5
 
0.08
 
0.30
 
(0)
 
[0.13,
 
0.47]
 
0
 
(0)
 
[0,
 
0]
 
10
 
0.18
 
0.27
 
(0)
 
[0.13,
 
0.47]
 
0
 
(0)
 
[0,
 
0]
 
25
 
0.49
 
0.17
 
(0)
 
[0.10,
 
0.26]
 
0
 
(0)
 
[0,
 
0.07]
 
School
-
level
 
random effect 
variance 


5
 
0.08
 
0.11
 
(0)
 
[0.03,
 
0.45]
 
0
 
(0
)
 
[0,
 
0]
 
10
 
0.18
 
0.27
 
(0.05)
 
[0.06,
 
2.06]
 
0
 
(0)
 
[0,
 
0]
 
25
 
0.49
 
0.39
 
(0.19)
 
[1.03, 0.12]
 
0
 
(0)
 
[0,
 
0]
 
Standard error of
 

coefficient 


5
 
0.08
 
0.08
 
(0)
 
[0.05, 
0.10]
 
-
0.06
 
(0)
 
[
-
0.10, 
-
0.04]
 
10
 
0.18
 
0.09
 
(0)
 
[0.06, 0.13]
 
-
0.03
 
(0)
 
[
-
0.05, 
-
0.02]
 
25
 
0.49
 
0.07
 
(0)
 
[0.04, 0.11]
 
-
0.01
 
(0)
 
[
-
0.04, 0.00]
 
Standard error of
 

coefficient 


5
 
0.08
 
-
0.30
 
(0)
 
[
-
0.36,
 
-
0.18]
 
0.13
 
(0)
 
[0.04,
 
0.24]
 
10
 
0.18
 
-
0.45
 
(0)
 
[
-
0.53,
 
-
0.34]
 
0.16
 
(0)
 
[0.06,
 
0.34]
 
25
 
0.49
 
-
0.63
 
(0)
 
[
-
0.73,
 
-
0.44]
 
0.15
 
(0.01)
 
[
-
0.05,
 
0.42]
 
Standard error of
 

coefficient 


5
 
0.08
 
0
 
(0)
 
[0,
 
0]
 
0
 
(0)
 
[0,
 
0]
 
10
 
0.18
 
0
 
(0)
 
[0,
 
0]
 
0
 
(0)
 
[0,
 
0]
 
25
 
0.49
 
0
 
(0)
 
[
-
0.16,
 
0]
 
0
 
(0)
 
[
-
0.01,
 
0]
 
 
124
 
 
Table 2
C
.2 (

 
Parameter
 

of OLS
 

of OLS
 
Mean
 
(Variance)
 
Range
 
Mean
 
(Variance)
 
Range
 
Standard error of
 

coefficient 


5
 
0.08
 
0.20
 
(0)
 
[0.14, 0.30]
 
-
0.07
 
(0)
 
[
-
0.13, 
-
0.04]
 
10
 
0.18
 
0.24
 
(0)
 
[0.17, 0.31]
 
-
0.04
 
(0)
 
[
-
0.08
, 
-
0.01]
 
25
 
0.49
 
0.26
 
(0)
 
[0.17, 0.36]
 
-
0.02
 
(0)
 
[
-
0.10, 0.02]
 
Standard error of
 

coefficient 


5
 
0.08
 
-
0.21
 
(0)
 
[
-
0.32,
 
-
0.10]
 
0.06
 
(0)
 
[
-
0.01,
 
0.13]
 
10
 
0.18
 
-
0.38
 
(0)
 
[
-
0.50,
 
-
0.23]
 
0.04
 
(0)
 
[
-
0.01,
 
0.10]
 
25
 
0.49
 
-
0.56
 
(0)
 
[
-
0
.70,
 
-
0.36]
 
0.01
 
(0)
 
[
-
0.06,
 
0.11]
 
Standard error of
 

coefficient 


5
 
0.08
 
-
0.68
 
(0)
 
[
-
0.77,
 
-
0.49]
 
-
0.01
 
(0)
 
[
-
0.09,
 
0.12]
 
10
 
0.18
 
-
0.70
 
(0)
 
[
-
0.78,
 
-
0.50]
 
-
0.01
 
(0)
 
[
-
0.09,
 
0.11]
 
25
 
0.49
 
-
0.73
 
(0)
 
[
-
0.80,
 
-
0.58]
 
-
0.02
 
(0)
 
[
-
0.24,
 
0.12]
 
 
125
 
 
Table 2C.
3
 
Relative
 
bias of
 
estimates of variances
 
when
 

, 


.
 
Parameter
 

of HLM
 

of HLM
 
Mean
 
(Variance)
 
Range
 
Mean
 
(Variance)
 
Range
 
Residual 
variance
 

5
 
0.08
 
1.52
 
(0.04)
 
[0.95,
 
2.08]
 
0
 
(0)
 
[0,
 
0]
 
10
 
0.18
 
1.36
 
(0.07)
 
[0.77,
 
2.16]
 
0
 
(0)
 
[0,
 
0.04]
 
25
 
0.49
 
0.81
 
(0.07)
 
[0.28,
 
1.84]
 
0.01
 
(0)
 
[
-
0.01,
 
0.37]
 
School
-
level
 
random effect 
variance
 

5
 
0.08
 
0.31
 
(0.06)
 
[0.
09,
 
2.50]
 
0
 
(0)
 
[0,
 
0]
 
10
 
0.18
 
0.39
 
(0.13)
 
[0.13, 0.82]
 
0
 
(0)
 
[0,
 
0]
 
25
 
0.49
 
0.81
 
(0.43)
 
[0.13, 2.47]
 
0
 
(0)
 
[0,
 
0]
 
Standard error of
 

coefficient 


5
 
0.08
 
0.45
 
(0)
 
[0.38, 0.54]
 
-
0.09
 
(0)
 
[
-
0.35, 0.00]
 
10
 
0.18
 
0.47
 
(0)
 
[0.38, 0.5
9]
 
-
0.04
 
(0)
 
[
-
0.20, 0.06]
 
25
 
0.49
 
0.34
 
(0)
 
[0.22, 0.49]
 
-
0.01
 
(0)
 
[
-
0.32, 0.09]
 
Standard error of
 

coefficient 


5
 
0.08
 
-
0.48
 
(0)
 
[
-
0.50
 
-
0.44]
 
0.01
 
(0)
 
[
-
0.05,
 
0.08]
 
10
 
0.18
 
-
0.63
 
(0)
 
[
-
0.66
 
-
0.59]
 
-
0.01
 
(0)
 
[
-
0.09,
 
0.06]
 
25
 
0
.49
 
-
0.78
 
(0)
 
[
-
0.82
 
-
0.71]
 
-
0.13
 
(0.01)
 
[
-
0.27
 
0.03]
 
Standard error of
 

coefficient 


5
 
0.08
 
0
 
(0)
 
[0,
 
0]
 
0
 
(0)
 
[0,
 
0]
 
10
 
0.18
 
0
 
(0)
 
[
-
0.07,
 
0]
 
0
 
(0)
 
[0,
 
0]
 
25
 
0.49
 
-
0.01
 
(0)
 
[
-
0.21,
 
0.01]
 
0
 
(0)
 
[
-
0.01,
 
0]
 
 
126
 
 
Table 2
C
.
3
 
(

 
P
arameter
 

of OLS
 

of OLS
 
Mean
 
(Variance)
 
Range
 
Mean
 
(Variance)
 
Range
 
Standard error of
 

coefficient 


5
 
0.08
 
0.65
 
(0)
 
[0.54, 0.81]
 
-
0.09
 
(0)
 
[
-
0.39, 0.01]
 
10
 
0.18
 
0.73
 
(0)
 
[0.60, 0
.85]
 
-
0.05
 
(0)
 
[
-
0.23, 0.08]
 
25
 
0.49
 
0.78
 
(0)
 
[0.59, 1.02]
 
-
0.01
 
(0)
 
[
-
0.44, 0.14]
 
Standard error of
 

coefficient 


5
 
0.08
 
-
0.40
 
(0)
 
[
-
0.44,
 
-
0.37]
 
0.03
 
(0)
 
[
-
0.03,
 
0.10]
 
10
 
0.18
 
-
0.57
 
(0)
 
[
-
0.63,
 
-
0.46]
 
<0.01
 
(0)
 
[
-
0.13,
 
0.17]
 
25
 
0.49
 
-
0.71
 
(0)
 
[
-
0.77,
 
-
0.66]
 
<0.01
 
(0.01)
 
[
-
0.09,
 
0.11]
 
Standard error of
 

coefficient 


5
 
0.08
 
-
0.70
 
(0)
 
[
-
0.78,
 
-
0.51]
 
-
0.01
 
(0)
 
[
-
0.12,
 
0.12]
 
10
 
0.18
 
-
0.73
 
(0)
 
[
-
0.80,
 
-
0.59]
 
-
0.01
 
(0)
 
[
-
0.19,
 
0.16]
 
25
 
0.49
 
-
0.78
 
(0)
 
[
-
0.83,
 
-
0.72]
 
-
0.04
 
(0.01)
 
[
-
0.28,
 
0.19]
 
 
127
 
 
Table 2C.
4
 
Relative
 
bias of
 
estimates of variances
 
when
 

, 


.
 
Parameter
 

of HLM
 

of HLM
 
Mean
 
(Variance)
 
Range
 
Mean
 
(Variance)
 
Range
 
Residual 
variance
 

5
 
0.08
 
6.38
 
(0.55)
 
[4.19,
 
8.69]
 
0
 
(0)
 
[0,
 
0]
 
10
 
0.18
 
5.76
 
(0.87)
 
[3.43,
 
8.74]
 
0.01
 
(0)
 
[
-
0.01,
 
0.41]
 
25
 
0.49
 
3.36
 
(1.17)
 
[1.15,
 
8.13]
 
0.10
 
(0.08)
 
[
-
0.37,
 
1.99]
 
School
-
level
 
random effect 
varia
nce
 

5
 
0.08
 
0.46
 
(0.22)
 
[
-
1.00,
 
6.64]
 
0
 
(0)
 
[
-
0.02,
 
0]
 
10
 
0.18
 
0.58
 
(0.18)
 
[1.07,
 
0.21]
 
0
 
(0)
 
[0,
 
0]
 
25
 
0.49
 
1.00
 
(0.50)
 
[2.83,
 
0.20]
 
0
 
(0)
 
[0,
 
0]
 
Standard error of
 

coefficient 


5
 
0.08
 
1.47
 
(0)
 
[1.30,
 
1.66]
 
-
0.10
 
(0.0
4)
 
[
-
0.92,
 
0.28]
 
10
 
0.18
 
1.47
 
(0.01)
 
[1.26,
 
1.74]
 
-
0.03
 
(0.06)
 
[
-
0.99,
 
0.45]
 
25
 
0.49
 
1.09
 
(0.01)
 
[0.78,
 
1.46]
 
0.03
 
(0.08)
 
[
-
0.94,
 
0.45]
 
Standard error of
 

coefficient 


5
 
0.08
 
-
0.55
 
(0)
 
[
-
0.55,
 
-
0.53]
 
0.02
 
(0)
 
[
-
0.03,
 
0.08]
 
10
 
0
.18
 
-
0.69
 
(0)
 
[
-
0.70,
 
-
0.68]
 
-
0.08
 
(0)
 
[
-
0.16,
 
0.01]
 
25
 
0.49
 
-
0.83
 
(0)
 
[
-
0.85,
 
-
0.81]
 
-
0.24
 
(0.01)
 
[
-
0.34,
 
-
0.14]
 
Standard error of
 

coefficient 


5
 
0.08
 
0
 
(0)
 
[0,
 
0]
 
0
 
(0)
 
[0,
 
0]
 
10
 
0.18
 
0
 
(0)
 
[
-
0.98,
 
0]
 
0
 
(0)
 
[
-
0.01,
 
0]
 
25
 
0.49
 
-
0.01
 
(0)
 
[
-
0.30,
 
0.04]
 
0
 
(0)
 
[
-
0.01,
 
0]
 
 
128
 
 
Table 2
C
.
4 (c
ont

d
)
 
Parameter
 

of OLS
 

of OLS
 
Mean
 
(Variance)
 
Range
 
Mean
 
(Variance)
 
Range
 
Standard error of
 

coefficient 


5
 
0.08
 
1.83
 
(0
.01)
 
[1.61, 2.13]
 
-
0.11
 
(0.04)
 
[
-
0.93, 0.29]
 
10
 
0.18
 
1.99
 
(0.01)
 
[1.68, 2.24]
 
-
0.03
 
(0.06)
 
[
-
0.99, 0.50]
 
25
 
0.49
 
2.07
 
(0.02)
 
[1.65, 2.59]
 
0.05
 
(0.10)
 
[
-
0.94, 0.58]
 
Standard error of
 

coefficient 


5
 
0.08
 
-
0.48
 
(0)
 
[
-
0.53,
 
-
0.40]
 
0
.01
 
(0)
 
[
-
0.12,
 
0.16]
 
10
 
0.18
 
-
0.63
 
(0)
 
[
-
0.67,
 
-
0.55]
 
0.01
 
(0)
 
[
-
0.09,
 
0.08]
 
25
 
0.49
 
-
0.75
 
(0)
 
[
-
0.79,
 
-
0.61]
 
<0.01
 
(0.01)
 
[
-
0.14,
 
0.12]
 
Standard error of
 

coefficient 


5
 
0.08
 
-
0.71
 
(0)
 
[
-
1.00,
 
-
0.55]
 
-
0.01
 
(0)
 
[
-
0.99,
 
0.15]
 
10
 
0.18
 
-
0.74
 
(0)
 
[
-
0.99,
 
-
0.66]
 
-
0.02
 
(0)
 
[
-
0.97,
 
0.21]
 
25
 
0.49
 
-
0.81
 
(0)
 
[
-
0.84,
 
-
0.78]
 
-
0.05
 
(0.01)
 
[
-
0.38,
 
0.21]
 
 
129
 
 
Table 2C.
5
 
Relative
 
bias of
 
estimates of variances
 
when 


, 


.
 
Parameter
 

of HLM
 

of HLM
 
Mean
 
(Variance)
 
Range
 
Mean
 
(Variance)
 
Range
 
Residual 
variance
 

5
 
0.08
 
1.82
 
(0.05)
 
[1.16, 2.48]
 
0
 
(0)
 
[0,
 
0]
 
10
 
0.18
 
1.63
 
(0.09)
 
[0.93, 2.58]
 
0
 
(0)
 
[0,
 
0]
 
25
 
0.49
 
0.97
 
(0.12)
 
[0.24,
 
2.20]
 
0
 
(0)
 
[0,
 
0.05]
 
School
-
level
 
random effect 
variance
 

5
 
0.08
 
0.03
 
(0)
 
[0.01,
 
0.09]
 
0
 
(0)
 
[0,
 
0]
 
10
 
0.18
 
0.07
 
(0)
 
[0.02,
 
0.24]
 
0
 
(0)
 
[0,
 
0]
 
25
 
0.49
 
0.18
 
(0.01)
 
[0.03,
 
0.76]
 
0
 
(0)
 
[0,
 
0]
 
Standard error of
 

coefficien
t 


5
 
0.08
 
0.54
 
(0)
 
[0.45, 0.63]
 
-
0.05
 
(0.05)
 
[
-
0.80, 0.28]
 
10
 
0.18
 
0.56
 
(0)
 
[0.45, 0.69]
 
-
0.02
 
(0.05)
 
[
-
0.82, 0.28]
 
25
 
0.49
 
0.40
 
(0)
 
[0.26, 0.57]
 
0.01
 
(0.04)
 
[
-
0.77, 0.27]
 
Standard error of
 

coefficient 


5
 
0.08
 
-
0.49
 
(0)
 
[
-
0.51,
 
-
0.46]
 
0.08
 
(0)
 
[
-
0.06,
 
0.29]
 
10
 
0.18
 
-
0.64
 
(0)
 
[
-
0.67,
 
-
0.61]
 
0.07
 
(0)
 
[
-
0.07,
 
0.24]
 
25
 
0.49
 
-
0.79
 
(0)
 
[
-
0.83,
 
-
0.69]
 
-
0.06
 
(0)
 
[
-
0.23,
 
0.16]
 
Standard error of
 

coefficient 


5
 
0.08
 
0
 
(0)
 
[0,
 
0]
 
0
 
(0)
 
[0,
 
0]
 
10
 
0.18
 
0
 
(0
)
 
[0,
 
0]
 
0
 
(0)
 
[0,
 
0]
 
25
 
0.49
 
0
 
(0)
 
[
-
0.03,
 
0]
 
0
 
(0)
 
[
-
0.01,
 
0]
 
 
130
 
 
Table 2
C
.
5
 
(

 
Parameter
 

of OLS
 

of OLS
 
Mean
 
(Variance)
 
Range
 
Mean
 
(Variance)
 
Range
 
Standard error of
 

coefficient
 

5
 
0.08
 
1.84
 
(0.02)
 
[1.45, 2.51]
 
-
0.01
 
(0.08)
 
[
-
0.83, 0.51]
 
10
 
0.18
 
2.00
 
(0.03)
 
[1.58, 2.46]
 
0.03
 
(0.09)
 
[
-
0.85, 0.57]
 
25
 
0.49
 
2.06
 
(0.03)
 
[1.50, 2.66]
 
0.09
 
(0.11)
 
[
-
0.83, 0.67]
 
Standard error of
 

coefficient 


5
 
0.08
 
-
0
.05
 
(0.01)
 
[
-
0.20, 0.15]
 
0.27
 
(0.00)
 
[0.08, 0.55]
 
10
 
0.18
 
-
0.33
 
(0.01)
 
[
-
0.55, 
-
0.05]
 
0.15
 
(0.02)
 
[0.01, 034]
 
25
 
0.49
 
-
0.55
 
(0.01)
 
[
-
0.65,
 
-
0.43]
 
0.06
 
(0)
 
[
-
0.13,
 
0.27]
 
Standard error of
 

coefficient 


5
 
0.08
 
-
0.83
 
(0)
 
[
-
0.84,
 
-
0.78
]
 
-
0.01
 
(0.01)
 
[
-
0.14,
 
0.09]
 
10
 
0.18
 
-
0.83
 
(0)
 
[
-
0.85,
 
-
0.78]
 
-
0.04
 
(0.01)
 
[
-
0.34,
 
0.27]
 
25
 
0.49
 
-
0.84
 
(0)
 
[
-
0.85,
 
-
0.83]
 
-
0.01
 
(0.02)
 
[
-
0.14,
 
0.11]
 
 
131
 
 
APPENDIX 3A
 
Quantifying the Robustness of Inference with Type 2 Error
 
 
In cases of when 


, Type 2 error may occur. The discussion serves scenarios of when the 

overestimated standard error estimates. For example, Chapter 2 showed that 
VOC
s 
of the 
i
ndividual level predictor 


(and 


) are smaller than 1 when the upper middle cluster 
level is omitted. Further in Chapter 4, the standard error estimate of the middle cluster level 
predictor 


coefficient  could 
also be overestimated when the highest cluster level is 
omitted. This
 
scenario of having potential risks of making Type II error is demonstrated using an 
empirical study in Chapter 4 with implementing the below robustness inference measures. 
 
 
Identical t
o the discussed rationale for comparing the deviation of the estimate
d models 
from the true models, Figure 3.A.1 shows the two possible scenarios of  having or not having 
Type II error when the t
-
statistic is smaller than the t critical value. Unlike Figur
e 3.
2
, 

 
turns to 
be deflation instead of inflation. The definitions 
of quantifying the deviations of the t statistics to 
the t critical value remain the same, while the formulas are reversed as in Type I error 
discussions that 


and
 

. In panel (a), Type II error does not occur since 


; in Panel (b), Type II error occurs since 


. 
 
 
Following the ideas of constructing the measures of robustness of infere
nce and effect 
size when 


that
 
have been discussed previously, these two measures are adapted
 
for the 
current setting of 


(i.e., no Type II error) : 
 
 
132
 
 
and 
 

When 


, a Type II error occurs
, and 


and 


are the same as above.
 
Further, the risk of making a Type II error index is identical to the above Type I error one as: 
 

and
 

where 

 
is positive and 


,
 
since 


. 
 
 
133
 
 
(a) Scenario of 
n
o Type II error 
 
 
(b) Scenario of havi
ng Type II error 
 
Figure 3.A.
1
 
Two scenarios of comparing 
t
 
statistics of the estimated model and the 
hypothesized satisfactory models (


).
 
 
134
 
 
APPENDIX 4A
 
Simulation Parameter Settings and Results
 
of VOCs of Om
itting the Highest Cluster Level
 
 
Table 4A.
1
 
Simulation 
p
arameter settings
.
 

or
 

or
 

or
 

Avg. No. of 
teachers per
 
school
 

No. of schools
 

0.5
 
0.1
 
0.4
 
20
 
5
 
0.5
 
0.8
 
0.4
 
0.4
 
0.2
 
0.8
 
0.6
 
0.2
 
0.2
 
0.5
 
0.1
 
0.4
 
10
 
10
 
0.5
 
0.8
 
0.4
 
0.4
 
0.2
 
0.8
 
0.6
 
0.2
 
0.2
 
0.5
 
0.1
 
0.4
 
4
 
25
 
0.5
 
0.8
 
0.4
 
0.4
 
0.2
 
0.8
 
0.6
 
0.2
 
0.2
 
0.5
 
0.1
 
0.4
 
2
 
50
 
0.5
 
0.8
 
0.4
 
0.4
 
0.2
 
0.8
 
0.6
 
0.2
 
0.2
 
 
135
 
 
Table 4A.
2
 
Relative
 
bias of
 
estimates of variance
s
 
when 


, 


, 


.
 
Parameter
 

of HLM
 

of HLM
 
Mean
 
(Variance)
 
Range
 
Mean
 
(Variance)
 
Range
 
Teacher
-
level
 
ran
dom effect variance 


20
 
5
 
0.14
 
(0.02)
 
[0, 1.14]
 
0
 
(0)
 
[0,
 
0]
 
10
 
10
 
0.19
 
(
0.03
)
 
[
0, 1.12]
 
0
 
(0)
 
[0,
 
0]
 
4
 
25
 
0.24
 
(
0.03
)
 
[
0, 1.06]
 
0
 
(0)
 
[0,
 
0]
 
2
 
50
 
0.31
 
(0.06)
 
[0, 1.68]
 
0
 
(0)
 
[0,
 
0]
 
Standard error of
 

coefficient 


20
 
5
 
0.06
 
(0)
 
[0, 0.43]
 
0
 
(0)
 
[
0
,
 
0.02]
 
10
 
10
 
0.08
 
(
0
)
 
[
0, 0.42]
 
0
 
(0)
 
[0, 0.02]
 
4
 
25
 
0.10
 
(
0.01
)
 
[
0, 0.41]
 
0.01
 
(0)
 
[0, 0.03]
 
2
 
50
 
0.13
 
(0.01)
 
[0, 0.58]
 
0.01
 
(0)
 
[0, 0.04]
 
Standard error of
 

coefficient 


20
 
5
 
-
0.33
 
(0.04)
 
[
-
0.69, 0]
 
0
 
(
0)
 
[
-
0.01, 
0.01]
 
10
 
10
 
-
0.30
 
(
0.02
)
 
[
-
0.58, 0]
 
0
 
(0)
 
[
-
0.01, 
0.01]
 
4
 
25
 
-
0.17
 
(
0.01
)
 
[
-
0.37, 0]
 
0
 
(0)
 
[0,
 
0]
 
2
 
50
 
-
0.08
 
(0)
 
[
-
0.21, 0.00]
 
0
 
(0)
 
[0,
 
0]
 
 
136
 
 
Parameter
 

of OLS
 

of 
OLS
 
Mean
 
(Variance)
 
Range
 
Mean
 
(Variance)
 
Range
 
Standard error of
 

coefficient 


20
 
5
 
0.38
 
(0)
 
[0.24,
 
0.56]
 
0.02
 
(0)
 
[
-
0.16,
 
0.07]
 
10
 
10
 
0.39
 
(0)
 
[0.24,
 
0.56]
 
0.01
 
(0)
 
[
-
0.15,
 
0.08]
 
4
 
25
 
0.40
 
(0)
 
[0.27,
 
0.61]
 
<0.01
 
(0)
 
[
-
0.22,
 
0.07]
 
2
 
50
 
0.40
 
(0)
 
[0.25,
 
0.57]
 
<0.01
 
(0)
 
[
-
0.18,
 
0.07]
 
Standard error of
 

coefficient 


20
 
5
 
-
0.66
 
(0)
 
[
-
0.71,
 
-
0.58]
 
-
0.02
 
(0)
 
[
-
0.11,
 
0.10]
 
10
 
10
 
-
0.66
 
(0)
 
[
-
0.70
 
-
0.57]
 
-
0.01
 
(0)
 
[
-
0.11,
 
0.10]
 
4
 
25
 
-
0.66
 
(0)
 
[
-
0.71
 
-
0.55]
 
<0.01
 
(0)
 
[
-
0.10,
 
0.11]
 
2
 
50
 
-
0.65
 
(0)
 
[
-
0.71,
 
-
0.51]
 
<0.01
 
(0)
 
[
-
0.10,
 
0.11]
 
Standard error of 


coefficient 


20
 
5
 
-
0.79
 
(0)
 
[
-
0.91,
 
-
0.64]
 
-
0.03
 
(0)
 
[
-
0.12,
 
0.10]
 
10
 
10
 
-
0.78
 
(0)
 
[
-
0.87,
 
-
0.65]
 
-
0.02
 
(0)
 
[
-
0.11,
 
0.10]
 
4
 
25
 
-
0.
74
 
(0)
 
[
-
0.82,
 
-
0.67]
 
-
0.01
 
(0)
 
[
-
0.11,
 
0.10]
 
2
 
50
 
-
0.71
 
(0)
 
[
-
0.76,
 
-
0.65]
 
<0.01
 
(0)
 
[
-
0.11,
 
0.11]
 
 
137
 
 
Table 4A.
3
 
Relative
 
bias of
 
estimates of variances
 
when 


, 


,


.
 
Parameter
 

of HLM
 

of HLM
 
Mean
 
(Variance)
 
Range
 
Mean
 
(Variance)
 
Range
 
Teacher
-
level
 
random effect 
variance 


20
 
5
 
0.59
 
(0.26)
 
[0, 3.79]
 
0
 
(0)
 
[0,
 
0]
 
10
 
10
 
0.82
 
(0.20)
 
[0.00, 2.58]
 
0
 
(0)
 
[0,
 
0]
 
4
 
25
 
0.95
 
(0.17)
 
[0.1
5, 3.38]
 
0
 
(0)
 
[0,
 
0]
 
2
 
50
 
1.02
 
(0.23)
 
[0.00, 3.29]
 
0
 
(0)
 
[0,
 
0]
 
Standard error of
 

coefficient 


20
 
5
 
0.24
 
(0.03)
 
[0, 1.16]
 
0.02
 
(0)
 
[0, 0.05]
 
10
 
10
 
0.33
 
(0.02)
 
[0, 0.87]
 
0.02
 
(0)
 
[
-
0.97, 0.06]
 
4
 
25
 
0.38
 
(0.02)
 
[0.07, 1.07]
 
0.02
 
(0)
 
[0.01, 0.05]
 
2
 
50
 
0.40
 
(0.03)
 
[0.00, 1.04]
 
0.03
 
(0)
 
[0.00, 0.09]
 
Standard error of
 

coefficient 


20
 
5
 
-
0.57
 
(0.03)
 
[
-
0.75, 0]
 
-
0.01
 
(0)
 
[
-
0.02, 0.01]
 
10
 
10
 
-
0.52
 
(0.01)
 
[
-
0.99, 0]
 
-
0.01
 
(0)
 
[
-
0.97, 0.01]
 
4
 
25
 
-
0.35
 
(0)
 
[
-
0.45, 
-
0
.15]
 
0
 
(0)
 
[
-
0.01, 0.01]
 
2
 
50
 
-
0.17
 
(0)
 
[
-
0.25, 0.00]
 
0
 
(0)
 
[0,
 
0]
 
 
138
 
 
Parameter
 

of OLS
 

of OLS
 
Mean
 
(Variance)
 
Range
 
Mean
 
(Variance)
 
Range
 
Standard error of
 

coe
fficient 


20
 
5
 
1.02
 
(0.06)
 
[0.62, 
1.92]
 
0.21
 
(0.04)
 
[
-
0.68, 0.37]
 
10
 
10
 
1.12
 
(0.04)
 
[0.63, 1.75]
 
0.13
 
(0.05)
 
[
-
0.91, 0.38]
 
4
 
25
 
1.19
 
(0.03)
 
[0.68, 1.74]
 
0.05
 
(0.06)
 
[
-
0.97, 0.36]
 
2
 
50
 
1.20
 
(0.02)
 
[0.71, 1.63]
 
0.03
 
(0.05)
 
[
-
0.83, 0.35]
 
Stand
ard error of
 

coefficient 


20
 
5
 
-
0.68
 
(0)
 
[
-
0.74, 
-
0.49]
 
-
0.07
 
(0.01)
 
[
-
0.25, 0.35]
 
10
 
10
 
-
0.66
 
(0)
 
[
-
0.74, 
-
0.54]
 
-
0.03
 
(0.01)
 
[
-
0.22, 0.25]
 
4
 
25
 
-
0.65
 
(0)
 
[
-
0.72, 
-
0.51]
 
<0.01
 
(0.01)
 
[
-
0.19, 0.33]
 
2
 
50
 
-
0.65
 
(0)
 
[
-
0.74, 
-
0.49]
 
0.01
 
(0)
 
[
-
0.16, 0.19]
 
Standard error of 


coefficient 


20
 
5
 
-
0.89
 
(0)
 
[
-
0.94, 
-
0.72]
 
-
0.09
 
(0.01
)
 
[
-
0.26, 0.32]
 
10
 
10
 
-
0.88
 
(0)
 
[
-
1.00, 
-
0.73]
 
-
0.05
 
(0.01
)
 
[
-
0.24, 0.21]
 
4
 
25
 
-
0.84
 
(0)
 
[
-
0.87, 
-
0.78]
 
-
0.03
 
(0.01
)
 
[
-
0.20, 0.28]
 
2
 
50
 
-
0.79
 
(0)
 
[
-
0.82, 
-
0.74]
 
-
0.01
 
(0)
 
[
-
0.18, 0.15]
 
 
139
 
 
Table 4A.
4
 
Relative
 
bias of
 
estimates of variances
 
when 


, 


, 


.
 
Parameter
 

of HLM
 

of HL
M
 
Mean
 
(Variance)
 
Range
 
Mean
 
(Variance)
 
Range
 
Teacher
-
level
 
random effect 
variance 


20
 
5
 
1.69
 
(1.97)
 
[0.00, 0.39]
 
<0.01
 
(0)
 
[
-
0.01, 0]
 
10
 
10
 
2.47
 
(1.79)
 
[0.11, 7.11]
 
<0.01
 
(0)
 
[0,
 
0]
 
4
 
25
 
2.85
 
(1.11)
 
[0.70,
 
7.44]
 
<0.01
 
(0)
 
[0,
 
0]
 
2
 
50
 
3.12
 
(1.06)
 
[1.06, 6.77]
 
<0.01
 
(0)
 
[0,
 
0]
 
Standard error of
 

coefficient 


20
 
5
 
0.57
 
(0.14)
 
[0.00, 2.28]
 
0.05
 
(0)
 
[0.00, 0.12]
 
10
 
10
 
0.80
 
(0.11)
 
[0.05, 1.78]
 
0.06
 
(0)
 
[0.01, 0.13]
 
4
 
25
 
0.91
 
(0.06)
 
[0.29, 1.85]
 
0.07
 
(0)
 
[0.03, 0.13]
 
2
 
50
 
0.97
 
(0.06)
 
[0.42, 1.69]
 
0.07
 
(0)
 
[0.03 0.17]
 
Standard error of
 

coefficient 


20
 
5
 
-
0.67
 
(0.01)
 
[
-
0.77, 0.00]
 
-
0.02
 
(0)
 
[
-
0.06, 0.01]
 
10
 
10
 
-
0.61
 
(0)
 
[
-
0.66, 
-
0.27]
 
-
0.01
 
(0)
 
[
-
0.04, 0.01]
 
4
 
25
 
-
0.43
 
(0)
 
[
-
0.48, 
-
0.33]
 
<0.01
 
(0
)
 
[
-
0.02, 0.01]
 
2
 
50
 
-
0.24
 
(0)
 
[
-
0.27, 
-
0.18]
 
<0.01
 
(0)
 
[
-
0.01, 0.00]
 
 
140
 
 
Table 4A.
4 (c
ont

d
)
 
Parameter
 
 
of OLS
 

of OLS
 
Mean
 
(Variance)
 
Range
 
Mean
 
(Variance)
 
Range
 
Standard error of
 

coefficient 


20
 
5
 
0.91
 
(0.12)
 
[0.38, 2.26]
 
0.23
 
(0.04)
 
[
-
0.89, 0.38]
 
10
 
10
 
1.07
 
(0.08)
 
[0.45, 1.88]
 
0.18
 
(0.05)
 
[
-
0.87, 0.39]
 
4
 
25
 
1.18
 
(0.04)
 
[0.53, 1.93]
 
0.09
 
(0.06)
 
[
-
0.93, 0.37]
 
2
 
50
 
1.20
 
(0.03)
 
[0.64, 1.67]
 
0.04
 
(0.06)
 
[
-
0.89, 0.36]
 
S
tandard error of
 

coefficient 


20
 
5
 
-
0.59
 
(0.01)
 
[
-
0.70, 
-
0.23]
 
-
0.10
 
(0.02)
 
[
-
0.34, 0.57]
 
10
 
10
 
-
0.54
 
(0)
 
[
-
0.69, 
-
0.33]
 
<0.01
 
(0.02)
 
[
-
0.31, 0.45]
 
4
 
25
 
-
0.52
 
(0)
 
[
-
0.65, 
-
0.33]
 
0.04
 
(0.01)
 
[
-
0.21, 0.48]
 
2
 
50
 
-
0.51
 
(0)
 
[
-
0.63, 
-
0.33]
 
0.06
 
(0.01)
 
[
-
0.16, 0.28]
 
Standard error of 


coefficient 


20
 
5
 
-
0.91
 
(0)
 
[
-
0.94, 
-
0.68]
 
-
0.16
 
(0.02)
 
[
-
0.39, 0.44]
 
10
 
10
 
-
0.90
 
(0)
 
[
-
0.92, 
-
0.78]
 
-
0.07
 
(0.02)
 
[
-
0.33, 0.38]
 
4
 
25
 
-
0.86
 
(0)
 
[
-
0.88, 
-
0.82]
 
-
0.03
 
(0.01)
 
[
-
0.25
, 0.38]
 
2
 
50
 
-
0.81
 
(0)
 
[
-
0.82, 
-
0.78]
 
-
0.01
 
(0)
 
[
-
0.22, 0.19]
 
 
141
 
 
APPENDIX 5A
 
Simulation Parameter Settings and Results of VOCs of Omitting the Lowest Cluster Level
 
 
Table 5A.
1
 
Simulation 
p
arameter settings.
 

6
 
0.9
 
0.789
 
0.5
 
0.167
 
0.7
 
0.476
 
0.5
 
0.269
 
0.2
 
0.079
 
10
 
0.9
 
0.697
 
0.7
 
0.140
 
0.7
 
0.351
 
0.5
 
0.178
 
0.2
 
0.049
 
30
 
0.9
 
0.423
 
0.9
 
0.06
 
0.7
 
0.143
 
0.5
 
0.064
 
0.2
 
0.017
 
 
142
 
 
Tab
le 5A.
2
 
Relative
 
bias of
 
estimates of variances
 
when
 
l
ag
-
1 autocorrelation 


Parameter
 

of ID
 

of ID
 
Mean
 
(Variance)
 
Range
 
Mean
 
(Variance)
 
Range
 
Residual variance
 

6
 
-
0.79
 
(0.00)
 
[
-
0.81,
 
-
0.76]
 
0.00
 
(0.00)
 
[
-
0.12,
 
0.12]
 
10
 
-
0.70
 
(0.00)
 
[
-
0.73,
 
-
0.6
6]
 
0.00
 
(0.00)
 
[
-
0.10,
 
0.11]
 
30
 
-
0.42
 
(0.00)
 
[
-
0.47, 
-
0.36]
 
0.00
 
(0.00)
 
[
-
0.07, 0.10]
 
Individual level
 
random effect variance
 

6
 
1.77
 
(0.04)
 
[1.24, 2.23]
 
0
.00
 
(0.04)
 
[
-
0.58,
 
0.56]
 
10
 
1.57
 
(0.03)
 
[1.11,2.09]
 
0.01
 
(0.03)
 
[
-
0.54,
 
0.57]
 
30
 
0.95
 
(0.02)
 
[0.55, 1.32]
 
0.00
 
(0.02)
 
[
-
0.38, 0.36]
 
Standard error of
 

coefficient 


6
 
-
0.39
 
(0.00)
 
[
-
0.41, 
-
0.36]
 
-
0.04
 
(0.00)
 
[
-
0.12,
 
0.02]
 
10
 
-
0.51
 
(0.00)
 
[
-
0.53, 
-
0.49]
 
-
0.03
 
(0.00)
 
[
-
0.11,
 
-
0.04]
 
30
 
-
0.71
 
(0.00)
 
[
-
0.72, 
-
0.69]
 
-
0.03
 
(0.00)
 
[
-
0.08, 0.03]
 
Parameter
 

of OLS
 

of OLS
 
Mean
 
(Variance)
 
Range
 
Mean
 
(Variance)
 
Range
 
Standard error of
 

coefficient 


6
 
-
0.56
 
(0.00)
 
[
-
0.57, 
-
0.55]
 
0.01
 
(0.00)
 
[
-
0.01,
 
0.04]
 
10
 
-
0.64
 
(0.00)
 
[
-
0.65, 
-
0.63]
 
0.02
 
(0.00)
 
[
-
0.01,
 
0.05]
 
30
 
-
0.74
 
(0.00)
 
[
-
0.75, 
-
0.73]
 
0.12
 
(0.00)
 
[0.06, 0.17]
 
 
143
 
 
Table 5A.
3
 
Relative
 
bias 
of
 
estimates of variances
 
when
 
l
ag
-
1 autocorrelation 


Parameter
 

of ID
 

of ID
 
Mean
 
(Variance)
 
Range
 
Mean
 
(Variance)
 
Range
 
Residual variance
 

6
 
-
0.48
 
(0.00)
 
[
-
0.53, 
-
0.42]
 
0.00
 
(0.00)
 
[
-
0.10,
 
0.10]
 
10
 
-
0.35
 
(0.00)
 
[
-
0.41, 
-
0.29]
 
0.00
 
(0.00)
 
[
-
0.09, 0.10]
 
30
 
-
0.14
 
(0.00)
 
[
-
0.19, 
-
0.09]
 
0.00
 
(0.00)
 
[
-
0.06, 0.06]
 
Individual level
 
random effect variance
 

6
 
1.07
 
(0.02)
 
[0.58, 1.53]
 
0.00
 
(0.02)
 
[
-
0.53, 0.51]
 
10
 
0.79
 
(0.01)
 
[0.46, 1
.13]
 
0.00
 
(0.01)
 
[
-
0.30, 0.37]
 
30
 
0.33
 
(0.01)
 
[0.09, 0.59]
 
0.01
 
(0.01)
 
[
-
0.24, 0.28]
 
Standard error of
 

coefficient 


6
 
-
0.33
 
(0.00)
 
[
-
0.40, 
-
0.30]
 
-
0.04
 
(0.00)
 
[
-
0.15, 0.01]
 
10
 
-
0.41
 
(0.00)
 
[
-
0.44, 
-
0.39]
 
0.05
 
(0.00)
 
[0.01, 0.0
9]
 
30
 
-
0.63
 
(0.00)
 
[
-
0.65, 
-
0.61]
 
0.01
 
(0.00)
 
[
-
0.04, 0.04]
 
Parameter
 

of OLS
 

of OLS
 
Mean
 
(Variance)
 
Range
 
Mean
 
(Variance)
 
Range
 
Standard error of
 

coefficient 


6
 
-
0.50
 
(0.00)
 
[
-
0.52, 
-
0
.48]
 
0.02
 
(0.00)
 
[
-
0.01, 0.06]
 
10
 
-
0.58
 
(0.00)
 
[
-
0.60, 
-
0.57]
 
0.02
 
(0.00)
 
[
-
0.01, 0.04]
 
30
 
-
0.62
 
(0.00)
 
[
-
0.64, 
-
0.61]
 
0.36
 
(0.00)
 
[0.26, 0.41]
 
 
144
 
 
Table 5A.
4
 
Relative
 
bias of
 
estimates of variances
 
when 
lag
-
1 autocorrel
ation 


.
 
 
Parameter
 

of ID
 

of ID
 
Mean
 
(Variance)
 
Range
 
Mean
 
(Variance)
 
Range
 
Residual variance
 

6
 
-
0.27
 
(0.00)
 
[
-
0.34, 
-
0.20]
 
0.00
 
(0.00)
 
[
-
0.09, 0.09]
 
10
 
-
0.18
 
(0.00)
 
[
-
0.24, 
-
0.12]
 
0.00
 
(0.00)
 
[
-
0.08, 0.07]
 
30
 
-
0.07
 
(0.00)
 
[
-
0.11, 
-
0.03]
 
0.00
 
(0.00)
 
[
-
0.05, 0.04]
 
Individual level
 
random effect variance
 

6
 
0.61
 
(0.01)
 
[0.28, 0.95]
 
0.01
 
(0.01)
 
[
-
0.33, 0.36]
 
10
 
0.40
 
(0.01)
 
[0.11, 0.67]
 
0.00
 
(0.01)
 
[
-
0.29, 0.28]
 
30
 
0.15
 
(0.0
1)
 
[
-
0.07, 0.38]
 
0.00
 
(0.01)
 
[
-
0.22, 0.23]
 
Standard error of
 

coefficient 


6
 
-
0.27
 
(0.00)
 
[
-
0.36, 
-
0.22]
 
-
0.03
 
(0.00)
 
[
-
0.15, 0.05]
 
10
 
-
0.32
 
(0.00)
 
[
-
0.35, 
-
0.29]
 
0.12
 
(0.00)
 
[0.07, 0.17]
 
30
 
-
0.38
 
(0.00)
 
[
-
0.40, 
-
0.37]
 
0.58
 
(0.
00)
 
[0.51, 0.66]
 
Parameter
 

of OLS
 

of OLS
 
Mean
 
(Variance)
 
Range
 
Mean
 
(Variance)
 
Range
 
Standard error of
 

coefficient 


6
 
-
0.45
 
(0.00)
 
[
-
0.48, 
-
0.39]
 
0.02
 
(0.00)
 
[
-
0.01, 0.10]
 
10
 
-
0.54
 
(0.
00)
 
[
-
0.57, 
-
0.52]
 
0.01
 
(0.00)
 
[
-
0.01, 0.03]
 
30
 
-
0.70
 
(0.00)
 
[
-
0.72, 
-
0.68]
 
0.00
 
(0.00)
 
[
-
0.01, 0.01]
 
 
145
 
 
Table 5A.
5
 
 
Relative
 
bias of
 
estimates of variances
 
when l
ag
-
1 autocorrelation 


.
 
Parameter
 

of ID
 

of ID
 
Mean
 
(Variance)
 
Range
 
Mean
 
(Variance)
 
Range
 
Residual variance
 

6
 
-
0.08
 
(0.00)
 
[
-
0.15, 0.01]
 
0.00
 
(0.00)
 
[
-
0.08, 0.10]
 
10
 
-
0.05
 
(0.00)
 
[
-
0.12, 0.02]
 
0.00
 
(0.00)
 
[
-
0.08, 0.07]
 
30
 
-
0.02
 
(0.00)
 
[
-
0.06, 0.
02]
 
0.00
 
(0.00)
 
[
-
0.04, 0.04]
 
Individual level
 
random effect variance
 

6
 
0.17
 
(0.01)
 
[
-
0.11, 0.44]
 
0.00
 
(0.01)
 
[
-
0.30, 0.27]
 
10
 
0.11
 
(0.01)
 
[
-
0.15, 0.40]
 
0.00
 
(0.01)
 
[
-
0.26, 0.29]
 
30
 
0.04
 
(0.00)
 
[
-
0.18, 0.24]
 
0.00
 
(0.00)
 
[
-
0.22, 0.20]
 
Stan
dard error of
 

coefficient 


6
 
-
0.12
 
(0.00)
 
[
-
0.16, 
-
0.06]
 
0.09
 
(0.00)
 
[0.02, 0.14]
 
10
 
-
0.14
 
(0.00)
 
[
-
0.17, 
-
0.10]
 
0.29
 
(0.00)
 
[0.23, 0.35]
 
30
 
-
0.17
 
(0.00)
 
[
-
0.19, 
-
0.15]
 
1.05
 
(0.00)
 
[0.93, 1.12]
 
Parameter
 

of OLS
 

of OLS
 
Mean
 
(Variance)
 
Range
 
Mean
 
(Variance)
 
Range
 
Standard error of
 

coefficient 


6
 
-
0.40
 
(0.00)
 
[
-
0.43, 
-
0.36]
 
0.00
 
(0.00)
 
[
-
0.01, 0.01]
 
10
 
-
0.50
 
(0.00)
 
[
-
0.53, 
-
0.47]
 
0.00
 
(0.00)
 
[
-
0.01, 0.01]
 
30
 
-
0.69
 
(0.00)
 
[
-
0.70, 
-
0.66]
 
0.00
 
(0.00)
 
[0.00, 0.00]
 
 
146
 
 
BIBLIOGRAPHY
 
 
147
 
 
BIBLIOGRAPHY
 
 
Abadie, A., Athey, S., Imbens, G. W., & Wooldridge, J. (2017).
 
When should you adjust 
standard errors for clustering?
 
(No. w24003). National Bureau of Econom
ic Research.
 

Based Uncertainty in Regression Analysis.
 
Econometrica
,
 
88
(1), 265
-
296.
 
Abe, Y., & Gee, K. A. (2014). Sensitivity analyses for clustered data: An illus
tration from 
a 
large
-
scale clustered randomized controlled trial in education.
 
Evaluation and program 
planning
,
 
47
, 26
-
34.
 
Adelson, J. L., McCoach, D. B., & Gavin, M. K. (2012). Examining the effects of gifted 
programming in mathematics and reading using t
he ECLS
-
K.
 
Gi
fted Child 
Quarterly
,
 
56
(1), 25
-
39.
 
Akerlof, G. A., & Kranton, R. E. (2002). Identity and schooling: Some lessons for the economics 
of education.
 
Journal of economic literature
,
 
40
(4), 1167
-
1201.
 
Alejo, J., Montes
-
Rojas, G., & Sosa
-
Escudero, W
. (2018). Tes
ting for serial correlation in 
hierarchical linear models.
 
Journal of Multivariate Analysis
,
 
165
, 101
-
116.
 
Angrist, J. D., & Pischke, J. S. (2008).
 
Mostly harmless econometrics: An empiricist's 
companion
. Princeton university press.
 
Antonakis,
 
J., 
Bastardoz, N., & Rönkkö, M. (2019). On ignoring the random effects assumption 
in multilevel models: Review, critique, and recommendations.
 
Organizational Research 
Methods
, 1094428119877457.
 
Barr, R., & Dreeben, R. (1983). 
How schools work.
 
Chicago: Un
ivers
ity of Chicago Press.
 
Baek, E. K., & Ferron, J. M. (2013). Multilevel models for multiple
-
baseline data: Modeling 
across
-
participant variation in autocorrelation and residual variance.
 
Behavior Research 
Methods
,
 
45
(1), 65
-
74.
 
Baldwin, S. A., & Felling
ham, 
G. W. (2013). Bayesian methods for the analysis of small sample 
multilevel data with a complex variance structure.
 
Psychological Methods
,
 
18
(2), 151.
 
Bates, D., Maechler, M., Bolker, B., Walker, S., & Haubo Bojesen Christensen, R. (2015). lme4: 
Linear
 
mixed
-
effects models using Eigen and S4. R package version 1.1

7. 2014.
 
Battaglia, M. (2008). Multi
-
stage sample. In P. J. Lavrakas (Ed.). 
Encyclopedia of survey 
research methods
 
(pp. 493
-
493). Thousand Oaks, CA: SAGE Publications, Inc. doi: 
10.4135
/97814
12963947.n311
 
 
148
 
 
Baltagi, B. H., & Li, Q. (1991). A joint test for serial correlation and random individual 
effects.
 
Statistics & Probability Letters
,
 
11
(3), 277
-
280.
 
Baltagi, B. H., Song, S. H., & Jung, B. C. (2002). Simple LM tests for the unbalanced 
nested
 
error component regression model.
 
Econometric Reviews
,
 
21
(2), 167
-
187.
 
Baltagi, B. H., Jung, B. C., & Song, S. H. (2010). Testing for heteroskedasticity and serial 
correlation in a random effects panel data model.
 
Journal of Econometrics
,
 
154
(2), 12
2
-
124.
 

and their relationship to socio
-
economic status. In
 
D. F. Robitaille & A. E. Beaton, 
(Eds.).
 
Secondary analysis of the TIMSS data
 
(pp. 211
-
231). Dordrecht: S
pringe
r. doi: 
10.1007/0
-
306
-
47642
-
8
 
Bera, A. K., Sosa
-
Escudero, W., & Yoon, M. (2001). Tests for the error component model in the 
presence of local misspecification.
 
Journal of Econometrics
,
 
101
(1), 1
-
23.
 
Berger, M. P., & Wong, W. K. (2009).
 
An introductio
n to o
ptimal designs for social and 
biomedical research
 
(Vol. 83). John Wiley & Sons.
 
Bertrand, M., Duflo, E., & Mullainathan, S. (2004). How much should we trust differences
-
in
-
differences estimates?.
 
The Quarterly journal of economics
,
 
119
(1), 
249
-
275.
 
Bloom, 
H. S., Richburg
-
Hayes, L., & Black, A. R. (2007). Using covariates to improve precision 
for studies that randomize schools to evaluate educational interventions.
 
Educational 
Evaluation and Policy Analysis
,
 
29
(1), 30
-
59.
 
Breusch, T. S., & Pa
gan, A. R. (1980
). The Lagrange multiplier test and its applications to model 
specification in econometrics.
 
The review of economic studies
,
 
47
(1), 239
-
253.
 
Bryk, A. S., & Raudenbush, S. W. (1989). Toward a more appropriate conceptualization of 
research on
 
school effects:
 
A three
-
level hierarchical linear model. In
 
R.D. Bock (Ed.), 
Multilevel analysis of educational data
 
(pp. 159
-
204). Academic Press.
 

-
robust 
inference.
 
Journal of huma
n resources
,
 
50
(
2), 317
-
372.
 
Cameron, A. C., Gelbach, J. B., & Miller, D. L. (2008). Bootstrap
-
based improvements for 
inference with clustered errors.
 
The Review of Economics and Statistics
,
 
90
(3), 414
-
427.
 
Chen, S., & Rust, K. (2017). An extension of Kish

esign effects to two
-
and three
-
stage designs with stratification.
 
Journal of Survey Statistics and Methodology
,
 
5
(2), 111
-
130.
 
Cheong, Y. F., Fotiu, R. P., & Raudenbush, S. W. (2001). Efficiency and robustness of 
alternative estimators for 
two
-
and three
-
le
vel models: The case of NAEP.
 
Journal of 
Educational and Behavioral Statistics
,
 
26
(4), 411
-
429.
 
 
149
 
 
Claessens, A. (2012). Kindergarten child care experiences and child achievement and 
socioemotional skills.
 
Early Childhood Research Quarterly
,
 
2
7
(3), 365
-
375.
 
C
oburn, C. E., Russell, J. L., Kaufman, J. H., & Stein, M. K. (2012). Supporting sustainability: 

 
American Journal of 
Education
,
 
119
(1), 137
-
182.
 
Cohen, J. (1962). The statistical 
power of abnorma
l
-
social psychological research: a review.
 
The 
Journal of Abnormal and Social Psychology
,
 
65
(3), 145.
 
Cohen, J. (1992). A power primer.
 
Psychological bulletin
,
 
112
(1), 155.
 
Cohen, J. (2009). 
Statistical power analysis for the behavioral sci
ences
 
(2nd ed.).
 
New York: 
Psychology Press.
 
Conaway, C., Keesler, V., & Schwartz, N. (2015). What research do state education agencies 
really need? The promise and limitations of state longitudinal data systems.
 
Educational 
Evaluation and Policy 
Analysis
,
 
37
(1_suppl), 16
S
-
28S.
 
Cooper, H., Hedges, L. V., & Valentine, J. C. (Eds.). (2019).
 
The handbook of research synthesis 
and meta
-
analysis
. Russell Sage Foundation.
 
  
Dedrick, R. F., Ferron, J. M., Hess, M. R., Hogarty, K. Y., Kromrey, J. D., Lang, T. R., Niles, J. 
D., & L
ee, R. S. (2009). Multilevel Modeling: A Review of Methodological Issues 
and
 
Applications. Review of Educational Research, 79(1), 69

102. 
https://do
i.org/10.3102/0034654308325581 
 
Fahle, E. M., & Reardon, S. F. (2018). How much do test scores vary among sc
hool districts? 
New estimates using population data, 2009

2015.
 
Educational Researcher
,
 
47
(4), 221
-
234.
 
Ferron, J., Dailey, R., & Yi, Q. (2002). Eff
ects of misspecifying the first
-
level error structure in 
two
-
level models of change.
 
Multivariate Behavioral
 
Research
,
 
37
(3), 379
-
403.
 
Fitchett, P. G., & Heafner, T. L. (2017). Student demographics and teacher characteristics as 
predictors of elementary
-
ag
e students' history knowledge: Implications for teacher 
education and practice.
 
Teaching and Teacher Educati
on
,
 
67
, 79
-
92.
 
Frank, K. A. (1998). Quantitative methods for studying social context in multilevels and through 
interpersonal relations.
 
Review of r
esearch in education
,
 
23
(1), 171
-
216.
 
Frank, K. A., Maroulis, S. J., Duong, M. Q., & Kelcey, B. M. (2013). W
hat would it take to 

inferences.
 
Educational Evaluation and Po
licy Analysis
,
 
35
(4), 437
-
460.
 
Frank, K. A., Muller, C., Schiller, K. S., Riegle
-
Crumb, C., Mueller, A. S., 
Crosnoe, R., & 
Pearson, J. (2008). The social dynamics of mathematics coursetaking in high 
school.
 
American Journal of Sociology
,
 
113
(6), 1645
-
1696.
 
 
150
 
 
Fritz, C. O., Morris, P. E., & Richler, J. J. (2012). Effect size estimates: current use, calculations, 
and
 
interpretation.
 
Journal of experimental psychology: General
,
 
141
(1), 2.
 
Gamoran, A., & Dreeben, R. (1986). Coupling and control in educational 
orga
nizations.
 
Administrative science quarterly
, 612
-
632.
 
Gamoran, A., Secada, W. G., & Marrett, C. B. (2000). T
he organizational context of teaching 
and learning. In
 
Handbook of the sociology of education
 
(pp. 37
-
63). Springer, Boston, 
MA.
 
Goddard, Y. L., God
dard, R. D., & Tschannen
-
Moran, M. (2007). A theoretical and empirical 
investigation of teacher collaboratio
n for school improvement and student achievement in 
public elementary schools.
 
Teachers college record
,
 
109
(4), 877
-
896.
 
Goldstein, H. (2011). 
Multilevel statistical models
. Hoboken, NJ: Wiley
 
Hallinger, P., & Murphy, J. F. (1986). The social context of ef
fective schools.
 
American journal 
of education
,
 
94
(3), 328
-
355.
 
Hansen, C. B. (2007). Generalized least squares inference
 
in panel and multilevel models with 
serial correlation and fixed effects.
 
Journal of econometrics
,
 
140
(2), 670
-
694.
 
Heafner, T. L., Va
nFossen, P. J., & Fitchett, P. G. (2019). Predictors of students

 
achievement on 
NAEP
-
Economics: A multilevel model.
 
The 
Journal of Social Studies Research
,
 
43
(4), 
327
-
341.
 
Heck, R. H., Larsen, T. J., & Marcoulides, G. A. (1990). Instructional leadership a
nd school 
achievement: Validation of a causal model.
 
Educational Administration Quarterly
,
 
26
(2), 
94
-
125.
 
Hedges, L. V. (
2007). Effect sizes in cluster
-
randomized designs. Journal of Educational and 
Behavioral Statistics, 32, 341
-
370.
 
Hedges, L. V. (2008).
 
What are effect sizes and why do we need them?.
 
Child Development 
Perspectives
,
 
2
(3), 167
-
171.
 
Hedges, L. V., & Hedberg,
 
E. C. (2014). Intraclass Correlations and Covariate Outcome
 
Correlations for Planning Two
-
 
and Three
-
Level Cluster
-
Randomized Experime
nts in
 
Education. 
Evaluation Review
, 
37
(6), 445

489. 
https://doi
.org/10.1177/0193841X14529126
 
Hedges, L. V., & Rhoads, C. (2010). Statistical Power Analysis in Education Research. NCSER 
2010
-
3006.
 
Na
tional Center for Special Education Research
.
 
Heo, M., & Leon, A. C. (2008). Statistical power and sample size requiremen
ts for three level 
hierarchical cluster randomized trials.
 
Biometrics
,
 
64
(4), 1256
-
1262.
 
 
151
 
 
Hill, H. C., Rowan, B., & Ball, D. L. (2005). 

teaching on student achievement. American educational research 
journal, 42(2), 371
-
406.
 
Hill, C. J., Bloom, H. S., Black, A. R., & Lipsey, M. W. (2008). Empirical benchmarks for 
interpreting effect sizes in r
esearch.
 
Child development perspectives
,
 
2
(3), 172
-
177.
 
Hoffman, L. (2015).
 
Longitudinal analysis: Modeling wit
hin
-
person fluctuation and change
. 
Routledge.
 
Hox, J. J., van de Schoot, R., & Matthijsse, S. (2012, July). How few countries will do? 
Comparativ
e survey analysis from a Bayesian perspective. In
 
Survey Research 
Methods
 
(Vol. 6, No. 2, pp. 87
-
93).
 
Huang, F.
 
L. (2018). Multilevel modeling myths.
 
School Psychology Quarterly
,
 
33
(3), 492.
 
Ilies, R., & Judge, T. A. (2004). An experience
-
sampling measure 
of job satisfaction and its 
relationships with affectivity, mood at work, job beliefs, and general job 
satisfaction.
 
European journal of work and organizational psychology
,
 
13
(3), 367
-
389.
 
Jayanthi, M., Dimino, J., Gersten, R., Taylor, M. J., Haymond, K., 
Smolkowski, K., & Newman
-
Gonchar, R. (2018). The impact of teacher study groups in vocabulary on teachi
ng 
practice, teacher knowledge, and student vocabulary knowledge: A large
-
scale replication 
study. 
Journal of Research on Educational Effectiveness
, 11(1
), 83
-
108.
 
Jennings, J. L., & DiPrete, T. A. (2010). Teacher effects on social and behavioral skills in
 
early 
elementary school.
 
Sociology of Education
,
 
83
(2), 135
-
159.
 
Kenward, M. G., & Roger, J. H. (1997). Small sample inference for fixed effects from re
stricted 
maximum likelihood.
 
Biometrics
, 983
-
997.
 
Kenward, M. G., & Roger, J. H. (2009). An improved ap
proximation to the precision of fixed 
effects from restricted maximum likelihood.
 
Computational Statistics & Data 
Analysis
,
 
53
(7), 2583
-
2595.
 
King, G., &
 
Roberts, M. E. (2015). How robust standard errors expose methodological problems 
they do not fix, and 
what to do about it.
 
Political Analysis
, 159
-
179.
 
Kish, L. (1995). Methods for design effects.
 
Journal of Official Statistics,
 
11
(1), 55. Retrieved 
from 
http://ezproxy.msu.edu.proxy1.cl.msu.edu/login?url=https://search
-
proquest
-
com.proxy1.cl.msu.edu/docview/1266820489?accountid=12598
 
Konstantopoulos, S. (2008a). The power of the test for treatment effects in three
-
level cluster 
randomized designs.
 
Journal 
of Research on Educational Effectiveness
,
 
1
(1), 66
-
88.
 
Konstantopoulos, S. (2008b). The power of the test for treatment effects in three
-
level block 
randomized designs.
 
Journal of Research on Educational Effectiveness
,
 
1
(4), 265
-
288.
 
 
152
 
 
Konstantopoulos, S. (2
009). Incorporating cost in power analysis for three
-
level cluster 
randomized designs. 
Evaluation Review
, 
33
(4), 335

57. 
https://doi.org/10.1177/0193841X09337991
 
Konstantopoulos, S. (2010). Power ana
lysis in two
-
level unbalanced designs.
 
The Journal of 
Experimental Education
,
 
78
(3), 291
-
317.
 

Research Synthesis Methods
,
 
2
(1), 61
-
76.
 
Korendijk, E. 
J., Hox, J. J., Moerbeek, M., & Maas, C. J. (2011). Robustness of parameter and 
standard error estimates against ignoring a contextual effect of a subject
-
level covariate in 
cluster
-
randomized trials.
 
Behavior research methods
,
 
43
(4), 1003
-
1013.
 
Korendijk,
 
E. J., Moerbeek, M., & Maas, C. J. (2010). The robustness of designs for trials with 
nested data against incorrect initial intracluster correlation coefficient estimates.
 
Journal 
of Educational and Behavioral Statistics
,
 
35
(5), 566
-
585.
 
Kraft, M. A. 
(2020
). Interpreting effect sizes of education interventions.
 
Educational 
Researcher
,
 
49
(4), 241
-
253.
 
Krull, J. L., & MacKinnon, D. P. (2001). Multilevel modeling of individual and group level 
mediated effects.
 
Multivariate behavioral research
,
 
36
(2), 249
-
277.
 
Kwok, O. M., West, S. G., & Green, S. B. (2007). The impact of misspecifying the within
-
subject covariance structure in multiwave longitudinal multilevel models: A Monte Carlo 
study.
 
Multivariate Behavioral Research
,
 
42
(3), 557
-
592.
 
LeBeau, B. (2016).
 
Impa
ct of serial correlation misspecification with the linear mixed 
model.
 
Journal of Modern Applied Statistical Methods
,
 
15
(1), 21.
 
LeBeau, B. (2018, May). 
Misspecification of the random effect structure: Implications for the 
linear mixed model
 
[Working 
paper
].
 
Retrieved January 20, 2020 from
 
https://iro.uiowa.edu/discovery/fulldisplay/alma9983557687702771/01IOWA_INST:Res
earchReposi
tory?t
ags=scholar
 
Leckie, G., French, R., Charlton, C., & Browne, W. (2014). Modeling Heterogeneous Variance

Covariance Components in Two
-
Level Models. 
Journal of Educational and Behavioral 
Statistics
, 39(5), 307

332. doi: 
10.3102/1076998614546494
 
Leeuw J.
, & M
eijer E. (2008) Introduction to Multilevel Analysis. In J. De Leeuw & E. Meijer 
(Eds.). 
Handbook of Multilevel Analysis 
(pp. 1
-
75).
 
New York, NY: Springer.
 
Lendrum, A., & Humphrey, N. (2012). The importance of studying the implementation of 
interventi
ons i
n school settings.
 
Oxford Review of Education
,
 
38
(5), 635
-
652.
 
Liang, K. Y., & Zeger, S. L. (1986). Longitudinal data analysis using generalized linear 
models.
 
Biometrika
,
 
73
(1), 13
-
22.
 
 
153
 
 
Littell, R. C., Pendergast, J., & Natarajan, R. (2000). 
Modelling cova
riance structure in the 
analysis of repeated measures data.
 
Statistics in medicine
,
 
19
(13), 1793
-
1819.
 
Luo, W., & Kwok, O. M. (2009). The impacts of ignoring a crossed factor in analyzing cross
-
classified data.
 
Multivariate 
Behavioral Research
,
 
44
(2), 182
-
212.
 
MacKinnon, J. G., & Webb, M. D. (2019, May).
 
When and how to deal with clustered errors in 
regression models
 

Retrieved  
January 20, from 
http://qed.econ.queensu.ca/pub/faculty/mackinnon/working
-
papers/qed_wp_1421.pdf
 
Martínez, J. F. (2012). Consequences of omitting the classroom in multilevel models of 
schooling: an illust
ration using opportunity to lear
n and reading achievement.
 
School 
Effectiveness and School Improvement
,
 
23
(3), 305
-
326.
 
McNeish, D. M. (2014). Analyzing clustered data with OLS regression: The effect of a 
hierarchical data structure.
 
Multiple Linear Regres
sion Viewpoints
,
 
40
(1), 11
-
16.
 
M
cNeish, D., & Kelley, K. (2019). Fixed effects models versus mixed effects models for 
clustered data: Reviewing the approaches, disentangling the differences, and making 
recommendations.
 
Psychological Methods
,
 
24
(1), 20.
 
McN
eish, D., Stapleton, L. M., & Si
lverman, R. D. (2017). On the unnecessary ubiquity of 
hierarchical linear modeling.
 
Psychological Methods
,
 
22
(1), 114.
 
McNeish, D., & Wentzel, K. R. (2017). Accommodating small sample sizes in three
-
level 
models when the thi
rd level is incidental.
 
Multivar
iate Behavioral Research
,
 
52
(2), 200
-
215.
 
Moeller, A. J., Theiler, J. M., & Wu, C. (2012). Goal setting and student achievement: A 
longitudinal study.
 
The Modern Language Journal
,
 
96
(2), 153
-
169.
 
Moerbeek, M. (2004). The 
consequence of ignoring a level of 
nesting in multilevel 
analysis.
 
Multivariate behavioral research
,
 
39
(1), 129
-
149.
 
Montes
-
Rojas, G. (2016). An equicorrelation Moulton factor in the presence of arbitrary intra
-
cluster correlation.
 
Economics Letters
,
 
145
, 
221
-
224.
 
Moulton, B. R. (1986). Ran
dom group effects and the precision of regression estimates.
 
Journal 
of econometrics
,
 
32
(3), 385
-
397.
 
Moulton, B. R. (1990). An illustration of a pitfall in estimating the effects of aggregate variables 
on micro units.
 
Th
e review of Economics and Statistic
s
, 334
-
338.
 
Muijs D. (2020) Extending Educational Effectiveness: The Middle Tier and Network 
Effectiveness. In J. Hall, A. Lindorff, & P. Sammons (Eds.). 
International 
Perspectives in Educational Effectiveness Research
. S
pringer, Cham. 
doi
: 
10.1007/978
-
3
-
030
-
44810
-
3_5
 
 
154
 
 
Muller, C. L. (2015). Measuring school contexts.
 
AERA open
,
 
1
(4). 
https://doi.org/10.1177/2332858415613055
 
Mundfrom, D. J., & 
Schults, M. R. (2002). A Monte Carlo simulation comparing parameter 
estimates from multiple linear regression and hierarchical linear modelin
g.
 
Multiple 
Regression Viewpoints
,
 
28
, 18
-
21.
 
Murphy, D. L., & Pituch, K. A. (2009). The performance of multilevel 
growth curve models 
under an autoregressive moving average process.
 
The Journal of Experimental 
Education
,
 
77
(3), 255
-
284.
 
Murray, D. M., Han
nan, P. J., & Baker, W. L. (1996). A Monte Carlo study of alternative 
responses to intraclass correlation in commun
ity trials: is it ever possible to avoid 
Cornfield's penalties?. Evaluation Review, 20(3), 313
-
337.
 
Musca, S. C., Kamiejski, R., 
Nugier, A., Méot, A., Er
-
Rafiy, A., & Brauer, M. (2011). Data with 
hierarchical structure: impact of intraclass correlation and
 
sample size on type
-
I 
error.
 
Frontiers in Psychology
,
 
74
(2).
 
Nezlek, J. B., & Zyzniewski, L. E. (1998). Using hierarchical linea
r modeling to analyze grouped 
data.
 
Group Dynamics: Theory, Research, and Practice
,
 
2
(4), 313.
 
Niehaus, E., Campbell, C. M., & 
Inkelas, K. K. (2014). HLM behind the curtain: Unveiling 
decisions behind the use and interpretation of HLM in higher education 
r
esearch.
 
Research in Higher Education
,
 
55
(1), 101
-
122.
 
Nye, B., Konstantopoulos, S., & Hedges, L. V. (2004). How large are teac
her effects?. 
Educational evaluation and policy analysis, 
26
(3), 237
-
257.
 
Opdenakker, M. C., & Van Damme, J. (2000). The importan
ce of identifying levels in multilevel 
analysis: an illustration of the effects of ignoring the top or intermediate levels in s
chool 
effectiveness research.
 
School Effectiveness and School Improvement
,
 
11
(1), 103
-
130.
 
OECD (2017).
 
PISA 2015 technical repor
t
. 
https://www.oecd.org/pisa/dat
a/2015
-
technical
-
report/PISA2015_TechRep_Final.pdf
 
Penuel, W. R., Riel, M., Krause, A., & Frank, K. A. (2009). Analyzing teacher

interactions in a school as social capital: A social network approach.
 
Teachers college 
record
,
 
111
(1), 124
-
163
.
 
Raudenbush, S. W. (2008). Advancing educational policy by advancing research on 
instruction.
 
American Educational Research Jour
nal
,
 
45
(1), 206
-
230.
 
Raudenbush, S. W., & Bryk, A. S. (2002). 
Hierarchical linear models: Applications and data 
analysis method
s
. Thousand Oaks, CA: Sage Publications.
 
Raudenbush, S. W., & Sadoff, S. (2008). Statistical inference when classroom 
quality is 
measured with error.
 
Journal of Research on Educational Effectiveness
,
 
1
(2), 138
-
154.
 
 
155
 
 
Raudenbush, S. W., & Schwartz, D. (2020).
 
Randomized Experiments in Education, with 
Implications for Multilevel Causal Inference. 
Annual Review of Statistics a
nd Its 
Application
, 7, 177
-
208.
 
Raykov, T., Patelis, T., Marcoulides, G. A., & Lee, C. L. (2016). Examining intermediate 
omitted levels in
 
hierarchical designs via latent variable modeling.
 
Structural Equation 
Modeling: A Multidisciplinary Journal
,
 
23
(1), 
111
-
115.
 

education.
 
Journal of Educational and Behavi
oral Statistics
,
 
36
(1), 76
-
104.
 
Robinson, T. Three essays on measuring political behaviour [PhD thesis]. University of
 
Oxford.
 
Rumberger, R. & Palardy, G (2004). Multilevel models for school effectiveness research. In D. 
Kaplan (Ed). 
The Sage handbook of q
uantitative methodology for the social sciences 
(pp. 
235
-
257). Thousand Oaks, CA: Sage.
 
Schochet, P. Z. (2008). Statis
tical power for random assignment evaluations of education 
programs. 
 

ational learning: The 
effects of school culture and context.
 
School Effectiveness and School 
Improvement
,
 
27
(4), 534
-
5
56.
 
Singer, J. D., & Willett, J. B. (2003).
 
Applied longitudinal data analysis: Modeling change and 
event occurrence
. Oxford university pr
ess.
 
 
Journal of the Royal Statistical 
Society: 
Series B (Methodological)
,
 
48
(1), 89
-
99.
 
Skinner, C. J., Holt, D., & Smith, T. F. (1989).
 
Analysis of complex surveys
. John Wiley & Sons.
 
Skinner, 
C., & Wakefield, J. (2017). Introduction to the design and analysis of complex survey 
data.
 
Statistical Scien
ce
,
 
32
(2), 165
-
175.
 
Skrondal A., & Rabe
-
Hesketh S. (2008) Multilevel and Related Models for Longitudinal Data. In 
J. De Leeuw & E. Meijer (Eds.). 
H
andbook of Multilevel Analysis 
(pp. 275
-
300).
 
New 
York, NY: Springer.
 
Snijders T.A., & Berkhof, J. (2008) Dia
gnostic Checks for Multilevel Models. In J. De Leeuw & 
E. Meijer (Eds.). 
Handbook of Multilevel Analysis 
(pp. 141
-
175).
 
New York, NY: 
Springer.
 
Sni
jders T.A. (2005). Power and sample size in multilevel modeling. In B.S. Everitt & D.C. 
Howell (Eds.). 
Encycl
opedia of Statistics in Behavioral Science 
(pp. 1570

1573
)
. Vol. 3. 
Chichester, UK: Wiley. doi: 
10.1002/0470013192
 
 
156
 
 
Snijders, T. A., & Bosker, R. J.
 
(2011).
 
Multilevel analysis: An introduction to basic and 
advanced multilevel modeling
. Sage.
 
Spillane, J. P
., Parise, L. M., & Sherer, J. Z. (2011). Organizational routines as coupling 
mechanisms: Policy, school administration, and the technical core.
 
Am
erican educational 
research journal
,
 
48
(3), 586
-
619.
 
Spybrook, J., Kelcey, B., & Dong, N. (2016). Power for d
etecting treatment by moderator effects 
in two
-
and three
-
level cluster randomized trials.
 
Journal of Educational and Behavioral 
Statistics
,
 
41
(6), 
605
-
627.
 
Spybrook, J., Zhang, Q., Kelcey, B., & Dong, N. (2020). Learning from cluster randomized trials 
in e
ducation: An assessment of the capacity of studies to determine what works, for 
whom, and under what conditions.
 
Educational Evaluation and Policy 
Analysis
,
 
42
(3), 
354
-
374.
 
Stapleton, L. M., & Kang, Y. (2018). Design effects of multilevel estimates from na
tional 
probability samples.
 
Sociological Methods & Research
,
 
47
(3), 430
-
457.
 
Taylor, I. M., Ntoumanis, N., Standage, M., & Spray, C. M. (2010). Mot
ivational predictors of 

leisure
-
time physical activity: 
A multilevel linear growth analysis.
 
Journal of Sport and Exercise Psychology
,
 
32
(1), 99
-
120.
 
Tranmer, M., & Steel, D. G. (2001). Ignoring
 
a level in a multilevel model: evidence from UK 
census data.
 
Environment and Planning
 
A
,
 
33
(5), 941
-
948.
 
Vallejo, G., Fernández, M. P., Livacic
-
Rojas, P. E., & Tuero
-
Herrero, E. (2011). Selecting the 
best unbalanced repeated measures model.
 
Behavior resea
rch methods
,
 
43
(1), 18
-
36.
 
Valliant, R., Dever, J. A., & Kreuter, F. (2013). Practical
 
tools for designing and weighting 
survey samples. New York: Springer.
 
van Breukelen, G & Moerbeek, M. (2016). Design considerations in multilevel studies. 
In M. A. 
Scott
, J. S. Simonoff & D. Marx (Eds.). 
The SAGE handbook of Multilevel Modeling 
(pp. 
183
-
2
00). London, UK: SAGE Publications Ltd. doi: 
10.4135/9781446247600
 
Van den Noortgate, W., Opdenakker, M. C., & Onghena, P. (2005). The effects of ignoring a 
level in mult
ilevel analysis.
 
School Effectiveness and School Improvement
,
 
16
(3), 281
-
303.
 
Vaezghas
emi, M., Ng, N., Eriksson, M., & Subramanian, S. V. (2016). Households, the omitted 
level in contextual analysis: disentangling the relative influence of households and 
d
istricts on the variation of BMI about two decades in Indonesia.
 
International journal
 
for equity in health
,
 
15
(1), 102.
 
 
157
 
 
Voogt, J. M., Pieters, J. M., & Handelzalts, A. (2016). Teacher collaboration in curriculum 
design teams: effects, mechanisms, and cond
itions.
 
Educational Research and 
Evaluation
,
 
22
(3
-
4), 121
-
140.
 
Wang, W., Liao, M., & S
tapleton, L. (2019). Incidental Second
-
Level Dependence in 
Educational Survey Data with a Nested Data Structure.
 
Educational Psychology Review
, 
1
-
26.
 
Weiss, M. J. (2010).
 
The implications of teacher selection and the teacher effect in individually 
randomiz
ed group treatment trials.
 
Journal of Research on Educational 
Effectiveness
,
 
3
(4), 381
-
405.
 
Weiss, M. J., Lockwood, J. R., & McCaffrey, D. F. (2016). Estimating the stand
ard error of the 
impact estimator in individually randomized trials with clustering.
 
J
ournal of Research 
on Educational Effectiveness
,
 
9
(3), 421
-
444.
 
Westine, C. D., Spybrook, J., & Taylor, J. A. (2013). An empirical investigation of variance
 
design parame
ters for planning cluster
-
randomized trials of science achievement.
 
Evaluation Review
, 
37
(6), 490

519. 
https://doi.org/10.1177/0193841X14531584
 
White, H. (1980). A heteroskedasticity
-
consistent covar
iance matrix estimator and a direct test 
for heteroskedasticity.
 
Eco
nometrica: journal of the Econometric Society
, 817
-
838.
 
Wong, E. M., & Li, S. C. (2008). Framing ICT implementation in a context of educational 
change: A multilevel analysis.
 
School effect
iveness and school improvement
,
 
19
(1), 99
-
120.
 
Wooldridge, J. M. (20
03). Cluster
-
sample methods in applied econometrics.
 
American Economic 
Review
,
 
93
(2), 133
-
138.
 
Xia, J., Shen, J., & Sun, J. (2020). Tight, loose, or decoupling? A National study of the dec
ision
-
making power relationship between district central offices and
 
school 
principals.
 
Educational Administration Quarterly
,
 
56
(3), 396
-
434.
 
Zhu, P., Jacob, R., Bloom, H., & Xu, Z. (2012). Designing and analyzing studies that randomize 
schools to estimate
 
intervention effects on student academic outcomes without 
classroom
-
level information.
 
Educational Evaluation and Policy Analysis
,
 
34
(1), 45
-
68.